Search papers, labs, and topics across Lattice.
The paper identifies a critical vulnerability in Spark-on-AWS-Lambda (SoAL) where Lambda timeouts during writes to Delta Lake and Iceberg tables can lead to silent data loss due to orphaned Parquet files and unchanged metadata. Through controlled experiments, the authors demonstrate that this vulnerability occurs 100% of the time when a SIGKILL signal interrupts the write process between data upload and metadata commit. To address this, they introduce SafeWriter, a wrapper that proactively triggers format-native rollbacks and records checkpoints, effectively mitigating silent data loss with minimal overhead.
Lambda timeouts in Spark jobs writing to Delta Lake and Iceberg tables cause 100% silent data loss, but a simple wrapper can eliminate it.
AWS Lambda terminates containers with an uncatchable SIGKILL signal when a function exceeds its configured timeout. When a Spark-on-AWS-Lambda (SoAL) job is killed between Phase 1 (data upload) and Phase 2 (metadata commit) of a write, the result is silent data loss: orphaned Parquet files accumulate on S3 while the table's committed state remains unchanged and standard monitoring raises no alert. We characterize this vulnerability across Delta Lake and Apache Iceberg through 860 controlled kill-injection experiments at three dataset sizes. A SIGKILL landing in the inter-phase gap produced silent data loss in 100% of trials for both formats. We then present SafeWriter, a language-level wrapper that arms a watchdog thread 30 seconds before the Lambda timeout, triggers a format-native rollback via SQL, and records a checkpoint document on S3. SafeWriter converted every tested kill scenario into a clean, detectable rollback with under 100 ms added to normal write paths.