Which statement describes Delta Lake Auto Compaction?
A.
An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 1 GB.
B.
Before a Jobs cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job.
C.
Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.
D.
Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.
E.
An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 128 MB.
A and E are wrong because auto compaction is synchronous operation!
I vote for B
As per documentation - "Auto compaction occurs after a write to a table has succeeded and runs synchronously on the cluster that has performed the write. Auto compaction only compacts files that haven’t been compacted previously."
https://docs.delta.io/latest/optimizations-oss.html
E. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 128 MB.
https://community.databricks.com/t5/data-engineering/what-is-the-difference-between-optimize-and-auto-optimize/td-p/21189
Optimize default target file size is 1Gb, however in this question we are dealing with auto compaction. Which when enabled runs optimize with 128MB file size by default.
Delta Lake's Auto Compaction feature is designed to improve the efficiency of data storage by reducing the number of small files in a Delta table. After data is written to a Delta table, an asynchronous job can be triggered to evaluate the file sizes. If it determines that there are a significant number of small files, it will automatically run the OPTIMIZE command, which coalesces these small files into larger ones, typically aiming for files around 1 GB in size for optimal performance.
E is incorrect because the statement is similar to A but with an incorrect default file size target.
E fits best, but according to docs it is synchronous opeartion
"Auto compaction occurs after a write to a table has succeeded and runs synchronously on the cluster that has performed the write. Auto compaction only compacts files that haven’t been compacted previously."
Correct answer is E:
Auto optimize consists of 2 complementary operations:
- Optimized writes: with this feature enabled, Databricks attempts to write out 128 MB files for each table partition.
- Auto compaction: this will check after an individual write, if files can further be compacted. If yes, it runs an OPTIMIZE job with 128 MB file sizes (instead of the 1 GB file size used in the standard OPTIMIZE)
correct answer is E, the auto-compaction runs a asynchronous job to combine small files to a default of 128 MB
https://learn.microsoft.com/en-us/azure/databricks/delta/tune-file-size
A voting comment increases the vote count for the chosen answer by one.
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one.
So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
partha1022
1 month agoShailly
2 months agoimatheushenrique
3 months, 3 weeks agoojudz08
7 months, 1 week agoDAN_H
7 months, 3 weeks agokz_data
8 months, 2 weeks agoIWantCerts
8 months, 2 weeks agoYogi05
9 months agohamzaKhribi
9 months, 3 weeks agoaragorn_brego
10 months agoKill9
3 months agoBIKRAM063
10 months, 3 weeks agosturcu
11 months, 2 weeks agoEertyy
1 year agocotardo2077
1 year agotaif12340
1 year agoBrianNguyen95
1 year, 1 month ago8605246
1 year, 1 month agoBrianNguyen95
1 year, 1 month ago