Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 70 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 70
Topic #: 1

[All Certified Data Engineer Professional Questions]

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
B. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
C. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.
D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.
E. Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

Show Suggested Answer

Suggested Answer: A 🗳️

by sturcu at Oct. 24, 2023, 6:02 p.m.

Comments

Submit Cancel

aragorn_brego

Highly Voted 1 year, 8 months ago

Selected Answer: A

This strategy aims to control the size of the output Parquet files without shuffling the data. The spark.sql.files.maxPartitionBytes parameter sets the maximum size of a partition that Spark will read. By setting it to 512 MB, you are aligning the read partition size with the desired output file size. Since the transformations are narrow (meaning they do not require shuffling), the number of partitions should roughly correspond to the number of output files when writing out to Parquet, assuming the data is evenly distributed and there is no data expansion during processing.

upvoted 10 times

...

Def21

Highly Voted 1 year, 6 months ago

Selected Answer: D

D is the only one that does the trick. Note, we can not do shuffling. Wrong answers: A: spark.sql.files.maxPartitionBytes is about reading, not writing.(The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC. ) B: spark.sql.adaptive.advisoryPartitionSizeInBytes takes effect while shuffling and sorting does not make sense (The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.) C: Would work but spark.sql.adaptive.advisoryPartitionSizeInBytes would need shuffling. E. spark.sql.shuffle.partitions (Configures the number of partitions to use when shuffling data for joins or aggregations.) is not about writing.

upvoted 6 times

arekm

7 months, 1 week ago

D does repartition, which the question says we should try to avoid.

upvoted 1 times

...

carlosmps

8 months ago

spark.sql.files.maxPartitionBytes is not just for reading files

upvoted 1 times

...

azurefan777

9 months, 1 week ago

Answer D is wrong -> repartition does perform shuffling in Spark. When you use repartition, Spark redistributes the data across the specified number of partitions, which requires moving data between nodes to achieve the new partitioning. Answer A should be correct

upvoted 4 times

...

dalupus

Most Recent 3 weeks, 3 days ago

Selected Answer: C

This is still only kinda correct. A is wrong because there may be no correlation to the size of the data being read versus the size of the data being written and this only controls read. The reason I say this is only partially correct is because it makes a false assumption that if you simply take a 1TB dataset and divide it up that the sum of the pieces will equal 1TB which is not true. The size of dataset will be dependent on the cardinality of the columns, but by definition any time you increase the number of parquet files you will increase total data size. If you don't believe me try it yourself. Take a 5GB dataframe and save it as 1 parquet vs 500 parquets. you will see the size is very different

upvoted 1 times

...

happyhelppy

3 weeks, 6 days ago

Selected Answer: C

You want target Parquet file sizes around 512 MB without causing a full shuffle. Using coalesce instead of repartition ensures that no shuffle occurs — perfect for performance-sensitive jobs that don’t need data redistribution. The spark.sql.adaptive.advisoryPartitionSizeInBytes helps optimize physical plan execution under Adaptive Query Execution (AQE) without enforcing a hard shuffle. Writing to 2,048 partitions (1 TB ÷ 512 MB) gives the best chance of achieving the desired file sizes while minimizing write latency.

upvoted 1 times

...

KadELbied

3 months, 2 weeks ago

Selected Answer: B

I found this question in anothers Exam test and all of them it's looks like B

upvoted 1 times

...

AlejandroU

7 months, 3 weeks ago

Selected Answer: D

Answer D. Explicitly repartitioning to 2,048 partitions ensures that the output files are close to the desired size of 512 MB, provided the data distribution is relatively even. Repartitioning directly addresses the problem by controlling the number of partitions, which directly affects the output file size Why not option A ? Misinterpretation of spark.sql.files.maxPartitionBytes in Option A: The assessment incorrectly states that this configuration controls the maximum size of files when writing to Parquet. This setting controls the size of partitions when reading data, not during writing.

upvoted 1 times

AlejandroU

7 months, 3 weeks ago

Given the requirement to avoid shuffling, Option A is the most suitable choice. By setting spark.sql.files.maxPartitionBytes to 512 MB, you influence the partitioning during the read phase, which can help in achieving the desired file sizes during the write operation. However, it's important to note that this approach may not guarantee exact file sizes, and some variability may occur. If achieving precise file sizes is critical and shuffling is permissible, Option D would be the preferred strategy.

upvoted 2 times

...

temple1305

8 months ago

Selected Answer: C

spark.sql.adaptive.advisoryPartitionSizeInBytes - The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition. And then we do coalesce - without shuffle - so have to work!

upvoted 1 times

...

nedlo

9 months, 2 weeks ago

Selected Answer: A

I though D, but default num of partitions is 200, so you cant do coalesce (2048) (you cant increase numb of partitions through coalesce), so its not possible to do it without repartitioning and shuffle. Only A can be done without Shuffle

upvoted 2 times

...

sdas1

11 months ago

Option A spark.sql.files.maxPartitionBytes controls the maximum size of partitions during reading on the Spark cluster, and that reducing this value could lead to more partitions and thus potentially more output files. The key point is that it works best when no shuffles occur, which aligns with the scenario of having narrow transformations only.

upvoted 2 times

sdas1

11 months ago

Given that no shuffle occurs and you're aiming to control the file sizes during output, adjusting spark.sql.files.maxPartitionBytes could help indirectly by determining the partition size for reading. Since the number of input partitions can influence the size of the output files when no shuffle occurs, the partition size may closely match the size of the files being written out.

upvoted 1 times

sdas1

11 months ago

If the transformations remain narrow, then Spark won't repartition the data unless explicitly instructed to do so (e.g., through a repartition or coalesce operation). In this case, using spark.sql.files.maxPartitionBytes to adjust the read partition size to 512 MB could indirectly control the number of output files and ensure they align with the target file size.

upvoted 1 times

sdas1

11 months ago

Thus, Option A is also a valid strategy: Set spark.sql.files.maxPartitionBytes to 512 MB, process the data with narrow transformations, and write to Parquet. By reducing the value of spark.sql.files.maxPartitionBytes, you ensure more partitions are created during the read phase, leading to output files closer to the desired size, assuming the transformations are narrow and no shuffling occurs.

upvoted 1 times

...

vikram12apr

1 year, 5 months ago

Selected Answer: A

D is not correct as it will create 2048 target files of 0.5 MB each Only A will do the job as it will read this file in 2 partition ( 1 TB = 512*2 MB) and as we are not doing any shuffling(not mentioned in option) it will create those many partition file i.e 2 part files

upvoted 1 times

hal2401me

1 year, 5 months ago

hey, 1TB=1000GB=1^6MB.

upvoted 4 times

...

hal2401me

1 year, 5 months ago

Selected Answer: D

ChatGPT says D: This strategy directly addresses the desired part-file size by repartitioning the data. It avoids shuffling during narrow transformations. Recommended for achieving the desired part-file size without unnecessary shuffling.

upvoted 1 times

...

Curious76

1 year, 5 months ago

Selected Answer: D

D is mot suitable.

upvoted 1 times

...

vctrhugo

1 year, 6 months ago

Selected Answer: A

This approach ensures that each partition will be approximately the target part-file size, which can improve the efficiency of the data write. It also avoids the need for a shuffle operation, which can be expensive in terms of performance.

upvoted 3 times

...

adenis

1 year, 6 months ago

Selected Answer: C

С is correct

upvoted 1 times

...

spaceexplorer

1 year, 6 months ago

Selected Answer: A

Rest of the answers trigger shuffles

upvoted 2 times

...

divingbell17

1 year, 7 months ago

Selected Answer: A

A is correct. The question states Which strategy will yield the best performance without shuffling data. The other options involve shuffling either manually or through AQE

upvoted 2 times

...

911land

1 year, 7 months ago

C is correct answer

upvoted 1 times

...

Load full discussion...