Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 136 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 136
Topic #: 1

[All Certified Data Engineer Professional Questions]

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
B. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
C. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.
D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.

Show Suggested Answer

Suggested Answer: A 🗳️

by Freyr at June 1, 2024, 10:32 p.m.

Comments

Submit Cancel

arekm

Highly Voted 7 months, 1 week ago

Selected Answer: A

Definitely A - no repartitioning and subsequent shuffle (which the question is asking about). The parameter defines how many bytes per partition to read, tasks will read in those chunks, since only narrow operations performed (per definition - no shuffle), we just write what we read. The target files size is 512MBs and we did not shuffle.

upvoted 5 times

...

sainandam

Most Recent 1 month, 2 weeks ago

Selected Answer: C

AQE with coalesce is the only way to produce predictable file-sizes

upvoted 1 times

...

KadELbied

3 months, 2 weeks ago

Selected Answer: B

The key to efficiently converting a large JSON dataset to Parquet files of a specific size without shuffling data lies in controlling the size of the output files directly. Setting spark.sql.files.maxPartitionBytes to 512 MB configures Spark to process data in chunks of 512 MB. This setting directly influences the size of the part-files in the output, aligning with the target file size. Narrow transformations (which do not involve shuffling data across partitions) can then be applied to this data. Writing the data out to Parquet will result in files that are approximately the size specified by spark.sql.files.maxPartitionBytes, in this case, 512 MB. The other options involve unnecessary shuffles or repartitions (B, C, D) or an incorrect setting for this specific requirement (E).

upvoted 1 times

...

RandomForest

6 months, 4 weeks ago

Selected Answer: D

Correct answer is D: Why Not Other Options?: A. Set spark.sql.files.maxPartitionBytes: This configuration controls how many bytes Spark reads per input partition during a file scan, not the output file size. It does not help in controlling Parquet file sizes during writing. B. Set spark.sql.shuffle.partitions and sort data: While sorting data can optimize performance in some cases, it introduces unnecessary overhead for this scenario. Additionally, spark.sql.shuffle.partitions controls the number of shuffle partitions, not directly the output partitioning of the data. C. Use spark.sql.adaptive.advisoryPartitionSizeInBytes: Adaptive Query Execution (AQE) optimizes queries at runtime, but this configuration does not directly control Parquet file sizes. It dynamically adjusts partition sizes for shuffle stages, not for the write output.

upvoted 3 times

...

_lene_

7 months ago

Selected Answer: A

arekm explanation

upvoted 1 times

...

temple1305

8 months, 2 weeks ago

Selected Answer: C

I think, "execute the narrow transformations, coalesce to" is key words here - because coalesce is not cause shuffling.

upvoted 1 times

...

cf56faf

9 months ago

Selected Answer: D

It's D, because A primarily affects the reading of the data

upvoted 2 times

...

Jugiboss

9 months, 3 weeks ago

Selected Answer: A

A does not shuffle while D shuffles

upvoted 2 times

...

m79590530

9 months, 3 weeks ago

Selected Answer: A

Answer A as narrow transformations like union, filter and map do not cause shuffle across partitions.

upvoted 2 times

...

Colje

10 months ago

Selected Answer: D

The correct answer is D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB * 1024 * 1024 / 512), and then write to Parquet. Explanation: In this case, the goal is to write a 1 TB dataset to Parquet with a target file size of 512 MB without incurring the overhead of data shuffling. To achieve optimal performance, we must balance the number of partitions to match the file size requirements while avoiding expensive shuffle operations. Narrow transformations: These transformations (such as map, filter) don’t require shuffling the data, which keeps the operation efficient. Repartition to 2,048 partitions: Given that the desired part-file size is 512 MB and the total dataset size is 1 TB, repartitioning the dataset into 2,048 partitions ensures that each partition will be approximately 512 MB in size, which matches the target file size. This avoids shuffle operations and allows for an efficient write.

upvoted 1 times

arekm

7 months, 1 week ago

All true, but not a correct answer. We are looking for a solution without shuffle/repartition/coalesce.

upvoted 1 times

...

pk07

10 months, 3 weeks ago

Selected Answer: D

Not A because spark.sql.files.maxPartitionBytes primarily affects the reading of data, not the writing. It determines the maximum size of a partition when reading files, not when writing them.

upvoted 2 times

...

shaojunni

10 months, 3 weeks ago

Selected Answer: C

A, D will not prevent shuffling data. C using coalesce to reduce shuffling data.

upvoted 1 times

...

03355a2

1 year, 1 month ago

Selected Answer: A

best performance without shuffling data

upvoted 3 times

...

hpkr

1 year, 2 months ago

Selected Answer: D

option D

upvoted 1 times

...

Freyr

1 year, 2 months ago

Selected Answer: D

Correct Answer D: Repartition to 2,048 partitions and write to Parquet This option directly controls the number of output files by repartitioning the data into 2,048 partitions, assuming that 1TB/512MB per file roughly translates to 2,048 files. Repartitioning the data involves shuffling, but it's a deliberate shuffle designed to achieve a specific partitioning beneficial for writing. After repartitioning, the data is written to Parquet files, each expected to be approximately 512 MB if the data is uniformly distributed across partitions.

upvoted 1 times

...