Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 49 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 49
Topic #: 1

[All Certified Data Engineer Professional Questions]

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.
Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

A. Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs, all PySpark and Spark SQL logic should be refactored.
B. The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution.
C. Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.
D. Calling display() forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.
E. The Jobs UI should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.

Show Suggested Answer

Suggested Answer: B 🗳️

by tkg13 at Aug. 24, 2023, 4:09 a.m.

Comments

Submit Cancel

guillesd

Highly Voted 1 year, 5 months ago

Selected Answer: B

Both B and D are correct statements. However, D is not an adjustment (see the question), it is just an afirmation which happens to be correct. B, however, is an adjustment, and it will definitely help with profiling.

upvoted 7 times

...

79f0e18

Most Recent 1 week, 2 days ago

Selected Answer: D

Transformations like select(), filter(), and join() are lazy — they build a logical execution plan but do not trigger actual computation. Actions like display(), collect(), count(), or write() are what trigger execution of the plan. When a user repeatedly executes the same code, Spark may cache intermediate results or re-optimize the plan, which makes performance inconsistent and not representative of real production runs. Therefore, measuring execution time by re-running code interactively in a notebook (especially using display()) is not a reliable performance benchmark.

upvoted 1 times

...

sammy2025

1 month, 2 weeks ago

Selected Answer: D

Lazy Evaluation in Spark: Many transformations in Spark only modify the logical query plan until an action (like display(), collect(), or write()) is triggered.

upvoted 1 times

...

KadELbied

2 months ago

Selected Answer: B

suretly B

upvoted 1 times

...

Tedet

4 months, 1 week ago

Selected Answer: D

Explanation: Using display() in Databricks forces a job to trigger and display the output, which can lead to an inaccurate measure of performance when benchmarking code. This is because display() triggers the job and materializes the result, which does not accurately reflect how the code will perform in production when the job is run without the display output. Additionally, repeated execution of the same logic (with caching) may not give you meaningful performance results since the results are cached in memory and not representative of fresh computations, as they would occur in a production environment. To get a more accurate measure of execution time, the user should focus on using appropriate job execution techniques, such as running the notebook with "Run All" and avoiding reliance on display() calls, which are not representative of how the pipeline would behave in production.

upvoted 3 times

...

arekm

6 months, 1 week ago

Selected Answer: B

Answer B, see discussion under benni_ale.

upvoted 1 times

ultimomassimo

3 months, 2 weeks ago

in any real life commercial project answer B is not feasible, sorry. You always use representative sample, but using same data volumes (especially when they are massive) is impractical and no one would sing off on the cost

upvoted 1 times

...

AlejandroU

6 months, 4 weeks ago

Selected Answer: D

Answer D. While Option D doesn't directly provide an alternative adjustment, it points out a critical issue in the way interactive notebooks might give misleading results. It would be advisable to avoid using display() as a benchmark for performance in production-like environments.

upvoted 3 times

...

carlosmps

7 months ago

Selected Answer: B

Without much thought, I would vote for option B, but since it says 'the ONLY,' it makes me hesitate. While option D only points out the issues with the data engineer's executions, it doesn’t really provide the adjustments that need to be made. On the other hand, option B at least gives you a way to simulate production behavior. I’ll vote for B, but as I said, the word 'only' makes me doubt, because it’s not the only way.

upvoted 1 times

...

benni_ale

7 months, 3 weeks ago

Selected Answer: D

Answer: D. Explanation: Lazy Evaluation: Spark employs lazy evaluation, meaning transformations are not executed until an action (e.g., display(), count(), collect()) is called. Using display() triggers the execution of the transformations up to that point. Caching Effects: Repeatedly executing the same cell can lead to caching, where Spark stores intermediate results. This caching can cause subsequent executions to be faster, not reflecting the true performance of the code. Why not B: Production-Sized Data and Clusters: While using production-sized data and clusters (as mentioned in option B) can provide insights into performance, it's not the only way to troubleshoot execution times. Proper testing can often be conducted on smaller datasets and clusters, especially during the development phase.

upvoted 1 times

af4a20a

7 months ago

Yep, what if your production size is 10 TB... But you have a 10GB sample. No idea what's actually right for the test, but D is correct.

upvoted 1 times

arekm

6 months, 1 week ago

D is correct. However, it does not show direction on what to do to troubleshoot the problem, which is the first statement in the question. The only way to troubleshoot performance problems is to start with the data & processing platform of size that is representative of production. That is why I think B is a better choice.

upvoted 1 times

...

practicioner

10 months, 4 weeks ago

Selected Answer: B

B and D are correct. The question says "which statements" which suggests us that this is a question with multiple choices

upvoted 2 times

...

HelixAbdu

11 months, 2 weeks ago

Both D and B are correct. But in real life some times clients dose not accept to gave you there production data to test easily. Also it says in B it is “the only way” ans this is not true for me So i will go with D

upvoted 4 times

RyanAck24

9 months, 2 weeks ago

I would add to this and say that this *could* be a multi-choice question (possibly) as practicioner mentions above. But if it isn't, I would go with D as well.

upvoted 1 times

...

ffsdfdsfdsfdsfdsf

1 year, 4 months ago

Selected Answer: B

These people voting D have no reading comprehension.

upvoted 4 times

...

alexvno

1 year, 4 months ago

Selected Answer: B

Close env size volumes as possible so results make sense

upvoted 2 times

...

halleysg

1 year, 4 months ago

Selected Answer: D

D is correct

upvoted 3 times

...

Curious76

1 year, 4 months ago

Selected Answer: D

I will go with D

upvoted 1 times

...

agreddy

1 year, 4 months ago

D is the correct answer A. Scala is the only language accurately tested using notebooks: Not true. Spark SQL and PySpark can be accurately tested in notebooks, and production performance doesn't solely depend on language choice. B. Production-sized data and clusters are necessary: While ideal, it's not always feasible for development. Smaller datasets and clusters can provide indicative insights. C. IDE and local Spark/Delta Lake: Local environments won't replicate production's scale and configuration fully. E. Jobs UI and Photon: True that Photon benefits scheduled jobs, but Jobs UI can track execution times regardless of Photon usage. However, Jobs UI runs might involve additional overhead compared to notebook cells. Option D addresses the specific limitations of using display() for performance measurement

upvoted 4 times

...

DAN_H

1 year, 5 months ago

Selected Answer: D

As B not talking about how to deal with display() function. We know that way to testing performance for the whole notebook need to avoid using display as it is way to test the code and display the data

upvoted 3 times

arekm

6 months, 1 week ago

True, it is not addressing the display() function. However, D does not give any hint on how to go about the problem. On top of that display() function is an action that might help you out in investigating by triggering the actual processing. You still need the data volume that represents the inherent problem - which means that you need the production size of the data, which I think is the first step anyway. Not the last though :)

upvoted 1 times

...

Load full discussion...