Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 21 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 21
Topic #: 1

[All Certified Data Engineer Professional Questions]

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.
Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

A. Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.
B. Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.
C. The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.
D. Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.
E. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.

Show Suggested Answer

Suggested Answer: E 🗳️

by asmayassineg at Aug. 2, 2023, 11:54 a.m.

Comments

Submit Cancel

RafaelCFC

Highly Voted 1 year, 6 months ago

Selected Answer: E

I believe this is a case of the least bad option, not exactly the best option possible. - A is wrong because in Streaming you very rarely have any executors idle, as all cores are engaged in processing the window of data; - B is wrong because triggering every 30s will not meet the 10s target processing interval; - C is wrong in two manners: increasing shuffle partitions to any number above the number of available cores in the cluster will worsen performance in streaming; also, the checkpoint folder has no connection with trigger time. - D is wrong because, keeping all other things the same as described by the problem, keeping the trigger time as 10s will not change the underlying conditions of the delay (i.e.: too much data to be processed in a timely manner). E is the only option that might improve processing time.

upvoted 9 times

6 months, 1 week ago

Selected Answer: A

Option A emphasizes utilizing idle executors to begin processing the next batch while longer-running tasks from previous batches finish. This approach can help maintain a steady flow of data processing and reduce the likelihood of bottlenecks.

upvoted 1 times

arekm

6 months, 1 week ago

Structured streaming processes batches in sequence. It does so since it guarantees exactly once processing, see: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

upvoted 1 times

...

Thameur01

7 months ago

Selected Answer: B

If microbatch execution occasionally exceeds 30 seconds, a trigger interval of 5 seconds would cause multiple batches to queue up while the previous batch is still running. This would exacerbate the delays and potentially lead to backpressure and failure. B is the best option in this case. If we assume for sure that execution time should be less than 10s, then in that case a 5s interval will make more sense and E should be the best answer.

upvoted 1 times

...

wdeleersnyder

11 months, 1 week ago

In Databricks Runtime 11.3 LTS and above, the Trigger.Once setting is deprecated. Databricks recommends you use Trigger.AvailableNow for all incremental batch processing workloads. https://docs.databricks.com/en/structured-streaming/triggers.html Doesn't seem like E is a valid and recommended option given that it is deprecated.

upvoted 2 times

wdeleersnyder

11 months, 1 week ago

Ooops, I mean, D.

upvoted 2 times

...

1 year, 8 months ago

Only C. Even if you trigger more frequently you decrease both load and time for this load. E doesn't change anything.

upvoted 1 times

...

sturcu

1 year, 9 months ago

Selected Answer: E

Changing trigger interval to "one" will cause this to be a "batch" and will not execute in microbranches. This will not help at all

upvoted 4 times

...

Eertyy

1 year, 9 months ago

correct answer is E

upvoted 1 times

...

azurearch

1 year, 10 months ago

sorry, the caveat is holding all other variables constant.. that means we are not allowed to change trigger intervals. is C the answer then

upvoted 1 times

...

azurearch

1 year, 10 months ago

what if in between those 5 seconds trigger interval if there are more records, that would still increase the time it takes to process.. i doubt E is correct. I will go with answer D. it is not to execute all queries within 10 secs. it is to execute trigger now batch every 10 seconds.

upvoted 1 times

...

azurearch

1 year, 10 months ago

A option also is about setting trigger interval to 5 seconds, just to understand.. why its not the answer

upvoted 1 times

...

Load full discussion...