Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 60 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 60
Topic #: 1

[All Certified Data Engineer Professional Questions]

The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is named store_saies_summary and the schema is as follows:

The table daily_store_sales contains all the information needed to update store_sales_summary. The schema for this table is: store_id INT, sales_date DATE, total_sales FLOAT
If daily_store_sales is implemented as a Type 1 table and the total_sales column might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in the store_sales_summary table?

A. Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update.
B. Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table.
C. Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.
D. Implement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.
E. Use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.

Show Suggested Answer

Suggested Answer: C 🗳️

by hammer_1234_h at Sept. 10, 2023, 9:44 p.m.

Comments

Submit Cancel

hammer_1234_h

Highly Voted 1 year, 9 months ago

The answer should be A. it is the safest to generate accurate report

upvoted 12 times

Def21

1 year, 5 months ago

This is confusing: "overwrite the store_sales_summary table with each Update." sounds like it is only doing updates, not inserting new possible stories.

upvoted 4 times

...

alexvno

1 year, 6 months ago

Incorrect BATCH processing and OVERWRITE will give partial results

upvoted 2 times

...

djohn_prasad

4 months, 1 week ago

But what if each batch did not have historical data but only updates? Then doing an overwrite will risk removing historical data in the target table so Option C is much preferable

upvoted 2 times

...

KadELbied

Most Recent 1 month, 4 weeks ago

Selected Answer: C

suretly C

upvoted 1 times

...

ultimomassimo

3 months, 1 week ago

people that claim type 1 scd is not maintained by upsert need to make some more reading before posting anything here...

upvoted 1 times

...

Jamuro

4 months ago

Selected Answer: C

It makes no sense to recalculate all aggregations for all historical data and then overwrite. It would lead to a non scalable solution because all data should be recalculated to not "delete" rows in target table. Option C lets us filter by date ranges or even partitions to merge only target periods and get the same results in a much more performant, safer and cheaper way.

upvoted 1 times

...

AlejandroU

6 months, 2 weeks ago

Selected Answer: C

The answer is C. Option A is correct in ensuring accuracy, as it recalculates the entire store_sales_summary table based on the full historical data in daily_store_sales. However, it is computationally expensive and may not scale well. Option C (upsert logic) could be a better choice in most real-world scenarios, as it focuses only on the records that have changed, reducing computational costs and minimizing disruption for downstream systems.

upvoted 1 times

arekm

6 months ago

How do you know which records have changed? I think A is the safest answer.

upvoted 1 times

djohn_prasad

4 months, 1 week ago

Option C tells us based on upsert logic where matching records (matching key assumed as storeid and sales date) are checked for differences in sales values and the value from batch is taken over that of the target table. C is computational much easier then

upvoted 1 times

...

Sriramiyer92

6 months, 3 weeks ago

Selected Answer: A

A it is SCD Type 1, so clearly Append and Upsert logic should not be used.

upvoted 3 times

ultimomassimo

3 months, 1 week ago

you dont know what is scd type 1 , do you? you dont understand the difference between overwriting a column value and overwriting the whole table, do you?

upvoted 1 times

...

benni_ale

7 months ago

Selected Answer: A

A as the table does not require history

upvoted 1 times

...

shaojunni

8 months, 3 weeks ago

Selected Answer: A

daily_store_sales is type1 table, no history is maintained. You have to treat every record as new record and do aggregation for every store. Overwrite is much efficient than upsert.

upvoted 1 times

...

Ati1362

1 year ago

Selected Answer: C

I will go with c. upsert

upvoted 3 times

...

MDWPartners

1 year, 1 month ago

Selected Answer: C

A is not correct because the table is daily. If you overwrite you delete all history. You need to insert/update to keep history.

upvoted 4 times

fe3b2fc

10 months, 3 weeks ago

Incorrect. The daily store sales table contains all of the history needed to update the table. The summary table holds no historical records. Seeing as this is a nightly job, any manual changes made to daily store sales will be captured. A is the correct answer.

upvoted 2 times

...

ThoBustos

1 year, 2 months ago

Selected Answer: A

Not sure if that's right but I would go for A. What do you think? Type1: Data is overwritten Type 2: History is maintained, new data is inserted as new rows Type 3: Stores two versions per record: a previous and a current value A. batch + overwrite -> Match Type 1 requirements. YES B: batch + append new rows -> Would be for type 2. NO C. Batch + Upsert -> Data is not being overwritten (which is required for Type 1). NO D. ReadStream + Upsert -> Data is not being overwritten (which is required for Type 1). NO E. Change Data Feed to update -> Problem is manual edits + not overwriting (required for type 1). No I have doubts around "which approach is the safest". Maybe because due to some manual changes it is hard to track changes or do upsert, so to make sure that the stats are right overwriting is safer.

upvoted 4 times

...

vikram12apr

1 year, 3 months ago

Selected Answer: C

Not A because overwriting will only provide a daily based data not the history of it. Not B because it will not fix the issue of incorrect sales amount As these data are fit for natch processing so neither D or E. C will only upsert the changes while making sure we are updating the records based on sales_date & store_id

upvoted 2 times

...

Rinscy

1 year, 5 months ago

E definitely because it say that the total_sales column may be change by manual auditing so not via a job, so streaming with CDF is the only option here !

upvoted 1 times

...

Somesh512

1 year, 5 months ago

Selected Answer: A

I would go with Option A. Because it has manual auditing hence values can change. Uses type 1 hence replace original data

upvoted 3 times

...

spaceexplorer

1 year, 5 months ago

Selected Answer: E

It should be E, as structure streaming has built-in fault-tolerance feature.

upvoted 1 times

...

Rinscy

1 year, 5 months ago

It said type 1 so A is the correct answer !

upvoted 2 times

...

divingbell17

1 year, 6 months ago

The question is unclear whether the aggregated table needs to support a rolling history. Note the aggregated table does not have a date column to distinguish which date the summary is generated for so one could assume the table is maintained only for the current snapshot. Assuming the above - A would be the safest option as all stores and aggregates would need to be refreshed nightly

upvoted 2 times

...

Load full discussion...