Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 13 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 13
Topic #: 1

[All Certified Data Engineer Professional Questions]

An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id.
For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour.
Which solution meets these requirements?

A. Create a separate history table for each pk_id resolve the current state of the table by running a union all filtering the history tables for the most recent state.
B. Use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a bronze table, then propagate all changes throughout the system.
C. Iterate through an ordered set of changes to the table, applying each in turn; rely on Delta Lake's versioning ability to create an audit log.
D. Use Delta Lake's change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse.
E. Ingest all log information into a bronze table; use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a silver table to recreate the current table state.

Show Suggested Answer

Suggested Answer: E 🗳️

by Eertyy at Aug. 27, 2023, 6:06 p.m.

Comments

Submit Cancel

Eertyy

Highly Voted 1 year, 10 months ago

The answer given is correct

upvoted 10 times

fabiospont

5 months, 1 week ago

Option D

upvoted 2 times

...

Eertyy

1 year, 9 months ago

I want to correct my response.It seems the right answer Option D, it leverages Delta Lake's built-in capabilities for handling CDC data. It is designed to efficiently capture, process, and propagate changes, making it a more robust and scalable solution, particularly for large-scale data scenarios with frequent updates and auditing requirements.

upvoted 3 times

arekm

6 months, 2 weeks ago

The question names 2 requirements to keep the data - archival with all records - querying with only the currently valid values CDF is not designed as a permanent storage for archival purposes. It keeps the data to propagate it to downstream applications / workloads. CDF is also purged with the vacuum command, so this would make a very unreliable archival. Medallion architecture that Databricks promotes seems to be a clear winner.

upvoted 1 times

...

Sriramiyer92

6 months, 2 weeks ago

Actually you were correct in the first go. If you are dealing with small amount of data to changes, CDC is it way to go. But not in this case. (Keywords: For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system.) Answer is E in that case.

upvoted 1 times

...

Starvosxant

1 year, 9 months ago

Databricks is NOT able to process CDC alone. It needs a intermediare Tool to make it on an object storage and then ingest it. So how can be D?

upvoted 5 times

...

Load full discussion...

...

imatheushenrique

Highly Voted 1 year, 1 month ago

E. This is the correct answer because it meets the requirements of maintaining a full record of all values that have ever been valid in the source system and recreating the current table state with only the most recent value for each record. The code ingests all log information into a bronze table, which preserves the raw CDC data as it is. Then, it uses merge into to perform an upsert operation on a silver table, which means it will insert new records or update or delete existing records based on the change type and the pk_id columns. This way, the silver table will always reflect the current state of the source table, while the bronze table will keep the history of all changes.

upvoted 6 times

...

KadELbied

Most Recent 2 months, 1 week ago

Selected Answer: E

Suretly E

upvoted 1 times

...

Tedet

4 months, 2 weeks ago

Selected Answer: E

Bronze Table (Raw Ingest): You start by ingesting all the change data capture (CDC) records into a bronze table. Silver Table (Processed State): The silver table represents the most recent state of the data. You would use the MERGE INTO command to process the changes from the bronze table and update the silver table accordingly. Audit Trail: Since you're ingesting all the data into the bronze table, you maintain a full history of changes that have occurred over time, which satisfies the auditing requirement.

upvoted 2 times

...

fabiospont

5 months, 1 week ago

D is the option more indicated. CDC is the most effeciently.

upvoted 1 times

...

mwynn

6 months, 1 week ago

Selected Answer: E

If getting External CDC Data (Kafka, etc) no need for CDF! Just ingest to Bronze with (pipelines.reset.allowed = false)

upvoted 1 times

...

arekm

6 months, 2 weeks ago

Selected Answer: E

upvoted 1 times

arekm

6 months, 2 weeks ago

On top of that - CDC with CDF is not automatic. You still need SQL or Python to read the changes and put them somewhere.

upvoted 1 times

...

benni_ale

8 months ago

Selected Answer: E

You can only read the change data feed for enabled tables. You must explicitly enable the change data feed option using one of the following methods: TBLPROPERTIES (delta.enableChangeDataFeed = true) . this means it is a delta feature or in other words it is a feature supported by delta tables. the data to process in the question is external so it is not a delta table => can't be B... Hopefully I am correct.

upvoted 1 times

benni_ale

8 months ago

i meant can't be D

upvoted 1 times

...

benni_ale

9 months ago

Selected Answer: E

E . databricks cdc is not set to process external cdc. if u have external cdc u could send to bronze for auditing purposes and use bronze to get silver where u have only valid records

upvoted 1 times

...

databrick_work

9 months, 4 weeks ago

E is correct

upvoted 1 times

...

spaceexplorer

1 year, 5 months ago

Selected Answer: E

The answer is E

upvoted 2 times

...

RafaelCFC

1 year, 6 months ago

Selected Answer: E

Complimenting kz_data's response, be aware that the data that is being consumed is not a Databrick's CDC data feed object, but rather, CDC coming from somewhere else, that is, just regular data. So, indeed, it can't be processed without another tool.

upvoted 1 times

...

kz_data

1 year, 6 months ago

Selected Answer: E

Answer E is correct, as the CDC captured from the external database may contain duplicates for the same pk_id (key) due to multiple updates within the processed hour, we need to take the most recent update for the pk_id, and then MERGE into a silver table.

upvoted 2 times

...

a560fe1

1 year, 6 months ago

CDF captures changes only from a Delta table and is only forward-looking once enabled. The CDC logs are writing to object storage. So you would need to ingestion those and merge into downstream tables, hence the answer is E

upvoted 2 times

...

hamzaKhribi

1 year, 7 months ago

Selected Answer: D

For me the answer is D, the question states that CDC logs are emitted on an external storage meaning it can be ingested into the bronze layer on a table with CDF enabled. In this case we let databricks handle the complexity of following changes and only worry about data quality. meaning with CDF enabled databricks will already work the audit data for us with the table_changes of the pre-image and post-image and also give us the last updated value for our use case. here is a similar example: https://www.databricks.com/blog/2021/06/09/how-to-simplify-cdc-with-delta-lakes-change-data-feed.html

upvoted 3 times

spaceexplorer

1 year, 5 months ago

This article shows exactly why D is not right. Since "CDF captures changes only from a Delta table and is only forward-looking once enabled."

upvoted 4 times

...

sturcu

1 year, 9 months ago

Selected Answer: E

E is correct

upvoted 3 times

...