Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.

Unlimited Access

Get Unlimited Contributor Access to the all ExamTopics Exams!
Take advantage of PDF Files for 1000+ Exams along with community discussions and pass IT Certification Exams Easily.

Exam Certified Data Engineer Professional topic 1 question 13 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 13
Topic #: 1
[All Certified Data Engineer Professional Questions]

An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id.
For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour.
Which solution meets these requirements?

  • A. Create a separate history table for each pk_id resolve the current state of the table by running a union all filtering the history tables for the most recent state.
  • B. Use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a bronze table, then propagate all changes throughout the system.
  • C. Iterate through an ordered set of changes to the table, applying each in turn; rely on Delta Lake's versioning ability to create an audit log.
  • D. Use Delta Lake's change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse.
  • E. Ingest all log information into a bronze table; use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a silver table to recreate the current table state.
Show Suggested Answer Hide Answer
Suggested Answer: E 🗳️


Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
Highly Voted 3 months, 2 weeks ago
The answer given is correct
upvoted 6 times
2 months, 3 weeks ago
I want to correct my response.It seems the right answer Option D, it leverages Delta Lake's built-in capabilities for handling CDC data. It is designed to efficiently capture, process, and propagate changes, making it a more robust and scalable solution, particularly for large-scale data scenarios with frequent updates and auditing requirements.
upvoted 1 times
2 months ago
the D states: process CDC data from an external system. so this delta CDF.
upvoted 1 times
2 months ago
Databricks is NOT able to process CDC alone. It needs a intermediare Tool to make it on an object storage and then ingest it. So how can be D?
upvoted 1 times
Most Recent 1 week, 2 days ago
Selected Answer: D
For me the answer is D, the question states that CDC logs are emitted on an external storage meaning it can be ingested into the bronze layer on a table with CDF enabled. In this case we let databricks handle the complexity of following changes and only worry about data quality. meaning with CDF enabled databricks will already work the audit data for us with the table_changes of the pre-image and post-image and also give us the last updated value for our use case. here is a similar example: https://www.databricks.com/blog/2021/06/09/how-to-simplify-cdc-with-delta-lakes-change-data-feed.html
upvoted 1 times
2 months ago
Selected Answer: E
E is correct
upvoted 2 times
Community vote distribution
A (35%)
C (25%)
B (20%)
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

Loading ...