Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 30 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 30
Topic #: 1

[All Certified Data Engineer Professional Questions]

A nightly job ingests data into a Delta Lake table using the following code:

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.
Which code snippet completes this function definition?
def new_records():

A. return spark.readStream.table("bronze")
B. return spark.readStream.load("bronze")
C.
D. return spark.read.option("readChangeFeed", "true").table ("bronze")
E.

Show Suggested Answer

Suggested Answer: A 🗳️

by hammer_1234_h at Sept. 14, 2023, 12:35 a.m.

Comments

Submit Cancel

AzureDE2522

Highly Voted 1 year, 7 months ago

Selected Answer: D

# not providing a starting version/timestamp will result in the latest snapshot being fetched first spark.readStream.format("delta") \ .option("readChangeFeed", "true") \ .table("myDeltaTable") Please refer: https://docs.databricks.com/en/delta/delta-change-data-feed.html

upvoted 13 times

arekm

6 months ago

Answer D would require specifying the start and (optionally) the end version for reading data from CDF. So D does not seem to be correct.

upvoted 2 times

...

shaojunni

8 months, 3 weeks ago

readChangeFeed is disabled by default.

upvoted 2 times

...

t_d_v

10 months, 2 weeks ago

There is no stream in option D

upvoted 3 times

GHill1982

10 months, 1 week ago

You can read Delta Lake Change Data Feed without using a stream. You can use batch queries to read the change data feed by setting the readChangeFeed option to true.

upvoted 2 times

arekm

6 months ago

CDF without a stream requires a starting version at the minimum.

upvoted 1 times

...

Laraujo2022

Highly Voted 1 year, 7 months ago

In my opinion E is not correct because we do not see parameters pass within to the function (year, month and day)... the function is def new_records():

upvoted 9 times

...

ConquerorAlpha

Most Recent 1 week, 2 days ago

Selected Answer: E

The question is clearly saying that the table is being written in batch format, then how you can read is using streaming code, hence option E is correct there with the proper filter to read the current date data. Please think logically.

upvoted 1 times

...

BryOs

1 month ago

Selected Answer: A

It's Option A because it will allow you to get the latest changes even if CDF isn't enabled on the table. Option D would fail if CDF isn't enabled on the table. The statement doesn't indicate if CDF is active.

upvoted 1 times

...

KadELbied

1 month, 3 weeks ago

Selected Answer: A

Suretly A

upvoted 1 times

...

AlHerd

3 months, 1 week ago

Selected Answer: A

Option A is best because it creates a streaming source that reads only new appended data from the "bronze" table incrementally. Even if ingestion is done in batch, using spark.readStream.table("bronze") lets downstream processing treat the table as a live data stream.

upvoted 2 times

...

Tedet

4 months ago

Selected Answer: D

Explanation: This is the best option for Delta Lake, as it uses the readChangeFeed option. This option is specifically designed to read only the new changes (insertions, updates, or deletions) since the last read, which is exactly what is needed when you want to handle new records that have not yet been processed. This ensures that only records that are new or changed since the last read are returned. Conclusion: This is the correct choice, as it ensures that only new records are read.

upvoted 1 times

...

asdsadasdas

4 months, 3 weeks ago

Selected Answer: A

"manipulate new records that have *not yet been processed* to the next table " readstream can incrementally pick data yet to be processed. with D the issue is spark.read it will read the entire table

upvoted 1 times

asdsadasdas

4 months, 3 weeks ago

Batch (read) Reads all available CDF history starting from the earliest retained version May load too much data or fail if old versions are deleted Streaming (readStream) Starts from the latest version unless a checkpoint exists

upvoted 1 times

...

shaswat1404

4 months, 3 weeks ago

Selected Answer: E

in option A and B assume steaming ingestiopn but ingestion is in batch mode in option C current_timestamp is used which is dynamic and changes every time the query is executed therefore it wont correctly filter records injested in the last batch in option D it only works if delta.enableChangeDataFeed = true was set on the table before the ingestion (its disabled by default and given query does not set this option as true) therefore this option is in valid option E is correct as it correctly filters from the most recent batch as it uses file path to retrieve only data from the latest ingestion column source_file was created specifically for this purpose ensuring the function returns onle new records..

upvoted 2 times

...

arekm

6 months ago

Selected Answer: A

You can read data from the delta table using structured streaming. You have 2 options: - without CDF - only process new rows (without updates and deletes) - with CDF - all changes to the data, i.e. insert, update, delete. Answer A uses the first option. However, in the question they talk about "new records". So using streaming for new records is OK. Answer A is correct.

upvoted 2 times

arekm

6 months ago

At first I thought of answer D. However, after checking in the docs I learned that starting version is a must while reading from CDF using batch pattern.

upvoted 1 times

...

sgerin

6 months, 2 weeks ago

Selected Answer: E

New records will be filtered for D /

upvoted 1 times

...

temple1305

6 months, 3 weeks ago

Selected Answer: D

New records will be filtered for D - example https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/

upvoted 1 times

...

AlejandroU

6 months, 3 weeks ago

Selected Answer: A

Answer A. A better approach would involve streaming directly from the Delta table (Option A), possibly along with using metadata like ingest_time to track new records more accurately. It might be better to rely on the streaming process itself rather than trying to filter based on the file path (option E).

upvoted 1 times

...

Thameur01

7 months ago

Selected Answer: E

Using the source_file metadata field allows you to filter new records ingested from specific files. E is the most robust and reliable option for tracking and working with new records in this batch ingestion pipeline.

upvoted 1 times

...