exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 30 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 30
Topic #: 1
[All Certified Data Engineer Professional Questions]

A nightly job ingests data into a Delta Lake table using the following code:

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.
Which code snippet completes this function definition?
def new_records():

  • A. return spark.readStream.table("bronze")
  • B. return spark.readStream.load("bronze")
  • C.
  • D. return spark.read.option("readChangeFeed", "true").table ("bronze")
  • E.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
AzureDE2522
Highly Voted 1 year, 7 months ago
Selected Answer: D
# not providing a starting version/timestamp will result in the latest snapshot being fetched first spark.readStream.format("delta") \ .option("readChangeFeed", "true") \ .table("myDeltaTable") Please refer: https://docs.databricks.com/en/delta/delta-change-data-feed.html
upvoted 13 times
arekm
5 months, 2 weeks ago
Answer D would require specifying the start and (optionally) the end version for reading data from CDF. So D does not seem to be correct.
upvoted 2 times
...
shaojunni
8 months, 1 week ago
readChangeFeed is disabled by default.
upvoted 2 times
...
t_d_v
10 months ago
There is no stream in option D
upvoted 3 times
GHill1982
9 months, 3 weeks ago
You can read Delta Lake Change Data Feed without using a stream. You can use batch queries to read the change data feed by setting the readChangeFeed option to true.
upvoted 2 times
arekm
5 months, 2 weeks ago
CDF without a stream requires a starting version at the minimum.
upvoted 1 times
...
...
...
...
Laraujo2022
Highly Voted 1 year, 7 months ago
In my opinion E is not correct because we do not see parameters pass within to the function (year, month and day)... the function is def new_records():
upvoted 9 times
...
BryOs
Most Recent 2 weeks, 2 days ago
Selected Answer: A
It's Option A because it will allow you to get the latest changes even if CDF isn't enabled on the table. Option D would fail if CDF isn't enabled on the table. The statement doesn't indicate if CDF is active.
upvoted 1 times
...
KadELbied
1 month, 1 week ago
Selected Answer: A
Suretly A
upvoted 1 times
...
AlHerd
2 months, 3 weeks ago
Selected Answer: A
Option A is best because it creates a streaming source that reads only new appended data from the "bronze" table incrementally. Even if ingestion is done in batch, using spark.readStream.table("bronze") lets downstream processing treat the table as a live data stream.
upvoted 2 times
...
Tedet
3 months, 2 weeks ago
Selected Answer: D
Explanation: This is the best option for Delta Lake, as it uses the readChangeFeed option. This option is specifically designed to read only the new changes (insertions, updates, or deletions) since the last read, which is exactly what is needed when you want to handle new records that have not yet been processed. This ensures that only records that are new or changed since the last read are returned. Conclusion: This is the correct choice, as it ensures that only new records are read.
upvoted 1 times
...
asdsadasdas
4 months ago
Selected Answer: A
"manipulate new records that have *not yet been processed* to the next table " readstream can incrementally pick data yet to be processed. with D the issue is spark.read it will read the entire table
upvoted 1 times
asdsadasdas
4 months ago
Batch (read) Reads all available CDF history starting from the earliest retained version May load too much data or fail if old versions are deleted Streaming (readStream) Starts from the latest version unless a checkpoint exists
upvoted 1 times
...
...
shaswat1404
4 months, 1 week ago
Selected Answer: E
in option A and B assume steaming ingestiopn but ingestion is in batch mode in option C current_timestamp is used which is dynamic and changes every time the query is executed therefore it wont correctly filter records injested in the last batch in option D it only works if delta.enableChangeDataFeed = true was set on the table before the ingestion (its disabled by default and given query does not set this option as true) therefore this option is in valid option E is correct as it correctly filters from the most recent batch as it uses file path to retrieve only data from the latest ingestion column source_file was created specifically for this purpose ensuring the function returns onle new records..
upvoted 2 times
...
arekm
5 months, 2 weeks ago
Selected Answer: A
You can read data from the delta table using structured streaming. You have 2 options: - without CDF - only process new rows (without updates and deletes) - with CDF - all changes to the data, i.e. insert, update, delete. Answer A uses the first option. However, in the question they talk about "new records". So using streaming for new records is OK. Answer A is correct.
upvoted 2 times
arekm
5 months, 2 weeks ago
At first I thought of answer D. However, after checking in the docs I learned that starting version is a must while reading from CDF using batch pattern.
upvoted 1 times
...
...
sgerin
6 months ago
Selected Answer: E
New records will be filtered for D /
upvoted 1 times
...
temple1305
6 months ago
Selected Answer: D
New records will be filtered for D - example https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/
upvoted 1 times
...
AlejandroU
6 months ago
Selected Answer: A
Answer A. A better approach would involve streaming directly from the Delta table (Option A), possibly along with using metadata like ingest_time to track new records more accurately. It might be better to rely on the streaming process itself rather than trying to filter based on the file path (option E).
upvoted 1 times
...
Thameur01
6 months, 2 weeks ago
Selected Answer: E
Using the source_file metadata field allows you to filter new records ingested from specific files. E is the most robust and reliable option for tracking and working with new records in this batch ingestion pipeline.
upvoted 1 times
...
benni_ale
6 months, 3 weeks ago
Selected Answer: E
I tried myself but none really works
upvoted 1 times
...
cbj
8 months ago
Selected Answer: A
Others can't ensure data not being processed. e.g. if the code not run for one day and run next day, C or E will mis process one day's data.
upvoted 3 times
...
shaojunni
8 months, 1 week ago
Selected Answer: A
since "bronze" table is a delta table, readStream() only returns new data.
upvoted 4 times
...
pk07
8 months, 1 week ago
Selected Answer: E
If the job runs only once per day, then option E could indeed be a valid and effective solution. Here's why: Daily Execution: Since the job runs once per day, all records ingested on that day would be new and unprocessed. Source File Filtering: The filter condition col("source_file").like(f"/mnt/daily_batch/{year}/{month}/{day}") would select only the records that were ingested from the current day's batch file. Simplicity: This approach is straightforward and doesn't require maintaining additional state (like last processed version or timestamp). Reliability: As long as the daily batch files are consistently named and placed in the correct directory structure, this method will reliably capture all new records for that day.
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...