exam questions

Exam Certified Data Engineer Professional All Questions

View all questions & answers for the Certified Data Engineer Professional exam

Exam Certified Data Engineer Professional topic 1 question 30 discussion

Actual exam question from Databricks's Certified Data Engineer Professional
Question #: 30
Topic #: 1
[All Certified Data Engineer Professional Questions]

A nightly job ingests data into a Delta Lake table using the following code:

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.
Which code snippet completes this function definition?
def new_records():

  • A. return spark.readStream.table("bronze")
  • B. return spark.readStream.load("bronze")
  • C.
  • D. return spark.read.option("readChangeFeed", "true").table ("bronze")
  • E.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
AzureDE2522
Highly Voted 1 year, 5 months ago
Selected Answer: D
# not providing a starting version/timestamp will result in the latest snapshot being fetched first spark.readStream.format("delta") \ .option("readChangeFeed", "true") \ .table("myDeltaTable") Please refer: https://docs.databricks.com/en/delta/delta-change-data-feed.html
upvoted 13 times
arekm
4 months ago
Answer D would require specifying the start and (optionally) the end version for reading data from CDF. So D does not seem to be correct.
upvoted 2 times
...
shaojunni
6 months, 3 weeks ago
readChangeFeed is disabled by default.
upvoted 2 times
...
t_d_v
8 months, 2 weeks ago
There is no stream in option D
upvoted 3 times
GHill1982
8 months, 1 week ago
You can read Delta Lake Change Data Feed without using a stream. You can use batch queries to read the change data feed by setting the readChangeFeed option to true.
upvoted 2 times
arekm
4 months ago
CDF without a stream requires a starting version at the minimum.
upvoted 1 times
...
...
...
...
Laraujo2022
Highly Voted 1 year, 5 months ago
In my opinion E is not correct because we do not see parameters pass within to the function (year, month and day)... the function is def new_records():
upvoted 9 times
...
AlHerd
Most Recent 1 month, 1 week ago
Selected Answer: A
Option A is best because it creates a streaming source that reads only new appended data from the "bronze" table incrementally. Even if ingestion is done in batch, using spark.readStream.table("bronze") lets downstream processing treat the table as a live data stream.
upvoted 2 times
...
Tedet
2 months ago
Selected Answer: D
Explanation: This is the best option for Delta Lake, as it uses the readChangeFeed option. This option is specifically designed to read only the new changes (insertions, updates, or deletions) since the last read, which is exactly what is needed when you want to handle new records that have not yet been processed. This ensures that only records that are new or changed since the last read are returned. Conclusion: This is the correct choice, as it ensures that only new records are read.
upvoted 1 times
...
asdsadasdas
2 months, 2 weeks ago
Selected Answer: A
"manipulate new records that have *not yet been processed* to the next table " readstream can incrementally pick data yet to be processed. with D the issue is spark.read it will read the entire table
upvoted 1 times
asdsadasdas
2 months, 2 weeks ago
Batch (read) Reads all available CDF history starting from the earliest retained version May load too much data or fail if old versions are deleted Streaming (readStream) Starts from the latest version unless a checkpoint exists
upvoted 1 times
...
...
shaswat1404
2 months, 3 weeks ago
Selected Answer: E
in option A and B assume steaming ingestiopn but ingestion is in batch mode in option C current_timestamp is used which is dynamic and changes every time the query is executed therefore it wont correctly filter records injested in the last batch in option D it only works if delta.enableChangeDataFeed = true was set on the table before the ingestion (its disabled by default and given query does not set this option as true) therefore this option is in valid option E is correct as it correctly filters from the most recent batch as it uses file path to retrieve only data from the latest ingestion column source_file was created specifically for this purpose ensuring the function returns onle new records..
upvoted 2 times
...
arekm
4 months ago
Selected Answer: A
You can read data from the delta table using structured streaming. You have 2 options: - without CDF - only process new rows (without updates and deletes) - with CDF - all changes to the data, i.e. insert, update, delete. Answer A uses the first option. However, in the question they talk about "new records". So using streaming for new records is OK. Answer A is correct.
upvoted 2 times
arekm
4 months ago
At first I thought of answer D. However, after checking in the docs I learned that starting version is a must while reading from CDF using batch pattern.
upvoted 1 times
...
...
sgerin
4 months, 2 weeks ago
Selected Answer: E
New records will be filtered for D /
upvoted 1 times
...
temple1305
4 months, 2 weeks ago
Selected Answer: D
New records will be filtered for D - example https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/
upvoted 1 times
...
AlejandroU
4 months, 2 weeks ago
Selected Answer: A
Answer A. A better approach would involve streaming directly from the Delta table (Option A), possibly along with using metadata like ingest_time to track new records more accurately. It might be better to rely on the streaming process itself rather than trying to filter based on the file path (option E).
upvoted 1 times
...
Thameur01
5 months ago
Selected Answer: E
Using the source_file metadata field allows you to filter new records ingested from specific files. E is the most robust and reliable option for tracking and working with new records in this batch ingestion pipeline.
upvoted 1 times
...
benni_ale
5 months ago
Selected Answer: E
I tried myself but none really works
upvoted 1 times
...
cbj
6 months, 2 weeks ago
Selected Answer: A
Others can't ensure data not being processed. e.g. if the code not run for one day and run next day, C or E will mis process one day's data.
upvoted 2 times
...
shaojunni
6 months, 3 weeks ago
Selected Answer: A
since "bronze" table is a delta table, readStream() only returns new data.
upvoted 4 times
...
pk07
6 months, 3 weeks ago
Selected Answer: E
If the job runs only once per day, then option E could indeed be a valid and effective solution. Here's why: Daily Execution: Since the job runs once per day, all records ingested on that day would be new and unprocessed. Source File Filtering: The filter condition col("source_file").like(f"/mnt/daily_batch/{year}/{month}/{day}") would select only the records that were ingested from the current day's batch file. Simplicity: This approach is straightforward and doesn't require maintaining additional state (like last processed version or timestamp). Reliability: As long as the daily batch files are consistently named and placed in the correct directory structure, this method will reliably capture all new records for that day.
upvoted 2 times
...
AndreFR
7 months, 1 week ago
Selected Answer: A
A is correct by Elimination. As stated by Alaverdi in another comment. Reads delta table as a stream and processes only newly arrived records. B excluded because of incorrect syntax C excluded, will be an empty result, as ingestion time (which comes as a param in the other method) is compared with current timestamp D excluded because of syntax error, should be : spark.read.option("readChangeFeed", "true").option("startingVersion", 1).table("bronze") E excluded, will be an empty result, because “source_file” give a filename, while f"/mnt /daily_batch/{year}/{month}/{day}" gives a folder name
upvoted 7 times
...
t_d_v
8 months, 2 weeks ago
Selected Answer: C
Actually it's hard to choose between C and E, as both are a bit incorrect: Option E - seems like it will be an empty result, as file name is compared with folder name Option C - seems like it will be an empty result, as ingestion time (which comes as a param in the other method) is compared with current timestamp. On the other hand, if new_records method had an ingestion time param, then the task would be obvious. Also considering the very first line which imports current_timestamp, let me say it's C :))
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago