Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.

Unlimited Access

Get Unlimited Contributor Access to the all ExamTopics Exams!
Take advantage of PDF Files for 1000+ Exams along with community discussions and pass IT Certification Exams Easily.

Exam Professional Data Engineer topic 1 question 5 discussion

Actual exam question from Google's Professional Data Engineer
Question #: 5
Topic #: 1
[All Professional Data Engineer Questions]

An external customer provides you with a daily dump of data from their database. The data flows into Google Cloud Storage GCS as comma-separated values
(CSV) files. You want to analyze this data in Google BigQuery, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?

  • A. Use federated data sources, and check data in the SQL query.
  • B. Enable BigQuery monitoring in Google Stackdriver and create an alert.
  • C. Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0.
  • D. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
[Removed]
Highly Voted 4 years, 1 month ago
Agreed: D
upvoted 25 times
...
Radhika7983
Highly Voted 3 years, 5 months ago
The answer is D. An ETL pipeline will be implemented for this scenario. Check out handling invalid inputs in cloud data flow https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow ParDos . . . and don’ts: handling invalid inputs in Dataflow using Side Outputs as a “Dead Letter” file
upvoted 13 times
jkhong
1 year, 4 months ago
The sources you've provided cannot be accessed. Here is an updated best practice. https://cloud.google.com/architecture/building-production-ready-data-pipelines-using-dataflow-developing-and-testing#use_dead_letter_queues
upvoted 5 times
...
...
RT_G
Most Recent 5 months, 2 weeks ago
Selected Answer: D
All other options only alert or error out bad data. As the question requires, option D sends bad data to the dead letter table for further analysis while valid data is loaded to the table
upvoted 1 times
...
rocky48
5 months, 3 weeks ago
Selected Answer: D
Option A is incorrect because federated data sources do not provide any data validation or cleaning capabilities and you'll have to do it on the SQL query, which could slow down the performance. Option B is incorrect because Stackdriver monitoring can only monitor the performance of the pipeline, but it can't handle corrupted or incorrectly formatted data. Option C is incorrect because using gcloud CLI and setting max_bad_records to 0 will ignore the corrupted or incorrectly formatted data and continue the load process, this will lead to incorrect analysis. Answer D. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.
upvoted 1 times
...
rtcpost
6 months ago
Selected Answer: D
Google Cloud Dataflow allows you to create a data pipeline that can preprocess and transform data before loading it into BigQuery. This approach will enable you to handle problematic rows, push them to a dead-letter table for later analysis, and load the valid data into BigQuery. Option A (using federated data sources and checking data in the SQL query) can be used but doesn't directly address the issue of handling corrupted or incorrectly formatted rows. Options B and C are not the best choices for handling data quality and error issues. Enabling monitoring and setting max_bad_records to 0 in BigQuery may help identify errors but won't store the problematic rows for further analysis, and it might prevent loading any data with issues, which may not be ideal.
upvoted 1 times
...
NeoNitin
7 months, 1 week ago
ans D ,,thank you exam topic , connect me if need any help [email protected]
upvoted 1 times
...
NeoNitin
7 months, 3 weeks ago
D , Thank you Exam topic : Passed the exam in august and I can say examtopic is help me lot, topic 1 is enough for the exam, just last week I received welcome kit from google for PDE exam one google cloud cup. if you need any help reach out to me neonitin6attherategoogledotcom
upvoted 1 times
...
vaga1
11 months, 1 week ago
Selected Answer: D
Agreed: D
upvoted 1 times
...
odiez3
1 year, 1 month ago
D because you need Transform the data
upvoted 1 times
...
Morock
1 year, 2 months ago
Selected Answer: D
D. The question is asking pipeline, then let’s build a pipeline.
upvoted 3 times
vaga1
11 months, 1 week ago
I agree. There are not much information on what to do. Every answer is valid except B. But in stricly technical terms only D generates a real pipeline.
upvoted 1 times
...
...
samdhimal
1 year, 3 months ago
D. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis. By running a Cloud Dataflow pipeline to import the data, you can perform data validation, cleaning and transformation before it gets loaded into BigQuery. Dataflow allows you to handle corrupted or incorrectly formatted rows by pushing them to another dead-letter table for analysis. This way, you can ensure that only clean and correctly formatted data is loaded into BigQuery for analysis.
upvoted 2 times
samdhimal
1 year, 3 months ago
Option A is incorrect because federated data sources do not provide any data validation or cleaning capabilities and you'll have to do it on the SQL query, which could slow down the performance. Option B is incorrect because Stackdriver monitoring can only monitor the performance of the pipeline, but it can't handle corrupted or incorrectly formatted data. Option C is incorrect because using gcloud CLI and setting max_bad_records to 0 will ignore the corrupted or incorrectly formatted data and continue the load process, this will lead to incorrect analysis.
upvoted 5 times
hamza101
9 months ago
for Option C i think when setting max_bad_records to 0 this will prevent the loading to be achieved since the condition will cut off the loading if we have at least 1 corrupted row
upvoted 1 times
...
...
...
Besss
1 year, 6 months ago
Selected Answer: D
Agreed: D
upvoted 1 times
...
Dip1994
1 year, 8 months ago
The correct answer is D
upvoted 1 times
...
Arkon88
2 years, 1 month ago
Selected Answer: D
Correct - D (as we need to create Pipeline) which possible via 'D'
upvoted 1 times
...
MaxNRG
2 years, 5 months ago
Looks like D, with C you will not import anything, stackdriver alerts will not help you with this and with federated resources you won’t know what happened with those bad records. D is the most complete one. https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow
upvoted 3 times
...
anji007
2 years, 6 months ago
Ans: D
upvoted 1 times
...
nickozz
2 years, 7 months ago
D seems to be correct. explained here how combined wth Pub/Sub, this can be achieved. https://cloud.google.com/pubsub/docs/handling-failures
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...