Unlimited Access

Get Unlimited Contributor Access to the all ExamTopics Exams!
Take advantage of PDF Files for 1000+ Exams along with community discussions and pass IT Certification Exams Easily.

Get Unlimited Access

Google Discussions

Exam Professional Data Engineer topic 1 question 5 discussion

Actual exam question from Google's Professional Data Engineer

Question #: 5
Topic #: 1

[All Professional Data Engineer Questions]

An external customer provides you with a daily dump of data from their database. The data flows into Google Cloud Storage GCS as comma-separated values
(CSV) files. You want to analyze this data in Google BigQuery, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?

A. Use federated data sources, and check data in the SQL query.
B. Enable BigQuery monitoring in Google Stackdriver and create an alert.
C. Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0.
D. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.

Show Suggested Answer

Suggested Answer: D 🗳️

by [deleted] at March 15, 2020, 8:14 a.m.

Comments

Submit Cancel

[Removed]

Highly Voted 4 years, 1 month ago

Agreed: D

upvoted 25 times

...

Radhika7983

Highly Voted 3 years, 5 months ago

The answer is D. An ETL pipeline will be implemented for this scenario. Check out handling invalid inputs in cloud data flow https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow ParDos . . . and don’ts: handling invalid inputs in Dataflow using Side Outputs as a “Dead Letter” file

upvoted 13 times

jkhong

1 year, 4 months ago

The sources you've provided cannot be accessed. Here is an updated best practice. https://cloud.google.com/architecture/building-production-ready-data-pipelines-using-dataflow-developing-and-testing#use_dead_letter_queues

upvoted 5 times

...

RT_G

Most Recent 5 months, 2 weeks ago

Selected Answer: D

All other options only alert or error out bad data. As the question requires, option D sends bad data to the dead letter table for further analysis while valid data is loaded to the table

upvoted 1 times

...

rocky48

5 months, 3 weeks ago

Selected Answer: D

Option A is incorrect because federated data sources do not provide any data validation or cleaning capabilities and you'll have to do it on the SQL query, which could slow down the performance. Option B is incorrect because Stackdriver monitoring can only monitor the performance of the pipeline, but it can't handle corrupted or incorrectly formatted data. Option C is incorrect because using gcloud CLI and setting max_bad_records to 0 will ignore the corrupted or incorrectly formatted data and continue the load process, this will lead to incorrect analysis. Answer D. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.

upvoted 1 times

...

rtcpost

6 months ago

Selected Answer: D

Google Cloud Dataflow allows you to create a data pipeline that can preprocess and transform data before loading it into BigQuery. This approach will enable you to handle problematic rows, push them to a dead-letter table for later analysis, and load the valid data into BigQuery. Option A (using federated data sources and checking data in the SQL query) can be used but doesn't directly address the issue of handling corrupted or incorrectly formatted rows. Options B and C are not the best choices for handling data quality and error issues. Enabling monitoring and setting max_bad_records to 0 in BigQuery may help identify errors but won't store the problematic rows for further analysis, and it might prevent loading any data with issues, which may not be ideal.

upvoted 1 times

...

NeoNitin

7 months, 1 week ago

ans D ,,thank you exam topic , connect me if need any help [email protected]

upvoted 1 times

...

NeoNitin

7 months, 3 weeks ago

D , Thank you Exam topic : Passed the exam in august and I can say examtopic is help me lot, topic 1 is enough for the exam, just last week I received welcome kit from google for PDE exam one google cloud cup. if you need any help reach out to me neonitin6attherategoogledotcom

upvoted 1 times

...

vaga1

11 months, 1 week ago

Selected Answer: D

Agreed: D

upvoted 1 times

...

odiez3

1 year, 1 month ago

D because you need Transform the data

upvoted 1 times

...

Morock

1 year, 2 months ago

Selected Answer: D

D. The question is asking pipeline, then let’s build a pipeline.

upvoted 3 times

vaga1

11 months, 1 week ago

I agree. There are not much information on what to do. Every answer is valid except B. But in stricly technical terms only D generates a real pipeline.

upvoted 1 times

...

samdhimal

1 year, 3 months ago

D. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis. By running a Cloud Dataflow pipeline to import the data, you can perform data validation, cleaning and transformation before it gets loaded into BigQuery. Dataflow allows you to handle corrupted or incorrectly formatted rows by pushing them to another dead-letter table for analysis. This way, you can ensure that only clean and correctly formatted data is loaded into BigQuery for analysis.

upvoted 2 times

samdhimal

1 year, 3 months ago

upvoted 5 times

hamza101

9 months ago

for Option C i think when setting max_bad_records to 0 this will prevent the loading to be achieved since the condition will cut off the loading if we have at least 1 corrupted row

upvoted 1 times

...

Besss

1 year, 6 months ago

Selected Answer: D

Agreed: D

upvoted 1 times

...

Dip1994

1 year, 8 months ago

The correct answer is D

upvoted 1 times

...

Arkon88

2 years, 1 month ago

Selected Answer: D

Correct - D (as we need to create Pipeline) which possible via 'D'

upvoted 1 times

...

MaxNRG

2 years, 5 months ago

Looks like D, with C you will not import anything, stackdriver alerts will not help you with this and with federated resources you won’t know what happened with those bad records. D is the most complete one. https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow

upvoted 3 times

...

anji007

2 years, 6 months ago

Ans: D

upvoted 1 times

...

nickozz

2 years, 7 months ago

D seems to be correct. explained here how combined wth Pub/Sub, this can be achieved. https://cloud.google.com/pubsub/docs/handling-failures

upvoted 1 times

...

Load full discussion...

Unlimited Access

Exam Professional Data Engineer topic 1 question 5 discussion

Comments

[Removed]

Radhika7983

jkhong

RT_G

rocky48

rtcpost

NeoNitin

NeoNitin

vaga1

odiez3

Morock

vaga1

samdhimal

samdhimal

hamza101

Besss

Dip1994

Arkon88

MaxNRG

anji007

nickozz

Get IT Certification

New Version GCP Professional Cloud Architect Certificate & Helpful Information

The 5 Most In-Demand Project Management Certifications of 2019