Unlimited Access

Get Unlimited Contributor Access to the all ExamTopics Exams!
Take advantage of PDF Files for 1000+ Exams along with community discussions and pass IT Certification Exams Easily.

Get Unlimited Access

Google Discussions

Exam Professional Data Engineer topic 1 question 68 discussion

Actual exam question from Google's Professional Data Engineer

Question #: 68
Topic #: 1

[All Professional Data Engineer Questions]

You are selecting services to write and transform JSON messages from Cloud Pub/Sub to BigQuery for a data pipeline on Google Cloud. You want to minimize service costs. You also want to monitor and accommodate input data volume that will vary in size with minimal manual intervention. What should you do?

A. Use Cloud Dataproc to run your transformations. Monitor CPU utilization for the cluster. Resize the number of worker nodes in your cluster via the command line.
B. Use Cloud Dataproc to run your transformations. Use the diagnose command to generate an operational output archive. Locate the bottleneck and adjust cluster resources.
C. Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the default autoscaling setting for worker instances.
D. Use Cloud Dataflow to run your transformations. Monitor the total execution time for a sampling of jobs. Configure the job to use non-default Compute Engine machine types when needed.

Show Suggested Answer

Suggested Answer: B 🗳️

by madhu1171 at March 13, 2020, 2:01 p.m.

Comments

Submit Cancel

madhu1171

Highly Voted 4 years, 1 month ago

Answer should be C

upvoted 35 times

...

[Removed]

Highly Voted 4 years, 1 month ago

Answer: C - best suitable for the purpose with autoscaling and google recommended transform engine between pubsub n bq

upvoted 25 times

...

barnac1es

Most Recent 6 months, 3 weeks ago

Selected Answer: D

Option C suggests using Cloud Dataflow to run the transformations and monitoring the job system lag with Stackdriver while using the default autoscaling setting for worker instances. While using Cloud Dataflow is a suitable choice for processing data from Cloud Pub/Sub to BigQuery, and monitoring with Stackdriver provides valuable insights, the specific emphasis on configuring non-default Compute Engine machine types (as mentioned in option D) gives you more control over cost optimization and performance tuning. By configuring non-default machine types, you can precisely tailor the computational resources to match the specific requirements of your workload. This fine-grained control allows you to optimize costs further by avoiding over-provisioning of resources, especially if your workload is memory-intensive, CPU-bound, or requires specific configurations that are not met by the default settings.

upvoted 1 times

barnac1es

6 months, 3 weeks ago

Additionally, having the flexibility to adjust machine types based on workload characteristics ensures that you can achieve the desired performance levels without overspending on unnecessary resources. This level of customization is not provided by simply relying on the default autoscaling settings, making option D a more comprehensive and cost-effective solution for managing varying data volumes.

upvoted 1 times

...

Mathew106

9 months ago

Selected Answer: B

At first I answered C. However, Dataproc is indeed cheaper than Dataflow. And both of them can scale automatically horizontically. Dataflow horizontal scaling applies to both primary and secondary nodes. Scaling secondary nodes scales up CPU/compute and scaling primary nodes scales up both memory and CPU/compute. I don't quite understand the second part of answer B where it says I should allocate resources accordingly. I guess I could do that, but auto-scaling should be enough.

upvoted 1 times

...

AbdullahAnwar

1 year, 1 month ago

Answer should be C

upvoted 2 times

...

samdhimal

1 year, 2 months ago

C. Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the default autoscaling setting for worker instances. Cloud Dataflow is a managed service that allows you to write and execute data transformations in a highly scalable and fault-tolerant way. By default, it will automatically scale the number of worker instances based on the input data volume and job performance, which can help minimize costs. Monitoring the job system lag with Stackdriver can help you identify any issues that may be impacting performance and take action as needed. Additionally, using the default autoscaling setting for worker instances can help you minimize manual intervention and ensure that resources are used efficiently.

upvoted 3 times

...

zellck

1 year, 4 months ago

Selected Answer: C

C is the answer.

upvoted 1 times

...

odacir

1 year, 4 months ago

Selected Answer: C

@admin why all the answers are wrong. I paid 30 euros for this web and its garbage. Dataproc has no sense in this scenario, because you want to have minimal intervention/operation. D is not a good practice, the answer is C.

upvoted 9 times

zellck

1 year, 4 months ago

you need to look at community vote distribution and comments, and not the suggested answer.

upvoted 7 times

...

medeis_jar

2 years, 3 months ago

Selected Answer: C

C only as referred by MaxNRG

upvoted 4 times

...

MaxNRG

2 years, 4 months ago

Selected Answer: C

C. Dataproc does not seem to be a good solution in this case as it always require a manual intervention to adjust resources. Autoscaling with dataflow will automatically handle changing data volumes with no manual intervention, and monitoring through Stackdriver can be used to spot bottleneck. Total execution time is not good there as it does not provide a precise view on potential bottleneck.

upvoted 9 times

...

StefanoG

2 years, 4 months ago

Selected Answer: C

Dataflow, Stackdriver and autoscaling

upvoted 3 times

...

victorlie

2 years, 5 months ago

Admin, please take a look on the comments. Almost all answers are wrong

upvoted 4 times

...

nguyenmoon

2 years, 7 months ago

Answer should be C as dataflow is unpredictable size ( input that will vary in size), dataproc is with known size

upvoted 4 times

Tanzu

2 years, 2 months ago

dataflow over dataproc is always the preferred way in gcp. use dataproc only there is specific client requirements such as existing hadoop workloads, etc..

upvoted 1 times

...

sandipk91

2 years, 8 months ago

Option C is the answer

upvoted 3 times

...

sumanshu

2 years, 9 months ago

Vote for C

upvoted 1 times

...

apnu

3 years, 3 months ago

B , it is correct , as it says minimum service cost, dataflow is more expansive than dataproc.

upvoted 2 times

daghayeghi

3 years, 1 month ago

but it said "with minimal manual intervention" and for Dataproc you need to manage cluster manually, then C is the best option.

upvoted 1 times

...

Believerath

3 years ago

You have to transform the JSON messages. Hence, you need to use dataflow.

upvoted 1 times

...

apnu

3 years, 3 months ago

B , it is correct as it says minimum service cost, data flow is more expensive than dataproc.

upvoted 2 times

...

Load full discussion...

Unlimited Access

Exam Professional Data Engineer topic 1 question 68 discussion

Comments

madhu1171

[Removed]

barnac1es

barnac1es

Mathew106

AbdullahAnwar

samdhimal

zellck

odacir

zellck

medeis_jar

MaxNRG

StefanoG

victorlie

nguyenmoon

Tanzu

sandipk91

sumanshu

apnu

daghayeghi

Believerath

apnu

Get IT Certification

New Version GCP Professional Cloud Architect Certificate & Helpful Information

The 5 Most In-Demand Project Management Certifications of 2019