Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.

Unlimited Access

Get Unlimited Contributor Access to the all ExamTopics Exams!
Take advantage of PDF Files for 1000+ Exams along with community discussions and pass IT Certification Exams Easily.

Exam Professional Data Engineer topic 1 question 68 discussion

Actual exam question from Google's Professional Data Engineer
Question #: 68
Topic #: 1
[All Professional Data Engineer Questions]

You are selecting services to write and transform JSON messages from Cloud Pub/Sub to BigQuery for a data pipeline on Google Cloud. You want to minimize service costs. You also want to monitor and accommodate input data volume that will vary in size with minimal manual intervention. What should you do?

  • A. Use Cloud Dataproc to run your transformations. Monitor CPU utilization for the cluster. Resize the number of worker nodes in your cluster via the command line.
  • B. Use Cloud Dataproc to run your transformations. Use the diagnose command to generate an operational output archive. Locate the bottleneck and adjust cluster resources.
  • C. Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the default autoscaling setting for worker instances.
  • D. Use Cloud Dataflow to run your transformations. Monitor the total execution time for a sampling of jobs. Configure the job to use non-default Compute Engine machine types when needed.
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
madhu1171
Highly Voted 4 years, 1 month ago
Answer should be C
upvoted 35 times
...
[Removed]
Highly Voted 4 years, 1 month ago
Answer: C - best suitable for the purpose with autoscaling and google recommended transform engine between pubsub n bq
upvoted 25 times
...
barnac1es
Most Recent 6 months, 3 weeks ago
Selected Answer: D
Option C suggests using Cloud Dataflow to run the transformations and monitoring the job system lag with Stackdriver while using the default autoscaling setting for worker instances. While using Cloud Dataflow is a suitable choice for processing data from Cloud Pub/Sub to BigQuery, and monitoring with Stackdriver provides valuable insights, the specific emphasis on configuring non-default Compute Engine machine types (as mentioned in option D) gives you more control over cost optimization and performance tuning. By configuring non-default machine types, you can precisely tailor the computational resources to match the specific requirements of your workload. This fine-grained control allows you to optimize costs further by avoiding over-provisioning of resources, especially if your workload is memory-intensive, CPU-bound, or requires specific configurations that are not met by the default settings.
upvoted 1 times
barnac1es
6 months, 3 weeks ago
Additionally, having the flexibility to adjust machine types based on workload characteristics ensures that you can achieve the desired performance levels without overspending on unnecessary resources. This level of customization is not provided by simply relying on the default autoscaling settings, making option D a more comprehensive and cost-effective solution for managing varying data volumes.
upvoted 1 times
...
...
Mathew106
9 months ago
Selected Answer: B
At first I answered C. However, Dataproc is indeed cheaper than Dataflow. And both of them can scale automatically horizontically. Dataflow horizontal scaling applies to both primary and secondary nodes. Scaling secondary nodes scales up CPU/compute and scaling primary nodes scales up both memory and CPU/compute. I don't quite understand the second part of answer B where it says I should allocate resources accordingly. I guess I could do that, but auto-scaling should be enough.
upvoted 1 times
...
AbdullahAnwar
1 year, 1 month ago
Answer should be C
upvoted 2 times
...
samdhimal
1 year, 2 months ago
C. Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the default autoscaling setting for worker instances. Cloud Dataflow is a managed service that allows you to write and execute data transformations in a highly scalable and fault-tolerant way. By default, it will automatically scale the number of worker instances based on the input data volume and job performance, which can help minimize costs. Monitoring the job system lag with Stackdriver can help you identify any issues that may be impacting performance and take action as needed. Additionally, using the default autoscaling setting for worker instances can help you minimize manual intervention and ensure that resources are used efficiently.
upvoted 3 times
...
zellck
1 year, 4 months ago
Selected Answer: C
C is the answer.
upvoted 1 times
...
odacir
1 year, 4 months ago
Selected Answer: C
@admin why all the answers are wrong. I paid 30 euros for this web and its garbage. Dataproc has no sense in this scenario, because you want to have minimal intervention/operation. D is not a good practice, the answer is C.
upvoted 9 times
zellck
1 year, 4 months ago
you need to look at community vote distribution and comments, and not the suggested answer.
upvoted 7 times
...
...
medeis_jar
2 years, 3 months ago
Selected Answer: C
C only as referred by MaxNRG
upvoted 4 times
...
MaxNRG
2 years, 4 months ago
Selected Answer: C
C. Dataproc does not seem to be a good solution in this case as it always require a manual intervention to adjust resources. Autoscaling with dataflow will automatically handle changing data volumes with no manual intervention, and monitoring through Stackdriver can be used to spot bottleneck. Total execution time is not good there as it does not provide a precise view on potential bottleneck.
upvoted 9 times
...
StefanoG
2 years, 4 months ago
Selected Answer: C
Dataflow, Stackdriver and autoscaling
upvoted 3 times
...
victorlie
2 years, 5 months ago
Admin, please take a look on the comments. Almost all answers are wrong
upvoted 4 times
...
nguyenmoon
2 years, 7 months ago
Answer should be C as dataflow is unpredictable size ( input that will vary in size), dataproc is with known size
upvoted 4 times
Tanzu
2 years, 2 months ago
dataflow over dataproc is always the preferred way in gcp. use dataproc only there is specific client requirements such as existing hadoop workloads, etc..
upvoted 1 times
...
...
sandipk91
2 years, 8 months ago
Option C is the answer
upvoted 3 times
...
sumanshu
2 years, 9 months ago
Vote for C
upvoted 1 times
...
apnu
3 years, 3 months ago
B , it is correct , as it says minimum service cost, dataflow is more expansive than dataproc.
upvoted 2 times
daghayeghi
3 years, 1 month ago
but it said "with minimal manual intervention" and for Dataproc you need to manage cluster manually, then C is the best option.
upvoted 1 times
...
Believerath
3 years ago
You have to transform the JSON messages. Hence, you need to use dataflow.
upvoted 1 times
...
...
apnu
3 years, 3 months ago
B , it is correct as it says minimum service cost, data flow is more expensive than dataproc.
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...