Unlimited Access

Get Unlimited Contributor Access to the all ExamTopics Exams!
Take advantage of PDF Files for 1000+ Exams along with community discussions and pass IT Certification Exams Easily.

Get Unlimited Access

Google Discussions

Exam Professional Data Engineer topic 1 question 87 discussion

Actual exam question from Google's Professional Data Engineer

Question #: 87
Topic #: 1

[All Professional Data Engineer Questions]

You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffling operations and initial data are parquet files (on average 200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.
What should you do?

A. Increase the size of your parquet files to ensure them to be 1 GB minimum.
B. Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.
C. Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.
D. Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.

Show Suggested Answer

Suggested Answer: D 🗳️

by madhu1171 at March 14, 2020, 2:44 p.m.

Comments

Submit Cancel

rickywck

Highly Voted 4 years, 1 month ago

Should be A: https://stackoverflow.com/questions/42918663/is-it-better-to-have-one-large-parquet-file-or-lots-of-smaller-parquet-files https://www.dremio.com/tuning-parquet/ C & D will improve performance but need to pay more $$

upvoted 68 times

diluvio

2 years, 6 months ago

It is A . please read the links above

upvoted 5 times

...

odacir

1 year, 4 months ago

https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#optimize_performance

upvoted 1 times

...

raf2121

2 years, 9 months ago

Point for discussion - Another reason why it can't be C or D. SSD's are not available on pre-emptible Worker nodes (answers didn't say whether they wanted to switch from HDD to SDD for Master nodes) https://cloud.google.com/architecture/hadoop/hadoop-gcp-migration-jobs

upvoted 8 times

rr4444

1 year, 9 months ago

You can have local SSDs for the dataproc normal or preemptible VMs https://cloud.google.com/dataproc/docs/concepts/compute/dataproc-pd-ssd

upvoted 1 times

...

raf2121

2 years, 9 months ago

Also for Shuffling Operations, one need to override the preemptible VMs configuration to increase boot disk size. (Second half of answer D is correct but first half is wrong)

upvoted 1 times

...

zellck

1 year, 4 months ago

https://cloud.google.com/dataproc/docs/support/spark-job-tuning#limit_the_number_of_files Store data in larger file sizes, for example, file sizes in the 256MB–512MB range.

upvoted 2 times

...

Load full discussion...

...

madhu1171

Highly Voted 4 years, 1 month ago

Answer should be D

upvoted 12 times

jvg637

4 years, 1 month ago

D: # By default, preemptible node disk sizes are limited to 100GB or the size of the non-preemptible node disk sizes, whichever is smaller. However you can override the default preemptible disk size to any requested size. Since the majority of our cluster is using preemptible nodes, the size of the disk used for caching operations will see a noticeable performance improvement using a larger disk. Also, SSD's will perform better than HDD. This will increase costs slightly, but is the best option available while maintaining costs.

upvoted 15 times

ch3n6

3 years, 10 months ago

C is correct. D is wrong. they are using 'dataproc and GCS', not related to boot disk at all .

upvoted 2 times

VishalB

3 years, 9 months ago

C is recommended only - If you have many small files, consider copying files for processing to the local HDFS and then copying the results back

upvoted 1 times

FARR

3 years, 8 months ago

File sizes are already within the expected range for GCS (128MB-1GB) so not C. D seems most feasible

upvoted 3 times

...

philli1011

Most Recent 2 months, 2 weeks ago

A We don't know if HDD was used, so we can know what to do about that, but we know that the parquet files are small and much, and we can act on that by increasing the sizes to have lesser number of it.

upvoted 1 times

...

rocky48

4 months, 3 weeks ago

Selected Answer: A

Should be A: https://stackoverflow.com/questions/42918663/is-it-better-to-have-one-large-parquet-file-or-lots-of-smaller-parquet-files

upvoted 1 times

rocky48

4 months, 3 weeks ago

Given the scenario and the cost-sensitive nature of your organization, the best option would be: C. Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job, and copy results back to GCS. Option C allows you to leverage the benefits of SSDs and HDFS while minimizing costs by continuing to use Dataproc on preemptible VMs. This approach optimizes both performance and cost-effectiveness for your analytical workload on Google Cloud.

upvoted 1 times

...

Mathew106

9 months ago

Selected Answer: A

https://stackoverflow.com/questions/42918663/is-it-better-to-have-one-large-parquet-file-or-lots-of-smaller-parquet-files Cost effective is the key in the question.

upvoted 1 times

...

Nandhu95

1 year, 1 month ago

Selected Answer: D

Preemptible VMs can't be used for HDFS storage. As a default, preemptible VMs are created with a smaller boot disk size, and you might want to override this configuration if you are running shuffle-heavy workloads.

upvoted 1 times

...

midgoo

1 year, 1 month ago

Selected Answer: D

Should NOT be A as: 1. The file size is already at the optimal size 2. If the current file size works well in the current Hadoop, it is expected to have similar performance in Dataproc The only difference between the current and Dataproc is that Dataproc is using preemptible nodes. So yes, it may incur a bit more cost by using SSD but assuming using the preemptible already save most of it, so we want to save less to improve the performance

upvoted 1 times

Mathew106

9 months ago

Optimal size is 1GB

upvoted 1 times

...

[Removed]

1 year, 2 months ago

Selected Answer: A

Cost sensitive is the keyword.

upvoted 1 times

...

musumusu

1 year, 2 months ago

this question asked by Google, So option C is not correct otherwise, good approach to use initial data in hdfs and swtich from HDD to SDDs for 2 non-preemptible node. Option D is right but they are not mentioning that they will stop using 2 non-preemptible node. but i assume it :P

upvoted 2 times

...

PolyMoe

1 year, 2 months ago

Selected Answer: C

C. ref : https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#optimize_performance - size recommended is 128MB-1GB ==> so it is not size issue ==> not A - there is no issue mentioned with file format ==> not B - D. could be a good solution, but requires overriding preemptible VMs. however, the questions asks to continue using preemtibles ==> not D - C. is a good solution.

upvoted 3 times

ayush_1995

1 year, 2 months ago

agreed C over D as switching from HDDs to SSDs and overriding the preemptible VMs configuration to increase the boot disk size, may not be the best solution for improving performance in this scenario because it doesn't address the main issue which is the large number of shuffling operations that are causing performance degradation. While SSDs may have faster read and write speeds than HDDs, they may not provide significant performance improvements for a workload that is primarily CPU-bound and heavily reliant on shuffling operations. Additionally, increasing the boot disk size of the preemptible VMs may not be necessary or cost-effective for this particular workload.

upvoted 1 times

...

slade_wilson

1 year, 4 months ago

Selected Answer: D

https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#optimize_performance Manage Cloud Storage file sizes To get optimal performance, split your data in Cloud Storage into files with sizes from 128 MB to 1 GB. Using lots of small files can create a bottleneck. If you have many small files, consider copying files for processing to the local HDFS and then copying the results back. Switch to SSD disks If you perform many shuffling operations or partitioned writes, switch to SSDs to boost performance.

upvoted 2 times

...

odacir

1 year, 4 months ago

Selected Answer: D

Its D 100%. It's the recommended best practice for this scenario. https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#optimize_performance

upvoted 3 times

...

zellck

1 year, 4 months ago

Selected Answer: D

D is the answer. https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#switch_to_ssd_disks If you perform many shuffling operations or partitioned writes, switch to SSDs to boost performance. https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#use_preemptible_vms As a default, preemptible VMs are created with a smaller boot disk size, and you might want to override this configuration if you are running shuffle-heavy workloads. For details, see the page on preemptible VMs in the Dataproc documentation.

upvoted 1 times

...

sfsdeniso

1 year, 5 months ago

Answer id D not C because cannot use HDFS with preemptible VMs https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#use_preemptible_vms

upvoted 2 times

...

dish11dish

1 year, 5 months ago

Selected Answer: D

Option D is correct Elimination Strategy:- A. Increase the size of your parquet files to ensure them to be 1 GB minimum (doesn’t make sense as the file size are fit for migration to proceed with given scenario, recommended size is between 128 MB to 1 GB.) B. Switch to TFRecords formats (appr. 200MB per file) instead of parquet files(doesn’t make sense to make changes to file format ) C. Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS(doesn’t make sense to copy the file from GCS to HDFS as the workload that consists of many shuffling operations) D. Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size(perfect fit as the workload that consists of many shuffling operations which requires attention to increase the performance reference doc:- https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#optimize_performance )

upvoted 4 times

...

piotrpiskorski

1 year, 5 months ago

Selected Answer: A

it's A. Larger parquet files will be more efficient and it's a non-cost solution to implement in contrary to the SSD drives.

upvoted 1 times

zellck

1 year, 4 months ago

Recommended file size is not 1GB. https://cloud.google.com/dataproc/docs/support/spark-job-tuning#limit_the_number_of_files Store data in larger file sizes, for example, file sizes in the 256MB–512MB range.

upvoted 1 times

...

gudiking

1 year, 5 months ago

Selected Answer: A

https://www.dremio.com/blog/tuning-parquet/

upvoted 1 times

...

Load full discussion...

Unlimited Access

Exam Professional Data Engineer topic 1 question 87 discussion

Comments

rickywck

diluvio

odacir

raf2121

rr4444

raf2121

zellck

madhu1171

jvg637

ch3n6

VishalB

FARR

philli1011

rocky48

rocky48

Mathew106

Nandhu95

midgoo

Mathew106

[Removed]

musumusu

PolyMoe

ayush_1995

slade_wilson

odacir

zellck

sfsdeniso

dish11dish

piotrpiskorski

zellck

gudiking

Get IT Certification

New Version GCP Professional Cloud Architect Certificate & Helpful Information

The 5 Most In-Demand Project Management Certifications of 2019