Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.

Unlimited Access

Get Unlimited Contributor Access to the all ExamTopics Exams!
Take advantage of PDF Files for 1000+ Exams along with community discussions and pass IT Certification Exams Easily.

Exam Professional Data Engineer topic 1 question 87 discussion

Actual exam question from Google's Professional Data Engineer
Question #: 87
Topic #: 1
[All Professional Data Engineer Questions]

You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffling operations and initial data are parquet files (on average 200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.
What should you do?

  • A. Increase the size of your parquet files to ensure them to be 1 GB minimum.
  • B. Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.
  • C. Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.
  • D. Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
rickywck
Highly Voted 4 years, 1 month ago
Should be A: https://stackoverflow.com/questions/42918663/is-it-better-to-have-one-large-parquet-file-or-lots-of-smaller-parquet-files https://www.dremio.com/tuning-parquet/ C & D will improve performance but need to pay more $$
upvoted 68 times
diluvio
2 years, 6 months ago
It is A . please read the links above
upvoted 5 times
...
odacir
1 year, 4 months ago
https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#optimize_performance
upvoted 1 times
...
raf2121
2 years, 9 months ago
Point for discussion - Another reason why it can't be C or D. SSD's are not available on pre-emptible Worker nodes (answers didn't say whether they wanted to switch from HDD to SDD for Master nodes) https://cloud.google.com/architecture/hadoop/hadoop-gcp-migration-jobs
upvoted 8 times
rr4444
1 year, 9 months ago
You can have local SSDs for the dataproc normal or preemptible VMs https://cloud.google.com/dataproc/docs/concepts/compute/dataproc-pd-ssd
upvoted 1 times
...
raf2121
2 years, 9 months ago
Also for Shuffling Operations, one need to override the preemptible VMs configuration to increase boot disk size. (Second half of answer D is correct but first half is wrong)
upvoted 1 times
...
...
zellck
1 year, 4 months ago
https://cloud.google.com/dataproc/docs/support/spark-job-tuning#limit_the_number_of_files Store data in larger file sizes, for example, file sizes in the 256MB–512MB range.
upvoted 2 times
...
...
madhu1171
Highly Voted 4 years, 1 month ago
Answer should be D
upvoted 12 times
jvg637
4 years, 1 month ago
D: # By default, preemptible node disk sizes are limited to 100GB or the size of the non-preemptible node disk sizes, whichever is smaller. However you can override the default preemptible disk size to any requested size. Since the majority of our cluster is using preemptible nodes, the size of the disk used for caching operations will see a noticeable performance improvement using a larger disk. Also, SSD's will perform better than HDD. This will increase costs slightly, but is the best option available while maintaining costs.
upvoted 15 times
ch3n6
3 years, 10 months ago
C is correct. D is wrong. they are using 'dataproc and GCS', not related to boot disk at all .
upvoted 2 times
VishalB
3 years, 9 months ago
C is recommended only - If you have many small files, consider copying files for processing to the local HDFS and then copying the results back
upvoted 1 times
FARR
3 years, 8 months ago
File sizes are already within the expected range for GCS (128MB-1GB) so not C. D seems most feasible
upvoted 3 times
...
...
...
...
...
philli1011
Most Recent 2 months, 2 weeks ago
A We don't know if HDD was used, so we can know what to do about that, but we know that the parquet files are small and much, and we can act on that by increasing the sizes to have lesser number of it.
upvoted 1 times
...
rocky48
4 months, 3 weeks ago
Selected Answer: A
Should be A: https://stackoverflow.com/questions/42918663/is-it-better-to-have-one-large-parquet-file-or-lots-of-smaller-parquet-files
upvoted 1 times
rocky48
4 months, 3 weeks ago
Given the scenario and the cost-sensitive nature of your organization, the best option would be: C. Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job, and copy results back to GCS. Option C allows you to leverage the benefits of SSDs and HDFS while minimizing costs by continuing to use Dataproc on preemptible VMs. This approach optimizes both performance and cost-effectiveness for your analytical workload on Google Cloud.
upvoted 1 times
...
...
Mathew106
9 months ago
Selected Answer: A
https://stackoverflow.com/questions/42918663/is-it-better-to-have-one-large-parquet-file-or-lots-of-smaller-parquet-files Cost effective is the key in the question.
upvoted 1 times
...
Nandhu95
1 year, 1 month ago
Selected Answer: D
Preemptible VMs can't be used for HDFS storage. As a default, preemptible VMs are created with a smaller boot disk size, and you might want to override this configuration if you are running shuffle-heavy workloads.
upvoted 1 times
...
midgoo
1 year, 1 month ago
Selected Answer: D
Should NOT be A as: 1. The file size is already at the optimal size 2. If the current file size works well in the current Hadoop, it is expected to have similar performance in Dataproc The only difference between the current and Dataproc is that Dataproc is using preemptible nodes. So yes, it may incur a bit more cost by using SSD but assuming using the preemptible already save most of it, so we want to save less to improve the performance
upvoted 1 times
Mathew106
9 months ago
Optimal size is 1GB
upvoted 1 times
...
...
[Removed]
1 year, 2 months ago
Selected Answer: A
Cost sensitive is the keyword.
upvoted 1 times
...
musumusu
1 year, 2 months ago
this question asked by Google, So option C is not correct otherwise, good approach to use initial data in hdfs and swtich from HDD to SDDs for 2 non-preemptible node. Option D is right but they are not mentioning that they will stop using 2 non-preemptible node. but i assume it :P
upvoted 2 times
...
PolyMoe
1 year, 2 months ago
Selected Answer: C
C. ref : https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#optimize_performance - size recommended is 128MB-1GB ==> so it is not size issue ==> not A - there is no issue mentioned with file format ==> not B - D. could be a good solution, but requires overriding preemptible VMs. however, the questions asks to continue using preemtibles ==> not D - C. is a good solution.
upvoted 3 times
ayush_1995
1 year, 2 months ago
agreed C over D as switching from HDDs to SSDs and overriding the preemptible VMs configuration to increase the boot disk size, may not be the best solution for improving performance in this scenario because it doesn't address the main issue which is the large number of shuffling operations that are causing performance degradation. While SSDs may have faster read and write speeds than HDDs, they may not provide significant performance improvements for a workload that is primarily CPU-bound and heavily reliant on shuffling operations. Additionally, increasing the boot disk size of the preemptible VMs may not be necessary or cost-effective for this particular workload.
upvoted 1 times
...
...
slade_wilson
1 year, 4 months ago
Selected Answer: D
https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#optimize_performance Manage Cloud Storage file sizes To get optimal performance, split your data in Cloud Storage into files with sizes from 128 MB to 1 GB. Using lots of small files can create a bottleneck. If you have many small files, consider copying files for processing to the local HDFS and then copying the results back. Switch to SSD disks If you perform many shuffling operations or partitioned writes, switch to SSDs to boost performance.
upvoted 2 times
...
odacir
1 year, 4 months ago
Selected Answer: D
Its D 100%. It's the recommended best practice for this scenario. https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#optimize_performance
upvoted 3 times
...
zellck
1 year, 4 months ago
Selected Answer: D
D is the answer. https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#switch_to_ssd_disks If you perform many shuffling operations or partitioned writes, switch to SSDs to boost performance. https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#use_preemptible_vms As a default, preemptible VMs are created with a smaller boot disk size, and you might want to override this configuration if you are running shuffle-heavy workloads. For details, see the page on preemptible VMs in the Dataproc documentation.
upvoted 1 times
...
sfsdeniso
1 year, 5 months ago
Answer id D not C because cannot use HDFS with preemptible VMs https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#use_preemptible_vms
upvoted 2 times
...
dish11dish
1 year, 5 months ago
Selected Answer: D
Option D is correct Elimination Strategy:- A. Increase the size of your parquet files to ensure them to be 1 GB minimum (doesn’t make sense as the file size are fit for migration to proceed with given scenario, recommended size is between 128 MB to 1 GB.) B. Switch to TFRecords formats (appr. 200MB per file) instead of parquet files(doesn’t make sense to make changes to file format ) C. Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS(doesn’t make sense to copy the file from GCS to HDFS as the workload that consists of many shuffling operations) D. Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size(perfect fit as the workload that consists of many shuffling operations which requires attention to increase the performance reference doc:- https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#optimize_performance )
upvoted 4 times
...
piotrpiskorski
1 year, 5 months ago
Selected Answer: A
it's A. Larger parquet files will be more efficient and it's a non-cost solution to implement in contrary to the SSD drives.
upvoted 1 times
zellck
1 year, 4 months ago
Recommended file size is not 1GB. https://cloud.google.com/dataproc/docs/support/spark-job-tuning#limit_the_number_of_files Store data in larger file sizes, for example, file sizes in the 256MB–512MB range.
upvoted 1 times
...
...
gudiking
1 year, 5 months ago
Selected Answer: A
https://www.dremio.com/blog/tuning-parquet/
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...