exam questions

Exam Professional Data Engineer All Questions

View all questions & answers for the Professional Data Engineer exam

Exam Professional Data Engineer topic 1 question 225 discussion

Actual exam question from Google's Professional Data Engineer
Question #: 225
Topic #: 1
[All Professional Data Engineer Questions]

Your organization stores customer data in an on-premises Apache Hadoop cluster in Apache Parquet format. Data is processed on a daily basis by Apache Spark jobs that run on the cluster. You are migrating the Spark jobs and Parquet data to Google Cloud. BigQuery will be used on future transformation pipelines so you need to ensure that your data is available in BigQuery. You want to use managed services, while minimizing ETL data processing changes and overhead costs. What should you do?

  • A. Migrate your data to Cloud Storage and migrate the metadata to Dataproc Metastore (DPMS). Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc Serverless.
  • B. Migrate your data to Cloud Storage and register the bucket as a Dataplex asset. Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc Serverless.
  • C. Migrate your data to BigQuery. Refactor Spark pipelines to write and read data on BigQuery, and run them on Dataproc Serverless.
  • D. Migrate your data to BigLake. Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc on Compute Engine.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
raaad
Highly Voted 1 year, 4 months ago
Selected Answer: A
- This option involves moving Parquet files to Cloud Storage, which is a common and cost-effective storage solution for big data and is compatible with Spark jobs. - Using Dataproc Metastore to manage metadata allows us to keep Hadoop ecosystem's structural information. - Running Spark jobs on Dataproc Serverless takes advantage of managed Spark services without managing clusters. - Once the data is in Cloud Storage, you can also easily load it into BigQuery for further analysis.
upvoted 6 times
...
380e3c6
Most Recent 3 months ago
Selected Answer: A
A is correctbecause it minimizes ETL changes, keeps Parquet data in Cloud Storage (cost-effective and Spark-compatible), and integrates with BigQuery via external tables. C is flawed** since moving directly to BigQuery requires refactoring Spark jobs, increasing complexity and costs. B adds unnecessary governance overhead, and D focuses on infrastructure instead of pipeline efficiency.
upvoted 1 times
...
plum21
3 months, 1 week ago
Selected Answer: D
The requirement: "You want to use managed services" excludes Dataproc Serverless. Dataproc on Compute Engine remains. Next requirement: "BigQuery will be used on future transformation pipelines so you need to ensure that your data is available in BigQuery" -> BigLake Next requirement: "while minimizing ETL data processing changes and overhead costs" -> Refactor Spark pipelines to write and read data on Cloud Storage Notes 1. Dataproc Metastore (DPMS) could be used on Dataproc to read data from BQ but not the other way round.
upvoted 1 times
Positron75
5 days, 17 hours ago
How is Dataproc Serverless not a managed service, but running Dataproc on Compute Engine is? D is the first answer to rule out.
upvoted 1 times
...
...
skhaire
3 months, 2 weeks ago
Selected Answer: B
BigQuery Integration: The requirement is to make data available in BigQuery. Dataplex has built-in integration with BigQuery. It can automatically discover data in Cloud Storage and create external tables in BigQuery, making the data readily queryable. DPMS doesn't have this direct integration with BigQuery.
upvoted 4 times
...
LP_PDE
3 months, 4 weeks ago
Selected Answer: A
Both Spark and BigQuery can directly access data in Cloud Storage.
upvoted 1 times
...
hrishi19
6 months, 1 week ago
Selected Answer: C
The question states that the data should be available on BigQuery and only option C meets this requirement.
upvoted 3 times
...
JamesKarianis
9 months, 2 weeks ago
Selected Answer: A
A is correct
upvoted 1 times
...
Anudeep58
11 months, 3 weeks ago
Selected Answer: A
Option B: Registering the bucket as a Dataplex asset adds an additional layer of data governance and management. While useful, it may not be necessary for your immediate migration needs and can introduce additional complexity. Option C: Migrating data directly to BigQuery would require significant changes to your Spark pipelines since they would need to be refactored to read from and write to BigQuery instead of Parquet files. This approach could introduce higher costs due to BigQuery storage and querying. Option D: Using BigLake and Dataproc on Compute Engine is more complex and requires more management compared to Dataproc Serverless. Additionally, it might not be as cost-effective as leveraging Cloud Storage and Dataproc Serverless.
upvoted 3 times
aoifneofi_ef
9 months, 1 week ago
Just adding further commentary on why A is correct while why other options are incorrect is explained above. Parquet files have schema engrained in them. Hence Spark pipelines on Hadoop Cluster may not have needed tables at all. Hence the simplest solution would be to move it to Cloud Storage instead of BigQuery and this way there would be minimal changes to the ETL pipelines - just change HDFS file system pointer to GCS file system for read writes and no need for any additional tables
upvoted 2 times
...
...
josech
1 year ago
Selected Answer: A
The question says "You want to use managed services, while minimizing ETL data processing changes and overhead costs". Dataproc is a managed service that doesn't need to refactor the data transformation Spark code you already have (you will have to refactor only the wrtie and read code), an it has a Big Query connector for future use. https://cloud.google.com/dataproc/docs/concepts/connectors/bigquery
upvoted 1 times
...
52ed0e5
1 year, 2 months ago
Selected Answer: C
Migrate your data directly to BigQuery. Refactor Spark pipelines to read from and write to BigQuery. Run the Spark jobs on Dataproc Serverless. The best choice for ensuring data availability in BigQuery. It allows seamless integration with BigQuery and minimizes ETL changes.
upvoted 3 times
...
Ramon98
1 year, 2 months ago
Selected Answer: C
A tricky one, because of "you need to ensure that your data is available in BigQuery". The easiest and most straight forward migration seems answer A to me, and then you can use external tables to make the parquet data directly available in BigQuery. https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet However creating the external tables is an extra step? So therefore maybe C is the answer?
upvoted 4 times
...
Moss2011
1 year, 2 months ago
Selected Answer: C
I think the key phrase here is "you need to ensure that your data is available in BigQuery" that's why I thing C it's the best option
upvoted 1 times
...
JyoGCP
1 year, 3 months ago
Selected Answer: C
I think it's C. Dataproc can use BigQuery to read and write data. Dataproc's BigQuery connector is a library that allows Spark and Hadoop applications to process and write data from BigQuery. Here's how Dataproc can be used with BigQuery: Process large datasets: Use Spark to process data stored in BigQuery. Write results: Write the results back to BigQuery or other data storage for further analysis. Read data: The BigQuery connector can read data from BigQuery into a Spark DataFrame. Write data: The connector writes data to BigQuery by buffering all the data into a Cloud Storage temporary table.
upvoted 3 times
JyoGCP
1 year, 3 months ago
As per question.. "BigQuery will be used on future transformation pipelines so you need to ensure that your data is available in BigQuery. You want to use managed services (DATAPROC), while minimizing ETL data processing changes and overhead costs."
upvoted 3 times
...
...
matiijax
1 year, 3 months ago
Selected Answer: B
I think its B and the reason is that egistering the data as a Dataplex asset enables seamless integration with BigQuery later on. Dataplex simplifies data discovery and lineage tracking, making it easier to prepare your data for BigQuery transformations.
upvoted 3 times
...
saschak94
1 year, 3 months ago
Why would I select A here? Why not moving the data to BigQuery and running Dataproc Serverless jobs accessing the data in BigQuery?
upvoted 3 times
...
e70ea9e
1 year, 4 months ago
Selected Answer: A
Managed Services: Leverages Dataproc Serverless for a fully managed Spark environment, reducing overhead and administrative tasks. Minimal Data Processing Changes: Keeps Spark pipelines largely intact by working with Parquet files on Cloud Storage, minimizing refactoring efforts. BigQuery Integration: Dataproc Serverless can directly access BigQuery, enabling future transformation pipelines without additional data movement. Cost-Effective: Serverless model scales resources only when needed, optimizing costs for intermittent workloads.
upvoted 3 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...