Your organization has a petabyte of application logs stored as Parquet files in Cloud Storage. You need to quickly perform a one-time SQL-based analysis of the files and join them to data that already resides in BigQuery. What should you do?
A.
Create a Dataproc cluster, and write a PySpark job to join the data from BigQuery to the files in Cloud Storage.
B.
Launch a Cloud Data Fusion environment, use plugins to connect to BigQuery and Cloud Storage, and use the SQL join operation to analyze the data.
C.
Create external tables over the files in Cloud Storage, and perform SQL joins to tables in BigQuery to analyze the data.
D.
Use the bq load command to load the Parquet files into BigQuery, and perform SQL joins to analyze the data.
The most efficient and quick solution for a one-time SQL analysis of petabyte-scale Parquet files in Cloud Storage joined with BigQuery data is C. Create external tables over the files in Cloud Storage and perform SQL joins. External tables allow you to query data directly in Cloud Storage with SQL, avoiding the time and cost of loading a petabyte of data into BigQuery. This is ideal for a fast, one-time analysis. Options A (Dataproc/Spark) and B (Cloud Data Fusion) are more complex and slower for a quick analysis. Option D (bq load) is inefficient and slow as it requires loading a petabyte of data into BigQuery, which is unnecessary for a one-time analysis of external files. Therefore, Option C provides the most direct, efficient, and SQL-centric approach for this scenario.
A voting comment increases the vote count for the chosen answer by one.
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one.
So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
n2183712847
2 months ago