Exam Professional Data Engineer All Questions

View all questions & answers for the Professional Data Engineer exam

Exam Professional Data Engineer topic 1 question 111 discussion

Actual exam question from Google's Professional Data Engineer

Question #: 111
Topic #: 1

[All Professional Data Engineer Questions]

You have historical data covering the last three years in BigQuery and a data pipeline that delivers new data to BigQuery daily. You have noticed that when the
Data Science team runs a query filtered on a date column and limited to 30`"90 days of data, the query scans the entire table. You also noticed that your bill is increasing more quickly than you expected. You want to resolve the issue as cost-effectively as possible while maintaining the ability to conduct SQL queries.
What should you do?

A. Re-create the tables using DDL. Partition the tables by a column containing a TIMESTAMP or DATE Type.
B. Recommend that the Data Science team export the table to a CSV file on Cloud Storage and use Cloud Datalab to explore the data by reading the files directly.
C. Modify your pipeline to maintain the last 30ג€"90 days of data in one table and the longer history in a different table to minimize full table scans over the entire history.
D. Write an Apache Beam pipeline that creates a BigQuery table per day. Recommend that the Data Science team use wildcards on the table name suffixes to select the data they need.

Show Suggested Answer

Suggested Answer: A 🗳️

by [deleted] at March 22, 2020, 1:15 p.m.

Comments

Submit Cancel

[Removed]

Highly Voted 3 years, 9 months ago

should be A

upvoted 35 times

...

[Removed]

Highly Voted 3 years, 8 months ago

Answer: A Description: Partition is the solution for reducing cost and time

upvoted 18 times

willbot

3 years, 6 months ago

but how would recreating tables with 3 years of data, maintain the ability to conduct sql queries during that time?

upvoted 1 times

squishy_fishy

2 years, 2 months ago

Recreating the new table, the old table will still have new data coming, then append the difference to the new table.

upvoted 2 times

...

odacir

Most Recent 1 year ago

Selected Answer: A

Answer: A, has no cost to reload the data, Also Partition is the solution for reducing cost and time

upvoted 1 times

...

zellck

1 year ago

Selected Answer: A

A is the answer. https://cloud.google.com/bigquery/docs/partitioned-tables A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query. You can partition BigQuery tables by: - Time-unit column: Tables are partitioned based on a TIMESTAMP, DATE, or DATETIME column in the table.

upvoted 3 times

1 year, 11 months ago

Why not D? You can use SQL. This is the cheapest and fastest option https://cloud.google.com/bigquery/docs/querying-wildcard-tables

upvoted 2 times

John_Pongthorn

1 year, 2 months ago

Partitioning is recommended over table sharding, because partitioned tables perform better This is a google recommendation nowaday.

upvoted 1 times

...

StefanoG

2 years ago

Selected Answer: A

The D solution is obviously discarded. The request NOT require ONLY LAST 30-90 days, so the C solution is not the right solution. In addition to this, the request ask to keep the possibility to made queries, so B is wrost. Is not mandatory make the queries while you make the modify so the right answer is A

upvoted 4 times

...

JayZeeLee

2 years, 1 month ago

B sounds more feasible. The point is 'historical' data, not new table/data. Recreating tables from the past three years is a lot of work. Might as well export the table and run analyses there. No cost for exporting in BigQuery.

upvoted 1 times

...

sumanshu

2 years, 5 months ago

Vote for A

upvoted 5 times

...

arghya13

3 years, 1 month ago

I will go with Option A

upvoted 5 times

...

Alasmindas

3 years, 1 month ago

I will go with Option A, although at first instance I felt Option C would be correct. Option A : Because partitioning will help to address both the concerns mentioned in the question - i.e. faster query and reducing cost. Option C : Modifying the data pipeline to store last 30-90 days data would have possible, if there was a point mentioned that only the latest data (30-90 days) is kept and the older data - beyond 90 days is moved to the master table. Since that point is mot mentioned, we will land up having multiple - 30-90 days data in separate tables + the master table.

upvoted 3 times

karthik89

2 years, 10 months ago

but how will you append the data that is older than 90days in to the master table?

upvoted 2 times

...

Cloud_Enthusiast

3 years, 1 month ago

Answer is A. Recreating the DDL with new parition is easy and does not require any changes on applications that read data from it

upvoted 4 times

...

Load full discussion...