Exam Professional Data Engineer All Questions

View all questions & answers for the Professional Data Engineer exam

Exam Professional Data Engineer topic 1 question 3 discussion

Actual exam question from Google's Professional Data Engineer

Question #: 3
Topic #: 1

[All Professional Data Engineer Questions]

You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?

A. Add capacity (memory and disk space) to the database server by the order of 200.
B. Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.
C. Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.
D. Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.

Show Suggested Answer

Suggested Answer: C 🗳️

by [deleted] at March 15, 2020, 8:14 a.m.

Comments

Submit Cancel

MaxNRG

Highly Voted 3 years, 8 months ago

C is correct because this option provides the least amount of inconvenience over using pre-specified date ranges or one table per clinic while also increasing performance due to avoiding self-joins. A is not correct because adding additional compute resources is not a recommended way to resolve database schema problems. B is not correct because this will reduce the functionality of the database and make running reports more difficult. D is not correct because this will likely increase the number of tables so much that it will be more difficult to generate reports vs. the correct option. https://cloud.google.com/bigquery/docs/best-practices-performance-patterns https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#explicit-alias-visibility

upvoted 8 times

gord_nat

6 months, 2 weeks ago

Why are we assuming the database in question is BigQuery? There are several other RDBMS options in GCP. .. Also, why was the original db pushed to prod without being normalized first? Typically, the normalized db is released to prod. When the data set becomes larger, you would add partitioning. The scenario being presented is unrealistic.

upvoted 2 times

...

balseron99

Highly Voted 4 years, 5 months ago

A is incorrect because adding space won't solve the problem of query performance. B is incorrect because there is nothing related to the report generation which is specified and sharding tables on date ranges is not a good option as it will create many tables. C is CORRECT because the statement says "the scope of the project has expanded. The database must now store 100 times more patient records". As the data increases there would be difficulty in managing the tables and querying it. Hence creating different table is correct as per the need. D is Incorrect as it Partitions on each clinic. We have to adjust the database design so that it performs optimally when generating reports. Also nothing is specified for generation of reports in the required statement.

upvoted 7 times

...

Nanto90

Most Recent 5 days, 8 hours ago

Selected Answer: C

Because you can optimize the storage of each table; furthermore, you will avoid self-joins.

upvoted 1 times

...

Ahamada

4 months, 2 weeks ago

Selected Answer: C

answer is C, the problem here is the self-join (avoid self-join if possible) on a Denormalized table. So the solution is to Normalize

upvoted 1 times

...

cqrm3n

5 months, 4 weeks ago

Selected Answer: C

Normalizing the database into separate Patients and Visits tables, along with creating other necessary tables, is the best solution for handling the increased data size while ensuring efficient query performance and maintainability. This approach addresses the root problem instead of applying temporary fixes.

upvoted 1 times

...

SamuelTsch

8 months, 3 weeks ago

Selected Answer: C

C is the most suitable solution for this situation. It provides a better way for scalability and monitoring. B has a constraint on predefined date range, which is usually not suitable for reporting.

upvoted 1 times

...

rocky48

1 year, 8 months ago

Selected Answer: C

Normalization is a technique used to organize data in a relational database to reduce data redundancy and improve data integrity. Breaking the patient records into separate tables (patient and visits) and eliminating self-joins will make the database more scalable and improve query performance. It also helps maintain data integrity and makes it easier to manage large datasets efficiently. Options A, B, and D may provide some benefits in specific cases, but for a scenario where the project scope has expanded significantly and there are performance issues with self-joins, normalization (Option C) is the most robust and scalable solution.

upvoted 4 times

...

rtcpost

1 year, 8 months ago

Selected Answer: C

upvoted 3 times

...

vaga1

2 years, 1 month ago

Selected Answer: C

"100 times more patient records"immediately brings to create a patient dimensional table to save space on disk if a generical relational database is mentioned.

upvoted 1 times

...