Exam AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 132 discussion

Exam question from Amazon's AWS Certified Data Engineer - Associate DEA-C01

Question #: 132
Topic #: 1

[All AWS Certified Data Engineer - Associate DEA-C01 Questions]

A company uploads .csv files to an Amazon S3 bucket. The company’s data platform team has set up an AWS Glue crawler to perform data discovery and to create the tables and schemas.

An AWS Glue job writes processed data from the tables to an Amazon Redshift database. The AWS Glue job handles column mapping and creates the Amazon Redshift tables in the Redshift database appropriately.

If the company reruns the AWS Glue job for any reason, duplicate records are introduced into the Amazon Redshift tables. The company needs a solution that will update the Redshift tables without duplicates.

Which solution will meet these requirements?

A. Modify the AWS Glue job to copy the rows into a staging Redshift table. Add SQL commands to update the existing rows with new values from the staging Redshift table.
B. Modify the AWS Glue job to load the previously inserted data into a MySQL database. Perform an upsert operation in the MySQL database. Copy the results to the Amazon Redshift tables.
C. Use Apache Spark’s DataFrame dropDuplicates() API to eliminate duplicates. Write the data to the Redshift tables.
D. Use the AWS Glue ResolveChoice built-in transform to select the value of the column from the most recent record.

Show Suggested Answer

Suggested Answer: A 🗳️

by Shanmahi at Aug. 6, 2024, 7:40 a.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

Shanmahi

Highly Voted 1 year ago

Selected Answer: A

Two step approach involving creating a staging table, followed by using Redshift's merge statement to update the target table from staging table and finally truncate/housekeep the staging table.

upvoted 7 times

...

praveenu

Most Recent 3 months ago

Selected Answer: A

Write all processed data (including potential duplicates) to a temporary staging table in Redshift. Execute SQL commands in Redshift to identify and either update existing records in the target table with the new values from the staging table (based on a unique key) or insert new records if they don't already exist. Optionally truncate the staging table after the upsert operation. This method leverages Redshift's SQL capabilities for efficient data manipulation and ensures no duplicates in the final tables.

upvoted 1 times

...