Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 6 discussion

Exam question from Amazon's AWS Certified Data Analytics - Specialty

Question #: 6
Topic #: 1

[All AWS Certified Data Analytics - Specialty Questions]

A company has a business unit uploading .csv files to an Amazon S3 bucket. The company's data platform team has set up an AWS Glue crawler to do discovery, and create tables and schemas. An AWS Glue job writes processed data from the created tables to an Amazon Redshift database. The AWS Glue job handles column mapping and creating the Amazon Redshift table appropriately. When the AWS Glue job is rerun for any reason in a day, duplicate records are introduced into the Amazon Redshift table.
Which solution will update the Redshift table without duplicates when jobs are rerun?

A. Modify the AWS Glue job to copy the rows into a staging table. Add SQL commands to replace the existing rows in the main table as postactions in the DynamicFrameWriter class.
B. Load the previously inserted data into a MySQL database in the AWS Glue job. Perform an upsert operation in MySQL, and copy the results to the Amazon Redshift table.
C. Use Apache Spark's DataFrame dropDuplicates() API to eliminate duplicates and then write the data to Amazon Redshift.
D. Use the AWS Glue ResolveChoice built-in transform to select the most recent value of the column.

Show Suggested Answer

Suggested Answer: A 🗳️

by testtaker3434 at Aug. 9, 2020, 1:51 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

testtaker3434

Highly Voted 3 years, 8 months ago

Answer should be A according to the link provided. Thoughts?

upvoted 19 times

lakediver

3 years, 5 months ago

Indeed A https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/

upvoted 6 times

...

Huy

Highly Voted 3 years, 7 months ago

B is wrong. We don't need a staging DB here which is costly and moreover MySQL is not the right choice. C. dropDuplicates() is used to remove duplicate records in the Spark not destination DB D. ResolveChoice is to cast data with unidentified data type to a specified data type and also work on Spark not destination DB. A is the answer

upvoted 15 times

...

NikkyDicky

Most Recent 1 year, 10 months ago

Selected Answer: A

It's n A

upvoted 1 times

...

Espa

2 years ago

Selected Answer: A

To me A looks correct answer, check this link https://stackoverflow.com/questions/52397646/aws-glue-to-redshift-duplicate-data

upvoted 2 times

...

pk349

2 years, 1 month ago

A: I passed the test

upvoted 1 times

...

AwsNewPeople

2 years, 2 months ago

A. Modify the AWS Glue job to copy the rows into a staging table. Add SQL commands to replace the existing rows in the main table as postactions in the DynamicFrameWriter class. To update the Redshift table without duplicates when AWS Glue jobs are rerun, the company should modify the AWS Glue job to copy the rows into a staging table. The job should then add SQL commands to replace the existing rows in the main table as postactions in the DynamicFrameWriter class. This approach ensures that the data written to the Redshift table does not contain any duplicates, and the table only contains the latest data. Loading the previously inserted data into a MySQL database and performing an upsert operation may be a feasible approach but adds complexity to the architecture. Using Spark's dropDuplicates() API to eliminate duplicates may not always work correctly when dealing with large datasets. Using the ResolveChoice built-in transform is used for handling schema changes in a column, not for removing duplicates.

upvoted 4 times

...

itsme1

2 years, 3 months ago

Selected Answer: A

With option B, it is to copy the RedShift data into SQL and back to RedShift. Option A is simpler

upvoted 1 times

...

tpompeu

2 years, 4 months ago

Selected Answer: A

A, for sure

upvoted 1 times

...

henom

2 years, 6 months ago

Correct Answer - A B is is incorrect because you can't use the COPY command to copy data directly from a MySQL database into Amazon Redshift. A workaround for this is to move the MySQL data into Amazon S3 and use AWS Glue as a staging table to perform the upsert operation. Since this method requires more effort, it is not the best approach to solve the problem.

upvoted 1 times

...

cloudlearnerhere

2 years, 7 months ago

Selected Answer: A

Correct answer is A as Redshift does not support merge or upsert on the single table. However, a staging table can be created and data merged with the main table. Option B is wrong as a staging DB as MySQL is not required. Option C is wrong as dropDuplicates() is used to remove duplicate records in the Spark and not destination DB. Option D is wrong as ResolveChoice is to cast data with an unidentified data type to a specified data type. It does not handle duplicates.

upvoted 7 times

...

rocky48

2 years, 10 months ago

Selected Answer: A

Answer is A

upvoted 1 times

rocky48

2 years, 7 months ago

Got confused with C as dataframe.dropDuplicates() also will work, but as per the given question, we have to stick to AWS Glue job, thus Answer is A.

upvoted 1 times

...

Bik000

3 years ago

Selected Answer: A

Answer is A

upvoted 1 times

...

Shivanikats

3 years, 4 months ago

Answer is A

upvoted 1 times

...

Donell

3 years, 7 months ago

Answer: A. Modify the AWS Glue job to copy the rows into a staging table. Add SQL commands to replace the existing rows in the main table as postactions in the DynamicFrameWriter class.

upvoted 3 times

...

leliodesouza

3 years, 7 months ago

The answer is A.

upvoted 2 times

...

ariane_tateishi

3 years, 7 months ago

A should be the right answer. I found a link that helps to explain why. https://aws.amazon.com/pt/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/

upvoted 1 times

...

lostsoul07

3 years, 7 months ago

A is the right answer

upvoted 4 times

...

Load full discussion...