exam questions

Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 6 discussion

A company has a business unit uploading .csv files to an Amazon S3 bucket. The company's data platform team has set up an AWS Glue crawler to do discovery, and create tables and schemas. An AWS Glue job writes processed data from the created tables to an Amazon Redshift database. The AWS Glue job handles column mapping and creating the Amazon Redshift table appropriately. When the AWS Glue job is rerun for any reason in a day, duplicate records are introduced into the Amazon Redshift table.
Which solution will update the Redshift table without duplicates when jobs are rerun?

  • A. Modify the AWS Glue job to copy the rows into a staging table. Add SQL commands to replace the existing rows in the main table as postactions in the DynamicFrameWriter class.
  • B. Load the previously inserted data into a MySQL database in the AWS Glue job. Perform an upsert operation in MySQL, and copy the results to the Amazon Redshift table.
  • C. Use Apache Spark's DataFrame dropDuplicates() API to eliminate duplicates and then write the data to Amazon Redshift.
  • D. Use the AWS Glue ResolveChoice built-in transform to select the most recent value of the column.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
testtaker3434
Highly Voted 3 years, 8 months ago
Answer should be A according to the link provided. Thoughts?
upvoted 19 times
lakediver
3 years, 5 months ago
Indeed A https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/
upvoted 6 times
...
...
Huy
Highly Voted 3 years, 7 months ago
B is wrong. We don't need a staging DB here which is costly and moreover MySQL is not the right choice. C. dropDuplicates() is used to remove duplicate records in the Spark not destination DB D. ResolveChoice is to cast data with unidentified data type to a specified data type and also work on Spark not destination DB. A is the answer
upvoted 15 times
...
NikkyDicky
Most Recent 1 year, 10 months ago
Selected Answer: A
It's n A
upvoted 1 times
...
Espa
2 years ago
Selected Answer: A
To me A looks correct answer, check this link https://stackoverflow.com/questions/52397646/aws-glue-to-redshift-duplicate-data
upvoted 2 times
...
pk349
2 years, 1 month ago
A: I passed the test
upvoted 1 times
...
AwsNewPeople
2 years, 2 months ago
A. Modify the AWS Glue job to copy the rows into a staging table. Add SQL commands to replace the existing rows in the main table as postactions in the DynamicFrameWriter class. To update the Redshift table without duplicates when AWS Glue jobs are rerun, the company should modify the AWS Glue job to copy the rows into a staging table. The job should then add SQL commands to replace the existing rows in the main table as postactions in the DynamicFrameWriter class. This approach ensures that the data written to the Redshift table does not contain any duplicates, and the table only contains the latest data. Loading the previously inserted data into a MySQL database and performing an upsert operation may be a feasible approach but adds complexity to the architecture. Using Spark's dropDuplicates() API to eliminate duplicates may not always work correctly when dealing with large datasets. Using the ResolveChoice built-in transform is used for handling schema changes in a column, not for removing duplicates.
upvoted 4 times
...
itsme1
2 years, 3 months ago
Selected Answer: A
With option B, it is to copy the RedShift data into SQL and back to RedShift. Option A is simpler
upvoted 1 times
...
tpompeu
2 years, 4 months ago
Selected Answer: A
A, for sure
upvoted 1 times
...
henom
2 years, 6 months ago
Correct Answer - A B is is incorrect because you can't use the COPY command to copy data directly from a MySQL database into Amazon Redshift. A workaround for this is to move the MySQL data into Amazon S3 and use AWS Glue as a staging table to perform the upsert operation. Since this method requires more effort, it is not the best approach to solve the problem.
upvoted 1 times
...
cloudlearnerhere
2 years, 7 months ago
Selected Answer: A
Correct answer is A as Redshift does not support merge or upsert on the single table. However, a staging table can be created and data merged with the main table. Option B is wrong as a staging DB as MySQL is not required. Option C is wrong as dropDuplicates() is used to remove duplicate records in the Spark and not destination DB. Option D is wrong as ResolveChoice is to cast data with an unidentified data type to a specified data type. It does not handle duplicates.
upvoted 7 times
...
rocky48
2 years, 10 months ago
Selected Answer: A
Answer is A
upvoted 1 times
rocky48
2 years, 7 months ago
Got confused with C as dataframe.dropDuplicates() also will work, but as per the given question, we have to stick to AWS Glue job, thus Answer is A.
upvoted 1 times
...
...
Bik000
3 years ago
Selected Answer: A
Answer is A
upvoted 1 times
...
Shivanikats
3 years, 4 months ago
Answer is A
upvoted 1 times
...
Donell
3 years, 7 months ago
Answer: A. Modify the AWS Glue job to copy the rows into a staging table. Add SQL commands to replace the existing rows in the main table as postactions in the DynamicFrameWriter class.
upvoted 3 times
...
leliodesouza
3 years, 7 months ago
The answer is A.
upvoted 2 times
...
ariane_tateishi
3 years, 7 months ago
A should be the right answer. I found a link that helps to explain why. https://aws.amazon.com/pt/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/
upvoted 1 times
...
lostsoul07
3 years, 7 months ago
A is the right answer
upvoted 4 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...