A company has a large, unstructured dataset. The dataset includes many duplicate records across several key attributes. Which solution on AWS will detect duplicates in the dataset with the LEAST code development?
A.
Use Amazon Mechanical Turk jobs to detect duplicates.
B.
Use Amazon QuickSight ML Insights to build a custom deduplication model.
C.
Use Amazon SageMaker Data Wrangler to pre-process and detect duplicates.
D.
Use the AWS Glue FindMatches transform to detect duplicates.
AWS Glue FindMatches is specifically designed to identify duplicate or matching records in datasets without requiring labeled training data. It uses machine learning to find fuzzy matches and allows customization to fine-tune the matching process, making it ideal for this scenario.
https://aws.amazon.com/about-aws/whats-new/2021/11/aws-glue-findmatches-new-data-existing-dataset/
"allows you to identify duplicate or matching records in your dataset"
AWS Glue FindMatches requires the least code development because it's a purpose-built transform specifically for matching similar records. You can train it with examples of matches and non-matches, and then apply it to your entire dataset without writing complex matching algorithms.
Amazon SageMaker Data Wrangler can help with data preparation, but would require more manual coding to implement custom deduplication logic.
I would argue you can should use Data Wrangler if you want to have the least code development. See https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-data-insights.html#data-wrangler-data-insights-samples
"You can remove duplicate samples from the dataset using the Drop duplicates transform under Manage rows."
The AWS Glue FindMatches transform is the most appropriate solution because it is specifically designed to detect duplicates, requires minimal development effort, and scales efficiently for large datasets.
A voting comment increases the vote count for the chosen answer by one.
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one.
So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
Saransundar
Highly Voted 4 months, 3 weeks agoGiorgioGss
Highly Voted 5 months agosnna4
Most Recent 4 days, 15 hours agoconrad2023
2 months, 2 weeks agofeelgoodfactor
4 months, 2 weeks agonakidal495
4 months, 4 weeks ago