Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 13 discussion

Exam question from Amazon's AWS Certified Data Analytics - Specialty

Question #: 13
Topic #: 1

[All AWS Certified Data Analytics - Specialty Questions]

A company is planning to do a proof of concept for a machine learning (ML) project using Amazon SageMaker with a subset of existing on-premises data hosted in the company's 3 TB data warehouse. For part of the project, AWS Direct Connect is established and tested. To prepare the data for ML, data analysts are performing data curation. The data analysts want to perform multiple step, including mapping, dropping null fields, resolving choice, and splitting fields. The company needs the fastest solution to curate the data for this project.
Which solution meets these requirements?

A. Ingest data into Amazon S3 using AWS DataSync and use Apache Spark scrips to curate the data in an Amazon EMR cluster. Store the curated data in Amazon S3 for ML processing.
B. Create custom ETL jobs on-premises to curate the data. Use AWS DMS to ingest data into Amazon S3 for ML processing.
C. Ingest data into Amazon S3 using AWS DMS. Use AWS Glue to perform data curation and store the data in Amazon S3 for ML processing.
D. Take a full backup of the data store and ship the backup files using AWS Snowball. Upload Snowball data into Amazon S3 and schedule data curation jobs using AWS Batch to prepare the data for ML.

Show Suggested Answer

Suggested Answer: C 🗳️

by zanhsieh at Aug. 16, 2020, 1:59 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

abhineet

Highly Voted 3 years, 10 months ago

C is correct, s3 is a valid target for DMS https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html

upvoted 28 times

GauravM17

3 years, 10 months ago

I guess it should be A. DMS can can not do the data preprocessing and Spark is the best option on the large datasets

upvoted 2 times

Brijeshkrishna

3 years, 9 months ago

C is correct as AWS Glue uses Spark engine

upvoted 2 times

...

zeronine

Highly Voted 3 years, 11 months ago

C. DMS supports S3 as a target.

upvoted 6 times

...

Frazy

Most Recent 1 year, 9 months ago

C: Option A, using AWS DataSync and Apache Spark scripts, involves maintaining an on-premises EMR cluster, which adds complexity and management overhead. Option B, creating custom ETL jobs on-premises, requires significant development effort and may not be as efficient as using AWS Glue. Option D, using AWS Snowball for data transfer and AWS Batch for data curation, is less efficient and more time-consuming compared to the direct ingestion and curation approach.

upvoted 1 times

...

jerkane

1 year, 9 months ago

Selected Answer: C

C is correct using glue would be faster than using EMR

upvoted 1 times

...

monkeydba

1 year, 9 months ago

This is the differentiator. DMS can read a database source. DataSync cannot. The question says "hosted in the company's 3 TB data warehouse.". DataSync can read NFS, SMB, HDFS, S3. https://docs.aws.amazon.com/datasync/latest/userguide/how-datasync-transfer-works.html#onprem-aws

upvoted 2 times

...

monkeydba

1 year, 9 months ago

DataSync can indeed pull a subset of data. https://docs.aws.amazon.com/datasync/latest/userguide/filtering.html

upvoted 1 times

...

monkeydba

1 year, 9 months ago

The question mentions "subset" of data. Can DataSync do that? DMS can.

upvoted 1 times

...

gofavad926

1 year, 10 months ago

Selected Answer: A

A. I don't understand that all people agree on C. DMS means database migration service and here they mention data warehouse and not database, so this is not a DMS compatible source: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.html. A is the valid option because with DataSync you can migrate your DATA to the S3 and then we can process it with EMR (more efficient than Glue)

upvoted 1 times

...

debasishg

1 year, 10 months ago

Selected Answer: C

C. Because, 1. Datasync is used for file migration, DMS for Data. 2. GLUE ETL required to transform data after migration.

upvoted 1 times

...

NikkyDicky

2 years ago

Selected Answer: C

C for sure

upvoted 1 times

...

pk349

2 years, 3 months ago

C: I passed the test

upvoted 1 times

...

cloudlearnerhere

2 years, 9 months ago

Correct answer is C as DMS can be used for data migration to S3. AWS Glue can be used for preprocessing and data curation. Option A is wrong as DataSync is usually for storage migration and using Spark might be as operationally efficient as Glue. Option B is wrong as using on-premises custom ETL jobs might not be time-efficient. Option D is wrong as the data migration using Snowball will take time.

upvoted 4 times

...