Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 187 discussion

Exam question from Amazon's AWS Certified Machine Learning - Specialty

Question #: 187
Topic #: 1

[All AWS Certified Machine Learning - Specialty Questions]

A data engineer needs to provide a team of data scientists with the appropriate dataset to run machine learning training jobs. The data will be stored in Amazon S3. The data engineer is obtaining the data from an Amazon Redshift database and is using join queries to extract a single tabular dataset. A portion of the schema is as follows:

TransactionTimestamp (Timestamp)
CardName (Varchar)
CardNo (Varchar)

The data engineer must provide the data so that any row with a CardNo value of NULL is removed. Also, the TransactionTimestamp column must be separated into a TransactionDate column and a TransactionTime column. Finally, the CardName column must be renamed to NameOnCard.

The data will be extracted on a monthly basis and will be loaded into an S3 bucket. The solution must minimize the effort that is needed to set up infrastructure for the ingestion and transformation. The solution also must be automated and must minimize the load on the Amazon Redshift cluster.

Which solution meets these requirements?

A. Set up an Amazon EMR cluster. Create an Apache Spark job to read the data from the Amazon Redshift cluster and transform the data. Load the data into the S3 bucket. Schedule the job to run monthly.
B. Set up an Amazon EC2 instance with a SQL client tool, such as SQL Workbench/J, to query the data from the Amazon Redshift cluster directly Export the resulting dataset into a file. Upload the file into the S3 bucket. Perform these tasks monthly.
C. Set up an AWS Glue job that has the Amazon Redshift cluster as the source and the S3 bucket as the destination. Use the built-in transforms Filter, Map, and RenameField to perform the required transformations. Schedule the job to run monthly.
D. Use Amazon Redshift Spectrum to run a query that writes the data directly to the S3 bucket. Create an AWS Lambda function to run the query monthly.

Show Suggested Answer

Suggested Answer: C 🗳️

by ystotest at Nov. 28, 2022, 3:05 a.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

ystotest

Highly Voted 1 year, 8 months ago

Selected Answer: C

agreed with C

upvoted 12 times

...

2eb8df0

Most Recent 5 months, 1 week ago

Selected Answer: C

Its always Glue

upvoted 1 times

...

akgarg00

8 months, 3 weeks ago

Selected Answer: C

The answer was between C and D, but we are suppose to minimize use of Redshift cluster, answer is C. And B are too much effort, so not to be done as per constraints of question.

upvoted 2 times

...

teka112233

11 months, 1 week ago

Selected Answer: C

Simply the requirements are a full ETL process where data will be extracted from Redshift (E), then transformed by renaming, removing null values, or even separating the first column So (T), and finally load data to S3(L) all that with the least overhead, which make the AWS Glue ideal for these requirements

upvoted 1 times

...

loict

11 months, 2 weeks ago

Selected Answer: C

A. NO - AWS Glue (serverless) is a simpler option than EMR to run Spark jobs B. NO - Spark is a better option for datapipelines, it avoids the need for intermediary files C. YES - Spark and AWS Glue best combination D. NO - Amazon Redshift Spectrum is a "Lake House" architecture, meant to run SQL against against both DW & S3; here, we want to query only from the DW

upvoted 1 times

...

Mickey321

12 months ago

Selected Answer: C

The reason is that this solution can leverage the existing capabilities of AWS Glue, which is a fully managed service that can help users create, run, and manage ETL (extract, transform, and load) workflows. According to the web search results, AWS Glue can connect to various data sources and destinations, such as Amazon Redshift and Amazon S3, and use Apache Spark as the underlying processing engine. AWS Glue can also provide various built-in transforms that can perform common data manipulation operations, such as filtering, mapping, renaming, or joining. Moreover, AWS Glue can support scheduling and automation of ETL jobs using triggers or workflows.

upvoted 1 times

...

Mickey321

12 months ago

Selected Answer: C

agree with C

upvoted 1 times

...

kaike_reis

1 year ago

Selected Answer: C

C Reason: we want to minimize infrastructure effort, so we should prioritize serverless solutions, we want something automated and minimize the load on the Redshift cluster. That said, Letter A is wrong as it uses a managed service (EMR) just like Letter B (EC2). Letter D brings Redshift Spectrum, however the base is not in S3, but Redshift! So, it's discarded this option, since we use this service to move data from S3 → Redshift using SQL. Letter C is correct.

upvoted 1 times

...

Jerry84

1 year, 7 months ago

Selected Answer: C

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-transforms.html

upvoted 2 times

...