Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 71 discussion

Exam question from Amazon's AWS Certified Machine Learning - Specialty

Question #: 71
Topic #: 1

[All AWS Certified Machine Learning - Specialty Questions]

A Data Scientist needs to migrate an existing on-premises ETL process to the cloud. The current process runs at regular time intervals and uses PySpark to combine and format multiple large data sources into a single consolidated output for downstream processing.
The Data Scientist has been given the following requirements to the cloud solution:
✑ Combine multiple data sources.
✑ Reuse existing PySpark logic.
✑ Run the solution on the existing schedule.
✑ Minimize the number of servers that will need to be managed.
Which architecture should the Data Scientist use to build this solution?

A. Write the raw data to Amazon S3. Schedule an AWS Lambda function to submit a Spark step to a persistent Amazon EMR cluster based on the existing schedule. Use the existing PySpark logic to run the ETL job on the EMR cluster. Output the results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
B. Write the raw data to Amazon S3. Create an AWS Glue ETL job to perform the ETL processing against the input data. Write the ETL job in PySpark to leverage the existing logic. Create a new AWS Glue trigger to trigger the ETL job based on the existing schedule. Configure the output target of the ETL job to write to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
C. Write the raw data to Amazon S3. Schedule an AWS Lambda function to run on the existing schedule and process the input data from Amazon S3. Write the Lambda logic in Python and implement the existing PySpark logic to perform the ETL process. Have the Lambda function output the results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
D. Use Amazon Kinesis Data Analytics to stream the input data and perform real-time SQL queries against the stream to carry out the required transformations within the stream. Deliver the output results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.

Show Suggested Answer

Suggested Answer: B 🗳️

by Joe_Zhang at Feb. 1, 2021, 10:36 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

Paul_NoName

Highly Voted 3 years, 10 months ago

B it is .

upvoted 29 times

[Removed]

3 years, 10 months ago

I agree, B is serverless and reuses Pyspark. Similar example shown here: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-medicaid.html

upvoted 11 times

...

SophieSu

Highly Voted 3 years, 10 months ago

A is not correct because Minimize the number of servers that will need to be managed. EMR is not server-less. B is correct. AWS Glue supports an extension of the PySpark Python dialect for scripting extract, transform, and load... C is not correct because using Lambda for ETL you will not be able to Reuse existing PySpark logic D is not correct because Kinesis is not server-less. And you can not Reuse existing PySpark logic

upvoted 12 times

...

xicocaio

Most Recent 10 months, 3 weeks ago

Selected Answer: B

Option B (using AWS Glue for the ETL process) is the best solution for the described requirements. A: This solution requires managing an Amazon EMR cluster, which would involve more server management than AWS Glue, violating the requirement to minimize the number of servers to be managed. C: AWS Lambda is not ideal for this use case because it has resource limitations, including memory and execution time limits (15 minutes max), which might not be suitable for large-scale ETL operations involving PySpark logic. D: Amazon Kinesis Data Analytics is focused on real-time stream processing, which doesn't fit the described scheduled batch processing scenario.

upvoted 1 times

...

akgarg00

1 year, 9 months ago

Answer is A, as B clearly mentions that Pyspark code is written with leverage from already existing code. Also, the server architecture used currently is on-premises which will have more servers that solution A.

upvoted 2 times

...

sonoluminescence

1 year, 9 months ago

Selected Answer: B

Amazon Kinesis Data Analytics is more suited for real-time processing and streaming data. The given use case does not indicate a need for real-time processing, so this might not be the best fit. Furthermore, it doesn't support PySpark natively.

upvoted 1 times

...

Shenannigan

1 year, 11 months ago

Selected Answer: B

Voted B based on the serverless (minimum servers) and https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming.html

upvoted 1 times

...

Mickey321

1 year, 11 months ago

Selected Answer: B

Indeed B using Glue

upvoted 1 times

...

kaike_reis

2 years ago

B is the correct. A you have to manage EMR, so it's wrong. D you don't use Spark, so it's wrong. C you will not be using Spark, so it's wrong.

upvoted 1 times

...

Maaayaaa

2 years, 4 months ago

Selected Answer: B

B ticks all boxes. Minimize servers -> AWS managed services -> Glue.

upvoted 2 times

...

bakarys

2 years, 5 months ago

Selected Answer: A

Option A would be the best response for this scenario. This solution allows the Data Scientist to reuse the existing PySpark logic while migrating the ETL process to the cloud. The raw data is written to Amazon S3, and a Lambda function is scheduled to trigger a Spark step on a persistent EMR cluster based on the existing schedule. The PySpark logic is used to run the ETL job on the EMR cluster, and the results are output to a processed location in Amazon S3 that is accessible for downstream use. This solution minimizes the number of servers that need to be managed, and it allows for a seamless migration of the existing ETL process to the cloud.

upvoted 1 times

...