exam questions

Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 71 discussion

A Data Scientist needs to migrate an existing on-premises ETL process to the cloud. The current process runs at regular time intervals and uses PySpark to combine and format multiple large data sources into a single consolidated output for downstream processing.
The Data Scientist has been given the following requirements to the cloud solution:
✑ Combine multiple data sources.
✑ Reuse existing PySpark logic.
✑ Run the solution on the existing schedule.
✑ Minimize the number of servers that will need to be managed.
Which architecture should the Data Scientist use to build this solution?

  • A. Write the raw data to Amazon S3. Schedule an AWS Lambda function to submit a Spark step to a persistent Amazon EMR cluster based on the existing schedule. Use the existing PySpark logic to run the ETL job on the EMR cluster. Output the results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
  • B. Write the raw data to Amazon S3. Create an AWS Glue ETL job to perform the ETL processing against the input data. Write the ETL job in PySpark to leverage the existing logic. Create a new AWS Glue trigger to trigger the ETL job based on the existing schedule. Configure the output target of the ETL job to write to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
  • C. Write the raw data to Amazon S3. Schedule an AWS Lambda function to run on the existing schedule and process the input data from Amazon S3. Write the Lambda logic in Python and implement the existing PySpark logic to perform the ETL process. Have the Lambda function output the results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
  • D. Use Amazon Kinesis Data Analytics to stream the input data and perform real-time SQL queries against the stream to carry out the required transformations within the stream. Deliver the output results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
Paul_NoName
Highly Voted 3 years, 8 months ago
B it is .
upvoted 29 times
[Removed]
3 years, 8 months ago
I agree, B is serverless and reuses Pyspark. Similar example shown here: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-medicaid.html
upvoted 11 times
...
...
SophieSu
Highly Voted 3 years, 8 months ago
A is not correct because Minimize the number of servers that will need to be managed. EMR is not server-less. B is correct. AWS Glue supports an extension of the PySpark Python dialect for scripting extract, transform, and load... C is not correct because using Lambda for ETL you will not be able to Reuse existing PySpark logic D is not correct because Kinesis is not server-less. And you can not Reuse existing PySpark logic
upvoted 12 times
...
xicocaio
Most Recent 8 months, 3 weeks ago
Selected Answer: B
Option B (using AWS Glue for the ETL process) is the best solution for the described requirements. A: This solution requires managing an Amazon EMR cluster, which would involve more server management than AWS Glue, violating the requirement to minimize the number of servers to be managed. C: AWS Lambda is not ideal for this use case because it has resource limitations, including memory and execution time limits (15 minutes max), which might not be suitable for large-scale ETL operations involving PySpark logic. D: Amazon Kinesis Data Analytics is focused on real-time stream processing, which doesn't fit the described scheduled batch processing scenario.
upvoted 1 times
...
akgarg00
1 year, 7 months ago
Answer is A, as B clearly mentions that Pyspark code is written with leverage from already existing code. Also, the server architecture used currently is on-premises which will have more servers that solution A.
upvoted 2 times
...
sonoluminescence
1 year, 7 months ago
Selected Answer: B
Amazon Kinesis Data Analytics is more suited for real-time processing and streaming data. The given use case does not indicate a need for real-time processing, so this might not be the best fit. Furthermore, it doesn't support PySpark natively.
upvoted 1 times
...
Shenannigan
1 year, 9 months ago
Selected Answer: B
Voted B based on the serverless (minimum servers) and https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming.html
upvoted 1 times
...
Mickey321
1 year, 9 months ago
Selected Answer: B
Indeed B using Glue
upvoted 1 times
...
kaike_reis
1 year, 10 months ago
B is the correct. A you have to manage EMR, so it's wrong. D you don't use Spark, so it's wrong. C you will not be using Spark, so it's wrong.
upvoted 1 times
...
Maaayaaa
2 years, 2 months ago
Selected Answer: B
B ticks all boxes. Minimize servers -> AWS managed services -> Glue.
upvoted 2 times
...
bakarys
2 years, 3 months ago
Selected Answer: A
Option A would be the best response for this scenario. This solution allows the Data Scientist to reuse the existing PySpark logic while migrating the ETL process to the cloud. The raw data is written to Amazon S3, and a Lambda function is scheduled to trigger a Spark step on a persistent EMR cluster based on the existing schedule. The PySpark logic is used to run the ETL job on the EMR cluster, and the results are output to a processed location in Amazon S3 that is accessible for downstream use. This solution minimizes the number of servers that need to be managed, and it allows for a seamless migration of the existing ETL process to the cloud.
upvoted 1 times
...
sqavi
2 years, 4 months ago
Selected Answer: B
Option D is wrong it should be B
upvoted 1 times
...
Peeking
2 years, 6 months ago
D cannot be answer as there is no streaming data or Realtime processing.
upvoted 2 times
...
salads
2 years, 10 months ago
Selected Answer: B
the answer is b
upvoted 2 times
...
Nickname_L
3 years, 7 months ago
Answer should be B. Serverless, on a regular schedule (no real time requirement), reuses PySpark code in Glue ETL script.
upvoted 4 times
...
gcpwhiz
3 years, 8 months ago
Answer is B as they specifically ask about reusing existing PySpark, which can be done with Glue
upvoted 3 times
...
Aashi22
3 years, 8 months ago
https://docs.aws.amazon.com/glue/latest/dg/creating_running_workflows.html
upvoted 1 times
...
astonm13
3 years, 8 months ago
It is B. ! "Minimize number of servers to be managed". B is a Serverless solution which fulfils other requirements!
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...