exam questions

Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 169 discussion

A company is building a new version of a recommendation engine. Machine learning (ML) specialists need to keep adding new data from users to improve personalized recommendations. The ML specialists gather data from the users' interactions on the platform and from sources such as external websites and social media.
The pipeline cleans, transforms, enriches, and compresses terabytes of data daily, and this data is stored in Amazon S3. A set of Python scripts was coded to do the job and is stored in a large Amazon EC2 instance. The whole process takes more than 20 hours to finish, with each script taking at least an hour. The company wants to move the scripts out of Amazon EC2 into a more managed solution that will eliminate the need to maintain servers.
Which approach will address all of these requirements with the LEAST development effort?

  • A. Load the data into an Amazon Redshift cluster. Execute the pipeline by using SQL. Store the results in Amazon S3.
  • B. Load the data into Amazon DynamoDB. Convert the scripts to an AWS Lambda function. Execute the pipeline by triggering Lambda executions. Store the results in Amazon S3.
  • C. Create an AWS Glue job. Convert the scripts to PySpark. Execute the pipeline. Store the results in Amazon S3.
  • D. Create a set of individual AWS Lambda functions to execute each of the scripts. Build a step function by using the AWS Step Functions Data Science SDK. Store the results in Amazon S3.
Show Suggested Answer Hide Answer
Suggested Answer: C 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
spaceexplorer
Highly Voted 2 years, 6 months ago
Selected Answer: C
C; Lambda execution time has hard limit of 15 mins which might not be enough for data processing
upvoted 15 times
ckkobe24
2 years, 5 months ago
but C requires some coding efforts
upvoted 1 times
daidaidai
1 year, 5 months ago
I think C is correct, because pyspark are also a kind of Python, and it only require a little code change.
upvoted 1 times
...
...
...
Stokvisss
Most Recent 8 months, 1 week ago
Selected Answer: C
D is wrong as AWS Lambda has a maximum execution time of 15 minutes, which may not be sufficient for some of the scripts. C is right as it's serverless and not a lot of work.
upvoted 1 times
...
endeesa
11 months, 1 week ago
Selected Answer: C
Redshift is definitely going to require some effort to setup, lambda just won't cut it performance-wise if the EC2 instance can't. Guess whats left?
upvoted 1 times
...
giustino98
12 months ago
Selected Answer: C
C seems the most correct but it misses the part of importing data in AWS
upvoted 1 times
...
teka112233
1 year, 1 month ago
Selected Answer: C
option c fit all requirements since, it provides the least development effort using AWS Glue, and convert the python to pyspark which provide the most performance. option D is not suitable because lambda function has a limitation of only 15 minutes ruining while the script needs 1 hour.
upvoted 1 times
...
kaike_reis
1 year, 2 months ago
Selected Answer: C
We want to eliminate server management and reduce development effort. That said, Letter A is wrong, as it brings effort to refactor code. Letter B is wrong as DynamoDB asks for server management. Letter D is wrong, because despite the services being serverless (Lambda and Step Functions), the maximum timeout of a Lambda function is 15 minutes, which would be less than the desired one (1 hour). Letter C is correct, even if there is a pure Python code conversion effort → PySpark, this is the solution that fits the requirements.
upvoted 1 times
...
Mickey321
1 year, 3 months ago
Selected Answer: C
Option C is the best option because it allows you to use the existing Python scripts without having to convert them to a different language or framework. AWS Glue is a managed service that makes it easy to prepare data for analysis. PySpark is a Python library that allows you to use Spark to process data. This approach would address all of the requirements with the least development effort and would be able to handle large-scale data processing.
upvoted 1 times
...
Mickey321
1 year, 3 months ago
Selected Answer: D
Overall, option C with AWS Glue and PySpark is the most efficient approach, as it requires the least amount of development effort while effectively addressing all the requirements, including moving away from EC2 maintenance and handling large-scale data processing.
upvoted 1 times
Mickey321
1 year, 2 months ago
corrected to option c
upvoted 1 times
...
...
AjoseO
1 year, 8 months ago
Selected Answer: C
The data pipeline involves cleaning, transforming, enriching, and compressing terabytes of data and storing the data in Amazon S3. AWS Glue is an ETL service that makes it easy to move data between data stores. The Glue job allows you to use PySpark scripts to perform ETL tasks. With AWS Glue, you do not need to provision and manage servers, which eliminates the need to maintain servers, as required by the company. Therefore, AWS Glue would address all of the company's requirements with the least development effort.
upvoted 3 times
...
maxkm
1 year, 9 months ago
Selected Answer: D
1) eliminate the need to maintain servers - Lambda is serverless 2) the least development effort - python scripts do no need to be rewritten for Lambda function
upvoted 1 times
GiyeonShin
1 year, 8 months ago
"with each script taking at least an hour" - lambda would be time-out during taking job.
upvoted 5 times
...
...
milan_ml
2 years, 3 months ago
Selected Answer: C
C as for Redshift I need to build a new pipeline
upvoted 1 times
...
ovokpus
2 years, 4 months ago
Selected Answer: C
Converting python scripts to pyspark is less coding effort than writing up SQL, which is somewhat limited in the types of transformations it can do. Lambda function responses are a deadend for reason already given (timeout)
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago