Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 63 discussion

Exam question from Amazon's AWS Certified Data Analytics - Specialty

Question #: 63
Topic #: 1

[All AWS Certified Data Analytics - Specialty Questions]

A smart home automation company must efficiently ingest and process messages from various connected devices and sensors. The majority of these messages are comprised of a large number of small files. These messages are ingested using Amazon Kinesis Data Streams and sent to Amazon S3 using a Kinesis data stream consumer application. The Amazon S3 message data is then passed through a processing pipeline built on Amazon EMR running scheduled PySpark jobs.
The data platform team manages data processing and is concerned about the efficiency and cost of downstream data processing. They want to continue to use
PySpark.
Which solution improves the efficiency of the data processing jobs and is well architected?

A. Send the sensor and devices data directly to a Kinesis Data Firehose delivery stream to send the data to Amazon S3 with Apache Parquet record format conversion enabled. Use Amazon EMR running PySpark to process the data in Amazon S3.
B. Set up an AWS Lambda function with a Python runtime environment. Process individual Kinesis data stream messages from the connected devices and sensors using Lambda.
C. Launch an Amazon Redshift cluster. Copy the collected data from Amazon S3 to Amazon Redshift and move the data processing jobs from Amazon EMR to Amazon Redshift.
D. Set up AWS Glue Python jobs to merge the small data files in Amazon S3 into larger files and transform them to Apache Parquet format. Migrate the downstream PySpark jobs from Amazon EMR to AWS Glue.

Show Suggested Answer

Suggested Answer: D 🗳️

by ramozo at Aug. 22, 2020, 10:33 a.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

Phoenyx89

Highly Voted 3 years, 8 months ago

D seems the good choice because is the only answer dealing with small files but the doubt is... Glue is only batch! but there is a new article of 04/20 that says glue is now also supporting streaming process. So if we consider this article D is right. But we can? https://aws.amazon.com/it/about-aws/whats-new/2020/04/aws-glue-now-supports-serverless-streaming-etl/

upvoted 28 times

metin

3 years, 8 months ago

In question, it is not mentioned that they definitely need a real-time solution. And Glue can be used for batch processing of stream data.

upvoted 2 times

...

Dr_Kiko

3 years, 7 months ago

they already use Firehouse that does data batching, so no problem with D

upvoted 1 times

...

Ipc01

3 years, 4 months ago

Since question is searching for an answer that reflects operational efficiency, setting up individual glue jobs is definitely more time consuming so the answer is A

upvoted 2 times

Ipc01

3 years, 4 months ago

Plus, AWS Kinesis Data Firehose has in-house data transformation so you don’t need to add operational overhead by utilizing AWS Lambda

upvoted 2 times

...

juanife

1 year, 12 months ago

No, the question asks for a solution that replaces ETL with Pyspark in EMR cluster indeed, so AWS GLUE ETL jobs would be a good choice here. Undoubtedly, D is the correct option, but option A is not as bad as seems, but maybe it's not as cheap as D.

upvoted 2 times

...

Load full discussion...

...

singh100

Highly Voted 3 years, 8 months ago

Anas: A https://aws.amazon.com/blogs/big-data/optimizing-downstream-data-processing-with-amazon-kinesis-data-firehose-and-amazon-emr-running-apache-spark/

upvoted 23 times

GauravM17

3 years, 8 months ago

Should this not be D? Where are we handling the small files in A?

upvoted 1 times

GeeBeeEl

3 years, 8 months ago

You are changing them to parquet on the fly with Firehose.

upvoted 1 times

...

Ashish1101

2 years, 7 months ago

Firehose will batch in buffer time and reduce number of files.

upvoted 2 times

...

kzu19878

3 years, 8 months ago

Remember "They want to continue to use PySpark." D is migrating PySpark into Glue

upvoted 8 times

AjNapa

3 years, 8 months ago

This part of the question is what many ppl here have missed. You’re right. It’s A

upvoted 2 times

...

Haimett

2 years, 7 months ago

There is no problem in using pyspark with Glue.

upvoted 2 times

...

Load full discussion...

...

NarenKA

Most Recent 1 year, 3 months ago

Selected Answer: A

Converting data into Apache Parquet format before storing it in S3 optimizes the data for analytical processing. KDF able to automatically batch, compress, and convert incoming streaming data into Parquet format. It reduces the overhead with processing a large number of small files without the need for additional processing or intermediate steps. And it allows the team to continue using PySpark on Amazon EMR for data processing. B - AWS Lambda to process individual messages could introduce operational overhead and not efficiently handle the conversion of a large number of small files. C - moving data processing to Redshift would require changes to the existing PySpark-based processing pipeline and not most cost-effective solution. D - merging small files into larger ones using Glue addresses the efficiency concern, it suggests migrating PySpark jobs from EMR to AWS Glue could involve refactoring of the existing jobs.

upvoted 2 times

...

GCPereira

1 year, 5 months ago

emr is very expansive and spark doesn't work well with a large number of small files... the best option is to merge small files into large files and use job glue to decrease the cost of downstream processing... D is a perfect answer

upvoted 1 times

...

GCPereira

1 year, 5 months ago

take a look at this sentence "...the solution needs to be well-architected"... that is, cost-efficient, secure, highly available and operational-efficient... aws emr are not highly available and need a lot of operational resources... then disagree emr... to continue pyspark job, aws glue are the best option

upvoted 1 times

...

MLCL

1 year, 10 months ago

Selected Answer: D

D solves the issue with small files and replaces the EMR batch job with a Glue one, which is cheaper. If a Transient EMR cluster was in the A proposition, it would be acceptable.

upvoted 2 times

...

Debi_mishra

2 years ago

Both A and D correct technically and very difficult to figure out the cost effective solution without more context. Other answers assuming EMR is a long running and is expensive but thats not mentioned here. D has upper hand considering all will be serverless.

upvoted 1 times

...

2 years, 6 months ago

The correct answer is D :- the option that says: Replace the Amazon EMR with AWS Glue. Program an AWS Glue ETL script in Python to merge the small sensor data into larger files and convert them to Apache Parquet format. The option that says: Deploy a Kinesis Data Firehose delivery stream to collect and convert sensor data to Apache Parquet format. Deliver the transformed data into an Amazon S3 bucket. Process the data from the bucket using a PySpark Job running on an Amazon EMR cluster is incorrect. Although this option is valid, there is no significant cost reduction since Amazon EMR is still running. AWS Glue can provide lower costs while providing the same function. In addition, it is better to merge the smaller files to a large file, than just compressing them using the Apache Parquet format to improve ingestion performance.

upvoted 1 times

...