exam questions

Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 63 discussion

A smart home automation company must efficiently ingest and process messages from various connected devices and sensors. The majority of these messages are comprised of a large number of small files. These messages are ingested using Amazon Kinesis Data Streams and sent to Amazon S3 using a Kinesis data stream consumer application. The Amazon S3 message data is then passed through a processing pipeline built on Amazon EMR running scheduled PySpark jobs.
The data platform team manages data processing and is concerned about the efficiency and cost of downstream data processing. They want to continue to use
PySpark.
Which solution improves the efficiency of the data processing jobs and is well architected?

  • A. Send the sensor and devices data directly to a Kinesis Data Firehose delivery stream to send the data to Amazon S3 with Apache Parquet record format conversion enabled. Use Amazon EMR running PySpark to process the data in Amazon S3.
  • B. Set up an AWS Lambda function with a Python runtime environment. Process individual Kinesis data stream messages from the connected devices and sensors using Lambda.
  • C. Launch an Amazon Redshift cluster. Copy the collected data from Amazon S3 to Amazon Redshift and move the data processing jobs from Amazon EMR to Amazon Redshift.
  • D. Set up AWS Glue Python jobs to merge the small data files in Amazon S3 into larger files and transform them to Apache Parquet format. Migrate the downstream PySpark jobs from Amazon EMR to AWS Glue.
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
Phoenyx89
Highly Voted 3 years, 7 months ago
D seems the good choice because is the only answer dealing with small files but the doubt is... Glue is only batch! but there is a new article of 04/20 that says glue is now also supporting streaming process. So if we consider this article D is right. But we can? https://aws.amazon.com/it/about-aws/whats-new/2020/04/aws-glue-now-supports-serverless-streaming-etl/
upvoted 28 times
metin
3 years, 6 months ago
In question, it is not mentioned that they definitely need a real-time solution. And Glue can be used for batch processing of stream data.
upvoted 2 times
...
Dr_Kiko
3 years, 5 months ago
they already use Firehouse that does data batching, so no problem with D
upvoted 1 times
...
Ipc01
3 years, 3 months ago
Since question is searching for an answer that reflects operational efficiency, setting up individual glue jobs is definitely more time consuming so the answer is A
upvoted 2 times
Ipc01
3 years, 3 months ago
Plus, AWS Kinesis Data Firehose has in-house data transformation so you don’t need to add operational overhead by utilizing AWS Lambda
upvoted 2 times
...
...
juanife
1 year, 10 months ago
No, the question asks for a solution that replaces ETL with Pyspark in EMR cluster indeed, so AWS GLUE ETL jobs would be a good choice here. Undoubtedly, D is the correct option, but option A is not as bad as seems, but maybe it's not as cheap as D.
upvoted 2 times
...
...
singh100
Highly Voted 3 years, 7 months ago
Anas: A https://aws.amazon.com/blogs/big-data/optimizing-downstream-data-processing-with-amazon-kinesis-data-firehose-and-amazon-emr-running-apache-spark/
upvoted 23 times
GauravM17
3 years, 7 months ago
Should this not be D? Where are we handling the small files in A?
upvoted 1 times
GeeBeeEl
3 years, 6 months ago
You are changing them to parquet on the fly with Firehose.
upvoted 1 times
...
Ashish1101
2 years, 5 months ago
Firehose will batch in buffer time and reduce number of files.
upvoted 2 times
...
kzu19878
3 years, 7 months ago
Remember "They want to continue to use PySpark." D is migrating PySpark into Glue
upvoted 8 times
AjNapa
3 years, 6 months ago
This part of the question is what many ppl here have missed. You’re right. It’s A
upvoted 2 times
...
Haimett
2 years, 6 months ago
There is no problem in using pyspark with Glue.
upvoted 2 times
...
...
...
...
NarenKA
Most Recent 1 year, 2 months ago
Selected Answer: A
Converting data into Apache Parquet format before storing it in S3 optimizes the data for analytical processing. KDF able to automatically batch, compress, and convert incoming streaming data into Parquet format. It reduces the overhead with processing a large number of small files without the need for additional processing or intermediate steps. And it allows the team to continue using PySpark on Amazon EMR for data processing. B - AWS Lambda to process individual messages could introduce operational overhead and not efficiently handle the conversion of a large number of small files. C - moving data processing to Redshift would require changes to the existing PySpark-based processing pipeline and not most cost-effective solution. D - merging small files into larger ones using Glue addresses the efficiency concern, it suggests migrating PySpark jobs from EMR to AWS Glue could involve refactoring of the existing jobs.
upvoted 2 times
...
GCPereira
1 year, 3 months ago
emr is very expansive and spark doesn't work well with a large number of small files... the best option is to merge small files into large files and use job glue to decrease the cost of downstream processing... D is a perfect answer
upvoted 1 times
...
GCPereira
1 year, 4 months ago
take a look at this sentence "...the solution needs to be well-architected"... that is, cost-efficient, secure, highly available and operational-efficient... aws emr are not highly available and need a lot of operational resources... then disagree emr... to continue pyspark job, aws glue are the best option
upvoted 1 times
...
MLCL
1 year, 9 months ago
Selected Answer: D
D solves the issue with small files and replaces the EMR batch job with a Glue one, which is cheaper. If a Transient EMR cluster was in the A proposition, it would be acceptable.
upvoted 2 times
...
Debi_mishra
1 year, 11 months ago
Both A and D correct technically and very difficult to figure out the cost effective solution without more context. Other answers assuming EMR is a long running and is expensive but thats not mentioned here. D has upper hand considering all will be serverless.
upvoted 1 times
...
pk349
2 years ago
D: I passed the test
upvoted 3 times
...
akashm99101001com
2 years, 1 month ago
Selected Answer: D
"cost of downstream data processing" so migrate it
upvoted 1 times
...
Chelseajcole
2 years, 3 months ago
Selected Answer: D
This question is testing if you know Glue can run PySpark job
upvoted 4 times
...
DeerSong
2 years, 3 months ago
Selected Answer: D
D for sue
upvoted 1 times
...
Kinlive1991
2 years, 4 months ago
Selected Answer: D
D is correct
upvoted 1 times
...
siju13
2 years, 4 months ago
Selected Answer: D
glue cheaper than emr
upvoted 1 times
...
henom
2 years, 5 months ago
The correct answer is D :- the option that says: Replace the Amazon EMR with AWS Glue. Program an AWS Glue ETL script in Python to merge the small sensor data into larger files and convert them to Apache Parquet format. The option that says: Deploy a Kinesis Data Firehose delivery stream to collect and convert sensor data to Apache Parquet format. Deliver the transformed data into an Amazon S3 bucket. Process the data from the bucket using a PySpark Job running on an Amazon EMR cluster is incorrect. Although this option is valid, there is no significant cost reduction since Amazon EMR is still running. AWS Glue can provide lower costs while providing the same function. In addition, it is better to merge the smaller files to a large file, than just compressing them using the Apache Parquet format to improve ingestion performance.
upvoted 1 times
...
nadavw
2 years, 5 months ago
Selected Answer: D
The question is about the data processing and not the full pipeline including ingestion so D is the most efficient from processing perspective
upvoted 1 times
...
thuyeinaung
2 years, 5 months ago
Selected Answer: D
Glue can run PySpark
upvoted 1 times
...
alinato
2 years, 5 months ago
Selected Answer: B
Lambda can run pyspark and is cost effective and serverless meaning well architectured.
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago