Exam AWS Certified Big Data - Specialty topic 1 question 39 discussion

Exam question from Amazon's AWS Certified Big Data - Specialty

Question #: 39
Topic #: 1

[All AWS Certified Big Data - Specialty Questions]

An organization uses a custom map reduce application to build monthly reports based on many small data files in an Amazon S3 bucket. The data is submitted from various business units on a frequent but unpredictable schedule. As the dataset continues to grow, it becomes increasingly difficult to process all of the data in one day. The organization has scaled up its Amazon EMR cluster, but other optimizations could improve performance.
The organization needs to improve performance with minimal changes to existing processes and applications.
What action should the organization take?

A. Use Amazon S3 Event Notifications and AWS Lambda to create a quick search file index in DynamoDB.
B. Add Spark to the Amazon EMR cluster and utilize Resilient Distributed Datasets in-memory.
C. Use Amazon S3 Event Notifications and AWS Lambda to index each file into an Amazon Elasticsearch Service cluster.
D. Schedule a daily AWS Data Pipeline process that aggregates content into larger files using S3DistCp.
E. Have business units submit data via Amazon Kinesis Firehose to aggregate data hourly into Amazon S3.

Show Suggested Answer

Suggested Answer: B 🗳️

by mattyb123 at Aug. 14, 2019, 1 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

jay1ram2

Highly Voted 3 years, 9 months ago

The Answer is D. S3Distcp has native capability to combine multiple small files into large files and does not require any coding. So, minimal change to existing processes. https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5 Also through a process of elimination, we can exclude A) DynamoDB is not a good service for large scale data analytics. C) Moving from EMR to ElasticSearch requires significant changes to existing processes and applications. B) Moving from MR to Spark requires significant changes to existing processes and applications. E) Moving from S3 to firehose requires significant changes to existing processes and applications.

upvoted 10 times

ME2000

3 years, 9 months ago

Option E is wrong because of how FH to aggregate data hourly into Amazon S3?

upvoted 3 times

...

guruguru

Highly Voted 3 years, 9 months ago

D. Hadoop is optimized for reading a fewer number of large files rather than many small files, whether from S3 or HDFS. You can use S3DistCp to aggregate small files into fewer large files of a size that you choose, which can optimize your analysis for both performance and cost. https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5

upvoted 8 times

...

kriscool

Most Recent 3 years, 8 months ago

D is correct. You can aggregate files using DistCp command. Hadoop is optimized for reading a fewer number of large files rather than many small files

upvoted 1 times

...

Bulti

3 years, 9 months ago

The choice is really between B and D. With B- Spark has the capability to improve performance over the existing AMR setup if lets says its using Hive or Pig instead. However it still needs to deal with large number of small data sets to process which is really the issue. So at the end of the day it won't solve the root cause of the issue which is about I/O not about compute. With D- This is the simplest solution that doesn't need the businesses to change the way they ingest those files into the organization today. It's a change that solves the root cause of the problem which is I/O yet keeps the overall process and flow unchanged. So the correct answer is D.

upvoted 2 times

...