exam questions

Exam AWS Certified Big Data - Specialty All Questions

View all questions & answers for the AWS Certified Big Data - Specialty exam

Exam AWS Certified Big Data - Specialty topic 1 question 39 discussion

Exam question from Amazon's AWS Certified Big Data - Specialty
Question #: 39
Topic #: 1
[All AWS Certified Big Data - Specialty Questions]

An organization uses a custom map reduce application to build monthly reports based on many small data files in an Amazon S3 bucket. The data is submitted from various business units on a frequent but unpredictable schedule. As the dataset continues to grow, it becomes increasingly difficult to process all of the data in one day. The organization has scaled up its Amazon EMR cluster, but other optimizations could improve performance.
The organization needs to improve performance with minimal changes to existing processes and applications.
What action should the organization take?

  • A. Use Amazon S3 Event Notifications and AWS Lambda to create a quick search file index in DynamoDB.
  • B. Add Spark to the Amazon EMR cluster and utilize Resilient Distributed Datasets in-memory.
  • C. Use Amazon S3 Event Notifications and AWS Lambda to index each file into an Amazon Elasticsearch Service cluster.
  • D. Schedule a daily AWS Data Pipeline process that aggregates content into larger files using S3DistCp.
  • E. Have business units submit data via Amazon Kinesis Firehose to aggregate data hourly into Amazon S3.
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
jay1ram2
Highly Voted 3 years, 6 months ago
The Answer is D. S3Distcp has native capability to combine multiple small files into large files and does not require any coding. So, minimal change to existing processes. https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5 Also through a process of elimination, we can exclude A) DynamoDB is not a good service for large scale data analytics. C) Moving from EMR to ElasticSearch requires significant changes to existing processes and applications. B) Moving from MR to Spark requires significant changes to existing processes and applications. E) Moving from S3 to firehose requires significant changes to existing processes and applications.
upvoted 10 times
ME2000
3 years, 6 months ago
Option E is wrong because of how FH to aggregate data hourly into Amazon S3?
upvoted 3 times
...
...
guruguru
Highly Voted 3 years, 6 months ago
D. Hadoop is optimized for reading a fewer number of large files rather than many small files, whether from S3 or HDFS. You can use S3DistCp to aggregate small files into fewer large files of a size that you choose, which can optimize your analysis for both performance and cost. https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5
upvoted 8 times
...
kriscool
Most Recent 3 years, 6 months ago
D is correct. You can aggregate files using DistCp command. Hadoop is optimized for reading a fewer number of large files rather than many small files
upvoted 1 times
...
Bulti
3 years, 6 months ago
The choice is really between B and D. With B- Spark has the capability to improve performance over the existing AMR setup if lets says its using Hive or Pig instead. However it still needs to deal with large number of small data sets to process which is really the issue. So at the end of the day it won't solve the root cause of the issue which is about I/O not about compute. With D- This is the simplest solution that doesn't need the businesses to change the way they ingest those files into the organization today. It's a change that solves the root cause of the problem which is I/O yet keeps the overall process and flow unchanged. So the correct answer is D.
upvoted 2 times
...
ramz123
3 years, 6 months ago
It is B Wrong for : A- is wrong because new function C- Lambda with elastic search - why something new D- data is already there, its just gorwing the issue is not how to store or move data - but how to process E- a big change for all
upvoted 2 times
Corram
3 years, 6 months ago
Hadoop has a default chunk size of 64MB and is not efficient at processing many small files, that's why D actually does help.
upvoted 2 times
...
...
san2020
3 years, 6 months ago
my selection D
upvoted 3 times
...
marwan
3 years, 6 months ago
Answer us D https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/
upvoted 3 times
...
antoneti
3 years, 6 months ago
I support B since you can add a custom Jar so I believe you could use your own map-reduce app there
upvoted 1 times
...
Raju_k
3 years, 6 months ago
I would choose D as it is simple solution to improve performance with minimal changes to the existing process.
upvoted 1 times
...
viduvivek
3 years, 6 months ago
D looks reasonable. Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by subsequent steps in your Amazon EMR cluster.
upvoted 4 times
...
[Removed]
3 years, 7 months ago
What is the right answer for this? A or D?
upvoted 1 times
...
cybe001
3 years, 7 months ago
Option A, require EMR to access DynamoDB to get the index files which needs change in current setup. Using Option D, you can add a S3DistCp step in EMR job which is minimal change. Option D is correct
upvoted 3 times
...
M2
3 years, 7 months ago
A,E & C are not minimal changes. the question says minimal changes to existing processes and application. Answer should be D or B.
upvoted 3 times
...
exams
3 years, 7 months ago
Any thought on B?
upvoted 1 times
...
jlpl
3 years, 7 months ago
A is correct
upvoted 1 times
...
muhsin
3 years, 7 months ago
a require some additional services. the question asks with minimal changes.
upvoted 3 times
mattyb123
3 years, 7 months ago
Since the app is written for map reduce wouldn't adding spark mean the app would have to be re-written for spark?
upvoted 2 times
...
...
mattyb123
3 years, 7 months ago
Thoughts on A?
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago