Exam AWS Certified Big Data - Specialty All Questions

View all questions & answers for the AWS Certified Big Data - Specialty exam

Exam AWS Certified Big Data - Specialty topic 1 question 2 discussion

Exam question from Amazon's AWS Certified Big Data - Specialty

Question #: 2
Topic #: 1

[All AWS Certified Big Data - Specialty Questions]

A new algorithm has been written in Python to identify SPAM e-mails. The algorithm analyzes the free text contained within a sample set of 1 million e-mails stored on Amazon S3. The algorithm must be scaled across a production dataset of 5 PB, which also resides in Amazon S3 storage.
Which AWS service strategy is best for this use case?

A. Copy the data into Amazon ElastiCache to perform text analysis on the in-memory data and export the results of the model into Amazon Machine Learning.
B. Use Amazon EMR to parallelize the text analysis tasks across the cluster using a streaming program step.
C. Use Amazon Elasticsearch Service to store the text and then use the Python Elasticsearch Client to run analysis against the text index.
D. Initiate a Python job from AWS Data Pipeline to run directly against the Amazon S3 text files.

Show Suggested Answer

Suggested Answer: C 🗳️
Reference: https://aws.amazon.com/blogs/database/indexing-metadata-in-amazon-elasticsearch-service- using-aws-lambda-and-python/

by mattyb123 at Aug. 12, 2019, 5:49 a.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

Bulti

Highly Voted 3 years, 8 months ago

Answer is B: Hadoop Streaming is a utility that comes with Hadoop that enables you to develop MapReduce executables in languages other than Java. Streaming is implemented in the form of a JAR file, so you can run it from the Amazon EMR API or command line just like a standard JAR file. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UseCase_Streaming.html

upvoted 6 times

...

fagilom

Most Recent 1 year, 6 months ago

Option B (Use Amazon EMR) is the best service strategy for this use case. EMR can handle the scale of the data and provides the necessary computational resources to run complex text analysis algorithms in parallel across a large cluster, making it a suitable choice for processing 5 PB of data.

upvoted 1 times

...

Abhi09

3 years, 8 months ago

Option B. EMR supports on S3 supports 5PB. Hadoop or spark steaming supports Python algo. Elastisearch has limit of 30TB for a single domain. Even after request to Amazon to increase, it can go max 300TB, so not possible to store 3TB. Secondly, using lambda in this case might not be good choice with such huge volume of data flowing in.

upvoted 1 times

...

alopazo

3 years, 8 months ago

B https://docs.aws.amazon.com/emr/latest/ReleaseGuide/CLI_CreateStreaming.html

upvoted 2 times

...

YashBindlish

3 years, 8 months ago

C cannot be a right answer considering that the question does not mention about a Lambda function to copy from S3, secondly the question is talking about 5PB of data. Elasticsearch can not support 5 PB. Option B is the correct answer as the data is huge EMR can be used to parallelly analyse the data using Streaming program which supports python.

upvoted 2 times

...

jxj

3 years, 8 months ago

C is reasonable

upvoted 1 times

...

yuvaraj228

3 years, 8 months ago

C is right

upvoted 1 times

...

san2020

3 years, 8 months ago

Selected B

upvoted 1 times

...

ME2000

3 years, 9 months ago

The question is lying here: The algorithm analyzes the free text contained within a sample set of 1 million e-mails stored on Amazon S3. This secondary information is for overthinking: The algorithm must be scaled across a production dataset of 5 PB, which also resides in Amazon S3 storage. (Must notice...also resides in Amazon S3 storage.) So clearly the correct answer is C

upvoted 1 times

...

shwang

3 years, 9 months ago

It can NOT be B, cause it says that 'using a streaming program step', how can a streaming program analysis text content and figure out if the content is spam then give feedback the content belongs to which piece of mail?

upvoted 2 times

Vlad511

3 years, 9 months ago

For me, C is the right one

upvoted 1 times

...

cert_learner

3 years, 9 months ago

AWS ES cant support more than 3 PB data

upvoted 2 times

...

srirampc

3 years, 8 months ago

streaming could mean using hadoop streaming/spark streaming to process files in S3. Each row in csv is sent over to the streaming python function to be process. B is could be a good option to parallelize a function thats working good on a sample.

upvoted 1 times

...

antoneti

3 years, 9 months ago

I dont understand, I would have said also that answer is B but it seems it is not, is it possible that in the exam not only one answer is allowed?

upvoted 1 times

Corram

3 years, 8 months ago

thing is, the answers provided by examtopics are often wrong for some reason (at least for this exam)

upvoted 1 times

...

WWODIN

3 years, 9 months ago

for data size in PB, EMR and answer B should be

upvoted 2 times

...

M2

3 years, 9 months ago

Answer is B

upvoted 2 times

...

muhsin

3 years, 9 months ago

Spam filtering is a machine learning algorithm. It works with EMR and S3 which are most suitable scenario. b is the correct answer

upvoted 3 times

...

Jialu

3 years, 9 months ago

B is the correct answer , since Amazon Elasticsearch Servic can not suppot 5 PB

upvoted 3 times

practicioner

3 years, 9 months ago

https://aws.amazon.com/ru/elasticsearch-service/faqs/ Q: Is there a limit on the amount of EBS storage that can be allocated to an Amazon Elasticsearch Service domain? Yes. Amazon Elasticsearch Service supports one EBS volume (max size of 1.5 TB) per instance associated with a domain. With the default maximum of 20 data nodes allowed per Amazon Elasticsearch Service domain, you can allocate about 30 TB of EBS storage to a single domain. You can request a service limit increase up to 200 instances per domain by creating a case with the AWS Support Center. With 200 instances, you can allocate about 300 TB of EBS storage to a single domain.

upvoted 1 times

...