exam questions

Exam AWS Certified Big Data - Specialty All Questions

View all questions & answers for the AWS Certified Big Data - Specialty exam

Exam AWS Certified Big Data - Specialty topic 1 question 2 discussion

Exam question from Amazon's AWS Certified Big Data - Specialty
Question #: 2
Topic #: 1
[All AWS Certified Big Data - Specialty Questions]

A new algorithm has been written in Python to identify SPAM e-mails. The algorithm analyzes the free text contained within a sample set of 1 million e-mails stored on Amazon S3. The algorithm must be scaled across a production dataset of 5 PB, which also resides in Amazon S3 storage.
Which AWS service strategy is best for this use case?

  • A. Copy the data into Amazon ElastiCache to perform text analysis on the in-memory data and export the results of the model into Amazon Machine Learning.
  • B. Use Amazon EMR to parallelize the text analysis tasks across the cluster using a streaming program step.
  • C. Use Amazon Elasticsearch Service to store the text and then use the Python Elasticsearch Client to run analysis against the text index.
  • D. Initiate a Python job from AWS Data Pipeline to run directly against the Amazon S3 text files.
Show Suggested Answer Hide Answer
Suggested Answer: C 🗳️
Reference: https://aws.amazon.com/blogs/database/indexing-metadata-in-amazon-elasticsearch-service- using-aws-lambda-and-python/

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
Bulti
Highly Voted 3 years, 8 months ago
Answer is B: Hadoop Streaming is a utility that comes with Hadoop that enables you to develop MapReduce executables in languages other than Java. Streaming is implemented in the form of a JAR file, so you can run it from the Amazon EMR API or command line just like a standard JAR file. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UseCase_Streaming.html
upvoted 6 times
...
fagilom
Most Recent 1 year, 6 months ago
Option B (Use Amazon EMR) is the best service strategy for this use case. EMR can handle the scale of the data and provides the necessary computational resources to run complex text analysis algorithms in parallel across a large cluster, making it a suitable choice for processing 5 PB of data.
upvoted 1 times
...
Abhi09
3 years, 8 months ago
Option B. EMR supports on S3 supports 5PB. Hadoop or spark steaming supports Python algo. Elastisearch has limit of 30TB for a single domain. Even after request to Amazon to increase, it can go max 300TB, so not possible to store 3TB. Secondly, using lambda in this case might not be good choice with such huge volume of data flowing in.
upvoted 1 times
...
alopazo
3 years, 8 months ago
B https://docs.aws.amazon.com/emr/latest/ReleaseGuide/CLI_CreateStreaming.html
upvoted 2 times
...
YashBindlish
3 years, 8 months ago
C cannot be a right answer considering that the question does not mention about a Lambda function to copy from S3, secondly the question is talking about 5PB of data. Elasticsearch can not support 5 PB. Option B is the correct answer as the data is huge EMR can be used to parallelly analyse the data using Streaming program which supports python.
upvoted 2 times
...
jxj
3 years, 8 months ago
C is reasonable
upvoted 1 times
...
yuvaraj228
3 years, 8 months ago
C is right
upvoted 1 times
...
san2020
3 years, 8 months ago
Selected B
upvoted 1 times
...
ME2000
3 years, 9 months ago
The question is lying here: The algorithm analyzes the free text contained within a sample set of 1 million e-mails stored on Amazon S3. This secondary information is for overthinking: The algorithm must be scaled across a production dataset of 5 PB, which also resides in Amazon S3 storage. (Must notice...also resides in Amazon S3 storage.) So clearly the correct answer is C
upvoted 1 times
...
shwang
3 years, 9 months ago
It can NOT be B, cause it says that 'using a streaming program step', how can a streaming program analysis text content and figure out if the content is spam then give feedback the content belongs to which piece of mail?
upvoted 2 times
Vlad511
3 years, 9 months ago
For me, C is the right one
upvoted 1 times
...
cert_learner
3 years, 9 months ago
AWS ES cant support more than 3 PB data
upvoted 2 times
...
srirampc
3 years, 8 months ago
streaming could mean using hadoop streaming/spark streaming to process files in S3. Each row in csv is sent over to the streaming python function to be process. B is could be a good option to parallelize a function thats working good on a sample.
upvoted 1 times
...
...
antoneti
3 years, 9 months ago
I dont understand, I would have said also that answer is B but it seems it is not, is it possible that in the exam not only one answer is allowed?
upvoted 1 times
Corram
3 years, 8 months ago
thing is, the answers provided by examtopics are often wrong for some reason (at least for this exam)
upvoted 1 times
...
...
WWODIN
3 years, 9 months ago
for data size in PB, EMR and answer B should be
upvoted 2 times
...
M2
3 years, 9 months ago
Answer is B
upvoted 2 times
...
muhsin
3 years, 9 months ago
Spam filtering is a machine learning algorithm. It works with EMR and S3 which are most suitable scenario. b is the correct answer
upvoted 3 times
...
Jialu
3 years, 9 months ago
B is the correct answer , since Amazon Elasticsearch Servic can not suppot 5 PB
upvoted 3 times
practicioner
3 years, 9 months ago
https://aws.amazon.com/ru/elasticsearch-service/faqs/ Q: Is there a limit on the amount of EBS storage that can be allocated to an Amazon Elasticsearch Service domain? Yes. Amazon Elasticsearch Service supports one EBS volume (max size of 1.5 TB) per instance associated with a domain. With the default maximum of 20 data nodes allowed per Amazon Elasticsearch Service domain, you can allocate about 30 TB of EBS storage to a single domain. You can request a service limit increase up to 200 instances per domain by creating a case with the AWS Support Center. With 200 instances, you can allocate about 300 TB of EBS storage to a single domain.
upvoted 1 times
...
...
mattyb123
3 years, 10 months ago
B is the correct answer
upvoted 3 times
mattyb123
3 years, 10 months ago
https://aws.amazon.com/blogs/database/indexing-metadata-in-amazon-elasticsearch-service- using-aws-lambda-and-python/
upvoted 1 times
mattyb123
3 years, 10 months ago
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UseCase_Streaming.html)
upvoted 3 times
exams
3 years, 9 months ago
B is correct
upvoted 2 times
shouvanik
3 years, 9 months ago
Hi, if B is correct, what does this URL you pasted, explains?
upvoted 1 times
...
...
...
...
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...