Exam AWS Certified Big Data - Specialty All Questions

View all questions & answers for the AWS Certified Big Data - Specialty exam

Exam AWS Certified Big Data - Specialty topic 1 question 41 discussion

Exam question from Amazon's AWS Certified Big Data - Specialty

Question #: 41
Topic #: 1

[All AWS Certified Big Data - Specialty Questions]

An organization uses Amazon Elastic MapReduce(EMR) to process a series of extract-transform-load (ETL) steps that run in sequence. The output of each step must be fully processed in subsequent steps but will not be retained.
Which of the following techniques will meet this requirement most efficiently?

A. Use the EMR File System (EMRFS) to store the outputs from each step as objects in Amazon Simple Storage Service (S3).
B. Use the s3n URI to store the data to be processed as objects in Amazon S3.
C. Define the ETL steps as separate AWS Data Pipeline activities.
D. Load the data to be processed into HDFS, and then write the final output to Amazon S3.

Show Suggested Answer

Suggested Answer: B 🗳️

by mattyb123 at Aug. 13, 2019, 7:45 a.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

Jayraam

Highly Voted 3 years, 8 months ago

Answer is C. AWS Data Pipepline works well for sequence of ETL processing. https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data. With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks. You define the parameters of your data transformations and AWS Data Pipeline enforces the logic that you've set up.

upvoted 9 times

...

s3an

Highly Voted 3 years, 9 months ago

C should be the correct answer. The question never mentioned anything about keeping the final output in s3. ETL might be to-and-from any other database. And D only says load data to be processed in HDFS, and not really the output of each process.

upvoted 8 times

reg9

3 years, 8 months ago

With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks. https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html

upvoted 1 times

...

Royk2020

Most Recent 3 years, 8 months ago

only 2 logical answers are C & D. Out if which Data Pipeline does not have the ability to share data on its own in between steps (it has to be stored somewhere). My choice is D as HDFS is ephemeral, data is lost of cluster termination

upvoted 2 times

vicks316

3 years, 8 months ago

Exactly what I was going to say, "Data Pipeline does not have the ability to share data on its own in between steps (it has to be stored somewhere)". Spot on, D from my perspective.

upvoted 1 times

...

Bulti

3 years, 8 months ago

C & D will both do the job. However I think D is more efficient than C so you do not have to deal with starting and terminating a transient EMR cluster on each intermediate step. AWS Data pipeline is being introduced to confuse us because it is the service used to execute a series of job in a sequence. However I think D is the right answer as its more efficient. The reason why the final output is persisted to S3 is because we cannot lose it as is it the result of all the map/reduce processing we did on the cluster. So when the cluster is terminate we don't want to lose the results of our Map/reduce processing we did on the data fed to the cluster for processing.

upvoted 5 times

...

YashBindlish

3 years, 8 months ago

Correct Answer is D you can efficiently copy large amounts of data from Amazon S3 into HDFS where subsequent steps in your EMR clusters can process it. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3

upvoted 1 times

...

san2020

3 years, 8 months ago

my selection C

upvoted 1 times

...

balajisush0312

3 years, 8 months ago

A.B and D support retain.However, C doesn't. So far me C is correct

upvoted 1 times

...

ExamTopicSteven

3 years, 8 months ago

Moreover, "Datapipe line launches an Amazon EMR cluster for each scheduled interval, submits jobs as steps to the cluster, and terminates the cluster after tasks have completed." By this, data is not retained. So C looks good for me. Do you think the "final output" in D, means final output of each step. final output = output here.

upvoted 2 times

...

ExamTopicSteven

3 years, 9 months ago

"D. Load the data to be processed into HDFS" = retained?. So D is not right? Moreover, the question looks is only about ETL, did not mention what we should do regarding "final output" C. Define the ETL steps as SEPARATE AWS Data Pipeline activities. SEPARATE means "not ratained", doesn't it?

upvoted 1 times

...

sam3787

3 years, 9 months ago

@mattyb123 or anyone here- did you cleared the exam recently? need reviews on most of the contradictory answers

upvoted 1 times

...

sam3787

3 years, 9 months ago

output will not be retained. do we still require S3? not sure of the correct answer here

upvoted 1 times

...

jay1ram2

3 years, 9 months ago

Correct Answer is D The main ask is efficiency. A) EMRFS provides consistency. However copying intermediate results into S3 is not an efficient approach. B) s3n is a old protocol so it is not efficient C) Using data pipelines activity for each step is just orchestration and may not guarantee efficiency. D) Using HDFS for intermediate steps ensures that the data is replicated and stored within EMR core nodes and is the most efficient way to store data for processing in EMR (even compared to S3). Storing the final result in S3 provides durability and may or may not matter from the context of this question.

upvoted 4 times

...

ME2000

3 years, 9 months ago

Answer is D Here we go... Scalability and Flexibility Additionally, Amazon EMR provides the flexibility to use several file systems for your input, output, and intermediate data. For example, you might choose the Hadoop Distributed File System (HDFS) which runs on the master and core nodes of your cluster for processing data that you do not need to store beyond your cluster’s lifecycle. You might choose the EMR File System (EMRFS) to use Amazon S3 as a data layer for applications running on your cluster so that you can separate your compute and storage, and persist data outside of the lifecycle of your cluster. https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview-benefits.html

upvoted 2 times

...