exam questions

Exam AWS Certified Big Data - Specialty All Questions

View all questions & answers for the AWS Certified Big Data - Specialty exam

Exam AWS Certified Big Data - Specialty topic 1 question 41 discussion

Exam question from Amazon's AWS Certified Big Data - Specialty
Question #: 41
Topic #: 1
[All AWS Certified Big Data - Specialty Questions]

An organization uses Amazon Elastic MapReduce(EMR) to process a series of extract-transform-load (ETL) steps that run in sequence. The output of each step must be fully processed in subsequent steps but will not be retained.
Which of the following techniques will meet this requirement most efficiently?

  • A. Use the EMR File System (EMRFS) to store the outputs from each step as objects in Amazon Simple Storage Service (S3).
  • B. Use the s3n URI to store the data to be processed as objects in Amazon S3.
  • C. Define the ETL steps as separate AWS Data Pipeline activities.
  • D. Load the data to be processed into HDFS, and then write the final output to Amazon S3.
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
Jayraam
Highly Voted 3 years, 7 months ago
Answer is C. AWS Data Pipepline works well for sequence of ETL processing. https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data. With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks. You define the parameters of your data transformations and AWS Data Pipeline enforces the logic that you've set up.
upvoted 9 times
...
s3an
Highly Voted 3 years, 7 months ago
C should be the correct answer. The question never mentioned anything about keeping the final output in s3. ETL might be to-and-from any other database. And D only says load data to be processed in HDFS, and not really the output of each process.
upvoted 8 times
reg9
3 years, 7 months ago
With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks. https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
upvoted 1 times
...
...
Royk2020
Most Recent 3 years, 6 months ago
only 2 logical answers are C & D. Out if which Data Pipeline does not have the ability to share data on its own in between steps (it has to be stored somewhere). My choice is D as HDFS is ephemeral, data is lost of cluster termination
upvoted 2 times
vicks316
3 years, 6 months ago
Exactly what I was going to say, "Data Pipeline does not have the ability to share data on its own in between steps (it has to be stored somewhere)". Spot on, D from my perspective.
upvoted 1 times
...
...
Bulti
3 years, 7 months ago
C & D will both do the job. However I think D is more efficient than C so you do not have to deal with starting and terminating a transient EMR cluster on each intermediate step. AWS Data pipeline is being introduced to confuse us because it is the service used to execute a series of job in a sequence. However I think D is the right answer as its more efficient. The reason why the final output is persisted to S3 is because we cannot lose it as is it the result of all the map/reduce processing we did on the cluster. So when the cluster is terminate we don't want to lose the results of our Map/reduce processing we did on the data fed to the cluster for processing.
upvoted 5 times
...
YashBindlish
3 years, 7 months ago
Correct Answer is D you can efficiently copy large amounts of data from Amazon S3 into HDFS where subsequent steps in your EMR clusters can process it. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3
upvoted 1 times
...
san2020
3 years, 7 months ago
my selection C
upvoted 1 times
...
balajisush0312
3 years, 7 months ago
A.B and D support retain.However, C doesn't. So far me C is correct
upvoted 1 times
...
ExamTopicSteven
3 years, 7 months ago
Moreover, "Datapipe line launches an Amazon EMR cluster for each scheduled interval, submits jobs as steps to the cluster, and terminates the cluster after tasks have completed." By this, data is not retained. So C looks good for me. Do you think the "final output" in D, means final output of each step. final output = output here.
upvoted 2 times
...
ExamTopicSteven
3 years, 7 months ago
"D. Load the data to be processed into HDFS" = retained?. So D is not right? Moreover, the question looks is only about ETL, did not mention what we should do regarding "final output" C. Define the ETL steps as SEPARATE AWS Data Pipeline activities. SEPARATE means "not ratained", doesn't it?
upvoted 1 times
...
sam3787
3 years, 7 months ago
@mattyb123 or anyone here- did you cleared the exam recently? need reviews on most of the contradictory answers
upvoted 1 times
...
sam3787
3 years, 7 months ago
output will not be retained. do we still require S3? not sure of the correct answer here
upvoted 1 times
...
jay1ram2
3 years, 7 months ago
Correct Answer is D The main ask is efficiency. A) EMRFS provides consistency. However copying intermediate results into S3 is not an efficient approach. B) s3n is a old protocol so it is not efficient C) Using data pipelines activity for each step is just orchestration and may not guarantee efficiency. D) Using HDFS for intermediate steps ensures that the data is replicated and stored within EMR core nodes and is the most efficient way to store data for processing in EMR (even compared to S3). Storing the final result in S3 provides durability and may or may not matter from the context of this question.
upvoted 4 times
...
ME2000
3 years, 7 months ago
Answer is D Here we go... Scalability and Flexibility Additionally, Amazon EMR provides the flexibility to use several file systems for your input, output, and intermediate data. For example, you might choose the Hadoop Distributed File System (HDFS) which runs on the master and core nodes of your cluster for processing data that you do not need to store beyond your cluster’s lifecycle. You might choose the EMR File System (EMRFS) to use Amazon S3 as a data layer for applications running on your cluster so that you can separate your compute and storage, and persist data outside of the lifecycle of your cluster. https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview-benefits.html
upvoted 2 times
...
Raju_k
3 years, 7 months ago
I would choose D since intermittent output saving it in HDFS is best option for chain of ETL jobs.
upvoted 1 times
...
cybe001
3 years, 7 months ago
My answer is D
upvoted 2 times
...
asadao
3 years, 7 months ago
B is correct
upvoted 1 times
...
M2
3 years, 7 months ago
answer should D as it is writing only final output only in s3
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...