exam questions

Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 30 discussion

A company has developed an Apache Hive script to batch process data stared in Amazon S3. The script needs to run once every day and store the output in
Amazon S3. The company tested the script, and it completes within 30 minutes on a small local three-node cluster.
Which solution is the MOST cost-effective for scheduling and executing the script?

  • A. Create an AWS Lambda function to spin up an Amazon EMR cluster with a Hive execution step. Set KeepJobFlowAliveWhenNoSteps to false and disable the termination protection flag. Use Amazon CloudWatch Events to schedule the Lambda function to run daily.
  • B. Use the AWS Management Console to spin up an Amazon EMR cluster with Python Hue. Hive, and Apache Oozie. Set the termination protection flag to true and use Spot Instances for the core nodes of the cluster. Configure an Oozie workflow in the cluster to invoke the Hive script daily.
  • C. Create an AWS Glue job with the Hive script to perform the batch operation. Configure the job to run once a day using a time-based schedule.
  • D. Use AWS Lambda layers and load the Hive runtime to AWS Lambda and copy the Hive script. Schedule the Lambda function to run daily by creating a workflow using AWS Step Functions.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
carol1522
Highly Voted 3 years, 9 months ago
For me it is A. Not B because we are not supposed to run core nodes in spot instances, just task nodes and it is more expensive because to schedule with oozie, our cluster have to be up all the time. It is not C because glue cannot run hive script, and it is not c because lambda cannot run hive scripts also. https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/RunLambdaSchedule.html
upvoted 44 times
awssp12345
3 years, 9 months ago
Agree with A
upvoted 2 times
...
Prodip
3 years, 9 months ago
Perfect Explanation ; I wanted to write something but your text covers everything.
upvoted 2 times
...
chengxu32
3 years, 8 months ago
https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html With KeepJobFlowAliveWhenNoSteps parameter is set to False, the cluster will be shutdown once the steps are completed, thus the cost effective requirement is met
upvoted 9 times
...
...
jove
Highly Voted 3 years, 8 months ago
+ A is the correct answer. - B : Spot Instances are not a good option to run a 30-min-script - C: Glue cannot run Hive scripts - D: Lambda can run for 15 minutes maximum. Not enough time to run that script.
upvoted 8 times
Bob888
2 years, 3 months ago
C: Glue cannot run Hive scripts---> Clue can run hive scripts. But the problem is that C keep all the Glue setting and does not terminate it. A By default, an Amazon EMR cluster will be terminated automatically when all steps have completed and there are no pending steps or other applications running on the cluster.
upvoted 4 times
...
...
tsangckl
Most Recent 1 year, 3 months ago
Selected Answer: C
Bing Option C is correct because AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. You can create a job in AWS Glue that incorporates your Hive script, and you can schedule this job to run once a day. This approach does not require the provisioning or management of servers, making it a cost-effective solution. Other options involve using Amazon EMR or AWS Lambda, which could incur higher costs due to the need for server provisioning and potential for idle resources.
upvoted 1 times
...
gofavad926
1 year, 8 months ago
Selected Answer: A
A. As carol1522 explains in her comment
upvoted 1 times
...
Hamza98
1 year, 9 months ago
Selected Answer: A
A satisfies all the requirements
upvoted 1 times
...
rlnd2000
1 year, 9 months ago
Selected Answer: C
"Could anyone who chose "A" as the correct answer please explain how to make a Lambda function run for 30 minutes?"
upvoted 1 times
gofavad926
1 year, 8 months ago
The lambda function only create and initialise the EMR...
upvoted 2 times
...
...
petervu
2 years ago
Selected Answer: C
Since AWS Glue can run Hive script. So C will be cheaper than A.
upvoted 3 times
...
Venkkat
2 years ago
A for sure
upvoted 1 times
...
pk349
2 years, 2 months ago
A: I passed the test
upvoted 1 times
rlnd2000
1 year, 9 months ago
You passed, OK but this question was wrong in your test I think, how can you make lambda run for 30 minutes?
upvoted 1 times
uyendo123
1 year, 9 months ago
I guess that the A means Lambda function would just spin up the EMR Cluster, when Cluster has started, the Lambda function would stop. Then the Hive script run on EMR Cluster, and terminated when script running done.
upvoted 1 times
...
...
...
Arjun777
2 years, 4 months ago
aws glue can run hive scripts - https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
upvoted 4 times
he11ow0rId
1 year, 11 months ago
I think you misunderstand this blog. hive can use the catalog generated by glue, but glue running hive script. So, C is still wrong, A is the correct answer
upvoted 3 times
...
...
cloudlearnerhere
2 years, 8 months ago
Selected Answer: A
Correct answer is A as the EMR cluster can be used to execute the Hive scripts. KeepJobFlowAliveWhenNoSteps set to false and disabling the termination protection flag would help destroy the cluster once no running jobs. CloudWatch Events with Lambda can be used to trigger the scheduled activity. Option B is wrong as Oozie requires the EMR cluster always running, else the job cannot be scheduled and executed. Using Spot instances for core nodes is not recommended. Option C is wrong as Glue does not support running Hive scripts. Option D is wrong as Lambda would not be able to meet the 30 minutes job runtime requirement.
upvoted 7 times
...
Arka_01
2 years, 9 months ago
Selected Answer: A
This one is a classic scenario of Transient cluster. So A is the answer here.
upvoted 2 times
...
rocky48
2 years, 11 months ago
Selected Answer: A
Answer is A
upvoted 1 times
...
Pradhan
3 years, 8 months ago
I will go with A.
upvoted 3 times
...
Shraddha
3 years, 8 months ago
Ans A B = wrong, termination flag should be off, spot instances not good for core nodes. C = Glue runs on Spark, cannot run Hive scripts. D = wrong, Lambda maximum running time 15 minutes.
upvoted 4 times
...
lostsoul07
3 years, 8 months ago
A is the right answer
upvoted 2 times
...
Sai12
3 years, 8 months ago
A based on its similarity to this article https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-gs-process-sample-data.html
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...