Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 10 discussion

Exam question from Amazon's AWS Certified Data Analytics - Specialty

Question #: 10
Topic #: 1

[All AWS Certified Data Analytics - Specialty Questions]

An insurance company has raw data in JSON format that is sent without a predefined schedule through an Amazon Kinesis Data Firehose delivery stream to an
Amazon S3 bucket. An AWS Glue crawler is scheduled to run every 8 hours to update the schema in the data catalog of the tables stored in the S3 bucket. Data analysts analyze the data using Apache Spark SQL on Amazon EMR set up with AWS Glue Data Catalog as the metastore. Data analysts say that, occasionally, the data they receive is stale. A data engineer needs to provide access to the most up-to-date data.
Which solution meets these requirements?

A. Create an external schema based on the AWS Glue Data Catalog on the existing Amazon Redshift cluster to query new data in Amazon S3 with Amazon Redshift Spectrum.
B. Use Amazon CloudWatch Events with the rate (1 hour) expression to execute the AWS Glue crawler every hour.
C. Using the AWS CLI, modify the execution schedule of the AWS Glue crawler from 8 hours to 1 minute.
D. Run the AWS Glue crawler from an AWS Lambda function triggered by an S3:ObjectCreated:* event notification on the S3 bucket.

Show Suggested Answer

Suggested Answer: D 🗳️

by singh100 at Aug. 16, 2020, 1:17 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

singh100

Highly Voted 3 years, 9 months ago

Answer: D Data analysts analyze the data using Apache Spark SQL on Amazon EMR for the data stored on S3 in JSON format. Input JSON file landing in S3 triggers a Lambda which invokes Glue Crawler.

upvoted 37 times

chinmayj213

1 year, 9 months ago

D is correct out of all these answer , but at the same time running a crawler on bucket for just landing of one object. Is it a good idea ?

upvoted 2 times

...

zanhsieh

Highly Voted 3 years, 9 months ago

A is on demand (triggered by hand). B minimum time required is 1 hr. C is 1 minute based on the cron schedule syntax. For D, it could reach to sub-minute level since it watches S3 new data events. https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html https://docs.aws.amazon.com/glue/latest/dg/monitor-data-warehouse-schedule.html https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html#RateExpressions

upvoted 9 times

...

NarenKA

Most Recent 1 year, 4 months ago

Selected Answer: D

Option A is not directly related to the issue of schema updates and would not address the staleness of data in the AWS Glue Data Catalog. Option B increases the frequency of crawls but still may not provide real-time updates. Option C is not practical or cost-effective due to the excessive number of crawler runs it would trigger, and the AWS Glue crawler cannot be scheduled to run every minute. Option D provides a dynamic, event-driven solution that ensures data analysts have access to the most current data available.

upvoted 1 times

...

NikkyDicky

1 year, 11 months ago

Selected Answer: D

its a D

upvoted 1 times

...

Bdtri

2 years, 1 month ago

Why triggering glue crawler can give us the latest data? Isn’t it only updating metastore?

upvoted 1 times

chinmayj213

1 year, 9 months ago

yes metastore , but every filename in s3 bucket need to registered in metastore to pick the latest data.

upvoted 2 times

...

pk349

2 years, 2 months ago

D: I passed the test

upvoted 1 times

kondi2309

1 year, 4 months ago

why D?

upvoted 1 times

...

lk23

2 years, 4 months ago

very curious to know every answer so far what it says it incorrect as discussion revleas something else, why?

upvoted 3 times

...

cloudlearnerhere

2 years, 8 months ago

Selected Answer: D

Correct answer is D as it is event-driven and would load the data as soon as the object-created event is triggered. Option A is wrong as this is still manual and on-demand. Option B is wrong as the refresh interval is still 1 hr. Option C is wrong as the minimum precision for the schedule is 5 mins.

upvoted 3 times

...

Abep

2 years, 10 months ago

Selected Answer: D

Answer: D

upvoted 1 times

...

rocky48

2 years, 11 months ago

Selected Answer: D

Answer: D

upvoted 1 times

...

Bik000

3 years, 1 month ago

Selected Answer: D

Answer is D

upvoted 1 times

...

jrheen

3 years, 2 months ago

Answer-D

upvoted 1 times

...

ShilaP

3 years, 3 months ago

D is correct

upvoted 1 times

...

aws2019

3 years, 7 months ago

The answer is D.

upvoted 1 times

...

iconara

3 years, 8 months ago

D seems correct, but could potentially be an expensive solution.

upvoted 1 times

...

Huy

3 years, 8 months ago

Although D is correct answer, the answer should mention SQS. The crawler will not run fast enough to catch up with objects created.

upvoted 2 times

...

Shraddha

3 years, 8 months ago

This is a textbook question. A = wrong, won’t work because schema will update every 8 hours. B = wrong, not most up-to-date. C = wrong, minimum schedule is 5 minutes. https://aws.amazon.com/blogs/big-data/build-and-automate-a-serverless-data-lake-using-an-aws-glue-trigger-for-the-data-catalog-and-etl-jobs/

upvoted 2 times

...

Load full discussion...