exam questions

Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 10 discussion

An insurance company has raw data in JSON format that is sent without a predefined schedule through an Amazon Kinesis Data Firehose delivery stream to an
Amazon S3 bucket. An AWS Glue crawler is scheduled to run every 8 hours to update the schema in the data catalog of the tables stored in the S3 bucket. Data analysts analyze the data using Apache Spark SQL on Amazon EMR set up with AWS Glue Data Catalog as the metastore. Data analysts say that, occasionally, the data they receive is stale. A data engineer needs to provide access to the most up-to-date data.
Which solution meets these requirements?

  • A. Create an external schema based on the AWS Glue Data Catalog on the existing Amazon Redshift cluster to query new data in Amazon S3 with Amazon Redshift Spectrum.
  • B. Use Amazon CloudWatch Events with the rate (1 hour) expression to execute the AWS Glue crawler every hour.
  • C. Using the AWS CLI, modify the execution schedule of the AWS Glue crawler from 8 hours to 1 minute.
  • D. Run the AWS Glue crawler from an AWS Lambda function triggered by an S3:ObjectCreated:* event notification on the S3 bucket.
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
singh100
Highly Voted 3 years, 9 months ago
Answer: D Data analysts analyze the data using Apache Spark SQL on Amazon EMR for the data stored on S3 in JSON format. Input JSON file landing in S3 triggers a Lambda which invokes Glue Crawler.
upvoted 37 times
chinmayj213
1 year, 9 months ago
D is correct out of all these answer , but at the same time running a crawler on bucket for just landing of one object. Is it a good idea ?
upvoted 2 times
...
...
zanhsieh
Highly Voted 3 years, 9 months ago
A is on demand (triggered by hand). B minimum time required is 1 hr. C is 1 minute based on the cron schedule syntax. For D, it could reach to sub-minute level since it watches S3 new data events. https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html https://docs.aws.amazon.com/glue/latest/dg/monitor-data-warehouse-schedule.html https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html#RateExpressions
upvoted 9 times
...
NarenKA
Most Recent 1 year, 4 months ago
Selected Answer: D
Option A is not directly related to the issue of schema updates and would not address the staleness of data in the AWS Glue Data Catalog. Option B increases the frequency of crawls but still may not provide real-time updates. Option C is not practical or cost-effective due to the excessive number of crawler runs it would trigger, and the AWS Glue crawler cannot be scheduled to run every minute. Option D provides a dynamic, event-driven solution that ensures data analysts have access to the most current data available.
upvoted 1 times
...
NikkyDicky
1 year, 11 months ago
Selected Answer: D
its a D
upvoted 1 times
...
Bdtri
2 years, 1 month ago
Why triggering glue crawler can give us the latest data? Isn’t it only updating metastore?
upvoted 1 times
chinmayj213
1 year, 9 months ago
yes metastore , but every filename in s3 bucket need to registered in metastore to pick the latest data.
upvoted 2 times
...
...
pk349
2 years, 2 months ago
D: I passed the test
upvoted 1 times
kondi2309
1 year, 4 months ago
why D?
upvoted 1 times
...
...
lk23
2 years, 4 months ago
very curious to know every answer so far what it says it incorrect as discussion revleas something else, why?
upvoted 3 times
...
cloudlearnerhere
2 years, 8 months ago
Selected Answer: D
Correct answer is D as it is event-driven and would load the data as soon as the object-created event is triggered. Option A is wrong as this is still manual and on-demand. Option B is wrong as the refresh interval is still 1 hr. Option C is wrong as the minimum precision for the schedule is 5 mins.
upvoted 3 times
...
Abep
2 years, 10 months ago
Selected Answer: D
Answer: D
upvoted 1 times
...
rocky48
2 years, 11 months ago
Selected Answer: D
Answer: D
upvoted 1 times
...
Bik000
3 years, 1 month ago
Selected Answer: D
Answer is D
upvoted 1 times
...
jrheen
3 years, 2 months ago
Answer-D
upvoted 1 times
...
ShilaP
3 years, 3 months ago
D is correct
upvoted 1 times
...
aws2019
3 years, 7 months ago
The answer is D.
upvoted 1 times
...
iconara
3 years, 8 months ago
D seems correct, but could potentially be an expensive solution.
upvoted 1 times
...
Huy
3 years, 8 months ago
Although D is correct answer, the answer should mention SQS. The crawler will not run fast enough to catch up with objects created.
upvoted 2 times
...
Shraddha
3 years, 8 months ago
This is a textbook question. A = wrong, won’t work because schema will update every 8 hours. B = wrong, not most up-to-date. C = wrong, minimum schedule is 5 minutes. https://aws.amazon.com/blogs/big-data/build-and-automate-a-serverless-data-lake-using-an-aws-glue-trigger-for-the-data-catalog-and-etl-jobs/
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...