exam questions

Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 125 discussion

An education provider's learning management system (LMS) is hosted in a 100 TB data lake that is built on Amazon S3. The provider's LMS supports hundreds of schools. The provider wants to build an advanced analytics reporting platform using Amazon Redshift to handle complex queries with optimal performance.
System users will query the most recent 4 months of data 95% of the time while 5% of the queries will leverage data from the previous 12 months.
Which solution meets these requirements in the MOST cost-effective way?

  • A. Store the most recent 4 months of data in the Amazon Redshift cluster. Use Amazon Redshift Spectrum to query data in the data lake. Use S3 lifecycle management rules to store data from the previous 12 months in Amazon S3 Glacier storage.
  • B. Leverage DS2 nodes for the Amazon Redshift cluster. Migrate all data from Amazon S3 to Amazon Redshift. Decommission the data lake.
  • C. Store the most recent 4 months of data in the Amazon Redshift cluster. Use Amazon Redshift Spectrum to query data in the data lake. Ensure the S3 Standard storage class is in use with objects in the data lake.
  • D. Store the most recent 4 months of data in the Amazon Redshift cluster. Use Amazon Redshift federated queries to join cluster data with the data lake to reduce costs. Ensure the S3 Standard storage class is in use with objects in the data lake.
Show Suggested Answer Hide Answer
Suggested Answer: C 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
srinivasa
Highly Voted 3 years, 9 months ago
Answer: C
upvoted 16 times
...
cloudlearnerhere
Highly Voted 2 years, 8 months ago
Correct answer is C as only the last 4 months of data is required for 95%, the data can be stored in the Redshift cluster. The data covering the 12 months can be moved to S3 and queried using Redshift Spectrum. A mix of S3 and Redshift would provide the most cost-effective option. https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html Option A is wrong as Redshift Spectrum cannot be used to query S3 Glacier storage data. Option B is wrong as using Redshift for all the data is not cost-effective. Option D is wrong as although Redshift federated queries would work, however, for 5% it would be cost-effective to query S3 directly instead of joining data from cluster and S3.
upvoted 9 times
...
chinmayj213
Most Recent 1 year, 4 months ago
Point A cannot be answer because 5% user/query will be frequently used and Glacier need 90 to 180 days period. Point D : Federated query make sense when multiple data source because it required lots of authentication and authorization process , But here we have only S3 , so we can go with Point C
upvoted 2 times
...
roymunson
1 year, 7 months ago
I don't get it: The question says: "System users will query the most recent 4 months of data 95% of the time while 5% of the queries will leverage data from the previous 12 months." 95% using data from the previous 4 month 5 % using data from the previous 12 month 0 % using data that is older than 12 month. So why not archive them?
upvoted 1 times
roymunson
1 year, 7 months ago
Ahh now I got it: They want tu use the glacier to store the data from the previous 12 months.
upvoted 1 times
...
...
pk349
2 years, 2 months ago
C: I passed the test
upvoted 1 times
...
Arjun777
2 years, 4 months ago
Option C suggests storing the most recent 4 months of data in the Amazon Redshift cluster and using Amazon Redshift Spectrum to query data in the data lake, while ensuring that the S3 Standard storage class is in use with objects in the data lake. While this approach could work, it may not be the most cost-effective way to meet the requirements, because storing data in the Amazon Redshift cluster can be more expensive than storing data in S3. Additionally, by storing all data in the data lake, you may be able to use other data analysis services to query the data, which can be more cost-effective than using Amazon Redshift. Therefore, Option A, which leverages Amazon Redshift Spectrum to query the data in the data lake and uses S3 lifecycle management rules to move data from the previous 12 months to Amazon S3 Glacier, is likely a more cost-effective solution.
upvoted 4 times
...
rocky48
2 years, 11 months ago
Selected Answer: C
Selected Answer: C
upvoted 1 times
...
Alekx42
2 years, 12 months ago
Selected Answer: C
C is the answer. If you have to join old data coming from S3 with new data coming from the Redshift cluster, you can do that. It is described here: https://catalog.us-east-1.prod.workshops.aws/workshops/e5548031-3004-49ad-89be-a13e8cd616f6/en-US/perform-analytics-on-your-data/join-and-query-data-with-redshift-spectrum
upvoted 1 times
...
GiveMeEz
3 years ago
Not D. https://docs.aws.amazon.com/redshift/latest/dg/federated-limitations.html
upvoted 1 times
...
Bik000
3 years, 1 month ago
Selected Answer: C
Answer C should be correct
upvoted 1 times
...
certificationJunkie
3 years, 1 month ago
C. There is no mention of joining old and new data. Hence no need for federated queries.
upvoted 1 times
certificationJunkie
3 years, 1 month ago
federated queries are for databases and not specific to s3. Here requirement is s3 so spectrum should work okay.
upvoted 1 times
...
...
Shammy45
3 years, 1 month ago
Selected Answer: C
Its C , textbook case for spectrum
upvoted 1 times
...
MWL
3 years, 2 months ago
Selected Answer: D
I think D is correct. Using federated queries to combine redshift and Spectrum (from S3) will be more cost effective.
upvoted 1 times
...
yogen
3 years, 6 months ago
A - Data from past 12 months is not required so glacier is most cost effective, Redshift spectrum to be used for querying data 4 - 12 months older, and upto 4 months old data is to queried from Redshift DB
upvoted 2 times
yogen
3 years, 6 months ago
I correct myself for misreading data from last 12 months to be in glacier, answer is C
upvoted 2 times
...
...
damaldon
3 years, 7 months ago
The question says "efficiently handle complicated queries", A. recommend S3 lifecycle, but I don´t think Glacier be efficient, even Expedited retrieval (1-5) minutes. I will go for C.
upvoted 3 times
...
tobsam
3 years, 7 months ago
ali98 on point. Answer is C
upvoted 1 times
...
awsmani
3 years, 7 months ago
Isn't D? The 5% of the user should be able to query both. moving the 4 months data to redshift makes sense and at the same time for 12 months data you need to query Redshift + Data lake data so in that federated queries can help to do that. In option C if they have provided Redshift Spectrum queries S3 as an external also then it make sense to choose C, But I do not read that way.
upvoted 4 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...