Exam AWS Certified Solutions Architect - Professional SAP-C02 topic 1 question 281 discussion

Exam question from Amazon's AWS Certified Solutions Architect - Professional SAP-C02

Question #: 281
Topic #: 1

[All AWS Certified Solutions Architect - Professional SAP-C02 Questions]

A company is collecting a large amount of data from a fleet of IoT devices. Data is stored as Optimized Row Columnar (ORC) files in the Hadoop Distributed File System (HDFS) on a persistent Amazon EMR cluster. The company's data analytics team queries the data by using SQL in Apache Presto deployed on the same EMR cluster. Queries scan large amounts of data, always run for less than 15 minutes, and run only between 5 PM and 10 PM.

The company is concerned about the high cost associated with the current solution. A solutions architect must propose the most cost-effective solution that will allow SQL data queries.

Which solution will meet these requirements?

A. Store data in Amazon S3. Use Amazon Redshift Spectrum to query data.
B. Store data in Amazon S3. Use the AWS Glue Data Catalog and Amazon Athena to query data.
C. Store data in EMR File System (EMRFS). Use Presto in Amazon EMR to query data.
D. Store data in Amazon Redshift. Use Amazon Redshift to query data.

Show Suggested Answer

Suggested Answer: B 🗳️

by psyx21 at June 21, 2023, 11 a.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

Alabi

Highly Voted 2 years ago

Selected Answer: B

Storing the data in Amazon S3 is a cost-effective solution compared to running a persistent EMR cluster with HDFS. The AWS Glue Data Catalog provides a centralized metadata repository for organizing and cataloging data in S3. Amazon Athena is a serverless query service that allows you to run SQL queries directly against data in S3 without the need for a dedicated cluster or infrastructure. By using Amazon Athena, you only pay for the queries you run, which aligns with the requirement of cost-effectiveness.

upvoted 6 times

...

1 year, 3 months ago

Selected Answer: B

Option B - Athena can connect to your data stored in Amazon S3 using the AWS Glue Data Catalog to store metadata such as table and column names. After the connection is made, your databases, tables, and views appear in Athena's query editor. https://docs.aws.amazon.com/athena/latest/ug/data-sources-glue.html

upvoted 2 times

...

kejam

1 year, 5 months ago

Selected Answer: C

The question doesn't provide enough info to calculate the answer. We need to know how large the emr cluster is, how many queries, and how many TBs/PBs of data per query per day. However I'm leaning towards... Answer C: Store data in EMR File System (EMRFS). Use Presto in Amazon EMR to query data. EMRFS is an implementation of HDFS that all Amazon EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon S3. The company could switch to EMRFS and continue to use Presto which comes included in EMR and turn off the clusters when not in use while the data persists in EMRFS(S3). EMR comes in many flavors with different price points (EC2, Serverless) and is geared more towards daily data pipelines like this company is running. Regarding B: Athena is serverless and great for ad-hoc queries, but it is not cheap.

upvoted 1 times

...

CProgrammer

1 year, 6 months ago

significantly more expensive to store data in Redshift compared to S3 HOWEVER https://docs.aws.amazon.com/redshift/latest/gsg/data-lake.html You can use Amazon Redshift Spectrum to query data in Amazon S3 files without having to load the data into Amazon Redshift tables. Athena: While cost-effective for occasional ad-hoc queries, Athena's serverless architecture may not be as performant for frequent, resource-intensive queries [Queries scan large amounts of data]

upvoted 2 times

...

career360guru

1 year, 7 months ago

Selected Answer: B

B is most cost effective. A Redshift Spectrum can be a good option but then it needs Reshift cluster which my be more expensive. One information missing in the question is many queries/sec. If there are large number queries/sec then A can be better choice.

upvoted 3 times

...

2 years ago

Selected Answer: B

Correct Answer is B

upvoted 1 times

...