Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 68 discussion

Exam question from Amazon's AWS Certified Data Analytics - Specialty

Question #: 68
Topic #: 1

[All AWS Certified Data Analytics - Specialty Questions]

An airline has been collecting metrics on flight activities for analytics. A recently completed proof of concept demonstrates how the company provides insights to data analysts to improve on-time departures. The proof of concept used objects in Amazon S3, which contained the metrics in .csv format, and used Amazon
Athena for querying the data. As the amount of data increases, the data analyst wants to optimize the storage solution to improve query performance.
Which options should the data analyst use to improve performance as the data lake grows? (Choose three.)

A. Add a randomized string to the beginning of the keys in S3 to get more throughput across partitions.
B. Use an S3 bucket in the same account as Athena.
C. Compress the objects to reduce the data transfer I/O.
D. Use an S3 bucket in the same Region as Athena.
E. Preprocess the .csv data to JSON to reduce I/O by fetching only the document keys needed by the query.
F. Preprocess the .csv data to Apache Parquet to reduce I/O by fetching only the data blocks needed for predicates.

Show Suggested Answer

Suggested Answer: CDF 🗳️

by testtaker3434 at Aug. 28, 2020, 1 a.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

carol1522

Highly Voted 3 years, 9 months ago

For me is CDF

upvoted 26 times

GauravM17

3 years, 9 months ago

Parquet file is by default compressed which is convered under F. The answer should be A,D,F

upvoted 4 times

...

Woong

Highly Voted 3 years, 9 months ago

A is not best practice any more. [Quoted]previously Amazon S3 performance guidelines recommended randomizing prefix naming with hashed characters to optimize performance for frequent data retrievals. You no longer have to randomize prefix naming for performance, and can use sequential date-based naming for your prefixes. [Unquoted]

upvoted 14 times

Abep

2 years, 10 months ago

@Woong Thanks for quoting the excerpt. This makes option "A" incorrect. Sharing the link to this statement for anyone who wish to verify https://d1.awsstatic.com/whitepapers/AmazonS3BestPractices.pdf "This guidance supersedes any previous guidance on optimizing performance for Amazon S3. For example, previously Amazon S3 performance guidelines recommended randomizing prefix naming with hashed characters to optimize performance for frequent data retrievals. You no longer have to randomize prefix naming for performance, and can use sequential date-based naming for your prefixes"

upvoted 3 times

...

vicks316

3 years, 9 months ago

That's absolutely right, hence should be C,D,F

upvoted 3 times

...

monkeydba

Most Recent 1 year, 7 months ago

The comment about random strings is also in this useful link: https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/introduction.html

upvoted 1 times

...

Debi_mishra

2 years, 1 month ago

C and F are not doubt easy answers. But I believe A and D both are correct. People quoting randomize prefix not required - are ignoring "sequential date-based naming" which is also not mentioned in question and Athen can run longer as S3 list operations will take time without a well distributed prefix.

upvoted 1 times

...

pk349

2 years, 2 months ago

CDF: I passed the test

upvoted 3 times

...

henom

2 years, 7 months ago

A,C,F Some data lake applications on Amazon S3 scan millions or billions of objects for queries that run over petabytes of data. In this scenario, millions of data points are stored on Amazon S3 and it is recommended to create a random string and add that to the beginning of the object prefixes to increase the read performance for S3 objects.

upvoted 1 times

...

cloudlearnerhere

2 years, 8 months ago

Selected Answer: CDF

Correct answers are C, D & F Options C & F as using compression and columnar data format helps improve query performance and optimize storage Option D as using Athena and S3 within the same region would help with query performance and cost. Option A is wrong as S3 scales automatically now and is not bounded by the restriction. Option B is wrong as using the same account does not help in optimizing the cost of query performance. Option E is wrong as using JSON is the same as using CSV files and does help in n optimizing the cost or query performance.

upvoted 8 times

...

rocky48

2 years, 11 months ago

Selected Answer: CDF

Selected Answer: CDF

upvoted 1 times

...

GiveMeEz

3 years ago

Ans A. can't be correct at all. Adding randomized string will make partition size too small to reap the benefits. https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

upvoted 1 times

...

f4bi4n

3 years, 1 month ago

Selected Answer: CDF

C,D,F fulfills are needs

upvoted 1 times

...

Bik000

3 years, 1 month ago

Selected Answer: CDF

My Answer is C, D & F

upvoted 1 times

...

aws2019

3 years, 7 months ago

For me is CDF

upvoted 4 times

...

mickies9

3 years, 8 months ago

As a best practice, S3 and Athena should be in the same region and account and columnar based is appropriate for the performance. My answer would be BDF

upvoted 1 times

...

jueueuergen

3 years, 8 months ago

CDF. Parquet compresses by default, yes, but there is also an "uncompressed" option. So C is not redundant.

upvoted 3 times

...

gunjan4392

3 years, 8 months ago

I think CDF

upvoted 1 times

...

Donell

3 years, 8 months ago

I goes with C,D,F.

upvoted 1 times

...

DerekKey

3 years, 8 months ago

C -> is WRONG in my opinion 1. How do you want to compress Apache Parquet that is already compressed by default? We selected Parquet as file format in F. "For Athena, we recommend using either Apache Parquet or Apache ORC, which compress data by default and are splittable." 2. We still need prefixes but we don't have to randomize them You can increase your read or write performance by parallelizing reads. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second. BUT You no longer have to randomize prefix naming for performance and can use sequential date-based naming for your prefixes. 3. You can not reduce data transfer I/O. I/O represents an entity that sends/receives data, therefore, you can only reduce parameters of I/O e.g. data transfer bandwidth, speed, no of operations (e.g. IOPS) etc.

upvoted 1 times

...

Load full discussion...