exam questions

Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 68 discussion

An airline has been collecting metrics on flight activities for analytics. A recently completed proof of concept demonstrates how the company provides insights to data analysts to improve on-time departures. The proof of concept used objects in Amazon S3, which contained the metrics in .csv format, and used Amazon
Athena for querying the data. As the amount of data increases, the data analyst wants to optimize the storage solution to improve query performance.
Which options should the data analyst use to improve performance as the data lake grows? (Choose three.)

  • A. Add a randomized string to the beginning of the keys in S3 to get more throughput across partitions.
  • B. Use an S3 bucket in the same account as Athena.
  • C. Compress the objects to reduce the data transfer I/O.
  • D. Use an S3 bucket in the same Region as Athena.
  • E. Preprocess the .csv data to JSON to reduce I/O by fetching only the document keys needed by the query.
  • F. Preprocess the .csv data to Apache Parquet to reduce I/O by fetching only the data blocks needed for predicates.
Show Suggested Answer Hide Answer
Suggested Answer: CDF 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
carol1522
Highly Voted 3 years, 9 months ago
For me is CDF
upvoted 26 times
GauravM17
3 years, 9 months ago
Parquet file is by default compressed which is convered under F. The answer should be A,D,F
upvoted 4 times
...
...
Woong
Highly Voted 3 years, 9 months ago
A is not best practice any more. [Quoted]previously Amazon S3 performance guidelines recommended randomizing prefix naming with hashed characters to optimize performance for frequent data retrievals. You no longer have to randomize prefix naming for performance, and can use sequential date-based naming for your prefixes. [Unquoted]
upvoted 14 times
Abep
2 years, 10 months ago
@Woong Thanks for quoting the excerpt. This makes option "A" incorrect. Sharing the link to this statement for anyone who wish to verify https://d1.awsstatic.com/whitepapers/AmazonS3BestPractices.pdf "This guidance supersedes any previous guidance on optimizing performance for Amazon S3. For example, previously Amazon S3 performance guidelines recommended randomizing prefix naming with hashed characters to optimize performance for frequent data retrievals. You no longer have to randomize prefix naming for performance, and can use sequential date-based naming for your prefixes"
upvoted 3 times
...
vicks316
3 years, 9 months ago
That's absolutely right, hence should be C,D,F
upvoted 3 times
...
...
monkeydba
Most Recent 1 year, 7 months ago
The comment about random strings is also in this useful link: https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/introduction.html
upvoted 1 times
...
Debi_mishra
2 years, 1 month ago
C and F are not doubt easy answers. But I believe A and D both are correct. People quoting randomize prefix not required - are ignoring "sequential date-based naming" which is also not mentioned in question and Athen can run longer as S3 list operations will take time without a well distributed prefix.
upvoted 1 times
...
pk349
2 years, 2 months ago
CDF: I passed the test
upvoted 3 times
...
henom
2 years, 7 months ago
A,C,F Some data lake applications on Amazon S3 scan millions or billions of objects for queries that run over petabytes of data. In this scenario, millions of data points are stored on Amazon S3 and it is recommended to create a random string and add that to the beginning of the object prefixes to increase the read performance for S3 objects.
upvoted 1 times
...
cloudlearnerhere
2 years, 8 months ago
Selected Answer: CDF
Correct answers are C, D & F Options C & F as using compression and columnar data format helps improve query performance and optimize storage Option D as using Athena and S3 within the same region would help with query performance and cost. Option A is wrong as S3 scales automatically now and is not bounded by the restriction. Option B is wrong as using the same account does not help in optimizing the cost of query performance. Option E is wrong as using JSON is the same as using CSV files and does help in n optimizing the cost or query performance.
upvoted 8 times
...
rocky48
2 years, 11 months ago
Selected Answer: CDF
Selected Answer: CDF
upvoted 1 times
...
GiveMeEz
3 years ago
Ans A. can't be correct at all. Adding randomized string will make partition size too small to reap the benefits. https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
upvoted 1 times
...
f4bi4n
3 years, 1 month ago
Selected Answer: CDF
C,D,F fulfills are needs
upvoted 1 times
...
Bik000
3 years, 1 month ago
Selected Answer: CDF
My Answer is C, D & F
upvoted 1 times
...
aws2019
3 years, 7 months ago
For me is CDF
upvoted 4 times
...
mickies9
3 years, 8 months ago
As a best practice, S3 and Athena should be in the same region and account and columnar based is appropriate for the performance. My answer would be BDF
upvoted 1 times
...
jueueuergen
3 years, 8 months ago
CDF. Parquet compresses by default, yes, but there is also an "uncompressed" option. So C is not redundant.
upvoted 3 times
...
gunjan4392
3 years, 8 months ago
I think CDF
upvoted 1 times
...
Donell
3 years, 8 months ago
I goes with C,D,F.
upvoted 1 times
...
DerekKey
3 years, 8 months ago
C -> is WRONG in my opinion 1. How do you want to compress Apache Parquet that is already compressed by default? We selected Parquet as file format in F. "For Athena, we recommend using either Apache Parquet or Apache ORC, which compress data by default and are splittable." 2. We still need prefixes but we don't have to randomize them You can increase your read or write performance by parallelizing reads. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second. BUT You no longer have to randomize prefix naming for performance and can use sequential date-based naming for your prefixes. 3. You can not reduce data transfer I/O. I/O represents an entity that sends/receives data, therefore, you can only reduce parameters of I/O e.g. data transfer bandwidth, speed, no of operations (e.g. IOPS) etc.
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...