Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 11 discussion

Exam question from Amazon's AWS Certified Data Analytics - Specialty

Question #: 11
Topic #: 1

[All AWS Certified Data Analytics - Specialty Questions]

A company that produces network devices has millions of users. Data is collected from the devices on an hourly basis and stored in an Amazon S3 data lake.
The company runs analyses on the last 24 hours of data flow logs for abnormality detection and to troubleshoot and resolve user issues. The company also analyzes historical logs dating back 2 years to discover patterns and look for improvement opportunities.
The data flow logs contain many metrics, such as date, timestamp, source IP, and target IP. There are about 10 billion events every day.
How should this data be stored for optimal performance?

A. In Apache ORC partitioned by date and sorted by source IP
B. In compressed .csv partitioned by date and sorted by source IP
C. In Apache Parquet partitioned by source IP and sorted by date
D. In compressed nested JSON partitioned by source IP and sorted by date

Show Suggested Answer

Suggested Answer: A 🗳️

by testtaker3434 at Aug. 9, 2020, 1:56 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

zanhsieh

Highly Voted 3 years, 9 months ago

A. BD dropped due to row based format. Choosing between ORC and Parquet format would be tough since their performance is very close. However, data are supposed to partitioned by date then sorted by source IP, so C dropped.

upvoted 60 times

abhineet

3 years, 9 months ago

correct

upvoted 2 times

...

Paitan

Highly Voted 3 years, 9 months ago

ORC and Parquet are ideal here. But the data should be partitioned by Date and sorted on IP and not the other way round. So option A is the right choice.

upvoted 12 times

...

kondi2309

Most Recent 1 year, 4 months ago

Selected Answer: A

ideal choices will be ORC and Parquet, first choice being Apache Parquet, but here we have to consider partition, and partition by Date then sort on IP is best way to store data.

upvoted 2 times

...

MLCL

1 year, 11 months ago

Selected Answer: A

Between A and C, if we have more IPs than Dates. I would go with A since analysis is performed on a daily schedule and anomalies are detected on a time interval.

upvoted 1 times

...

NikkyDicky

1 year, 11 months ago

Selected Answer: A

its an A

upvoted 1 times

...

pk349

2 years, 2 months ago

A: I passed the test

upvoted 2 times

...

anonymous909

2 years, 3 months ago

Option A: In Apache ORC partitioned by date and sorted by source IP Partitioning the data by date allows for faster query performance and efficient data retrieval based on time periods. Sorting the data by source IP enables efficient filtering and joins on that attribute. Overall, ORC partitioned by date and sorted by source IP would provide efficient storage and querying of the data.

upvoted 1 times

...

srirnag

2 years, 4 months ago

Option C is the best. The analysis is done on last 24 hours of data. Hence, sorting by IP may not be ideal.

upvoted 1 times

...

cloudlearnerhere

2 years, 8 months ago

Selected Answer: A

A is the right answer as the company does daily analysis, so it only needs to look at the data generated for a given date C is wrong as partitioning by source IP is incorrect for this use case, and partitioning by date is optimal. B & D, Both the above options are not columnar storage formats, they are row-based formats that are not optimal for big data retrievals for complex analytical queries.

upvoted 3 times

...

rocky48

2 years, 11 months ago

Selected Answer: A

A is the right answer

upvoted 1 times

...

dushmantha

3 years ago

Selected Answer: A

Agree with "zanhsieh"

upvoted 1 times

...

Ayaa4

3 years ago

Columnar data is faster such as ORC and Parquet, answer is A

upvoted 1 times

...

Bik000

3 years, 1 month ago

Selected Answer: A

Answer is A

upvoted 2 times

...

jrheen

3 years, 2 months ago

Answer : A

upvoted 1 times

...

rav009

3 years, 5 months ago

For previous 24 hours, sorted by date from C is not helpful. Sorted by timestamp makes sense.

upvoted 2 times

...

Donell

3 years, 8 months ago

Answer is A. In Apache ORC partitioned by date and sorted by source IP. Because the company analyzes historical logs dating back 2 years and also past 24 hours data. Hence the Data should be partitioned based on Date and sorted by IP and not the other way around. ORC is columnar hence preferred data format.

upvoted 5 times

...

Shraddha

3 years, 8 months ago

B and D = wrong, use columnar format. C = wrong, partition by date so historical data and be separated.

upvoted 2 times

...

Load full discussion...