exam questions

Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 11 discussion

A company that produces network devices has millions of users. Data is collected from the devices on an hourly basis and stored in an Amazon S3 data lake.
The company runs analyses on the last 24 hours of data flow logs for abnormality detection and to troubleshoot and resolve user issues. The company also analyzes historical logs dating back 2 years to discover patterns and look for improvement opportunities.
The data flow logs contain many metrics, such as date, timestamp, source IP, and target IP. There are about 10 billion events every day.
How should this data be stored for optimal performance?

  • A. In Apache ORC partitioned by date and sorted by source IP
  • B. In compressed .csv partitioned by date and sorted by source IP
  • C. In Apache Parquet partitioned by source IP and sorted by date
  • D. In compressed nested JSON partitioned by source IP and sorted by date
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
zanhsieh
Highly Voted 3 years, 9 months ago
A. BD dropped due to row based format. Choosing between ORC and Parquet format would be tough since their performance is very close. However, data are supposed to partitioned by date then sorted by source IP, so C dropped.
upvoted 60 times
abhineet
3 years, 9 months ago
correct
upvoted 2 times
...
...
Paitan
Highly Voted 3 years, 9 months ago
ORC and Parquet are ideal here. But the data should be partitioned by Date and sorted on IP and not the other way round. So option A is the right choice.
upvoted 12 times
...
kondi2309
Most Recent 1 year, 4 months ago
Selected Answer: A
ideal choices will be ORC and Parquet, first choice being Apache Parquet, but here we have to consider partition, and partition by Date then sort on IP is best way to store data.
upvoted 2 times
...
MLCL
1 year, 11 months ago
Selected Answer: A
Between A and C, if we have more IPs than Dates. I would go with A since analysis is performed on a daily schedule and anomalies are detected on a time interval.
upvoted 1 times
...
NikkyDicky
1 year, 11 months ago
Selected Answer: A
its an A
upvoted 1 times
...
pk349
2 years, 2 months ago
A: I passed the test
upvoted 2 times
...
anonymous909
2 years, 3 months ago
Option A: In Apache ORC partitioned by date and sorted by source IP Partitioning the data by date allows for faster query performance and efficient data retrieval based on time periods. Sorting the data by source IP enables efficient filtering and joins on that attribute. Overall, ORC partitioned by date and sorted by source IP would provide efficient storage and querying of the data.
upvoted 1 times
...
srirnag
2 years, 4 months ago
Option C is the best. The analysis is done on last 24 hours of data. Hence, sorting by IP may not be ideal.
upvoted 1 times
...
cloudlearnerhere
2 years, 8 months ago
Selected Answer: A
A is the right answer as the company does daily analysis, so it only needs to look at the data generated for a given date C is wrong as partitioning by source IP is incorrect for this use case, and partitioning by date is optimal. B & D, Both the above options are not columnar storage formats, they are row-based formats that are not optimal for big data retrievals for complex analytical queries.
upvoted 3 times
...
rocky48
2 years, 11 months ago
Selected Answer: A
A is the right answer
upvoted 1 times
...
dushmantha
3 years ago
Selected Answer: A
Agree with "zanhsieh"
upvoted 1 times
...
Ayaa4
3 years ago
Columnar data is faster such as ORC and Parquet, answer is A
upvoted 1 times
...
Bik000
3 years, 1 month ago
Selected Answer: A
Answer is A
upvoted 2 times
...
jrheen
3 years, 2 months ago
Answer : A
upvoted 1 times
...
rav009
3 years, 5 months ago
For previous 24 hours, sorted by date from C is not helpful. Sorted by timestamp makes sense.
upvoted 2 times
...
Donell
3 years, 8 months ago
Answer is A. In Apache ORC partitioned by date and sorted by source IP. Because the company analyzes historical logs dating back 2 years and also past 24 hours data. Hence the Data should be partitioned based on Date and sorted by IP and not the other way around. ORC is columnar hence preferred data format.
upvoted 5 times
...
Shraddha
3 years, 8 months ago
B and D = wrong, use columnar format. C = wrong, partition by date so historical data and be separated.
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...