Exam AWS Certified Data Engineer - Associate DEA-C01 topic 1 question 31 discussion

Exam question from Amazon's AWS Certified Data Engineer - Associate DEA-C01

Question #: 31
Topic #: 1

[All AWS Certified Data Engineer - Associate DEA-C01 Questions]

A company is building an analytics solution. The solution uses Amazon S3 for data lake storage and Amazon Redshift for a data warehouse. The company wants to use Amazon Redshift Spectrum to query the data that is in Amazon S3.
Which actions will provide the FASTEST queries? (Choose two.)

A. Use gzip compression to compress individual files to sizes that are between 1 GB and 5 GB.
B. Use a columnar storage file format.
C. Partition the data based on the most common query predicates.
D. Split the data into files that are less than 10 KB.
E. Use file formats that are not splittable.

Show Suggested Answer

Suggested Answer: BC 🗳️

by rralucard_ at Feb. 3, 2024, 3:05 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

GiorgioGss

Highly Voted 1 year, 3 months ago

Selected Answer: BC

https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-performance.html

upvoted 6 times

...

rralucard_

Highly Voted 1 year, 5 months ago

Selected Answer: BC

B. Use a columnar storage file format: This is an excellent approach. Columnar storage formats like Parquet and ORC are highly recommended for use with Redshift Spectrum. They store data in columns, which allows Spectrum to scan only the needed columns for a query, significantly improving query performance and reducing the amount of data scanned. C. Partition the data based on the most common query predicates: Partitioning data in S3 based on commonly used query predicates (like date, region, etc.) allows Redshift Spectrum to skip large portions of data that are irrelevant to a particular query. This can lead to substantial performance improvements, especially for large datasets.

upvoted 5 times

...

andrologin

Most Recent 12 months ago

Selected Answer: BC

Partioning helps filter the data and columnar storage is optimised for analytical (OLAP) queries

upvoted 1 times

...

pypelyncar

1 year ago

Selected Answer: BC

Redshift Spectrum is optimized for querying data stored in columnar formats like Parquet or ORC. These formats store each data column separately, allowing Redshift Spectrum to only scan the relevant columns for a specific query, significantly improving performance compared to row-oriented formats Partitioning organizes data files in S3 based on specific column values (e.g., date, region). When your queries filter or join data based on these partitioning columns (common query predicates), Redshift Spectrum can quickly locate the relevant data files, minimizing the amount of data scanned and accelerating query execution

upvoted 3 times

...

d8945a1

1 year, 1 month ago

Selected Answer: BC

https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/

upvoted 1 times

...

certplan

1 year, 3 months ago

2. **Partitioning**: AWS documentation for Amazon Redshift Spectrum highlights the importance of partitioning data based on commonly used query predicates to improve query performance. By partitioning data, Redshift Spectrum can prune unnecessary partitions during query execution, reducing the amount of data scanned and improving overall query performance. This guidance can be found in the AWS documentation for Amazon Redshift Spectrum under "Using Partitioning to Improve Query Performance": https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum-partitioning.html

upvoted 1 times

...

certplan

1 year, 3 months ago

1. **Columnar Storage File Format**: According to AWS documentation, columnar storage file formats like Apache Parquet and Apache ORC are recommended for optimizing query performance with Amazon Redshift Spectrum. They state that these formats are highly efficient for selective column reads, which aligns with the way analytical queries typically operate. This can be found in the AWS documentation for Amazon Redshift Spectrum under "Choosing Data Formats": https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html#spectrum-columnar-storage

upvoted 1 times

...