Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 31 discussion

Exam question from Amazon's AWS Certified Data Analytics - Specialty

Question #: 31
Topic #: 1

[All AWS Certified Data Analytics - Specialty Questions]

A company wants to improve the data load time of a sales data dashboard. Data has been collected as .csv files and stored within an Amazon S3 bucket that is partitioned by date. The data is then loaded to an Amazon Redshift data warehouse for frequent analysis. The data volume is up to 500 GB per day.
Which solution will improve the data loading performance?

A. Compress .csv files and use an INSERT statement to ingest data into Amazon Redshift.
B. Split large .csv files, then use a COPY command to load data into Amazon Redshift.
C. Use Amazon Kinesis Data Firehose to ingest data into Amazon Redshift.
D. Load the .csv files in an unsorted key order and vacuum the table in Amazon Redshift.

Show Suggested Answer

Suggested Answer: B 🗳️

by cloud4gr8 at Aug. 17, 2020, 7:55 a.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

Paitan

Highly Voted 3 years, 9 months ago

B for sure. The COPY command loads the data in parallel from multiple files, dividing the workload among the nodes in your cluster. When you load all the data from a single large file, Amazon Redshift is forced to perform a serialized load, which is much slower. Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression. For optimum parallelism, the ideal size is between 1 MB and 125 MB after compression. The number of files should be a multiple of the number of slices in your cluster

upvoted 44 times

...

Shraddha

Highly Voted 3 years, 8 months ago

Ans B A = wrong, compression means download file from S3 then compress, time-consuming, also you use COPY for files not INSERT. C = wrong, will not improve performance. D = wrong, vacuum frees up storage. This is a question about parallel loading.

upvoted 11 times

...

Mayank7g

Most Recent 1 year, 11 months ago

Selected Answer: B

B for sure

upvoted 1 times

...

pk349

2 years, 2 months ago

B: I passed the test

upvoted 2 times

...

AwsNewPeople

2 years, 3 months ago

Selected Answer: B

Option B is the most appropriate solution for improving data loading performance. Splitting large .csv files and using a COPY command can parallelize the load process and reduce the data load time. The data partitioning by date can help further optimize the load process by reducing the data scanned for each load. Compressing the .csv files may help reduce the storage cost, but it may not improve the data load time. Using an INSERT statement to ingest data into Amazon Redshift can be slow and does not take advantage of Redshift's parallel processing capability. Amazon Kinesis Data Firehose can be used to ingest streaming data in real-time, but may not be the best choice for large batch loads. Loading the .csv files in an unsorted key order and vacuuming the table can help optimize the table for query performance but may not improve the data loading performance.

upvoted 3 times

...

cloudlearnerhere

2 years, 8 months ago

Selected Answer: B

Correct answer is B as splitting the large file into multiple files can help improve the data loading performance using the COPY command. Option A is wrong as the COPY command would provide the best benefit. Option C is wrong as Kinesis Data Firehose cannot move data from S3 to Redshift. Kinesis Data Firehose delivers your data to your S3 bucket first and then issues an Amazon Redshift COPY command to load the data into your Amazon Redshift cluster. So it doesn't still improve the load performance. Option D is wrong as vacuuming will free up space but does not improve the load performance.

upvoted 5 times

...

Arka_01

2 years, 9 months ago

Selected Answer: B

S3 to Redshift upload can be done through copy command. To utilize parallelism, large files are recommended to split into small chunk of files.

upvoted 1 times

...

Binh12

2 years, 11 months ago

Don't know why B? here is uncompressed csv file, so no need to split file (Redshift will do automatically? In contrast, when you load delimited data from a large, uncompressed file, Amazon Redshift makes use of multiple slices. These slices work in parallel, automatically. This provides fast load performance. In https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-use-multiple-files.html

upvoted 2 times

Ryo0w0o

2 years, 7 months ago

Agreed. It seems like no correct answer among the choices.

upvoted 1 times

...

rocky48

2 years, 11 months ago

Selected Answer: B

Answer = B

upvoted 1 times

...

Bik000

3 years, 1 month ago

Selected Answer: B

Answer is B

upvoted 1 times

...

moon2351

3 years, 3 months ago

Selected Answer: B

Answer is B

upvoted 2 times

...

awsmani

3 years, 7 months ago

Ans:B split large files will help loading in performance. Having one large file will load in serialized manner which lowers performance

upvoted 1 times

...

lostsoul07

3 years, 8 months ago

B is the right answer

upvoted 3 times

...

BillyC

3 years, 8 months ago

B is correct for me

upvoted 2 times

...

sanjaym

3 years, 9 months ago

B for sure.

upvoted 2 times

...

syu31svc

3 years, 9 months ago

It's already in S3 so answer is B 100%

upvoted 1 times

...

ali_baba_acs

3 years, 9 months ago

Answer is B, A compress is a good practice but then copy command not insert, Kinesis Firehose will not improve performance, the vacuum will help freeing space not improve perf too.

upvoted 3 times

...

Load full discussion...