exam questions

Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 31 discussion

A company wants to improve the data load time of a sales data dashboard. Data has been collected as .csv files and stored within an Amazon S3 bucket that is partitioned by date. The data is then loaded to an Amazon Redshift data warehouse for frequent analysis. The data volume is up to 500 GB per day.
Which solution will improve the data loading performance?

  • A. Compress .csv files and use an INSERT statement to ingest data into Amazon Redshift.
  • B. Split large .csv files, then use a COPY command to load data into Amazon Redshift.
  • C. Use Amazon Kinesis Data Firehose to ingest data into Amazon Redshift.
  • D. Load the .csv files in an unsorted key order and vacuum the table in Amazon Redshift.
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
Paitan
Highly Voted 3 years, 7 months ago
B for sure. The COPY command loads the data in parallel from multiple files, dividing the workload among the nodes in your cluster. When you load all the data from a single large file, Amazon Redshift is forced to perform a serialized load, which is much slower. Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression. For optimum parallelism, the ideal size is between 1 MB and 125 MB after compression. The number of files should be a multiple of the number of slices in your cluster
upvoted 44 times
...
Shraddha
Highly Voted 3 years, 6 months ago
Ans B A = wrong, compression means download file from S3 then compress, time-consuming, also you use COPY for files not INSERT. C = wrong, will not improve performance. D = wrong, vacuum frees up storage. This is a question about parallel loading.
upvoted 11 times
...
Mayank7g
Most Recent 1 year, 10 months ago
Selected Answer: B
B for sure
upvoted 1 times
...
pk349
2 years ago
B: I passed the test
upvoted 2 times
...
AwsNewPeople
2 years, 2 months ago
Selected Answer: B
Option B is the most appropriate solution for improving data loading performance. Splitting large .csv files and using a COPY command can parallelize the load process and reduce the data load time. The data partitioning by date can help further optimize the load process by reducing the data scanned for each load. Compressing the .csv files may help reduce the storage cost, but it may not improve the data load time. Using an INSERT statement to ingest data into Amazon Redshift can be slow and does not take advantage of Redshift's parallel processing capability. Amazon Kinesis Data Firehose can be used to ingest streaming data in real-time, but may not be the best choice for large batch loads. Loading the .csv files in an unsorted key order and vacuuming the table can help optimize the table for query performance but may not improve the data loading performance.
upvoted 3 times
...
cloudlearnerhere
2 years, 6 months ago
Selected Answer: B
Correct answer is B as splitting the large file into multiple files can help improve the data loading performance using the COPY command. Option A is wrong as the COPY command would provide the best benefit. Option C is wrong as Kinesis Data Firehose cannot move data from S3 to Redshift. Kinesis Data Firehose delivers your data to your S3 bucket first and then issues an Amazon Redshift COPY command to load the data into your Amazon Redshift cluster. So it doesn't still improve the load performance. Option D is wrong as vacuuming will free up space but does not improve the load performance.
upvoted 5 times
...
Arka_01
2 years, 7 months ago
Selected Answer: B
S3 to Redshift upload can be done through copy command. To utilize parallelism, large files are recommended to split into small chunk of files.
upvoted 1 times
...
Binh12
2 years, 10 months ago
Don't know why B? here is uncompressed csv file, so no need to split file (Redshift will do automatically? In contrast, when you load delimited data from a large, uncompressed file, Amazon Redshift makes use of multiple slices. These slices work in parallel, automatically. This provides fast load performance. In https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-use-multiple-files.html
upvoted 2 times
Ryo0w0o
2 years, 5 months ago
Agreed. It seems like no correct answer among the choices.
upvoted 1 times
...
...
rocky48
2 years, 10 months ago
Selected Answer: B
Answer = B
upvoted 1 times
...
Bik000
2 years, 12 months ago
Selected Answer: B
Answer is B
upvoted 1 times
...
moon2351
3 years, 2 months ago
Selected Answer: B
Answer is B
upvoted 2 times
...
awsmani
3 years, 5 months ago
Ans:B split large files will help loading in performance. Having one large file will load in serialized manner which lowers performance
upvoted 1 times
...
lostsoul07
3 years, 7 months ago
B is the right answer
upvoted 3 times
...
BillyC
3 years, 7 months ago
B is correct for me
upvoted 2 times
...
sanjaym
3 years, 7 months ago
B for sure.
upvoted 2 times
...
syu31svc
3 years, 7 months ago
It's already in S3 so answer is B 100%
upvoted 1 times
...
ali_baba_acs
3 years, 7 months ago
Answer is B, A compress is a good practice but then copy command not insert, Kinesis Firehose will not improve performance, the vacuum will help freeing space not improve perf too.
upvoted 3 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago