Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 19 discussion

Exam question from Amazon's AWS Certified Data Analytics - Specialty

Question #: 19
Topic #: 1

[All AWS Certified Data Analytics - Specialty Questions]

A large company receives files from external parties in Amazon EC2 throughout the day. At the end of the day, the files are combined into a single file, compressed into a gzip file, and uploaded to Amazon S3. The total size of all the files is close to 100 GB daily. Once the files are uploaded to Amazon S3, an
AWS Batch program executes a COPY command to load the files into an Amazon Redshift cluster.
Which program modification will accelerate the COPY process?

A. Upload the individual files to Amazon S3 and run the COPY command as soon as the files become available.
B. Split the number of files so they are equal to a multiple of the number of slices in the Amazon Redshift cluster. Gzip and upload the files to Amazon S3. Run the COPY command on the files.
C. Split the number of files so they are equal to a multiple of the number of compute nodes in the Amazon Redshift cluster. Gzip and upload the files to Amazon S3. Run the COPY command on the files.
D. Apply sharding by breaking up the files so the distkey columns with the same values go to the same file. Gzip and upload the sharded files to Amazon S3. Run the COPY command on the files.

Show Suggested Answer

Suggested Answer: B 🗳️

by testtaker3434 at Aug. 9, 2020, 8:24 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

singh100

Highly Voted 3 years, 9 months ago

B. Split your data into files so that the number of files is a multiple of the number of slices in your cluster. That way Amazon Redshift can divide the data evenly among the slices.

upvoted 23 times

...

Shraddha

Highly Voted 3 years, 8 months ago

B : This is a textbook question. Sequential loading vs. parallel loading. https://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files.html

upvoted 7 times

...

GCPereira

Most Recent 1 year, 6 months ago

files -> EC2 -> merge files at the end of the day single file compressed -> s3 (100GB daily) A: The copy command for a large file (100GB) is slow and not effective as redshift will try to distribute the processing across the cluster and only after this division will the copy be carried out. B: Files must be large enough to run on only one slice of the node. With this pre-processing done, the master node does not need to worry about "allocating memory" to copy this file. If each slice processes a file, the transfer speed will be optimal. C: It's a smart option, but not the most effective. The number of slices is directly related to the number of nodes, but if the division is made thinking only about the number of nodes, it is possible to make the mistake of executing the COPY command for files that are too large. D: The dist style must be done after loading the data.

upvoted 1 times

...

nroopa

1 year, 10 months ago

Option B

upvoted 1 times

...

NikkyDicky

1 year, 11 months ago

Selected Answer: B

I think B

upvoted 1 times

...

pk349

2 years, 2 months ago

B: I passed the test

upvoted 1 times

roymunson

1 year, 7 months ago

Agree I passed the test two times in a row.

upvoted 1 times

...

SamQiu

2 years, 6 months ago

Why can't I use Option D?

upvoted 1 times

...

[Removed]

2 years, 6 months ago

b https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html Granting access to Data Catalog resources across accounts enables your extract, transform, and load (ETL) jobs to query and join data from different accounts.

upvoted 1 times

...

cloudlearnerhere

2 years, 8 months ago

Selected Answer: B

B is correct as the COPY command loads the data in parallel from multiple files, dividing the workload among the nodes in your cluster. When you load all the data from a single large file, Amazon Redshift is forced to perform a serialized load, which is much slower. Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression. For optimum parallelism, the ideal size is between 1 MB and 125 MB after compression. The number of files should be a multiple of the number of slices in your cluster.

upvoted 7 times

...

Arka_01

2 years, 9 months ago

Selected Answer: B

GZIP cannot be split. So first split the files and then gzip the slices. Also, it is recommended to have number of files equal to a number which is multiple of total slices in Redshift cluster, so that copy command can engage all worked nodes parallelly and evenly distribute the load.

upvoted 1 times

...

renfdo

2 years, 9 months ago

Selected Answer: B

B is the right answer

upvoted 1 times

...

renfdo

2 years, 9 months ago

B is the right answer

upvoted 1 times

...

rocky48

2 years, 11 months ago

Selected Answer: B

B is the right answer

upvoted 1 times

...

lostsoul07

3 years, 8 months ago

B is the right answer

upvoted 1 times

...

BillyC

3 years, 8 months ago

B is correct for me

upvoted 1 times

...

sanjaym

3 years, 8 months ago

B is the answer.

upvoted 1 times

...

Karan_Sharma

3 years, 9 months ago

Option B, By using a single full file forces copy to do a serial load. Splitting the files in multiple of number of slices in cluster and compressing them is ideal for better performance of copy

upvoted 2 times

...

Load full discussion...