exam questions

Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 19 discussion

A large company receives files from external parties in Amazon EC2 throughout the day. At the end of the day, the files are combined into a single file, compressed into a gzip file, and uploaded to Amazon S3. The total size of all the files is close to 100 GB daily. Once the files are uploaded to Amazon S3, an
AWS Batch program executes a COPY command to load the files into an Amazon Redshift cluster.
Which program modification will accelerate the COPY process?

  • A. Upload the individual files to Amazon S3 and run the COPY command as soon as the files become available.
  • B. Split the number of files so they are equal to a multiple of the number of slices in the Amazon Redshift cluster. Gzip and upload the files to Amazon S3. Run the COPY command on the files.
  • C. Split the number of files so they are equal to a multiple of the number of compute nodes in the Amazon Redshift cluster. Gzip and upload the files to Amazon S3. Run the COPY command on the files.
  • D. Apply sharding by breaking up the files so the distkey columns with the same values go to the same file. Gzip and upload the sharded files to Amazon S3. Run the COPY command on the files.
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
singh100
Highly Voted 3 years, 9 months ago
B. Split your data into files so that the number of files is a multiple of the number of slices in your cluster. That way Amazon Redshift can divide the data evenly among the slices.
upvoted 23 times
...
Shraddha
Highly Voted 3 years, 8 months ago
B : This is a textbook question. Sequential loading vs. parallel loading. https://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files.html
upvoted 7 times
...
GCPereira
Most Recent 1 year, 6 months ago
files -> EC2 -> merge files at the end of the day single file compressed -> s3 (100GB daily) A: The copy command for a large file (100GB) is slow and not effective as redshift will try to distribute the processing across the cluster and only after this division will the copy be carried out. B: Files must be large enough to run on only one slice of the node. With this pre-processing done, the master node does not need to worry about "allocating memory" to copy this file. If each slice processes a file, the transfer speed will be optimal. C: It's a smart option, but not the most effective. The number of slices is directly related to the number of nodes, but if the division is made thinking only about the number of nodes, it is possible to make the mistake of executing the COPY command for files that are too large. D: The dist style must be done after loading the data.
upvoted 1 times
...
nroopa
1 year, 10 months ago
Option B
upvoted 1 times
...
NikkyDicky
1 year, 11 months ago
Selected Answer: B
I think B
upvoted 1 times
...
pk349
2 years, 2 months ago
B: I passed the test
upvoted 1 times
roymunson
1 year, 7 months ago
Agree I passed the test two times in a row.
upvoted 1 times
...
...
SamQiu
2 years, 6 months ago
Why can't I use Option D?
upvoted 1 times
...
[Removed]
2 years, 6 months ago
b https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html Granting access to Data Catalog resources across accounts enables your extract, transform, and load (ETL) jobs to query and join data from different accounts.
upvoted 1 times
...
cloudlearnerhere
2 years, 8 months ago
Selected Answer: B
B is correct as the COPY command loads the data in parallel from multiple files, dividing the workload among the nodes in your cluster. When you load all the data from a single large file, Amazon Redshift is forced to perform a serialized load, which is much slower. Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression. For optimum parallelism, the ideal size is between 1 MB and 125 MB after compression. The number of files should be a multiple of the number of slices in your cluster.
upvoted 7 times
...
Arka_01
2 years, 9 months ago
Selected Answer: B
GZIP cannot be split. So first split the files and then gzip the slices. Also, it is recommended to have number of files equal to a number which is multiple of total slices in Redshift cluster, so that copy command can engage all worked nodes parallelly and evenly distribute the load.
upvoted 1 times
...
renfdo
2 years, 9 months ago
Selected Answer: B
B is the right answer
upvoted 1 times
...
renfdo
2 years, 9 months ago
B is the right answer
upvoted 1 times
...
rocky48
2 years, 11 months ago
Selected Answer: B
B is the right answer
upvoted 1 times
...
lostsoul07
3 years, 8 months ago
B is the right answer
upvoted 1 times
...
BillyC
3 years, 8 months ago
B is correct for me
upvoted 1 times
...
sanjaym
3 years, 8 months ago
B is the answer.
upvoted 1 times
...
Karan_Sharma
3 years, 9 months ago
Option B, By using a single full file forces copy to do a serial load. Splitting the files in multiple of number of slices in cluster and compressing them is ideal for better performance of copy
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...