exam questions

Exam AWS Certified Data Analytics - Specialty All Questions

View all questions & answers for the AWS Certified Data Analytics - Specialty exam

Exam AWS Certified Data Analytics - Specialty topic 1 question 127 discussion

A global pharmaceutical company receives test results for new drugs from various testing facilities worldwide. The results are sent in millions of 1 KB-sized JSON objects to an Amazon S3 bucket owned by the company. The data engineering team needs to process those files, convert them into Apache Parquet format, and load them into Amazon Redshift for data analysts to perform dashboard reporting. The engineering team uses AWS Glue to process the objects, AWS Step
Functions for process orchestration, and Amazon CloudWatch for job scheduling.
More testing facilities were recently added, and the time to process files is increasing.
What will MOST efficiently decrease the data processing time?

  • A. Use AWS Lambda to group the small files into larger files. Write the files back to Amazon S3. Process the files using AWS Glue and load them into Amazon Redshift tables.
  • B. Use the AWS Glue dynamic frame file grouping option while ingesting the raw input files. Process the files and load them into Amazon Redshift tables.
  • C. Use the Amazon Redshift COPY command to move the files from Amazon S3 into Amazon Redshift tables directly. Process the files in Amazon Redshift.
  • D. Use Amazon EMR instead of AWS Glue to group the small input files. Process the files in Amazon EMR and load them into Amazon Redshift tables.
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
srinivasa
Highly Voted 3 years, 9 months ago
Answer: B https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html
upvoted 18 times
lakediver
3 years, 6 months ago
Agree df = glueContext.create_dynamic_frame.from_options("s3", {'paths': ["s3://s3path/"], 'recurse':True, 'groupFiles': 'inPartition', 'groupSize': '1048576'}, format="json"
upvoted 3 times
...
...
rajeevramadurai
Most Recent 1 year, 3 months ago
The data engineering team needs to process those files, convert them into Apache Parquet format---so answer is A?
upvoted 1 times
...
pk349
2 years, 1 month ago
B: I passed the test
upvoted 1 times
...
cloudlearnerhere
2 years, 8 months ago
Correct answer is B as the AWS Glue job can be updated to group files to create larger files which can help improve the processing time without any additional steps or changes. https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html Options A & D are wrong as using a staging space or EMR would add additional steps to the processing. Option C is wrong as this only performs the loading of data, not processing before the load.
upvoted 4 times
...
rocky48
2 years, 11 months ago
Selected Answer: B
Answer: B
upvoted 1 times
...
jealbave
2 years, 11 months ago
Answer C
upvoted 1 times
...
jrheen
3 years, 1 month ago
Answer-B
upvoted 1 times
...
Teraxs
3 years, 1 month ago
Selected Answer: B
https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html
upvoted 2 times
...
simo40010
3 years, 3 months ago
I'm confused between B and C. But I think I will vote for B, since the input files are in json so I think they need to be processed before loading to redshift (flatten the data ...). We need ETL instead of ELT.
upvoted 2 times
...
cynthiacy
3 years, 6 months ago
groupFiles is supported for DynamicFrames created from the following data formats: csv, ion, grokLog, json, and xml. This option is not supported for avro, parquet, and orc. from https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html. B is wrong. Why not C?
upvoted 2 times
npt
3 years, 6 months ago
B is right. The question says that input file format is json, output is parquet, so DynamicFrames with groupFiles can help input json format
upvoted 1 times
...
...
aws2019
3 years, 7 months ago
This is a weird question as both A and B are equally efficient
upvoted 2 times
lakeswimmer
3 years, 6 months ago
"most effectively" might be be A
upvoted 1 times
...
cnmc
3 years, 4 months ago
A is not as efficient since you're creating an additional step (a "staging" location in S3 to store those large files). Not to mention now you're paying double the storage cost
upvoted 3 times
...
Olga2022
3 years, 7 months ago
You are correct, but the question says they already have Glue job. So, I would choose B.
upvoted 5 times
...
...
ali98
3 years, 7 months ago
Answer : B
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...