Exam AWS Certified Big Data - Specialty All Questions

View all questions & answers for the AWS Certified Big Data - Specialty exam

Exam AWS Certified Big Data - Specialty topic 1 question 53 discussion

Exam question from Amazon's AWS Certified Big Data - Specialty

Question #: 53
Topic #: 1

[All AWS Certified Big Data - Specialty Questions]

A company uses Amazon Redshift for its enterprise data warehouse. A new on-premises PostgreSQL OLTP
DB must be integrated into the data warehouse. Each table in the PostgreSQL DB has an indexed timestamp column. The data warehouse has a staging layer to load source data into the data warehouse environment for further processing.
The data lag between the source PostgreSQL DB and the Amazon Redshift staging layer should NOT exceed four hours.
What is the most efficient technique to meet these requirements?

A. Create a DBLINK on the source DB to connect to Amazon Redshift. Use a PostgreSQL trigger on the source table to capture the new insert/update/delete event and execute the event on the Amazon Redshift staging table.
B. Use a PostgreSQL trigger on the source table to capture the new insert/update/delete event and write it to Amazon Kinesis Streams. Use a KCL application to execute the event on the Amazon Redshift staging table.
C. Extract the incremental changes periodically using a SQL query. Upload the changes to multiple Amazon Simple Storage Service (S3) objects, and run the COPY command to load to the Amazon Redshift staging layer.
D. Extract the incremental changes periodically using a SQL query. Upload the changes to a single Amazon Simple Storage Service (S3) object, and run the COPY command to load to the Amazon Redshift staging layer.

Show Suggested Answer

Suggested Answer: C 🗳️

by mattyb123 at Aug. 28, 2019, 3:56 a.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

Bulti

Highly Voted 3 years, 7 months ago

A - Incorrect due to the DBLink constraint around using RDS PostgreSQL as opposed to on-prem PostgreSQL. B. Using Kinesis KCL to write to RedShift although can be done using JBDC/ODBC driver doesn't seem efficient. C. This makes the best sense. Using a cron job, we could extract new/changed records based on the timestamp from each table and create an S3 object for that table. Then use a Copy command to copy the data from the S3 object to a RedShift Table. D. It doesn't make any practical sense to extract changes across all tables into a single S3 object because it's not possible to then split the records into multiple RedShift tables. Correct option is C.

upvoted 6 times

...

exams

Highly Voted 3 years, 7 months ago

correct... C?

upvoted 5 times

Kuntazulu

3 years, 7 months ago

Agreed. COPY command is the fastest way to load in Redshit, if you have ++files that is...

upvoted 2 times

...

Bulti

Most Recent 3 years, 7 months ago

https://stackoverflow.com/questions/55262639/automatically-load-data-into-redshift-with-the-copy-function

upvoted 1 times

...

Bulti

3 years, 7 months ago

I found this post which suggests that a cron job or a Lambda function triggered by an S3 event upon creating a new file in an S3 bucket could invoke the COPY command to load the data from each file into the corresponding RedShift table. My answer is C with more confidence this time.

upvoted 3 times

...

YashBindlish

3 years, 7 months ago

Correct Answer is A

upvoted 1 times

...

susan8840

3 years, 7 months ago

C. per best practices, COPY is the best way to update data in Redshift not insert. D is incorrect as Redshift can process multiple S3 in parallel

upvoted 1 times

...

san2020

3 years, 7 months ago

my selection C

upvoted 2 times

...

am7

3 years, 7 months ago

@shwang, the parallelisation increases.

upvoted 1 times

...

shwang

3 years, 7 months ago

why not D? what is the difference to store the increment data to single or multiple S3?

upvoted 3 times

Corram

3 years, 7 months ago

Because of a Redshift best practice: Copy data split into a multiple of the number of slices in the Redshift cluster. It will be faster.

upvoted 2 times

...

Corram

3 years, 7 months ago

Even worse, as Bulti correctly pointed out below, you might have multiple tables that you incrementally update, and their data must not be in a common file.

upvoted 1 times

...

jlpl

3 years, 8 months ago

A! i think!

upvoted 2 times

mattyb123

3 years, 8 months ago

from the last post at this link https://forums.aws.amazon.com/message.jspa?messageID=807709. It doesn't seem to be time effective to do DBLINK. In addition as @muhsin said DBLINK to redshift ONLY works with RDS postgresql instances. Answer must be C.

upvoted 4 times

Corram

3 years, 7 months ago

Additionally, it appears that DBLINK is used in RDS to query from Redshift, not the other way round. https://stackoverflow.com/questions/45516920/does-amazon-redshift-support-extension-dblink

upvoted 1 times

...

mattyb123

3 years, 8 months ago

Is it A? 1. https://aws.amazon.com/blogs/big-data/join-amazon-redshift-and-amazon-rds-postgresql-with-dblink/ 2. https://forums.aws.amazon.com/message.jspa?messageID=807709

upvoted 1 times

muhsin

3 years, 8 months ago

this is aws RDS. but in the question, db is on-prem. so we need to move the data to aws.

upvoted 1 times

...