exam questions

Exam AWS Certified Big Data - Specialty All Questions

View all questions & answers for the AWS Certified Big Data - Specialty exam

Exam AWS Certified Big Data - Specialty topic 1 question 53 discussion

Exam question from Amazon's AWS Certified Big Data - Specialty
Question #: 53
Topic #: 1
[All AWS Certified Big Data - Specialty Questions]

A company uses Amazon Redshift for its enterprise data warehouse. A new on-premises PostgreSQL OLTP
DB must be integrated into the data warehouse. Each table in the PostgreSQL DB has an indexed timestamp column. The data warehouse has a staging layer to load source data into the data warehouse environment for further processing.
The data lag between the source PostgreSQL DB and the Amazon Redshift staging layer should NOT exceed four hours.
What is the most efficient technique to meet these requirements?

  • A. Create a DBLINK on the source DB to connect to Amazon Redshift. Use a PostgreSQL trigger on the source table to capture the new insert/update/delete event and execute the event on the Amazon Redshift staging table.
  • B. Use a PostgreSQL trigger on the source table to capture the new insert/update/delete event and write it to Amazon Kinesis Streams. Use a KCL application to execute the event on the Amazon Redshift staging table.
  • C. Extract the incremental changes periodically using a SQL query. Upload the changes to multiple Amazon Simple Storage Service (S3) objects, and run the COPY command to load to the Amazon Redshift staging layer.
  • D. Extract the incremental changes periodically using a SQL query. Upload the changes to a single Amazon Simple Storage Service (S3) object, and run the COPY command to load to the Amazon Redshift staging layer.
Show Suggested Answer Hide Answer
Suggested Answer: C 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
Bulti
Highly Voted 3 years, 7 months ago
A - Incorrect due to the DBLink constraint around using RDS PostgreSQL as opposed to on-prem PostgreSQL. B. Using Kinesis KCL to write to RedShift although can be done using JBDC/ODBC driver doesn't seem efficient. C. This makes the best sense. Using a cron job, we could extract new/changed records based on the timestamp from each table and create an S3 object for that table. Then use a Copy command to copy the data from the S3 object to a RedShift Table. D. It doesn't make any practical sense to extract changes across all tables into a single S3 object because it's not possible to then split the records into multiple RedShift tables. Correct option is C.
upvoted 6 times
...
exams
Highly Voted 3 years, 7 months ago
correct... C?
upvoted 5 times
Kuntazulu
3 years, 7 months ago
Agreed. COPY command is the fastest way to load in Redshit, if you have ++files that is...
upvoted 2 times
...
...
Bulti
Most Recent 3 years, 7 months ago
https://stackoverflow.com/questions/55262639/automatically-load-data-into-redshift-with-the-copy-function
upvoted 1 times
...
Bulti
3 years, 7 months ago
I found this post which suggests that a cron job or a Lambda function triggered by an S3 event upon creating a new file in an S3 bucket could invoke the COPY command to load the data from each file into the corresponding RedShift table. My answer is C with more confidence this time.
upvoted 3 times
...
YashBindlish
3 years, 7 months ago
Correct Answer is A
upvoted 1 times
...
susan8840
3 years, 7 months ago
C. per best practices, COPY is the best way to update data in Redshift not insert. D is incorrect as Redshift can process multiple S3 in parallel
upvoted 1 times
...
san2020
3 years, 7 months ago
my selection C
upvoted 2 times
...
am7
3 years, 7 months ago
@shwang, the parallelisation increases.
upvoted 1 times
...
shwang
3 years, 7 months ago
why not D? what is the difference to store the increment data to single or multiple S3?
upvoted 3 times
Corram
3 years, 7 months ago
Because of a Redshift best practice: Copy data split into a multiple of the number of slices in the Redshift cluster. It will be faster.
upvoted 2 times
...
Corram
3 years, 7 months ago
Even worse, as Bulti correctly pointed out below, you might have multiple tables that you incrementally update, and their data must not be in a common file.
upvoted 1 times
...
...
jlpl
3 years, 8 months ago
A! i think!
upvoted 2 times
mattyb123
3 years, 8 months ago
from the last post at this link https://forums.aws.amazon.com/message.jspa?messageID=807709. It doesn't seem to be time effective to do DBLINK. In addition as @muhsin said DBLINK to redshift ONLY works with RDS postgresql instances. Answer must be C.
upvoted 4 times
Corram
3 years, 7 months ago
Additionally, it appears that DBLINK is used in RDS to query from Redshift, not the other way round. https://stackoverflow.com/questions/45516920/does-amazon-redshift-support-extension-dblink
upvoted 1 times
...
...
...
mattyb123
3 years, 8 months ago
Is it A? 1. https://aws.amazon.com/blogs/big-data/join-amazon-redshift-and-amazon-rds-postgresql-with-dblink/ 2. https://forums.aws.amazon.com/message.jspa?messageID=807709
upvoted 1 times
muhsin
3 years, 8 months ago
this is aws RDS. but in the question, db is on-prem. so we need to move the data to aws.
upvoted 1 times
...
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...