You used Dataprep to create a recipe on a sample of data in a BigQuery table. You want to reuse this recipe on a daily upload of data with the same schema, after the load job with variable execution time completes. What should you do?
A.
Create a cron schedule in Dataprep.
B.
Create an App Engine cron job to schedule the execution of the Dataprep job.
C.
Export the recipe as a Dataprep template, and create a job in Cloud Scheduler.
D.
Export the Dataprep job as a Dataflow template, and incorporate it into a Composer job.
I'd pick D because it's the only option which allows variable execution (since we need to execute the dataprep job only after the prior load job). Although D suggests the export of Dataflow templates, this discussion suggests that the export option is no longer available (https://stackoverflow.com/questions/72544839/how-to-get-the-dataflow-template-of-a-dataprep-job), there are already Airflow Operators for Dataprep which we should be using instead - https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/dataprep.html
Since the load job execution time is unexpected, schedule the Dataprep based on a fixed time window may not work.
When the Dataprep job run the first time, we can find the Dataflow job for that in the console. We can use that to create the Template --> With the help of the Composer to determine if the load job is completed, we can then trigger the Dataflow job
Dataprep by Trifacta allows you to schedule the execution of recipes. You can set up a cron schedule directly within Dataprep to automatically run your recipe at specified intervals, such as daily.
WHY NOT D ? : This option involves significant additional complexity. Exporting the Dataprep job as a Dataflow template and then incorporating it into a Composer (Apache Airflow) job is a more complicated process and is typically used for more complex orchestration needs that go beyond simple scheduling.
We have external dependency "after the load job with variable execution time completes"
which requires DAG -> Airflow (Cloud Composer)
The reasons:
A scheduler like Cloud Scheduler won't handle the dependency on the BigQuery load completion time
Using Composer allows creating a DAG workflow that can:
Trigger the BigQuery load
Wait for BigQuery load to complete
Trigger the Dataprep Dataflow job
Dataflow template allows easy reuse of the Dataprep transformation logic
Composer coordinates everything based on the dependencies in an automated workflow
The key here is "after the load job with variable execution time completes" which means the execution of this job depends on the completion of another job which has a variable execution time. Hence D
This approach ensures the dynamic triggering of the Dataprep job based on the completion of the preceding load job, ensuring data is processed accurately and in sequen
A is correct. D is too complicated.
A is correct, because you can schedule a job right from Dataprep UI.
https://cloud.google.com/blog/products/gcp/scheduling-and-sampling-arrive-for-google-cloud-dataprep
Scheduling and sampling arrive for Google Cloud Dataprep
Throughout our early releases, users’ most common request has been Flow scheduling. As of Thursday’s release, Flows can be scheduled with minute granularity at any frequency.
It's A. You can set it directly in Dataprep a job and it will use Dataflow under the hood. No need to export nor incorporate into a Composer job.
Dataprep by trifacta - https://docs.trifacta.com/display/DP/cron+Schedule+Syntax+Reference
Dataprep job uses dataflow - https://cloud.google.com/dataprep
A voting comment increases the vote count for the chosen answer by one.
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one.
So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
jkhong
Highly Voted 2 years agomidgoo
Highly Voted 1 year, 9 months agoTVH_Data_Engineer
Most Recent 12 months agoMaxNRG
1 year agorocky48
1 year agogaurav0480
1 year, 3 months agogod_brainer
1 year, 3 months agoAdswerve
1 year, 8 months agolucaluca1982
1 year, 8 months agomusumusu
1 year, 9 months agojroig_
1 year, 11 months agozellck
2 years agoanicloudgirl
2 years agoanicloudgirl
2 years agojkhong
2 years agocloudmon
2 years, 1 month agoJohn_Pongthorn
2 years, 3 months agoAWSandeep
2 years, 3 months ago