An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema: user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT
New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id.
Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the described account_current table as part of each hourly batch job?
RafaelCFC
Highly Voted 1 year, 3 months agoterrku
Highly Voted 1 year, 1 month agoJoG1221
Most Recent 2 weeks agoTedet
2 months agoAnanth4Sap
6 months, 1 week agobenni_ale
6 months, 2 weeks agoDhusanth
9 months agofaraaz132
9 months agoKarunakaran_R
11 months agoFreyr
11 months, 1 week agoPrashantTiwari
1 year, 2 months agoDAN_H
1 year, 3 months agospaceexplorer
1 year, 3 months agokz_data
1 year, 3 months agoATLTennis
1 year, 3 months agoAndreFR
8 months, 2 weeks agosturcu
1 year, 6 months agosturcu
1 year, 6 months agopetrv
1 year, 5 months agoEertyy
1 year, 7 months ago