Exam DP-300 All Questions

View all questions & answers for the DP-300 exam

Exam DP-300 topic 1 question 9 discussion

Actual exam question from Microsoft's DP-300

Question #: 9
Topic #: 1

HOTSPOT -
You plan to develop a dataset named Purchases by using Azure Databricks. Purchases will contain the following columns:
✑ ProductID
✑ ItemPrice
✑ LineTotal
✑ Quantity
✑ StoreID
✑ Minute
✑ Month
✑ Hour
✑ Year
✑ Day
You need to store the data to support hourly incremental load pipelines that will vary for each StoreID. The solution must minimize storage costs.
How should you complete the code? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

Show Suggested Answer

Suggested Answer:

Box 1: .partitionBy -
Example:
df.write.partitionBy("y","m","d")
.mode(SaveMode.Append)
.parquet("/data/hive/warehouse/db_name.db/" + tableName)
Box 2: ("Year","Month","Day","Hour","StoreID")
Box 3: .parquet("/Purchases")
Reference:
https://intellipaat.com/community/11744/how-to-partition-and-write-dataframe-in-spark-without-deleting-partitions-with-no-new-data

by HichemZe at Aug. 23, 2021, 12:18 p.m.

Comments

Submit Cancel

HichemZe

Highly Voted 1 year, 8 months ago

Question FOR DP-203 , Not For DBA (DP-300)

upvoted 19 times

valente_sven1

1 year, 7 months ago

Is it correct then?

upvoted 2 times

ladywhiteadder

1 year ago

Answer should be ("StoreID","Year","Month","Day","Hour") and this indeed a question from DP-203 (recently passed this one)

upvoted 5 times

...

Backy

Highly Voted 10 months, 2 weeks ago

.partitionBy ("StoreID", "Year","Month","Day","Hour") or ("StoreID", "Hour") .parquet("/Purchases") // The problem is that ("StoreID", "Year","Month","Day","Hour") and ("Year","Month","Day","Hour", "StoreID") are basically the same // ("StoreID", "Hour") or even better ("StoreID") (not on the list) are also good. The problem is that you would have to keep offset of the last read // I would choose ("StoreID", "Year","Month","Day","Hour") because it is the cleanest

upvoted 5 times

...

sincerebb

Most Recent 3 weeks, 5 days ago

Question FOR DP-203 , Not For DBA (DP-300)

upvoted 1 times

...

reachmymind

1 year, 1 month ago

.partitionBy ("Year","Month","Day","Hour","StoreID") .parquet("/Purchases") Correct at the expectation is incremental load pipelines so the smallest partition will be achieved by df.write.partitionBy ("Year","Month","Day","Hour","StoreID") .mode("append") .parquet("/Purchases") as parquet has the least data footprint

upvoted 2 times

...

Daba

1 year, 3 months ago

IMHO, as far as hourly load should vary by Store, it should be "StoreID, Year, Month, Day, Hour".

upvoted 2 times

...

Cindy_Lo

1 year, 6 months ago

answer is correct. Reference: https://stackoverflow.com/questions/59278835/pyspark-how-to-write-dataframe-partition-by-year-month-day-hour-sub-directory

upvoted 3 times

...

learnazureportal

1 year, 6 months ago

The given answer is correct.

upvoted 2 times

...

o2091

1 year, 7 months ago

is the answer correct?

upvoted 1 times

...

Exam DP-300 All Questions

View all questions & answers for the DP-300 exam

Exam DP-300 topic 1 question 9 discussion

Comments

HichemZe

valente_sven1

ladywhiteadder

Backy

sincerebb

reachmymind

Daba

Cindy_Lo

learnazureportal

o2091

SY0-701