exam questions

Exam DP-300 All Questions

View all questions & answers for the DP-300 exam

Exam DP-300 topic 1 question 9 discussion

Actual exam question from Microsoft's DP-300
Question #: 9
Topic #: 1
[All DP-300 Questions]

HOTSPOT -
You plan to develop a dataset named Purchases by using Azure Databricks. Purchases will contain the following columns:
✑ ProductID
✑ ItemPrice
✑ LineTotal
✑ Quantity
✑ StoreID
✑ Minute
✑ Month
✑ Hour
✑ Year
✑ Day
You need to store the data to support hourly incremental load pipelines that will vary for each StoreID. The solution must minimize storage costs.
How should you complete the code? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

Show Suggested Answer Hide Answer
Suggested Answer:
Box 1: .partitionBy -
Example:
df.write.partitionBy("y","m","d")
.mode(SaveMode.Append)
.parquet("/data/hive/warehouse/db_name.db/" + tableName)
Box 2: ("Year","Month","Day","Hour","StoreID")
Box 3: .parquet("/Purchases")
Reference:
https://intellipaat.com/community/11744/how-to-partition-and-write-dataframe-in-spark-without-deleting-partitions-with-no-new-data

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
HichemZe
Highly Voted 1 year, 8 months ago
Question FOR DP-203 , Not For DBA (DP-300)
upvoted 19 times
valente_sven1
1 year, 7 months ago
Is it correct then?
upvoted 2 times
ladywhiteadder
1 year ago
Answer should be ("StoreID","Year","Month","Day","Hour") and this indeed a question from DP-203 (recently passed this one)
upvoted 5 times
...
...
...
Backy
Highly Voted 10 months, 2 weeks ago
.partitionBy ("StoreID", "Year","Month","Day","Hour") or ("StoreID", "Hour") .parquet("/Purchases") // The problem is that ("StoreID", "Year","Month","Day","Hour") and ("Year","Month","Day","Hour", "StoreID") are basically the same // ("StoreID", "Hour") or even better ("StoreID") (not on the list) are also good. The problem is that you would have to keep offset of the last read // I would choose ("StoreID", "Year","Month","Day","Hour") because it is the cleanest
upvoted 5 times
...
sincerebb
Most Recent 3 weeks, 5 days ago
Question FOR DP-203 , Not For DBA (DP-300)
upvoted 1 times
...
reachmymind
1 year, 1 month ago
.partitionBy ("Year","Month","Day","Hour","StoreID") .parquet("/Purchases") Correct at the expectation is incremental load pipelines so the smallest partition will be achieved by df.write.partitionBy ("Year","Month","Day","Hour","StoreID") .mode("append") .parquet("/Purchases") as parquet has the least data footprint
upvoted 2 times
...
Daba
1 year, 3 months ago
IMHO, as far as hourly load should vary by Store, it should be "StoreID, Year, Month, Day, Hour".
upvoted 2 times
...
Cindy_Lo
1 year, 6 months ago
answer is correct. Reference: https://stackoverflow.com/questions/59278835/pyspark-how-to-write-dataframe-partition-by-year-month-day-hour-sub-directory
upvoted 3 times
...
learnazureportal
1 year, 6 months ago
The given answer is correct.
upvoted 2 times
...
o2091
1 year, 7 months ago
is the answer correct?
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago