exam questions

Exam DP-203 All Questions

View all questions & answers for the DP-203 exam

Exam DP-203 topic 1 question 36 discussion

Actual exam question from Microsoft's DP-203
Question #: 36
Topic #: 1
[All DP-203 Questions]

HOTSPOT -
You plan to develop a dataset named Purchases by using Azure Databricks. Purchases will contain the following columns:
✑ ProductID
✑ ItemPrice
✑ LineTotal
✑ Quantity
✑ StoreID
✑ Minute
✑ Month
✑ Hour

Year -

✑ Day
You need to store the data to support hourly incremental load pipelines that will vary for each Store ID. The solution must minimize storage costs.
How should you complete the code? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

Show Suggested Answer Hide Answer
Suggested Answer:
Box 1: partitionBy -
We should overwrite at the partition level.
Example:
df.write.partitionBy("y","m","d")
.mode(SaveMode.Append)
.parquet("/data/hive/warehouse/db_name.db/" + tableName)
Box 2: ("StoreID", "Year", "Month", "Day", "Hour", "StoreID")
Box 3: parquet("/Purchases")
Reference:
https://intellipaat.com/community/11744/how-to-partition-and-write-dataframe-in-spark-without-deleting-partitions-with-no-new-data

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
Mahesh_mm
Highly Voted 2 years, 11 months ago
Answers are correct
upvoted 24 times
...
Aslam208
Highly Voted 2 years, 11 months ago
correct
upvoted 9 times
...
dgerok
Most Recent 8 months ago
The answer is correct
upvoted 2 times
...
ELJORDAN23
10 months, 2 weeks ago
Got this question on my exam on january 17, the answer is correct
upvoted 7 times
...
hodashiyam
1 year, 2 months ago
Answers are correct
upvoted 1 times
...
kkk5566
1 year, 2 months ago
correct
upvoted 1 times
...
Rrk07
2 years ago
Why parquet option? Can anyone explain.
upvoted 3 times
DataSaM
1 year, 4 months ago
I guess because of the requirement "reducing storage costs"
upvoted 6 times
...
steveo123
1 year, 6 months ago
The solution must minimize storage costs.
upvoted 6 times
...
phydev
1 year, 1 month ago
Because Parquet is always the answer.
upvoted 32 times
...
...
gabrielkuka
2 years ago
Can somebody explain why are we partitioning by StoreId, Year, Month, Day and Hour instead of just StoreID and Hour?
upvoted 6 times
dduque10
2 years ago
if partitioned by storeid and hour only, the same hours from different days would go to the same partition, that would be innefficient
upvoted 41 times
...
...
Keerthi24
2 years, 2 months ago
Can someone explain why parquet and not saveAsTable option?
upvoted 3 times
uira
1 year, 11 months ago
Parquet is columnar, so faster to be read by Azure Synapse Analytics via CETAs.
upvoted 7 times
hypersam
4 months, 3 weeks ago
saveAsTable default saves as delta table, which is parquet but with additional _delta_log
upvoted 1 times
...
...
...
Deeksha1234
2 years, 4 months ago
given answers are correct
upvoted 3 times
...
hm358
2 years, 5 months ago
Correct
upvoted 2 times
...
sparkchu
2 years, 8 months ago
ans should be saveAsTable. format is defined by format() method.
upvoted 4 times
...
assU2
2 years, 10 months ago
Can anyone explain why it's Partitioning and not Bucketing pls?
upvoted 6 times
bhanuprasad9331
2 years, 9 months ago
There should be a different folder for each store. Partitioning will create separate folder for each storeId. In bucketing, multiple stores having same hash value can be present in the same file, so multiple storeIds can be present under a single file.
upvoted 11 times
...
KashRaynardMorse
2 years, 7 months ago
Bucketing feature (part of data skipping index) was removed and microsoft recommends using DeltaLake, which uses the partition syntax. https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/dataskipping-index
upvoted 6 times
...
assU2
2 years, 10 months ago
Is it a question of correct syntax (numBuckets int the number of buckets to save) or is it smth else?
upvoted 2 times
...
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...