Exam DP-203 All Questions

View all questions & answers for the DP-203 exam

Exam DP-203 topic 1 question 36 discussion

Actual exam question from Microsoft's DP-203

Question #: 36
Topic #: 1

HOTSPOT -
You plan to develop a dataset named Purchases by using Azure Databricks. Purchases will contain the following columns:
✑ ProductID
✑ ItemPrice
✑ LineTotal
✑ Quantity
✑ StoreID
✑ Minute
✑ Month
✑ Hour

Year -

✑ Day
You need to store the data to support hourly incremental load pipelines that will vary for each Store ID. The solution must minimize storage costs.
How should you complete the code? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

Show Suggested Answer

Suggested Answer:

Box 1: partitionBy -
We should overwrite at the partition level.
Example:
df.write.partitionBy("y","m","d")
.mode(SaveMode.Append)
.parquet("/data/hive/warehouse/db_name.db/" + tableName)
Box 2: ("StoreID", "Year", "Month", "Day", "Hour", "StoreID")
Box 3: parquet("/Purchases")
Reference:
https://intellipaat.com/community/11744/how-to-partition-and-write-dataframe-in-spark-without-deleting-partitions-with-no-new-data

by Aslam208 at Dec. 11, 2021, 5:54 p.m.

Comments

Submit Cancel

Mahesh_mm

Highly Voted 2 years, 11 months ago

Answers are correct

upvoted 24 times

...

Aslam208

Highly Voted 2 years, 11 months ago

correct

upvoted 9 times

...

dgerok

Most Recent 8 months ago

The answer is correct

upvoted 2 times

...

ELJORDAN23

10 months, 2 weeks ago

Got this question on my exam on january 17, the answer is correct

upvoted 7 times

...

hodashiyam

1 year, 2 months ago

Answers are correct

upvoted 1 times

...

kkk5566

1 year, 2 months ago

correct

upvoted 1 times

...

Rrk07

2 years ago

Why parquet option? Can anyone explain.

upvoted 3 times

DataSaM

1 year, 4 months ago

I guess because of the requirement "reducing storage costs"

upvoted 6 times

...

steveo123

1 year, 6 months ago

The solution must minimize storage costs.

upvoted 6 times

...

phydev

1 year, 1 month ago

Because Parquet is always the answer.

upvoted 32 times

...

gabrielkuka

2 years ago

Can somebody explain why are we partitioning by StoreId, Year, Month, Day and Hour instead of just StoreID and Hour?

upvoted 6 times

dduque10

2 years ago

if partitioned by storeid and hour only, the same hours from different days would go to the same partition, that would be innefficient

upvoted 41 times

...

Keerthi24

2 years, 2 months ago

Can someone explain why parquet and not saveAsTable option?

upvoted 3 times

uira

1 year, 11 months ago

Parquet is columnar, so faster to be read by Azure Synapse Analytics via CETAs.

upvoted 7 times

hypersam

4 months, 3 weeks ago

saveAsTable default saves as delta table, which is parquet but with additional _delta_log

upvoted 1 times

...

Deeksha1234

2 years, 4 months ago

given answers are correct

upvoted 3 times

...

hm358

2 years, 5 months ago

Correct

upvoted 2 times

...

sparkchu

2 years, 8 months ago

ans should be saveAsTable. format is defined by format() method.

upvoted 4 times

...

assU2

2 years, 10 months ago

Can anyone explain why it's Partitioning and not Bucketing pls?

upvoted 6 times

bhanuprasad9331

2 years, 9 months ago

There should be a different folder for each store. Partitioning will create separate folder for each storeId. In bucketing, multiple stores having same hash value can be present in the same file, so multiple storeIds can be present under a single file.

upvoted 11 times

...

KashRaynardMorse

2 years, 7 months ago

Bucketing feature (part of data skipping index) was removed and microsoft recommends using DeltaLake, which uses the partition syntax. https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/dataskipping-index

upvoted 6 times

...

assU2

2 years, 10 months ago

Is it a question of correct syntax (numBuckets int the number of buckets to save) or is it smth else?

upvoted 2 times

...