Exam DP-203 All Questions

View all questions & answers for the DP-203 exam

Exam DP-203 topic 2 question 100 discussion

Actual exam question from Microsoft's DP-203

Question #: 100
Topic #: 2

HOTSPOT
-

You have an Azure Blob storage account that contains a folder. The folder contains 120,000 files. Each file contains 62 columns.

Each day, 1,500 new files are added to the folder.

You plan to incrementally load five data columns from each new file into an Azure Synapse Analytics workspace.

You need to minimize how long it takes to perform the incremental loads.

What should you use to store the files and in which format? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Show Suggested Answer

Suggested Answer:

by ababatunde_hs at April 3, 2023, 6:44 p.m.

Comments

Submit Cancel

ababatunde_hs

Highly Voted 2 years, 1 month ago

Time partitioning is correct as the fastest way to load only new files, but requires that the timeslice information be part of the file or folder name (https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-overview) However, Parquet is the correct file format since it's a columnar format

upvoted 53 times

...

kkk5566

Highly Voted 1 year, 9 months ago

Time partitioning and parquet

upvoted 9 times

...

suranga4

Most Recent 7 months ago

Answer should be Timeslicing and Parquet

upvoted 1 times

...

MBRSDG

1 year, 2 months ago

Parquet is the answer to the second question. You nee to take only 5 colums out of 62: with CSV you'd have to explore the file row-wise sequentially... Parquet is more efficient, since seleciton proceeds column-wise.

upvoted 1 times

MBRSDG

1 year, 2 months ago

Just a notice. This answer should be proven by a benchmark. Currently I didn't find any benchmark comparing the performances of the two files. Logically speaking, I'd expect Parquet to be a lot more efficient, but it should have to be measured in practise.

upvoted 2 times

...

vctrhugo

1 year, 11 months ago

You need to minimize how long it takes to perform the incremental loads. With Parquet, which is a columnar format, it is way faster to select a few columns than csv.

upvoted 2 times

...

vegeta379

2 years ago

we can do incremental load just with deltatable for a parquet file which supported by datarbricks or synapse spark and here he didn't give details so I think it will be CSV

upvoted 1 times

...

pavankr

2 years ago

I think the requirement is to select specific columns, hence CSV?

upvoted 1 times

...

verisdev

2 years ago

it supposed to be Parquet instead of CSV

upvoted 5 times

...