Exam DP-201 topic 2 question 10 discussion

Actual exam question from Microsoft's DP-201

Question #: 10
Topic #: 2

You plan to ingest streaming social media data by using Azure Stream Analytics. The data will be stored in files in Azure Data Lake Storage, and then consumed by using Azure Databricks and PolyBase in Azure Synapse Analytics.
You need to recommend a Stream Analytics data output format to ensure that the queries from Databricks and PolyBase against the files encounter the fewest possible errors. The solution must ensure that the files can be queried quickly and that the data type information is retained.
What should you recommend?

A. Avro
B. CSV
C. Parquet
D. JSON

Show Suggested Answer

Suggested Answer: A 🗳️
The Avro format is great for data and message preservation.
Avro schema with its support for evolution is essential for making the data robust for streaming architectures like Kafka, and with the metadata that schema provides, you can reason on the data. Having a schema provides robustness in providing meta-data about the data stored in Avro records which are self- documenting the data.
References:
http://cloudurable.com/blog/avro/index.html

by felmasri at March 12, 2021, 4:19 a.m.

Comments

Submit Cancel

felmasri

Highly Voted 4 years, 2 months ago

I think this Answer is wrong since polybase does not support Avro. I will pick Parquet

upvoted 52 times

...

jms309

Highly Voted 4 years, 2 months ago

I understand that Databricks and Polybase will consume the data independently ... So, based on that premise the selected output format from Synapse Stream Analytics should be a format compatible with both. Since, we need the file format to be a distributed file format for speed up the queries, the only possible solutions are AVRO and Parquet. As, AVRO is no a valid solution as Polybase doesn't support this format, the only possible answer is PARQUET

upvoted 15 times

...

massnonn

Most Recent 3 years, 6 months ago

for me the correct answer is parquet

upvoted 1 times

...

dumpi

4 years ago

Parquet is correct answer I verify

upvoted 3 times

...

KpKo

4 years ago

Agreed with Parquet

upvoted 2 times

...

cadio30

4 years ago

Both services uses CSV and parquet as input files though parquet is the candidate for this requirement as it is the recommended file format for azure databricks and is also supported by polybase

upvoted 2 times

...

davita8

4 years, 1 month ago

C. Parquet

upvoted 3 times

...

maciejt

4 years, 2 months ago

JSON and CSV don't define the types strongly and we need to preserve the data types, so those 2 are exuded. Parquet is better optimized for read, avro is for write and requirement is to make queries fast, so parquet. https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/

upvoted 7 times

...