Microsoft Discussions

Exam DP-201 All Questions

View all questions & answers for the DP-201 exam

Go to Exam

Exam DP-201 topic 1 question 7 discussion

Actual exam question from Microsoft's DP-201

Question #: 7
Topic #: 1

[All DP-201 Questions]

HOTSPOT -
You are designing a data processing solution that will run as a Spark job on an HDInsight cluster. The solution will be used to provide near real-time information about online ordering for a retailer.
The solution must include a page on the company intranet that displays summary information.
The summary information page must meet the following requirements:
✑ Display a summary of sales to date grouped by product categories, price range, and review scope.
✑ Display sales summary information including total sales, sales as compared to one day ago and sales as compared to one year ago.
✑ Reflect information for new orders as quickly as possible.
You need to recommend a design for the solution.
What should you recommend? To answer, select the appropriate configuration in the answer area.
Hot Area:

Show Suggested Answer

Suggested Answer:

Box 1: DataFrame -

DataFrames -
Best choice in most situations.
Provides query optimization through Catalyst.
Whole-stage code generation.
Direct memory access.
Low garbage collection (GC) overhead.
Not as developer-friendly as DataSets, as there are no compile-time checks or domain object programming.

Box 2: parquet -
The best format for performance is parquet with snappy compression, which is the default in Spark 2.x. Parquet stores data in columnar format, and is highly optimized in Spark.
Incorrect Answers:

DataSets -
Good in complex ETL pipelines where the performance impact is acceptable.
Not good in aggregations where the performance impact can be considerable.

RDDs -
You do not need to use RDDs, unless you need to build a new custom RDD.
No query optimization through Catalyst.
No whole-stage code generation.
High GC overhead.
Reference:
https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-perf

by kempstonjoystick at April 2, 2020, 8:11 a.m.

Comments

Submit Cancel

kempstonjoystick

Highly Voted 5 years, 3 months ago

The highighted answer and the explanation differ. Should be dataframe I believe.

upvoted 50 times

...

apz333

Highly Voted 5 years, 2 months ago

I think it should be dataframe as well. In most cases parquet and dataframe are the best choice.

upvoted 22 times

frakcha

5 years, 2 months ago

They say Dataset is good for complex ETL situations https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-perf

upvoted 1 times

...

...

hsetin

Most Recent 3 years, 10 months ago

1. Dataframe 2. Parquet Confirmed

upvoted 1 times

...

mchatrvd

3 years, 11 months ago

Anyone knows why Exam Topics have taken AWS certification questions offline? There is nothing related to AWS certifications which used to be there earlier.

upvoted 1 times

victor90

3 years, 7 months ago

Hi, I found the link to the associate SA exam. https://www.examtopics.com/exams/amazon/aws-certified-solutions-architect-associate-saa-c02/view/

upvoted 1 times

...

...

satyamkishoresingh

3 years, 11 months ago

The practical combination is Dataframe + Parquet . Here answer clarification is ambiguous.

upvoted 1 times

...

HichemZe

3 years, 11 months ago

1- DATFRAME 2 - Data Format = Avro Because only Avro support Streaming (Against Parquet)

upvoted 1 times

...

ismaelrihawi

4 years, 1 month ago

Data abstraction = Dataframe

upvoted 5 times

...

BobFar

4 years, 1 month ago

Dataframe is correct , https://docs.microsoft.com/en-us/azure/hdinsight/spark/optimize-data-storage

upvoted 2 times

...

Deepu1987

4 years, 4 months ago

It's wrong selection shown in the display. It's actually - Data Frame [Reason for elimination Not as developer-friendly as DataSets, as there are no compile-time checks or domain object programming,don't need to use RDDs, unless you need to build a new custom RDD] Anyhow "Parquet" is selected

upvoted 1 times

...

syu31svc

4 years, 7 months ago

https://docs.microsoft.com/en-us/azure/hdinsight/spark/optimize-data-storage: "Parquet stores data in columnar format, and is highly optimized in Spark." "DataFrames Best choice in most situations."

upvoted 2 times

...

BaisArun

4 years, 7 months ago

Dataset is not good for Aggregation, Should be dataframe.

upvoted 3 times

...

Nihar258255

4 years, 8 months ago

Can some correct the answers??

upvoted 1 times

...

AhmedReda

5 years ago

The question need quick processing but Dataset add overhead, also the query is aggregation and Dataset not good at that DataSets : Adds serialization/deserialization overhead, High GC overhead, Not good in aggregations where the performance impact can be considerable. DataFrames : Best choice in most situations, Direct memory access.

upvoted 11 times

...

Runi

5 years ago

Data set is Not good in aggregations where the performance impact can be considerable.So. I think dataframe should be correct one. Can anyone confirm. Please Thanks.

upvoted 5 times

...

serger

5 years, 1 month ago

dataframe for sure

upvoted 4 times

...

Tombarc

5 years, 2 months ago

I think it's dataframe too.

upvoted 7 times

...