exam questions

Exam DP-201 All Questions

View all questions & answers for the DP-201 exam

Exam DP-201 topic 1 question 7 discussion

Actual exam question from Microsoft's DP-201
Question #: 7
Topic #: 1
[All DP-201 Questions]

HOTSPOT -
You are designing a data processing solution that will run as a Spark job on an HDInsight cluster. The solution will be used to provide near real-time information about online ordering for a retailer.
The solution must include a page on the company intranet that displays summary information.
The summary information page must meet the following requirements:
✑ Display a summary of sales to date grouped by product categories, price range, and review scope.
✑ Display sales summary information including total sales, sales as compared to one day ago and sales as compared to one year ago.
✑ Reflect information for new orders as quickly as possible.
You need to recommend a design for the solution.
What should you recommend? To answer, select the appropriate configuration in the answer area.
Hot Area:

Show Suggested Answer Hide Answer
Suggested Answer:
Box 1: DataFrame -

DataFrames -
Best choice in most situations.
Provides query optimization through Catalyst.
Whole-stage code generation.
Direct memory access.
Low garbage collection (GC) overhead.
Not as developer-friendly as DataSets, as there are no compile-time checks or domain object programming.

Box 2: parquet -
The best format for performance is parquet with snappy compression, which is the default in Spark 2.x. Parquet stores data in columnar format, and is highly optimized in Spark.
Incorrect Answers:

DataSets -
Good in complex ETL pipelines where the performance impact is acceptable.
Not good in aggregations where the performance impact can be considerable.

RDDs -
You do not need to use RDDs, unless you need to build a new custom RDD.
No query optimization through Catalyst.
No whole-stage code generation.
High GC overhead.
Reference:
https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-perf

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
kempstonjoystick
Highly Voted 5 years, 1 month ago
The highighted answer and the explanation differ. Should be dataframe I believe.
upvoted 50 times
...
apz333
Highly Voted 5 years ago
I think it should be dataframe as well. In most cases parquet and dataframe are the best choice.
upvoted 22 times
frakcha
4 years, 11 months ago
They say Dataset is good for complex ETL situations https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-perf
upvoted 1 times
...
...
hsetin
Most Recent 3 years, 8 months ago
1. Dataframe 2. Parquet Confirmed
upvoted 1 times
...
mchatrvd
3 years, 8 months ago
Anyone knows why Exam Topics have taken AWS certification questions offline? There is nothing related to AWS certifications which used to be there earlier.
upvoted 1 times
victor90
3 years, 5 months ago
Hi, I found the link to the associate SA exam. https://www.examtopics.com/exams/amazon/aws-certified-solutions-architect-associate-saa-c02/view/
upvoted 1 times
...
...
satyamkishoresingh
3 years, 8 months ago
The practical combination is Dataframe + Parquet . Here answer clarification is ambiguous.
upvoted 1 times
...
HichemZe
3 years, 9 months ago
1- DATFRAME 2 - Data Format = Avro Because only Avro support Streaming (Against Parquet)
upvoted 1 times
...
ismaelrihawi
3 years, 11 months ago
Data abstraction = Dataframe
upvoted 5 times
...
BobFar
3 years, 11 months ago
Dataframe is correct , https://docs.microsoft.com/en-us/azure/hdinsight/spark/optimize-data-storage
upvoted 2 times
...
Deepu1987
4 years, 2 months ago
It's wrong selection shown in the display. It's actually - Data Frame [Reason for elimination Not as developer-friendly as DataSets, as there are no compile-time checks or domain object programming,don't need to use RDDs, unless you need to build a new custom RDD] Anyhow "Parquet" is selected
upvoted 1 times
...
syu31svc
4 years, 4 months ago
https://docs.microsoft.com/en-us/azure/hdinsight/spark/optimize-data-storage: "Parquet stores data in columnar format, and is highly optimized in Spark." "DataFrames Best choice in most situations."
upvoted 2 times
...
BaisArun
4 years, 5 months ago
Dataset is not good for Aggregation, Should be dataframe.
upvoted 3 times
...
Nihar258255
4 years, 5 months ago
Can some correct the answers??
upvoted 1 times
...
AhmedReda
4 years, 10 months ago
The question need quick processing but Dataset add overhead, also the query is aggregation and Dataset not good at that DataSets : Adds serialization/deserialization overhead, High GC overhead, Not good in aggregations where the performance impact can be considerable. DataFrames : Best choice in most situations, Direct memory access.
upvoted 11 times
...
Runi
4 years, 10 months ago
Data set is Not good in aggregations where the performance impact can be considerable.So. I think dataframe should be correct one. Can anyone confirm. Please Thanks.
upvoted 5 times
...
serger
4 years, 11 months ago
dataframe for sure
upvoted 4 times
...
Tombarc
5 years ago
I think it's dataframe too.
upvoted 7 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago