Exam DP-200 topic 5 question 1 discussion

Actual exam question from Microsoft's DP-200

Question #: 1
Topic #: 5

You manage a process that performs analysis of daily web traffic logs on an HDInsight cluster. Each of the 250 web servers generates approximately 10 megabytes (MB) of log data each day. All log data is stored in a single folder in Microsoft Azure Data Lake Storage Gen 2.
You need to improve the performance of the process.
Which two changes should you make? Each correct answer presents a complete solution.
NOTE: Each correct selection is worth one point.

A. Combine the daily log files for all servers into one file
B. Increase the value of the mapreduce.map.memory parameter
C. Move the log files into folders so that each day's logs are in their own folder
D. Increase the number of worker nodes
E. Increase the value of the hive.tez.container.size parameter

Show Suggested Answer

Suggested Answer: AC 🗳️
A: Typically, analytics engines such as HDInsight and Azure Data Lake Analytics have a per-file overhead. If you store your data as many small files, this can negatively affect performance. In general, organize your data into larger sized files for better performance (256MB to 100GB in size). Some engines and applications might have trouble efficiently processing files that are greater than 100GB in size.
C: For Hive workloads, partition pruning of time-series data can help some queries read only a subset of the data which improves performance.
Those pipelines that ingest time-series data, often place their files with a very structured naming for files and folders. Below is a very common example we see for data that is structured by date:
\DataSet\YYYY\MM\DD\datafile_YYYY_MM_DD.tsv
Notice that the datetime information appears both as folders and in the filename.
References:
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-performance-tuning-guidance

by JPaul at July 28, 2020, 7:53 a.m.

Comments

Submit Cancel

Cassielovedata

Highly Voted 4 years, 8 months ago

https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-performance-tuning-guidance The link provided clearly states that the optimization methods for cluster are only for I/O intensive job. However, the jobs for the question is like CPU intensive job and Memory intensive because the size of files is not large. Then the slow speed of processing comes from the more times of processing for each web servers files because HDInsight and Azure Data Lake Analytics have a per-file overhead. If you store your data as many small files, this can negatively affect performance. Then come back to the question, only the options that adjust the data structures can be the solutions. Thus, A and C will be the answer.

upvoted 18 times

...

akram786

Most Recent 4 years, 3 months ago

A and C are correct

upvoted 2 times

...

mohowzeh

4 years, 5 months ago

My 2 cents: C and D Why C? Storing files in a separate folder increases concurrency. Why not A? When read literally, A advocates the creation of one single file that holds all the log information of the 250 servers at the same time. Since one file cannot be stored in more than one folder, this contradicts C. The question also mentions "performs analysis". Performance optimisation of the analysis is a valid part of the answer. Since no information is given on any tools, a valid assumption is that it will be performed within the HD Insight realm. Why not B? This parameter is used for memory problems, so it is not relevant for performance optimisation (https://dzone.com/articles/configuring-memory-for-mapreduce-running-on-yarn) Why not E? If anything, this parameter should be reduced, not increased (https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-performance-tuning-hive) D is correct. Scaling out worker nodes is mentioned on https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-optimize-hive-query

upvoted 3 times

Ab5381

4 years, 5 months ago

A does not contradicts C. A talks about combining all 250 files in a day into 1. C talks about ensuring you keep this 1 combined file in a folder. Next day combined file should be in another folder and should not be in previous day folder.

upvoted 4 times

...

syu31svc

4 years, 6 months ago

From the link provided: In general, we recommend that your system have some sort of process to aggregate small files into larger ones for use by downstream applications. -> This supports A Again, the choice you make with the folder and file organization should optimize for the larger file sizes and a reasonable number of files in each folder. -> This supports C

upvoted 4 times

...

RajatNaik

4 years, 10 months ago

Answer should be C & E

upvoted 3 times

...

JPaul

4 years, 10 months ago

i don't seem this should be the right answer

upvoted 3 times

...