Exam DP-100 All Questions

View all questions & answers for the DP-100 exam

Exam DP-100 topic 2 question 49 discussion

Actual exam question from Microsoft's DP-100

Question #: 49
Topic #: 2

HOTSPOT -
You are performing a classification task in Azure Machine Learning Studio.
You must prepare balanced testing and training samples based on a provided data set.
You need to split the data with a 0.75:0.25 ratio.
Which value should you use for each parameter? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

Show Suggested Answer

Suggested Answer:

Box 1: Split rows -
Use the Split Rows option if you just want to divide the data into two parts. You can specify the percentage of data to put in each split, but by default, the data is divided 50-50.
You can also randomize the selection of rows in each group, and use stratified sampling. In stratified sampling, you must select a single column of data for which you want values to be apportioned equally among the two result datasets.

Box 2: 0.75 -
If you specify a number as a percentage, or if you use a string that contains the "%" character, the value is interpreted as a percentage. All percentage values must be within the range (0, 100), not including the values 0 and 100.

Box 3: Yes -
To ensure splits are balanced.

Box 4: No -
If you use the option for a stratified split, the output datasets can be further divided by subgroups, by selecting a strata column.
Reference:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/split-data

by Yong2020 at May 20, 2020, 9:56 a.m.

Comments

Submit Cancel

Yong2020

Highly Voted 4 years, 2 months ago

Stratified split should be true in order to be balanced

upvoted 66 times

snegnik

1 year, 2 months ago

Stratified split - TRUE Stratified split: Set this option to True to ensure that the two output datasets contain a representative sample of the values in the strata column or stratification key column. With stratified sampling, the data is divided such that each output dataset gets roughly the same percentage of each target value. For example, you might want to ensure that your training and testing sets are roughly balanced with regard to the outcome or to some other column (such as gender). https://learn.microsoft.com/en-us/azure/machine-learning/component-reference/split-data?view=azureml-api-2

upvoted 4 times

...

JUEI

4 years ago

Would tend to go with this answer too, cause it doesn't make sense that randomize split could ensure testing and training samples are balanced, since it perform "randomize selection", saying there might be possibility that some targeted values might happen to have more/less than the other.

upvoted 6 times

...

SnowCheetah

3 years, 1 month ago

I agree https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-module-reference/split-data

upvoted 3 times

...

dija123

2 years, 7 months ago

Totally agree

upvoted 2 times

...

Load full discussion...

...

Andrexx

Highly Voted 3 years, 8 months ago

In my opinion, stratified split should be true. Based on this: "Stratified split: Set this option to True to ensure that the two output datasets contain a representative sample of the values in the strata column or stratification key column. With stratified sampling, the data is divided such that each output dataset gets roughly the same percentage of each target value. For example, you might want to ensure that your training and testing sets are roughly balanced with regard to the outcome, or with regard to some other column such as gender." https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/split-data-using-split-rows

upvoted 13 times

...

PI_Team

Most Recent 1 year ago

When performing data splitting, especially in scenarios where there is an imbalance in the distribution of certain categories or groups within a column, using a strata column can be beneficial. By grouping the rows based on the strata column, it helps ensure that each subset (e.g., training or testing) maintains a similar representation of the different categories or groups present in the strata column. so Stratified split - TRUE SaM

upvoted 5 times

...

MohammadKhubeb

2 years, 6 months ago

Stratified split should be FALSE, because it is not the IMBALANCE DATASET. Please refer this: Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-sampling technique." Journal of Artificial Intelligence Research 16 (2002): 321-357.

upvoted 3 times

...

azure1000

3 years ago

In a classification setting, it is often chosen to ensure that the train and test sets have approximately the same percentage of samples of each target class as the complete set. As a result, if the data set has a large amount of each class, stratified sampling is pretty much the same as random sampling. But if one class isn't much represented in the data set, which may be the case in your dataset since you plan to oversample the minority class, then stratified sampling may yield a different target class distribution in the train and test sets than what random sampling may yield.

upvoted 1 times

...

azure1000

3 years ago

upvoted 1 times

...

Lucario95

3 years, 1 month ago

I think Stratified split should be set to True for balancing the 2 subsets, no info about Random Splitting True/False though...

upvoted 3 times

Padilha

1 year, 6 months ago

Normally you don't need to stratify if the dataset is balanced. It's the same in sklearn

upvoted 1 times

...

rishi_ram

3 years, 2 months ago

Based on below requirement I would say Stratified Split should be false Additional requirements for stratified sampling: The strata column can contain only nominal or categorical data. If the column contains continuous numeric data, an error message is raised. A column with too many unique values is not a good candidate for stratification. You might try collapsing some categories or grouping values beforehand.

upvoted 1 times

...

Tusharsp

3 years, 2 months ago

Contrary to all comments here, stratified split should be False. For it to be set to true, we need to select the column for which data needs to be stratified. There is no mention of any column in the question. In this scenario just saying select it to True does not make sense. Also for people following documentation, here is the extract. Stratified split: Set this option to True to ensure that the two output datasets contain a representative sample of the values in the strata column or stratification key column. With stratified sampling, the data is divided such that each output dataset gets roughly the same percentage of each target value. For example, you might want to ensure that your training and testing sets are roughly balanced with regard to the outcome or to some other column (such as gender).

upvoted 10 times

...

poons

3 years, 4 months ago

Since, the dataset is balanced, Stratified split = False might work.

upvoted 3 times

PremPatrick

1 year, 8 months ago

Didnt say it is balanced... said you have to balance it

upvoted 1 times

...

saurabhk1

3 years, 5 months ago

I too think Stratified should be set to True.

upvoted 2 times

...

Paa_Kwesi

3 years, 8 months ago

stratified split should be set to True With stratified sampling, the data is divided such that each output dataset gets roughly the same percentage of each target value. For example, you might want to ensure that your training and testing sets are roughly balanced with regard to the outcome, or with regard to some other column such as gender. https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/split-data-using-split-rows

upvoted 6 times

DanielGP

3 years, 6 months ago

You are absolutely right. For "balanced" set -> "Stratified" must be true

upvoted 2 times

...

Pucha

3 years, 8 months ago

Stratified should be true

upvoted 2 times

...

podval

4 years ago

Stratified means UNbalanced as it keeps the ratio between classes. That is why it needs to be set to False.

upvoted 4 times

dev2dev

3 years, 4 months ago

where did you get the meaning of stratified is unbalanced? strata means layers/groups. in ML we do stratify data to make sure the data balance is maintained.

upvoted 3 times

...

Gitty

3 years, 11 months ago

Correct. Stratified should be False.

upvoted 2 times

...

kath3624

4 years, 1 month ago

because of balanced testing and training samples, Statisfied should be set to True. https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/split-data-using-split-rows

upvoted 8 times

...

jsnels86

4 years, 2 months ago

Stratified split is correct here

upvoted 4 times

...

ajithvajrala

4 years, 2 months ago

I too think Statified should be set to True

upvoted 6 times

...