exam questions

Exam DP-100 All Questions

View all questions & answers for the DP-100 exam

Exam DP-100 topic 2 question 49 discussion

Actual exam question from Microsoft's DP-100
Question #: 49
Topic #: 2
[All DP-100 Questions]

HOTSPOT -
You are performing a classification task in Azure Machine Learning Studio.
You must prepare balanced testing and training samples based on a provided data set.
You need to split the data with a 0.75:0.25 ratio.
Which value should you use for each parameter? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:

Show Suggested Answer Hide Answer
Suggested Answer:
Box 1: Split rows -
Use the Split Rows option if you just want to divide the data into two parts. You can specify the percentage of data to put in each split, but by default, the data is divided 50-50.
You can also randomize the selection of rows in each group, and use stratified sampling. In stratified sampling, you must select a single column of data for which you want values to be apportioned equally among the two result datasets.

Box 2: 0.75 -
If you specify a number as a percentage, or if you use a string that contains the "%" character, the value is interpreted as a percentage. All percentage values must be within the range (0, 100), not including the values 0 and 100.

Box 3: Yes -
To ensure splits are balanced.

Box 4: No -
If you use the option for a stratified split, the output datasets can be further divided by subgroups, by selecting a strata column.
Reference:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/split-data

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
Yong2020
Highly Voted 4 years ago
Stratified split should be true in order to be balanced
upvoted 66 times
snegnik
1 year ago
Stratified split - TRUE Stratified split: Set this option to True to ensure that the two output datasets contain a representative sample of the values in the strata column or stratification key column. With stratified sampling, the data is divided such that each output dataset gets roughly the same percentage of each target value. For example, you might want to ensure that your training and testing sets are roughly balanced with regard to the outcome or to some other column (such as gender). https://learn.microsoft.com/en-us/azure/machine-learning/component-reference/split-data?view=azureml-api-2
upvoted 4 times
...
JUEI
3 years, 10 months ago
Would tend to go with this answer too, cause it doesn't make sense that randomize split could ensure testing and training samples are balanced, since it perform "randomize selection", saying there might be possibility that some targeted values might happen to have more/less than the other.
upvoted 6 times
...
SnowCheetah
2 years, 11 months ago
I agree https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-module-reference/split-data
upvoted 3 times
...
dija123
2 years, 5 months ago
Totally agree
upvoted 2 times
...
...
Andrexx
Highly Voted 3 years, 6 months ago
In my opinion, stratified split should be true. Based on this: "Stratified split: Set this option to True to ensure that the two output datasets contain a representative sample of the values in the strata column or stratification key column. With stratified sampling, the data is divided such that each output dataset gets roughly the same percentage of each target value. For example, you might want to ensure that your training and testing sets are roughly balanced with regard to the outcome, or with regard to some other column such as gender." https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/split-data-using-split-rows
upvoted 13 times
...
PI_Team
Most Recent 10 months, 4 weeks ago
When performing data splitting, especially in scenarios where there is an imbalance in the distribution of certain categories or groups within a column, using a strata column can be beneficial. By grouping the rows based on the strata column, it helps ensure that each subset (e.g., training or testing) maintains a similar representation of the different categories or groups present in the strata column. so Stratified split - TRUE SaM
upvoted 5 times
...
MohammadKhubeb
2 years, 4 months ago
Stratified split should be FALSE, because it is not the IMBALANCE DATASET. Please refer this: Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-sampling technique." Journal of Artificial Intelligence Research 16 (2002): 321-357.
upvoted 3 times
...
azure1000
2 years, 10 months ago
In a classification setting, it is often chosen to ensure that the train and test sets have approximately the same percentage of samples of each target class as the complete set. As a result, if the data set has a large amount of each class, stratified sampling is pretty much the same as random sampling. But if one class isn't much represented in the data set, which may be the case in your dataset since you plan to oversample the minority class, then stratified sampling may yield a different target class distribution in the train and test sets than what random sampling may yield.
upvoted 1 times
...
azure1000
2 years, 10 months ago
In a classification setting, it is often chosen to ensure that the train and test sets have approximately the same percentage of samples of each target class as the complete set. As a result, if the data set has a large amount of each class, stratified sampling is pretty much the same as random sampling. But if one class isn't much represented in the data set, which may be the case in your dataset since you plan to oversample the minority class, then stratified sampling may yield a different target class distribution in the train and test sets than what random sampling may yield.
upvoted 1 times
...
Lucario95
2 years, 12 months ago
I think Stratified split should be set to True for balancing the 2 subsets, no info about Random Splitting True/False though...
upvoted 3 times
Padilha
1 year, 4 months ago
Normally you don't need to stratify if the dataset is balanced. It's the same in sklearn
upvoted 1 times
...
...
rishi_ram
3 years ago
Based on below requirement I would say Stratified Split should be false Additional requirements for stratified sampling: The strata column can contain only nominal or categorical data. If the column contains continuous numeric data, an error message is raised. A column with too many unique values is not a good candidate for stratification. You might try collapsing some categories or grouping values beforehand.
upvoted 1 times
...
Tusharsp
3 years ago
Contrary to all comments here, stratified split should be False. For it to be set to true, we need to select the column for which data needs to be stratified. There is no mention of any column in the question. In this scenario just saying select it to True does not make sense. Also for people following documentation, here is the extract. Stratified split: Set this option to True to ensure that the two output datasets contain a representative sample of the values in the strata column or stratification key column. With stratified sampling, the data is divided such that each output dataset gets roughly the same percentage of each target value. For example, you might want to ensure that your training and testing sets are roughly balanced with regard to the outcome or to some other column (such as gender).
upvoted 10 times
...
poons
3 years, 2 months ago
Since, the dataset is balanced, Stratified split = False might work.
upvoted 3 times
PremPatrick
1 year, 6 months ago
Didnt say it is balanced... said you have to balance it
upvoted 1 times
...
...
saurabhk1
3 years, 3 months ago
I too think Stratified should be set to True.
upvoted 2 times
...
Paa_Kwesi
3 years, 6 months ago
stratified split should be set to True With stratified sampling, the data is divided such that each output dataset gets roughly the same percentage of each target value. For example, you might want to ensure that your training and testing sets are roughly balanced with regard to the outcome, or with regard to some other column such as gender. https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/split-data-using-split-rows
upvoted 6 times
DanielGP
3 years, 4 months ago
You are absolutely right. For "balanced" set -> "Stratified" must be true
upvoted 2 times
...
...
Pucha
3 years, 6 months ago
Stratified should be true
upvoted 2 times
...
podval
3 years, 11 months ago
Stratified means UNbalanced as it keeps the ratio between classes. That is why it needs to be set to False.
upvoted 4 times
dev2dev
3 years, 2 months ago
where did you get the meaning of stratified is unbalanced? strata means layers/groups. in ML we do stratify data to make sure the data balance is maintained.
upvoted 3 times
...
Gitty
3 years, 9 months ago
Correct. Stratified should be False.
upvoted 2 times
...
...
kath3624
3 years, 11 months ago
because of balanced testing and training samples, Statisfied should be set to True. https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/split-data-using-split-rows
upvoted 8 times
...
jsnels86
4 years ago
Stratified split is correct here
upvoted 4 times
...
ajithvajrala
4 years ago
I too think Statified should be set to True
upvoted 6 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...