exam questions

Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 59 discussion

A data scientist has explored and sanitized a dataset in preparation for the modeling phase of a supervised learning task. The statistical dispersion can vary widely between features, sometimes by several orders of magnitude. Before moving on to the modeling phase, the data scientist wants to ensure that the prediction performance on the production data is as accurate as possible.
Which sequence of steps should the data scientist take to meet these requirements?

  • A. Apply random sampling to the dataset. Then split the dataset into training, validation, and test sets.
  • B. Split the dataset into training, validation, and test sets. Then rescale the training set and apply the same scaling to the validation and test sets.
  • C. Rescale the dataset. Then split the dataset into training, validation, and test sets.
  • D. Split the dataset into training, validation, and test sets. Then rescale the training set, the validation set, and the test set independently.
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
cron0001
Highly Voted 3 years ago
Selected Answer: C
C would be my answer here. Rescaling each set independently could lead to strange skews. Training set, Test set and Evaluation set should be on the same scale
upvoted 17 times
GiyeonShin
2 years, 4 months ago
You're right. test set and val set should be rescaled on the same scale. But the scale value should be extracted by only statistical value from training data. I think C means that the rescaling stage is affected by the values from the whole data (with val, test set) So, I think B is correct
upvoted 9 times
...
...
masoa3b
Highly Voted 2 years, 6 months ago
Selected Answer: B
https://stackoverflow.com/questions/49444262/normalize-data-before-or-after-split-of-training-and-testing-data C also leads to data leakage. You are using the test data to scale everything. So part of the data in the test set is used to scale for when you build the model on the training and check against the validation set.
upvoted 17 times
...
ML_2
Most Recent 8 months, 3 weeks ago
Selected Answer: B
If you Rescale all the data first you are going to do data leakage by showing all the variance of data with in training. The rescaling needs to be after splitting the data and not before it
upvoted 1 times
...
Denise123
1 year, 2 months ago
Selected Answer: B
The best practice is --> to split the dataset into training, validation, and test sets first, and then rescale the training set and apply the SAME scaling to the validation and test sets. This ensures that the scaling parameters (e.g., mean and standard deviation for standardization or min and max values for min-max scaling) are calculated only based on the training set to prevent data leakage and maintain the integrity of the evaluation process. By following this approach, you prevent information from the validation and test sets from influencing the scaling parameters, which could lead to data leakage and overestimation of model performance. Keeping the scaling consistent across all subsets ensures a fair evaluation of the model's generalization performance on new, unseen data.
upvoted 5 times
...
phdykd
1 year, 3 months ago
Answer is B. The other options have shortcomings: A: Random sampling is a good practice, but it doesn't address the issue of feature scaling. Also, rescaling should occur after splitting the data. C: Rescaling the entire dataset before splitting could lead to data leakage, where information from the validation/test sets inadvertently influences the training process. D: Rescaling the sets independently would lead to inconsistencies in scale across the training, validation, and test sets, which could negatively impact model performance and evaluation.
upvoted 2 times
...
Sukhi4fornet
1 year, 4 months ago
OPTION C. Rescale the dataset. Then split the dataset into training, validation, and test sets. Explanation: Rescaling the dataset: This is the first step to address the varying statistical dispersion among features. By rescaling, you ensure that all features are on a similar scale, which is important for many machine learning algorithms. Splitting into training, validation, and test sets: After rescaling, the dataset is split into training, validation, and test sets. This ensures that the model is trained on one set, validated on another set, and tested on a third set. This separation helps evaluate the model's performance on unseen data. Option C ensures that the rescaling is applied before splitting the data, ensuring consistency in the scaling across different sets. This approach prevents data leakage and provides a more accurate representation of how the model will perform on new, unseen data.
upvoted 1 times
...
akgarg00
1 year, 5 months ago
Selected Answer: B
Validation and test set should be scaled as per parameters used for scaling of training set. Independent scaling of test set would mean that drift of model in production will be way quicker and is not recommended in data science
upvoted 1 times
...
elvin_ml_qayiran25091992razor
1 year, 5 months ago
Selected Answer: B
B is correct, scale on train and apply the others. prevent to data leakage
upvoted 1 times
...
akgarg00
1 year, 5 months ago
Selected Answer: B
Answer B, C is not a good data science practise.
upvoted 1 times
...
DimLam
1 year, 6 months ago
Selected Answer: B
We need firstly split the data to avoid data leakage from test/eval sets, then rescale data in all sets using statistics from training set
upvoted 1 times
...
DavidRou
1 year, 7 months ago
Selected Answer: B
I think the right answer here is B. We need to split the dataset into Training, Validation and Test set. Then we can only scale (by using some technique) data contained in the Training set. Data that belong to Validation and Test set must be scaled by using the parameters used on the training. For example, if we want to apply a standardization, we can do that only on the Training set as we should not be allowed to use mean and standard deviation computed on Validation/Test set. We must act as we don't own those data!
upvoted 2 times
...
Mickey321
1 year, 8 months ago
Selected Answer: B
option B
upvoted 1 times
...
kaike_reis
1 year, 9 months ago
Data Science 101: (A) Given the question, doesn't solve the magnitude problem. (B) Correct (C) Data Leakage (D) It's not correct, still data leakage.
upvoted 1 times
...
gusta_dantas
1 year, 9 months ago
Tricky question, but, D, definitely! B: You can't apply the same scaling to the validation and test sets 'cause you may suffer data leakage! C: You shouldn't rescale the whole dataset then split into training, validation and test, it's not a good practice and may suffer data leakage as well. D: You're first splitting the whole dataset and applying rescaling individually, preventing any data leakage and each set is rescaled based in your own statistics.
upvoted 1 times
DavidRou
1 year, 7 months ago
Theoretically, you should not have Test set data at Training time (when you're doing the scaling), so how do you think to do that? What if you will not have an entire Test set, but you will receive each new row at a time?
upvoted 1 times
...
kaike_reis
1 year, 9 months ago
but you are leaking information from validation samples between themselves.
upvoted 1 times
...
...
JK1977
1 year, 11 months ago
Selected Answer: B
From Bing chat (and it makes complete sense) "Based on the search results, I think the best sequence of steps for the data scientist to take is B. Split the dataset into training, validation, and test sets. Then rescale the training set and apply the same scaling to the validation and test sets. This sequence of steps ensures that the data scientist can evaluate the model performance on different subsets of data that have not been used for training or tuning. It also ensures that the data scientist can rescale the features to have a common scale without introducing any data leakage from the validation or test sets. Rescaling the features can help improve the accuracy of some machine learning algorithms that are sensitive to the magnitude or distribution of the data, such as distance-based methods or gradient-based methods 1.
upvoted 3 times
...
tommct
1 year, 11 months ago
Selected Answer: B
You want to measure how the model performs on new data. Scaling with the test set is a no-no.
upvoted 1 times
...
GOSD
2 years ago
B or D, I dont understand the semantics of "independently" and the effect it would have. It's most def not done before because of data leakage. https://www.linkedin.com/pulse/feature-scaling-dataset-spliting-arnab-mukherjee/
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago