Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 59 discussion

Exam question from Amazon's AWS Certified Machine Learning - Specialty

Question #: 59
Topic #: 1

[All AWS Certified Machine Learning - Specialty Questions]

A data scientist has explored and sanitized a dataset in preparation for the modeling phase of a supervised learning task. The statistical dispersion can vary widely between features, sometimes by several orders of magnitude. Before moving on to the modeling phase, the data scientist wants to ensure that the prediction performance on the production data is as accurate as possible.
Which sequence of steps should the data scientist take to meet these requirements?

A. Apply random sampling to the dataset. Then split the dataset into training, validation, and test sets.
B. Split the dataset into training, validation, and test sets. Then rescale the training set and apply the same scaling to the validation and test sets.
C. Rescale the dataset. Then split the dataset into training, validation, and test sets.
D. Split the dataset into training, validation, and test sets. Then rescale the training set, the validation set, and the test set independently.

Show Suggested Answer

Suggested Answer: B 🗳️

by cron0001 at April 24, 2022, 1:44 a.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

cron0001

Highly Voted 3 years, 2 months ago

Selected Answer: C

C would be my answer here. Rescaling each set independently could lead to strange skews. Training set, Test set and Evaluation set should be on the same scale

upvoted 17 times

GiyeonShin

2 years, 6 months ago

You're right. test set and val set should be rescaled on the same scale. But the scale value should be extracted by only statistical value from training data. I think C means that the rescaling stage is affected by the values from the whole data (with val, test set) So, I think B is correct

upvoted 9 times

...

masoa3b

Highly Voted 2 years, 8 months ago

Selected Answer: B

https://stackoverflow.com/questions/49444262/normalize-data-before-or-after-split-of-training-and-testing-data C also leads to data leakage. You are using the test data to scale everything. So part of the data in the test set is used to scale for when you build the model on the training and check against the validation set.

upvoted 17 times

...

ML_2

Most Recent 11 months ago

Selected Answer: B

If you Rescale all the data first you are going to do data leakage by showing all the variance of data with in training. The rescaling needs to be after splitting the data and not before it

upvoted 1 times

...

Denise123

1 year, 4 months ago

Selected Answer: B

The best practice is --> to split the dataset into training, validation, and test sets first, and then rescale the training set and apply the SAME scaling to the validation and test sets. This ensures that the scaling parameters (e.g., mean and standard deviation for standardization or min and max values for min-max scaling) are calculated only based on the training set to prevent data leakage and maintain the integrity of the evaluation process. By following this approach, you prevent information from the validation and test sets from influencing the scaling parameters, which could lead to data leakage and overestimation of model performance. Keeping the scaling consistent across all subsets ensures a fair evaluation of the model's generalization performance on new, unseen data.

upvoted 5 times

...

phdykd

1 year, 6 months ago

Answer is B. The other options have shortcomings: A: Random sampling is a good practice, but it doesn't address the issue of feature scaling. Also, rescaling should occur after splitting the data. C: Rescaling the entire dataset before splitting could lead to data leakage, where information from the validation/test sets inadvertently influences the training process. D: Rescaling the sets independently would lead to inconsistencies in scale across the training, validation, and test sets, which could negatively impact model performance and evaluation.

upvoted 2 times

...

Sukhi4fornet

1 year, 6 months ago

OPTION C. Rescale the dataset. Then split the dataset into training, validation, and test sets. Explanation: Rescaling the dataset: This is the first step to address the varying statistical dispersion among features. By rescaling, you ensure that all features are on a similar scale, which is important for many machine learning algorithms. Splitting into training, validation, and test sets: After rescaling, the dataset is split into training, validation, and test sets. This ensures that the model is trained on one set, validated on another set, and tested on a third set. This separation helps evaluate the model's performance on unseen data. Option C ensures that the rescaling is applied before splitting the data, ensuring consistency in the scaling across different sets. This approach prevents data leakage and provides a more accurate representation of how the model will perform on new, unseen data.

upvoted 1 times

...

akgarg00

1 year, 7 months ago

Selected Answer: B

Validation and test set should be scaled as per parameters used for scaling of training set. Independent scaling of test set would mean that drift of model in production will be way quicker and is not recommended in data science

upvoted 1 times

...

elvin_ml_qayiran25091992razor

1 year, 8 months ago

Selected Answer: B

B is correct, scale on train and apply the others. prevent to data leakage

upvoted 1 times

...

akgarg00

1 year, 8 months ago

Selected Answer: B

Answer B, C is not a good data science practise.

upvoted 1 times

...

DimLam

1 year, 8 months ago

Selected Answer: B

We need firstly split the data to avoid data leakage from test/eval sets, then rescale data in all sets using statistics from training set

upvoted 1 times

...

DavidRou

1 year, 10 months ago

Selected Answer: B

I think the right answer here is B. We need to split the dataset into Training, Validation and Test set. Then we can only scale (by using some technique) data contained in the Training set. Data that belong to Validation and Test set must be scaled by using the parameters used on the training. For example, if we want to apply a standardization, we can do that only on the Training set as we should not be allowed to use mean and standard deviation computed on Validation/Test set. We must act as we don't own those data!

upvoted 2 times

...

Mickey321

1 year, 10 months ago

Selected Answer: B

option B

upvoted 1 times

...

kaike_reis

1 year, 11 months ago

Data Science 101: (A) Given the question, doesn't solve the magnitude problem. (B) Correct (C) Data Leakage (D) It's not correct, still data leakage.

upvoted 1 times

...

gusta_dantas

1 year, 11 months ago

Tricky question, but, D, definitely! B: You can't apply the same scaling to the validation and test sets 'cause you may suffer data leakage! C: You shouldn't rescale the whole dataset then split into training, validation and test, it's not a good practice and may suffer data leakage as well. D: You're first splitting the whole dataset and applying rescaling individually, preventing any data leakage and each set is rescaled based in your own statistics.

upvoted 1 times

DavidRou

1 year, 10 months ago

Theoretically, you should not have Test set data at Training time (when you're doing the scaling), so how do you think to do that? What if you will not have an entire Test set, but you will receive each new row at a time?

upvoted 1 times

...

kaike_reis

1 year, 11 months ago

but you are leaking information from validation samples between themselves.

upvoted 1 times

...

JK1977

2 years, 1 month ago

Selected Answer: B

From Bing chat (and it makes complete sense) "Based on the search results, I think the best sequence of steps for the data scientist to take is B. Split the dataset into training, validation, and test sets. Then rescale the training set and apply the same scaling to the validation and test sets. This sequence of steps ensures that the data scientist can evaluate the model performance on different subsets of data that have not been used for training or tuning. It also ensures that the data scientist can rescale the features to have a common scale without introducing any data leakage from the validation or test sets. Rescaling the features can help improve the accuracy of some machine learning algorithms that are sensitive to the magnitude or distribution of the data, such as distance-based methods or gradient-based methods 1.

upvoted 3 times

...

tommct

2 years, 1 month ago

Selected Answer: B

You want to measure how the model performs on new data. Scaling with the test set is a no-no.

upvoted 1 times

...

GOSD

2 years, 2 months ago

B or D, I dont understand the semantics of "independently" and the effect it would have. It's most def not done before because of data leakage. https://www.linkedin.com/pulse/feature-scaling-dataset-spliting-arnab-mukherjee/

upvoted 1 times

...

Load full discussion...