Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 80 discussion

Exam question from Amazon's AWS Certified Machine Learning - Specialty

Question #: 80
Topic #: 1

[All AWS Certified Machine Learning - Specialty Questions]

A credit card company wants to build a credit scoring model to help predict whether a new credit card applicant will default on a credit card payment. The company has collected data from a large number of sources with thousands of raw attributes. Early experiments to train a classification model revealed that many attributes are highly correlated, the large number of features slows down the training speed significantly, and that there are some overfitting issues.
The Data Scientist on this project would like to speed up the model training time without losing a lot of information from the original dataset.
Which feature engineering technique should the Data Scientist use to meet the objectives?

A. Run self-correlation on all features and remove highly correlated features
B. Normalize all numerical values to be between 0 and 1
C. Use an autoencoder or principal component analysis (PCA) to replace original features with new features
D. Cluster raw data using k-means and use sample data from each cluster to build a new dataset

Show Suggested Answer

Suggested Answer: C 🗳️

by ahquiceno at Feb. 3, 2021, 2:10 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

ahquiceno

Highly Voted 3 years, 11 months ago

Answer C. Need reduce the features preserving the information on it this is achieve using PCA.

upvoted 26 times

Dr_Kiko

3 years, 9 months ago

without losing a lot of information from the original dataset since when PCA retains information?

upvoted 3 times

...

VinceCar

2 years, 9 months ago

PCA helps to speed up the training

upvoted 4 times

...

[Removed]

Highly Voted 3 years, 10 months ago

Answer is A, because one must avoid information loss that PCA or autoencoders introduce through new features (https://www.i2tutorials.com/what-are-the-pros-and-cons-of-the-pca/). Otherwise, I would perform C.

upvoted 6 times

SophieSu

3 years, 10 months ago

If you REMOVE highly correlated features(that means in pairs), the model lost a lot of information.

upvoted 4 times

...

rodrigus

2 years, 5 months ago

A doesn't have sense. Self-correlation is for times series data, not for pair correlation

upvoted 2 times

...

xicocaio

Most Recent 10 months, 3 weeks ago

Selected Answer: A

This question can be misleading. I would choose A if self-correlation in the dataset is meaning pair-wise correlation, this is the most typical approach in real life. But if self-correlation means auto-correlation as in the time-series treatment, then it is wrong. Issues with answer C: Autoencoders are notorious for being hard to interpret. With PCA it is possible, but definitely not easy if you have a large dataset. In real life with this scenario, you would always go with pairwise correlation as the most simple yet effective approach.

upvoted 1 times

...

Giodefa96

1 year ago

Selected Answer: C

Answer is C

upvoted 1 times

...

geoan13

1 year, 9 months ago

Answer C PCA (Principal Component Analysis) takes advantage of multicollinearity and combines the highly correlated variables into a set of uncorrelated variables. Therefore, PCA can effectively eliminate multicollinearity between features. https://towardsdatascience.com/how-do-you-apply-pca-to-logistic-regression-to-remove-multicollinearity-10b7f8e89f9b#:~:text=PCA%20(Principal%20Component%20Analysis)%20takes,effectively%20eliminate%20multicollinearity%20between%20features.

upvoted 1 times

...

Mickey321

1 year, 11 months ago

Selected Answer: C

Option C

upvoted 1 times

...

Mickey321

1 year, 11 months ago

Selected Answer: C

An autoencoder is a type of neural network that can learn a compressed representation of the input data, called the latent space, by encoding and decoding the data through multiple hidden layers1. PCA is a statistical technique that can reduce the dimensionality of the data by finding a set of orthogonal axes, called the principal components, that capture the most variance in the data2. Both methods can transform the original features into new features that are lower-dimensional, uncorrelated, and informative.

upvoted 1 times

...

kaike_reis

2 years ago

Selected Answer: C

C is the correct. Self-correlation is for time series, which is not mention here. Besides that, even if was correlation only, try to do this in thousand features...

upvoted 1 times

...

vbal

2 years, 2 months ago

A . run correlation matrix and remove highly correlated features.

upvoted 1 times

...

JK1977

2 years, 2 months ago

Selected Answer: C

PCA for feature reduction

upvoted 1 times

...

GOSD

2 years, 3 months ago

is it just me or is every 15th answer here PCA?

upvoted 2 times

...

oso0348

2 years, 4 months ago

Selected Answer: C

Using an autoencoder or PCA can help reduce the dimensionality of the dataset by creating new features that capture the most important information in the original dataset while discarding some of the noise and highly correlated features. This can help speed up the training time and reduce overfitting issues without losing a lot of information from the original dataset. Option A may remove too many features and may not capture all the important information in the dataset, while option B only rescales the data and does not address the issue of highly correlated features. Option D is not a feature engineering technique and may not be an effective way to reduce the dimensionality of the dataset.

upvoted 1 times

...

Paolo991

2 years, 4 months ago

Selected Answer: C

PCA builds new features starting from high correlated ones. So it matches the question

upvoted 1 times

...

Sneep

2 years, 7 months ago

It's C. The Data Scientist should use principal component analysis (PCA) to replace the original features with new features. PCA is a technique that reduces the dimensionality of a dataset by projecting it onto a lower-dimensional space, while preserving as much of the original variation as possible. This can help to speed up the training time of the model and reduce overfitting issues, without losing a significant amount of information from the original dataset.

upvoted 1 times

...

Aninina

2 years, 7 months ago

Selected Answer: C

C: PCA is the solution

upvoted 1 times

...

ovokpus

3 years, 1 month ago

Selected Answer: C

Correction to C. Removing correlated features from hundreds of columns will be tedious and time consuming. PCA is the way to go here. Apologies for the flip

upvoted 2 times

...

ovokpus

3 years, 1 month ago

Selected Answer: A

Answer is A. Eliminate features that are highly correlated. This will not compromise the quality of the feature space as much as PCA would.

upvoted 1 times

...

Load full discussion...