Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 92 discussion

Exam question from Amazon's AWS Certified Machine Learning - Specialty

Question #: 92
Topic #: 1

[All AWS Certified Machine Learning - Specialty Questions]

A company wants to predict the sale prices of houses based on available historical sales data. The target variable in the company's dataset is the sale price. The features include parameters such as the lot size, living area measurements, non-living area measurements, number of bedrooms, number of bathrooms, year built, and postal code. The company wants to use multi-variable linear regression to predict house sale prices.
Which step should a machine learning specialist take to remove features that are irrelevant for the analysis and reduce the model's complexity?

A. Plot a histogram of the features and compute their standard deviation. Remove features with high variance.
B. Plot a histogram of the features and compute their standard deviation. Remove features with low variance.
C. Build a heatmap showing the correlation of the dataset against itself. Remove features with low mutual correlation scores.
D. Run a correlation check of all features against the target variable. Remove features with low target variable correlation scores.

Show Suggested Answer

Suggested Answer: D 🗳️

by ahquiceno at Feb. 3, 2021, 4:08 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

puffpuff

Highly Voted 2 years, 8 months ago

D should be the more comprehensive answer. If it's not correlated, you can't make use of it in a linear regression A lot of others say B, but low variance can also be due to the nature/typical magnitudes of the variable itself

upvoted 29 times

hamimelon

1 year, 6 months ago

I think the problem with B is that what is considered "low variance"? The features are on different scales.

upvoted 1 times

...

V_B_

1 year, 10 months ago

Correlation indicates only linear relation, but, there might be non linear as well. To exploit it in the Linear Regression, you can take the variables to some power or run some non linear preprocessing on it, and you don't have to change the algorithm for it. So, answer B seem much more solid for me.

upvoted 2 times

...

ahquiceno

Highly Voted 2 years, 9 months ago

Answer B. Is not the best solucion prior can use other analysis. https://community.dataquest.io/t/feature-selection-features-with-low-variance/2418 If the variance is low or close to zero, then a feature is approximately constant and will not improve the performance of the model. In that case, it should be removed. Or if only a handful of observations differ from a constant value, the variance will also be very low.

upvoted 16 times

fshkkento

2 years, 7 months ago

Low variance does not mean the feature is not important, right? If variance of target true value is also small and the correlation between above feature and target, the feature can be important feature.

upvoted 7 times

rb39

1 year, 9 months ago

it does. If feature and target are correlated and you expect the target to change, the feature must have some sort of variance. Otherwise it means feature is almost constant so does target.

upvoted 1 times

...

akgarg00

Most Recent 7 months, 1 week ago

Selected Answer: D

D is the best answer as it is mentioned multivariable linear regression applied where correlation is strong between dependent and independent variables.

upvoted 1 times

...

mirik

1 year ago

Selected Answer: D

D: We should remove features that are strongly correlated with each other and weakly correlated with the target: https://androidkt.com/find-correlation-between-features-and-target-using-the-correlation-matrix/ You can evaluate the relationship between each feature and target using a correlation and selecting those features that have the strongest relationship with the target variable.

upvoted 2 times

...

HunterZ9527

1 year, 2 months ago

Selected Answer: D

I think D is the correct answer. If I remember correctly, Benjamini-Hochberg Method is essentially answer D if you consider the Hypothesis to be: the feature is powerfully influential to the target. My problem with B is that the variance can be easily affected by the scale. In the question, the number of bedroom's variance is very low, while the sqrt of the house has a high variance, both of these could be very useful. Furthermore, zip codes are included, and it is safe to assume the variance of zip codes can be high, but the information is very limited, especially if you use them as numerical instead of categorical features.

upvoted 1 times

...

Valcilio

1 year, 3 months ago

Selected Answer: D

B is correct but the answer in D is better.

upvoted 2 times

...

AjoseO

1 year, 4 months ago

Selected Answer: D

D is preferred over C because the goal is to predict the sale price of houses, which is the target variable. By checking the correlation of each feature against the target variable, the machine learning specialist can identify which features are most relevant to the prediction of the sale price and which are less relevant. Removing features with low correlation to the target variable helps reduce the complexity of the model and potentially improve its accuracy. On the other hand, a heatmap showing the correlation of the dataset against itself (C) doesn't directly address the relevance of the features to the target variable, and so it's not as effective in reducing the complexity of the model.

upvoted 3 times

...

expertguru

1 year, 5 months ago

Answer should be D, THIS is feature elimination /selection during feature Enginering. Choice c is so close just to confuse test takers to pick the wrong choice! See below C and D answers -- C should have been correct if the question asked about how to visualize correlation among independent variables! PROVIDED second sentence in C needs to be removed or to say which feature you will eliminate in such case then the one with low correlation against target out of those two. C. Build a heatmap showing the correlation of the dataset against itself. Remove features with low mutual correlation scores. D. Run a correlation check of all features against the target variable. Remove features with low target variable correlation scores.

upvoted 1 times

...

Ob1KN0B

1 year, 10 months ago

Selected Answer: D

The multiple regression model is based on the following assumptions: There is a linear relationship between the dependent variables and the independent variables The independent variables are not too highly correlated with each other yi observations are selected independently and randomly from the population Residuals should be normally distributed with a mean of 0 and variance σ

upvoted 5 times

...

wakuwaku

2 years, 4 months ago

I think the answer is D. If the model is a decision tree or something like that, I don't think it is possible to make a decision based only on the direct correlation with the target variable. But in multiple linear regression, the only thing that matters is the relationship between the target variable and the feature variable. B, if the standard deviation is small but not zero, then we have information.

upvoted 3 times

...

apprehensive_scar

2 years, 4 months ago

Selected Answer: B

B is correct.

upvoted 2 times

...

Peasfull

2 years, 5 months ago

To eliminate extraneous information. So, the answer is D.

upvoted 2 times

...

Asrivastava3

2 years, 6 months ago

Correct answer is D. The reason B is wrong because it is difficult to reason out why would you plot a histogram? Absolutely unnecessary step and distraction choice.

upvoted 4 times

...

[Removed]

2 years, 7 months ago

Selected Answer: B

D is not the proper answer. Here is why: It says that it is comparing with the target variable (dependent variable), which implies it is comparing the correlation between the dependent and independent variables. This type of comparison is usually done after a model is constructed in order to prevent assessing the predictive strength of the model. To compare the target label, the label you wish to predict, with the other variables before - is premature and will likely result in weakening your model. Variables with low variance has very less information and the inclusion of which will likely weaken the model performance. Hence, B.

upvoted 4 times

...

MikkyO

2 years, 7 months ago

Answer is D. https://deep-r.medium.com/difference-between-variance-co-variance-and-correlation-ea0b7ddbaa1

upvoted 5 times

...

Huy

2 years, 8 months ago

Answer C. Heatmaps is used to visualize for correlation matrix https://towardsdatascience.com/better-heatmaps-and-correlation-matrix-plots-in-python-41445d0f2bec

upvoted 1 times

mahmoudai

2 years, 7 months ago

but is mentioned, "Remove features with low mutual correlation scores." which is wrong you should drop features with high correlation scores. so Answer is D

upvoted 5 times

...

hero67

2 years, 8 months ago

The problem with correlation tasks is it capture linear relations only. So, I would go with B

upvoted 1 times

...

Load full discussion...