exam questions

Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 92 discussion

A company wants to predict the sale prices of houses based on available historical sales data. The target variable in the company's dataset is the sale price. The features include parameters such as the lot size, living area measurements, non-living area measurements, number of bedrooms, number of bathrooms, year built, and postal code. The company wants to use multi-variable linear regression to predict house sale prices.
Which step should a machine learning specialist take to remove features that are irrelevant for the analysis and reduce the model's complexity?

  • A. Plot a histogram of the features and compute their standard deviation. Remove features with high variance.
  • B. Plot a histogram of the features and compute their standard deviation. Remove features with low variance.
  • C. Build a heatmap showing the correlation of the dataset against itself. Remove features with low mutual correlation scores.
  • D. Run a correlation check of all features against the target variable. Remove features with low target variable correlation scores.
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
puffpuff
Highly Voted 2 years, 6 months ago
D should be the more comprehensive answer. If it's not correlated, you can't make use of it in a linear regression A lot of others say B, but low variance can also be due to the nature/typical magnitudes of the variable itself
upvoted 29 times
hamimelon
1 year, 4 months ago
I think the problem with B is that what is considered "low variance"? The features are on different scales.
upvoted 1 times
...
V_B_
1 year, 8 months ago
Correlation indicates only linear relation, but, there might be non linear as well. To exploit it in the Linear Regression, you can take the variables to some power or run some non linear preprocessing on it, and you don't have to change the algorithm for it. So, answer B seem much more solid for me.
upvoted 2 times
...
...
ahquiceno
Highly Voted 2 years, 7 months ago
Answer B. Is not the best solucion prior can use other analysis. https://community.dataquest.io/t/feature-selection-features-with-low-variance/2418 If the variance is low or close to zero, then a feature is approximately constant and will not improve the performance of the model. In that case, it should be removed. Or if only a handful of observations differ from a constant value, the variance will also be very low.
upvoted 16 times
fshkkento
2 years, 5 months ago
Low variance does not mean the feature is not important, right? If variance of target true value is also small and the correlation between above feature and target, the feature can be important feature.
upvoted 7 times
rb39
1 year, 7 months ago
it does. If feature and target are correlated and you expect the target to change, the feature must have some sort of variance. Otherwise it means feature is almost constant so does target.
upvoted 1 times
...
...
...
akgarg00
Most Recent 5 months, 2 weeks ago
Selected Answer: D
D is the best answer as it is mentioned multivariable linear regression applied where correlation is strong between dependent and independent variables.
upvoted 1 times
...
mirik
10 months, 3 weeks ago
Selected Answer: D
D: We should remove features that are strongly correlated with each other and weakly correlated with the target: https://androidkt.com/find-correlation-between-features-and-target-using-the-correlation-matrix/ You can evaluate the relationship between each feature and target using a correlation and selecting those features that have the strongest relationship with the target variable.
upvoted 2 times
...
HunterZ9527
1 year ago
Selected Answer: D
I think D is the correct answer. If I remember correctly, Benjamini-Hochberg Method is essentially answer D if you consider the Hypothesis to be: the feature is powerfully influential to the target. My problem with B is that the variance can be easily affected by the scale. In the question, the number of bedroom's variance is very low, while the sqrt of the house has a high variance, both of these could be very useful. Furthermore, zip codes are included, and it is safe to assume the variance of zip codes can be high, but the information is very limited, especially if you use them as numerical instead of categorical features.
upvoted 1 times
...
Valcilio
1 year, 1 month ago
Selected Answer: D
B is correct but the answer in D is better.
upvoted 2 times
...
AjoseO
1 year, 2 months ago
Selected Answer: D
D is preferred over C because the goal is to predict the sale price of houses, which is the target variable. By checking the correlation of each feature against the target variable, the machine learning specialist can identify which features are most relevant to the prediction of the sale price and which are less relevant. Removing features with low correlation to the target variable helps reduce the complexity of the model and potentially improve its accuracy. On the other hand, a heatmap showing the correlation of the dataset against itself (C) doesn't directly address the relevance of the features to the target variable, and so it's not as effective in reducing the complexity of the model.
upvoted 3 times
...
expertguru
1 year, 3 months ago
Answer should be D, THIS is feature elimination /selection during feature Enginering. Choice c is so close just to confuse test takers to pick the wrong choice! See below C and D answers -- C should have been correct if the question asked about how to visualize correlation among independent variables! PROVIDED second sentence in C needs to be removed or to say which feature you will eliminate in such case then the one with low correlation against target out of those two. C. Build a heatmap showing the correlation of the dataset against itself. Remove features with low mutual correlation scores. D. Run a correlation check of all features against the target variable. Remove features with low target variable correlation scores.
upvoted 1 times
...
Ob1KN0B
1 year, 8 months ago
Selected Answer: D
The multiple regression model is based on the following assumptions: There is a linear relationship between the dependent variables and the independent variables The independent variables are not too highly correlated with each other yi observations are selected independently and randomly from the population Residuals should be normally distributed with a mean of 0 and variance σ
upvoted 5 times
...
wakuwaku
2 years, 2 months ago
I think the answer is D. If the model is a decision tree or something like that, I don't think it is possible to make a decision based only on the direct correlation with the target variable. But in multiple linear regression, the only thing that matters is the relationship between the target variable and the feature variable. B, if the standard deviation is small but not zero, then we have information.
upvoted 3 times
...
apprehensive_scar
2 years, 3 months ago
Selected Answer: B
B is correct.
upvoted 2 times
...
Peasfull
2 years, 3 months ago
To eliminate extraneous information. So, the answer is D.
upvoted 2 times
...
Asrivastava3
2 years, 4 months ago
Correct answer is D. The reason B is wrong because it is difficult to reason out why would you plot a histogram? Absolutely unnecessary step and distraction choice.
upvoted 4 times
...
[Removed]
2 years, 5 months ago
Selected Answer: B
D is not the proper answer. Here is why: It says that it is comparing with the target variable (dependent variable), which implies it is comparing the correlation between the dependent and independent variables. This type of comparison is usually done after a model is constructed in order to prevent assessing the predictive strength of the model. To compare the target label, the label you wish to predict, with the other variables before - is premature and will likely result in weakening your model. Variables with low variance has very less information and the inclusion of which will likely weaken the model performance. Hence, B.
upvoted 4 times
...
MikkyO
2 years, 6 months ago
Answer is D. https://deep-r.medium.com/difference-between-variance-co-variance-and-correlation-ea0b7ddbaa1
upvoted 5 times
...
Huy
2 years, 6 months ago
Answer C. Heatmaps is used to visualize for correlation matrix https://towardsdatascience.com/better-heatmaps-and-correlation-matrix-plots-in-python-41445d0f2bec
upvoted 1 times
mahmoudai
2 years, 6 months ago
but is mentioned, "Remove features with low mutual correlation scores." which is wrong you should drop features with high correlation scores. so Answer is D
upvoted 5 times
...
...
hero67
2 years, 6 months ago
The problem with correlation tasks is it capture linear relations only. So, I would go with B
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago