exam questions

Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 274 discussion

A medical device company is building a machine learning (ML) model to predict the likelihood of device recall based on customer data that the company collects from a plain text survey. One of the survey questions asks which medications the customer is taking. The data for this field contains the names of medications that customers enter manually. Customers misspell some of the medication names. The column that contains the medication name data gives a categorical feature with high cardinality but redundancy.

What is the MOST effective way to encode this categorical feature into a numeric feature?

  • A. Spell check the column. Use Amazon SageMaker one-hot encoding on the column to transform a categorical feature to a numerical feature.
  • B. Fix the spelling in the column by using char-RNN. Use Amazon SageMaker Data Wrangler one-hot encoding to transform a categorical feature to a numerical feature.
  • C. Use Amazon SageMaker Data Wrangler similarity encoding on the column to create embeddings of vectors of real numbers.
  • D. Use Amazon SageMaker Data Wrangler ordinal encoding on the column to encode categories into an integer between 0 and the total number of categories in the column.
Show Suggested Answer Hide Answer
Suggested Answer: C 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
usamazubairi
Highly Voted 11 months ago
Ansewer is A: Given the scenario, One-Hot Encoding would be the most effective way to encode the categorical feature into a numerical feature
upvoted 5 times
...
AIWave
Most Recent 8 months, 1 week ago
Selected Answer: C
Similarity encoding is meant to encode high cardinality features having misspelled values by group them closer. In high cardinality, it's more efficient o create vectors of real numbers than creating insane number of one hot encoded columns
upvoted 1 times
...
vkbajoria
8 months, 1 week ago
A - most effective. dataset seems small and manually fixing of spelling is possible since this is categorical
upvoted 2 times
...
Alice1234
9 months, 1 week ago
C- Data Wrangler similarity encoding on the column to create embeddings of vectors of real numbers.
upvoted 1 times
...
Topg4u
9 months, 1 week ago
The similarity encoder creates embeddings for columns with categorical data. An embedding is a mapping of discrete objects, such as words, to vectors of real numbers. It encodes similar strings to vectors containing similar values. For example, it creates very similar encodings for "California" and "Calfornia". https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-transform.html
upvoted 2 times
...
aakash_0086
9 months, 2 weeks ago
Selected Answer: B
a character-level Recurrent Neural Network (char-RNN) can be used to fix spelling mistakes in a column containing medication names.
upvoted 2 times
...
xiaoeason
11 months ago
Selected Answer: C
Use similarity encoding when you have the following: 1. A large number of categorical variables 2. Noisy data
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago