Exam AWS Certified Machine Learning - Specialty topic 1 question 274 discussion

Exam question from Amazon's AWS Certified Machine Learning - Specialty

Question #: 274
Topic #: 1

[All AWS Certified Machine Learning - Specialty Questions]

A medical device company is building a machine learning (ML) model to predict the likelihood of device recall based on customer data that the company collects from a plain text survey. One of the survey questions asks which medications the customer is taking. The data for this field contains the names of medications that customers enter manually. Customers misspell some of the medication names. The column that contains the medication name data gives a categorical feature with high cardinality but redundancy.

What is the MOST effective way to encode this categorical feature into a numeric feature?

A. Spell check the column. Use Amazon SageMaker one-hot encoding on the column to transform a categorical feature to a numerical feature.
B. Fix the spelling in the column by using char-RNN. Use Amazon SageMaker Data Wrangler one-hot encoding to transform a categorical feature to a numerical feature.
C. Use Amazon SageMaker Data Wrangler similarity encoding on the column to create embeddings of vectors of real numbers.
D. Use Amazon SageMaker Data Wrangler ordinal encoding on the column to encode categories into an integer between 0 and the total number of categories in the column.

Show Suggested Answer

Suggested Answer: C 🗳️

by usamazubairi at Dec. 14, 2023, 6:02 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

usamazubairi

Highly Voted 1 year ago

Ansewer is A: Given the scenario, One-Hot Encoding would be the most effective way to encode the categorical feature into a numerical feature

upvoted 5 times

...

AIWave

Most Recent 9 months, 3 weeks ago

Selected Answer: C

Similarity encoding is meant to encode high cardinality features having misspelled values by group them closer. In high cardinality, it's more efficient o create vectors of real numbers than creating insane number of one hot encoded columns

upvoted 1 times

...

vkbajoria

9 months, 4 weeks ago

A - most effective. dataset seems small and manually fixing of spelling is possible since this is categorical

upvoted 2 times

...

Alice1234

10 months, 4 weeks ago

C- Data Wrangler similarity encoding on the column to create embeddings of vectors of real numbers.

upvoted 1 times

...

Topg4u

10 months, 4 weeks ago

The similarity encoder creates embeddings for columns with categorical data. An embedding is a mapping of discrete objects, such as words, to vectors of real numbers. It encodes similar strings to vectors containing similar values. For example, it creates very similar encodings for "California" and "Calfornia". https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-transform.html

upvoted 2 times

...

aakash_0086

11 months, 1 week ago

Selected Answer: B

a character-level Recurrent Neural Network (char-RNN) can be used to fix spelling mistakes in a column containing medication names.

upvoted 2 times

...

xiaoeason

1 year ago

Selected Answer: C

Use similarity encoding when you have the following: 1. A large number of categorical variables 2. Noisy data

upvoted 2 times

...