Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 209 discussion

Exam question from Amazon's AWS Certified Machine Learning - Specialty

Question #: 209
Topic #: 1

[All AWS Certified Machine Learning - Specialty Questions]

A finance company needs to forecast the price of a commodity. The company has compiled a dataset of historical daily prices. A data scientist must train various forecasting models on 80% of the dataset and must validate the efficacy of those models on the remaining 20% of the dataset.

How should the data scientist split the dataset into a training dataset and a validation dataset to compare model performance?

A. Pick a date so that 80% of the data points precede the date. Assign that group of data points as the training dataset. Assign all the remaining data points to the validation dataset.
B. Pick a date so that 80% of the data points occur after the date. Assign that group of data points as the training dataset. Assign all the remaining data points to the validation dataset.
C. Starting from the earliest date in the dataset, pick eight data points for the training dataset and two data points for the validation dataset. Repeat this stratified sampling until no data points remain.
D. Sample data points randomly without replacement so that 80% of the data points are in the training dataset. Assign all the remaining data points to the validation dataset.

Show Suggested Answer

Suggested Answer: A 🗳️

by merki at Feb. 9, 2023, 12:54 p.m.

Disclaimers:

- ExamTopics website is not related to, affiliated with, endorsed or authorized by Amazon.
- Trademarks, certification & product names are used for reference only and belong to Amazon.

Comments

Submit Cancel

blanco750

Highly Voted 1 year, 3 months ago

Selected Answer: A

For time series data, it is important to split the dataset chronologically, with the training dataset containing the earlier dates and the validation dataset containing the later dates

upvoted 6 times

...

loict

Most Recent 10 months, 1 week ago

Selected Answer: A

A. YES - it is forecasting, so you want to predict the future and 20% of the data points after a date will do so B. NO - it is forecasting, we want to simulate an actual use case and not predict the past C. NO - there is data leakage as future datapoints are used in the predictions C. NO - there is data leakage as future datapoints are used in the predictions

upvoted 1 times

...

Mickey321

11 months ago

Selected Answer: A

Time series keep the order

upvoted 2 times

...

SANDEEP_AWS

1 year, 4 months ago

Selected Answer: A

A, As it's a time series problem.

upvoted 3 times

...

Valcilio

1 year, 4 months ago

Selected Answer: A

It's a timeseries problem, then the splitting needs to be made by date.

upvoted 1 times

...

AjoseO

1 year, 4 months ago

Selected Answer: A

Option A is the recommended approach where the training dataset contains historical prices that precede a certain date, and the validation dataset contains prices that occur after that date. This ensures that the model is trained on past data and evaluated on future data, which is more representative of real-world performance. Option D is NOT the recommended approach for time series data because it ignores the time aspect of the data. Randomly sampling data points without considering the time sequence can result in data leakage and poor model performance.

upvoted 3 times

...

Adi_09

1 year, 5 months ago

a since this is time series problem

upvoted 2 times

...

drcok87

1 year, 5 months ago

a https://towardsdatascience.com/time-series-from-scratch-train-test-splits-and-evaluation-metrics-4fd654de1b37

upvoted 2 times

...

HP0510

1 year, 5 months ago

Selected Answer: D

Because it randomly selects data points for both the training and validation datasets, ensuring that the samples are representative of the entire dataset and reducing the chances of overfitting. By randomly sampling without replacement, the data scientist can avoid any biases in the selection of data points and ensure that the training and validation datasets are independent.

upvoted 2 times

kaike_reis

11 months ago

For time series you should keep the order.

upvoted 2 times

...

merki

1 year, 5 months ago

I think it should be A!

upvoted 1 times

...