exam questions

Exam AWS Certified Machine Learning - Specialty All Questions

View all questions & answers for the AWS Certified Machine Learning - Specialty exam

Exam AWS Certified Machine Learning - Specialty topic 1 question 209 discussion

A finance company needs to forecast the price of a commodity. The company has compiled a dataset of historical daily prices. A data scientist must train various forecasting models on 80% of the dataset and must validate the efficacy of those models on the remaining 20% of the dataset.

How should the data scientist split the dataset into a training dataset and a validation dataset to compare model performance?

  • A. Pick a date so that 80% of the data points precede the date. Assign that group of data points as the training dataset. Assign all the remaining data points to the validation dataset.
  • B. Pick a date so that 80% of the data points occur after the date. Assign that group of data points as the training dataset. Assign all the remaining data points to the validation dataset.
  • C. Starting from the earliest date in the dataset, pick eight data points for the training dataset and two data points for the validation dataset. Repeat this stratified sampling until no data points remain.
  • D. Sample data points randomly without replacement so that 80% of the data points are in the training dataset. Assign all the remaining data points to the validation dataset.
Show Suggested Answer Hide Answer
Suggested Answer: A 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
blanco750
Highly Voted 1 year, 1 month ago
Selected Answer: A
For time series data, it is important to split the dataset chronologically, with the training dataset containing the earlier dates and the validation dataset containing the later dates
upvoted 6 times
...
loict
Most Recent 8 months ago
Selected Answer: A
A. YES - it is forecasting, so you want to predict the future and 20% of the data points after a date will do so B. NO - it is forecasting, we want to simulate an actual use case and not predict the past C. NO - there is data leakage as future datapoints are used in the predictions C. NO - there is data leakage as future datapoints are used in the predictions
upvoted 1 times
...
Mickey321
8 months, 4 weeks ago
Selected Answer: A
Time series keep the order
upvoted 2 times
...
SANDEEP_AWS
1 year, 2 months ago
Selected Answer: A
A, As it's a time series problem.
upvoted 3 times
...
Valcilio
1 year, 2 months ago
Selected Answer: A
It's a timeseries problem, then the splitting needs to be made by date.
upvoted 1 times
...
AjoseO
1 year, 2 months ago
Selected Answer: A
Option A is the recommended approach where the training dataset contains historical prices that precede a certain date, and the validation dataset contains prices that occur after that date. This ensures that the model is trained on past data and evaluated on future data, which is more representative of real-world performance. Option D is NOT the recommended approach for time series data because it ignores the time aspect of the data. Randomly sampling data points without considering the time sequence can result in data leakage and poor model performance.
upvoted 3 times
...
Adi_09
1 year, 2 months ago
a since this is time series problem
upvoted 2 times
...
drcok87
1 year, 3 months ago
a https://towardsdatascience.com/time-series-from-scratch-train-test-splits-and-evaluation-metrics-4fd654de1b37
upvoted 2 times
...
HP0510
1 year, 3 months ago
Selected Answer: D
Because it randomly selects data points for both the training and validation datasets, ensuring that the samples are representative of the entire dataset and reducing the chances of overfitting. By randomly sampling without replacement, the data scientist can avoid any biases in the selection of data points and ensure that the training and validation datasets are independent.
upvoted 2 times
kaike_reis
9 months ago
For time series you should keep the order.
upvoted 2 times
...
...
merki
1 year, 3 months ago
I think it should be A!
upvoted 1 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago