Exam DP-100 topic 2 question 23 discussion

Actual exam question from Microsoft's DP-100

Question #: 23
Topic #: 2

A set of CSV files contains sales records. All the CSV files have the same data schema.
Each CSV file contains the sales record for a particular month and has the filename sales.csv. Each file is stored in a folder that indicates the month and year when the data was recorded. The folders are in an Azure blob container for which a datastore has been defined in an Azure Machine Learning workspace. The folders are organized in a parent folder named sales to create the following hierarchical structure:

At the end of each month, a new folder with that month's sales file is added to the sales folder.
You plan to use the sales data to train a machine learning model based on the following requirements:
✑ You must define a dataset that loads all of the sales data to date into a structure that can be easily converted to a dataframe.
✑ You must be able to create experiments that use only data that was created before a specific previous month, ignoring any data that was added after that month.
✑ You must register the minimum number of datasets possible.
You need to register the sales data as a dataset in Azure Machine Learning service workspace.
What should you do?

A. Create a tabular dataset that references the datastore and explicitly specifies each 'sales/mm-yyyy/sales.csv' file every month. Register the dataset with the name sales_dataset each month, replacing the existing dataset and specifying a tag named month indicating the month and year it was registered. Use this dataset for all experiments.
B. Create a tabular dataset that references the datastore and specifies the path 'sales/*/sales.csv', register the dataset with the name sales_dataset and a tag named month indicating the month and year it was registered, and use this dataset for all experiments.
C. Create a new tabular dataset that references the datastore and explicitly specifies each 'sales/mm-yyyy/sales.csv' file every month. Register the dataset with the name sales_dataset_MM-YYYY each month with appropriate MM and YYYY values for the month and year. Use the appropriate month-specific dataset for experiments.
D. Create a tabular dataset that references the datastore and explicitly specifies each 'sales/mm-yyyy/sales.csv' file. Register the dataset with the name sales_dataset each month as a new version and with a tag named month indicating the month and year it was registered. Use this dataset for all experiments, identifying the version to be used based on the month tag as necessary.

Show Suggested Answer

Suggested Answer: D 🗳️

by dev2dev at March 15, 2021, 6:59 a.m.

Comments

Submit Cancel

gamezone25

Highly Voted 3 years, 9 months ago

D seems to be the correct answer. B does not allow you to get the data from before a specific month. With D you create only one dataset with multiple versions (1 version per month). Similar example in 'Versioning best practice': https://docs.microsoft.com/en-us/azure/machine-learning/how-to-version-track-datasets

upvoted 28 times

chevyli

2 years, 5 months ago

I guess you can by using module like Split or Filter data? You can specify the condition to get data before a particular month

upvoted 3 times

...

Shailen

3 years, 1 month ago

But D don't satisfy the last requirement that register minimal data set possible since each specific sales file need to register in option D. Given answer B seems correct as it fulfils all conditions.

upvoted 4 times

...

chaudha4

3 years, 9 months ago

I agree. The example shown in the link below does exactly what is being asked in the question. https://docs.microsoft.com/en-us/azure/machine-learning/how-to-version-track-datasets#versioning-best-practice

upvoted 2 times

levm39

3 years, 8 months ago

You must register the minimum number of datasets possible. D is not correct, because you will have to do this manually each month,?

upvoted 4 times

YipingRuan

3 years, 6 months ago

But B you can't select by (each) Month.

upvoted 1 times

...

TheCyanideLancer

Highly Voted 3 years ago

Quick update, verified, correct ans is D. Cross checked in coursera and validated there.

upvoted 19 times

...

Lion007

Most Recent 1 year, 1 month ago

Selected Answer: D

The Correct answer is: D Option D is the most appropriate choice because it allows for both the inclusion of all data to date for general training and the ability to use specific versions for experiments that require data up to a particular month. The "minimum number of datasets" can be interpreted as the minimum number of distinct dataset entities registered in the workspace. With versioning (Option D), you're still working with one dataset entity, but with multiple versions, which aligns with the requirement of minimal dataset registration. Justification: - Versioning in Azure Machine Learning allows you to handle the evolving data by creating new versions of the dataset each month, without increasing the number of dataset entities in the workspace. - By using version tags, you can manage and reference the appropriate data snapshot for experiments as needed. - This approach offers a balance between efficient data management and the ability to run experiments on specific subsets of the data as of a given date, thus meeting all the stated requirements.

upvoted 3 times

...