exam questions

Exam DP-100 All Questions

View all questions & answers for the DP-100 exam

Exam DP-100 topic 2 question 23 discussion

Actual exam question from Microsoft's DP-100
Question #: 23
Topic #: 2
[All DP-100 Questions]

A set of CSV files contains sales records. All the CSV files have the same data schema.
Each CSV file contains the sales record for a particular month and has the filename sales.csv. Each file is stored in a folder that indicates the month and year when the data was recorded. The folders are in an Azure blob container for which a datastore has been defined in an Azure Machine Learning workspace. The folders are organized in a parent folder named sales to create the following hierarchical structure:

At the end of each month, a new folder with that month's sales file is added to the sales folder.
You plan to use the sales data to train a machine learning model based on the following requirements:
✑ You must define a dataset that loads all of the sales data to date into a structure that can be easily converted to a dataframe.
✑ You must be able to create experiments that use only data that was created before a specific previous month, ignoring any data that was added after that month.
✑ You must register the minimum number of datasets possible.
You need to register the sales data as a dataset in Azure Machine Learning service workspace.
What should you do?

  • A. Create a tabular dataset that references the datastore and explicitly specifies each 'sales/mm-yyyy/sales.csv' file every month. Register the dataset with the name sales_dataset each month, replacing the existing dataset and specifying a tag named month indicating the month and year it was registered. Use this dataset for all experiments.
  • B. Create a tabular dataset that references the datastore and specifies the path 'sales/*/sales.csv', register the dataset with the name sales_dataset and a tag named month indicating the month and year it was registered, and use this dataset for all experiments.
  • C. Create a new tabular dataset that references the datastore and explicitly specifies each 'sales/mm-yyyy/sales.csv' file every month. Register the dataset with the name sales_dataset_MM-YYYY each month with appropriate MM and YYYY values for the month and year. Use the appropriate month-specific dataset for experiments.
  • D. Create a tabular dataset that references the datastore and explicitly specifies each 'sales/mm-yyyy/sales.csv' file. Register the dataset with the name sales_dataset each month as a new version and with a tag named month indicating the month and year it was registered. Use this dataset for all experiments, identifying the version to be used based on the month tag as necessary.
Show Suggested Answer Hide Answer
Suggested Answer: D 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
gamezone25
Highly Voted 3 years, 7 months ago
D seems to be the correct answer. B does not allow you to get the data from before a specific month. With D you create only one dataset with multiple versions (1 version per month). Similar example in 'Versioning best practice': https://docs.microsoft.com/en-us/azure/machine-learning/how-to-version-track-datasets
upvoted 28 times
chevyli
2 years, 3 months ago
I guess you can by using module like Split or Filter data? You can specify the condition to get data before a particular month
upvoted 3 times
...
Shailen
2 years, 11 months ago
But D don't satisfy the last requirement that register minimal data set possible since each specific sales file need to register in option D. Given answer B seems correct as it fulfils all conditions.
upvoted 4 times
...
chaudha4
3 years, 7 months ago
I agree. The example shown in the link below does exactly what is being asked in the question. https://docs.microsoft.com/en-us/azure/machine-learning/how-to-version-track-datasets#versioning-best-practice
upvoted 2 times
levm39
3 years, 6 months ago
You must register the minimum number of datasets possible. D is not correct, because you will have to do this manually each month,?
upvoted 4 times
YipingRuan
3 years, 5 months ago
But B you can't select by (each) Month.
upvoted 1 times
...
...
...
...
TheCyanideLancer
Highly Voted 2 years, 11 months ago
Quick update, verified, correct ans is D. Cross checked in coursera and validated there.
upvoted 19 times
...
Lion007
Most Recent 11 months, 2 weeks ago
Selected Answer: D
The Correct answer is: D Option D is the most appropriate choice because it allows for both the inclusion of all data to date for general training and the ability to use specific versions for experiments that require data up to a particular month. The "minimum number of datasets" can be interpreted as the minimum number of distinct dataset entities registered in the workspace. With versioning (Option D), you're still working with one dataset entity, but with multiple versions, which aligns with the requirement of minimal dataset registration. Justification: - Versioning in Azure Machine Learning allows you to handle the evolving data by creating new versions of the dataset each month, without increasing the number of dataset entities in the workspace. - By using version tags, you can manage and reference the appropriate data snapshot for experiments as needed. - This approach offers a balance between efficient data management and the ability to run experiments on specific subsets of the data as of a given date, thus meeting all the stated requirements.
upvoted 3 times
...
Kanwal001
1 year, 3 months ago
On exam 28/08/2023..
upvoted 4 times
...
Depayser
1 year, 6 months ago
Selected Answer: D
Option D
upvoted 1 times
phydev
1 year, 4 months ago
ChatGPT agrees.
upvoted 1 times
...
...
MarinaMijailovic
1 year, 7 months ago
Selected Answer: D
A: *replaces* the the existing dataset -> can't directly filter data before the specific month B: captures all the sales data from different folders in *one dataset* -> can't can't directly filter data before the specific month C: requires registering multiple datasets D: satisfies all the requirements
upvoted 1 times
...
Yuriy_Ch
1 year, 9 months ago
Exactly this question was on exam 07/03/2023
upvoted 2 times
Jit1981
1 year, 8 months ago
Is Answer B or D?
upvoted 2 times
...
...
mamau
1 year, 9 months ago
B. Create a tabular dataset that references the datastore and specifies the path 'sales/*/sales.csv', register the dataset with the name sales_dataset and a tag named month indicating the month and year it was registered, and use this dataset for all experiments. This option meets all the requirements of the problem statement: ✑ The dataset loads all of the sales data to date into a structure that can be easily converted to a dataframe. ✑ You can create experiments that use only data that was created before a specific previous month, ignoring any data that was added after that month by filtering the dataset based on the "month" tag. ✑ The minimum number of datasets possible is registered (only one).
upvoted 2 times
...
phdykd
1 year, 10 months ago
Option D satisfies the last requirement of registering the minimum number of datasets possible. While option B uses a single dataset that references the entire path 'sales/*/sales.csv', it still requires registering the dataset every month with a new tag indicating the month and year. In comparison, option D registers each month's sales data as a new version of the same dataset with a tag indicating the month and year. This allows you to only have to register one dataset instead of multiple datasets, minimizing the number of registered datasets. Option B does not satisfy the requirement of being able to create experiments that use only data that was created before a specific previous month as it only references the entire path and not individual files for each month
upvoted 2 times
...
Edriv
1 year, 11 months ago
Option C
upvoted 2 times
...
Arend78
2 years ago
If I look at the explanation for the "correct" (?) answer B, it seems that they mean to ask "How to load CSVs form the appropriate folders using the least amount of lines?" In the explanation they use an asterix. Not a very clear question i.m.o.
upvoted 1 times
...
fvil
2 years, 1 month ago
On exam 07/11/2022
upvoted 1 times
...
victorafb
2 years, 1 month ago
on exam 16/10/2022 I've answer D
upvoted 3 times
...
ning
2 years, 6 months ago
Selected Answer: D
Absolutely correct, one dataset with different versions. Versions are NOT the same as different dataset!
upvoted 2 times
...
JTWang
2 years, 7 months ago
on exam 04/22/2022
upvoted 1 times
...
azurelearner666
2 years, 8 months ago
Correct answer is D, Even this question is really very badly written to promote misunderstanding and confusion.
upvoted 3 times
...
kkkk_jjjj
2 years, 8 months ago
on exam 18/03/2022
upvoted 2 times
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...