exam questions

Exam Professional Machine Learning Engineer All Questions

View all questions & answers for the Professional Machine Learning Engineer exam

Exam Professional Machine Learning Engineer topic 1 question 53 discussion

Actual exam question from Google's Professional Machine Learning Engineer
Question #: 53
Topic #: 1
[All Professional Machine Learning Engineer Questions]

Your team is working on an NLP research project to predict political affiliation of authors based on articles they have written. You have a large training dataset that is structured like this:

You followed the standard 80%-10%-10% data distribution across the training, testing, and evaluation subsets. How should you distribute the training examples across the train-test-eval subsets while maintaining the 80-10-10 proportion?

  • A. Distribute texts randomly across the train-test-eval subsets: Train set: [TextA1, TextB2, ...] Test set: [TextA2, TextC1, TextD2, ...] Eval set: [TextB1, TextC2, TextD1, ...]
  • B. Distribute authors randomly across the train-test-eval subsets: (*) Train set: [TextA1, TextA2, TextD1, TextD2, ...] Test set: [TextB1, TextB2, ...] Eval set: [TexC1,TextC2 ...]
  • C. Distribute sentences randomly across the train-test-eval subsets: Train set: [SentenceA11, SentenceA21, SentenceB11, SentenceB21, SentenceC11, SentenceD21 ...] Test set: [SentenceA12, SentenceA22, SentenceB12, SentenceC22, SentenceC12, SentenceD22 ...] Eval set: [SentenceA13, SentenceA23, SentenceB13, SentenceC23, SentenceC13, SentenceD31 ...]
  • D. Distribute paragraphs of texts (i.e., chunks of consecutive sentences) across the train-test-eval subsets: Train set: [SentenceA11, SentenceA12, SentenceD11, SentenceD12 ...] Test set: [SentenceA13, SentenceB13, SentenceB21, SentenceD23, SentenceC12, SentenceD13 ...] Eval set: [SentenceA11, SentenceA22, SentenceB13, SentenceD22, SentenceC23, SentenceD11 ...]
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
rc380
Highly Voted 3 years, 8 months ago
I think since we are predicting political leaning of authors, perhaps distributing authors make more sense? (B)
upvoted 21 times
sensev
3 years, 8 months ago
Agree it should be B. Since every author has his/her distinct style, splitting different text from the same author across different set could result in data label leakage.
upvoted 8 times
dxxdd7
3 years, 7 months ago
I don't agree as we want to know the political affiliation from a text and not based on an author. I think A is better
upvoted 2 times
jk73
3 years, 7 months ago
it is the political affiliation from a text, but to whom belong that text? The statement clearly says ... Predict political affiliation of authors based on articles they have written. Hence the political affiliation is for each author according to the text he wrote.
upvoted 3 times
...
...
...
jk73
3 years, 7 months ago
Exactly! I also consider is B Check this out! If we just put inside the Training set , Validation set and Test set , randomly Text, Paragraph or sentences the model will have the ability to learn specific qualities about The Author's use of language beyond just his own articles. Therefore the model will mixed up different opinions. Rather if we divided things up a the author level, so that given authors were only on the training data, or only in the test data or only in the validation data. The model will find more difficult to get a high accuracy on the test validation (What is correct and have more sense!). Because it will need to really focus in author by author articles rather than get a single political affiliation based on a bunch of mixed articles from different authors. https://developers.google.com/machine-learning/crash-course/18th-century-literature
upvoted 12 times
...
...
inder0007
Highly Voted 3 years, 9 months ago
Should be A, we are trying to get a label on the entire text so only A makes sense
upvoted 8 times
GogoG
3 years, 6 months ago
Correct answer is B - https://developers.google.com/machine-learning/crash-course/18th-century-literature
upvoted 5 times
Dunnoth
2 years, 2 months ago
This is a known study. if you use A, the moment a new author is given in a test set the accuracy is waay low than what your metrics might suggest. To have realistic evaluation results it should be B. Also note that the label is for the "authour" not a text.
upvoted 1 times
...
...
...
chibuzorrr
Most Recent 4 months, 4 weeks ago
Selected Answer: A
I think A. The B option training set would not contain text from authors supporting party B
upvoted 2 times
...
PhilipKoku
10 months, 4 weeks ago
Selected Answer: B
B) Authors
upvoted 1 times
...
girgu
11 months, 1 week ago
Selected Answer: B
We have divide / split at author level. Other wise model will used text to author relationship but we want to find text to political affiliation relation ship. While prediction we already know text to author relation but we want to find text to political relation (and therefore author to political relation is implied.
upvoted 1 times
...
tavva_prudhvi
1 year, 10 months ago
Selected Answer: B
This is the best approach as it ensures that the data is distributed in a way that is representative of the overall population. By randomly distributing authors across the subsets, we ensure that each subset has a similar distribution of political affiliations. This helps to minimize bias and increases the likelihood that our model will generalize well to new data. Distributing texts randomly or by sentences or paragraphs may result in subsets that are biased towards a particular political affiliation. This could lead to overfitting and poor generalization performance. Therefore, it is important to distribute the data in a way that maintains the overall distribution of political affiliations across the subsets.
upvoted 3 times
...
M25
1 year, 11 months ago
Selected Answer: B
Went with B
upvoted 1 times
...
John_Pongthorn
2 years, 2 months ago
Selected Answer: B
https://cloud.google.com/automl-tables/docs/prepare#split https://developers.google.com/machine-learning/crash-course/18th-century-literature
upvoted 1 times
...
enghabeth
2 years, 2 months ago
Selected Answer: B
Ans B The model is to predict which political party the author belongs to, not which political party the text belongs to... You do not have the information of the political party of each text, you are assuming that the texts are associated with the political party of the author.
upvoted 1 times
...
bL357A
2 years, 7 months ago
Selected Answer: A
label is party, feature is text
upvoted 2 times
...
suresh_vn
2 years, 8 months ago
IMO, B is correct A,C,D label leakaged
upvoted 1 times
...
ggorzki
3 years, 3 months ago
Selected Answer: B
https://developers.google.com/machine-learning/crash-course/18th-century-literature Split by authors, otherwise there will be data leakage - the model will get the ability to learn author specific use of language
upvoted 6 times
...
NamitSehgal
3 years, 3 months ago
B I agree
upvoted 1 times
...
JobQ
3 years, 4 months ago
I already saw the video in: https://developers.google.com/machine-learning/crash-course/18th-century-literature Based on this video I concluded that the answer is A. What answer B is saying is that you will have Author B's texts in the training set, Author A's texts in the testing set and Author C's texts in the validation set. According to the video B is incorrect. We want to have texts from author A in the training, testing and validation set. So A is correct. I think most people are choosing B because the word "author" but let's be careful.
upvoted 2 times
giaZ
3 years, 1 month ago
I though the same initially, but no..We'd want texts from author A in the training, testing and validation set if the task was to predict the author from a text (meaning, if the label was the author..right? You train the model to learn the style of text and connect it to an author. You'd need new texts from the same author in the test and validation sets, to see if the model is able to recognize him/her). HERE, the task is to predict political affiliation from a text of an author. The author is given. In the test and validation sets you need new authors, to see wether the model is able to guess their political affiliation. So you would do 80 authors (and corresponding texts) for training, 10 different authors for validation, and 10 different ones for test.
upvoted 5 times
...
...
pddddd
3 years, 7 months ago
Partition by author - there is an actual example in Coursera 'Production ML systems' course
upvoted 1 times
...
Macgogo
3 years, 7 months ago
I think it is B. -- Your test data includes data from populations that will not be represented in production. For example, suppose you are training a model with purchase data from a number of stores. You know, however, that the model will be used primarily to make predictions for stores that are not in the training data. To ensure that the model can generalize to unseen stores, you should segregate your data sets by stores. In other words, your test set should include only stores different from the evaluation set, and the evaluation set should include only stores different from the training set. https://cloud.google.com/automl-tables/docs/prepare#ml-use
upvoted 4 times
...
Danny2021
3 years, 7 months ago
Should be D. Please see the dataset provided, it is based on the text / paragraphs.
upvoted 1 times
george_ognyanov
3 years, 6 months ago
Have a look at the link the other have already provided twice. Splitting sentence by sentence is literally mentioned in said video as a bad example and something we should not do in this case.
upvoted 1 times
...
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago