exam questions

Exam Professional Machine Learning Engineer All Questions

View all questions & answers for the Professional Machine Learning Engineer exam

Exam Professional Machine Learning Engineer topic 1 question 82 discussion

Actual exam question from Google's Professional Machine Learning Engineer
Question #: 82
Topic #: 1
[All Professional Machine Learning Engineer Questions]

You are profiling the performance of your TensorFlow model training time and notice a performance issue caused by inefficiencies in the input data pipeline for a single 5 terabyte CSV file dataset on Cloud Storage. You need to optimize the input pipeline performance. Which action should you try first to increase the efficiency of your pipeline?

  • A. Preprocess the input CSV file into a TFRecord file.
  • B. Randomly select a 10 gigabyte subset of the data to train your model.
  • C. Split into multiple CSV files and use a parallel interleave transformation.
  • D. Set the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method.
Show Suggested Answer Hide Answer
Suggested Answer: C 🗳️

Comments

Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.
Switch to a voting comment New
5091a99
1 month, 4 weeks ago
Selected Answer: A
This is a bad question. But imho, Answer: A. - TFRecords will improve read speeds with its binary format. Presumably the large file was there for a reason, possibly the output of a upstream process whose data may change in the future. TFRecords is a straightforward FIRST step as a part of a pipeline. - The other option is parallel interleave. Also improves read speeds, but not as straightforward as a first step and requires lots of file in the database that require version control.
upvoted 1 times
...
NamitSehgal
2 months, 1 week ago
A. Preprocessing your data into TFRecord format can significantly improve I/O performance and reduce the time spent on parsing and loading data, which is critical for optimizing the input pipeline for large-scale datasets.
upvoted 1 times
...
bc3f222
2 months, 1 week ago
Selected Answer: A
according to the official doc A, C seems to pre TFX solution
upvoted 1 times
...
phani49
4 months, 2 weeks ago
Selected Answer: A
Based on the official documentation, Option A (converting to TFRecord format) is actually the correct first action to try, and the claim is incorrect. Why TFRecord is the Best First Option TFRecord format is specifically recommended for large datasets because: - It provides extremely high throughput when reading from Cloud Storage, especially for large-scale training[2] - It's the recommended format for structured data and large files[2] - It's designed for efficient serialization of structured data and optimal performance with TensorFlow
upvoted 2 times
...
AB_C
5 months, 1 week ago
Selected Answer: C
c is the right answer
upvoted 1 times
...
Prakzz
10 months ago
Selected Answer: A
Preprocessing the input CSV file into a TFRecord file optimizes the input data pipeline by enabling more efficient reading and processing. TFRecord is a binary format that is faster to read and more efficient for TensorFlow to process compared to CSV, which is a text-based format. This change can significantly reduce the time spent on data input operations during model training.
upvoted 3 times
...
PhilipKoku
11 months ago
Selected Answer: A
A) Convert CSV file into TFRecord is more effecient and processing CSV in parallel (C)
upvoted 1 times
...
pinimichele01
1 year ago
Selected Answer: C
Converting a large 5 terabyte CSV file to a TFRecord can be a time-consuming process, and you would still be dealing with a single large file.
upvoted 4 times
...
tavva_prudhvi
1 year, 5 months ago
Selected Answer: C
While preprocessing the input CSV file into a TFRecord file (Option A) can improve the performance of your input pipeline, it is not the first action to try in this situation. Converting a large 5 terabyte CSV file to a TFRecord can be a time-consuming process, and you would still be dealing with a single large file.
upvoted 1 times
...
andresvelasco
1 year, 7 months ago
Selected Answer: C
i think C based on the consideration: "Which action should you try first ", meaning it should be less impactful to continue using CSV.
upvoted 1 times
...
TNT87
1 year, 11 months ago
Selected Answer: C
https://www.tensorflow.org/guide/data_performance#best_practice_summary
upvoted 2 times
...
M25
1 year, 12 months ago
Selected Answer: C
Went with C
upvoted 1 times
...
e707
2 years ago
Selected Answer: C
Option A, preprocess the input CSV file into a TFRecord file, is not as good because it requires additional processing time. Hence, I think C is the best choice.
upvoted 1 times
...
frangm23
2 years ago
Selected Answer: A
I think it could be A. https://cloud.google.com/architecture/best-practices-for-ml-performance-cost#preprocess_the_data_once_and_save_it_as_a_tfrecord_file
upvoted 1 times
...
[Removed]
2 years ago
Selected Answer: A
Clearly both A and C works here, but I can't find any documentation which suggests C is any better than A.
upvoted 1 times
...
Yajnas_arpohc
2 years, 1 month ago
"Which action should you try first" seems to be key -- C seems more intuitive as first step! A is valid as well (interleave works w TFRecords) & definitely more efficient IMO, but maybe 2nd step!
upvoted 2 times
...
shankalman717
2 years, 2 months ago
Selected Answer: A
Option B (randomly selecting a 10 gigabyte subset of the data) could lead to a loss of useful data and may not be representative of the entire dataset. Option C (splitting into multiple CSV files and using a parallel interleave transformation) may also improve the performance, but may be more complex to implement and maintain, and may not be as efficient as converting to TFRecord. Option D (setting the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method) is not directly related to the input data format and may not provide as significant a performance improvement as converting to TFRecord.
upvoted 3 times
tavva_prudhvi
2 years, 1 month ago
Please read this site https://www.tensorflow.org/tutorials/load_data/csv, its simple to implement in the same input pipeline, and we cannot judge the answer by implementation difficulties!
upvoted 1 times
...
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
exam
Someone Bought Contributor Access for:
SY0-701
London, 1 minute ago