Exam Professional Machine Learning Engineer All Questions

View all questions & answers for the Professional Machine Learning Engineer exam

Exam Professional Machine Learning Engineer topic 1 question 82 discussion

Actual exam question from Google's Professional Machine Learning Engineer

Question #: 82
Topic #: 1

[All Professional Machine Learning Engineer Questions]

You are profiling the performance of your TensorFlow model training time and notice a performance issue caused by inefficiencies in the input data pipeline for a single 5 terabyte CSV file dataset on Cloud Storage. You need to optimize the input pipeline performance. Which action should you try first to increase the efficiency of your pipeline?

A. Preprocess the input CSV file into a TFRecord file.
B. Randomly select a 10 gigabyte subset of the data to train your model.
C. Split into multiple CSV files and use a parallel interleave transformation.
D. Set the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method.

Show Suggested Answer

Suggested Answer: C 🗳️

by LearnSodas at Dec. 11, 2022, 6:03 p.m.

Comments

Submit Cancel

pinimichele01

Highly Voted 1 year, 2 months ago

Selected Answer: C

Converting a large 5 terabyte CSV file to a TFRecord can be a time-consuming process, and you would still be dealing with a single large file.

upvoted 5 times

...

5091a99

Most Recent 3 months, 2 weeks ago

Selected Answer: A

This is a bad question. But imho, Answer: A. - TFRecords will improve read speeds with its binary format. Presumably the large file was there for a reason, possibly the output of a upstream process whose data may change in the future. TFRecords is a straightforward FIRST step as a part of a pipeline. - The other option is parallel interleave. Also improves read speeds, but not as straightforward as a first step and requires lots of file in the database that require version control.

upvoted 1 times

...

NamitSehgal

3 months, 3 weeks ago

A. Preprocessing your data into TFRecord format can significantly improve I/O performance and reduce the time spent on parsing and loading data, which is critical for optimizing the input pipeline for large-scale datasets.

upvoted 1 times

...

bc3f222

3 months, 3 weeks ago

Selected Answer: A

according to the official doc A, C seems to pre TFX solution

upvoted 1 times

...

phani49

6 months ago

Selected Answer: A

Based on the official documentation, Option A (converting to TFRecord format) is actually the correct first action to try, and the claim is incorrect. Why TFRecord is the Best First Option TFRecord format is specifically recommended for large datasets because: - It provides extremely high throughput when reading from Cloud Storage, especially for large-scale training[2] - It's the recommended format for structured data and large files[2] - It's designed for efficient serialization of structured data and optimal performance with TensorFlow

upvoted 2 times

...

AB_C

6 months, 3 weeks ago

Selected Answer: C

c is the right answer

upvoted 1 times

...

Prakzz

11 months, 3 weeks ago

Selected Answer: A

Preprocessing the input CSV file into a TFRecord file optimizes the input data pipeline by enabling more efficient reading and processing. TFRecord is a binary format that is faster to read and more efficient for TensorFlow to process compared to CSV, which is a text-based format. This change can significantly reduce the time spent on data input operations during model training.

upvoted 3 times

...

PhilipKoku

1 year ago

Selected Answer: A

A) Convert CSV file into TFRecord is more effecient and processing CSV in parallel (C)

upvoted 1 times

...

tavva_prudhvi

1 year, 7 months ago

Selected Answer: C

While preprocessing the input CSV file into a TFRecord file (Option A) can improve the performance of your input pipeline, it is not the first action to try in this situation. Converting a large 5 terabyte CSV file to a TFRecord can be a time-consuming process, and you would still be dealing with a single large file.

upvoted 1 times

...

andresvelasco

1 year, 9 months ago

Selected Answer: C

i think C based on the consideration: "Which action should you try first ", meaning it should be less impactful to continue using CSV.

upvoted 1 times

...

TNT87

2 years ago

Selected Answer: C

https://www.tensorflow.org/guide/data_performance#best_practice_summary

upvoted 2 times

...

M25

2 years, 1 month ago

Selected Answer: C

Went with C

upvoted 1 times

...

e707

2 years, 1 month ago

Selected Answer: C

Option A, preprocess the input CSV file into a TFRecord file, is not as good because it requires additional processing time. Hence, I think C is the best choice.

upvoted 1 times

...

frangm23

2 years, 1 month ago

Selected Answer: A

I think it could be A. https://cloud.google.com/architecture/best-practices-for-ml-performance-cost#preprocess_the_data_once_and_save_it_as_a_tfrecord_file

upvoted 1 times

...

[Removed]

2 years, 2 months ago

Selected Answer: A

Clearly both A and C works here, but I can't find any documentation which suggests C is any better than A.

upvoted 1 times

...

Yajnas_arpohc

2 years, 3 months ago

"Which action should you try first" seems to be key -- C seems more intuitive as first step! A is valid as well (interleave works w TFRecords) & definitely more efficient IMO, but maybe 2nd step!

upvoted 2 times

...

shankalman717

2 years, 3 months ago

Selected Answer: A

Option B (randomly selecting a 10 gigabyte subset of the data) could lead to a loss of useful data and may not be representative of the entire dataset. Option C (splitting into multiple CSV files and using a parallel interleave transformation) may also improve the performance, but may be more complex to implement and maintain, and may not be as efficient as converting to TFRecord. Option D (setting the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method) is not directly related to the input data format and may not provide as significant a performance improvement as converting to TFRecord.

upvoted 3 times

tavva_prudhvi

2 years, 2 months ago

Please read this site https://www.tensorflow.org/tutorials/load_data/csv, its simple to implement in the same input pipeline, and we cannot judge the answer by implementation difficulties!

upvoted 1 times

...

Load full discussion...