Exam Professional Machine Learning Engineer All Questions

View all questions & answers for the Professional Machine Learning Engineer exam

Exam Professional Machine Learning Engineer topic 1 question 201 discussion

Actual exam question from Google's Professional Machine Learning Engineer

Question #: 201
Topic #: 1

[All Professional Machine Learning Engineer Questions]

You developed a Vertex AI pipeline that trains a classification model on data stored in a large BigQuery table. The pipeline has four steps, where each step is created by a Python function that uses the KubeFlow v2 API. The components have the following names:

You launch your Vertex AI pipeline as the following:

You perform many model iterations by adjusting the code and parameters of the training step. You observe high costs associated with the development, particularly the data export and preprocessing steps. You need to reduce model development costs. What should you do?

A. Change the components’ YAML filenames to export.yaml, preprocess,yaml, f "train-
{dt}.yaml", f"calibrate-{dt).vaml".
B. Add the {"kubeflow.v1.caching": True} parameter to the set of params provided to your PipelineJob.
C. Move the first step of your pipeline to a separate step, and provide a cached path to Cloud Storage as an input to the main pipeline.
D. Change the name of the pipeline to f"my-awesome-pipeline-{dt}".

Show Suggested Answer

Suggested Answer: A 🗳️

by pikachu007 at Jan. 13, 2024, 4:21 a.m.

Comments

Submit Cancel

guilhermebutzke

Highly Voted 1 year, 5 months ago

Selected Answer: A

My Answer: A From what I understood, it's about optimizing the process of adjusting code while utilizing previously processed results from the pipeline. Kubeflow inherently caches these steps, eliminating the need to explicitly store results in a designated path. However, the original filenames include a timestamp (**`-dt`**), suggesting that by removing this timestamp, the pipeline steps might not rerun as expected. Option C could be an approach, but it would require more effort to implement (since Kubeflow handles it automatically). Additionally, the beginning of the option only mentions moving the first step, which is the export, and doesn't say anything about preprocessing (which could be one of the more expensive steps). So, considering all of these factors, I think A is the best choice."

upvoted 6 times

...

HaroonRaizada01

Most Recent 4 months ago

Selected Answer: B

**Use Option B** (`{"kubeflow.v1.caching": True}`) to enable caching in your Vertex AI pipeline. This is the most efficient and cost-effective way to avoid redundant executions of expensive steps like data export and preprocessing.

upvoted 1 times

...

Sivaram06

6 months ago

Selected Answer: B

Adding caching to your pipeline by setting the parameter {"kubeflow.v1.caching": True} is the most efficient and effective approach to reduce model development costs, particularly for steps like data export and preprocessing, which are often time-consuming and costly to repeat during multiple iterations. This will help you avoid unnecessary re-computation and save on resource usage.

upvoted 1 times

...

lunalongo

7 months, 1 week ago

Selected Answer: C

Option A is a superficial change with no significant impact on cost optimization. Option C is the correct approach for effectively leveraging caching to reduce costs. C strategically uses the caching mechanism by separating the expensive preprocessing steps and storing their outputs in Cloud Storage, thus reducing costs by reusing the preprocessed data across multiple pipeline runs. Changing filenames could affect caching only if the caching mechanism relies on exact filename matching, which is unlikely. Besides, Kubeflow and Vertex AI Pipelines do not automatically handle caching of intermediate results; it is not inherent to the pipeline steps themselves; it's a feature that needs to be explicitly managed and leveraged.

upvoted 2 times

...

f084277

8 months ago

Selected Answer: A

A. The dynamic filename is causing kubeflow to be unable to cache the export and preprocess steps, causing the problems mentioned in the question.

upvoted 1 times

...

Foxy2021

9 months ago

I select C: By leveraging a Dataproc cluster, you can maintain compatibility with your existing PySpark jobs, minimize management overhead, and create a scalable proof of concept quickly and efficiently.

upvoted 1 times

...

Foxy2021

9 months ago

I select B. A: Changing the YAML filenames does not affect caching behavior or cost reduction. The pipeline's efficiency and cost effectiveness are primarily governed by how it handles inputs and outputs rather than the filenames of the components. C: Moving the first step to a separate pipeline may help with organization but doesn’t directly address the cost incurred by repeated data exports and preprocessing. Also, simply providing a cached path does not guarantee that the preprocessing step itself won’t be executed multiple times. D: Changing the name of the pipeline to include a timestamp or other identifier does not influence caching or resource usage. It merely alters the identification of the pipeline runs without any impact on the efficiency of the operations being performed.

upvoted 1 times

...

gscharly

1 year, 2 months ago

Selected Answer: A

see guilhermebutzke

upvoted 1 times

...

pinimichele01

1 year, 2 months ago

Selected Answer: A

see guilhermebutzke

upvoted 1 times

...

Yan_X

1 year, 4 months ago

Selected Answer: C

C Caching should be enabled for all steps, e.g., export, preprocessing and training.

upvoted 1 times

...

shadz10

1 year, 5 months ago

Selected Answer: C

Not A - Changing file names does not help with reducing costs Not B - you cannot directly use kubeflow.v1.caching on a pipeline that uses the KubeFlow v2 API. Version Incompatibility: The kubeflow.v1.caching module is specifically designed for KubeFlow Pipelines v1, and its structure and functionality are not directly compatible with KubeFlow Pipelines v2. so best option here is C

upvoted 2 times

...

b1a8fae

1 year, 6 months ago

Selected Answer: C

I considered B but a search of "kubeflow.v1.caching" on Google only produces 1 result, which is this very question on this very website. Thus, I rule it out as non-existent (please share a resource if there is any that proves it exists) and opt for C.

upvoted 1 times

...

BlehMaks

1 year, 6 months ago

Selected Answer: A

i think it's A. 1)if we want to use the same results several times we shouldn't rename them. so we need to delete {dt} from the first two components names. 2)we already have this option enable_caching = True, why do we need kubeflow.v1.caching then? 3)i'm not sure but may be it does metter

upvoted 2 times

BlehMaks

1 year, 6 months ago

3)i'm not sure but may be it does matter that KubeFlow v2 API and kubeflow.v1.caching have different versions (v1 and v2)

upvoted 1 times

...

pikachu007

1 year, 6 months ago

Selected Answer: B

Enables caching: Setting this parameter instructs Vertex AI Pipelines to cache the outputs of pipeline steps that have successfully completed. This means that if a step's inputs haven't changed, its execution can be skipped, reusing the cached output instead. Targets costly steps: The prompt highlights that data export and preprocessing steps are particularly expensive. Caching these steps can significantly reduce costs during model iterations.

upvoted 2 times

...