Exam Professional Machine Learning Engineer All Questions

View all questions & answers for the Professional Machine Learning Engineer exam

Exam Professional Machine Learning Engineer topic 1 question 32 discussion

Actual exam question from Google's Professional Machine Learning Engineer

Question #: 32
Topic #: 1

[All Professional Machine Learning Engineer Questions]

You developed an ML model with AI Platform, and you want to move it to production. You serve a few thousand queries per second and are experiencing latency issues. Incoming requests are served by a load balancer that distributes them across multiple Kubeflow CPU-only pods running on Google Kubernetes Engine
(GKE). Your goal is to improve the serving latency without changing the underlying infrastructure. What should you do?

A. Significantly increase the max_batch_size TensorFlow Serving parameter.
B. Switch to the tensorflow-model-server-universal version of TensorFlow Serving.
C. Significantly increase the max_enqueued_batches TensorFlow Serving parameter.
D. Recompile TensorFlow Serving using the source to support CPU-specific optimizations. Instruct GKE to choose an appropriate baseline minimum CPU platform for serving nodes.

Show Suggested Answer

Suggested Answer: D 🗳️

by DucLee3110 at July 1, 2021, 7:58 a.m.

Comments

Submit Cancel

Y2Data

Highly Voted 3 years, 10 months ago

D is correct since this question is focusing on server performance which development env is higher than production env. It's already throttling so increase the pressure on them won't help. Both A and C is essentially doing this. B is a bit mysterious, but we definitely know that D would work.

upvoted 31 times

mousseUwU

3 years, 9 months ago

I think it's D too

upvoted 3 times

...

pico

Highly Voted 1 year, 8 months ago

Selected Answer: C

https://github.com/tensorflow/serving/blob/master/tensorflow_serving/batching/README.md#batch-scheduling-parameters-and-tuning A may help to some extent, but it primarily affects how many requests are processed in a single batch. It might not directly address latency issues. D is a valid approach for optimizing TensorFlow Serving for CPU-specific optimizations, but it's a more involved process and might not be the quickest way to address latency issues.

upvoted 5 times

...

desertlotus1211

Most Recent 7 months, 1 week ago

Selected Answer: D

A is wrong - Increasing max_batch_size reduces latency by batching more requests together, but this introduces delays since the system must wait to accumulate a full batch. - This approach can improve throughput but may increase per-query latency, which contradicts the goal of reducing latency.

upvoted 2 times

...

rajshiv

8 months ago

Selected Answer: A

I do not think D is correct as D is focused on optimizing CPU utilization, not on the batching process or managing latency. Since our goal is to improve serving latency, optimizing batching via the max_batch_size parameter is a more straightforward and effective solution.

upvoted 1 times

...

AB_C

8 months, 1 week ago

Selected Answer: A

A would work

upvoted 1 times

...

desertlotus1211

9 months, 2 weeks ago

max_batch_size: Increasing the max_batch_size parameter allows TensorFlow Serving to process more requests in a single batch. This can improve throughput and reduce latency, especially in high-query environments, as it allows more efficient utilization of CPU resources by processing larger batches of requests at once. Answer A

upvoted 2 times

...

taksan

11 months, 3 weeks ago

Selected Answer: D

I think the correct is D, because the question is about reducing latency. As for A, increasing the batch size might event hurt latency if the system is overwhelmed to serve more multiple requests

upvoted 2 times

...

chirag2506

1 year, 1 month ago

Selected Answer: D

it is D

upvoted 2 times

...

PhilipKoku

1 year, 2 months ago

Selected Answer: C

C) Batch enqueued

upvoted 1 times

...

pinimichele01

1 year, 3 months ago

Selected Answer: D

increasing the max_batch_size TensorFlow Serving parameter, is not the best choice because increasing the batch size may not necessarily improve latency. In fact, it may even lead to higher latency for individual requests, as they will have to wait for the batch to be filled before processing. This may be useful when optimizing for throughput, but not for serving latency, which is the primary goal in this scenario.

upvoted 2 times

...

ichbinnoah

1 year, 8 months ago

Selected Answer: A

I think A is correct, as D implies changes to the infrastructure (question says you must not do that).

upvoted 1 times

edoo

1 year, 5 months ago

This is purely a software optimization and on how GKE handles requests. GKE should be able to choose different CPU types for nodes within the same cluster, which doesn't represent a change in architecture.

upvoted 1 times

...

tavva_prudhvi

1 year, 12 months ago

Selected Answer: D

upvoted 2 times

...

harithacML

2 years ago

Selected Answer: D

max_batch_size parameter controls the maximum number of requests that can be batched together by TensorFlow Serving. Increasing this parameter can help reduce the number of round trips between the client and server, which can improve serving latency. However, increasing the batch size too much can lead to higher memory usage and longer processing times for each batch.

upvoted 2 times

...

Liting

2 years ago

Selected Answer: D

Definetely D to improve the serving latency of an ML model on AI Platform, you can recompile TensorFlow Serving using the source to support CPU-specific optimizations and instruct GKE to choose an appropriate baseline minimum CPU platform for serving nodes, this way GKE will schedule the pods on nodes with at least that CPU platform.

upvoted 2 times

...

M25

2 years, 2 months ago

Selected Answer: D

Went with D

upvoted 2 times

...

SergioRubiano

2 years, 4 months ago

Selected Answer: A

A is correct. max_batch_size TensorFlow Serving parameter

upvoted 2 times

...

Yajnas_arpohc

2 years, 4 months ago

Selected Answer: A

CPU-only: One Approach If your system is CPU-only (no GPU), then consider starting with the following values: num_batch_threads equal to the number of CPU cores; max_batch_size to a really high value; batch_timeout_micros to 0. Then experiment with batch_timeout_micros values in the 1-10 millisecond (1000-10000 microsecond) range, while keeping in mind that 0 may be the optimal value. https://github.com/tensorflow/serving/tree/master/tensorflow_serving/batching

upvoted 3 times

frangm23

2 years, 3 months ago

In that very link, what it says is that max_batch_size is the parameter that governs the latency/troughput tradeoff, and as I understand, the higher the batch size, the higher the throughput, but that doesn't assure that latency will be lower. I would go with D

upvoted 4 times

...

Load full discussion...