Exam Professional Cloud DevOps Engineer All Questions

View all questions & answers for the Professional Cloud DevOps Engineer exam

Exam Professional Cloud DevOps Engineer topic 1 question 59 discussion

Actual exam question from Google's Professional Cloud DevOps Engineer

Question #: 59
Topic #: 1

[All Professional Cloud DevOps Engineer Questions]

You encounter a large number of outages in the production systems you support. You receive alerts for all the outages that wake you up at night. The alerts are due to unhealthy systems that are automatically restarted within a minute. You want to set up a process that would prevent staff burnout while following Site
Reliability Engineering practices. What should you do?

A. Eliminate unactionable alerts.
B. Create an incident report for each of the alerts.
C. Distribute the alerts to engineers in different time zones.
D. Redefine the related Service Level Objective so that the error budget is not exhausted.

Show Suggested Answer

Suggested Answer: A 🗳️

by job_search83 at Oct. 26, 2021, 6:33 p.m.

Comments

Submit Cancel

AL12

Highly Voted 3 years, 3 months ago

I reckon its A, the reason is because it seems like the problem is automatically fixed with an restart of the service after a minute, therefore engineers don't really need to be woken up about these problems. If it failed multiple times or if the restart failed, then the engineer should be woken up

upvoted 14 times

MF2C

3 years, 2 months ago

A or C

upvoted 1 times

...

09bd94b

Most Recent 5 months, 2 weeks ago

Selected Answer: A

Agree with A. It does not make sense to wake up an engineer when you know that there is no need for any remedy action

upvoted 1 times

...

JonathanSJ

2 years ago

Selected Answer: A

I agree with A.

upvoted 2 times

...

Greg123123

2 years, 1 month ago

Selected Answer: A

It should be A rather than D. To follow SRE practice, we should eliminate unactionable alert which is pointless and to increase precision. While D also looks valid, the question never say that the application is being affected (e.g. has downtime), and never says any actions are needed. As a result, there is no need to redefine SLI and since they didn't spend time to resolve it no error budget is spent.

upvoted 2 times

...

ssmb

2 years, 3 months ago

Between A and C, B and D answers are not good. I lean more towards A because those alerts seem unactionable a the moment alert is received, ie: machine restarted automatically already. This would be best imidiate action as per the question. Of course the source of alerts should be looked at and fixed separately from addressing the issue in question.

upvoted 2 times

...

zygomar

2 years, 11 months ago

Selected Answer: A

agree with kyubiblaze about having to remove unactionable items aka spam: "good monitoring alerts on actionable problems" @ https://cloud.google.com/blog/products/management-tools/meeting-reliability-challenges-with-sre-principles

upvoted 4 times

...

Sekierer

3 years ago

A is correct

upvoted 1 times

...

KyubiBlaze

3 years ago

A - You have to remove "unactionable" alerts, these alerts are useless if you can't take any action. Simple reason, C might be following SRE practice, but it is distributing the problem, not solving it. B and D, totally No.

upvoted 3 times

...

gcpz

3 years, 1 month ago

answer is c. it follows google SRE and prevents staff burnout. https://sre.google/workbook/team-lifecycles/

upvoted 1 times

...

ESP_SAP

3 years, 1 month ago

The team may continue to work on non-reliability features if: The outage was caused by a company-wide networking problem. The outage was caused by a service maintained by another team, who have themselves frozen releases to address their reliability issues. The error budget was consumed by users out of scope for the SLO (e.g., load tests or penetration testers). Miscategorized errors consume budget even though no users were impacted. https://sre.google/workbook/error-budget-policy/

upvoted 3 times

ESP_SAP

3 years, 1 month ago

Correct Answer is (D):

upvoted 2 times

...

Manh

3 years, 2 months ago

Answer D

upvoted 1 times

...

NXD

3 years, 3 months ago

C follows the SRE.

upvoted 3 times

Feliphus

1 year, 1 month ago

The statemene says: you encounter a large number of outages in the production systems you support, then eliminating the alerts doesn't seem to be a good idea. If there is another support team in another time zone. What's happen if the server doesn't reboot or the services don't start fine?. There is not a correct answer between options, what it would be to resolve the reboot problem. I don't know which is better if A or C, I suppose we have losed some information in the statement or in the answers. But in this situation I agree @NXD and choose C

upvoted 1 times

Feliphus

1 year, 1 month ago

Sorry, but I change to ans A. I have noticed this question is repeated as Q133 but without the text: You receive alerts for all the outages that wake you up at night

upvoted 1 times

...