Skip to main content

Site Reliability Principles

 The Site Reliability Principles:

1. SRE needs Service Level Objectives, with consequences.

2, SREs have time to make tomorrow better than today.

3. SRE teams have the ability to regulate their workload.


SRE needs Service Level Objectives, with consequences:

Any organization, even without hiring a single SRE can have an Error Budget Policy.

This is any lever you can use to keep your customers from experiencing pain using your application.

You can implement this today, measure, account and act.


Making Tomorrow Better than Today:

SLOs and Error Budgets are the first step.

The next step is staffing an SRE role.

Know the real responsibility.

Your First SRE:

Defining and refining the Service Level Objectives.

Best placed person to see that the Error Budget Policy is enacted when necessary.

Is now responsible for making sure that the application meets the reliability expectations of its users.

Another major part of the SRE is Toil,

On-Call and incident management.

Authoring and iterating on Monitoring.

Automation of repetitive work.

Coordinating implementation of Postmortem Action Items.

Shared Responsibility:

Dumping all production services on an SRE team cannot work.

If a team gets overloaded with operational toil, they cannot make tomorrow better than today.

Providing an SRE team some way of giving back-pressure to their dev partners provides balance.


Regulating Workload:

Give 5% of the operational work to the developers, On-Call shifts,rollout management, ops tasks.

Track the project work of the SRE team, if it's not delivering completed projects then there is something wrong.

Analyse new production systems and only on-board them if they can be operated safely.

If every problem with a system has to be escalated to it's developer, give the call to the developer instead!


Reliability and Consistency Up Front:

Fixing a product after launch is always more expensive.

SRE teams can and should consult up-front on designs:

 Architecting resilient systems.

 Maintaining consistency means fewer SREs can support more products.


Automation:
 Three places SRE teams can benefit from Automation:
  •   To eliminate their toil - don't do things over and over!
  •   To do capacity planning, they can do auto-scaling instead of manual forecasting.
  •   To fix issues automatically, write a script, playbook so that system can do it.



Comments