Site Reliability Engineering Library

This library is a resource of collected resources that I’ve found helpful along with way of my journey, learning in site reliability engineering.

SRE Articles

Monitoring, Logging, and Metrics Oh My

Guides

Platform Engineering

CI/CD

Chaos Engineering

Tools

Terraform

Infrastructure

DBA

Data Team

Talks

Newsletters

Big Lists

Web Technologies

Productivity

Performance Testing Tools

  • Jmeter - Free to use performance testing application
  • NeoLoad - Easier to use but pricey load testing tool
  • K6 - Load testing tool using javascript
  • Locust - Load testing tool using python
  • Vegeta - CLI tool written in GO
  • Oha - CLI tool written in RUST
  • Hey - another CLI tool
  • Artillery
  • DDosify

CLI Tools

Visualization Tools

  • Kiri - Mindmap - Online mind map tool.
  • MindMup - Create public mind maps free
  • Diagrams.net - Create sharable diagrams with this tool, confluence integration available.

Conferences

Podcasts

Books

Skills Valued in SRE

  • Analysis and Systems Thinking
  • The Ability to Communicate with Tech Colleagues, Support, Management, Business People and Stakeholders
  • Observation, Modelling and Note Taking
  • Curiosity and a Desire for Exploration
  • Tenacity
  • Empathy for the User
  • Business Domain Knowledge
  • Formulating Rational and Consistent Arguments
  • Persuasiveness and Influencing Skills
  • Common Sense

Site Reliability Engineering Glossary

4 “Golden Signals” in Site Reliability Engineering (LETS)

Latency is the delay before data is fully transferred from one end to another. It is typically measured in milliseconds (ms)

Error rate measures errors occurring in the system, such as bugs in code or network outages. It is expressed as a % of total requests.

Throughput is the amount of data that can transfer through in a given period. It can be measured in bits/second.

Saturation is a measure of the load on your server resources. It can include measures like CPU utilization, and memory & storage used.


SLAs, SLOs and SLIs for measuring SRE success

Service Level Agreements (SLAs) are contractual obligations between the service provider and service consumer/payer for a certain level of performance. The consumer may demand money if the SLA is broken at any point.

Service Level Objectives (SLOs) are the guide levels of performance for engineers to aim for. They typically correlate with SLA requirements. For example, they can be goals for a certain level of availability for a service over a given period.

Service Level Indicators (SLIs) are measures of performance that allow engineers to understand if they are meeting the SLOs for the system and, subsequently the business-level SLAs. For example, they can be the uptime metric for a particular service.


Software incident response lingo

On-call implies that the engineer must be available to respond to incidents, should they arise when they are not typically working. This may mean evenings or weekends.

Pager Duty is a term used to refer to being on-call. It harks back to when operations engineers were required to carry a physical pager and respond if alerted by the device.

Follow the sun refers to an incident response timing where the engineer needs to respond to incidents from sunrise to sunset.

Mean time to acknowledge (MTTA) is the average time it takes for an engineer to get to look at an incident from the moment it has been identified and paged to the engineer.

Mean time to recovery (MTTR) is the average time for the engineer to resolve the incident from the moment the alerting system picks up the incident.

Mean time to failure (MTTF) is the average time a system or service is expected to function before it experiences a failure, such as a performance-degrading bug or outage.

Mean time between failures (MTBF) is the average time elapsed between two incidents across a series of incidents.


What SREs do after an incident

Postmortems are events that engineers may undertake after an incident has been resolved or controlled. They may go through logs and analyses the root cause to identify patterns and prevent future similar incidents.

Blameless is the cultural mindset that (most) Site Reliability Engineering teams aim to have when going through an incident. They aim to find out what and not who caused the problem. Even if they find the person behind the incident, they seek not to blame.


How SREs make better systems

Toil in the site reliability engineering sense manual, repetitive work that should be automated away if possible. The catchphrase among SREs is to “eliminate toil”.

Error budgets are an allowance for errors that Site Reliability Engineers make. This allowance helps SREs work on experimental work that may eliminate toil and give developers more breathing room. It is an advanced principle.