Site Reliability Engineering Library
This library is a resource of collected resources that I’ve found helpful along with way of my journey, learning in site reliability engineering.
SRE Articles
- Reliability as a Product Feature
- Pattern for SRE Team
- Golden Signals of SRE
- SRE Google
- The Four Golden Signals - Monitoring
- Using ChatOps to help Actions on-call engineers
- How Complex Systems Fail
- HTTP/3 is Fast
- Hot Takes on Code Freezes
- Building SRE Team with Specializations
- Reliability as a Product Feature
- Using ChatOps to help Actions on-call engineers
- What an SRE is Not
- The Calculus of Service Availability
- Cache Incidents at TWitter
- Renaming Post-Mortems to Post Incident Analysis(PIA)
- Incident Severity and Priority
- Delivery Hero Reliability Manifesto
- Honeycomb: How We Define SRE Work
- Lessons Learned in 10 Years of SRE: Part 1 - Starting SRE
- https://abakonski.com/resolving-502-and-504-errors-in-elastic/
- Amazon ECS Metrics
- Troubleshooting “the system is slow’”
- Implementing Availability SLOs in Typeform
- OpenSLO
- On-Call: Leave It Better Than You Found It
- Are your SLOs realistic? How to analyze your risks like an SRE
- Founding Uber SRE.
- Continuous Load Testing at Slack
- JSON object values into CSV with jq
- SRE Interview Questions
- Best Practices for Fixing Your Alerts
- How to Build Software like an SRE
- Things I want as SRE/DevOps from Devs
- Runbook Template
- A New Definition of Reliability
- When to Alert on What?
- On Quality - Operational Excellence
- DevOps Topologies
- Reddit: stories about incidents
- Scaling Microservices Alerting With Zero Ops
- How Cloud Networks Fail
- Resilient Retry and Recovery Mechanism: Enhancing Fault Tolerance and System Reliability - Mercari
- You should never be responsible for what you don’t control
Software Engineering
Monitoring, Logging, and Metrics Oh My
- Sumo Logic Self-Paced Training
- Best practices for setting SLOs and SLIs for modern, complex systems
- Calculating Composite SLA
- Best practices for setting SLOs and SLIs for modern, complex systems
- On the Brittleness of Dashboards
- observability 101: terminology and concepts
- Monitoring at Shipt
- Uptime-Kuma (like uptime robot)
- We’re all doing metrics wrong - good insights!
- SLO Formulas in Prometheus PROMQL
Developer Experience
- What is Developer Experience Roundup
- What Developer Self Service Should Look Like
- 3 Musketeers for an Epic Developer Experience
Guides
- The 12 Factors - questions you should ask
- The startup guide to sensible incident management
- Howie: The Post-Incident Guide
- Etsy Debriefing Facilitation Guide
- Retrospectives by PagerDuty
- OOPS (incident report) Guide
- Post Incident Investigation
- Introduction to Observability for Test Automation
Platform Engineering
- Top 10 fallacies in Platform Engineering
- The Future of Ops Is Platform Engineering
- Platform Engineering won’t kill devops
- Infraspec - Platform Engineering at Shipt
CI/CD
- Seamless Branch Deploys - Preview Builds
- Seamless Branch Deploys Part 2 - Preview Builds
- Github Deploying Preview Builds
- Continuous Previews Docker
- Continuous Previews Manifesto
- Automating safe, hands-off deployments
- Kubernetes Deployment Antipatterns – part 1 – applicable to any deployments
- How Do You Document CI/CD - Reddit
- Should your Infrastructure as Code get its own repository? - Reddit
- Reddit: How the hell do you reference an artifact to download from another pipeline in Github Actions?
- GitHub Actions Security Best Practices [cheat sheet included]
- Trigger GitHub Actions Workflow for comments on P/R
- Common Pitfalls of GitHub Actions
- The complexity of writing an efficient NodeJS Docker image
Chaos Engineering
- Principal’s of Chaos
- Awesome Chaos
- There’s more than Performance Testing - Chaos Engineering with k6 and Steadybit
- Chaos engineering helps DevOps cope with complexity
Performance Engineering
CLI
Tools
Terraform
- Folders to manage Terraform Environments? - Reddit
- Terragrunt - DRY maintainable code
- Fast Terraform Tutorial Hands On
- Build terraform modules with descriptions UM + LLM
Infrastructure
- Don’t Make My Mistakes: Common Infrastructure Errors I’ve Made
- DevOps the Hard Way AWS
- ELB vs. ALB vs. NLB: Choosing the Best AWS Load Balancer for Your Needs
- Some benefits of simple software architectures
- S3 Game Galaxy
- Kubernetes is a red flag signalling premature optimization.
- Writing Your First Operator: tutorial
- AWS Multi Region Fundamentals
- Visualizing Load Balancing
DBA
- Postgres Query Bottleneck
- Postgres Indexes for Newbies
- Useful Postgres tool PGMustard
- How Postgres Stores Rows
- Why Postgres
- Gitlab Postgres schema design
- Postgres Playground
Data Team
DevSecOps
Talks
Newsletters
Big Lists
- Awesome CloudOps Automation
- SRE Cheat Sheet
- SRE Checklist
- How They SRE
- Awesome SRE
- ResiliencePapers.club
- “Platform Engineering” is rapidly becoming the new DevOps or SRE 🧵
- Is 90DaysOfDevOps good for learning? - Reddit
- Awesome SLO
- SRE CheckList
Web Technologies
- What happens when you go to google.com in browser
- How HTTPS works
- How DNS works
- How Browser Work
- How Cookies Work
- How Dom Events Work
- The HTTP crash course nobody asked for
Productivity
Performance Testing Tools
- Jmeter - Free to use performance testing application
- NeoLoad - Easier to use but pricey load testing tool
- K6 - Load testing tool using javascript
- Locust - Load testing tool using python
- Vegeta - CLI tool written in GO
- Oha - CLI tool written in RUST
- Hey - another CLI tool
- Artillery
- DDosify
CLI Tools
Visualization Tools
- Kiri - Mindmap - Online mind map tool.
- MindMup - Create public mind maps free
- Diagrams.net - Create sharable diagrams with this tool, confluence integration available.
Conferences
Podcasts
- Getting There - Nora Jones & Niall Murphy
Books
- Free SRE books (digital) - Google
- The Site Reliability Workbook: Practical Ways to Implement SRE
- Site Reliability Engineering: How Google Runs Production Systems
- 97 Things Every SRE Should Know: Collective Wisdom from the Experts
- Enterprise Road Map to SRE
- Observability Engineering
Skills Valued in SRE
- Analysis and Systems Thinking
- The Ability to Communicate with Tech Colleagues, Support, Management, Business People and Stakeholders
- Observation, Modelling and Note Taking
- Curiosity and a Desire for Exploration
- Tenacity
- Empathy for the User
- Business Domain Knowledge
- Formulating Rational and Consistent Arguments
- Persuasiveness and Influencing Skills
- Common Sense
Site Reliability Engineering Glossary
4 “Golden Signals” in Site Reliability Engineering (LETS)
Latency is the delay before data is fully transferred from one end to another. It is typically measured in milliseconds (ms)
Error rate measures errors occurring in the system, such as bugs in code or network outages. It is expressed as a % of total requests.
Throughput is the amount of data that can transfer through in a given period. It can be measured in bits/second.
Saturation is a measure of the load on your server resources. It can include measures like CPU utilization, and memory & storage used.
SLAs, SLOs and SLIs for measuring SRE success
Service Level Agreements (SLAs) are contractual obligations between the service provider and service consumer/payer for a certain level of performance. The consumer may demand money if the SLA is broken at any point.
Service Level Objectives (SLOs) are the guide levels of performance for engineers to aim for. They typically correlate with SLA requirements. For example, they can be goals for a certain level of availability for a service over a given period.
Service Level Indicators (SLIs) are measures of performance that allow engineers to understand if they are meeting the SLOs for the system and, subsequently the business-level SLAs. For example, they can be the uptime metric for a particular service.
Software incident response lingo
On-call implies that the engineer must be available to respond to incidents, should they arise when they are not typically working. This may mean evenings or weekends.
Pager Duty is a term used to refer to being on-call. It harks back to when operations engineers were required to carry a physical pager and respond if alerted by the device.
Follow the sun refers to an incident response timing where the engineer needs to respond to incidents from sunrise to sunset.
Mean time to acknowledge (MTTA) is the average time it takes for an engineer to get to look at an incident from the moment it has been identified and paged to the engineer.
Mean time to recovery (MTTR) is the average time for the engineer to resolve the incident from the moment the alerting system picks up the incident.
Mean time to failure (MTTF) is the average time a system or service is expected to function before it experiences a failure, such as a performance-degrading bug or outage.
Mean time between failures (MTBF) is the average time elapsed between two incidents across a series of incidents.
What SREs do after an incident
Postmortems are events that engineers may undertake after an incident has been resolved or controlled. They may go through logs and analyses the root cause to identify patterns and prevent future similar incidents.
Blameless is the cultural mindset that (most) Site Reliability Engineering teams aim to have when going through an incident. They aim to find out what and not who caused the problem. Even if they find the person behind the incident, they seek not to blame.
How SREs make better systems
Toil in the site reliability engineering sense manual, repetitive work that should be automated away if possible. The catchphrase among SREs is to “eliminate toil”.
Error budgets are an allowance for errors that Site Reliability Engineers make. This allowance helps SREs work on experimental work that may eliminate toil and give developers more breathing room. It is an advanced principle.
Terraform useful commands
terraform init
— initializes a terraform working directory and installs backends, modules, and plugins.
`terraform validate — Checks to ensure code is syntactically and internally consistent.
terraform plan
— Takes your configuration and creates an execution plan. You then can verify what changes will be made before they are applied.
terraform apply
— Create or update the infrastructure
terraform destroy
— Destroy previously created infrastructure
Other Terraform commands to know:
`terraform console — Enter into the Terraform console
terraform fmt
— Reformats your configuration files in the standard format and style. (Cleans up your code)
terraform import
— Used to import existing resources into the terraform state
terraform output
— Shows output values from the root module
terraform state list
— To list deployed resources
terraform show
— View all of the resources in your infrastructure’s state
terraform taint
— Used to mark a resource as not fully functional and will be replaced with the next terraform apply
terraform version
— shows the current version of Terraform
terraform workspace
— Manages Terraform workspaces