Tagged | SRE
-
5 Design Patterns for Building Observable Services
(engineering.salesforce.com) -
Spike detection in Alert Correlation
(engineering.linkedin.com) -
Infrastructure Observability for Changing the Spend Curve
(slack.engineering) -
Presentation: User Simulation for Rapid Outage Mitigation
(www.infoq.com) -
Eats Safety Team On-Call Overview
(eng.uber.com) -
Presentation: True Observability Needs High-Cardinality
(www.infoq.com) -
Article: Building Reliable Software Systems with Chaos Engineering
(www.infoq.com) -
Presentation: Safe and Fast Deploys at Planet Scale
(www.infoq.com) -
Chaos Experimentation, an open-source framework built on top of Envoy Proxy
(eng.lyft.com) -
Architecture of a Java Agent to Inject Chaos
(product.hubspot.com) -
Achieving observability in async workflows
(netflixtechblog.com) -
Presentation: User Simulation for Rapid Outage Mitigation
(www.infoq.com) -
Article: Site Reliability Engineering for Native Mobile Apps
(www.infoq.com) -
Article: Site Reliability Engineering Experiences at Instana
(www.infoq.com) -
Presentation: Solving Mysteries Faster with Observability
(www.infoq.com) -
How we handle incidents at Getaround
(getaround.tech) -
Presentation: Certainty Among the Chaos
(www.infoq.com) -
Lessons learned in incident management
(dropbox.tech) -
“What's the worst that could happen?”: A worked example of how we deal with live incidents.
(sbg.technology) -
Introducing Dispatch
(netflixtechblog.com) -
SRE as a team sport
(www.oreilly.com) -
Presentation: Incident Management in the Age of DevOps & SRE
(www.infoq.com) -
Improving Incident Retrospectives
(engineering.indeedblog.com) -
Presentation: Making a Lion Bulletproof: SRE in Banking
(www.infoq.com) -
Presentation: What Breaks Our Systems: A Taxonomy of Black Swans
(www.infoq.com) -
Presentation: How Did Things Go Right? Learning More From Incidents
(www.infoq.com) -
Evolving Regional Evacuation
(medium.com) -
The tale of the missing semaphore
(webuild.envato.com) -
Taming chaos: Preparing for your next incident
(www.oreilly.com) -
Resiliency Doctor – A tool to achieve resiliency in hybrid cloud application ecosystems
(medium.com) -
Athena: Automated Build Health Monitoring at Dropbox Engineering
(www.infoq.com) -
How to monitor Golden signals in Kubernetes
(sysdig.com) -
How to get started with site reliability engineering (SRE)
(www.oreilly.com) -
Efficient, reliable cluster management at scale with Tupperware
(code.fb.com) -
Alerting on SLOs like Pros
(developers.soundcloud.com) -
Article: Sustainable Operations in Complex Systems With Production Excellence
(www.infoq.com) -
Iris Mobile: An Open Source, Mobile Interface for Incident Management
(engineering.linkedin.com) -
Presentation: Lessons from 300k+ Lines of Infrastructure Code
(www.infoq.com) -
SRE Case Study: URL Distribution Issue Caused by an Application
(www.ebayinc.com) -
Using Machine Learning to Ensure the Capacity Safety of Individual Microservices
(eng.uber.com) -
SRE Case Study: Triaging a Non-Heap JVM Out of Memory Issue
(www.ebayinc.com) -
Our Self-Service Hybrid Performance Engineering Platform
(tech.wayfair.com) -
Observability at Scale: Building Uber’s Alerting Ecosystem
(eng.uber.com) -
Coding Conversations: The “Perfect Storm" that Brought Down LinkedIn.com
(engineering.linkedin.com) -
SRE Case Study: Mysterious Traffic Imbalance
(www.ebayinc.com) -
The Yelp Production Engineering Documentation Style Guide
(engineeringblog.yelp.com) -
LinkedOut: A Request-Level Failure Injection Framework
(engineering.linkedin.com) -
Google: Addressing Cascading Failures
(highscalability.com) -
Google: A Collection of Best Practices for Production Services
(highscalability.com) -
Implementing ChatOps into our Incident Management Procedure
(shopifyengineering.myshopify.com) -
Open Sourcing Iris and Oncall
(engineering.linkedin.com)