Google created and popularized the idea of site reliability engineers, and it’s started making significant inroads in software delivery organizations across industries and markets.
I think this is because SRE methodology and philosophy align very neatly with what most people think of as DevOps or digital transformation — SRE is focused on making all the parts work better as a system, which is a hard thing to do within traditional organizational structures.
Know What You Want, and Tell People
Just as with DevOps, a lot of organizations mean well, but go about implementing a reliability practice in the wrong way.
There isn’t a singular model that will work for everyone, but in general if one can grok the core concepts of site reliability and DevOps, then decisions can be made about how to implement these new ideas in an existing organization.
A lot of times, an organization will start a project to implement DevOps or SRE, by assigning the roles and responsibilities to a small team of specialists — which isn’t necessarily a bad thing — but then fail to give those teams the authority and agency to make hard changes to other teams’ work flows.
Having a clear understanding of what a reliability practice can bring, and making sure that communication into and out of that process is a top priority, are essential if there is going to be meaningful and lasting progress made.
Match Simple Solutions to Real Problems
In my view, SRE should be focused on clearing roadblocks and risks so that the teams can focus on meaningful work, and in a lot of cases that requires more process and culture changes than technical or tooling changes.
Experience has shown me that for most organizations, the technical problems are trivial, but the process and culture problems are novel — disruptive startups may be trying to solve technical problems that are new and evolving, but most organizations are going to be consuming existing proven technologies and services to build their products upon.
This means that by focusing on smaller and simpler — but frequently felt — problems in the work flow, we can make immediate positive effects on the way a lot of people do their work, and build up good will and time/bandwidth to tackle the larger and more complex problems down the line.
If Everything Worked, We Wouldn’t Need to Change
To get to where we need to go, a lot of things have to go right. Conway’s Law posits that organizations will structure themselves in ways which mimic their communication structures. This means that when we have issues in communication, they will frequently manifest as a dropped ball between teams.
This makes any sort of goal setting, contracts between groups, or accountability in general very hard to assign — if the teams are not structured in a way that lets them truly own all the parts of their project that are critical to success, then failure can usually be externalized and explained away.
There’s an inverse to that law I have seen, in that when one tries to address issues around communication and team responsibilities, the members of the teams may see that as a threat or black mark on their own work.
When we try and raise accountability and ownership, it sometimes seems punitive — especially if the organization has not been fostering a culture of risk taking and learning from failure.
A key part of making these transformative changes is to give people peace of mind that the changes are going to be positive, that blame isn’t part of the equation, and that system-wide improvement is a goal worthy of some short term pain.
Getting management buy-in for this can be a particular challenge, as we’re narrowing the focus of who could be blamed, but forbidding management from using the Old Ways to get things “working” when we have setbacks.
Giving teams that psychological safety and protection from unwarranted negative feedback is tough, but should be high on one’s list when trying to drag an unhappy organization into the light.
Slow is Smooth, and Smooth is Fast
I was in the army for 8 years, and did a lot of training on close-quarters fighting in dynamic environments. One of the things I took away from that is “slow is smooth, and smooth is fast”.
The general idea is tied to Jim Boyd’s OODA Loop methodology, in that people who can react to new input and make sound decisions will have better success in high-risk situations than people who simply rush through faster than they can understand events and make smart decisions around them.
Moving faster than one can make informed decisions and maintain visibility on the indicators might buy some short-term speed, but it builds up tech debt and risk that will have long-term negative consequences, and imperils the business and teams that are going out on a limb to make these changes.
Working towards empowering teams to work at a slower but sustainable pace, while increasing the volume of work they can accomplish at that pace, is something I try and focus on.
This usually means making a compromise with management around feature and release cadence — most places won’t have the space to stop old work and redesign everything as the sole focus, so there are usually some serious constraints around these transformative initiatives that aren’t able to be ignored.
By letting the business set the must-have priorities for continuing existing operations, and giving them clear costs and milestones built around those priorities, we can set the table for renegotiating this list after the first few iterations and be able to demonstrate how much more will be possible if we give teams more authority to set their own pace.
Business as Usual Is Not an Indicator of Success
Many times, unhappy organizations will have issues getting traction on transformation because they don’t want to impact existing service delivery.
While this is absolutely a valid point that needs to be kept in mind, a lot of times specific teams or members will be pivotal to keeping things going, which both constrains improvement initiatives and drives the current unsustainable work flows that we’re trying to clean up.
An unhappy organization can continue making money and seem to outsiders as a functional system, but if the internal demands of maintaining that service level is not sustainable, then the system will trend towards instability and unpredictability.
Ensure Initiatives Have Expectations and Continual Analysis
To be able to make transformative change, people need to be in a position to take on new roles and responsibilities without risking burnout, so a key metric in the initial phases should be the elimination of toil and work which doesn’t impact overall quality of the deliverable.
So much work in an organization is a result of the “XY Problem” — people will ask for a solution, because they expect it will solve something else, not because the ask itself has value.
Evaluating normal work flow patterns, identifying “muda” — work which seems to be productive but has no material benefit on the end product — and eliminating or streamlining these processes to their greatest benefit for the least amount of investment is a core loop that needs to be continually refined and executed to make these changes last.
Defining Domains and KPIs
When measuring overall satisfaction, be sure to query internal and external groups, including developers, ops, and end users to see who is focused on what.
Too many teams focus too much on internal or external measurement of success to the exclusion of the other, creating situations where the creators and consumers are not resolving ambiguity in the normal process, and creating useless work to remediate these gaps.
Relentless Introspection
Humans may be bad judges of their own biases and preferences, so back up human supplied data with models to help make decisions in the best interest of the team.
To make long term positive changes to the process and culture, we need to have clear vision of a strategic goal, and key indicators that we can agree to and measure to see the effects of our changes.
I can’t count how many times I have done my research, surveyed teams, drawn up a plan, and then discovered in the first sprint that we need to pivot in a huge way or even throw out that plan and start fresh.
We should encourage risk taking and make space for failure as part of our learning and growing as an organization, and knowing when something is working or not requires that we have ways to measure system-wide and local indicators.
Monitoring
It is probably better to have no alerting than to have too many false alarms, as it’s usually easier to add new functions or tools than it is to remove them from the system.
Alert fatigue and high noise to signal monitoring will overwhelm the people who make critical decisions about production operations, so it is imperative that we minimize distractions and urgent work flows which do not actually align with business needs.
By trying to front load monitoring and metrics, or even worse have them defined by teams outside the product stakeholders, we will usually find ourselves wrapped around specific mundane issues that should be automated, and putting humans in the critical path for issues outside their domains of expertise.
The focus should be on adding observability to the services that constitute the product, and using those metrics to answer real questions about business value in clear and unambiguous terms.
The era of monitoring being something only the NOC watched has passed, and the new intent should be building alerts and dashboards that roll up the low-level metrics into high-level measures of the health of the entire system.
Many times this means turning existing monitoring work flows around, from something that operations teams divine and apply based on outages, to something driven directly by the business needs and product design.
Tying It All Together
All of these things seem simple and straightforward by themselves, and may even seem easy if taken individually.
The complexity and risk come in when we realize that the entire system needs to make these improvements across the board, in a coordinated and sustained way, if the benefits are to be realized and remain in effect.
Regression to the mean and organizational inertia are major negative tensors for organizations which are beginning their journey towards SRE/DevOps driven operations, and improvements in one area can quickly be swamped by constraints and problems in other areas.
Raising the water level slowly, and being selective in which places need to be brought up in lock step or sequence to make the most positive impact with the least wasted effort, is the art of SRE in practice.