SRE is Applied DevOps

Google created and popularized the idea of site reliability engineers, and it’s started making significant inroads in software delivery organizations across industries and markets.

I think this is because SRE methodology and philosophy align very neatly with what most people think of as DevOps or digital transformation — SRE is focused on making all the parts work better as a system, which is a hard thing to do within traditional organizational structures.

Know What You Want, and Tell People

Just as with DevOps, a lot of organizations mean well, but go about implementing a reliability practice in the wrong way.

There isn’t a singular model that will work for everyone, but in general if one can grok the core concepts of site reliability and DevOps, then decisions can be made about how to implement these new ideas in an existing organization.

A lot of times, an organization will start a project to implement DevOps or SRE, by assigning the roles and responsibilities to a small team of specialists — which isn’t necessarily a bad thing — but then fail to give those teams the authority and agency to make hard changes to other teams’ work flows.

Having a clear understanding of what a reliability practice can bring, and making sure that communication into and out of that process is a top priority, are essential if there is going to be meaningful and lasting progress made.

Match Simple Solutions to Real Problems

In my view, SRE should be focused on clearing roadblocks and risks so that the teams can focus on meaningful work, and in a lot of cases that requires more process and culture changes than technical or tooling changes.

Experience has shown me that for most organizations, the technical problems are trivial, but the process and culture problems are novel — disruptive startups may be trying to solve technical problems that are new and evolving, but most organizations are going to be consuming existing proven technologies and services to build their products upon.

This means that by focusing on smaller and simpler — but frequently felt — problems in the work flow, we can make immediate positive effects on the way a lot of people do their work, and build up good will and time/bandwidth to tackle the larger and more complex problems down the line.

If Everything Worked, We Wouldn’t Need to Change

To get to where we need to go, a lot of things have to go right. Conway’s Law posits that organizations will structure themselves in ways which mimic their communication structures. This means that when we have issues in communication, they will frequently manifest as a dropped ball between teams.

This makes any sort of goal setting, contracts between groups, or accountability in general very hard to assign — if the teams are not structured in a way that lets them truly own all the parts of their project that are critical to success, then failure can usually be externalized and explained away.

There’s an inverse to that law I have seen, in that when one tries to address issues around communication and team responsibilities, the members of the teams may see that as a threat or black mark on their own work.

When we try and raise accountability and ownership, it sometimes seems punitive — especially if the organization has not been fostering a culture of risk taking and learning from failure.

A key part of making these transformative changes is to give people peace of mind that the changes are going to be positive, that blame isn’t part of the equation, and that system-wide improvement is a goal worthy of some short term pain.

Getting management buy-in for this can be a particular challenge, as we’re narrowing the focus of who could be blamed, but forbidding management from using the Old Ways to get things “working” when we have setbacks.

Giving teams that psychological safety and protection from unwarranted negative feedback is tough, but should be high on one’s list when trying to drag an unhappy organization into the light.

Slow is Smooth, and Smooth is Fast

I was in the army for 8 years, and did a lot of training on close-quarters fighting in dynamic environments. One of the things I took away from that is “slow is smooth, and smooth is fast”.

The general idea is tied to Jim Boyd’s OODA Loop methodology, in that people who can react to new input and make sound decisions will have better success in high-risk situations than people who simply rush through faster than they can understand events and make smart decisions around them.

Moving faster than one can make informed decisions and maintain visibility on the indicators might buy some short-term speed, but it builds up tech debt and risk that will have long-term negative consequences, and imperils the business and teams that are going out on a limb to make these changes.

Working towards empowering teams to work at a slower but sustainable pace, while increasing the volume of work they can accomplish at that pace, is something I try and focus on.

This usually means making a compromise with management around feature and release cadence — most places won’t have the space to stop old work and redesign everything as the sole focus, so there are usually some serious constraints around these transformative initiatives that aren’t able to be ignored.

By letting the business set the must-have priorities for continuing existing operations, and giving them clear costs and milestones built around those priorities, we can set the table for renegotiating this list after the first few iterations and be able to demonstrate how much more will be possible if we give teams more authority to set their own pace.

Business as Usual Is Not an Indicator of Success

Many times, unhappy organizations will have issues getting traction on transformation because they don’t want to impact existing service delivery.

While this is absolutely a valid point that needs to be kept in mind, a lot of times specific teams or members will be pivotal to keeping things going, which both constrains improvement initiatives and drives the current unsustainable work flows that we’re trying to clean up.

An unhappy organization can continue making money and seem to outsiders as a functional system, but if the internal demands of maintaining that service level is not sustainable, then the system will trend towards instability and unpredictability.

Ensure Initiatives Have Expectations and Continual Analysis

To be able to make transformative change, people need to be in a position to take on new roles and responsibilities without risking burnout, so a key metric in the initial phases should be the elimination of toil and work which doesn’t impact overall quality of the deliverable.

So much work in an organization is a result of the “XY Problem” — people will ask for a solution, because they expect it will solve something else, not because the ask itself has value.

Evaluating normal work flow patterns, identifying “muda” — work which seems to be productive but has no material benefit on the end product — and eliminating or streamlining these processes to their greatest benefit for the least amount of investment is a core loop that needs to be continually refined and executed to make these changes last.

Defining Domains and KPIs

When measuring overall satisfaction, be sure to query internal and external groups, including developers, ops, and end users to see who is focused on what.

Too many teams focus too much on internal or external measurement of success to the exclusion of the other, creating situations where the creators and consumers are not resolving ambiguity in the normal process, and creating useless work to remediate these gaps.

Relentless Introspection

Humans may be bad judges of their own biases and preferences, so back up human supplied data with models to help make decisions in the best interest of the team.

To make long term positive changes to the process and culture, we need to have clear vision of a strategic goal, and key indicators that we can agree to and measure to see the effects of our changes.

I can’t count how many times I have done my research, surveyed teams, drawn up a plan, and then discovered in the first sprint that we need to pivot in a huge way or even throw out that plan and start fresh.

We should encourage risk taking and make space for failure as part of our learning and growing as an organization, and knowing when something is working or not requires that we have ways to measure system-wide and local indicators.

Monitoring

It is probably better to have no alerting than to have too many false alarms, as it’s usually easier to add new functions or tools than it is to remove them from the system.

Alert fatigue and high noise to signal monitoring will overwhelm the people who make critical decisions about production operations, so it is imperative that we minimize distractions and urgent work flows which do not actually align with business needs.

By trying to front load monitoring and metrics, or even worse have them defined by teams outside the product stakeholders, we will usually find ourselves wrapped around specific mundane issues that should be automated, and putting humans in the critical path for issues outside their domains of expertise.

The focus should be on adding observability to the services that constitute the product, and using those metrics to answer real questions about business value in clear and unambiguous terms.

The era of monitoring being something only the NOC watched has passed, and the new intent should be building alerts and dashboards that roll up the low-level metrics into high-level measures of the health of the entire system.

Many times this means turning existing monitoring work flows around, from something that operations teams divine and apply based on outages, to something driven directly by the business needs and product design.

Tying It All Together

All of these things seem simple and straightforward by themselves, and may even seem easy if taken individually.

The complexity and risk come in when we realize that the entire system needs to make these improvements across the board, in a coordinated and sustained way, if the benefits are to be realized and remain in effect.

Regression to the mean and organizational inertia are major negative tensors for organizations which are beginning their journey towards SRE/DevOps driven operations, and improvements in one area can quickly be swamped by constraints and problems in other areas.

Raising the water level slowly, and being selective in which places need to be brought up in lock step or sequence to make the most positive impact with the least wasted effort, is the art of SRE in practice.

Distributed Design and Delivery

In modern software, I like to work towards a general set of assumptions I call “distributed design and delivery”.

It’s a little buzzword-ish, but it clearly states what I think are safe global assumptions to work towards — and I’m going to use “D3” to reference the process and the SRE/Devops teams currently involved or managing these things.

The general idea is building groups that can fully own and understand not only their own domains, but also be in communication with other groups about the overall design and expectations for delivery.

If we want our developers to be good practitioners of distributed design and delivery, then we need to lead by example and provide solutions that are usable, measurable, and valuable for them.

I’d like to take a moment and clarify what each part of that name means, and the baseline assumptions for those things.

Distributed

Any group in the organization can do a thing, and advertise it as a service to any other group.

Decision making and knowledge needs to be loosely held, but to make all groups aware of the overall strategy and goals for their product.

Decentralized does not mean disorganized, so best practices and shared services should be widely socialized and agreed upon by every autonomous group.

Design

There is a clear vision and communicative customer for the final product we deliver, and all groups involved can adjust and notify others of alterations to the design.

The customer is directly available to answer questions, and decisions should be documented in a way that’s easy to query and visible for every stakeholder.

Delivery

Making the product available in the manner we promised, when we promised, by providing clear control of the necessary levers to perform all actions needed to bring blessed product to market, in the smallest and most manageable increments.

General Philosophy and Reasoning

A lot of teams engaged in digital transformation will try and take on too much, and then it fails to make an impact — or worse, see it become more overhead in a process that’s already overloaded.

This can lead to negative feedback loops around behaviors we want to encourage, but if the process and culture don’t make things easier at once, we need to acknowledge, measure, and adjust based on that feedback.

I tend to find most technical problems are trivial, and I tend to find most process and culture problems novel.

What this means in practice is that the technical implementations of things can be well understood only when the process and organizational communication beneath it fosters sustainable and reliable output.

I liken this to Anna Karenina, in that all happy organizations are providing the fundamental foundation for software delivery and will likely have an easy time doing technical things, but unhappy organizations are unhappy in different ways.

For an unhappy organization to get those foundational things reliable, they will have different issues and concerns than other organizations, and therefore should start changing their process and culture incrementally with clear priorities and goals.

Avoiding Constraints

If a devops or SRE team are the key holders for specific pieces of technology or process, then that team is now a constraint on scale for the organization.

Generally, there are many more developers than D3 people, so scaling developers is much more granular than D3 staff.

This will lead towards skill gaps, inefficiency, and eventually underperformance of teams that are now constrained.

Decentralizing access and basic use of a thing is generally good, bot that doesn’t mean that everyone is on their own. Having Subject Matter Experts in different domains is great, so long as the normal process doesn’t require their direct contribution for other teams to do their jobs.

Many times, I find that “expertise” becomes more about organizational and legacy knowledge than about mastery of the D3 principles, and that introducing incremental changes towards that goal helps make things move faster with less human intervention.

Focus on Shared Objectives, Own Your Part

Instead of being customer focused and goal oriented, teams should be customer oriented and goal focused.

This sounds trite, but if teams are focusing on the external inputs rather than their part of the internal output, then forecasting and feedback attenuate quickly when abstraction of tasks doesn’t give an equally simplified abstraction of work.

By having the person making the request for the thing directly involved with those making it, we remain focused on doing the immediate thing, while understanding the overall goal and general guidelines in case something unexpected happens.

In practice, this tends to push organizations to limit breaking up of units of work across enormous domains, in essence making one large development team that builds half the platform, but several ops teams focused on small functional slices of the operational work.

Real Life Is Messy

Unexpected work and bad process will always be an issue — if existing processes were working well we would not be so obsessed with digital transformations.

So when planning team structure and communication lines to start defining inputs, outputs, and functional domains for work, try and encourage teams to make baby steps from where they are rather than define exactly how they’re going to get to the strategic goal.

To start with, just try and remain consistent in every new promise or assumption advertised to other teams, and if that becomes too much, take a step back, or even stop making that advertisement.

The whole underpinning of a functional production model should be trust, by acknowledgement of problems and unexpected new discoveries we show others that we tolerate risk taking and rolling back as needed.

Don’t Do Work You Don’t Have To

Too many devops and sre practices are overly focused on eliminating specific things which don’t actually impact the delivered product, like making pipelines that allow for excessive variability without waiting to be asked for those cases.

Getting a big request and committing to trying just one small spike of that should be encouraged, and if it doesn’t work we adjust around to solve it as a cohesive system.

Many times, we’ll find that a use case with basic assumptions and limited configurability will be sufficient for most work, and then use an iterative process to use live feedback to measure impact and delta of new features.

Always Measure Everything

A major part of most unhappy organizations’ problems stem from a lack of objective measurements, and an agreement on what those should look like.

A team which is sprinting towards a short term goal needs to be aware of what other teams are doing, and how their work impacts those other teams.

To do so, we need to make sure that part of the deliverable from development to ops includes relevant metrics and monitoring, thresholds and alert settings, and to advertise these measurements in a way that is naturally accessible to those who need it.

Too often, unhappy organizations will have developers make the thing, throw it to QA who tests the thing, who throw it to Release who deploys the thing, and then to Ops who manage the thing.

Segmenting these domains isn’t necessarily a bad thing, but if the teams involved don’t have a clear expectation of the inputs and outputs, and have ways to measure the success of their part of the process, then risk is introduced at every handoff.

Ambiguity is the enemy of quality, so having clearly defined measurements which we can use to ensure that our work is not impacting reliability is essential.

Final Thoughts

Doing good work when the foundation is in place is the norm, but until the services and knowledge are at a level where teams can easily consume and improve them, it will require heroic efforts to maintain or scale delivery for unhappy organizations.

By keeping the focus on doing smaller incremental changes before undertaking major transformative projects, organizations can get themselves in a better place to have the bandwidth and reliability to make substantial changes.

In many cases, the biggest barriers to digital transformation are legacy processes, particularly the problems and burnout that emanate from those processes in everyday use.

Solving those issues requires that teams be given the agency and autonomy to do whatever they think is best to do their jobs, while ensuring communication and visibility to allow external stakeholders to know what they need to do their jobs.

DevOps Isn’t DevOps

Almost everyone I have ever met is terrible at defining devops, because I believe it is more about the process than the tools — essentially, we want to do everything right, verify that everything is actually right, and make everything visible.

There’s no part of the entire software enterprise that isn’t within the problem domain of a true cross-functional and effective devops team, so trying to bound it in functional or technical terms always comes up short.

A lot of people will say it’s about automation, or full stack development, or CI/CD and faster releases, but those are all visible outputs of an overall well-functioning software delivery organization.

I think devops is a philosophy and practice of taking hard things and making them manageable enough for people to accomplish reliably in human time scales, so that they can focus on work which provides value.

I don’t like the idea of DevOps Engineers as a functional role or team, because it’s like saying Agile Engineers or Git Engineers — devops is too big for any single team or role, devops is only effective when everyone in the organization has ownership and contributes to the whole system.

Team Structures Matter

To do good software delivery, teams need to be large enough to perform meaningful work, and small enough to be managed effectively from within — with a goal to normalize work among the different groups and keep communication flowing.

A lot of this is going to be changing behaviors and expectations — devops is only about the tools as much as they are valuable to their users. The primary place that DevOps/Agile transformation fails is that people expect to shift an entire culture in one project or by using a new platform, but it’s extremely ambitious try to pivot from a waterfall legacy Java shop to a microservice Kubernetes one as a single initiative.

Process Is a Means, Not an End

Just because developers are doing Agile, and running sprints, doesn’t mean much if the release cadence isn’t in sync and directly driven by the product teams. If we don’t map out and improve the release process, to the point where developers can be doing prod deploys independently, then every other tech we layer on top will be a band aid at best.

This generally happens because people see new technology that, when fully deployed and mature, fosters a much more sustainable workflow than what we have now — but we need to account for making those shifts while maintaining legacy service sufficient to our existing agreements.

Big Picture Strategy Drives Everything

Systems thinking, understanding how each decision impacts not only the final product, and how other teams and stakeholders engage in the process, is essential if momentum gained is to be maintained. Making an improvement in one area that streamlines a part of the process which wasn’t causing much pain may seem like a wise investment, but that time and energy will generally be better spent addressing a constraint which is causing the organization pain.

Focus on Demonstrable Value Above All Else

We need to prioritize on things that can help the business do business better — and, more to the point, we need to be able to show our customers and prospective customers how we deliver value, and to do so at or above expectations every time.

The entire software delivery lifecycle, and indeed the entirety of the company, is a tightly coupled unit — any deficiency in any area can be compounded to be detrimental far beyond its immediate effects. Slowdowns and missed communication between groups compounds and error probabilities raise significantly as the sizes of changes grows and the release cadence is further attenuated.

Agency and Autonomy are Paramount

Only by making everyone in the organization empowered and responsible for applying DevOps practices can this concept be successful. The general thrust should be to allow one person to do everything they need to create, deploy, and maintain software with minimal fuss and maximum simplicity.

To do this, there are two critical areas, and unhappy organizations tend to focus on one to the detriment of the other. First, the tools and services must be usable and useful to the teams we serve, and secondly, communication and normalization of decisions across the organization must be fostered.

Decentralizing decision making to allow more agency and autonomy can lead to dramatic increases in productivity and quality, but only if it’s done in a way that decentralizes the power of legacy processes to the teams. Too many organizations will declare “full stack” teams without giving the teams the trust and support needed for them to move at the pace they want, which leads to confusion and regression to the mean.

Set Reasonable Goals to Get Sustainable Wins

Be cautious of the psychological scope creep of shifting to a model agile organization in one jump, setting small milestones and building confidence across the organization to try new things, and keep or throw them out based on results and data, doesn’t need to have everything fixed in one big project.

Many times, some change instituted may be better, but is dragged back to the old ways by pressures elsewhere in the organization. This isn’t great, but sometimes putting that part of the puzzle down, and focusing on areas tangential to that piece, can help build the feedback loops that will move that needle incrementally.

People Aren’t Machines

In a lot of cases, the early stages of a digital transformation should be focused on improving existing process and tackling burnout. Addressing the immediate pain points for developers might mean spending a lot of time in the legacy Jenkins server fixing and instrumenting the status quo, but these backwards-looking work streams are necessary to buy the time and focus teams need to make meaningful improvements.

In my experience, issues with scaling are rarely about technology — consultants, vendors, and new tools can be brought onboard in a truly agile organization without a lot of fuss — the issues generally relate to the scaling of the humans and processes that underpin the product.

Making burnout a top tier priority will demonstrate that our process is aimed squarely at real needs and can demonstrate real value, and create opportunities for people across the organization to take a more hands on role in the process.

Maintaining and Scaling Wins

The regression to the mean is a real problem in our field, as the natural tendency of a group is to do what they’ve always done. Introducing new processes, tools, and ideas means that the way things have always been done might not be the best way to get to the strategic goal.

By focusing on actual needs and making sure that we understand the ask and the expected result, we spend less time doing things that look useful but don’t make a meaningful contribution to the process. Making sure that communication with stakeholders is consistent, and a required part of the work flow, can give us the temperature of the teams and see how people are receiving our work.

It’s a Journey, Not a Destination

DevOps is, if anything, a metaphysical thing — we can’t ever really say “we are done and are fully DevOps capable now”, because it’s more about how we look at our internal processes and structure to continually maintain and improve the things we create.

Because it’s not something we can put in Artifactory and version, it can be hard to quantify the benefits of this process in the earliest stages. Many times this will lead to resistance, as teams who are overwhelmed by the legacy process have no bandwidth to take on more work.

Focusing on meeting these teams’ needs in the short term builds the reputation and trust in DevOps practices which will be critical in making the larger and more disruptive changes that will come in time.

Applying DevOps Practices in Home Remodeling

Lately, I’ve been doing a lot of home renovations – I’d rather not be, but a series of major multi-system failures cascaded into a job I needed to have done but had no way to afford.

I grew up in a family full of carpenters and tradesmen, I figured I’d take it on, with some assistance from friends. I think the progress so far is good, and I realize that the success is due in large part to how I’m planning and executing the project.

I realized tonight that my day job as an SRE/DevOps advocate is directly applicable to home repairs, at least as far as systems thinking and relentless refinement of work around newly learned details of the overall goal.

A Crisis Arises

We arrived home on a Friday evening to quite a scene. There was no hot water, and when I went to go to the basement to troubleshoot, I saw that there was about 2 inches of boiling water in my basement. I had water running down the bathroom wall, and hot water was actively pouring out from the ruptured heater into the basement.

My water heater thermostat had failed, causing it to overheat, which burst a hot water pipe in the wall and the heater itself, and revealed numerous plumbing leaks from poor workmanship.

The only bathroom in the house was completely unusable, with a wife and two small boys – this is our production down, and demanded immediate action.

I could tell right away that the scope of the problem, and the solutions it would require, were going to be massive, so I just focused on the immediate things I had to do so that we could stabilize. I would get the water shut off, a new heater installed, and salvage whatever functionality remained in the bathroom.

In a crisis, it’s easy to get bogged down by incoming urgent demands, but the focus should remain on a key actions to mitigate the immediate danger, and sideline any urgent but unnecessary work from the narrow focus, to be addressed after the crisis is under control.

Prioritizing and Focusing on Immediate Needs

A number of issues popped up along the way, and many of them were addressed as they came up even though it added work – the key was to ask if, at the end of the crisis or delivery of the completed product, would this decision have a negative impact on quality, and if so is the work required now to fix it warranted by that risk?

By assuming that all decisions in the crisis are made in an area where we’re outside normal process, we can understand that something like “the basement is full of electricity conducting water, and the leak has soaked walls full of live wiring” can and should be broken down.

I’ll wear insulating shoes and shut off the power to the areas being soaked, but I won’t fully fix the problem and risks it has created until after the crisis is stabilized.

Protect Scope and Minimize Deliverable Size

The bathroom with the leaks had to be gutted and mold treated, and demanded a major leveling and reinforcement of the floor before I could start building the new bathroom, causing a big job to become a massive one. The discovery that the pipe had been leaking for a long time, and had rotted out 12 feet of the sill beam, caused it to grow even more.

I don’t have a wood shop, and when I started this I didn’t have many tools, so I had to focus on the overall plan, reference the tasks against tool requirements, and factor the cost of tooling into defining priorities and schedules.

At work, I focus on improving tooling only when it has a direct need and clearly defined customer, and minimize the scope of delivery to only what is specifically asked for.

So, while planning and deciding on what I needed, I broke the work down into blocks that addressed critical needs first, and deferred anything that wasn’t going to impact the functional requirements of the project.

In many cases, the original expectation of the finished product changes significantly as work is completed and new things are discovered, so I try to keep a loosely held general strategic goal and focus my tactical decision making on the things that add the most value per unit of work.

Pick Low Hanging Fruit First

In a complex project, there are so many things to do that all seem critical that decision paralysis can bog down progress.

When prioritization seems impossible due to the volume of high priority tasks, taking on the things which have the most short term impact can get the ball rolling.

I focused on addressing the things which directly impacted our ability to use the bathroom, and minimized things which would degrade usability.

Another rule of thumb is that adding a new feature for an unmet need is usually more valuable than refining an existing one, if both are similar priority.

Constant Communication and Feedback with Customer

As much as possible, when a decision comes up which can’t be easily answered with knowledge of existing plans and decisions, the customer should be consulted.

This seems like it might be a bother, but by making ambiguity visible to the customer, and getting a clear decision that all parties understand, a lot of time and energy will be saved compared to a process that “doesn’t bother” the customer until close to delivery.

I made sure that at the end of each sprint or task, I checked in with my family to make sure they knew the plan and what the next milestone would be, and shifted my work around their needs to make it as easy as I could for them.

Final Thoughts

This has been a pretty massive undertaking, but I’ve been able to make good progress without violating any of the agreements made at the outset by using these principles from the software world in the home.

Gun and Adjustment

Moving on to the gun, I had played with using 8-way black connectors and grey rods as a cannon, but it was enormous. Like, naval gun enormous. I settled on using a mix of ladder connectors, 1-way connectors, and 2 sections of quad grey rods. This gives a good overall shape, and is about as good as I could get it to look without using non-K’nex pieces.

The front is held to a “point” by an elastic, the back of the first section is joined all 8 ways with 1-way connectors, and the end of the gun is the black 8-way connector spaced evenly. Inside the turret I used a pair of 3-way connectors to connect the gun to the front lugs at 45°. There is a yellow rod attached to the top of the cannon’s rear 8-way connector that is for the laying system.

The laying system consists of a normal crown for overall elevation, with a cam connection to the gun. It is adjusted with via a small fixed gear on a rod that comes out the rear of the turret. I wanted it to be very stable and simple, so I needed to add friction and proper limits to allow full adjustment.

  1. Used a ladder connector which slides on the yellow rod instead of a 1-way.
  2. Offset the crown gear slightly with spacers to provide greater friction on the adjustment rod.
  3. Connected laying gear support to top of turret with 45° rod, which stops the gun from excess declination.

Setting up finally

After years of neglect, I’m going to start using this site. I’ll be dumping a bunch of in-progress writing I have been working on (mostly fiction), talking about games and game design, and general software development issues that interest me.