Postmortem anti-patterns

10 behaviors that prevent us from learning the right lessons, and a few principles for having better postmortems

Sep 02, 2024

Let’s talk about software postmortems. My opinions will be debatable here, and they are so almost necessarily, because I think about trade-offs and hunt for the boundary conditions where a good practice taken too far can turn into a bad practice. I couldn’t find many sources that reasoned through postmortem trade-offs—with one excellent exception being Incident Review and Postmortem Best Practices by Gergely Orosz at The Pragmatic Engineer—so I decided to write my own.

When something goes drastically wrong, such as having an outage in the product, the relevant teams usually reflect afterwards through a postmortem, retrospective, or incident review. Typically all of those involve writing a document where everyone contributes to a timeline and anybody can add observations, and then the people involved gather in a meeting to discuss it.1 We try to learn lessons, use the incident to spread best practices, and take suitable action items to prevent similar issues in the future. The term postmortem is a bit morbid, but it is less wordy than alternatives and conveys a useful image: we will examine the dead service, record information that is relevant or may become relevant, and diagnose the causes.

I’m far from an expert on the topic, but I’ve been in my fair share of these and made some observations of my own.2 There are many strong best practices for postmortems. Here I’d like to dwell a bit on common but flawed practices, the anti-patterns of postmortems.

Rules are made to be broken, including these ones. Many of the behaviors below are far from absolutely wrong, but you should watch out for doing them too often or taking them to extremes.

The anti-patterns

No postmortem

The biggest postmortem anti-pattern is not having a postmortem.

The trivial case of this issue is when companies or organizations don’t have postmortems at all. If you’ve read this far, I hope that isn’t your case, or won’t be for long.

The next most pressing case is companies that have a postmortem process, but don’t use it regularly. That situation happens even when the company has plenty of production issues. I expect the roots of this are often social, that people are scared of showcasing their mistakes to an audience that they aren’t close with. They might instead opt for a retrospective within their core team. But frankly, I’ve been in my share of those and I haven’t always found them as useful as public postmortems, because that approach doesn’t take advantage of outside perspectives from other parts of the organization. Solving the social issue is hard; it might require more positivity and less blame from the postmortem audience, more rewards for the postmortem presenters, or just generally improving the vibes and connections in the organization.

The next case is where companies have regular postmortems but don’t use them for the most important issues. We can’t have postmortems for every bug. However, for any given time and attention budget that a company can devote to postmortems, we don’t always use that budget effectively.

Postmortems happen for the most obvious, but not necessarily the most important, problems. There are certain types of issues that are more likely to be followed by a postmortem. Typically they are triggered by sudden problems like an outage in a service. I also suspect that postmortems are much more likely for engineering problems, for bugs, rather than for major problems or unintended behavior driven by more complex interactions in the product.

Here’s my suggestion: look back on the last 10 postmortems at your company and ask if those were the 10 most consequential problems during that time.

Not capturing the history

Write down everything, with timestamps and links. Write down the technical events and also the moments of human discovery and reaction to those events. If we don’t include a piece of the timeline in the postmortem before sharing it, it might never be included, and might be nearly impossible to reconstruct later on. I’ve seen many postmortems that have important events missing in the timeline and the subsequent discussion, or that even incorrectly record items out of order. We can learn the wrong lessons if we don’t have a shared understanding of what happened.

Many postmortem documents decay nearly immediately. For example, they may have links to dashboards that are not time-bound, so future viewers of those dashboards won’t see the same view. Screenshots help. The same concept applies to logs. Copy lines from them into the document, or into persistent files.

It’s important to be accurate with the timeline, because inaccuracies at this point will flow into inaccuracies of diagnosis and of lessons learned. Postmortems are exercises in practicing clear thinking. We must be accurate in our timeline, our enumeration of causes and consequences, our reasoning of the causal chain of events, and our logic with how new changes will be worthwhile.

We have to get the basics right before we’ll get anything out of postmortems. The hierarchy starts with having a postmortem in the first place, followed by writing down an accurate history.

Adding alerts

Possibly the most common action item in postmortems is to add an alert, such that if the situation was replayed but with the extra alert, the alert would trigger and someone would be notified. This comes with the implication that either the problem would be prevented entirely or at least it would be caught and acted on much earlier. As we go through more postmortems, we add more alerts, logging, monitoring charts, and documentation.

This is good, most of the time. We can harden our monitoring and alerting through each incident.

The catch is that the new alerts are reactive, chasing the last problem, not the next one. In many companies, the production systems generating the incidents are active and changing, and they will typically keep changing at a pace that can generate new broken surfaces with more postmortems. The new alert (or log, or chart, or playbook entry) might never even be used. Some of them may even be counterproductive, adding noise and being ignored because of their low precision (meaning: lots of false positives relative to the amount of important problems they catch). Low recall is an issue too, where the alert will only catch a portion of the events where the same issue causes user problems. One might think that low recall is better than no recall, but it depends on whether we expect higher recall and overly rely on the alert instead of alternative solutions.

Furthermore, debating adding an alert can be time consuming and become an easy way out of deeply thinking through the incident. It lets us focus on shallow solutions instead of deep ones.

Removing alerts

This is a corollary of reflexively adding alerts. We end up with many alerts, some of which have low precision. It’s typical in postmortems to note that alerts existed and were triggered, but were ignored.

Removing counterproductive alerts can be useful, but—much like adding alerts—it can be a cop-out from thinking through broader risks in the system. We can end up cycling through postmortems, adding alerts, removing them, adding them again, removing them again, ad infinitum. Same goes for changing the thresholds for alerting up and down. Tuning alerts and thresholds is worthwhile, but we shouldn’t let it consume all of our focus and time in the postmortem.

“We had a bug. Next time we shouldn’t code bugs.”

Nobody says it that way, but that’s what the lessons from some postmortems boil down to. They describe how a bug existed, and indicate how it will be fixed, but without looking at any contributing factors as to why the bug was created in the first place. This can be okay if the scope of postmortems is merely to document the event and explain immediate fixes. But usually the scope includes learning lessons, and the lesson shouldn’t be that we’ll try harder next time.

Sometimes there’s a bit of hubris implicit in how we talk our way through a postmortem, as if most of the audience do not write bugs and those who contributed to the bug will not write more after understanding this one. I’m skeptical. The people who wrote the bug aren’t any less smart than the people diagnosing the problem. We will continue to create bugs. I doubt a postmortem will teach someone to not code any more bugs. Even if it did, the current vintage of bug writers will soon enough move up the org chart or out of the company and be replaced by new vintages of bug writers.

“So, that's been my life: trying to roll back through the series of actions (or lack of actions) to see how things happened, and then trying to do something about it. The problem is that if you do this long enough, eventually the problems start leaving the tech realm and enter the squishy human realm.”
rachelbythebay, More than five whys and "layer eight" problems

Pinpointing a buggy line of code or suggesting an alert should not be followed by short-circuiting deeper analysis. We need sustainable ways to improve our product development, and that product development comes from our teams as a human system. This is the “squishy human realm” from the above quote, of our engineering practices as a human, and that’s why Gergely Orosz suggests to “take a socio-technical systems approach to understand the outage”.

Overprescribing fixes

Postmortems can result in too many action items. We can be overly reactive and unnecessarily rip up our roadmap. This might not be the time to commit to a full rearchitecture of your systems. Let the ideas marinate for a bit first.

The people in every organization fall on a spectrum of how polished they want every part of the system to be. The right balance varies with the circumstances, but it’s likely neither to slap everything together without any foresight, nor to obsess and polish to no end and with no impetus for new features. During a postmortem the perfectionists have the moral high ground, and they may overprescribe an extensive list of action items. Doing all of them might be a poor use of time and might distract from other important progress. It can even be particularly counterproductive to undertake large overhauls of systems, risking new bugs and new sources of instability.

The postmortem is a good catalyst to nudge more in the direction of code and architecture improvements, but you likely shouldn’t drastically upset the balance in your organization.

No tracking of action items

Given some of my concerns about unproductive action items, I don’t mind processes that let the dust settle and don’t have rigorous enforcement of action items. But we probably need just slightly more enforcement than no enforcement. It can be discouraging if nobody cares if the action items happen at all, and then the process can start to feel useless. The sense of déjà vu from repeatedly hitting some of the same outages can also be demotivating.

I enjoyed reading the example of Honeycomb in Incident Review and Postmortem Best Practices, how they don’t track action items at all. I don’t see this as that far off from my opinion, since it seems like they do expect teams to create tasks as an outcome. Furthermore, I strongly agree with their emphasis on learning ahead of overdesigning solutions on short notice. However, at the margin I think it’s still better to discuss and suggest action items in the postmortem meeting itself, while you have the attention and guidance from more people. That can still exist in a system where teams have the ability to change their mind about action items.

The catch with no action items is when teams intentionally abuse the implicit trust of that approach, saying they’ll undertake improvements when they fully intend to forget about them. That can be both an outcome and a cause of bad team culture. Commitments made during the postmortem meeting should, at a minimum, be written down.

The tracking process doesn’t need a lot of overhead, nor does it need to insist on completion of all action items. It could be as simple as having one postmortem overseer responsible for checking action item statuses, who politely reminds people to finish their action items or to document why their team decided against them. Ideally with a demeanor that is more “wise elder” than “stern schoolteacher”. Just enough enforcement to make people circle back and acknowledge their previous commitments.

Mutually contradictory suggestions

A good practice in postmortems is to write down everything that went wrong or contributed to the problems. Yet some of the implications and their action items might not fit together. If we want to learn lessons, we need to reconcile those.

Here’s a paraphrased example of conclusions from a postmortem from an overdue project: a) the analysis was too shallow, and b) the analysis took too long. We can’t have both: should the analysis have been deeper but more time-consuming, or faster but shallower? To reconcile this we need the claim to be one step deeper: maybe we should have scheduled later launch dates for the project, or allocated more people and split the analysis between them. It is usually ineffective to simply yearn for faster work with fewer bugs.3

Another example of contradictory advice is an action item to have more alerts, and an action item to remove noisy alerts. Those can actually coexist and make sense, if the real request is to have better alerts that collectively are on an improved frontier of precision and recall. But if so, why didn’t we do that before? Is it even possible, or are we constrained to trading off precision against recall (e.g. moving an alerting threshold up or down)? Or if we actually can see how the package of alerts could be better, did we reach this situation before because we implicitly chose against that, choosing (for example) to have faster timelines or more experiments rather than more time invested in optimizing alerts? If so, are we sure that was the wrong choice?

Blame

Postmortems cannot become public trials of the people involved. That can be traumatizing for them, it can hurt future postmortem participation and openness, and we want to use postmortems to diagnose system problems rather than individual people.

There are clear reasons why every guide to postmortems emphasizes the importance of blamelessness. I almost skipped this because it’s so universally accepted, but I include it to balance the next one:

Avoiding the problems

“Blameless” doesn’t mean we don’t talk about mistakes. I’m reminded of a postmortem for an outage where the code was entirely incorrect, and reaching that code path in any circumstance would trigger an error that ultimately propagated into a user-facing bug. The code wasn’t tested. The code reviewers had glanced over it without a thorough review. Yet we almost didn’t talk about how any of that happened.

People talked about what alerts could be better and all of the other peripheral improvements we could make. We almost missed any of the real lessons, about what our expectations are of code reviewers (a topic on which it turned out we had misalignment and could have a productive discussion about finding effective policy), how we could make testing easier, whether or how to increase the visibility of our test coverage, and whether or not we were rushing a launch.

It’s an indication of weak relationships and poor psychological safety in the team if people are afraid to talk about what happened or can’t do so without risk of offending others. Dodging a constructive conversation doesn’t make the experience feel blameless, instead it accentuates that there is so much judgement that the judgements can’t be said out loud.

Blameless means we don’t point fingers. We don’t accuse anyone of making a mistake that others wouldn’t have in the same circumstances. We don’t have to put individuals in a public inquisition. We can manage that while still talking through the actual mistakes. We have to respect that our teammates are adults too, which means we can have professional conversations that acknowledge mistakes.

Better principles for postmortems

The anti-patterns above have trade-offs, and are often individually good in moderation. I’d like to end with a positive set of suggestions. These are the principles I try to teach for effective postmortems.

Principles:

Document the history. What we don’t write down now, will be forgotten forever.
Look for patterns. Look for causes that recur across multiple incidents, such as the cases I wrote about in Looking for the simplest change that will eliminate the most bugs and Shared constants across programming languages. Should we be testing more? Should we invest in better tooling for testing? Do we have weak ownership? Do we have an unstable arrangement of dependencies?
Improve systems. After identifying patterns, look for systematic ways to reduce the number and severity of bugs in production.
Think through the consequences. The impact of a change or new policy will not be on the past outage, which already happened, but on future developments. What do we actually think we should do differently, that we could apply not just to that situation but to all of our development?
Recognize your company as a society. The people in your company are wonderful and flawed. They code bugs. They miss warning signs. They might be too busy. They err in communication. They exist in groups that are separated from other groups with fuzzy boundaries. People will join the society, change roles in the society, and leave the society. Everything is changing. If anything is built and accomplished, then bugs come as part of that deal.
Learn what there is to learn. Mistakes are precious and we need to learn from them.

If you can do all of that, you will have a postmortem process that feeds into a continually strengthening organization, where lessons are learned and practices improve over time as part of a strong culture.

See PagerDuty’s Postmortems page for general background on postmortems.

Let’s not dwell on why I might have been in many of these.

Don’t get me wrong, there’s a value in holding people to a high bar and pushing the team to do better work more efficiently, particularly if the bar isn’t being consistently met across the organization. But that message is usually misplaced in a postmortem and belongs instead in either individual coaching or in organizational performance management policies.

Simplicity is SOTA

Discussion about this post