Looking for the simplest change that will eliminate the most bugs

A case study of trying to reduce bugs across an engineering organization with a small process change

Feb 27, 2023

I am a testing advocate, but not a testing zealot. Testing has costs and should be done to the extent where the benefits of reduced risk outweigh the costs. No more, no less. Yet at work I felt that we had suboptimal amounts of testing. We had decent unit testing, where small parts of the code are tested in isolation. But we had trouble testing end-to-end product dynamics due to complicated dependencies between services or feedback loops through data generated from user behavior. Certain types of complex behaviors were under-tested before launches, leading to bugs that would require urgent remediation. We care about delivering a solid customer experience. Furthermore, even smaller bugs can kill our momentum if they mean we have to pause what we’re doing to fix them, and it can even set us back if we have to fix and restart our experiments.

There were two moments in product development where I expected that a little more manual testing would eliminate a large proportion of negative customer experiences.

Right before launching a feature to real users (typically as an experiment)
After starting an experiment but before ramping up to substantial traffic1

There is a significant difference between the two: the latter one is when we have data and logs from real users, some of whom use the product in ways we don’t expect.

Both are high leverage moments to test the product, where small amounts of effort translate into large impact.

Right before launching a feature to any real users we have all of our code merged and we think it’s ready to go. Our bugs won’t be fixed organically by subsequent pre-launch changes, because there are no more pre-launch changes. We also won’t hand-waive away any discovered bugs or odd product experiences under a mistaken impression that someone else knows about them or will fix them. Found bugs at this point have high precision. Additionally, the product changes are now fully implemented and we can interact with them more easily than at earlier times where parts were still in development. Often we can find bugs simply by looking at the product.

After starting an experiment but before ramping up to substantial traffic is the period where bugs haven’t done as much damage as they could. These bugs have yet to reach their full potential; instead they are contained to a small set of users. We don’t really need to test, we just need to analyze the testing our users have done for us. Our logs will be populated with useful diagnostics, and our charts will start showing user activity. This moment can be incredibly revealing with the simple effort of looking at our data.

For both of these scenarios, sometimes the bugs are right there. Superficially apparent. Yet they can can go unseen if nobody bothers to look.

Changing behaviors is hard

Unfortunately, it can be hard to convince people to look at their own product or to look at their own data.

For some engineers, it feels alien. Manual testing can feel inefficient since it isn’t automated. It usually isn’t written down into a concrete, repeatable, process—and if it is, that could cause even more hesitation if it reduces engineers’ sense of control and agency. Coders are excited to code. Not so much to verify. Unit testing can garner some enthusiasm, because it’s still coding. It still gives that dopamine hit of writing code and seeing it run as intended. Manual testing or data analysis is a different experience.

How do you convince someone to do something that they wouldn’t otherwise do? You could tell them it’s a best practice, but they already know that testing is a good thing. Another way is with a rule. You can tell them that it must be done. Yet rules have their own limitations.

Our workplaces are full of policies and processes, often too many to remember. Engineering organizations are inherently social. Not necessarily in the socializing sense, although many are. I’m referring to the society interpretation of the word. There are dynamics in organizations. Organizations have culture. While some policies may be written and followed, others are written and ignored, and others are followed yet unwritten. Rules are more effective if they align with incentives, and it is both hard and risky to change incentives.

Changing rules, policies, or best practices is ineffective if those changes are ignored. It’s not about having the best rules. It’s about having the best outcomes.

So how do we get the best outcomes?

In this case I looked for existing processes that occurred during our high leverage moments. Thankfully, we already had approval steps connected to the start of product launches, because the way we expand experiment coverage is through code changes. Those code changes require code review. We already have reviewers explicitly approving the next step in a launch.

Code review tools are configurable: they don’t need to be configured for only yes/no approval decisions. They can support custom labels. We have a label for “Tested”, the presence or absence of which is prominently displayed. So what we changed is that we asked reviewers to check for the Tested label and to ask if testing has been sufficient.

The testing doesn’t always have to be exhaustive. Engineers can give their code changes the Tested label, but that should be paired with writing down how they tested. It’s hard to write something in without doing any testing, and now it’s harder to continue experiments without adding the Testing label. The label policy is both an incentive and a reminder. Since the reviewers aren’t the ones who will have to do the legwork, they are a little more likely to enforce testing.

That’s it. All we did was change that process and clarify responsibilities of the reviewer and the reviewee. This was designed as a small modification of an existing process, to have high effectiveness with minimal effort for everyone involved.

How’s it working out?

So far, I grade the change as a moderate success. Observationally I see less customer-facing bugs from teams following the new process. We have cut down on bad product experiences for customers, we reduced experiment restarts, and we suffer fewer disruptive remediations. However, adoption is inconsistent. As mentioned before, organizations are social. It takes time to change practices across a large organization. We still need awareness of the rule, from the reviewers more than the coders. A faster way to enforce full adoption would be to mandate the Tested label in our code review configuration for the relevant repository, but that’s a blunt approach that I’m hesitant to use. Another reason to tread lightly is because we need the reviewers to build strong judgment about when to be more lax and when to be more firm.

I’d rather change behavior through a process improvement that grows into our culture, rather than imposing a heavy-handed rule that people rebel against and ignore. The last thing we need in our engineering society is a riot.

When we notice bug or incident patterns, we have an opportunity for solutions that will prevent multiple such problems in the future. That’s called leverage. Simple process changes can have a big impact. Yet it isn’t always easy; it might require adaptation and reinforcement. In the end we can only hope that we’ve made a difference and aspire for for those differences to outlast our own tenure, leaving a small but positive contribution to the organization’s culture.

For organizations that don’t use a canary or ramp-up process for experiment launch, the second moment might not exist.

Simplicity is SOTA

Discussion about this post