I spend a lot of time thinking about decision criteria for shipping experimental features. The topic could use a little more attention. Maybe it’s higher up the Maslow’s hierarchy of experimentation, somewhere above the prerequisites of actually running randomized experiments, constructing them optimally, and analyzing them correctly. Yet it is still important to actually ask: under what conditions do we ship?
When it comes to this question, there’s a magic number that breaks the way people think about their choices. That magic number is 0.05, and I wish people would forget it.1
Why 0.05?
Quick refresher: 0.05, or 5%, is the most common p-value threshold used in experiment analysis. It’s the most common number used to decide whether an experimental result is meaningful or whether it should be discarded due to limited evidence. Naturally, 0.05 is also commonly used as a significance level in power calculations. I’m not the first to lament the “mechanical dichotomous decisions around a sacred .05 criterion”, highlighted in the colourfully named The Earth Is Round (p < .05) from three decades ago [1].
In our predominant use with a 0-difference null hypothesis, the p-value is the probability of seeing a result as extreme as the experiment’s results under the hypothetical circumstances where nothing actually changed, purely due to random chance.
You test something on your users and observe a metric improvement of X%. Under the hypothetical where you made no change at all, and had purely done an A/A test of no actual change on your users, the p-value is how often you would see a result of X% or more.2 When we use a p-value of 0.05 as a threshold, we’re saying that as long as this type of extreme value would only be 5% likely by chance in an A/A test, then we’ll treat the test as successful and ship the change.3 This could be our ship criteria: Ship if p < 0.05, otherwise go back to the drawing board and think of another experiment. So if all we did was try minor product changes (even literally no change, an A/A test), we would still ship about 5% (1 in 20) of them and mistakenly think we improved the product. Those are false positives of experimentation.
1 in 20 seems like a low enough rate of false positives.4 And that’s partially why R.A. Fisher chose and popularized this threshold back in 1925 [2]. He also liked that in a normal distribution it corresponds to nearly twice the standard deviation, which also feels intuitive and was convenient for look-up tables before computers were invented.
Defaults matter
We’re all using a number that somebody chose nearly a century ago. Whether it is the right number or not depends on the circumstances. There are plenty of people who realize how arbitrary 0.05 is, but there are also plenty of people who don’t. There are lots of teams using 0.05 as their decision criteria.
We see it reinforced in our blog posts and papers, as people pick 0.05 in their examples. We also see it reinforced in our tools. My company isn’t alone in having a pretty experiment analysis tool that shows the colour green when a metric improves with p < 0.05 and red when a metric decays with p < 0.05. Those colours put a stamp of authority on the threshold. That stamp of authority has a real effect.
0.05 is not very useful in the tech industry
Our goal isn’t just to minimize experiment false positives. We can accomplish that by not running any experiments at all. We want to find real product improvements. We have to care about false negatives.
Thinking of typical tech use cases, we might overemphasize false positives and underemphasize false negatives.
Note that these p-values are calculated against a neutral change. Neutral changes aren’t costly on our users, so while we should be somewhat averse to wasting time and adding tech debt for neutral changes, it isn’t the end of the world. The likelihood for a meaningfully negative change will be some lower value. A 0.05 threshold implies extreme risk aversion against negative changes.
But do we need that much risk aversion? Most of the time, we’re not evaluating cancer drugs. We’re changing ad copy. We’re swapping ML models. We’re sending different emails.
Let’s keep some perspective.
When we switch to a different product interface or a new algorithm, we usually have the ability to reverse our decision later. The code is version controlled. We might even implicitly reverse the change through future experiments. This could happen invisibly, such as through a new ML model that ends up somewhat similar to a prior ML model, or visibly but obliviously, like a future UI change where nobody on the team was around when the product previously behaved that way. If we had found a local optimum of product behavior and incorrectly moved a bit off that optimum, the same forces that pulled us there once might pull us there once again.
On the other hand, when we have a false negative we can permanently learn the wrong lesson. We might discontinue a promising direction and never try something similar again. I posit that the forces pulling us to retry an experimental change are weaker than the forces pulling us back to a previous local optimum. In this way, false negatives might be more costly than false positives.
Examples from Hulu and Netflix are typical, where they power experiments to tolerate 5% false positives and 20% false negatives.5 If we were starting over and thinking of ideal defaults for experiments in tech, I doubt that’s what we would pick.
Effects of blindly using p < 0.05
I hypothesize that companies and teams with a 0.05 p-value threshold fall mostly into two archetypes:
Stasis
The experiment regime might be strict. The standard criteria could be p < 0.1, p < 0.05, or p < 0.01. We might even have (in effect) stricter policies through multiple hypothesis corrections. In these situations, too little gets shipped. Either the company progresses too slowly because of too many experiment false negatives and/or an inefficient use of sample, or teams escape the testing regime entirely and the product follows a random walk directed by the product intuition of the most powerful and motivated leaders.6
Arbitrarily adjustments
Sometimes 0.05 is a de jure standard, but the de facto standard is something weaker. The experiment authorities might be in the room, but not in control. Everyone looks at the neutral results (coloured grey or black) and then decides whether to move forward anyways, on “directional” evidence if the point estimate of the effect is positive. They might “look beyond just the numbers” with “strong and sensible judgment” [3].
This isn’t wrong; it seems better to me than the stasis approach. And I’m speaking as someone who’s been in the “experiment authority” role. But it does lead to inconsistent and unpredictable decision making. Why not just admit you want something akin to p = 0.25 in the first place? Don’t be shy about it.
Anchoring bias
I believe that nobody wants to write down a large p-value threshold, because it feels unscientific. Our software might show asterisks next to p-values based on some combination of 0.001, 0.01, 0.05, and 0.1 as thresholds, often implying that 0.5 is a middle ground or an upper limit for significance. Compared to those, 0.2 or 0.3 (or even higher!) seem extreme. So we don’t say that’s our decision criteria—even if it is.
The appropriate criteria depends on the situation. I have so much more to say about quantifiable decision criteria, because we’re just scratching the surface on the topic. Stay tuned for my next blog, where I’ll write in more detail about how people structure this problem rigorously. We aren’t restricted to regimes where we have to experiment only through traditional A/B tests with a ship criteria based on a p-value threshold.
[1] Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.
[2] Kennedy‐Shaffer, L. (2019). Before p < 0.05 to Beyond p < 0.05: Using History to Contextualize p-Values and Significance Testing. The American Statistician, 73, 82 - 90.
[3] Tingley, M. (2021, November 15). Building confidence in a decision. Netflix TechBlog. Retrieved April 8, 2022, from https://netflixtechblog.com/building-confidence-in-a-decision-8705834e6fd8
I say, as I write about this very number.
“More” here meaning further from 0, either positive or negative in a two-sided test.
Assuming a positive improvement in this example.
I didn’t write “false positive rate” because technically I’m describing the false discovery rate.
A caveat here that these examples are for power calculations, where adding more sample improves the risk of both false positives and false negatives. That’s in contrast to decision thresholds in experiment analysis, where there is a direct trade-off between the two.
The inefficient use of sample is that we run experiments longer in order to power for those p-value thresholds. For companies whose volume of experiments are bound by sample, this causes fewer experiments. Even for companies unable to try enough experiments to consume their sample, this can extend the duration from idea to successfully shipping an improvement.