Don’t Run with Scissors! How to Safely Calculate Program ROI
Discovery Rate: What it is and Why it Matters
With thanks to Ton Wessling for inspiring this post series with his brilliance and sagacious thought leadership.
The optimization industry is using part of hypothesis testing that’s used in science (P values, T‑tests, null hypotheses, etc.). But there’s more to optimization measurement than just these statistical standbys, including false and true discovery rates, magnitude errors, and sign errors. For the next several blog posts, I’ll detail these important considerations, and here’s why: As a norm within the industry, these statistical measures are not currently understood, nor are they being fed into our test or program impact calculations.
Intuitively, we know that we shouldn’t be putting annualized test lift result estimates into our company’s annual P&L projection, but we don’t yet have a common language for communicating our hesitation. Currently, we don’t do a good job of helping business decision-makers understand how complex statistics affect the value of optimization programs.
We need to be better analysts and/or work more closely with data scientists to not over-report our value. Over-reporting is a detriment to the entire industry and can cause an erosion of trust and, consequently, budgets. The better we understand the statistical underpinning of our programs, the better we understand the risks involved (in communicating program ROI). If we can understand the statistics and the risks, we’ll be better able to responsibly calculate and communicate program impacts.
What’s a False Discovery Rate?
False Positive Rate (or FPR) is what you are probably most familiar with. This is what we’re controlling for with statistical significance levels and what we’re reporting when we say we have 95% “confidence” in a result. What we really mean is that the statistics tell us there is only a 5% chance we’ve made a false positive error. We calculate this (or our tools do) for each test we run.
In contrast, False Discovery Rate is a measure of the accuracy of your positive results in aggregate. Rather than using all results as your denominator as the False Positive Rate does, the FDR uses only the positive results (or “wins”) as the denominator, and it helps us understand how many of those wins are potentially false.
To break that down further, every test we run has 4 possible outcomes: True Negative (TN), False Negative (FN), True Positive (TP), and False Positive (TP). These outcomes can be most easily understood on a 2×2 grid like the one below:
|Test Outcomes||“True” Effect||“True” Null (no effect)|
|Statistically Significant Result||True Positive (TP)||False Positive (FP) — Type I Error|
|Statistically Insignificant Result||False Negative (FN) — Type II Error||True Negative (TN)|
|Total||Total Effect||Total Null|
Our columns represent an actual unknowable ‘True’ reality (i.e., the “reality” we’ve applied statistics to, which is not deterministic but probabilistic), while the rows represent our potential test outcomes. In the digital experimentation world, you can think of these test result possibilities as a statistically significant result vs. a statistically insignificant result.
Some of those statistically significant results are false positives, meaning we’ve made a Type I Error, and the impact is actually no different than control. In medicine, this is similar to the Placebo Effect (e.g., some beneficial effect occurs, but the effect can’t be attributed to the properties of the introduced factor.)
Sometimes, experiments report statistically insignificant results when there truly was an effect on our key performance indicator different from the control. These are false negatives or Type II Errors. In medicine, these are the test results that tell us we don’t have the flu when all signs (and our doctor) say we do.
Did you know? The false negative rate for the flu test can be as high as 80%(!), and the false positive rate is very low. What are the implications of a high false negative rate? Since doctors will generally treat and prescribe based on flu symptoms without administering the test, the effects of false negatives are minimum and, therefore, not a concern. However, a high false positive rate could lead to over medication, and over-reporting could have other unintended and dangerous consequences! In contrast, a high false negative rate for COVID-19 tests would be devastating! With no approved treatment and a high mortality rate, it’s extremely important that the COVID-19 test has a low false negatives rate or people will not know when they need to quarantine to protect those at highest risk. Which type of error you care about most in your experiment should be carefully considered. To read more about Type I and II errors and how to mitigate them, check out our Mythbuster blog post.
Our False Positive Rate, then, is simply the percentage of “True” null results that will be reported (inaccurately) as positive. In our 2X2 grid, we’re only looking at the “True” null column and calculating the % of False Positives / Total Column. We use statistics to estimate the False Positive Rate (since we are not omniscient and have no real way to know the ‘Truth’). Typically, before we launch a test, we decide what false positive rate we’re comfortable with. In the experimentation industry, standards tend to be 95 or sometimes 90% confidence which reflects a willingness to accept 5–10% false positives.
If False Positive Rate is the percentage of falsely reported positive results out of the total “True” null (no effect) outcomes, what then is False Discovery Rate? False Discovery Rate looks at the proportion of false positive results out of the total of the statistically significant results. So rather than look at the % of the “True” null column, it looks at the % of the statistically significant results ROW (the number of False Positives / the total number of statistically significant results). Again — we cannot know the “truth” so we use statistics to estimate the likelihood of our false positives.
There are 2 other measures we can look at — the True Positive Rate, also known as sensitivity — which is the percentage of true positive results out of the total number of actual “true” effects. And the True Negative Rate, also known as specificity — which is the percentage of “true” negatives out of the total number of null (no effect) results — again — unknowable.
Let’s apply some numbers into our chart based on
- a company running 100 experiments per year
- with a win rate of 16%
- using a 95% confidence level to control false positives risk to ~5%
- and using an 80% power
|“True” effect||“True” Null (no effect)||Totals|
|Statistically Significant Result||True Positives: 12||False Positives: 4||16|
|Statistically Insignificant Result||False Negatives: 3||True Negatives: 81||84|
Our False Positive Rate in this scenario = 4⁄85 or approximately 5%.
Our False Discovery Rate then becomes = 4⁄16 or ~25%!
So, 5% of our “True” null (no effect) results will come back as “wins”. And 25% of our “wins” could still actually be nulls. That’s a real head scratcher, right?!
Why it Matters
So why do we care? This is hard! Should we all just throw up our hands in frustration and find new jobs?
No! Of course not. As G.I.Joe reminds us, “Knowing is half the battle!” In our case, knowledge is power! And while power corrupts, and absolute power corrupts absolutely, power used responsibly makes the world a better place.
So, how should we use our newfound knowledge and power? We can apply this better understanding of error to our impact estimates. We can ensure our ROI math is adjusted by these potential error rates. We can calculate how many more tests we should run, and how we might want to adjust our standards to better mitigate (or not) each type of error. In short, we can apply our thinking brains to problems that historically we left to gut and heart, and we make smarter decisions and better recommendations on how to safely and responsibly use this data.
We are all asked or required at some point in our careers to calculate the “impact” or “program ROI” of our experimentation efforts. Some companies even calculate an annualized revenue impact from every test they complete. And then they bake those numbers into the P&L. Those efforts typically last about one year and lead to multiple (challenging) conversations with finance and leadership teams, where you (the analyst) stumble about, trying to explain all the potential causes for why this or that win didn’t materialize post permanent implementation.
We point to seasonality, changes in the market, lack of controlled environment. Sometimes we even point out that we cannot know that the permanent implementation is NOT providing a lift that has raised what would actually be a dip to a level performance!
Here’s the deal, though–and I hope this has been made clear throughout this post, so this doesn’t come as too much of a shock (but maybe take a seat, just in case)–responsibly, we cannot say any of those things.
Based on everything you’ve read above, hopefully, you understand why that might be. Even a result with 99% confidence means there is a 1% chance of a Type I error and a 2–9% chance of a False Discovery (depending on your win rate). False Discovery Rate goes down as Win Rate goes up.
Credit: Ton Wesseling
Shoutout to Ton Wesseling for creating this lovely visual to help you see how this works. But why is it this way? Essentially, your significance gives you the number of tests you might see as “positive” even if you only ran A/A or null experiments. 90% confidence? You would expect 10% of tests run to come back as a “win” — even when you didn’t have ANY winners. So a win rate of 10% with confidence set to 90% could potentially mean all of your “wins” are… well, not. So start with your confidence level. 95%? What is your annual test volume? 100? You would expect 5 false positives. So you need more than 5 wins to truly “win.” And the more wins you have, the more confident you can be that you were detecting some “true” effects. Neat, huh?
In an ideal world, we wouldn’t use statistics designed to answer yes or no to measure financial impact. We’re being asked to run with scissors, folks, and we need to mitigate some danger in our work. To be responsible analysts, to mitigate the tendency to over-report, we need to multiply ROI by a factor of the TDR, and we should take the following steps:
- Talk about “potential” impacts instead of “estimated” impacts.
- Create a standard confidence level that we stick to for all tests.
- Pre-establish the runtime for each test based on our set standards.
- Keep the metadata about all tests in a single repository.
- Calculate TDR and use it as an adjustment factor for all ROI calculations.
- Ton Wesseling’s TLC conversation
- FPR (false positive rate) vs FDR (false discovery rate)
- Medical False Positives and False Negatives