Calculating ‘Truth’ (while avoiding existential crises)
It’s an election year! There’s a pandemic! And you’re a crackerjack optimization analyst! Every day you’re called on to calculate the “Truth,” at least for your experiments. Lucky you!
So how can you calculate something unknowable like “Truth”? Statistics has a way to estimate that!
What you’ll need:
Power: typically defaults to 80% but check your testing technology documentation if you do not know or just use 80% as a default
Confidence (or “Statistical Significance”): what statistical threshold you have used for the selection of tests to be evaluated (ie — 90 or 95% typically)
Win Rate: the proportion of tests completed that were considered statistically significant
What you can do with that:
From these 3 numbers, we can calculate a “True Discovery Rate”. And 100% — the “True Discovery Rate” of course would be our “False Discovery Rate.” Here’s a handy formula for that:
|True Discovery Rate =||Power * (Win Rate + Confidence — 1)|
|Win Rate * (Power + Confidence ‑1)|
For an example using numbers, a likely scenario would be 80% power, 95% confidence, and win rate of 10%:
|True Discovery Rate =||80% * (10% + 95% — 1) or 53%|
|10% * (80% + 95% — 1)|
False Discovery Rate then equals 100–53% or 47%.
That means that nearly half of all “wins” reported by organizations (using 95% confidence, 80% power, and achieving a 10% win rate) are actually illusory!
So how can you increase that True Discovery Rate?
If you play around with the various elements in that formula, one of the first things you learn is that 80% power gives you nearly the same True Discovery Rate as 90%, 95%, and even 99% power. (Which is why 80% power is the industry standard for most technologies, I’m guessing!) In fact, there is a very slight negative correlation between Power and True Discovery Rate. In other words, increasing power increases your likelihood of finding a significant difference. But — due to the law of large numbers — sometimes those “significant” differences might not be important. It’s easy to find trends and correlations in data when you have a lot of data. So — it’s important to adequately power your test — but don’t over-power them either. Though if you have to make one error, err on the side of over power rather than under. If you under power the test — why are you even testing? You’ll not be able to get a statistically valid read and you’re just wasting time and resources.
However, lowering your confidence level from 95% to 90% (while maintaining the 10% win rate and 80% power), reduces the True Discovery Rate to 0%!
Of course, reducing your confidence level from 95% to 90% should increase your win rate simply by lowering your standards (and required runtime). So, if we lower to 90% while increasing the win rate to 20%, you find nearly the same False Discovery Rate (43%). Essentially, if you want to lower your confidence level to 90%, you’ll want to make sure you have a win rate of 20% or higher to ensure that more than half of your “wins” are actually real. Increase your confidence level to 99%, and you can afford a lower win rate of 10%.
Why it Matters
So why do we care? This is hard! Should we all just throw up our hands in frustration and find new jobs?
No! Of course not. As G.I.Joe reminds us, “Knowing is half the battle!” In our case, knowledge is power! And while power corrupts, and absolute power corrupts absolutely, power used responsibly makes the world a better place.
So, how should we use our newfound knowledge and power? We can apply this better understanding of error to our impact estimates. We can ensure our ROI math is adjusted by these potential error rates. We can calculate how many more tests we should run, and how we might want to adjust our standards to better mitigate (or not) each type of error. In short, we can apply our thinking brains to problems that historically we left to gut and heart, and we make smarter decisions and better recommendations on how to safely and responsibly use this data.
We are all asked or required at some point in our careers to calculate the “impact” or “program ROI” of our experimentation efforts. Some companies even calculate an annualized revenue impact from every test they complete. And then they bake those numbers into the P&L. Those efforts typically last about one year and lead to multiple (challenging) conversations with finance and leadership teams, where you (the analyst) stumble about, trying to explain all the potential causes for why this or that win didn’t materialize post permanent implementation.
We point to seasonality, changes in the market, lack of controlled environment. Sometimes we even point out that we cannot know that the permanent implementation is NOT providing a lift that has raised what would actually be a dip to a level performance!
Here’s the deal, though–and I hope this has been made clear throughout this post, so this doesn’t come as too much of a shock (but maybe take a seat, just in case)–responsibly, we cannot say any of those things.
Based on everything you’ve read above, hopefully, you understand why that might be. Even a result with 99% confidence means there is a 1% chance of a Type I error and a 2–9% chance of a False Discovery (depending on your win rate). False Discovery Rate goes down as Win Rate goes up.
Credit: Ton Wesseling
But are you satisfied with only half of your wins being “real”?
Would you like to feel confident your wins are real more than half of the time? Personally, I would aim for 70%+ True Discovery Rate — that’s about where you see the green appearing in Ton’s chart above. If your confidence level is set to 90%, you’ll want a win rate of at least 30% which will still get you a True Discovery Rate of about 76%. If your confidence level is set to 95%, you can get away with a win rate of 20%, and your True Discovery Rate is 80%! Manage a win rate of 30% with 95% confidence, and you’ll win the lottery with a True Discovery Rate of close to 90%! But, note! Even with those much higher standards and results, you still have 10% of your wins that are not wins!*
Fine. But what if you want to figure out which of our wins are the “real” wins?
Um. Yeah. I can’t do that. And neither can you. So you best make peace with that.
But go to this post for some inner peace about it all: Don’t Run with Scissors! How to Safely Calculate Program ROI
*Find this really interesting but hate doing math by hand? Thanks to Ton Wesseling, there’s a calculator for that!