Don’t Run with Scissors! How to Safely Calculate Program ROI

by Mar 16, 2020

Discovery Rate: What it is and Why it Matters

With thanks to Ton Wessling for inspir­ing this post series with his bril­liance and saga­cious thought lead­er­ship.

The opti­miza­tion indus­try is using part of hypoth­e­sis testing that’s used in science (P values, T‑tests, null hypothe­ses, etc.). But there’s more to opti­miza­tion measure­ment than just these statis­ti­cal stand­bys, includ­ing false and true discovery rates, magni­tude errors, and sign errors. For the next several blog posts, I’ll detail these impor­tant consid­er­a­tions, and here’s why: As a norm within the indus­try, these statis­ti­cal measures are not currently under­stood, nor are they being fed into our test or program impact calcu­la­tions.


Intu­itively, we know that we shouldn’t be putting annu­al­ized test lift result esti­mates into our company’s annual P&L projec­tion, but we don’t yet have a common language for commu­ni­cat­ing our hesi­ta­tion. Currently, we don’t do a good job of helping busi­ness deci­sion-makers under­stand how complex statis­tics affect the value of opti­miza­tion programs. 

We need to be better analysts and/or work more closely with data scien­tists to not over-report our value. Over-report­ing is a detri­ment to the entire indus­try and can cause an erosion of trust and, conse­quently, budgets. The better we under­stand the statis­ti­cal under­pin­ning of our programs, the better we under­stand the risks involved (in commu­ni­cat­ing program ROI). If we can under­stand the statis­tics and the risks, we’ll be better able to respon­si­bly calcu­late and commu­ni­cate program impacts. 

What’s a False Discovery Rate?

False Posi­tive Rate (or FPR) is what you are prob­a­bly most famil­iar with. This is what we’re control­ling for with statis­ti­cal signif­i­cance levels and what we’re report­ing when we say we have 95% “confi­dence” in a result. What we really mean is that the statis­tics tell us there is only a 5% chance we’ve made a false posi­tive error. We calcu­late this (or our tools do) for each test we run.

In contrast, False Discovery Rate is a measure of the accu­racy of your posi­tive results in aggre­gate. Rather than using all results as your denom­i­na­tor as the False Posi­tive Rate does, the FDR uses only the posi­tive results (or “wins”) as the denom­i­na­tor, and it helps us under­stand how many of those wins are poten­tially false.

To break that down further, every test we run has 4 possi­ble outcomes: True Nega­tive (TN), False Nega­tive (FN), True Posi­tive (TP), and False Posi­tive (TP). These outcomes can be most easily under­stood on a 2×2 grid like the one below:

Test Outcomes “True” Effect “True” Null (no effect)
Statis­ti­cally Signif­i­cant Result True Posi­tive (TP) False Posi­tive (FP) — Type I Error
Statis­ti­cally Insignif­i­cant Result False Nega­tive (FN) — Type II Error True Nega­tive (TN)
Total Total Effect Total Null

 

Our columns repre­sent an actual unknow­able ‘True’ reality (i.e., the “reality” we’ve applied statis­tics to, which is not deter­min­is­tic but prob­a­bilis­tic), while the rows repre­sent our poten­tial test outcomes. In the digital exper­i­men­ta­tion world, you can think of these test result possi­bil­i­ties as a statis­ti­cally signif­i­cant result vs. a statis­ti­cally insignif­i­cant result

Some of those statis­ti­cally signif­i­cant results are false posi­tives, meaning we’ve made a Type I Error, and the impact is actu­ally no differ­ent than control. In medi­cine, this is similar to the Placebo Effect (e.g., some bene­fi­cial effect occurs, but the effect can’t be attrib­uted to the prop­er­ties of the intro­duced factor.)


Some­times, exper­i­ments report statis­ti­cally insignif­i­cant results when there truly was an effect on our key perfor­mance indi­ca­tor differ­ent from the control. These are false nega­tives or Type II Errors. In medi­cine, these are the test results that tell us we don’t have the flu when all signs (and our doctor) say we do.

Did you know? The false nega­tive rate for the flu test can be as high as 80%(!), and the false posi­tive rate is very low. What are the impli­ca­tions of a high false nega­tive rate? Since doctors will gener­ally treat and prescribe based on flu symp­toms without admin­is­ter­ing the test, the effects of false nega­tives are minimum and, there­fore, not a concern. However, a high false posi­tive rate could lead to over medica­tion, and over-report­ing could have other unin­tended and danger­ous conse­quences! In contrast, a high false nega­tive rate for COVID-19 tests would be devas­tat­ing! With no approved treat­ment and a high mortal­ity rate, it’s extremely impor­tant that the COVID-19 test has a low false nega­tives rate or people will not know when they need to quar­an­tine to protect those at highest risk. Which type of error you care about most in your exper­i­ment should be care­fully consid­ered. To read more about Type I and II errors and how to miti­gate them, check out our Myth­buster blog post.

Our False Posi­tive Rate, then, is simply the percent­age of “True” null results that will be reported (inac­cu­rately) as posi­tive. In our 2X2 grid, we’re only looking at the “True” null column and calcu­lat­ing the % of False Posi­tives / Total Column.  We use statis­tics to esti­mate the False Posi­tive Rate (since we are not omni­scient and have no real way to know the ‘Truth’). Typi­cally, before we launch a test, we decide what false posi­tive rate we’re comfort­able with. In the exper­i­men­ta­tion indus­try, stan­dards tend to be 95 or some­times 90% confi­dence which reflects a will­ing­ness to accept 5–10% false posi­tives.

If False Posi­tive Rate is the percent­age of falsely reported posi­tive results out of the total “True” null (no effect) outcomes, what then is False Discovery Rate? False Discovery Rate looks at the propor­tion of false posi­tive results out of the total of the statis­ti­cally signif­i­cant results. So rather than look at the %  of the “True” null column, it looks at the % of the statis­ti­cally signif­i­cant results ROW (the number of False Posi­tives / the total number of statis­ti­cally signif­i­cant results). Again — we cannot know the “truth” so we use statis­tics to esti­mate the like­li­hood of our false posi­tives.

There are 2 other measures we can look at — the True Posi­tive Rate, also known as sensi­tiv­ity — which is the percent­age of true posi­tive results out of the total number of actual “true” effects. And the True Nega­tive Rate, also known as speci­ficity — which is the percent­age of “true” nega­tives out of the total number of null (no effect) results — again — unknow­able.

Let’s apply some numbers into our chart based on 

  • a company running 100 exper­i­ments per year 
  • with a win rate of 16% 
  • using a 95% confi­dence level to control false posi­tives risk to ~5% 
  • and using an 80% power
“True” effect “True” Null (no effect) Totals
Statis­ti­cally Signif­i­cant Result True Posi­tives: 12 False Posi­tives: 4 16
Statis­ti­cally Insignif­i­cant Result False Nega­tives: 3 True Nega­tives: 81 84
Totals 15 85 100

Our False Posi­tive Rate in this scenario = 485  or approx­i­mately 5%. 

Our False Discovery Rate then becomes = 416 or ~25%!

Want to learn how to calcu­late your False Discovery Rate?

So, 5% of our “True” null (no effect) results will come back as “wins”. And 25% of our “wins” could still actu­ally be nulls. That’s a real head scratcher, right?!

Why it Matters

So why do we care? This is hard! Should we all just throw up our hands in frus­tra­tion and find new jobs? 

No! Of course not. As G.I.Joe reminds us, “Knowing is half the battle!” In our case, knowl­edge is power! And while power corrupts, and absolute power corrupts absolutely, power used respon­si­bly makes the world a better place. 

So, how should we use our newfound knowl­edge and power? We can apply this better under­stand­ing of error to our impact esti­mates. We can ensure our ROI math is adjusted by these poten­tial error rates. We can calcu­late how many more tests we should run, and how we might want to adjust our stan­dards to better miti­gate (or not) each type of error. In short, we can apply our think­ing brains to prob­lems that histor­i­cally we left to gut and heart, and we make smarter deci­sions and better recom­men­da­tions on how to safely and respon­si­bly use this data.

We are all asked or required at some point in our careers to calcu­late the “impact” or “program ROI” of our exper­i­men­ta­tion efforts. Some compa­nies even calcu­late an annu­al­ized revenue impact from every test they complete. And then they bake those numbers into the P&L. Those efforts typi­cally last about one year and lead to multi­ple (chal­leng­ing) conver­sa­tions with finance and lead­er­ship teams, where you (the analyst) stumble about, trying to explain all the poten­tial causes for why this or that win didn’t mate­ri­al­ize post perma­nent imple­men­ta­tion. 

We point to season­al­ity, changes in the market, lack of controlled envi­ron­ment. Some­times we even point out that we cannot know that the perma­nent imple­men­ta­tion is NOT provid­ing a lift that has raised what would actu­ally be a dip to a level perfor­mance! 

Here’s the deal, though–and I hope this has been made clear through­out this post, so this doesn’t come as too much of a shock (but maybe take a seat, just in case)–responsibly, we cannot say any of those things.

Based on every­thing you’ve read above, hope­fully, you under­stand why that might be. Even a result with 99% confi­dence means there is a 1% chance of a Type I error and a 2–9% chance of a False Discovery (depend­ing on your win rate). False Discovery Rate goes down as Win Rate goes up.

Credit: Ton Wessel­ing

Shoutout to Ton Wessel­ing for creat­ing this lovely visual to help you see how this works. But why is it this way? Essen­tially, your signif­i­cance gives you the number of tests you might see as “posi­tive” even if you only ran A/A or null exper­i­ments. 90% confi­dence? You would expect 10% of tests run to come back as a “win” — even when you didn’t have ANY winners. So a win rate of 10% with confi­dence set to 90% could poten­tially mean all of your “wins” are… well, not. So start with your confi­dence level. 95%? What is your annual test volume? 100? You would expect 5 false posi­tives. So you need more than 5 wins to truly “win.” And the more wins you have, the more confi­dent you can be that you were detect­ing some “true” effects. Neat, huh?

TLDR!
In an ideal world, we wouldn’t use statis­tics designed to answer yes or no to measure finan­cial impact. We’re being asked to run with scis­sors, folks, and we need to miti­gate some danger in our work. To be respon­si­ble analysts, to miti­gate the tendency to over-report, we need to multi­ply ROI by a factor of the TDR, and we should take the follow­ing steps:

 

  1. Talk about “poten­tial” impacts instead of “esti­mated” impacts.
  2. Create a stan­dard confi­dence level that we stick to for all tests. 
  3. Pre-estab­lish the runtime for each test based on our set stan­dards.
  4. Keep the meta­data about all tests in a single repos­i­tory.
  5. Calcu­late TDR and use it as an adjust­ment factor for all ROI calcu­la­tions.

Resources

 

Still scratching your head and want some help? Reach out. We can help.

I consent to having Search Discovery use the provided infor­ma­tion for direct market­ing purposes includ­ing contact by phone, email, SMS, or other elec­tronic means.