Sample Size Calculation — Myth Buster Edition

by | May 20, 2018

Results interpretation is the bane of the optimization analyst’s existence.

We run a test, and we get ready to report the results. The first (and, often, only) ques­tion we know we will get from our stake­hold­ers is:

Are the results statis­ti­cally signif­i­cant?”

We know this is a danger­ous ques­tion to answer, even as we sympa­thize with the stakeholder’s desire for a simple, concise, tangi­ble, and clear response. And, as an analyst, the ques­tion can be intim­i­dat­ing: there are a lot of moving parts and mind-bending concepts under­ly­ing A/B testing, and walking the tightrope trying to balance our stake­hold­ers’ desire for simplic­ity with our intu­ition that there is a lot of under­ly­ing complex­ity can be para­lyz­ing.

We believe that one way to perform that balanc­ing act is to demys­tify some of those under­ly­ing concepts, and we’re embark­ing on the creation of a series of tools and posts to help do that. This first post has a compan­ion sample size calcu­la­tor (screen caps from the calcu­la­tor are included in this post) and aims at some of the most common myths we have run into when it comes to test design. We would love to know what you think! Shoot me an email, find me on Twitter, or join the Test & Learn Commu­nity to share your feed­back!

Calculating sample size before running your test is not always necessary

Reality: Calcu­lat­ing sample size is a crucial step in creat­ing and prior­i­tiz­ing your idea roadmap. Without it, you have no sound way to deter­mine the poten­tial impact of your idea, nor can you vali­date the ability of your test to provide an answer to the ques­tion you’re asking in the time avail­able for obtain­ing that answer.

If the test “lift” is higher than the lift used for the sample size calculation, you can end the test early

Reality: This is tricky. In order to end a test earlier than orig­i­nally planned, you must be constantly check­ing the test. This causes a phenom­e­non called “p hacking” which increases the risk of false posi­tives. There are ways to miti­gate the increased error — still not statis­ti­cally as sound — but if you are willing/able to risk it…

  • Look for a minimum of 3 days of consis­tent >X confi­dence AND
  • Look for a flat or diverg­ing cumu­la­tive differ­ence trend

There are use cases where you might be okay with a false posi­tive and just want to move faster. In those scenar­ios, the risk of a false posi­tive may not be some­thing you’re as concerned about (by defi­n­i­tion — a false posi­tive means you would make a deci­sion that would not nega­tively impact the busi­ness because the reality is that the new treat­ment is no better — or worse — than the old treat­ment).

Confidence is about “accuracy” of the specific lift %

Reality: Confi­dence is simply the inverse of a p-value that repre­sents the like­li­hood that group A is truly differ­ent from group B. It is NOT about your confi­dence in a specific AMOUNT of differ­ence. The “lift %” is simply a measure­ment of the differ­ence between the means of the groups’ distri­b­u­tions.

You must have estimated lift (minimal detectable effect or MDE) to calculate sample size

Reality: While it is recom­mended to begin with a calcu­la­tion of the minimum required lift needed to make a deci­sion, in certain scenar­ios, your constraint may actu­ally be the runtime rather than the minimum lift needed for deci­sion-making. For example: if you need a lift of 5% to make a deci­sion, but you only have 14 days to run the test, you may learn that the statis­tics will not be able to detect a 5% lift in that time period. If this happens, you must make a deci­sion as to whether you believe you can achieve a higher lift % than you initially planned. This requires review­ing prior test results of similar type tests — if avail­able — to see if this is real­is­tic. Alter­na­tively, you can revise the test target audi­ence and the statis­ti­cal constraints (power and confi­dence) you are using, or you can decide the test is not viable and look for alter­nate ways to get infor­ma­tion to make a better deci­sion.

95% confidence is “best practice” and should be used for all tests

Reality: 95% confi­dence is likely not needed for some (if not many) of your exper­i­ments. Confi­dence is a measure of how likely your statis­tics will give you a false posi­tive. In other words — how often will your statis­tics tell you there is a differ­ence between the perfor­mance of two groups when they are, actu­ally, the same? In the scenario where they are the same, you may be willing to roll out a differ­ent expe­ri­ence even if it’s truly “flat” because you will not hurt your key metric(s). Some example scenar­ios:

  • Banner tests
  • Heading or copy tests
  • Content tests in general
  • Targeted or retar­geted expe­ri­ences (that do not require increased creation of content)
  • Stream­lined pathing or tool func­tion­al­ity
  • Back-end func­tion­al­ity clean up

In all the exam­ples provided above, it will not hurt the front-end customer expe­ri­ence and MAY help with back-end oper­a­tions by reduc­ing complex­ity, cost, etc. In scenar­ios where the change would actu­ally increase costs for the busi­ness (i.e., purchase of a new recom­men­da­tions engine, complex or costly full imple­men­ta­tion, high main­te­nance cost of new design/targeting, etc) — you will want high confi­dence that your “win” is truly a win. In all other cases — it’s may be more impor­tant to make a deci­sion and move forward.

False positive errors are when the statistics report a “win” that is actually a “loss”

Reality: False posi­tive really means “false differ­ence” — posi­tive or nega­tive in direc­tion. This is when the statis­tics tell us that there is a differ­ence between the two groups when the reality is that they perform the same. Statis­ti­cal confi­dence can be increased to decrease the risk of false posi­tives (and is often set at 90–95% by default). See the above for exam­ples of why that may not always be neces­sary. (For a more in-depth expla­na­tion, check out the always enlight­en­ing Matt Gershoff’s blog post, “Do No Harm or AB Testing without P-Values”). The sample size calcu­la­tor we’ve created allows you to adjust the confi­dence from 50% to 99% and has a compan­ion visual that illus­trates the impact of adjust­ing the confi­dence level.

False negatives are not as important to avoid as false positives

Reality: Many testing tools and online sample size calcu­la­tors lock the statis­ti­cal power at 80% (many will not even let you adjust statis­ti­cal power — only confi­dence — and we found one calcu­la­tor that actu­ally locks the statis­ti­cal power at 50%!) — meaning they are reduc­ing the chance of the statis­tics missing a true differ­ence if one exists to 20%. In the medical indus­try, missing a true differ­ence is less impor­tant than mistak­enly believ­ing drug X is better than a placebo (which could lead consumers to take a drug that will actu­ally not help them). In the busi­ness world, however, we are testing specif­i­cally to detect a differ­ence and having high enough statis­ti­cal power to do so is, there­fore, very impor­tant. We recom­mend keeping power at a minimum of 80% and consider adjust­ing upwards to see how you can increase your test’s ability to detect a differ­ence in perfor­mance if one really exists. Just like all risk reduc­tion activ­i­ties — there is a cost to increas­ing power — longer run time. However, low power settings (higher risk of false nega­tives) can decrease your “win rate” (the % of tests with statis­ti­cally confi­dent differ­ences detected) to a point where your program is unable to provide a mean­ing­ful return on the testing invest­ment. If your program sees a high percent­age of “flat” or “incon­clu­sive” results — consider adjust­ing your power settings higher when calcu­lat­ing sample sizes. An analogy that may help make our point: Setting power to 80% is like buying 100 raffle tickets and throw­ing 20 away before the winning tickets are read out.

The sample size calcu­la­tor we’ve created allows you to adjust the statis­ti­cal power from 50% to 99% and has a compan­ion visual that illus­trates the impact of that adjust­ment.

I must use the sample size calculator provided by my testing tool vendor

Reality: While the sample size calcu­la­tor provided by your tool vendor should be tied to the same statis­ti­cal foun­da­tion as the tool results readout, you are not tied to only that calcu­la­tor. There are many free online calcu­la­tors to choose from, includ­ing ours. If you use a differ­ent calcu­la­tor than the one provided by your tool vendor, you will need to manu­ally calcu­late actual confi­dence (statis­ti­cal signif­i­cance) in your result OR rely on the orig­i­nal settings of your sample size calcu­la­tion (ie. If you entered 10% lift with 85% confi­dence in your calcu­la­tor and saw you needed 3,500 visi­tors per vari­a­tion assum­ing a start­ing conver­sion rate for the control group of 5%, your results would be confi­dent with at LEAST your minimum 85% confi­dence if you had 3,500 visi­tors per vari­a­tion and your control had 5% and your variant were +/-0.5% higher or lower. You would then report the mean differ­ence and show that differ­ence met the stan­dards of confi­dence without needing to calcu­late the actual statis­ti­cal signif­i­cance / p value.). We recom­mend the latter approach for ease and speed.

As we approached the creation of this calcu­la­tor, we studied the avail­able free calcu­la­tors provided by various vendors and commonly shared on testing forums and within testing commu­ni­ties. There are many excel­lent exam­ples — easy to use and that provide a quick answer to your ques­tion of how many visi­tors I need or how long my test must run. However, we noticed there were not many that provided the flex­i­bil­ity we needed to allow us to pull the various levers avail­able to us to adjust to account for differ­ent types of tests, differ­ent levels of risk toler­ance, and differ­ent needs of the busi­ness. As consul­tants, we see many differ­ent types of compa­nies across all verti­cals running hundreds of differ­ent types of tests with thou­sands of differ­ent prob­lems they’re trying to solve — all with varying degrees of risk toler­ance. Because of this — there is no such thing as “best prac­tice” in sample size calcu­la­tion settings. Instead — we recom­mend using the right tool for each problem in each scenario for each busi­ness goal and risk toler­ance level. For that reason — we needed a sample size calcu­la­tor that was more flex­i­ble than those currently avail­able. We reached out to several leaders in the world of digital analyt­ics and statis­tics includ­ing Matt Gershoff, Dr. Elea Feit, and Dr. Isabelle Bauman (a huge thank you to all of them for their patience and will­ing­ness to work with us to make sure we under­stood the concepts behind our deci­sions). As we created this tool for our own purposes and for our clients, we real­ized that maybe it could serve a dual purpose: as a literal tool for calcu­la­tion of sample sizes — but also as a way to educate stake­hold­ers and bust a few myths along the way.

So dive in! Play with all the differ­ent inputs and sliders and see how chang­ing each will impact sample size and/or run time. Compare the results from this calcu­la­tor with the sample sizes calcu­lated by the tool you currently use. Use the calcu­la­tor to educate your stake­hold­ers and make deci­sions. And — as always — if you have ques­tions or just want to chat about all things opti­miza­tion — reach out! We’d love to hear from you. And please don’t hesi­tate to reach out with feed­back and recom­men­da­tions for 2.0!