Sample Size Calculation — Myth Buster Edition
Results interpretation is the bane of the optimization analyst’s existence.
“Are the results statistically significant?”
We know this is a dangerous question to answer, even as we sympathize with the stakeholder’s desire for a simple, concise, tangible, and clear response. And, as an analyst, the question can be intimidating: there are a lot of moving parts and mind-bending concepts underlying A/B testing, and walking the tightrope trying to balance our stakeholders’ desire for simplicity with our intuition that there is a lot of underlying complexity can be paralyzing.
We believe that one way to perform that balancing act is to demystify some of those underlying concepts, and we’re embarking on the creation of a series of tools and posts to help do that. This first post has a companion sample size calculator (screen caps from the calculator are included in this post) and aims at some of the most common myths we have run into when it comes to test design. We would love to know what you think! Shoot me an email, find me on Twitter, or join the Test & Learn Community to share your feedback!
Calculating sample size before running your test is not always necessary
Reality: Calculating sample size is a crucial step in creating and prioritizing your idea roadmap. Without it, you have no sound way to determine the potential impact of your idea, nor can you validate the ability of your test to provide an answer to the question you’re asking in the time available for obtaining that answer.
If the test “lift” is higher than the lift used for the sample size calculation, you can end the test early
Reality: This is tricky. In order to end a test earlier than originally planned, you must be constantly checking the test. This causes a phenomenon called “p hacking” which increases the risk of false positives. There are ways to mitigate the increased error — still not statistically as sound — but if you are willing/able to risk it…
- Look for a minimum of 3 days of consistent >X confidence AND
- Look for a flat or diverging cumulative difference trend
There are use cases where you might be okay with a false positive and just want to move faster. In those scenarios, the risk of a false positive may not be something you’re as concerned about (by definition — a false positive means you would make a decision that would not negatively impact the business because the reality is that the new treatment is no better — or worse — than the old treatment).
Confidence is about “accuracy” of the specific lift %
Reality: Confidence is simply the inverse of a p-value that represents the likelihood that group A is truly different from group B. It is NOT about your confidence in a specific AMOUNT of difference. The “lift %” is simply a measurement of the difference between the means of the groups’ distributions.
You must have estimated lift (minimal detectable effect or MDE) to calculate sample size
Reality: While it is recommended to begin with a calculation of the minimum required lift needed to make a decision, in certain scenarios, your constraint may actually be the runtime rather than the minimum lift needed for decision-making. For example: if you need a lift of 5% to make a decision, but you only have 14 days to run the test, you may learn that the statistics will not be able to detect a 5% lift in that time period. If this happens, you must make a decision as to whether you believe you can achieve a higher lift % than you initially planned. This requires reviewing prior test results of similar type tests — if available — to see if this is realistic. Alternatively, you can revise the test target audience and the statistical constraints (power and confidence) you are using, or you can decide the test is not viable and look for alternate ways to get information to make a better decision.
95% confidence is “best practice” and should be used for all tests
Reality: 95% confidence is likely not needed for some (if not many) of your experiments. Confidence is a measure of how likely your statistics will give you a false positive. In other words — how often will your statistics tell you there is a difference between the performance of two groups when they are, actually, the same? In the scenario where they are the same, you may be willing to roll out a different experience even if it’s truly “flat” because you will not hurt your key metric(s). Some example scenarios:
- Banner tests
- Heading or copy tests
- Content tests in general
- Targeted or retargeted experiences (that do not require increased creation of content)
- Streamlined pathing or tool functionality
- Back-end functionality clean up
In all the examples provided above, it will not hurt the front-end customer experience and MAY help with back-end operations by reducing complexity, cost, etc. In scenarios where the change would actually increase costs for the business (i.e., purchase of a new recommendations engine, complex or costly full implementation, high maintenance cost of new design/targeting, etc) — you will want high confidence that your “win” is truly a win. In all other cases — it’s may be more important to make a decision and move forward.
False positive errors are when the statistics report a “win” that is actually a “loss”
Reality: False positive really means “false difference” — positive or negative in direction. This is when the statistics tell us that there is a difference between the two groups when the reality is that they perform the same. Statistical confidence can be increased to decrease the risk of false positives (and is often set at 90–95% by default). See the above for examples of why that may not always be necessary. (For a more in-depth explanation, check out the always enlightening Matt Gershoff’s blog post, “Do No Harm or AB Testing without P-Values”). The sample size calculator we’ve created allows you to adjust the confidence from 50% to 99% and has a companion visual that illustrates the impact of adjusting the confidence level.
False negatives are not as important to avoid as false positives
Reality: Many testing tools and online sample size calculators lock the statistical power at 80% (many will not even let you adjust statistical power — only confidence — and we found one calculator that actually locks the statistical power at 50%!) — meaning they are reducing the chance of the statistics missing a true difference if one exists to 20%. In the medical industry, missing a true difference is less important than mistakenly believing drug X is better than a placebo (which could lead consumers to take a drug that will actually not help them). In the business world, however, we are testing specifically to detect a difference and having high enough statistical power to do so is, therefore, very important. We recommend keeping power at a minimum of 80% and consider adjusting upwards to see how you can increase your test’s ability to detect a difference in performance if one really exists. Just like all risk reduction activities — there is a cost to increasing power — longer run time. However, low power settings (higher risk of false negatives) can decrease your “win rate” (the % of tests with statistically confident differences detected) to a point where your program is unable to provide a meaningful return on the testing investment. If your program sees a high percentage of “flat” or “inconclusive” results — consider adjusting your power settings higher when calculating sample sizes. An analogy that may help make our point: Setting power to 80% is like buying 100 raffle tickets and throwing 20 away before the winning tickets are read out.
The sample size calculator we’ve created allows you to adjust the statistical power from 50% to 99% and has a companion visual that illustrates the impact of that adjustment.
I must use the sample size calculator provided by my testing tool vendor
Reality: While the sample size calculator provided by your tool vendor should be tied to the same statistical foundation as the tool results readout, you are not tied to only that calculator. There are many free online calculators to choose from, including ours. If you use a different calculator than the one provided by your tool vendor, you will need to manually calculate actual confidence (statistical significance) in your result OR rely on the original settings of your sample size calculation (ie. If you entered 10% lift with 85% confidence in your calculator and saw you needed 3,500 visitors per variation assuming a starting conversion rate for the control group of 5%, your results would be confident with at LEAST your minimum 85% confidence if you had 3,500 visitors per variation and your control had 5% and your variant were +/-0.5% higher or lower. You would then report the mean difference and show that difference met the standards of confidence without needing to calculate the actual statistical significance / p value.). We recommend the latter approach for ease and speed.
As we approached the creation of this calculator, we studied the available free calculators provided by various vendors and commonly shared on testing forums and within testing communities. There are many excellent examples — easy to use and that provide a quick answer to your question of how many visitors I need or how long my test must run. However, we noticed there were not many that provided the flexibility we needed to allow us to pull the various levers available to us to adjust to account for different types of tests, different levels of risk tolerance, and different needs of the business. As consultants, we see many different types of companies across all verticals running hundreds of different types of tests with thousands of different problems they’re trying to solve — all with varying degrees of risk tolerance. Because of this — there is no such thing as “best practice” in sample size calculation settings. Instead — we recommend using the right tool for each problem in each scenario for each business goal and risk tolerance level. For that reason — we needed a sample size calculator that was more flexible than those currently available. We reached out to several leaders in the world of digital analytics and statistics including Matt Gershoff, Dr. Elea Feit, and Dr. Isabelle Bauman (a huge thank you to all of them for their patience and willingness to work with us to make sure we understood the concepts behind our decisions). As we created this tool for our own purposes and for our clients, we realized that maybe it could serve a dual purpose: as a literal tool for calculation of sample sizes — but also as a way to educate stakeholders and bust a few myths along the way.
So dive in! Play with all the different inputs and sliders and see how changing each will impact sample size and/or run time. Compare the results from this calculator with the sample sizes calculated by the tool you currently use. Use the calculator to educate your stakeholders and make decisions. And — as always — if you have questions or just want to chat about all things optimization — reach out! We’d love to hear from you. And please don’t hesitate to reach out with feedback and recommendations for 2.0!