Common Mistakes When Running A/B Tests
There are a few. When running a test, there are multiple places where one can misstep to put the test, and the program, in jeopardy:
- Peeking at your test. If you have set up a beautiful fixed horizon A/B test, but then you peek for results and call it before the sample size is reached, you have now lost all the benefit of your planning and statistical rigor. Your “result” now means nothing. Bummer. (Of course, if you are using a sequential testing method, peeking can be done safely!)
- Not checking the tracking. There is a certain amount of risk with every test—risk to the site, the customer experience, and, most importantly, the business. If proper oversight is not taken before and during the test, it could cause harm. And worse, it could go unnoticed for quite some time. Tracking is also how you collect the data needed to understand your results, so if it isn’t there, the test can tell you nothing. The only bad outcome of a test is having nothing to learn from. This mistake leads to re-fielding tests, loss of time, and harm done to the business.
- Not getting buy-in from the business. If stakeholders don’t have trust in a test, its results, or the program, then they will not be inclined to take action from your learnings.
These mistakes are not only harmful to your work and business, but they are wildly inefficient for your entire program.
Monitoring: Huzzah for the Solution!
The good news is that we have one solution to help combat all three common mistakes… Monitor your test!
An important note about monitoring, it is only as effective as your planning stage is rigorous. The more detail and thought put into the test planning phase, the easier the rest of the test process becomes. It saves time by informing exactly what you need to watch for.
While the test is running, utilize your analytics tool to keep an eye on the health of your test. If one of our goals in testing is to mitigate the business risk of just rolling out a change, then we should make sure we are actively on the lookout for unforeseen issues.
What are we monitoring for exactly, you ask? There are four main areas to watch; sample size and SRM (sample ratio mismatch), guardrail metrics, operational metrics, and any other stakeholder/business concerns. Notice this list does not contain “determine if your test reached statistical significance and is a winner,” because monitoring is NOT where you analyze your outcome.
Monitor sample size and SRM
Do this to focus on the statistical health of the test. In our planning phase, we determined the amount of traffic that would qualify for our experiment, therefore allowing us to estimate how many days/weeks it would take us to reach sample size and end our test. But, there is no guarantee that the flow of visits will occur before or after our estimated time frame. This is why it is so important to track your sample size!
All statistical tests are designed with a set of assumptions, and if those assumptions are not met, then the result of the test does not apply to your real-world scenario. One of those assumptions is the sample ratio between test variations (we suggest using equal splits).
This should always be checked at the end of a test; however, if you see a large discrepancy mid-test (and find a statistically significant SRM error), then we need to investigate the cause (technical issues with tracking or the testing tool, or an underlying bias). The reason for the SRM error almost always makes the test results invalid, because it suggests there are lurking variables that have not been controlled for—so your sample is not truly randomized. This indicates bias in your results and hinders you from concluding the effect of your treatment on your primary KPI.
Monitor any guardrail metrics
Hopefully you outlined guardrail metrics in your definition of success when planning the test. These are secondary to the primary KPI you designed your statistical test for, and they add important business considerations when understanding if the test is truly a winner. They are also in place to plan out what criterion would cause a test to be ended immediately.
For example, maybe your primary KPI is clicks to ‘Contact Us,’ and we see a 50% lift in our Challenger! But, when we look at the guardrail metric of clicks to ‘Subscribe to our emails,’ we see -35% change in Challenger for five days straight. That may be too big of a hit to that part of our business to justify Challenger getting rolled out to production, and the test may need to be stopped. Those types of decisions need to be made prior to the test running, so we can properly monitor the situation. Guardrail metrics should have thresholds set in the test plan based on what the business can handle before it becomes an issue.
Monitor operational metrics
The difference between these and guardrail metrics is that operational metrics are used to ensure the test is running as expected, and that the tracking for data collection is healthy as well.
These metrics are chosen based on tracking surrounding the area of the site you’re testing. This way, if we see a metric drop to zero or spike for one or multiple variations, we know exactly where to go to look for an issue. Tracking this also informs us of any caveats in our analysis, or if the test needs to be refielded because of too much lost/inaccurate data. On the flip side, this can also help bolster confidence in a test’s result, since we would have been consistently checking in and making sure nothing of concern occurred during the test.
The final items to include in your monitoring is any specific worry or question from your stakeholders. These may fall into the guardrail or operational metric groups, but it is beneficial to your program and analysis to include these. It shows your stakeholders you have heard their concerns and are actively watching them, and helps gain their trust and buy-in to your program!
Test Monitoring in Practice
Now that we have discussed why we should monitor our tests and what exactly we should be looking for, we need to talk about the where and when of monitoring.
The where, is in your analytics tool! Utilize the visualizations and sharing ability of your tool by creating a dashboard. This makes it easy to give everyone involved visibility into the live test, and it gives one centralized place for this information to live. Having one source of truth also aligns how people talk about the test by keeping all your details and definitions clearly documented. Again, when you give people this visibility, it helps them feel heard and more confident in the process.
But how about the when? This dashboard should be built prior to the test going live. All the conversations that happen in the test planning phase should facilitate the dashboard’s design to make sure the entire business case is understood and all concerns acknowledged. This should also be used when the test is QA’d prior to launch to make sure all operational metrics are checked before launch and that the dashboard isn’t missing any critical metric or scenario. And finally, it should be heavily used while the test is live! Set up alerts for important thresholds to ensure that if mass chaos occurs it won’t be missed and the test can be stopped immediately. The main purpose here is risk mitigation.
This practice of sharing out a single dashboard will go a long way in building relationships with stakeholders, and getting more interest in experimentation at your organization.
So let’s look at a real example:
For a B2B software company, the team we directly worked with was in charge of a recent site redesign. They had designed the new site to be built out in modules, which made the layout really flexible. After rolling out the initial setup, though, they had plans to start experimenting with the modules to determine the optimal layout. Since their site’s main goal was to generate leads for their sales team, they wanted to optimize on users signing up for a call or meeting.
What we observed
After launch, they had run a Digital Behavior Analysis to get a read on how users were interacting with the initial setup. Two major observations came out of the analysis. 1.) More visitors were filling out a lead form if they had viewed one of the Product pages than if they had viewed an About Us page, and 2.) The module hosting the Product page links on the homepage was getting twice as much engagement as the About Us module.
When we looked at the layout of the homepage, where both of these modules were, we saw that the Products module was actually below the fold of the initial page load and under the About Us module. So, knowing that it was already getting better engagement and seemed to lead to more forms being submitted, our test idea was born: Flip the position of the modules on the homepage to drive more traffic to the Product pages and ultimately increase the leads generated.
Utilizing our analytics tool
We used Adobe Analytics and Adobe Target for our test, but this approach can be applied to any stack!
The first panel is your overview panel. This is where you want to state the dashboard’s purpose, making sure it is clear that this is not where any outcome analysis is being done. Then, you want to include all the important details for understanding the rest of the dashboard: the change being tested, the audience and location of the test, your defined primary KPI, the estimated duration it will run, and the sample size that needs to be reached. You should also call out any specific guardrails or scenarios you are monitoring, so readers can understand what is included in the dashboard.
Monitoring sample size and SRM
The next panel focuses on the stats. This is where we monitor the health of our traffic split to the test, the sample size reached, and the primary KPI.
The most important part of this panel is checking that we have an even split of traffic to our test variations, and the bullet chart showing the progress toward reaching the necessary sample size.
Our test was estimated to run 4 weeks, but it ended up taking just over 5 weeks. We ended up running a test during the part of the summer where traffic slowed, and this YoY seasonality hadn’t been used in our estimation. Good thing we checked and didn’t stop it on day 28!
When looking at the primary KPI, we look at the lift and the total volume difference between Challenger and Control. We have this to keep an eye on what is going on in the test but we don’t draw our conclusion from it! There are 2 reasons for that. 1.) It would be peeking if we used this to call a winner before sample size was reached, and 2.) This is NOT a t-test, and therefore cannot tell us the true lift between variations and whether it was statistically significant.
We also checked the split of our test traffic and always saw an equal balance between the traffic to each variation, which is a good sign. But, as always, we used the Chi-Squared test to confirm that the difference in our variation sample sizes was not statistically significant, so we could move forward with test analysis.
Monitoring guardrail metrics
In the next panel, we want to set up the monitoring of our guardrail metrics. Always make the dashboard easy to read. Add text boxes and utilize the visualization headers to clearly call out what is being monitored, and where that lives in this panel.
For our test, we had guardrails set up around the CTR of the About Us module and the scroll depth on the homepage to make sure we didn’t see too much cannibalization of traffic (since we were trying to increase traffic to the Products module). The thresholds on both the 100% scroll depth and the CTR of the About Us module were determined in our Test Plan. The CTR of the module in Challenger could not drop below 55% over the entire test period, and the 100% scroll depth of the homepage could not drop below 35% over the entire test period.
Note that for some guardrail metrics you may have a threshold that is time-bound and not just for the entire test period. It could be that your test has caused a large enough drop in an important site conversion for 3 days in a row, and because of the decisions you made during test planning, you have to stop the test. Again, all of these criteria are set in the planning phase and agreed upon based on business requirements.
For example, the third guardrail metric was the Lead Form Completion Rate. Since the site’s main goal is to get new prospects to fill out the form requesting a call, we had to make sure our test didn’t hurt this activity. The business said they would not be willing to see more than a 5% decrease in the Challenger’s Completion Rate for more than 3 days in a row. With the help of alerts, we were confident that this threshold was never passed in our test.
Those types of metrics are perfectly suited to have an alert set up! This way, if it drops below your threshold in a single day, you and the entire team can be notified, making it even easier to make sure your drop and time limit is not exceeded. These alerts are an extra safe guard to make sure nothing is missed and that the proper team members are notified in real time.
Monitoring operational metrics
Lastly, we want an easy way to check that our web analytics tracking is working properly. This panel is here to quickly check that nothing has completely dropped off or spiked suspiciously. These visuals make it easy to quickly see any red flags that would warrant us to investigate further. For example, if we had seen overall module clicks on the homepage tank, we would be worried something with the deployment of the test had caused that tracking to break. That type of scenario would be detrimental to the test and is something we would want to be aware of immediately in the hopes of fixing it sooner rather than later.
By ensuring the health of your live test, you are able to build trust in your results and wider program. Getting buy- in from stakeholders is made easier with increased visibility to testing. But remember, this is not where we draw any conclusions! We are not analyzing results, we are simply ensuring that the results are clean enough to be analyzed and it’s safe to continue running the experiment.
Would you like an opportunity to hear more about this and ask questions about how to make this work for your experimentation program? Be sure to watch the September 29th A/B Test Monitoring Workspace SDEC session!
Read more recent optimization posts:
How to Safely Calculate Program ROI
Calculating ‘Truth’ (While Avoiding Existential Crisis)
Peeking Safely with Sequential Testing
Calculating Sample Sizes – Mythbuster Edition
Browser Changes Impact on Testing with Cory Underwood