Question: Four test subjects walk into a bar. The 1st wears a shirt and shoes. The 2nd wears a shirt but no shoes. The 3rd wears shoes but no shirt. The 4th wears no shirt and no shoes. Which ones got a drink?
Answer: Unfortunately, none of them accepted cookies, so we’ll never know.
Let’s talk about multivariate testing. But first, let’s address some semantics issues.
A/B/n vs. Multi-variant vs. Multivariate Experiments
A/B/n indicates a normal A/B test with additional (n) test variations to be compared, typically to the control. At some point, someone thought the notion of “multiple variants” should be shortened to “multi-variant,” and a baffling term was born. It’s really only used by a handful of digital experimentation people who don’t mind confusing others. Neither of these terms represents what is referred to as multivariate testing.
Multivariate Experiments are Factorial Experiment Designs
The term “multivariate test” or “MVT” was probably coined by an experimentation tool vendor since it has little basis in the scientific community from which our experimentation methods originate. Ask a statistics professor about MVT, and you’ll probably embark on a conversation about Multivariate Analysis, which isn’t technically off-topic, but likely wouldn’t help you on your way to understanding the concept at hand.
What you will find ample literature on is something called “Factorial Design” or “Factorial Experiments,” not to be confused with “Factor Analysis,” which is an entirely separate line of statistical procedures. But while we may have marketers to thank for the apparent misnomer, it’s not shocking to learn that the field of statistics failed to produce a lexicon fit for a sales pitch!
Factors and Levels
Factorial experiment design focuses on efficiently assessing how multiple types of interventions influence a metric. For us, it answers questions like, “Which of the following page elements influence conversion rate significantly: headline, CTA, positioning statements, and customer testimonials?” Each element is called a factor, and each factor may have multiple variations within referred to as levels. So the factor, CTA, or even CTA button color, will have levels—green, blue, orange, etc.
An MVT consists of two or more factors and two or more levels within each factor. In classic use cases, factors are often binary (person has or does not have diabetes) or numerical (temperature or IQ). So if the language seems awkward, it’s because it wasn’t explicitly created for headline and hero image variants.
What’s MVT Good For?
The beauty of an MVT is that it allows you to study these factors both independently and in combination simultaneously and do so pretty efficiently. The result is knowledge about which factors exhibit significant variation in performance across levels. The challenge is in selecting levels for testing that are likely to manifest whatever variation in performance may exist.
When a significant difference in performance is identified, we call this an Effect. In Factorial Experiments, there are a few different types of effects that, depending on your interest, can have substantial implications on your experiment design and sample planning.
Main Effects
Main effects refer to effects observed within single factors regardless of combinations with other factors. So, main effects help you understand things like, How much does having testimonials on the page impact conversion rate? Or, Which image, if any, lifts conversion rate? MVTs are incredibly effective in answering questions like these, even with many factors. That’s because the sample gets recycled to study each factor independently without any increase in sample size.
You might argue that the best use for MVT testing is to study main effects. After all, once you determine that a particular factor plays a significant role in conversions, you would want to spend a lot more time optimizing that factor/feature over others that do not influence performance to the same degree.
Interaction Effects
Interaction effects are effects observed when factors are studied in combination with one another. But these effects are tiered. Secondary (interaction) effects refer to interactions between two factors, tertiary between three factors, and so on. Before running an MVT, you must decide on the degree to which you want to study interactions. Each layer of interaction increases your sample size requirements exponentially.
On Sample Sizes
When people talk about MVT tests being very costly from a sampling perspective, they’re referring to the cost of measuring all interactions. Take this scenario:
Factor | Description | Levels |
1 | Color | A-Green, B-Blue |
2 | Size | A-Small, B-Big |
3 | Shape | A-Round, B-Square |
To measure only main effects, you can plan your sample as if you were testing 1A vs. 1B because the sample gets recycled for 2A vs. 2B and 3A vs. 3B. However, take note that you actually need to plan around the lowest minimum detectable effect (MDE) among the factors. If 3B represents a fairly trivial change and you’ve planned for a 10% MDE, you might be underpowering your test, i.e., collecting a sample size too small to detect the impact of such a minor change.
To measure secondary effects, you need to plan your sample as if you were testing 1A-2A vs. 1B-2A vs. 1B-2B vs. 1B-2A, so four variants. Your sample still gets recycled to compare combinations of factors 1 and 3 against each other as well as factor 2 and 3 combinations. But notice, we’re only looking at interactions between two factors at a time.
On the other hand, tertiary effects require you to plan sampling around all 1-2-3 combinations, or eight variants (2x2x2). The number of variants for sample size planning would grow to 16 if you added another factor and intended to measure all possible interaction effects.
A data scientist might plan a sample size by doing power analysis on simulated data—an iterative process that requires some coding. For our purposes, we’ll probably be fine using our standard test planning tools under the heuristic explained above—even better if we adjust for multiple comparisons.
Choosing a Degree of Interaction to Measure
In classic factorial experiments, researchers typically assume that complex interactions between many factors do not exist and that there are probably only main effects and a few low-order interactions. So you might power a test for secondary effects and ignore possibilities beyond that. But that is up to you to decide. Just know that MVT shines in detecting main and secondary effects but becomes very demanding of traffic volumes when measuring interactions beyond that.
Analyzing MVTs
You can’t perform the same type of statistical test you use for A/B on MVT data and expect to get the level of insight you’ve (hopefully) designed the test for. But, you do have options. You could use either Analysis of Variance (ANOVA) or Linear Regression methods—logistic regression for conversion rates—to detect main and interaction effects with nominal Type 1 error rates (p-values!). These can be done with surprisingly few lines of code in R or Python, but most free online calculators do not provide the flexibility needed for MVT analysis.
Your test data probably looks something like this with each test cell or variant combination in its own row:
Variant | Color | Size | Shape | Conversion Rate | Traffic |
1 | green | small | round | 0.59 | 1000 |
2 | green | small | square | 0.55 | 1000 |
3 | green | big | round | 0.55 | 1000 |
4 | green | big | square | 0.50 | 1000 |
5 | blue | small | round | 0.55 | 1000 |
6 | blue | small | square | 0.50 | 1000 |
7 | blue | big | round | 0.55 | 1000 |
8 | blue | big | square | 0.50 | 1000 |
Here’s what the code might look like in R to do the analysis via logistic regression. This is setup to measure main effects and secondary interaction effects but not tertiary effects in a 2³ design (three factors with two levels each). Remember, not only are tertiary effects expensive to measure from a sampling perspective, they are generally thought to be rare.
# Put your data in a table object
test_data <- data.frame(
# Treatment levels for the first factor: color
f1_color = c("green","green","green","green","blue","blue","blue","blue"),
# Treatment levels for the second factor: size
f2_size = c("small","small","big","big","small","small","big","big"),
# Treatment levels for the third factor: shape
f3_shape = c("round","square","round","square","round","square","round","square"),
# Conversion rates for each variation
conversion_rates = c(.59,.55,.55,.5,.55,.5,.55,.5),
# Sample sizes for each variation
traffic = c(1000,1000,1000,1000,1000,1000,1000,1000)
)
# Run logistic regression on the data using a general linear model
# The exponent ^2 used below indicates the level of interaction effects
# we want to measure. Here we’re measuring secondary effects.
test_analysis <- glm(
conversion_rates ~ (f1_color + f2_size + f3_shape)^2, family = "binomial",
data = test_data,
weights = traffic
)
library(jtools) # An R package that pretties up regression model outputs
summ(test_analysis, model.info = FALSE, model.fit = FALSE)
plot_summs(test_analysis)
Interpreting Results
Here’s what the output looks like:
There are a lot of numbers here. If you’ve seen these types of tables before, you might know that the Est. column typically holds the estimated effect size, and the p column holds the p-values. In this case, the effects are not easy to interpret as they are presented as a log odds ratio. But the p-values are the same as always.
But first, why are names so weird? Well, what it’s representing is the effect of going from an unlisted alternative to the listed value(s). So f2_sizesmall represents the main effect of size, going from big to small. And that effect is negative, although with a very high (insignificant) p-value. The next item, f3_shapesquare, also displays a negative effect but with a very low p-value of 0.01. By inference, this tells us that round shapes are clearly better.
The accompanying plot shows the effects of each factor/level along with a 95% confidence interval. If the p-values themselves weren’t clear, the plot certainly is, showing that in addition to f3_shapesquare having a clear negative effect, f1_colorgreen:f2_sizesmall shows a significant positive one. This is a secondary effect and means that combining green colors with small shapes is better than any other combination of the two factors.
MVT Benefits (vs. A/B)
MVTs are very useful for determining how page elements (factors) influence your KPI relative to one another; plus, they demonstrate which “levels” work best. If there are performance dependencies with certain levels of other factors, you’ll learn that too. This can be valuable information when it comes to determining what to spend your time optimizing.
An A/B or A/B/n test, on the other hand, will simply tell you whether a specific variation performs better than the control and by how much (by range). In fact, as long as you’re not set on measuring lower order interactions, an MVT actually lets you get more insight at a lower cost than an A/B/n. In statistical parlance, it generates a “pooling benefit” whereby you can answer two related questions with a lesser amount of data than you would have had to gather to answer the questions separately. MVT is the BOGO of optimization.
A Practical Example
We’d like our website to retain visitors with engaging content and appealing design. To that end we’ve come up with a number of ideas for optimizing our pages. We’ve selected one of our most popular posts to test out our ideas on. Here’s what we’re thinking:
- Change the masthead from a dark to a light background
- Change the article image to a video
- Move article suggestions from the side rail to the content body
Here’s the problem, at least two of these changes, if successful, would be rolled out site-wide so we want to be fairly certain that the specific change is driving an improvement. We could run this as an A/B test with all variables mushed together, except we wouldn’t know which variables drove the difference. We could run it as a sequence of tests and study each variable in turn, but what if it’s the combination of video and article suggestions that really makes the difference and we test one without the other?
Instead, we decided to run a 2³ MVT with each variable a factor with two levels. We’ve decided to measure Main and Secondary effects only. We’ll be able to do this in the same time it would take us to run an A/B/C/D test and at the end of it we should know if any of the variables impact performance individually and any variables rely on each other to boost performance. Note: we will not learn whether all three variables rely mutually on each other to boost performance, only if any two combined make a difference.
Conclusion
This has been an introduction to multivariate testing, or MVT, as we like to say. Hopefully, it has been a demystifying one, as the level of confusion surrounding this test type seems to be extreme. MVT is little more than a fancy repackaging of Factorial Design Experiments powered by ANOVA or Linear Regression.
We didn’t touch on partial- or fractional factorial design, which is another term inspiring more fascination within the A/B testing community than it probably deserves, again, probably thanks to technology vendors. It just refers to testing a representative-but-incomplete set of factor-level combinations. But based on what you now know about the lower-than-hyped sample size requirements of most MVTs, partial factorial designs are usually unnecessary.
Read more about MVTs, how they compare to A/B testing, and when and why to use each method.