As AB testing becomes more commonplace, companies are starting to move beyond thinking about how to best run experiments to consider how to best set up and run experimentation programs. Unless the required time, effort, and expertise is invested into designing and running the AB testing program, experimentation is very unlikely to be useful. We have shared ideas about this already, for example here and here.
Interestingly, one can find excellent advice for how to best get the most out of experimentation in a paper published almost 45 years ago by George Box. If that name rings a bell, perhaps it is because you have heard the famous line attributed to Box: “All models are wrong, but some are useful.” In the very same paper where this phrase appears, we can discover some guiding principles for running a successful experimentation program.
Thinking along with Box
In 1976 Box published “Science and Statistics” in the Journal of the American Statistical Association. In it he discusses what he considers to be the key elements to successfully applying the scientific method.
Why might this be useful for us? Because in a very real sense, experimentation and AB testing programs are the way we apply the scientific method to business decisions. They are how companies DO science. So learning about how to best employ the scientific method directly translates to how we should best set up and run our experimentation programs.
In his paper, Box argues that the scientific method is made up, in part, of the following:
1) Motivated Iteration
4) Selective Worry
According to Box, the attributes of the scientific method can best be thought of as “motivated iteration in which, in succession, practice confronts theory, and theory, practice.” He goes on to say that, “Rapid progress requires sufficient flexibility to profit from such confrontations, and the ability to devise parsimonious but effective models [and] to worry selectively …”.
Let’s look at what he means in a little more detail and how it applies to an experimentation program.
Learning and Motivated Iteration
According to Box, learning occurs through the iteration between theory and practice. And experimentation programs are a way to formalize a process for continuous learning of marketing messaging, customer journeys, product improvements, or any other number of ideas/theories.
Box: “[L]earning is achieved, not by mere theoretical speculation on the one hand, nor by the undirected accumulation of practical facts on the other, but rather by a motivated iteration between theory and practice. Matters of fact can lead to a tentative theory. Deductions from this tentative theory may be found to be discrepant with certain known or specially acquired facts. These discrepancies can then induce a modified, or in some cases a different, theory. Deductions made from the modified theory now may or may not be in conflict with fact, and so on.”
Similar to the Scientific Method, experimentation of ideas naturally requires BOTH a theory about how things work AND the ability to collect facts/evidence that may or may not support that theory. By theory, in our case, we could mean an understanding of what motivates your customer, why they are your customer and not someone else’s, and what you might do to ensure that they stay that way.
Many times marketers purchase technology and tools in an effort to better understand their customers. However, without a formulated experimentation program, they are missing out on one half of the equation. The main takeaway is that just having AB Testing and other analytics tools are not going to be sufficient for learning. It is vital for YOU to also have robust theories about customer behavior and what they are likely to care about.
The theory is the foundation and drives everything else. It is then through the iterative process of guided experimentation, that then feeds back on our theory and so on, that we establish a system for continuous learning.
Box: “On this view, efficient scientific iteration evidently requires unhampered feedback. In any feedback loop it is … the discrepancy between what tentative theory suggests should be so and what practice says is so that can produce learning. The good scientist must have the flexibility and courage to seek out, recognize, and exploit such errors … . In particular, using Bacon’s analogy, he must not be like Pygmalion and fall in love with his model.”
Notice the words that Box uses here: “unhampered” and “courage.” Just as inflexible thinkers are unable to consider alternative ways of thinking, and hence never learn, so it is with inflexible experimentation programs.
Just having a process for iterative learning is not enough. It must also be flexible. By flexible, Box doesn’t only mean it must be efficient in terms of throughput, but must also allow for ideas and experiments to flow unhampered, where neither influential stakeholders nor the data science team holds too dearly to any pet theory.
People must not be afraid of creating experiments that seek to contradict existing beliefs, nor should they fear reporting any results that do.
Box: ”Since all models are wrong, the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary, following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so over elaboration and overparameterization is often the mark of mediocrity.”
This is where the “All Models are Wrong” saying comes from! I take this to mean that rather than spend effort seeking the impossible, we should instead seek what is most useful and actionable: “How useful is this model or theory in helping to make effective decisions?”
In addition, we should try to keep analysis and experimental methods as simple as required for the problem. Often companies can get distracted, or worse, seduced by a new technology or method, that adds complexity without advancing the cause. I am not saying that more complexity is always bad, but whatever the solution is, it should be the simplest one that can do the job.
That said, the ‘job’ may be really on for signaling, rather than to solve a specific task. For example, differentiate a product or service as more ‘advanced’ than the competition, regardless if it actually improves outcomes. It is not for me to say if those are good enough reasons for making something more complex, but I do suggest being honest about it and going forward with eyes wide open.
Box: “Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.”
This is my favorite line from Box. Being “alert to what is importantly wrong” is perhaps the most fundamental and yet underappreciated analytic skill. It is so vital not just in building an experimentation program but for any analytics project to be able to step back and ask “while this isn’t exactly correct, will it matter to the outcome, and if so by how much?”
Is this a mouse or is this a tiger?
Of course if something is a mouse or a tiger will depend on the situation and context. That said, in general, at least to me, the biggest tiger in AB Testing is fixating on solutions or tools before having defined the problem. Companies can easily fall into the trap of buying, or worse, building a new testing tool or technology without having thought about:
1) exactly what they are trying to achieve;
2) the edge cases and situations where the new solution may not perform well; and
3) how the solution will operate within the larger organizational framework.
As for the mice, they are legion. They have nests in all the corners of any business, and when we spot them, we adopt their strategies, people rushing from one approach to another in the hopes of not being caught. Here are a few of the ‘mice’ that have scampered around AB Testing:
1) One Tail vs Two Tails (eek! A two tailed mouse – sounds horrible;
2) Bayes vs Frequentist;
3) Fixed vs Sequential designs;
4) Full Factorial Designs vs Taguchi (perhaps the biggest mouse)
There is a pattern here, all of these tigers from mice tend to be features or methods that are introduced by a vendor or agencies as new and improved, frequently over-selling its importance, and implying that some existing approach is ‘wrong’.
It isn’t that there aren’t ofeen principled reasons for preferring one approach over the other (except for maybe Taguchi MVT—I’m not sure that is ever really useful for online testing). In fact, often, all of them are useful depending on the problem. It is just that none of those will be what makes or breaks a program’s usefulness.
The real value in an experimentation program are the people involved,not some particular method or software. Don’t get me wrong, selecting software and the statistical methods that are most appropriate for your company matters, but it isn’t sufficient.
I think what Box says about the value of the statistician should be top of mind for any company looking to run experimentation at scale:
“Fisher’s work gradually made clear that the statistician’s job did not begin when all the work was over-it began long before it started. …[The Statistician’s] responsibility to the scientific team was that of the architect with the crucial job of ensuring that the investigational structure of a brand new experiment was sound and economical.”
So too for companies looking to include experimentation into the workflow. It is the experimenter’s responsibility to ensure that the experiment is both sound and economical. And it is the larger team’s responsibility to provide an environment and process—in part by following Box—that will encourage their success.