We’ve all heard of the one-armed bandit: the slot machine. The consummate victor engineered to best all players. But cue the lone tumbleweed and standoff music, because it’s high noon, and a stranger’s come to town.

Enter Machine Learning’s sinister-sounding solution: The multi-armed bandit. He’s not going to play the slots like any old two-armed, one lever-pullin’ Joe—ooooh, no. The Multi-armed Bandit is going to pull all the levers—that’s all the levers at once—and he’s going to learn quickly which machines pay fastest and when they’re paying fastest, and he’s going to play those particular machines and win(!), until he’s won everything (everything!), until the barkeep goes home crying and the casino is a penniless wasteland.

This aspect of the bandit problem has important implications. It eliminates our paradigm of measuring against a control, because a multi-armed bandit doesn’t test against an existing control, it tests among every experience.

Typically, in standard A/B/n or MVT experimentation, we measure the lift of a winner against a control, but with the bandit, you cannot say, “x% lift translates to x dollars annualized;” all you can say is, “this is your best performer.” Multi-armed-bandit type experimentation might help us shift the conversation from “how much can we expect?” to “what works best?”  In this way, we’ll be able to transfer the precious skull sweat we currently use calculating upside to actually creating optimal user experiences.

Further, with the multi-armed bandit approach, you cannot say with statistical confidence what your worst performer is in a given experiment. Why? Because the bandit has greedily shifted traffic away from the non-winning experiences in an effort to exploit the gains from the higher-performers. That means the lower performing recipes will not have the statistical power (samples) necessary to statistically measure which is worst.

Why should you care which is worst? Well, maybe you don’t. Normally you wouldn’t. But you certainly couldn’t calculate opportunity cost of choosing the losing variant, as many programs try to do in an effort to measure and report program ROI and the “savings” from not rolling out a subpar experience. Again, this shifts the way we think about and communicate about our program. I would say that this shift is a good move—away from calculating ROI and toward focusing on the very thing our programs are typically named after: optimization.

The Minimum Detectable Effect Conundrum, (aka, My Formerly FAQ)

One of the most common questions I get from my clients is around runtime calculations: “But, how do I know what my minimum detectable effect (MDE, minimum lift, etc.) should be?” I have the same answer every time: “How big a lift would you need to see to go with the new experience?” I normally get blank stares which leads to me (unhelpfully) lecturing about the cost of testing (it ain’t free, y’all!) and how they should, at a minimum, want to see a lift that would outweigh the costs of the test itself.

Do you know how many have been able to answer that question? Zero. That’s right: zero. Everyone can come up with a scenario where they would need that information—perhaps a new 3rd party recommendations engine or a test to remove a revenue-generating ad from a key funnel page, etc. In each scenario, we know the challenger must provide gains that outweigh or erase the costs of the decision being made. Easy!

But what if you’re deciding between banner A or B? Page template X or Y? The assets have been created. There’s no additional cost to push them live. So…what does your lift need to be? Does it even matter? So long as it’s measurably better. BUT we can’t use a minimum lift like, say, 1%  in a runtime calculator (unless you’re one of those programs gifted with an abundance of traffic and wonderful conversion rates) without seeing runtime estimates that leave us gasping for air.

What’s the solution? We tried to back into lift estimates by using maximum runtime in this version of our calculator, and that helps to some extent. It helps you make decisions about whether those lifts are likely, at least. And knowing that can help you with prioritization. But it still doesn’t answer that niggling question: How big does a lift need to be in order for the business to make a decision?

Is that answer truly “any lift?” Because if it is—and I suspect it often really is—then why do we spend so much time arguing over lift % and annualized impacts? Why can’t we stand up and say B is better than A. Go with B.

The New Frontier

This is essentially what the Multi-Armed Bandit is doing. It doesn’t care HOW much better B is compared to A, or that C is worse than A. It only cares that B is better. And with the push from executive leadership toward machine learning and the automation solution, maybe we can use this shift in methodologies to also shift the way we think about—and communicate—success. In that sense, maybe we’re all multi-armed bandits! Watch out casinos, here we come!

If you have any questions or would like to chat about how we could help build a new program or take an existing program to the next level, reach out!


Ready to get started?
Reach out to learn more about how we can help.


Leave a Comment

Your email address will not be published.

Contact Us

Related Posts

Join the Conversation

Check out Kelly Wortham’s Optimization based YouTube channel: Test & Learn Community.

Search Discovery
Education Community

Join Search Discovery’s new education community and keep up with the latest tools, technologies, and trends in analytics.




Scroll to Top


Catch the latest industry trends we’re watching and get new insights from our thought leaders delivered directly to your inbox each month.