13 Apr 2014, 11:21

A/B tests and the Kelly criterion

Share

When you’re testing a sequence of changes in an A/B test, the question naturally arises, how large of a group should I test on? Normal practice is to wing it, usually based on some function of the number of fingers you have. This is suboptimal; a 1% test group for a Facebook-sized organization, or in the context of millions of ad impressions, is likely to be incredibly overpowered.

You can deal more rigorously with the issue by recognizing that an A/B test is a wager: we take, say, 1% of our income stream for the time period, and we spin the wheel, seeing if the payoff is negative or not.

There is a “multi-armed bandit” problem that makes this analogy explicit. However, it’s dealing with a slightly different formulation: given a number of possible “plays”, which sequence do we test & how many spins do we give each one? The vanilla posing of the problem doesn’t give a direct answer to the question of how much money we should be dumping into the slot each time (although math composes nicely and one can get a coherent solution for all these problems combined with some effort).

The question of appropriate bet size has a claimed definitive answer in the Kelly criterion. It is remarkably simple:

Proportion to bet = (prob. of success) / (cost of failure) 
                  - (prob. of failure) / (payoff of success)

To pick some contrived numbers, if I have an experiment that I guestimate to have a 70% chance of a 2% gain, and a 30% chance of a 4.5% loss, I should put ~55% of my bankroll into that wager. Essentially what this decision gets you is maximum expected long-term total payoff if you fold your winnings back into more wagers. This means that it’s not actually the appropriate criterion if you can’t continually reinvest your winnings: for instance, if you’re measuring the effect of switching from one JVM garbage collector to another, you get a choice of three and you’re done, and your “winnings” are something like having to deploy a constant proportion fewer machines or a constant reduction in latency (eg, a one-time static payoff). On the other hand, if you’re trying to turn ads into customers into ads into customers, the analogy is pretty apt.

A few questions rear their ugly heads immediately:

  • How can I possibly know the expected payoffs if that’s what I’m trying to measure in the first place?
  • How does statistical significance play into this?

The first is more of an art than a science, but you can get an estimate by looking at the results of previous experiments. If all your previous futzing with your site’s fonts only shifted conversion by 0.5% in one direction or the other, your custom Lobster substitute is unlikely to change it by an order of magnitude. But still, it’s difficult to have reasoned estimates, especially at low probabilities of highly variable magnitudes from a fickle and ever-changing customer base. It might help if you thought of the bet as not “shift N% of our traffic to an A/B test of button color”, but as “shift N% of our traffic to an A/B testing team with this track record”.

The second is trickier. As I mentioned above, it is possible to reconcile these ideas seamlessly, but it does take some math and some assumptions about what your “real” utility function and beliefs are. The Kelly criterion is fundamentally forward-looking, and statistical confidence is fundamentally backwards-looking, so they need not be in conflict, and one of the nice things about the Kelly criterion is that there is no explicit term for “variance”. In particular, because Kelly only depends on expected outcomes, you can use Bayesian updating of your expected payouts as results come in, and adjust proportions in real time.

If this sounds like it might get a bit complicated, it’s because it does. Unfortunately there is no way around it: the more rigorously you try to model a complex system of probabilistic payoffs the more elaborate your model has to be, which is why the solutions tend to be to stick with ye olde A/B tests (which are for the most part predictably suboptimal), or hire some consultants or in-house statisticians.