Part Three - Dynamic Environments
Part Four - Contextual Bandits
Part Five - A/B Testing to Contextual Bandits
The goal of iterative A/B testing is to test user experiences against each other and find the best converting variant. This is typically done by creating several variants, segmenting the audience (or a portion) into buckets and collecting results on some taken action. Some use the concept of statistical significance to determine when to stop the experiment and use the best converting experience moving forward. This process can be repeated with new variants to try to find a better converting variant. This process is tedious, laborious and error prone. Could we automate the process and realize gains that were previously unreachable?
An automatic system would consist of an audience segmentation and a set of variants. This system would use the context and counters that it maintains per variant for those contexts to make decisions about which variant should be used. It would try to use the best converting variant in every context as often as possible. All of these requirements are fulfilled by contextual multi-armed bandits.
The tyranny of average is a saying that says that averages hide a lot of detail. Averages also hide outliers. A/B testing is typically done in a part of the product with a portion of the total audience. This unsegmented observation leaves out details about how variants performed per context.
Consider the following A/B test result.
V0 to V3 are three four different variants. The averages are shown broken down by some audience segmentation which gives us three contexts.
V0 | V1 | V2 | V3 | |
Context 1 | 0.40 | 0.70 | 0.55 | 0.90 |
Context 2 | 0.60 | 0.66 | 0.73 | 0.60 |
Context 3 | 0.90 | 0.61 | 0.63 | 0.60 |
Average | 0.63 | 0.65 | 0.63 | 0.70 |
Looking at the averages across our variants we see that V3 is the highest converting variant. If this was the result in an A/B test you would conclude that this is the best variant. Looking at the broken down data by context we see that V3 is the best variant because it is a very high conversion rate for context 1. We see that in context 2 and context 3 V3 does quite poorly. Also notice that in context 3 we could have gotten a much higher conversion with V0.
Ideally we would like a system that is able to use the context to choose the best variant. In context 1, we would like to choose V3 as much as possible. In context 3, we would like to choose V0 as much as possible.
Looking at just the averages we see that if we used V3 all the time, we would have a system conversion rate of 0.70. However if we have a system that is able to use the best variant given the context, we can now do as good as the best converting given the variant. In this case, for context 1 the best converting variant is V3 with 0.90. For context 3, the best converting variant is V0 with 0.90. The average of the best conversion rates across contexts is 0.84. This is would be a huge uplift if a system could use the best variant per context.
It might not be the case that the results of our A/B test would have the same result if we ran the same experiment years apart. This might be because of different traffic patterns or a demographic shift, perhaps societal sentiment shifts. This can be confirmed with hold out groups, however, maintaining hold out groups long term on many experiments can be burdensome.
Ideally we want a system that can adapt to audience shifts. To adapt it would need to constantly be running and be making decisions about which variant to choose.
Typically the concept of statistical significance is used to determine if we have data to stop the experiment and start again with a new set of variants. The concept of statistical significance is rooted in frequentism. This may be correct most of the time.
An ideal system would not depend on statistical significance.
Contextual multi-armed bandits maintain counters per variant per context. This way the best variant per context can be chosen for that context.
The system runs continually. Over time it collects counters to make better decisions. It works with dynamic environments in this case mainly audience sentiment shifts to the variants.
The system does not require statistical significance. Since it is constantly running and can adapt the decision on when to iterate on variants is not necessary.
Reframing optimal audience experience for maximal conversion as a decision problem, we can apply contextual multi-armed bandits to automate an existing A/B testing system. Iterative variant iteration would also be possible.
Contextual multi-armed bandit systems can yield previously unrealized conversion gains because it is optimizing per context.