Part Three - Dynamic Environments
Part Four - Contextual Bandits
Part Five - A/B Testing to Contextual Bandits
Multi-Armed bandits are named after "one armed bandits" a nickname for slot machines that implies that they are stealing your money. A slot machine has an arm. Players pull the arm to play. If they win they get a reward.
The multi-armed bandit imagines a row of slot machines that have different chances of winning. How should the slot machines be chosen as we play to maximize our profit?
As we select variants we are going to keep two counters per variant. Pulls: The number of times we have used this variant. Rewards: The number of times something good happened for this variant. These counters for each variant comprises our model. Each variant has a conversion rate which is the number of rewards we have per number of pulls. We will use the model to make decisions and we will update our model based on outcomes.
As we select variants we want to balance two goals; Exploitation - using the best variant as much as possible and Exploration - using the other variants in order to learn about them and confirm the best one is indeed the best one or to find another best variant.
There are four slot machines to choose from V0, V1, V2 and V3. They each have a chance of winning [10%, 30%, 70%, 90%]. We have 1000 opportunities to choose a slot machines (horizon). We start with $1000. It costs $1 to use a slot machine. If we win, we get $2 dollars, if we loose, we loose the dollar.
Not only do we want to use V3 as much as possible, we also have a secondary goal to avoid using V0 as much as possible. We will have to use the worst variant because we have to learn about all the variants.
There are three decision strategies presented. Each is evaluated on the problem and the result dollar value is shown.
The charts visualize the distribution of the cumulative sum of variant usage over the test period. All the simulations run in the browser in a worker.
Our most naive solution could be to completely ignore our model and randomly select the variants.
Variants are selected uniformly. As we have more trials the size of each bucket for each variant approaches the same size.
Buckets are roughly equally sized.
Total: $
This is a random walk around the expected average across all of the variants. We should end up roughly where we started at $1000.
Let's split our time between our two goals.
90% - Exploitation - Use the best variant
10% - Exploration - Pick a random variant
Rapid convergence on the best variant with consistent exploration.
Heavy utilization of the best variant.
Total: $
We used the best variant more than half of the time. In our exploration we used V0 as much as V1 and V2.
In this selection strategy we use the conversion rate of the variant and a boost to variants that fall below an uncertainty threshold. This selection strategy is deterministic (doesn't depend on randomness).
rewards / pulls + sqrt((2 * log(totalPulls)) / pulls)
Convergence to the best variant but much more exploration.
Total: $
Our dollar figure is the highest. We used the best variant for the vast amount of our decisions. We only minimally used our worst variant.
The model is counters for variants. The conversion rate is a measure of how often a reward is experience per pull. The model uses these conversion rate to make decisions about which variant to choose next. This chart visualizes how these conversion rates evolve over time.
UCB tends to undervalue lower performing variants. The conversion rate of the best variant is most accurate since it used at most or more in all three selection strategies.
We can run the simulations and keep an average per algorithm. There are three lines per selection strategy. The top and bottom lines are the max and min seen so far. The middle line is the running average.
We see that the random selection strategy stays near $1000. E-greedy average is slightly higher than UCB. This is explained in part 2 and has to do with our limited number of trials. The band around UCB is much tighter than e-greedy.