Part three - Dynamic Environments

Part four - Contextual Bandits

Multi-Armed bandits are named after "one armed bandits" a nickname for slot
machines that implies that they are stealing your money. A slot machine has
an **arm**. Players **pull** the arm to play. If they
win they get a **reward**.

The multi-armed bandit imagines a row of slot machines that have different chances of winning. How should the slot machines be chosen as we play to maximize our profit?

As we select variants we are going to keep two counters per variant.
**Pulls**: The number of times we have used this variant.
**Rewards**: The number of times something good happened for this variant.
These counters for each variant comprises our **model**. Each variant
has a **conversion rate** which is the number of rewards we have
per number of pulls. We will
use the model to make decisions and we will update our model based on outcomes.

As we select variants we want to balance two goals; **Exploitation** - using
the best variant as much as possible and **Exploration** - using the other variants
in order to learn about them and confirm the best one is indeed the best one or to find another
best variant.

There are four slot machines to choose from **V0, V1, V2 and V3**.
They each have a chance of winning **[10%, 30%, 70%, 90%]**.
We have **1000 opportunities** to choose a slot machines (horizon).
We start with $1000.
It costs **$1 to use a slot machine**. If we win, we get $2 dollars, if we loose, we
loose the dollar.

Not only do we want to use V3 as much as possible, we also have a secondary goal to avoid using V0 as much as possible. We will have to use the worst variant because we have to learn about all the variants.

There are three decision strategies presented. Each is evaluated on the problem and the result dollar value is shown.

The charts visualize the distribution of the cumulative sum of variant usage over the test period. All the simulations run in the browser in a worker.

Our most naive solution could be to completely ignore our model and randomly select the variants.

Variants are selected uniformly. As we have more trials the size of each bucket for each variant approaches the same size.

Buckets are roughly equally sized.

Total: $

This is a random walk around the expected average across all of the variants. We should end up roughly where we started at $1000.

Let's split our time between our two goals.

**90% - Exploitation - Use the best variant**

**10% - Exploration - Pick a random variant**

Rapid convergence on the best variant with consistent exploration.

Heavy utilization of the best variant.

Total: $

We used the best variant more than half of the time. In our exploration we used V0 as much as V1 and V2.

In this selection strategy we use the conversion rate of the variant and a boost to variants that fall below an uncertainty threshold. This selection strategy is deterministic (doesn't depend on randomness).

**rewards / pulls + sqrt((2 * log(totalPulls)) / pulls)**

Convergence to the best variant but much more exploration.

Best variant is used and worst variant avoided most of the time.

Total: $

Our dollar figure is the highest. We used the best variant for the vast amount of our decisions. We only minimally used our worst variant.

The model is counters for variants. The conversion rate is a measure of how often a reward is experience per pull. The model uses these conversion rate to make decisions about which variant to choose next. This chart visualizes how these conversion rates evolve over time.

UCB tends to undervalue lower performing variants. The conversion rate of the best variant is most accurate since it used at most or more in all three selection strategies.

We can run the simulations and keep an average per algorithm. There are three lines per selection strategy. The top and bottom lines are the max and min seen so far. The middle line is the running average.

We see that the random selection strategy stays near $1000. E-greedy average is slightly higher than UCB. This is explained in part 2 and has to do with our limited number of trials. The band around UCB is much tighter than e-greedy.