Multi-Armed Bandits

Dynamic Environments: When The World Changes

Part one - The Casino

Part two - Regret

Part three - Dynamic Environments

Part four - Contextual Bandits

Continuing on with our slot machines story. Imagine that you are pulling the slot machines and you have to go to the bathroom. When you come back, the position of the slot machines has been changed. How fast can our algorithms figure out the change and react. There have also been a fixed number of slot machines. What if a new slot machine is wheeled to the end of our row? How much time should we devote to it to potentially find a new best slot machine?

Each chart shows off the average of many simulations using a variant schedule. A variant schedule changes the conversion rates of the variants for the specified timestep.

Scenario 1: Shuffling the variants

Variant Schedule

0: [0.25, 0.4, 0.5, 0.7]

1000: [0.7, 0.5, 0.4, 0.25]

We start with four variants. Half way through the test we change the order. Now our first variant is the best and our last variant is the worst.

Variant Selection E-Greedy Over Time

E-greedy is slow to react but does eventually converge on the new best variant.

Variant Selection UCB Over Time

UCB very quickly shift to the new variant and uses it most often.

Regret Over Time

E-greedy regret balloons when the world shifts while UCB is able to adapt even to wild swings in variant conversion.

Scenario 2: Adding New Best Variant

Variant Schedule

0: [0.25, 0.4, 0.5, 0.7]

1000: [0.25, 0.4, 0.5, 0.7, 0.4, 0.4, 0.9]

We start with four variants. Half way through the test we add three new variants where two an not great and one is the new best variant.

Variant Selection E-Greedy Over Time

E-greedy holds on to the old best variant for a while before switching to the new best.

Variant Selection UCB Over Time

UCB nicely restarted experimentation and starts to nicely explore the new variants.

Average Regret Accumulation Over Time

After the introduction of the new variants, epsilon greedy's regret curve gets steeper.

Scenario 2: Adding new best variant

Variant Schedule

0: [0.25, 0.4, 0.5, 0.7]

1000: [0.25, 0.4, 0.5, 0.7, 0.3, 0.3, 0.2]

We start with four variants. Half way through the test we add three new variants where all are bad.

Variant Selection E-Greedy Over Time

E-greedy has only a tiny blip of exploration. Notice though that the new variants now take up a slice of the explore time.

Variant Selection UCB Over Time

UCB has a larger exploration blip but moves towards better convergence.

Regret Over Time

UCB has a larger investment in assessing new variants.

Scenario 4: An interesting failure case with UCB

Variant Schedule

0: [0.25, 0.4, 0.5, 0.7]

1000: [0.25, 0.4, 0.5, 0.7, 0.6, 0.6, 0.6]

Three new variants are added that are slightly worse than the best variant.

Variant Selection E-Greedy Over Time

E-greedy handles the case cleanly.

Variant Selection UCB Over Time

UCB over explores the new variants and contributes to regret.

Variant Selection UCB Over Time

UCB accumulates a lot of regret in this situation.

Read the text book