Dynamic Environments: When The World Changes
Part three - Dynamic Environments
Part four - Contextual Bandits
Continuing on with our slot machines story. Imagine that you are pulling the slot machines and you have to go to the bathroom. When you come back, the position of the slot machines has been changed. How fast can our algorithms figure out the change and react. There have also been a fixed number of slot machines. What if a new slot machine is wheeled to the end of our row? How much time should we devote to it to potentially find a new best slot machine?
Each chart shows off the average of many simulations using a variant schedule. A variant schedule changes the conversion rates of the variants for the specified timestep.
Variant Schedule
0: [0.25, 0.4, 0.5, 0.7]
1000: [0.7, 0.5, 0.4, 0.25]
We start with four variants. Half way through the test we change the order. Now our first variant is the best and our last variant is the worst.
E-greedy is slow to react but does eventually converge on the new best variant.
UCB very quickly shift to the new variant and uses it most often.
E-greedy regret balloons when the world shifts while UCB is able to adapt even to wild swings in variant conversion.
Variant Schedule
0: [0.25, 0.4, 0.5, 0.7]
1000: [0.25, 0.4, 0.5, 0.7, 0.4, 0.4, 0.9]
We start with four variants. Half way through the test we add three new variants where two an not great and one is the new best variant.
E-greedy holds on to the old best variant for a while before switching to the new best.
UCB nicely restarted experimentation and starts to nicely explore the new variants.
After the introduction of the new variants, epsilon greedy's regret curve gets steeper.
Variant Schedule
0: [0.25, 0.4, 0.5, 0.7]
1000: [0.25, 0.4, 0.5, 0.7, 0.3, 0.3, 0.2]
We start with four variants. Half way through the test we add three new variants where all are bad.
E-greedy has only a tiny blip of exploration. Notice though that the new variants now take up a slice of the explore time.
UCB has a larger exploration blip but moves towards better convergence.
UCB has a larger investment in assessing new variants.
Variant Schedule
0: [0.25, 0.4, 0.5, 0.7]
1000: [0.25, 0.4, 0.5, 0.7, 0.6, 0.6, 0.6]
Three new variants are added that are slightly worse than the best variant.
E-greedy handles the case cleanly.
UCB over explores the new variants and contributes to regret.
UCB accumulates a lot of regret in this situation.