Multi-Armed Bandits

Part One - The Casino

Part Two - Regret

Part Three - Dynamic Environments

Part Four - Contextual Bandits

Part Five - A/B Testing to Contextual Bandits

When a variant is chosen that is not the best variant, we experience opportunity loss. For example if our best variant converts at 0.9 and we choose a variant that converts at 0.7, we experience 0.2 units of regret. In the context of clicks per view over 100 views, you would lose a potential of 20 clicks. This amount is referred to as regret. We can evaluate algorithms by their regret accumulation patterns.

For each algorithm we track cumulative regret over time. The horizon is extended to 3000 trials.

Average Regret Accumulation Over Time (200 trials)

Observations

While e-greedy initially accumulates less regret, especially in the first ~1300 decisions, we observe an inversion in the traces and UCB starting to accumulate less regret. E-greedy has linear regret accumulation and UCB has asymptotic regret accumulation. This means that the more decisions we make the more accurate our decisions are going to become.

Part three - Dynamic Environments

Read the text book