Part Three - Dynamic Environments
Part Four - Contextual Bandits
Part Five - A/B Testing to Contextual Bandits
When a variant is chosen that is not the best variant, we experience opportunity loss. For example if our best variant converts at 0.9 and we choose a variant that converts at 0.7, we experience 0.2 units of regret. In the context of clicks per view over 100 views, you would lose a potential of 20 clicks. This amount is referred to as regret. We can evaluate algorithms by their regret accumulation patterns.
For each algorithm we track cumulative regret over time. The horizon is extended to 3000 trials.
While e-greedy initially accumulates less regret, especially in the first ~1300 decisions, we observe an inversion in the traces and UCB starting to accumulate less regret. E-greedy has linear regret accumulation and UCB has asymptotic regret accumulation. This means that the more decisions we make the more accurate our decisions are going to become.