The Value Of a Blackjack Strategy

Sutton and Barto's Reinforcement Learning textbook describes a simplified Blackjack game. If we pick a strategy for blackjack, how good is it to be in each state of the decision space.

Decision Space

Mangelbrot

The state representation captures the players total, the dealer's showing card and if the players hand has a useable ace. If the player total is less than 12, it makes sense to always hit so there is no decision to be made. This means the player total can take on values from 12 to 21. The dealers showing card can be a number from 1 to 10. A useable ace is captured as a boolean.

[[12-21], [1-10], [true, false]]: 200 states

The reward is 1 for a win, 0 for a draw and -1 for a loss. This is an undiscounted reward model where reward during an episode is 0 and only the final reward is counted. This means that we are taking the terminal reward and adding it to the average for each state.

The dealer hits up to a total of 16.

State Value Visualization

What is the value of a state with the given reward structure. This table shows the average reward that is achieved in each state. Each cell ranges from -1 to 1.

We are going to generate episodes and observe the reward. For each state encountered we are going to add the observed reward to average reward value of the state.

Example: Given a strategy of "hit on 19"

State Notes
Player Total: 13, Dealer Card: 5, Usable Ace: true, Action: Hit The player has a useable ace so their initial hand might have been 1,2 (an ace and a two). The player total is less than 19 so we hit.
Player Total: 18, Dealer Card: 5, Usable Ace: true, Action: Hit We got a 5 and our total is now 18. This is less than 19 so we hit.
Player Total: 17, Dealer Card: 5, Usable Ace: false, Action: Hit We got a 9 and our total is now 17. We had a useable ace that is now not useable. This is less than 19 so we hit.
Player Total: 21, Dealer Card: 5, Usable Ace: false, Action: Stick We got a 4 and our total is now 21. BLACKJACK!! Our total is over 19 so we stick.
Dealer: Hit 5, 5 total: 10
Dealer: Hit 5, 5, 6 total: 16
Dealer: Hit 5, 5, 6, 5 total: 21
Reward: 0 Dealer matched our score. A tie.

For each state encountered (first four rows) we are going to update the average with the reward we saw at the end of the game.

State Value Table

Strategy: Hit Max 19

Hit Max Strategy Win Rate and Cumulative Reward

I ran some simulations on the table above and copied the results when they looked like they converged to a value.

Win rate goes up as you hit less. Cumulative reward peaks at 16 with a value of 9.86.

Hit Max Win Rate Cumulative Reward
19 29.45 -49.16
18 35.99 -13.63
17 39.80 6.06
16 40.90 9.86
15 41.38 8.11
14 41.69 5.86
13 41.87 2.66
12 41.57 -2.26

State Action Values

Above we store the average of the all observed rewards for states. We now want to store an average for each state action pair. This allows us to make a strategy (policy) by selecting the best action for each state. For each state we will look at the state action pairs and pick the action that leads to the highest expected reward.

State Action Table

Here we are going to use "explore starts". Since we are using our policy to make decisions about which actions to take, we might not visit some parts of the state space. To compensate for this we are going to select random moves and actions from those moves. The we take the action from the selected state and continue the episode as before. The observed reward is then used to update all the state actions in the episode, as above.

Optimal Action Per State

The best action to take in every state. This is generated by looking at the table above and picking the action that leads to the best expected reward.

Optimal State Action Values

Just like selecting the best action we can also show max expected reward given the best action. Like the table above, except not showing the best action but showing the best expected value for any action.

Evaluating The Optimal Policy

Just like we evaluated the policy above, we can use the optimal policy that we computed.

The optimal policy has a win rate of 42.92 and a cumulative reward of 18.78.

Conclusion:

The optimal policy backs up that sticking is best in most situations which we saw above with the fixed hit max strategies. However there is nuance on if you have a useable ace.

In most cases explore starts is not an option since the state space is too large to be able to visit all states. The next article will look at strategies that can be used instead that balance exploration and exploitation.