Reinforcement Learning: Easy21¶

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import numpy as np
from deck import Deck, Card
from player import Player, Dealer, MonteCarloPlayer, SarsaPlayer, HumanPlayer
from utils import value, is_bust, policy_eval, cards_to_str, display_cards
from matplotlib import cm, gridspec
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
%pylab inline
import mayavi
from mayavi import mlab
from tqdm import tqdm
from utils import value, cards_to_str, is_bust, policy_eval
from easy21 import Easy21, plot_q
import pickle

mlab.init_notebook()

Populating the interactive namespace from numpy and matplotlib
Notebook initialized with x3d backend.

1 Implementation of the game Easy21¶

The game is played according to the following rules:

The game is played with an infinite deck of cards (i.e. cards are sampled with replacement)
Each draw from the deck results in a value between 1 and 10 (uniformly distributed) with a color of red (probability 1/3) or black (probability 2/3).
There are no aces or picture (face) cards in this game
At the start of the game both the player and the dealer draw one black card (fully observed)
Each turn the player may either stick or hit
If the player hits then she draws another card from the deck
If the player sticks she receives no further cards
The values of the player’s cards are added (black cards) or subtracted (red cards)
If the player’s sum exceeds 21, or becomes less than 1, then she “goes bust” and loses the game (reward -1)
If the player sticks then the dealer starts taking turns. The dealer always sticks on any sum of 17 or greater, and hits otherwise. If the dealer goes bust, then the player wins; otherwise, the outcome – win (reward +1), lose (reward -1), or draw (reward 0) – is the player with the largest sum.

Interact with the environment by playing the game in the following cell:

run easy21.py -a human

Enter your name: Dan
Let's play, Dan!

You have 8 and the dealer shows 7.
Dan: hit or stay? h

You have 13 and the dealer shows 7.
Dan: hit or stay? h

You have 20 and the dealer shows 7.
Dan: hit or stay? s
--------------------
Dealer: 7b = 7
Dan: 8b 5b 7b = 20
--------------------
Dealer draws 9b.
Dealer: 7b 9b = 16
--------------------
Dealer draws 1b.
Dealer: 7b 9b 1b = 17
--------------------
You win, Dan!

Play again? n

2 Monte Carlo Control in Easy21¶

The second task is to apply Monte Carlo Control to the Easy21 game. It is suggested to use a time-varying scalar step size of $\alpha_{t} = 1/N(s_{t}, a_{t})$ and an $\epsilon$-greedy exploration strategy with $\epsilon_{t} = N_{0}/(N_{0} + N(s_{t}))$, where $N_{0} = 100$ is a constant, $N(s)$ is the number of times that state $s$ has been visited, and $N(s, a)$ is the number of times that action $a$ has been selected from state $s$.

run easy21.py -a mc -e 1000000 -p mpl

100%|██████████| 1000000/1000000 [03:35<00:00, 4643.31it/s]

This is not very satisfying as an optimal value function. It is too noisy. The policy does not look optimal at all, given the symmetry of the problem. It seems that maybe epsilon is going to zero too quickly so a less than optimal policy is freezing too soon. I tried changing epsilon to $\epsilon_{t} = \sqrt{N_{0}/(N_{0} + N(s_{t}))}$ below, which seemed to help a little.

run easy21.py -a mc -e 1000000 -p mpl

100%|██████████| 1000000/1000000 [03:54<00:00, 4269.35it/s]

That still doesn't look like an optimal policy though, and there is still noise, so I decided to take the plunge and calculate the optimal value function and policy without using Reinforcement Learning. For any given policy we can define a stochastic matrix of probabilities for turning a starting state into a final state through playing the game. For the Dealer, let's call that $\mathbf{D}$. For a given policy $\pi$, we will write the player's stochastic matrix as $\mathbf{P}_\pi$. Then if we define a returns matrix $\mathbf{R}$ where rows enumerate the player end values and columns enumerate the dealer end values, we can express the value function as the expected returns under a given policy, $v_\pi = \mathbf{E}_\pi(R) = \mathbf{P}_\pi^{T}\mathbf{R}\mathbf{D}$.

We will assume that the policy will take the form of hitting for some range of card values and sticking outside of that range.
We can logically rule out some policies so we don't have ot test them all.
For each Dealer card we will choose the policy that gives the highest average return across the range of possible starting cards.

dprobs = policy_eval((1, 17), 100000)

100%|██████████| 100000/100000 [03:27<00:00, 482.85it/s]

Let's run this again to check that we are taking a large enough sample to get a good estimate of the probability.

dprobs2 = policy_eval((1, 17), 100000)

100%|██████████| 100000/100000 [03:04<00:00, 542.64it/s]

ds = dprobs[(dprobs != 1.0) * (dprobs != 0.0)] # Only check those that are not equal to zero or one.
ds2 = dprobs2[(dprobs2 != 1.0) * (dprobs2 != 0.0)]
print "RMS Error is:", np.sqrt(((2 * (ds - ds2)/(ds + ds2))**2).mean())
print "RMS Difference is:", np.sqrt((((ds - ds2))**2).mean())

RMS Error is: 0.0125284041048
RMS Difference is: 0.0015728245558

Create the returns matrix.

returns = np.zeros((22, 22))
for i in range(22):
    for j in range(22):
        returns[i, j] = np.sign(i - j)
returns[0, 0] = -1.0  # The one asymmetry between the player and the dealer.

returns

pstick = policy_eval((22, 11), 100000)

100%|██████████| 100000/100000 [00:10<00:00, 9922.01it/s]

pstick

Policies that are likely to be among the best:

policies = [[i, j] for j in range(18, 11, -1) for i in range(6, 11)]

[tup for tup in enumerate(policies)]

[(0, [6, 18]),
 (1, [7, 18]),
 (2, [8, 18]),
 (3, [9, 18]),
 (4, [10, 18]),
 (5, [6, 17]),
 (6, [7, 17]),
 (7, [8, 17]),
 (8, [9, 17]),
 (9, [10, 17]),
 (10, [6, 16]),
 (11, [7, 16]),
 (12, [8, 16]),
 (13, [9, 16]),
 (14, [10, 16]),
 (15, [6, 15]),
 (16, [7, 15]),
 (17, [8, 15]),
 (18, [9, 15]),
 (19, [10, 15]),
 (20, [6, 14]),
 (21, [7, 14]),
 (22, [8, 14]),
 (23, [9, 14]),
 (24, [10, 14]),
 (25, [6, 13]),
 (26, [7, 13]),
 (27, [8, 13]),
 (28, [9, 13]),
 (29, [10, 13]),
 (30, [6, 12]),
 (31, [7, 12]),
 (32, [8, 12]),
 (33, [9, 12]),
 (34, [10, 12])]

We have 35 policies, plus the policy of always sticking.

probs = []
for pol in policies:
    print pol
    probs.append(policy_eval(pol, 100000))
probs.append(pstick)

op = np.matmul(returns, dprobs)

Let's take a look at the value function for a policy of always sticking.

expecteds = np.matmul(pstick.T, op)

mlab.clf()
X = np.arange(1, 11, 1)
Y = np.arange(1, 22, 1)
X, Y = np.meshgrid(X, Y)
mlab.mesh(X, Y, 5*expecteds[1:22, 1:11])

expected_returns = [np.matmul(prob.T, op) for prob in probs]

best_pols = []
for i in range(1, 11):
    best_pols.append(np.argmax([np.mean(rtn[1:11, i]) for rtn in expected_returns]))

best_pols

[29, 29, 24, 23, 23, 23, 18, 12, 7, 6]

best_returns = np.array([expected_returns[col][1:, idx+1] for (idx, col) in enumerate(best_pols)]).T

Now we can plot the value function for the optimal policy along with the value function for the always sticking policy and see how much better we do.

mlab.clf()
X = np.arange(1, 11, 1)
Y = np.arange(1, 22, 1)
X, Y = np.meshgrid(X, Y)
mlab.mesh(X, Y, 5*expecteds[1:22, 1:11])
mlab.mesh(X, Y, 5*best_returns)

Now we can also find the hit surface for in case we hit form a given state, but otherwise follow our optimal policy.

p_card_value = np.array(10*[1/30.0] + [0] + 10*[2/30.0])

hit_values = np.zeros((21, 10))
for j in range(10):
    for i in range(21):
        hit_values[i, j] = (np.sum(optimal_values[(i-10)*(i-10>0):11+i, j] * p_card_value[(10-i)*(10-i>0):31-i])
                            + np.sum(-1 * p_card_value[((41-i)%31-10)*((41-i)%31-10>0):(42-i)%32]))

qvalues = np.concatenate((np.array([hit_values]), np.array([stick_values])), axis=0).T

optimal_vals = qvalues.max(axis=2).T

mlab.clf()
X = np.arange(1, 11, 1)
Y = np.arange(1, 22, 1)
X, Y = np.meshgrid(X, Y)
mlab.mesh(X, Y, 5*hit_values)
mlab.mesh(X, Y, 5*stick_values)
mlab.mesh(X, Y, 5*optimal_values)

The mean return for the optimal policy across all possible starting states is

np.mean(best_returns[:10, :])

0.061907533098000025

So with the optimal policy it is possible to do about 6.2% better than the dealer. We will save that so we have a beanchmark for how well our Reinforcement Learning agents are doing.

np.savetxt("hit_values.txt", hit_values)
np.savetxt("optimal_values.txt", best_returns)
np.savetxt("stick_values.txt", expecteds[1:22, 1:11])

hit_values = np.loadtxt("hit_values.txt")
optimal_values = np.loadtxt("optimal_values.txt")
stick_values = np.loadtxt("stick_values.txt")
qvalues = np.concatenate((np.array([stick_values]), np.array([hit_values])), axis=0).T

best_returns = optimal_values
expecteds = np.zeros((22, 11))
expecteds[1:22, 1:11] = stick_values

qvals = np.zeros((11, 22, 2))

qvals[1:11, 1:22, :] = qvalues

from mpl_toolkits.mplot3d import Axes3D
X = np.arange(1, 22, 1)
Y = np.arange(1, 11, 1)
X, Y = np.meshgrid(X, Y)

fig = plt.figure(figsize=(15, 10))
ax = fig.add_subplot(111, projection='3d', azim=-40, elev=45)
#ax1 = fig.add_subplot(gs[0], projection='3d', azim=-40, elev=45)
ax.plot_surface(Y, X, qvals[1:, 1:, 1], linewidth=0.5, shade=True,
                 cmap=cm.coolwarm, edgecolor=".7")
# ax.plot_surface(Y, X, qvals[1:, 1:, 0], linewidth=0.5, shade=True,
#                  cmap=cm.coolwarm, edgecolor=".7")
ax.set_zlim3d(bottom=-1.0, top=1.0)
ax.set_xticks([1, 4, 7, 10])
ax.set_yticks([1, 5, 9, 13, 17, 21])
ax.set_zticks([-1.0, -0.5, 0.0, 0.5, 1.0])
ax.set_xlabel('Dealer Card')
ax.set_ylabel('Player Cards Value')
ax.set_zlabel('Expected Return')
ax.set_title('Hit Value Surface')

Text(0.5,0.92,u'Hit Value Surface')

fig = plot_q(qvals)

Unfortunately, matplotlib does not do well with overlapping 3d surfaces. The hit surface should show as sticking through the stick surface.
This looks pretty different from what we were getting from the Monte Carlo controlled RL, so let's compare a little more closely, plotting the optimal surface with a monte carlo RL optimal surface using Mayavi.

run easy21.py -a mc -e 1000000 -d

100%|██████████| 1000000/1000000 [04:35<00:00, 3633.18it/s]

import pickle
with open("q_mc_lam1.0_eps1000000.pkl", 'r') as f:
    mc_values = pickle.load(f)

mlab.clf()
X = np.arange(1, 11, 1)
Y = np.arange(1, 22, 1)
X, Y = np.meshgrid(X, Y)
mlab.mesh(X, Y, 5*mc_values.max(axis=2)[1:11, 1:22].T, opacity=.8)
mlab.mesh(X, Y, 5*optimal_values, representation='wireframe', colormap="bone")

from IPython.display import Image
Image("mc_optimal_compare.png")

We see that the Monte Carlo prediction for optimal values is consistently low for the areas where the policy calls to hit. Let's continue and see if other methods improve upon this and if we can understand how MC falls short.

3 TD Learning in Easy21¶

run easy21.py -a mc -e 100000 -c 1000 -p mpl

100%|██████████| 100000/100000 [00:32<00:00, 3031.57it/s]

run easy21.py -a sarsa -e 100000 -c 1000 -p mpl

100%|██████████| 100000/100000 [00:42<00:00, 2327.11it/s]

run easy21.py -a sarsa -e 1000 -m 10

100%|██████████| 1000/1000 [00:00<00:00, 1928.40it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2034.01it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2064.17it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2381.73it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2047.26it/s]
100%|██████████| 1000/1000 [00:00<00:00, 1374.55it/s]
100%|██████████| 1000/1000 [00:00<00:00, 1376.36it/s]
100%|██████████| 1000/1000 [00:00<00:00, 1744.80it/s]
100%|██████████| 1000/1000 [00:00<00:00, 1658.98it/s]
100%|██████████| 1000/1000 [00:00<00:00, 1875.87it/s]
100%|██████████| 1000/1000 [00:00<00:00, 2450.10it/s]

run easy21.py -a sarsa -e 10000 -l 0 -m 100

100%|██████████| 10000/10000 [00:04<00:00, 2322.29it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2223.21it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2374.54it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2197.66it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1876.02it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2019.62it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2015.76it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1984.28it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1710.80it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1690.28it/s]
100%|██████████| 10000/10000 [00:06<00:00, 1549.22it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1777.17it/s]
100%|██████████| 10000/10000 [00:07<00:00, 1283.36it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1730.49it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2108.15it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2389.21it/s]
100%|██████████| 10000/10000 [00:06<00:00, 1516.59it/s]
100%|██████████| 10000/10000 [00:06<00:00, 1604.06it/s]
100%|██████████| 10000/10000 [00:06<00:00, 1521.66it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1719.22it/s]
100%|██████████| 10000/10000 [00:06<00:00, 1604.56it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1797.94it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1812.59it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1823.38it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1876.32it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2056.74it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2246.16it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1988.25it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2291.42it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2100.14it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2014.31it/s]
100%|██████████| 10000/10000 [00:03<00:00, 2523.00it/s]
100%|██████████| 10000/10000 [00:03<00:00, 2590.22it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2434.63it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1919.00it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1967.69it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2263.39it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2215.04it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2160.45it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2042.21it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1681.57it/s]
100%|██████████| 10000/10000 [00:03<00:00, 2620.81it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2334.62it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2148.46it/s]
100%|██████████| 10000/10000 [00:03<00:00, 2647.07it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2359.14it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2189.69it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2246.83it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2063.06it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2400.43it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2370.46it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2010.58it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2129.68it/s]
100%|██████████| 10000/10000 [00:03<00:00, 2592.80it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2186.49it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2497.76it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2291.71it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2040.12it/s]
100%|██████████| 10000/10000 [00:03<00:00, 2531.77it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2120.12it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2352.27it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2377.22it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1930.43it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2181.27it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2247.58it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2074.69it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2276.75it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1800.21it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2059.41it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2073.18it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1895.47it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2054.87it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1998.07it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1683.26it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1949.31it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1911.74it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2150.32it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2080.39it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2134.38it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2148.41it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1979.46it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2103.84it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2159.10it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2431.06it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2046.30it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2048.05it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2077.29it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2129.90it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2058.74it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2204.69it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2110.49it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1721.95it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1967.86it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1912.62it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2076.10it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2089.11it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2039.03it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2236.40it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2305.10it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2170.91it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2064.42it/s]

run easy21.py -a sarsa -e 10000 -l 0 -c 100

100%|██████████| 10000/10000 [00:04<00:00, 2178.14it/s]

run easy21.py -a sarsa -e 10000 -l 1.0 -c 100

100%|██████████| 10000/10000 [00:05<00:00, 1973.21it/s]