Check out the full report and code here:
Teaching Deep RL To Checkmate (from scratch)
How strong can a deep RL agent get if you strip chess to the bare minimum, build your own neural nets in NumPy, and skip autograd entirely?
That was the goal of this project: learn to play a tiny chess variant on a 4×4 board where
- The agent controls a king and a queen
- The opponent has only a king and moves randomly among legal moves
- The game ends in checkmate or draw
The agent sees the full board and makes one move per turn.
The Environment As An MDP
I modeled the game as a Markov Decision Process:
- State
- Piece positions encoded as binary planes
- Extra features describing the opponent king's freedom of movement
- Action
- Discrete set of legal king and queen moves (flattened to a fixed action space)
- Transition
- Agent moves, environment updates, opponent king makes a random legal move
- Reward
- +1 for checkmate
- -1 for draw (a crucial design choice)
- 0 otherwise
Early experiments showed something important.
ㅤ | Random Agent () | Random Agent () |
Average # Steps | 0.20 | -0.60 |
Draw % | 80.1% | 80.1% |
Checkmate % | 19.9% | 19.9% |
The table shows that if draws are treated as zero reward, a random agent gets a misleadingly positive expected value. Once draws are punished with -1, the baseline performance becomes negative, which correctly reflects the agent's material advantage.
This change becomes the backbone of later results.
- Discount factor
- Tuned, but not the main performance driver
Q values are approximated with a small fully connected neural network since the state space is too large for a tabular method.
Reward Shaping: Why Draws Must Hurt
Initially draws had reward 0. For the side with a queen advantage, a draw is clearly bad, not neutral, but the agent could not see that.
Changing the draw reward from 0 to -1 completely changed behavior:
- A random agent
- Checkmates roughly 20 percent of games
- Draws roughly 80 percent
- Expected reward drops sharply when draws become -1
- A SARSA agent
- With draw = 0, checkmates about half of the games
- With draw = -1, checkmates roughly three quarters of the games
ㅤ | SARSA () | SARSA () |
Average # of Moves | 10.9 | 31.1 |
Draw % | 48.7% | 24.6% |
Checkmate % | 51.3% | 75.4% |
Average steps per game increase, meaning the agent learns to fight for a win instead of drifting toward a draw.
Punishing draws pushes the agent to actively hunt checkmate instead of drifting toward stalemates. In practice, this mattered more than fine tuning or network size.
Agent 1: SARSA With Function Approximation
The first agent uses semi gradient SARSA with a neural network :
- Two hidden layers of 200 units
- Sigmoid activations
- Manual backprop only on the chosen action's output
- Epsilon greedy policy with decaying
This on policy setup learns from the actual behavior policy, exploration included.
Exploration decay () affects performance
Increasing makes decay faster, so the agent becomes greedy sooner:
As increases:
- Reward per episode increases
- Checkmate percentage rises
- Number of moves increases
ㅤ | |||
Average # of Moves | 19.6 | 31.1 | 32.9 |
Draw % | 31.7% | 24.6% | 21.2% |
Checkmate % | 68.3% | 75.4% | 78.8% |
Once the agent has a good mating plan, it keeps using it, but sometimes needs more moves to maneuver against random play.
SGD vs Adam
I trained SARSA with both vanilla SGD and a hand written Adam optimizer.
ㅤ | SGD SARSA | Adam SARSA |
Average # of Moves | 31.1 | 9.4 |
Draw % | 24.6% | 27.3% |
Checkmate % | 75.4% | 72.7% |
- SARSA + SGD
- Higher checkmate rate
- Longer games
- SARSA + Adam
- Slightly lower checkmate rate
- Much shorter games, often finishing in under 10 moves
Adam produces a more decisive style of play. If you care about fast tactical wins instead of absolute win percentage, this variant is attractive.
Agent 2: Double DQN
The second family of agents uses Double Deep Q Networks. DDQN addresses maximization bias by separating action selection from action evaluation:
- Online network selects the greedy action
- Target network evaluates that action
- Target network is updated from the online network every fixed number of episodes
Training is off policy with epsilon greedy behavior.
Experience replay and vectorization
I used an experience replay buffer and minibatch updates:
- Buffer stores recent transitions
- Start training after a warm up period
- Minibatch Q learning updates drawn from replay
The first implementation updated one transition at a time in Python loops. The second version vectorized the NN forward and backward passes for full minibatches.
The vectorized Double DQN:
- Trained almost twice as fast
- Non-vectorized: 3 hours 5 minutes
- Vectorized: 1 hour 46 minutes
- Reached a higher checkmate rate
- Showed more stable learning
ㅤ | DDQN SGD Non-vectorized | DDQN Minibatch Vectorized |
Number of episodes | 100000 | 100000 |
Iteration per second | 8.99 it/s | 15.59 it/s |
Total execution time | 3:05:24 | 1:46:55 |
Checkmate % | 75.4% | 84.1% |
Even more interesting, the table shows performance jumps:
The performance gain is due to both better hardware utilization and more consistent gradient estimates.
Tuning and
For Double DQN, faster epsilon decay (larger )
- Increased checkmate rate
- Made games longer on average, similar to SARSA
Different discount factors in a reasonable range had only minor effects once the agent was capable of reliably forcing checkmate.
Putting It All Together
Best Pure Performance
Vectorized DDQN, ,
- 86.8% checkmate
- Average reward: 0.74
- Longest average games (around 53 moves)
Most Efficient Player
Adam SARSA
- Only 9.4 moves per game
- Still maintains 72.7% checkmate,
ㅤ | Vectorized DDQN | Adam SARSA |
1e-4 | 5e-5 | |
0.85 | 0.85 | |
Average # of Moves | 53.2 | 9.4 |
Draw % | 13.2% | 27.3% |
Checkmate % | 86.8% | 72.7% |
Quality | Best Performance | Most Efficient |
So
- If you care about raw win rate in this toy environment, Double DQN wins
- If you want quick, sharp games with strong but slightly less optimal play, SARSA with Adam is surprisingly good
From a practical point of view:
- On policy SARSA is more conservative and better aligned with real world settings where exploration is risky
- Off policy Double DQN is great when you can simulate cheaply and push hard for optimality
What I Learned
A few broader lessons came out of this mini chess experiment:
- Reward design is critical
A single change, turning draws into negative outcomes, transformed the agent's behavior far more than algorithmic tweaks.
- Writing your own deep RL stack is painful and worth it
Implementing NN training, SARSA, Double DQN, replay buffers and optimizers by hand gives real intuition for what libraries do under the hood.
- Vectorization is not an optional optimization
Moving from per sample loops to minibatch matrix operations almost halved training time and improved results.
- On policy vs off policy is not just theory
Even in a tiny board game, SARSA and Double DQN learn qualitatively different styles, reflecting their underlying learning dynamics.
Mini chess on 4×4 squares will not beat Stockfish, but as a sandbox for understanding deep reinforcement learning end to end, it turned out to be a surprisingly rich playground.
