Teaching Deep RL To Checkmate (from scratch)
Teaching Deep RL To Checkmate (from scratch)

Teaching Deep RL To Checkmate (from scratch)

Tags
Computer Science
RL
Published
November 29, 2025
Author
Baris Ozakar
Check out the full report and code here:

Teaching Deep RL To Checkmate (from scratch)

How strong can a deep RL agent get if you strip chess to the bare minimum, build your own neural nets in NumPy, and skip autograd entirely?
That was the goal of this project: learn to play a tiny chess variant on a 4×4 board where
  • The agent controls a king and a queen
  • The opponent has only a king and moves randomly among legal moves
  • The game ends in checkmate or draw
The agent sees the full board and makes one move per turn.

The Environment As An MDP

I modeled the game as a Markov Decision Process:
  • State
    • Piece positions encoded as binary planes
    • Extra features describing the opponent king's freedom of movement
  • Action
    • Discrete set of legal king and queen moves (flattened to a fixed action space)
  • Transition
    • Agent moves, environment updates, opponent king makes a random legal move
  • Reward
    • +1 for checkmate
    • -1 for draw (a crucial design choice)
    • 0 otherwise
    • Early experiments showed something important.
      Random Agent ()
      Random Agent ()
      Average # Steps
      0.20
      -0.60
      Draw %
      80.1%
      80.1%
      Checkmate %
      19.9%
      19.9%
      💡
      The table shows that if draws are treated as zero reward, a random agent gets a misleadingly positive expected value. Once draws are punished with -1, the baseline performance becomes negative, which correctly reflects the agent's material advantage.
      This change becomes the backbone of later results.
  • Discount factor
    • Tuned, but not the main performance driver
Q values are approximated with a small fully connected neural network since the state space is too large for a tabular method.

Reward Shaping: Why Draws Must Hurt

Initially draws had reward 0. For the side with a queen advantage, a draw is clearly bad, not neutral, but the agent could not see that.
Changing the draw reward from 0 to -1 completely changed behavior:
  • A random agent
    • Checkmates roughly 20 percent of games
    • Draws roughly 80 percent
    • Expected reward drops sharply when draws become -1
  • A SARSA agent
    • With draw = 0, checkmates about half of the games
    • With draw = -1, checkmates roughly three quarters of the games
SARSA ()
SARSA ()
Average # of Moves
10.9
31.1
Draw %
48.7%
24.6%
Checkmate %
51.3%
75.4%
Average steps per game increase, meaning the agent learns to fight for a win instead of drifting toward a draw.
💡
Punishing draws pushes the agent to actively hunt checkmate instead of drifting toward stalemates. In practice, this mattered more than fine tuning or network size.

Agent 1: SARSA With Function Approximation

The first agent uses semi gradient SARSA with a neural network :
  • Two hidden layers of 200 units
  • Sigmoid activations
  • Manual backprop only on the chosen action's output
  • Epsilon greedy policy with decaying
This on policy setup learns from the actual behavior policy, exploration included.

Exploration decay () affects performance

Increasing makes decay faster, so the agent becomes greedy sooner:
As increases:
  • Reward per episode increases
  • Checkmate percentage rises
  • Number of moves increases
notion image
Average # of Moves
19.6
31.1
32.9
Draw %
31.7%
24.6%
21.2%
Checkmate %
68.3%
75.4%
78.8%
Once the agent has a good mating plan, it keeps using it, but sometimes needs more moves to maneuver against random play.

SGD vs Adam

I trained SARSA with both vanilla SGD and a hand written Adam optimizer.
SGD SARSA
Adam SARSA
Average # of Moves
31.1
9.4
Draw %
24.6%
27.3%
Checkmate %
75.4%
72.7%
  • SARSA + SGD
    • Higher checkmate rate
    • Longer games
  • SARSA + Adam
    • Slightly lower checkmate rate
    • Much shorter games, often finishing in under 10 moves
🔥
Adam produces a more decisive style of play. If you care about fast tactical wins instead of absolute win percentage, this variant is attractive.

Agent 2: Double DQN

The second family of agents uses Double Deep Q Networks. DDQN addresses maximization bias by separating action selection from action evaluation:
  • Online network selects the greedy action
  • Target network evaluates that action
  • Target network is updated from the online network every fixed number of episodes
Training is off policy with epsilon greedy behavior.

Experience replay and vectorization

I used an experience replay buffer and minibatch updates:
  • Buffer stores recent transitions
  • Start training after a warm up period
  • Minibatch Q learning updates drawn from replay
The first implementation updated one transition at a time in Python loops. The second version vectorized the NN forward and backward passes for full minibatches.
The vectorized Double DQN:
  • Trained almost twice as fast
    • Non-vectorized: 3 hours 5 minutes
    • Vectorized: 1 hour 46 minutes
  • Reached a higher checkmate rate
  • Showed more stable learning
 
DDQN SGD Non-vectorized
DDQN Minibatch Vectorized
Number of episodes
100000
100000
Iteration per second
8.99 it/s
15.59 it/s
Total execution time
3:05:24
1:46:55
Checkmate %
75.4%
84.1%
Even more interesting, the table shows performance jumps:
💡
The performance gain is due to both better hardware utilization and more consistent gradient estimates.

Tuning and

For Double DQN, faster epsilon decay (larger )
  • Increased checkmate rate
  • Made games longer on average, similar to SARSA
Different discount factors in a reasonable range had only minor effects once the agent was capable of reliably forcing checkmate.

Putting It All Together

Best Pure Performance

Vectorized DDQN, ,
  • 86.8% checkmate
  • Average reward: 0.74
  • Longest average games (around 53 moves)

Most Efficient Player

Adam SARSA
  • Only 9.4 moves per game
  • Still maintains 72.7% checkmate,
Vectorized DDQN
Adam SARSA
1e-4
5e-5
0.85
0.85
Average # of Moves
53.2
9.4
Draw %
13.2%
27.3%
Checkmate %
86.8%
72.7%
Quality
Best Performance
Most Efficient
So
  • If you care about raw win rate in this toy environment, Double DQN wins
  • If you want quick, sharp games with strong but slightly less optimal play, SARSA with Adam is surprisingly good
From a practical point of view:
⚠️
  • On policy SARSA is more conservative and better aligned with real world settings where exploration is risky
📈
  • Off policy Double DQN is great when you can simulate cheaply and push hard for optimality

What I Learned

A few broader lessons came out of this mini chess experiment:
  1. Reward design is critical
    1. A single change, turning draws into negative outcomes, transformed the agent's behavior far more than algorithmic tweaks.
  1. Writing your own deep RL stack is painful and worth it
    1. Implementing NN training, SARSA, Double DQN, replay buffers and optimizers by hand gives real intuition for what libraries do under the hood.
  1. Vectorization is not an optional optimization
    1. Moving from per sample loops to minibatch matrix operations almost halved training time and improved results.
  1. On policy vs off policy is not just theory
    1. Even in a tiny board game, SARSA and Double DQN learn qualitatively different styles, reflecting their underlying learning dynamics.
Mini chess on 4×4 squares will not beat Stockfish, but as a sandbox for understanding deep reinforcement learning end to end, it turned out to be a surprisingly rich playground.