Generating Chess Puzzles Fast with Rust and Stockfish

Generally, chess puzzles will present you with a position and ask you to find the best move. There’s only one, and all other moves are losing. These are great, but I’ve been thinking of a different style of puzzle that I’d like to practice with. What I want is a chess position, then two moves presented as options; one is a blunder, and one is the best move. The idea is to train my ability to spot blunders, like regular puzzles train your ability to find forcing lines.

The approach

If I just set up Stockfish and ask it to find the best and worst moves in a given position, it would usually be trivial to spot which is which. The worst move will almost always be a useless sac. So the plan is to get real blunders, from Lichess games. Then I can read through all the moves, and use Stockfish to figure out which moves are blunders, and what should have been played instead. Luckily, Lichess has a database of all the played games, by month.

Reading the database

First off, this thing is huge. Uncompressed, the December games PGN file is 223GB. That’s like half of my entire hard drive.

To read through all these games, I’m using the pgn-reader crate. So I wrote up a simple Visitor to count the moves, like in the example, and get this:

[INFO] Parsed 665565 moves in 1.79 seconds
[INFO] That's 370788 moves/sec

So that’s ridiculously fast. I don’t think that will be our bottleneck, to say the least.

Finding the blunders

The next step is to use Stockfish to identify blunders. For that I’ll be using the uciengine crate. Here’s what the visitor part looks like, for the first approach:

#[async_trait(?Send)]
impl Visitor for PuzzleFinderVisitor {
    type Result = PuzzleFinderResults;

    fn header(&mut self, key: &[u8], value: RawHeader<'_>) { }

    fn begin_variation(&mut self) -> Skip {
        Skip(true) // stay in the mainline
    }

    async fn san(&mut self, san_plus: pgn_reader::SanPlus) {
        self.halfmoves += 1;
        if let Ok(m) = san_plus.san.to_move(&self.pos) {
            let eval = eval_move(&self.pos, &m, &self.engine).await;
            if let Some(last_eval) = self.last_eval {
                if ((last_eval - eval).abs() > 500) {
                    self.puzzle_finder_results.puzzle_positions.push(PotentialPuzzlePosition {
                        fen: Fen::from_setup(&self.pos),
                        blunder: san_plus,
                    })
                }
            }
            self.last_eval = Some(eval);
            self.pos.play_unchecked(&m);
        }
    }

    fn end_game(&mut self) -> PuzzleFinderResults {
        self.puzzle_finder_results.moves_analyzed = self.halfmoves;
        self.halfmoves = 0;
        self.pos = Chess::default();
        self.last_eval = None;
        ::std::mem::replace(&mut self.puzzle_finder_results, PuzzleFinderResults::new())
    }
}

So I ran the same test with the above visitor:

[INFO] Found 48 blunders, in 576 moves analyzed, in 62.86 seconds
[INFO] That's 9.2 moves/sec, and 0.7 blunders/sec

Alright, so that’s way slower. Like, 42,000 times slower.

Figuring out why Stockfish is slow

Let’s see what the eval function looks like, maybe we can optimize our Stockfish usage.

async fn eval_move(pos: &Chess, m: &Move, engine: &Arc<UciEngine>) -> i32 {
    let fen = Fen::from_setup(pos);
    let uci_move = Uci::from_move(&m, CastlingMode::Standard);
    let go_job1 = GoJob::new()
        .uci_opt("Hash", 1024)
        .uci_opt("Threads", 10)
        .pos_fen(fen)
        .pos_moves(uci_move.to_string())
        .go_opt("nodes", 50 * 1000);
    let result = engine.go(go_job1).await.unwrap();
    let eval = to_centipawn(result.ai.score);
    return eval
}

Measuring how long this takes for one call, I get about 100ms.

50k nodes was sort of a random choice, so maybe I can reduce that and still get good results. Luckily, someone else has already done the hard work of figuring out what # of nodes correlates to what ELO. According to this article, Stockfish 13 is about the level of a 2500 player, at 10,000 nodes (with NNUE enabled). So it looks like I can get away with reducing that from 50k to 10k. In theory this should get us to ~45 moves per second. Let’s see how much faster this is…

[INFO] Found 43 blunders, in 576 moves analyzed, in 62.41 seconds
[INFO] That's 9.2 moves/sec, and 0.7 blunders/sec

Hm. That’s… not 5 times faster. It’s barely faster at all, in fact. So I tried running the same command in Stockfish directly:

> stockfish
Stockfish 14.1 by the Stockfish developers (see AUTHORS file)
setoption name Hash value 1024
setoption name Threads value 10
go nodes 10000
info string NNUE evaluation using nn-13406b1dcbe0.nnue enabled
info depth 1 seldepth 1 multipv 1 score cp 38 nodes 140 nps 70000 tbhits 0 time 2 pv d2d4
info depth 2 seldepth 2 multipv 1 score cp 63 nodes 1184 nps 592000 tbhits 0 time 2 pv d2d4 a7a6
info depth 3 seldepth 3 multipv 1 score cp 36 nodes 2057 nps 685666 tbhits 0 time 3 pv c2c4 c7c5 g1f3 b8c6
info depth 4 seldepth 4 multipv 1 score cp 36 nodes 6751 nps 1687750 tbhits 0 time 4 pv c2c4 c7c5 g1f3 b8c6
info depth 5 seldepth 5 multipv 1 score cp 27 nodes 9604 nps 1920800 tbhits 0 time 5 pv c2c4 g8f6 g1f3 c7c5 b1c3 b8c6
info depth 6 seldepth 5 multipv 1 score cp 27 nodes 10013 nps 2002600 tbhits 0 time 5 pv c2c4 g8f6 g1f3 c7c5 b1c3 b8c6
info depth 5 seldepth 6 multipv 1 score cp 27 nodes 10013 nps 2002600 tbhits 0 time 5 pv e2e4 e7e6
bestmove e2e4 ponder e7e6

Stockfish, in that last info message, reports its own time as 5ms, which is 20x faster than what I’m getting from the Rust side. I went and enabled some debug logging in the uciengine crate to figure out the issue. Here’s that, minus some noise:

[114] [DEBUG] issuing engine command : setoption name Hash value 1024
[114] [DEBUG] write result Ok(())

[114] [DEBUG] issuing engine command : setoption name Threads value 10
[114] [DEBUG] write result Ok(())

[114] [DEBUG] issuing engine command : position fen rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1 moves e2e3
[114] [DEBUG] write result Ok(())

[114] [DEBUG] issuing engine command : go nodes 50000
[114] [DEBUG] write result Ok(())

[219] [DEBUG] uci engine out ( 3 ) : info string NNUE evaluation using nn-13406b1dcbe0.nnue enabled
[224] [DEBUG] parse result ... time: 5, nodes: 50062 ...

Using fern, I added the milliseconds since startup to the beginning of the log message. So there’s virtually no time elapsed while setting up the options and position. But then after issuing the go command, there’s 115ms of lag. The info returned in the last line though, matches up with what I saw when calling Stockfish directly, about 5ms.

Skipping over all the failed debugging paths I took, so that I look smart in this post, I immediately thought, “I bet Stockfish is just still dealing with the previous commands, to set the position and options, before outputting its first info message”. Sure enough, if I remove the calls to uci_opt, I get much faster results:

[INFO] Found 37 blunders, in 576 moves analyzed, in 7.49 seconds
[INFO] That's 77.0 moves/sec, and 5 blunders/sec

Alright huge improvement, we’re up from 9 to 77 moves evaluated per second. But, we can go even faster if we could set those options again, without incurring the cost. So I added a check_ready function to the UciEngine, which will just issue a isready command to Stockfish, and wait for readyok to come back. Then I called that before doing all the blunder-finding stuff:

let engine = UciEngine::new("stockfish");
let setup_job = GoJob::new()
    .uci_opt("Hash", 1024)
    .uci_opt("Threads", 10);
let result = engine.check_ready(setup_job).await.unwrap();

And now we’re really moving:

[INFO] Found 41 blunders, in 576 moves analyzed, in 0.88 seconds
[INFO] That's 655.3 moves/sec, and 47 blunders/sec

Using false positives for even faster results

Alright so now we’ve got the equivalent of a GM looking at every move, and figuring out whether it’s a blunder. But what if we could have the equivalent of 10 club players identifying blunders, and the GM only needs to step in when one of those players finds something? Bit of a stretched analogy, but the idea is to bump down the nodes from 10,000 to 1,000. If a move is identified as a blunder at 1,000 nodes, then we validate that result with the 10,000 node, GM-equivalent engine. Just to validate the idea, here’s what the speed looks like at 1,000 nodes:

[INFO] Found 102 blunders, in 576 moves analyzed, in 0.41 seconds
[INFO] That's 1404.9 moves/sec, and 249 blunders/sec

Excellent, this gives us a 2.14 times speedup. This is an upper bound though, because we’re not doing the validation step. Also, the validation step will actually incur twice the cost; not only should the GM Stockfish evaluate that the blunder evaluation is correct, it should also evaluate that the previous evaluation was correct. Maybe the club Stockfish thinks that a certain move is a blunder, but the GM Stockfish knows that the game was already lost.

With the GM Stockfish stepping in as necessary, we get:

[INFO] Found 51 real blunders, 35 false positives, in 576 moves analyzed, in 0.65 seconds
[INFO] That's 880.7 moves/sec, and 78 blunders/sec

About 30% slower than with just the club-level Stockfish, but we’re still finding blunders 70% faster than before this optimization (78/second vs 47/second).

Of course, there are false negatives too, but I think among the 95 million games there will be enough blunders to keep me (and Stockfish) occupied for a while.

Option-tweaking

Hash size was chosen somewhat arbitrarily, so I tried upping that. I get diminishing returns after ~8000, so lets make it a clean 8192:

let engine = UciEngine::new("stockfish");
let setup_job = GoJob::new()
    .uci_opt("Threads", 10)
    .uci_opt("Hash", 8192);
let result = engine.check_ready(setup_job).await.unwrap();

This gives a ~15% speedup:

[INFO] Found 45 real blunders, 36 false positives, in 576 moves analyzed, in 0.50 seconds
[INFO] That's 1142.9 moves/sec, and 89 blunders/sec

In short

In terms of blunders found per second, which is ultimately the goal, we went from 0.7 initially, to 5 after fixing the Stockfish latency, to 47 after tweaking threads and hash size, to 78 after using a club-level Stockfish to find candidate blunders, to 89 after upping the hash size considerably. 127 times faster from baseline isn’t too bad. 89 blunders found per second is way faster than I thought I could get.

Of course, this is all overkill. I just want ~10,000 puzzle positions to add to my site, and I could have gotten that by just letting the original approach run for ~4 hours. Now it will find those 10,000 blunders in less than 2 minutes instead. Not necessary, but kind of awesome.

Other potential optimizations

“Give up” on games that have an overwhelming advantage for one side. If white is up a queen and I know white wins the game, probably there aren’t any good blunders to be found in the rest of the game.
If the eval of a position is, for example, mate in 3, I could skip evaluating the next 3 moves, and only go back to those skipped moves, if the game doesn’t end in 3 moves. Would only save a couple moves per game on average.
Use an opening table, only evaluating moves once the game is “out of book”. This could save 10+ evaluations per game, especially in higher rated games.