Background

I run a chess site, https://chessbook.com. We ingest a ton of games, nearly 6 billion, to power our various statistics. Our current database has about 900 million lines. We’re working on our next evolution of this database, and that entails re-ingesting all these games, but going much deeper on each game.

Our existing solution was already pretty optimized, you can’t get through 1.6 terabytes of data in a reasonable timeframe without some optimization, but we needed to really squeeze more performance out of it this time, as we bumped up the amount of data that it was going to get from each of these 6 billion games. We needed to fit what we needed into memory, and to do it in a reasonable timeframe. I don’t want to be waiting around for weeks for this to finish, only to notice a data issue at the end and have to wait weeks again as we re-run everything. Quick feedback loops are essential to development, I wanted this ingestion job to take less than a day ideally.

So I took our existing script, and bumped up the depth at which it indexed games. This is the journey of everything that broke along the way, and how I fixed it.

The program, in broad strokes

The script takes a bunch of PGN files, split them up across ${available_cores}, generates some data from them (win-rates of different openings, master play-rates, play-rates by elo range, etc etc), then runs some post-processing, and saves all this data to our Postgres database.

The core data structure is a tree of moves, and at each node there are some statistics about how often that line is played, the win-rates at each level, etc etc. Each node in this tree is wrapped in an Arc<Mutex<...>>, so that multiple threads can get references to, lock, and write to these shared data structures.

The journey

Problem: not enough RAM

Up until this generation, I had been carefully tweaking the various parameters so that my M2 MacBook Pro could run this script by itself, barely squeaking by on 64GB of RAM. But if I wanted to index more games, I knew this wasn’t going to cut it.

Solution: dedicated server

I got a dedicated server, with 256GB of RAM, 80 cores, and a ton of disk space.

I would have liked to use my cloud provider, but getting these sorts of specs with cloud providers is ridiculously expensive. I’m paying about $220/mo for this machine, I’m sure this would be >$1,000 per month on any cloud provider.

Problem: CPU is vastly under-utilized

I downloaded 1.6 terabytes of Lichess game data, all the games from This Week in Chess, a 60-million game OTB database, and historical chess games, onto the new server. I ran the new script, crossed my fingers… then saw that it was going to take weeks to finish – orders of magnitude slower than my M2.

After a long investigation that honestly shouldn’t have even been needed, I figured out the issue. Mutexes (Mutices?). I had some common data structures that all threads wrote to, by acquiring locks on the mutexes. On my 10-thread machine the contention was negligible, but on this 80-thread powerhouse, the program was spending most of its time waiting to lock the data it needed. CPU utilization was under 5% on every thread.

Solution: DashMap, AtomicI32

I looked around for some other data structures I could use that would avoid the problem of a bunch of threads locking the same common data structures. I came across DashMap, a concurrent HashMap. I swapped it in wherever we used HashMaps, and it worked a treat. CPU utilization went up by quite a bit.

There were still gains to be had though. Each node has a bunch of data points that look roughly like this:

struct ContinuationMeta {
    white_wins: i32,
    black_wins: i32,
    draws: i32,
}

And I was using DashMap’s get_mut to access these so I could increment the field I needed to. get_mut takes a write lock to the data though. So we’re splitting up the HashMap, but still locking some parts of it pretty often.

So I adjusted this data structure to use AtomicI32 instead:

struct ContinuationMeta {
    white_wins: AtomicI32,
    black_wins: AtomicI32,
    draws: AtomicI32,
}

Now I just need a reference to the struct, not a mutable reference, and I can increment the data like so:

match self.game_result.as_ref().unwrap() {
    GameResult::White => {
        meta.white_wins.fetch_add(1, atomic::Ordering::Relaxed);
    }
    GameResult::Black => {
        meta.black_wins.fetch_add(1, atomic::Ordering::Relaxed);
    }
    GameResult::Draw => {
        meta.draws.fetch_add(1, atomic::Ordering::Relaxed);
    }
};

I can be relaxed about the atomic addition, because I don’t actually need to fetch the value while processing.

Problem: Not enough RAM, Pt. 2

I ran the script again. I saw 80 cores pegged at 100%. I thought this was the end. It wasn’t even close to the end. I left the script to run for 30 minutes or so, and noticed the RAM was way higher than it should be for how little data it had ingested. Orders of magnitude too high. This was only happening on my server, not on my local machine. What was going on?

Skipping some painful investigation… this was the problem:

fn default_shard_amount() -> usize {
    static DEFAULT_SHARD_AMOUNT: OnceCell<usize> = OnceCell::new();
    *DEFAULT_SHARD_AMOUNT.get_or_init(|| {
        (std::thread::available_parallelism().map_or(1, usize::from) * 4).next_power_of_two()
    })
}

DashMap shards its underlying HashMaps into multiple RwLock<HashMap>s. The amount of shards to use is based on the available parallelism. On my server I have 80 cores, so DashMap was deciding to shard into 512 pieces. Every time I wanted a single map, DashMap was creating 512 HashMaps under the hood.

Solution Part 1: Hello DashMap, goodbye DashMap

You can specify the sharding amount and capacity to DashMap, using with_capacity_and_sharding_amount. So I tried all sorts of schemes to reduce sharding at different depths, trying to avoid contention at the higher levels of the tree while keeping the data structures small at the lower levels.

Skip ahead past some long painful experimentation, this just wasn’t a tenable approach. When you initialize a HashMap with one element in Rust, the actual capacity is 3. The minimum sharding amount with DashMap is 2, so for all these tiny leaf nodes, we’re allocating 6x as much memory as we need.

Back to the drawing board. DashMap doesn’t work. We still need a way to lock and unlock access to these data structures though, without a ton of contention. That’s when I realized I was being an idiot, and that when I changed everything to use atomics, that was all I really needed. So I undid the DashMap stuff. What I realized is that we now almost always only need a read-lock. So instead of Mutex, I put a RwLock on our nodes, and thanks to the atomics, all our threads can just acquire a read lock on our data, to increment the statistics.

Solution Part 2: SAN encoding

Our tree of nodes is connected by edges, and these edges are chess moves. Chess moves are generally stored in SAN notation, which isn’t the best in terms of bytes/move. Luckily, I had just been working on a separate project to encode these moves, and can store these using roughly a third of the bytes. Check out this blog post if you’re interested in some more details there.

Solution Part 3: HalfBrown

I did some further profiling of allocations as this program ran, and found that HashMap was still taking up too much space. Most of our HashMaps were tiny, having less than 4 elements. So I thought, maybe I can just store these in a Vec, and reduce the memory usage. Luckily someone has already created a crate that gives you the API of HashMap, with an underlying storage of Vec (at least, until you get to 32 elements, but there’s very few chess positions with 32 common continuations). This is the halfbrown library. I subbed this in for our HashMaps, and memory usage went down 30% or so.

Problem: CPU under-utilized, again?

After all this, I ran the script, waited a bit, then saw that the cores were hovering at 10% utilization again. What’s going on? There’s no lock contention, RAM is low, swap is unused, in theory these cores should be able to give it their all.

I ran hdparm, on a whim:

Timing buffered disk reads: 4 MB in  5.40 seconds = 758.68 kB/sec

758 kB/sec? Did my hosting provider set me up with an array of 16,000 floppy disks masquerading as a 1.6TB SATA drive? Actually, turns out my script was left running in the background, so this is telling us that we’ve completely saturated the SATA drive that’s storing all of our PGNs. This was sort of mind-blowing to me, to have optimized things to such a degree that I was actually limited by reading gzip-compressed files off of a solid state drive.

Solution: Split up the files

When I ordered the server, I had them set up a 800GB primary disk, and a 2TB secondary disk. The secondary disk was where I was storing all the PGNs, but it was a slower disk. So I created a new directory, and shunted over a third of the files into a directory which was instead mounted on the primary disk. Bash courtesy of Claude.

source_dir="secondary/opening_pgns/lichess_database"
dest_dir="primary/opening_pgns/lichess_database"
log_file="move_files.log"

mkdir -p "$dest_dir"

counter=1
for file in "$source_dir"/*; do
  if [ -e "$file" ]; then
    if [ $((counter % 3)) -eq 0 ]; then
      mv "$file" "$dest_dir"
      echo "$(date) - Moved file: $file to $dest_dir"
    fi
    ((counter++))
  fi
done

After this, the CPU utilization shot back up, at least while it’s crunching through files on both disks. I wish I had some faster disks, but we’re well past the point of “good enough”, so I’m stopping here.

Final results

What was taking weeks to months originally, now takes about 2 hours. We can now ingest about 1.1 million chess games per second.