Scraping Goodreads Sitemaps with Haskell

Recently I acquired a project, Now What Do I Read?. One of the steps in acquiring the data was to scrape the sitemaps of Goodreads. The project was written in Go, and I decided to rewrite it in Haskell, mostly to mix it up a bit. Writing everything in Python and JavaScript gets old eventually, quick as it may be.

This will probably turn into a series in which I go through the rest of the process of rewriting the project in Haskell. The intended audience is those with a passing familiarity with Haskell.

If you want to follow along, scroll down to the working example and delete everything under the imports and language pragmas. You'll also want these packages in your package.yaml:

  • optparse-applicative
  • http-conduit
  • filepath
  • directory
  • bytestring
  • text

Overview

Goodreads has about 400 sitemaps that we're interested in. Within these sitemaps are a lot of book urls. This is a minified example of what that looks like:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:content="http://www.google.com/schemas/sitemap-content/1.0">
  <url>
    <loc>https://www.goodreads.com/book/show/3730.The_Hidden_Persuaders</loc>
    ... more info
  </url>
  ... more urls
</urlset>

What we want to do is download all the sitemaps, and then parse the xml to find all the urls for books (/book/show). Additionally, we want it to be snappy (my first approach would have taken >3 hours to find all the URLs, more on that later).

Structure

The existing project had what I thought was a neat structure. There was a single binary, and the step that got executed was based on what flags were passed. I liked that approach so I'm stealing it with the Haskell rewrite. So at the end we'll be able to run:

$ nowwhatdoiread --scrapesitemaps 
$ nowwhatdoiread --parsesitemaps 

We'll be using optparse-applicative to parse the command line arguments.

We'll need a datatype for our flags, which is pretty self-explanatory:

data Opts = Opts {
    optDownloadSitemaps :: Bool,
    optParseSitemapUrls :: Bool
  }

We'll need a function to get these options:

parseCLI :: IO Opts
parseCLI = execParser $ info parseOptions (header "nowwhatdoiread")

The execParser function takes a ParserInfo a, and returns an IO a. The a in this case is the Opts datatype  we defined above. The parseOptions function isn't defined yet, we'll need to create a function of type Parser Opts, which turns out to be pretty straightforward:

parseOptions :: Parser Opts
parseOptions = do
    shouldDownloadsitemaps <- switch (long "downloadsitemaps")
    shouldParsesitemaps <- switch (long "parsesitemapurls")
    return $ Opts {
      optDownloadSitemaps=shouldDownloadsitemaps,
      optParseSitemapUrls=shouldParsesitemaps
    }

Now that we've got a way to parse options, we'll need to choose to do something based on which options are enabled. The when function from the Control.Monad package is perfect for our needs, so we'll define our main function thusly:

main = do
  opts <- parseCLI
  when (optDownloadSitemaps opts) downloadSitemaps
  when (optParseSitemapUrls opts) getAllBookUrls

Here we're just getting the opts from parseCLI in the first line. Then, if the download flag is set, we call downloadsitemaps, and if the parse flag is set, we call getAllBookUrls. Both of these, as you may be able to guess, are going to be of type IO (), once we define them later.

Preparing to download the sitemaps

First, let's define a couple "constants":

sitemapRange = [1..3]
sitemapsDirectory = "data/sitemaps"

The sitemapRange function is declaring the indices of the sitemaps we'll download. There are hundreds, but that takes a while so we'll say we only want a few. sitemapsDirectory is where we'll be downloading the sitemaps to.

Before we download the sitemaps, we're going to need a list of the URLs:

sitemaps :: [String]
sitemaps = map sitemapUrl sitemapRange
  where
  sitemapUrl i = "https://www.goodreads.com/sitemap." ++ show i ++ ".xml.gz"

We'll also define a helper function to give us the file path of a saved sitemap:

sitemapFilepath i = joinPath [sitemapsDirectory, show i ++ "_sitemap.txt"]

This is using the joinPath function from the System.FilePath.Posix library to give us a FilePath.

Downloading the sitemaps

Now, we need a function to actually download the list of urls we generated in the last step. Here's that function:

downloadSitemaps = mapM_ downloadSitemap $ zip [1..] sitemaps
  where
    downloadSitemap (i, url) = do
      putStrLn $ "Downloading sitemap " ++ show i
      let path = sitemapFilepath i
      createDirectoryIfMissing True $ takeDirectory path
      simpleHttp url >>= B.writeFile path 

We're using mapM_ here to run a function that returns IO () over a list of tuples. These tuples are the index of the sitemap and the sitemap url itself. We need the index to construct the download path, and the url to, well, download the sitemap.

The createDirectoryIfMissing function will create the directories we need, and the True flag makes it recursive, so that both data/ and data/sitemaps/ are created.

The simpleHttp function is the easiest way I found to download using Haskell, it comes from the Network.HTTP.Conduit package. Given a url it will download a ByteString. We use the writeFile function from the Data.ByteString.Lazy package to write that to our file path.

Parsing the sitemaps

The sitemaps that goodreads exposes are large. Each contains 50,000 entries, and 33 million characters. I initially tried to parse them using an xml parser, but that took ~30s per sitemap. After that I committed the cardinal sin of using regex to parse xml. At the small price of my soul, I shaved off most of the time, down to a few seconds per sitemap.

That was still painfully slow when downloading 100s of sitemaps, so I resorted to throwing together some text processing functions from Data.Text. I got another 10x speedup from that, each file takes about 0.3 seconds to parse. Ripgrep manages to do it 10x faster than that, at about 0.03 seconds per file, but matching that is a problem for another day.

First, we need a function that gets the sitemaps from the file system, calls our parsing function, then writes those urls to a new file:

getAllBookUrls :: IO ()
getAllBookUrls = mapM_ (getUrlsFromFile . sitemapFilepath) sitemapRange
  where
    getUrlsFromFile f = do
      b <- TIO.readFile f
      TIO.appendFile "data/bookurls.txt" $ T.unlines (findBookUrls b)

This is pretty unexciting, the parsing logic is in that findBookUrls function that gets called on the last line:

findBookUrls :: T.Text -> [T.Text]
findBookUrls = map getUrlFromLine . filter (T.isInfixOf "/book/") . T.lines
  where
    getUrlFromLine = T.replace "</loc>" "" . T.replace "<loc>" "" . T.strip 

First, we filter all the lines for the "/book/" substring. All the book urls we're interested in will have this substring. Now we're left with lines like this:

    <loc>https://www.goodreads.com/book/show/3730.The_Hidden_Persuaders</loc>

The getUrlFromLine function will do the work of extracting that inner text. First, we strip the whitespace from the sides, then we replace the opening and closing tags with empty strings.

Full working example

The full working example code is the following:

{-# LANGUAGE TypeFamilies          #-}
{-# LANGUAGE ConstraintKinds #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE ApplicativeDo              #-}
{-# LANGUAGE NoMonomorphismRestriction #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE OverloadedStrings #-}
import Options.Applicative
import Control.Monad
import qualified Data.Char as C
import System.FilePath.Posix
import System.Directory
import Network.HTTP.Conduit
import qualified Data.ByteString.Lazy as B
import qualified Data.Text.IO as TIO
import qualified Data.Text as T

data Opts = Opts {
    optDownloadSitemaps :: Bool,
    optParseSitemapUrls :: Bool
  }

main = do
  opts <- parseCLI
  when (optDownloadSitemaps opts) downloadSitemaps
  when (optParseSitemapUrls opts) getAllBookUrls

parseCLI :: IO Opts
parseCLI = execParser $ info parseOptions (header "nowwhatdoiread")

parseOptions :: Parser Opts
parseOptions = do
    shouldDownloadsitemaps <- switch (long "downloadsitemaps")
    shouldParsesitemaps <- switch (long "parsesitemapurls")
    return $ Opts {
      optDownloadSitemaps=shouldDownloadsitemaps,
      optParseSitemapUrls=shouldParsesitemaps
    }

sitemapsDirectory = "data/sitemaps"
sitemapRange = [1..3]

sitemaps :: [String]
sitemaps = map sitemapUrl sitemapRange
  where
  sitemapUrl i = "https://www.goodreads.com/sitemap." ++ show i ++ ".xml.gz"

sitemapFilepath i = joinPath [sitemapsDirectory, show i ++ "_sitemap.txt"]

downloadSitemaps :: IO ()
downloadSitemaps = mapM_ downloadSitemap $ zip [1..] sitemaps
  where
    downloadSitemap (i, url) = do
      putStrLn $ "Downloading sitemap " ++ show i
      let path = sitemapFilepath i
      createDirectoryIfMissing True $ takeDirectory path
      simpleHttp url >>= B.writeFile path 

getAllBookUrls :: IO ()
getAllBookUrls = do
    let fs = map sitemapFilepath sitemapRange
    mapM_ getUrlsFromFile fs
  where
    getUrlsFromFile f = do
      b <- TIO.readFile f
      TIO.appendFile "data/bookurls.txt" $ T.unlines (findBookUrls b)

findBookUrls :: T.Text -> [T.Text]
findBookUrls = map getUrlFromLine . filter (T.isInfixOf "/book/") . T.lines
  where
    getUrlFromLine = T.replace "</loc>" "" . T.replace "<loc>" "" . T.strip 
Show Comments