7 minutes
Scraping Goodreads Sitemaps with Haskell
Recently I acquired a project, Now What Do I Read?. One of the steps in acquiring the data was to scrape the sitemaps of Goodreads. The project was written in Go, and I decided to rewrite it in Haskell, mostly to mix it up a bit. Writing everything in Python and JavaScript gets old eventually, quick as it may be.
This will probably turn into a series in which I go through the rest of the process of rewriting the project in Haskell. The intended audience is those with a passing familiarity with Haskell.
If you want to follow along, scroll down to the working example and delete everything under the imports and language pragmas. You’ll also want these packages in your package.yaml
:
- optparse-applicative
- http-conduit
- filepath
- directory
- bytestring
- text
Overview
Goodreads has about 400 sitemaps that we’re interested in. Within these sitemaps are a lot of book urls. This is a minified example of what that looks like:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:content="http://www.google.com/schemas/sitemap-content/1.0">
<url>
<loc>https://www.goodreads.com/book/show/3730.The_Hidden_Persuaders</loc>
... more info
</url>
... more urls
</urlset>
What we want to do is download all the sitemaps, and then parse the xml to find all the urls for books (/book/show
). Additionally, we want it to be snappy (my first approach would have taken >3 hours to find all the URLs, more on that later).
Structure
The existing project had what I thought was a neat structure. There was a single binary, and the step that got executed was based on what flags were passed. I liked that approach so I’m stealing it with the Haskell rewrite. So at the end we’ll be able to run:
$ nowwhatdoiread --scrapesitemaps
$ nowwhatdoiread --parsesitemaps
We’ll be using optparse-applicative to parse the command line arguments.
We’ll need a datatype for our flags, which is pretty self-explanatory:
data Opts = Opts {
optDownloadSitemaps :: Bool,
optParseSitemapUrls :: Bool
}
We’ll need a function to get these options:
parseCLI :: IO Opts
parseCLI = execParser $ info parseOptions (header "nowwhatdoiread")
The execParser
function takes a ParserInfo a
, and returns an IO a
. The a
in this case is the Opts
datatype we defined above. The parseOptions
function isn’t defined yet, we’ll need to create a function of type Parser Opts
, which turns out to be pretty straightforward:
parseOptions :: Parser Opts
parseOptions = do
shouldDownloadsitemaps <- switch (long "downloadsitemaps")
shouldParsesitemaps <- switch (long "parsesitemapurls")
return $ Opts {
optDownloadSitemaps=shouldDownloadsitemaps,
optParseSitemapUrls=shouldParsesitemaps
}
Now that we’ve got a way to parse options, we’ll need to choose to do something based on which options are enabled. The when
function from the Control.Monad
package is perfect for our needs, so we’ll define our main
function thusly:
main = do
opts <- parseCLI
when (optDownloadSitemaps opts) downloadSitemaps
when (optParseSitemapUrls opts) getAllBookUrls
Here we’re just getting the opts from parseCLI
in the first line. Then, if the download flag is set, we call downloadsitemaps
, and if the parse flag is set, we call getAllBookUrls
. Both of these, as you may be able to guess, are going to be of type IO ()
, once we define them later.
Preparing to download the sitemaps
First, let’s define a couple “constants”:
sitemapRange = [1..3]
sitemapsDirectory = "data/sitemaps"
The sitemapRange
function is declaring the indices of the sitemaps we’ll download. There are hundreds, but that takes a while so we’ll say we only want a few. sitemapsDirectory
is where we’ll be downloading the sitemaps to.
Before we download the sitemaps, we’re going to need a list of the URLs:
sitemaps :: [String]
sitemaps = map sitemapUrl sitemapRange
where
sitemapUrl i = "https://www.goodreads.com/sitemap." ++ show i ++ ".xml.gz"
We’ll also define a helper function to give us the file path of a saved sitemap:
sitemapFilepath i = joinPath [sitemapsDirectory, show i ++ "_sitemap.txt"]
This is using the joinPath
function from the System.FilePath.Posix
library to give us a FilePath
.
Downloading the sitemaps
Now, we need a function to actually download the list of urls we generated in the last step. Here’s that function:
downloadSitemaps = mapM_ downloadSitemap $ zip [1..] sitemaps
where
downloadSitemap (i, url) = do
putStrLn $ "Downloading sitemap " ++ show i
let path = sitemapFilepath i
createDirectoryIfMissing True $ takeDirectory path
simpleHttp url >>= B.writeFile path
We’re using mapM_
here to run a function that returns IO ()
over a list of tuples. These tuples are the index of the sitemap and the sitemap url itself. We need the index to construct the download path, and the url to, well, download the sitemap.
The createDirectoryIfMissing
function will create the directories we need, and the True
flag makes it recursive, so that both data/
and data/sitemaps/
are created.
The simpleHttp
function is the easiest way I found to download using Haskell, it comes from the Network.HTTP.Conduit
package. Given a url it will download a ByteString
. We use the writeFile
function from the Data.ByteString.Lazy
package to write that to our file path.
Parsing the sitemaps
The sitemaps that goodreads exposes are large. Each contains 50,000 entries, and 33 million characters. I initially tried to parse them using an xml parser, but that took ~30s per sitemap. After that I committed the cardinal sin of using regex to parse xml. At the small price of my soul, I shaved off most of the time, down to a few seconds per sitemap.
That was still painfully slow when downloading 100s of sitemaps, so I resorted to throwing together some text processing functions from Data.Text
. I got another 10x speedup from that, each file takes about 0.3 seconds to parse. Ripgrep manages to do it 10x faster than that, at about 0.03 seconds per file, but matching that is a problem for another day.
First, we need a function that gets the sitemaps from the file system, calls our parsing function, then writes those urls to a new file:
getAllBookUrls :: IO ()
getAllBookUrls = mapM_ (getUrlsFromFile . sitemapFilepath) sitemapRange
where
getUrlsFromFile f = do
b <- TIO.readFile f
TIO.appendFile "data/bookurls.txt" $ T.unlines (findBookUrls b)
This is pretty unexciting, the parsing logic is in that findBookUrls
function that gets called on the last line:
findBookUrls :: T.Text -> [T.Text]
findBookUrls = map getUrlFromLine . filter (T.isInfixOf "/book/") . T.lines
where
getUrlFromLine = T.replace "</loc>" "" . T.replace "<loc>" "" . T.strip
First, we filter all the lines for the “/book/” substring. All the book urls we’re interested in will have this substring. Now we’re left with lines like this:
<loc>https://www.goodreads.com/book/show/3730.The_Hidden_Persuaders</loc>
The getUrlFromLine
function will do the work of extracting that inner text. First, we strip
the whitespace from the sides, then we replace the opening and closing tags with empty strings.
Full working example
The full working example code is the following:
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE ConstraintKinds #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE ApplicativeDo #-}
{-# LANGUAGE NoMonomorphismRestriction #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE OverloadedStrings #-}
import Options.Applicative
import Control.Monad
import qualified Data.Char as C
import System.FilePath.Posix
import System.Directory
import Network.HTTP.Conduit
import qualified Data.ByteString.Lazy as B
import qualified Data.Text.IO as TIO
import qualified Data.Text as T
data Opts = Opts {
optDownloadSitemaps :: Bool,
optParseSitemapUrls :: Bool
}
main = do
opts <- parseCLI
when (optDownloadSitemaps opts) downloadSitemaps
when (optParseSitemapUrls opts) getAllBookUrls
parseCLI :: IO Opts
parseCLI = execParser $ info parseOptions (header "nowwhatdoiread")
parseOptions :: Parser Opts
parseOptions = do
shouldDownloadsitemaps <- switch (long "downloadsitemaps")
shouldParsesitemaps <- switch (long "parsesitemapurls")
return $ Opts {
optDownloadSitemaps=shouldDownloadsitemaps,
optParseSitemapUrls=shouldParsesitemaps
}
sitemapsDirectory = "data/sitemaps"
sitemapRange = [1..3]
sitemaps :: [String]
sitemaps = map sitemapUrl sitemapRange
where
sitemapUrl i = "https://www.goodreads.com/sitemap." ++ show i ++ ".xml.gz"
sitemapFilepath i = joinPath [sitemapsDirectory, show i ++ "_sitemap.txt"]
downloadSitemaps :: IO ()
downloadSitemaps = mapM_ downloadSitemap $ zip [1..] sitemaps
where
downloadSitemap (i, url) = do
putStrLn $ "Downloading sitemap " ++ show i
let path = sitemapFilepath i
createDirectoryIfMissing True $ takeDirectory path
simpleHttp url >>= B.writeFile path
getAllBookUrls :: IO ()
getAllBookUrls = do
let fs = map sitemapFilepath sitemapRange
mapM_ getUrlsFromFile fs
where
getUrlsFromFile f = do
b <- TIO.readFile f
TIO.appendFile "data/bookurls.txt" $ T.unlines (findBookUrls b)
findBookUrls :: T.Text -> [T.Text]
findBookUrls = map getUrlFromLine . filter (T.isInfixOf "/book/") . T.lines
where
getUrlFromLine = T.replace "</loc>" "" . T.replace "<loc>" "" . T.strip
If you want to support me, you can buy me a coffee.
If you play games, my PSN is mbuffett, always looking for fun people to play with.
If you're into chess, I've made a repertoire builder. It uses statistics from hundreds of millions of games at your level to find the gaps in your repertoire, and uses spaced repetition to quiz you on them.
Samar Haroon, my girlfriend, has started a podcast where she talks about the South Asian community, from the perspective of a psychotherapist. Go check it out!.
1362 Words
2019-10-30 05:27 +0000