Haskell is incredibly powerful and rewarding, but also can be frusturating for beginners as common problems that are easy to solve in more traditional programming languages suddenly appear to be hard again. Due to many available resources not always being easy to understand, there is a lack of beginner resources for these issues, while in other languages there is often an abundance of content. I experienced this myself especially when working with JSON and HTTP. In the articles on this blog, my goal is to expand the amount of beginner resources for common tasks specifially related to web development in Haskell.
In this short series, I will walk you through scraping JSON data from the web using basic libraries, and eventually bringing it into the IHP web framework to be used in an actual application.
The data
Attics contains data about tens of thousands of live concerts and recordings of these concerts from over a dozen bands, which is all sourced from the Internet Archive’s Live Music Archive. The Archive provides a search API to access data about the recordings, but in order to generate the data used in Attics for a performance’s average rating, number of recordings, and band metadata, I needed to scrape data from all the recordings, transform and analyze it, and store it persistantly.
The Archive makes it easy to scrape large amounts of data using their Scrape API.
By making a request to https://archive.org/services/search/v1/scrape
, we get back a collection of items such as
{
"date": "1966-10-02T00:00:00Z",
"coverage": "San Francisco, CA",
"identifier": "gd1966-10-02.sbd.bershaw.9517.shnf",
"venue": "San Francisco State University",
"transferer": "Alan Bershaw",
"downloads": 20676,
"avg_rating": "4.50",
"num_reviews": 2,
"source": "Soundboard"
}
Besides identifier
and date
, all of these fields are optional.
Let’s model this in Haskell as a record named ArchiveItem
.
data ArchiveItem = ArchiveItem
{ identifier :: Text,
date :: Text,
collection :: Maybe Text,
transferer :: Maybe Text,
downloads :: Maybe Int,
source :: Maybe Text,
avgRating :: Maybe Text,
numReviews :: Maybe Int,
lineage :: Maybe Text,
coverage :: Maybe Text,
venue :: Maybe Text
}
deriving (Show, Generic, Eq)
Using the aeson library for JSON, we can define an instance of the type class FromJSON
in order to parse the JSON from the request.
It is possible to auto generate instances for your records, but this gives you much less control of the names of the fields and how the JSON is parsed.
Manually defined, our instance looks like this.
instance FromJSON ArchiveItem where
parseJSON = withObject "ArchiveItem" $ \obj ->
ArchiveItem
<$> obj .: "identifier"
<*> obj .: "date"
<*> obj .:? "collection"
<*> obj .:? "transferer"
<*> obj .:? "downloads"
<*> obj .:? "source"
<*> obj .:? "avg_rating"
<*> obj .:? "num_reviews"
<*> obj .:? "lineage"
<*> obj .:? "coverage"
<*> obj .:? "venue"
This code uses a common pattern with the operators <$>
and <*>
defined for the Applicative
type class in order to
create a record type using monadic actions. This code could be equivalently written:
instance FromJSON ArchiveItem where
parseJSON = withObject "ArchiveItem" $ \obj -> do
identifier <- obj .: "identifier"
date <- obj .: "date"
...
pure $ ArchiveItem identifier date ...
which uses the standard do
notation to parse the fields and create the record object.
The final piece of data we need to model is the response from the endpoint, which contains some metadata about the request, as well
as the items as we showed above inside the items
key:
data ScrapeResponse = ScrapeResponse
{ scrapeItems :: [ArchiveItem],
scrapeCursor :: Maybe Text
}
deriving (Generic)
instance FromJSON ScrapeResponse where
parseJSON = withObject "ScrapeResponse" $ \obj ->
ScrapeResponse
<$> obj .: "items"
<*> obj .:? "cursor"
The cursor
field is included for paginating large queries: we’ll look into this more later.
HTTP requests
Working with HTTP was daunting to me when I started writing Haskell due to the many options available.
Today, we’ll be using http-conduit
, which makes dealing with simple requests easy.
Let’s define an initial function scrape :: Text -> IO [ArchiveItem]
which scrapes all the items for a given collection,
and use http-conduit
to make a request to the endpoint.
scrape :: Text -> IO [ArchiveItem]
scrape' collection =
let url = "https://archive.org/services/search/v1/scrape?fields=avg_rating,venue,coverage,num_reviews,date,downloads,source,transferer,lineage,identifier&q=collection:" <> collection
in do
request <- parseRequest (cs url)
response <- httpJSON request
let ScrapeResponse {..} = getResponseBody response
pure scrapeItems
A couple key points here: parseRequest
parses a URL, given as a String
, into a request object.
It is then passed to httpJSON
, which automatially parses the response body as JSON. We access this body
using getResponseBody
. Note we didn’t explicitlly give httpJSON
a type here:
by assigning the result of getResponseBody response
to a ScrapeResponse
, GHC is automatially able to infer the
types for us. Go Haskell!
To complicate matters, we need to account for a possible cursor
included, which we need to include in the next
request in order to get more results. We can do this with a simple refactor:
scrape :: Text -> IO [ArchiveItem]
scrape t = scrape' t Nothing
scrape' :: Text -> Maybe Text -> IO [ArchiveItem]
scrape' collection cursor =
let baseUrl = "https://archive.org/services/search/v1/scrape?fields=avg_rating,venue,coverage,num_reviews,date,downloads,source,transferer,lineage,identifier&q=collection:" <> collection
url = case cursor of
Just c -> baseUrl <> "&cursor=" <> c
Nothing -> baseUrl
in do
request <- parseRequest (cs url)
putStrLn url
response <- httpJSON request
let ScrapeResponse {..} = getResponseBody response
case scrapeCursor of
Just cursor -> do
rest <- scrape' collection (Just cursor)
return $ scrapeItems ++ rest
Nothing -> return scrapeItems
Here we define a helper function which is called recursively with the cursor if it’s included in the response.
Next
Next time, we’ll be looking at how to integrate this scraping code into IHP to easily store it persistantly. Please comment below with any questions – I’m writing these articles to help those newer to Haskell solve the problems that I struggled to solve as I was learning, so I’d be happy to help with any of the problems you encounter as you’re learning Haskell.