GNU Parallel as a poor man's scraper
Table of Contents
This is a simple pattern I like to use when I want to fetch data from a website/API, and a quick and dirty approach is sufficient.
Before reading further, be aware that this post is about the GNU version of parallel. This will not work with moreutils' parallel.
$ parallel --version
GNU parallel 20221122
[...]
See it in action
As an example, let's download replies from a "Who is hiring?" post on HN.
Now you can simply grep
to find your next role!
TL;DR
The invocation parallel --bar --jobs N
does most of the lifting for scheduling the jobs, while curl
and jq
do the fetching/extracting.
The script
run.sh
#!/usr/bin/env bash
set -euo pipefail
# export `URL` so that it's available for parallel's subprocesses
export URL=https://hacker-news.firebaseio.com/v0/item/
function maybe_download_item {
item_id=$1
item_url=${URL}/${item_id}.json
item_path=posts/${item_id}.txt
# download item if it doesn't already exist
if [[ ! -f $item_path ]]; then
curl -fsS $item_url | jq -r .text > $item_path
sleep 0.5 # be gentle with the servers
fi
}
# export the function to use it with GNU Parallel
export -f maybe_download_item
# get reply ids from API
curl -fsS $URL/43858554.json | jq -r .kids[] > replies.txt
mkdir -p posts
# replace `head -n 20` with `cat` to actually download *all* replies
head -n 20 replies.txt | parallel --bar --jobs 4 maybe_download_item {}
How it works
Here's what's happening:
--bar
displays a progress bar.--jobs 4
runs 4 jobs in parallel.export
-ing the function makes it visible toparallel
's spawned subprocesses- alternatively, you could make it a separate script and invoke it with
... | parallel --bar --jobs 4 ./maybe_download_item.sh
, but keeping everything in a single script is convenient for throwaway code.
- alternatively, you could make it a separate script and invoke it with
More useful flags
Some other useful flags:
--timeout 30s
: terminate the job if it takes longer than 30 seconds.--delay 0.5s
: wait half a second before kicking off a job.- Though I much prefer having
sleep
inside the function, and only triggered when a file actually needs downloading, so that if all files are already downloaded no waiting is necessary.
- Though I much prefer having
--retries 3
: retry failed jobs up to 3 times.
Caveats
This only downloads the direct replies to the main post. Nested replies aren't downloaded. If you need a proper scraper, where each job may spawn multiple other new jobs, this pattern may fall short.