GNU Parallel as a poor man's scraper

· 2min · Giovanni Carvalho

This is a simple pattern I like to use when I want to fetch data from a website/API, and a quick and dirty approach is sufficient.

Before reading further, be aware that this post is about the GNU version of parallel. This will not work with moreutils' parallel.

$ parallel --version
GNU parallel 20221122
[...]

See it in action

As an example, let's download replies from a "Who is hiring?" post on HN.

Now you can simply grep to find your next role!

TL;DR

The invocation parallel --bar --jobs N does most of the lifting for scheduling the jobs, while curl and jq do the fetching/extracting.

The script

run.sh

#!/usr/bin/env bash
set -euo pipefail

# export `URL` so that it's available for parallel's subprocesses
export URL=https://hacker-news.firebaseio.com/v0/item/

function maybe_download_item {
    item_id=$1
    item_url=${URL}/${item_id}.json
    item_path=posts/${item_id}.txt

    # download item if it doesn't already exist
    if [[ ! -f $item_path ]]; then
        curl -fsS $item_url | jq -r .text > $item_path
        sleep 0.5  # be gentle with the servers
    fi
}

# export the function to use it with GNU Parallel
export -f maybe_download_item

# get reply ids from API
curl -fsS $URL/43858554.json | jq -r .kids[] > replies.txt

mkdir -p posts

# replace `head -n 20` with `cat` to actually download *all* replies
head -n 20 replies.txt | parallel --bar --jobs 4 maybe_download_item {}

How it works

Here's what's happening:

  • --bar displays a progress bar.
  • --jobs 4 runs 4 jobs in parallel.
  • export-ing the function makes it visible to parallel's spawned subprocesses
    • alternatively, you could make it a separate script and invoke it with ... | parallel --bar --jobs 4 ./maybe_download_item.sh, but keeping everything in a single script is convenient for throwaway code.

More useful flags

Some other useful flags:

  • --timeout 30s: terminate the job if it takes longer than 30 seconds.
  • --delay 0.5s: wait half a second before kicking off a job.
    • Though I much prefer having sleep inside the function, and only triggered when a file actually needs downloading, so that if all files are already downloaded no waiting is necessary.
  • --retries 3: retry failed jobs up to 3 times.

Caveats

This only downloads the direct replies to the main post. Nested replies aren't downloaded. If you need a proper scraper, where each job may spawn multiple other new jobs, this pattern may fall short.