Skip to content

Command Line Interface

spatula provides a command line interface that is useful for iterative development of scrapers.

Once installed within your Python environment, spatula can be invoked on the command line. E.g.:

  (scrape-venv) ~/scrape-proj $ spatula --version
  spatula, version 0.9.0

Or with poetry:

  ~/scrape-proj $ poetry run spatula --version
  spatula, version 0.9.0

The CLI provides four useful subcommands for different stages of development:

spatula

Usage:

spatula [OPTIONS] COMMAND [ARGS]...

Options:

Name Type Description Default
--version boolean Show the version and exit. False
--help boolean Show this message and exit. False

scout

Run first step of scrape & output data to a JSON file.

This command is intended to be used to detect at a first approximation whether or not a full scrape might need to be run. If the first layer detects any changes it is safe to say that the full run will as well.

This will work in the common case where a new subpage is added or removed. Of course in more advanced cases this depends upon the first page being scraped (typically a ListPage derivative) surfacing enough information (perhaps a last_updated date) to know whether any of the other pages have been scraped.

Usage:

spatula scout [OPTIONS] INITIAL_PAGE_NAME

Options:

Name Type Description Default
-s, --source text Provide (or override) source URL None
-o, --output-file text override default output file [default: scout.json]. scout.json
-ua, --user-agent text override default user-agent spatula 0.9.0
--rpm integer set requests per minute (default: 60) 60
--timeout integer set HTTP request timeout in seconds (default: 5) 5
--verify / --no-verify boolean control verification of SSL certs True
--retries integer configure how many retries to perform on HTTP request error (default: 0) 0
--retry-wait integer configure how many seconds to wait on HTTP request error (default: 10) 10
-H, --header text add a header to all requests. example format: 'Accept: application/json' None
-v, --verbosity integer override default verbosity for command (0-3) -1
--fastmode boolean use a cache to avoid making unnecessary requests False
--help boolean Show this message and exit. False

scrape

Run full scrape, and output data to disk.

Usage:

spatula scrape [OPTIONS] INITIAL_PAGE_NAME

Options:

Name Type Description Default
-o, --output-dir text override default output directory. None
--rmdir / --no-rmdir boolean remove output directory before scrape. False
-s, --source text Provide (or override) source URL None
--dump text Specify dump function json.dump
-ua, --user-agent text override default user-agent spatula 0.9.0
--rpm integer set requests per minute (default: 60) 60
--timeout integer set HTTP request timeout in seconds (default: 5) 5
--verify / --no-verify boolean control verification of SSL certs True
--retries integer configure how many retries to perform on HTTP request error (default: 0) 0
--retry-wait integer configure how many seconds to wait on HTTP request error (default: 10) 10
-H, --header text add a header to all requests. example format: 'Accept: application/json' None
-v, --verbosity integer override default verbosity for command (0-3) -1
--fastmode boolean use a cache to avoid making unnecessary requests False
--help boolean Show this message and exit. False

shell

Start a session to interact with a particular page.

Usage:

spatula shell [OPTIONS] URL

Options:

Name Type Description Default
-X, --verb text set HTTP verb such as POST GET
-ua, --user-agent text override default user-agent spatula 0.9.0
--rpm integer set requests per minute (default: 60) 60
--timeout integer set HTTP request timeout in seconds (default: 5) 5
--verify / --no-verify boolean control verification of SSL certs True
--retries integer configure how many retries to perform on HTTP request error (default: 0) 0
--retry-wait integer configure how many seconds to wait on HTTP request error (default: 10) 10
-H, --header text add a header to all requests. example format: 'Accept: application/json' None
-v, --verbosity integer override default verbosity for command (0-3) -1
--fastmode boolean use a cache to avoid making unnecessary requests False
--help boolean Show this message and exit. False

test

Scrape a single page and see output immediately.

This eases the common cycle of making modifications to a scraper, running a scrape (possibly with long-running but irrelevant portions commented out), and comparing output to what is expected.

test can also be useful for debugging existing scrapers, you can see exactly what a single step of the scrape is providing, to help narrow down where erroneous data is coming from.

Example:

$ spatula test path.to.ClassName --source https://example.com

This will run the scraper defined at path.to.ClassName against the provided URL.

Usage:

spatula test [OPTIONS] CLASS_NAME

Options:

Name Type Description Default
--interactive / --no-interactive boolean Interactively prompt for missing data. (Default: false) False
-d, --data text Provide input data in name=value pairs. None
-s, --source text Provide (or override) source URL None
--pagination / --no-pagination boolean Determine whether or not pagination should be followed or one page is enough for testing. (Default: true) True
--subpages / --no-subpages boolean Determine whether subpages should be scraped. (Default: false) False
-ua, --user-agent text override default user-agent spatula 0.9.0
--rpm integer set requests per minute (default: 60) 60
--timeout integer set HTTP request timeout in seconds (default: 5) 5
--verify / --no-verify boolean control verification of SSL certs True
--retries integer configure how many retries to perform on HTTP request error (default: 0) 0
--retry-wait integer configure how many seconds to wait on HTTP request error (default: 10) 10
-H, --header text add a header to all requests. example format: 'Accept: application/json' None
-v, --verbosity integer override default verbosity for command (0-3) -1
--fastmode boolean use a cache to avoid making unnecessary requests False
--help boolean Show this message and exit. False