Command Line Interface¶
spatula provides a command line interface that is useful for iterative development of scrapers.
Once installed within your Python environment, spatula can be invoked on the command line. E.g.:
(scrape-venv) ~/scrape-proj $ spatula --version
spatula, version 0.9.0
Or with poetry:
~/scrape-proj $ poetry run spatula --version
spatula, version 0.9.0
The CLI provides four useful subcommands for different stages of development:
spatula¶
Usage:
spatula [OPTIONS] COMMAND [ARGS]...
Options:
Name | Type | Description | Default |
---|---|---|---|
--version |
boolean | Show the version and exit. | False |
--help |
boolean | Show this message and exit. | False |
scout¶
Run first step of scrape & output data to a JSON file.
This command is intended to be used to detect at a first approximation whether or not a full scrape might need to be run. If the first layer detects any changes it is safe to say that the full run will as well.
This will work in the common case where a new subpage is added or removed. Of course in more advanced cases this depends upon the first page being scraped (typically a ListPage derivative) surfacing enough information (perhaps a last_updated date) to know whether any of the other pages have been scraped.
Usage:
spatula scout [OPTIONS] INITIAL_PAGE_NAME
Options:
Name | Type | Description | Default |
---|---|---|---|
-s , --source |
text | Provide (or override) source URL | None |
-o , --output-file |
text | override default output file [default: scout.json]. | scout.json |
-ua , --user-agent |
text | override default user-agent | spatula 0.9.0 |
--rpm |
integer | set requests per minute (default: 60) | 60 |
--timeout |
integer | set HTTP request timeout in seconds (default: 5) | 5 |
--verify / --no-verify |
boolean | control verification of SSL certs | True |
--retries |
integer | configure how many retries to perform on HTTP request error (default: 0) | 0 |
--retry-wait |
integer | configure how many seconds to wait on HTTP request error (default: 10) | 10 |
-H , --header |
text | add a header to all requests. example format: 'Accept: application/json' | None |
-v , --verbosity |
integer | override default verbosity for command (0-3) | -1 |
--fastmode |
boolean | use a cache to avoid making unnecessary requests | False |
--help |
boolean | Show this message and exit. | False |
scrape¶
Run full scrape, and output data to disk.
Usage:
spatula scrape [OPTIONS] INITIAL_PAGE_NAME
Options:
Name | Type | Description | Default |
---|---|---|---|
-o , --output-dir |
text | override default output directory. | None |
--rmdir / --no-rmdir |
boolean | remove output directory before scrape. | False |
-s , --source |
text | Provide (or override) source URL | None |
--dump |
text | Specify dump function | json.dump |
-ua , --user-agent |
text | override default user-agent | spatula 0.9.0 |
--rpm |
integer | set requests per minute (default: 60) | 60 |
--timeout |
integer | set HTTP request timeout in seconds (default: 5) | 5 |
--verify / --no-verify |
boolean | control verification of SSL certs | True |
--retries |
integer | configure how many retries to perform on HTTP request error (default: 0) | 0 |
--retry-wait |
integer | configure how many seconds to wait on HTTP request error (default: 10) | 10 |
-H , --header |
text | add a header to all requests. example format: 'Accept: application/json' | None |
-v , --verbosity |
integer | override default verbosity for command (0-3) | -1 |
--fastmode |
boolean | use a cache to avoid making unnecessary requests | False |
--help |
boolean | Show this message and exit. | False |
shell¶
Start a session to interact with a particular page.
Usage:
spatula shell [OPTIONS] URL
Options:
Name | Type | Description | Default |
---|---|---|---|
-X , --verb |
text | set HTTP verb such as POST | GET |
-ua , --user-agent |
text | override default user-agent | spatula 0.9.0 |
--rpm |
integer | set requests per minute (default: 60) | 60 |
--timeout |
integer | set HTTP request timeout in seconds (default: 5) | 5 |
--verify / --no-verify |
boolean | control verification of SSL certs | True |
--retries |
integer | configure how many retries to perform on HTTP request error (default: 0) | 0 |
--retry-wait |
integer | configure how many seconds to wait on HTTP request error (default: 10) | 10 |
-H , --header |
text | add a header to all requests. example format: 'Accept: application/json' | None |
-v , --verbosity |
integer | override default verbosity for command (0-3) | -1 |
--fastmode |
boolean | use a cache to avoid making unnecessary requests | False |
--help |
boolean | Show this message and exit. | False |
test¶
Scrape a single page and see output immediately.
This eases the common cycle of making modifications to a scraper, running a scrape (possibly with long-running but irrelevant portions commented out), and comparing output to what is expected.
test
can also be useful for debugging existing scrapers, you can see exactly
what a single step of the scrape is providing, to help narrow down where
erroneous data is coming from.
Example:
$ spatula test path.to.ClassName --source https://example.com
This will run the scraper defined at path.to.ClassName
against the provided URL.
Usage:
spatula test [OPTIONS] CLASS_NAME
Options:
Name | Type | Description | Default |
---|---|---|---|
--interactive / --no-interactive |
boolean | Interactively prompt for missing data. (Default: false) | False |
-d , --data |
text | Provide input data in name=value pairs. | None |
-s , --source |
text | Provide (or override) source URL | None |
--pagination / --no-pagination |
boolean | Determine whether or not pagination should be followed or one page is enough for testing. (Default: true) | True |
--subpages / --no-subpages |
boolean | Determine whether subpages should be scraped. (Default: false) | False |
-ua , --user-agent |
text | override default user-agent | spatula 0.9.0 |
--rpm |
integer | set requests per minute (default: 60) | 60 |
--timeout |
integer | set HTTP request timeout in seconds (default: 5) | 5 |
--verify / --no-verify |
boolean | control verification of SSL certs | True |
--retries |
integer | configure how many retries to perform on HTTP request error (default: 0) | 0 |
--retry-wait |
integer | configure how many seconds to wait on HTTP request error (default: 10) | 10 |
-H , --header |
text | add a header to all requests. example format: 'Accept: application/json' | None |
-v , --verbosity |
integer | override default verbosity for command (0-3) | -1 |
--fastmode |
boolean | use a cache to avoid making unnecessary requests | False |
--help |
boolean | Show this message and exit. | False |