API Reference¶

Scraper¶

Scraper is the most important class provided by scrapelib (and generally the only one to be instantiated directly). It provides a large number of options allowing for customization.

It wraps requests.Session and has the same attributes & methods available.

Parameters:

Name	Description	Default
`raise_errors`	set to True to raise a `HTTPError` on 4xx or 5xx response	required
`requests_per_minute`	maximum requests per minute (0 for unlimited, defaults to 60)	required
`retry_attempts`	number of times to retry if timeout occurs or page returns a (non-404) error	required
`retry_wait_seconds`	number of seconds to retry after first failure, subsequent retries will double this wait	required
`verify`	set to `False` to disable HTTPS verification.	required

`request(self, method, url, params=None, data=None, headers=None, cookies=None, files=None, auth=None, timeout=None, allow_redirects=True, proxies=None, hooks=None, stream=None, verify=True, cert=None, json=None, retry_on_404=False)` ¶

Override, wraps Session.request in caching.

Cache is only used if key_for_request returns a valid key and should_cache_response was true as well.

`urlretrieve(self, url, filename=None, method='GET', body=None, dir=None, **kwargs)` ¶

Save result of a request to a file, similarly to :func:urllib.urlretrieve.

If an error is encountered may raise any of the scrapelib exceptions_.

A filename may be provided or :meth:urlretrieve will safely create a temporary file. If a directory is provided, a file will be given a random name within the specified directory. Either way, it is the responsibility of the caller to ensure that the temporary file is deleted when it is no longer needed.

Parameters:

Name	Type	Description	Default
`url`	`str`	URL for request	required
`filename`	`str`	optional name for file	`None`
`method`	`str`	any valid HTTP method, but generally GET or POST	`'GET'`
`body`	`dict`	optional body for request, to turn parameters into an appropriate string use :func:`urllib.urlencode()`	`None`
`dir`	`str`	optional directory to place file in	`None`

Returns:

Type	Description
`Tuple[str, requests.models.Response]`	tuple with filename for saved response (will be same as given filename if one was given, otherwise will be a temp file in the OS temp directory) and a `Response` object that can be used to inspect the response headers.

Caching¶

Assign a MemoryCache, FileCache, or SQLiteCache to the cache_storage property of a scrapelib.Scraper to cache responses:


from scrapelib import Scraper
from scrapelib.cache import FileCache
cache = FileCache('cache-directory')
scraper = Scraper()
scraper.cache_storage = cache
scraper.cache_write_only = False

MemoryCache¶

In memory cache for request responses.

FileCache¶

File-based cache for request responses.

Parameters:

Name	Type	Description	Default
`cache_dir`		directory for storing responses	required
`check_last_modified`		set to True to compare last-modified timestamp in cached response with value from HEAD request	required

SQLiteCache¶

SQLite cache for request responses.

Parameters:

Name	Type	Description	Default
`cache_path`		path for SQLite database file	required
`check_last_modified`		set to True to compare last-modified timestamp in cached response with value from HEAD request	required

Exceptions¶

HTTPError¶

Raised when urlopen encounters a 4xx or 5xx error code and the raise_errors option is true.

HTTPMethodUnavailableError¶

Raised when the supplied HTTP method is invalid or not supported by the HTTP backend.

API Reference¶

Scraper¶

request(self, method, url, params=None, data=None, headers=None, cookies=None, files=None, auth=None, timeout=None, allow_redirects=True, proxies=None, hooks=None, stream=None, verify=True, cert=None, json=None, retry_on_404=False) ¶

urlretrieve(self, url, filename=None, method='GET', body=None, dir=None, **kwargs) ¶

Caching¶

MemoryCache¶

FileCache¶

SQLiteCache¶

Exceptions¶

HTTPError¶

HTTPMethodUnavailableError¶

FTPError¶

`request(self, method, url, params=None, data=None, headers=None, cookies=None, files=None, auth=None, timeout=None, allow_redirects=True, proxies=None, hooks=None, stream=None, verify=True, cert=None, json=None, retry_on_404=False)` ¶

`urlretrieve(self, url, filename=None, method='GET', body=None, dir=None, **kwargs)` ¶