Skip to content

API Reference

Scraper

Scraper is the most important class provided by scrapelib (and generally the only one to be instantiated directly). It provides a large number of options allowing for customization.

It wraps requests.Session and has the same attributes & methods available.

Parameters:

Name Type Description Default
raise_errors

set to True to raise a HTTPError on 4xx or 5xx response

required
requests_per_minute

maximum requests per minute (0 for unlimited, defaults to 60)

required
retry_attempts

number of times to retry if timeout occurs or page returns a (non-404) error

required
retry_wait_seconds

number of seconds to retry after first failure, subsequent retries will double this wait

required
verify

set to False to disable HTTPS verification.

required

request(self, method, url, params=None, data=None, headers=None, cookies=None, files=None, auth=None, timeout=None, allow_redirects=True, proxies=None, hooks=None, stream=None, verify=True, cert=None, json=None, retry_on_404=False)

Override, wraps Session.request in caching.

Cache is only used if key_for_request returns a valid key and should_cache_response was true as well.

urlretrieve(self, url, filename=None, method='GET', body=None, dir=None, **kwargs)

Save result of a request to a file, similarly to :func:urllib.urlretrieve.

If an error is encountered may raise any of the scrapelib exceptions_.

A filename may be provided or :meth:urlretrieve will safely create a temporary file. If a directory is provided, a file will be given a random name within the specified directory. Either way, it is the responsibility of the caller to ensure that the temporary file is deleted when it is no longer needed.

Parameters:

Name Type Description Default
url str

URL for request

required
filename str

optional name for file

None
method str

any valid HTTP method, but generally GET or POST

'GET'
body dict

optional body for request, to turn parameters into an appropriate string use :func:urllib.urlencode()

None
dir str

optional directory to place file in

None

Returns:

Type Description
Tuple[str, requests.models.Response]

tuple with filename for saved response (will be same as given filename if one was given, otherwise will be a temp file in the OS temp directory) and a Response object that can be used to inspect the response headers.

Caching

Assign a MemoryCache, FileCache, or SQLiteCache to the cache_storage property of a scrapelib.Scraper to cache responses:


from scrapelib import Scraper
from scrapelib.cache import FileCache
cache = FileCache('cache-directory')
scraper = Scraper()
scraper.cache_storage = cache
scraper.cache_write_only = False

MemoryCache

In memory cache for request responses.

FileCache

File-based cache for request responses.

Parameters:

Name Type Description Default
cache_dir

directory for storing responses

required
check_last_modified

set to True to compare last-modified timestamp in cached response with value from HEAD request

required

SQLiteCache

SQLite cache for request responses.

Parameters:

Name Type Description Default
cache_path

path for SQLite database file

required
check_last_modified

set to True to compare last-modified timestamp in cached response with value from HEAD request

required

Exceptions

HTTPError

Raised when urlopen encounters a 4xx or 5xx error code and the raise_errors option is true.

HTTPMethodUnavailableError

Raised when the supplied HTTP method is invalid or not supported by the HTTP backend.

FTPError

Back to top