API Reference¶
Scraper¶
Scraper is the most important class provided by scrapelib (and generally the only one to be instantiated directly). It provides a large number of options allowing for customization.
It wraps requests.Session
and has the same attributes & methods
available.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
raise_errors |
set to True to raise a |
required | |
requests_per_minute |
maximum requests per minute (0 for unlimited, defaults to 60) |
required | |
retry_attempts |
number of times to retry if timeout occurs or page returns a (non-404) error |
required | |
retry_wait_seconds |
number of seconds to retry after first failure, subsequent retries will double this wait |
required | |
verify |
set to |
required |
request(self, method, url, params=None, data=None, headers=None, cookies=None, files=None, auth=None, timeout=None, allow_redirects=True, proxies=None, hooks=None, stream=None, verify=True, cert=None, json=None, retry_on_404=False)
¶
Override, wraps Session.request in caching.
Cache is only used if key_for_request returns a valid key and should_cache_response was true as well.
urlretrieve(self, url, filename=None, method='GET', body=None, dir=None, **kwargs)
¶
Save result of a request to a file, similarly to
:func:urllib.urlretrieve
.
If an error is encountered may raise any of the scrapelib
exceptions
_.
A filename may be provided or :meth:urlretrieve
will safely create a
temporary file. If a directory is provided, a file will be given a random
name within the specified directory. Either way, it is the responsibility
of the caller to ensure that the temporary file is deleted when it is no
longer needed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str |
URL for request |
required |
filename |
str |
optional name for file |
None |
method |
str |
any valid HTTP method, but generally GET or POST |
'GET' |
body |
dict |
optional body for request, to turn parameters into an appropriate string use :func: |
None |
dir |
str |
optional directory to place file in |
None |
Returns:
Type | Description |
---|---|
Tuple[str, requests.models.Response] |
tuple with filename for saved response (will be same as given filename if one was given, otherwise will be a temp file in the OS temp directory) and a |
Caching¶
Assign a MemoryCache
, FileCache
, or SQLiteCache
to
the cache_storage
property of a scrapelib.Scraper
to cache responses:
from scrapelib import Scraper
from scrapelib.cache import FileCache
cache = FileCache('cache-directory')
scraper = Scraper()
scraper.cache_storage = cache
scraper.cache_write_only = False
MemoryCache¶
In memory cache for request responses.
FileCache¶
File-based cache for request responses.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cache_dir |
directory for storing responses |
required | |
check_last_modified |
set to True to compare last-modified timestamp in cached response with value from HEAD request |
required |
SQLiteCache¶
SQLite cache for request responses.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cache_path |
path for SQLite database file |
required | |
check_last_modified |
set to True to compare last-modified timestamp in cached response with value from HEAD request |
required |
Exceptions¶
HTTPError¶
Raised when urlopen encounters a 4xx or 5xx error code and the raise_errors option is true.
HTTPMethodUnavailableError¶
Raised when the supplied HTTP method is invalid or not supported by the HTTP backend.