API Reference¶
Scraper¶
Scraper is the most important class provided by scrapelib (and generally the only one to be instantiated directly). It provides a large number of options allowing for customization.
It wraps requests.Session and has the same attributes & methods
available.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
raise_errors | 
        set to True to raise a   | 
        required | |
requests_per_minute | 
        maximum requests per minute (0 for unlimited, defaults to 60)  | 
        required | |
retry_attempts | 
        number of times to retry if timeout occurs or page returns a (non-404) error  | 
        required | |
retry_wait_seconds | 
        number of seconds to retry after first failure, subsequent retries will double this wait  | 
        required | |
verify | 
        set to   | 
        required | 
request(self, method, url, params=None, data=None, headers=None, cookies=None, files=None, auth=None, timeout=None, allow_redirects=True, proxies=None, hooks=None, stream=None, verify=True, cert=None, json=None, retry_on_404=False)
¶
    Override, wraps Session.request in caching.
Cache is only used if key_for_request returns a valid key and should_cache_response was true as well.
urlretrieve(self, url, filename=None, method='GET', body=None, dir=None, **kwargs)
¶
    Save result of a request to a file, similarly to
:func:urllib.urlretrieve.
If an error is encountered may raise any of the scrapelib
exceptions_.
A filename may be provided or :meth:urlretrieve will safely create a
temporary file. If a directory is provided, a file will be given a random
name within the specified directory. Either way, it is the responsibility
of the caller to ensure that the temporary file is deleted when it is no
longer needed.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
url | 
        str | 
        URL for request  | 
        required | 
filename | 
        str | 
        optional name for file  | 
        None | 
      
method | 
        str | 
        any valid HTTP method, but generally GET or POST  | 
        'GET' | 
      
body | 
        dict | 
        optional body for request, to turn parameters into an appropriate string use :func:  | 
        None | 
      
dir | 
        str | 
        optional directory to place file in  | 
        None | 
      
Returns:
| Type | Description | 
|---|---|
Tuple[str, requests.models.Response] | 
      tuple with filename for saved response (will be same as given filename if one was given, otherwise will be a temp file in the OS temp directory) and a   | 
    
Caching¶
Assign a MemoryCache, FileCache, or SQLiteCache to
the cache_storage property of a scrapelib.Scraper to cache responses:
from scrapelib import Scraper
from scrapelib.cache import FileCache
cache = FileCache('cache-directory')
scraper = Scraper()
scraper.cache_storage = cache
scraper.cache_write_only = False
MemoryCache¶
In memory cache for request responses.
FileCache¶
File-based cache for request responses.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
cache_dir | 
        directory for storing responses  | 
        required | |
check_last_modified | 
        set to True to compare last-modified timestamp in cached response with value from HEAD request  | 
        required | 
SQLiteCache¶
SQLite cache for request responses.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
cache_path | 
        path for SQLite database file  | 
        required | |
check_last_modified | 
        set to True to compare last-modified timestamp in cached response with value from HEAD request  | 
        required | 
Exceptions¶
HTTPError¶
Raised when urlopen encounters a 4xx or 5xx error code and the raise_errors option is true.
HTTPMethodUnavailableError¶
Raised when the supplied HTTP method is invalid or not supported by the HTTP backend.