Skip to content

API Reference

Pages

Page

Bases: ABC

Base class for all Page scrapers, used for scraping information from a single type of page.

For details on how these methods are called, it may be helpful to read Anatomy of a Scrape.

Attributes

source

Can be set on subclasses of Page to define the initial HTTP request that the page will handle in its process_response method.

For simple GET requests, source can be a string. URL should be used for more advanced use cases.

response
requests.Response object available if access is needed to the raw response for any reason.
input
Instance of data being passed upon instantiation of this page. Must be of type input_type.
input_type
dataclass, attrs class, or pydantic model. If set will be used to prompt for and/or validate self.input
example_input
Instance of input_type to be used when invoking spatula test.
example_source
Source to fetch when invoking spatula test.
dependencies

Dictionary mapping of names to Page objects that will be available before process_page.

For example:

class EmployeeDetail(HtmlPage):
    dependencies = {"awards": AwardsPage()}

Means that before EmployeeDetail.process_page is called, it is guaranteed to have the output from AwardsPage available as self.awards.

See Specifying Dependencies for a more detailed explanation.

Methods

do_scrape(scraper=None)

yield results from this page and any subpages

Parameters:

Name Type Description Default
scraper typing.Optional[scrapelib.Scraper]

Optional scrapelib.Scraper instance to use for running scrape.

None

Returns:

Type Description
typing.Iterable[typing.Any]

Generator yielding results from the scrape.

get_next_source()

To be overriden for paginated pages.

Return a URL or valid source to fetch the next page, None if there isn't one.

get_source_from_input()

To be overridden.

Convert self.input to a Source object such as a URL.

postprocess_response()

To be overridden.

This is called after source.get_response but before self.process_page.

process_error_response(exception)

To be overridden.

This is called after source.get_response if an exception is raised.

process_page() abstractmethod

To be overridden.

Return data extracted from this page and this page alone.

HtmlPage

Bases: Page

Page that automatically handles parsing and normalizing links in an HTML response.

Attributes

root

lxml.etree.Element object representing the root element (e.g. <html>) on the page.

Can use the normal lxml methods (such as cssselect and getchildren), or use this element as the target of a Selector subclass.

JsonPage

Bases: Page

Page that automatically handles parsing a JSON response.

Attributes

data
JSON data from response. (same as self.response.json())

PdfPage

Bases: Page

Page that automatically handles converting a PDF response to text using pdftotext.

Attributes

preserve_layout
set to True on derived class if you want the conversion function to use pdftotext's -layout option to attempt to preserve the layout of text. (False by default)
text
UTF8 text extracted by pdftotext.

XmlPage

Bases: Page

Page that automatically handles parsing a XML response.

Attributes

root
lxml.etree.Element object representing the root XML element on the page.

ListPages

ListPage

Bases: Page

Base class for common pattern of extracting many homogenous items from one page.

Instead of overriding process_response, subclasses should provide a process_item.

Methods

process_item(item)

To be overridden.

Called once per subitem on page, as defined by the particular subclass being used.

Should return data extracted from the item.

If SkipItem is raised, process_item will continue to be called with the next item in the stream.

CsvListPage

Bases: ListPage

Processes each row in a CSV (after the first, assumed to be headers) as an item with process_item.

ExcelListPage

Bases: ListPage

Processes each row in an Excel file as an item with process_item.

HtmlListPage

Bases: LxmlListPage, HtmlPage

Selects homogenous items from HTML page using selector and passes them to process_item.

Attributes

selector
Selector subclass which matches list of homogenous elements to process. (e.g. CSS("tbody tr"))

JsonListPage

Bases: ListPage, JsonPage

Processes each element in a JSON list as an item with process_item.

XmlListPage

Bases: LxmlListPage, XmlPage

Selects homogenous items from XML document using selector and passes them to process_item.

Attributes

selector
Selector subclass which matches list of homogenous elements to process. (e.g. XPath("//item"))

Selectors

Selector

Bases: ABC

Base class implementing Selector interface.

match(element, *, min_items=None, max_items=None, num_items=None)

Return all matches of the given selector within element.

If the number of elements matched is outside the prescribed boundaries, a SelectorError is raised.

Parameters:

Name Type Description Default
element _Element

The element to match within. When used from within a Page will usually be self.root.

required
min_items Optional[int]

A minimum number of items to match.

None
max_items Optional[int]

A maximum number of items to match.

None
num_items Optional[int]

An exact number of items to match.

None

match_one(element)

Return exactly one match.

Parameters:

Name Type Description Default
element _Element

Element to search within.

required

CSS

Utilize CSS-style selectors.

Parameters:

Name Type Description Default
css_selector str

CSS selector expression to use.

required
min_items Optional[int]

A minimum number of items to match.

1
max_items Optional[int]

A maximum number of items to match.

None
num_items Optional[int]

An exact number of items to match.

None

Match links that fit a provided pattern.

Parameters:

Name Type Description Default
pattern str

Regular expression for link hrefs.

required
min_items Optional[int]

A minimum number of items to match.

1
max_items Optional[int]

A maximum number of items to match.

None
num_items Optional[int]

An exact number of items to match.

None

XPath

Utilize XPath selectors.

Parameters:

Name Type Description Default
xpath str

XPath expression to use.

required
min_items Optional[int]

A minimum number of items to match.

1
max_items Optional[int]

A maximum number of items to match.

None
num_items Optional[int]

An exact number of items to match.

None

Sources

URL

Defines a resource to fetch via URL, particularly useful for handling non-GET requests.

Parameters:

Name Type Description Default
url str

URL to fetch

required
method str

HTTP method to use, defaults to "GET"

'GET'
data dict

POST data to include in request body.

None
headers dict

dictionary of HTTP headers to set for the request.

None
verify bool

bool indicating whether or not to verify SSL certificates for request, defaults to True

True
timeout Optional[float]

HTTP(S) timeout in seconds

None
retries Optional[int]

number of retries to make

None

NullSource

Bases: Source

Special class to set as a page's source to indicate no HTTP request needs to be performed.

Exceptions

SelectorError

Bases: ValueError

Error raised when a selector's constraint (min_items/max_items, etc.) is not met.

SkipItem

Bases: Exception

To be raised to skip processing of the current item & continue with the next item.

Example:

class SomeListPage(HtmlListPage):
    def process_item(self, item):
        if item.name == "Vacant":
            raise SkipItem("vacant")
        # do normal processing here

Or, if the skip logic needs to live within a detail Page:

class SomeDetailPage(HtmlPage):
    def process_page(self):
        if self.input.name == "Vacant":
            raise SkipItem("vacant")
        # do normal processing here