Skip to content

API Reference

Pages

Page

Base class for all Page scrapers, used for scraping information from a single type of page.

For details on how these methods are called, it may be helpful to read Anatomy of a Scrape.

Attributes

source

Can be set on subclasses of Page to define the initial HTTP request that the page will handle in its process_response method.

For simple GET requests, source can be a string. URL should be used for more advanced use cases.

response
requests.Response object available if access is needed to the raw response for any reason.
input
Instance of data being passed upon instantiation of this page. Must be of type input_type.
input_type
dataclass, attrs class, or pydantic model. If set will be used to prompt for and/or validate self.input
example_input
Instance of input_type to be used when invoking spatula test.
example_source
Source to fetch when invoking spatula test.
dependencies

Dictionary mapping of names to Page objects that will be available before process_page.

For example:

class EmployeeDetail(HtmlPage):
    dependencies = {"awards": AwardsPage()}

Means that before EmployeeDetail.process_page is called, it is guaranteed to have the output from AwardsPage available as self.awards.

See Specifying Dependencies for a more detailed explanation.

Methods

do_scrape(self, scraper=None)

yield results from this page and any subpages

:param scraper: Optional scrapelib.Scraper instance to use for running scrape. :returns: Generator yielding results from the scrape.

get_next_source(self)

To be overriden for paginated pages.

Return a URL or valid source to fetch the next page, None if there isn't one.

get_source_from_input(self)

To be overridden.

Convert self.input to a Source object such as a URL.

postprocess_response(self)

To be overridden.

This is called after source.get_response but before self.process_page.

process_error_response(self, exception)

To be overridden.

This is called after source.get_response if an exception is raised.

process_page(self)

To be overridden.

Return data extracted from this page and this page alone.

HtmlPage

Page that automatically handles parsing and normalizing links in an HTML response.

Attributes

root

lxml.etree.Element object representing the root element (e.g. <html>) on the page.

Can use the normal lxml methods (such as cssselect and getchildren), or use this element as the target of a Selector subclass.

JsonPage

Page that automatically handles parsing a JSON response.

Attributes

data
JSON data from response. (same as self.response.json())

PdfPage

Page that automatically handles converting a PDF response to text using pdftotext.

Attributes

preserve_layout
set to True on derived class if you want the conversion function to use pdftotext's -layout option to attempt to preserve the layout of text. (False by default)
text
UTF8 text extracted by pdftotext.

XmlPage

Page that automatically handles parsing a XML response.

Attributes

root
lxml.etree.Element object representing the root XML element on the page.

ListPages

ListPage

Base class for common pattern of extracting many homogenous items from one page.

Instead of overriding process_response, subclasses should provide a process_item.

Methods

process_item(self, item)

To be overridden.

Called once per subitem on page, as defined by the particular subclass being used.

Should return data extracted from the item.

If SkipItem is raised, process_item will continue to be called with the next item in the stream.

CsvListPage

Processes each row in a CSV (after the first, assumed to be headers) as an item with process_item.

ExcelListPage

Processes each row in an Excel file as an item with process_item.

HtmlListPage

Selects homogenous items from HTML page using selector and passes them to process_item.

Attributes

selector
Selector subclass which matches list of homogenous elements to process. (e.g. CSS("tbody tr"))

JsonListPage

Processes each element in a JSON list as an item with process_item.

XmlListPage

Selects homogenous items from XML document using selector and passes them to process_item.

Attributes

selector
Selector subclass which matches list of homogenous elements to process. (e.g. XPath("//item"))

Selectors

Selector

Base class implementing Selector interface.

match(self, element, *, min_items=None, max_items=None, num_items=None)

Return all matches of the given selector within element.

If the number of elements matched is outside the prescribed boundaries, a SelectorError is raised.

:param element: The element to match within. When used from within a Page will usually be self.root. :param min_items: A minimum number of items to match. :param max_items: A maximum number of items to match. :param num_items: An exact number of items to match.

match_one(self, element)

Return exactly one match.

:param element: Element to search within.

CSS

Utilize CSS-style selectors.

:param css_selector: CSS selector expression to use. :param min_items: A minimum number of items to match. :param max_items: A maximum number of items to match. :param num_items: An exact number of items to match.

Match links that fit a provided pattern.

:param pattern: Regular expression for link hrefs. :param min_items: A minimum number of items to match. :param max_items: A maximum number of items to match. :param num_items: An exact number of items to match.

XPath

Utilize XPath selectors.

:param xpath: XPath expression to use. :param min_items: A minimum number of items to match. :param max_items: A maximum number of items to match. :param num_items: An exact number of items to match.

Sources

URL

Defines a resource to fetch via URL, particularly useful for handling non-GET requests.

:param url: URL to fetch :param method: HTTP method to use, defaults to "GET" :param data: POST data to include in request body. :param headers: dictionary of HTTP headers to set for the request. :param verify: bool indicating whether or not to verify SSL certificates for request, defaults to True :param timeout: HTTP(S) timeout in seconds :param retries: number of retries to make

NullSource

Special class to set as a page's source to indicate no HTTP request needs to be performed.

Exceptions

SelectorError

Error raised when a selector's constraint (min_items/max_items, etc.) is not met.

SkipItem

To be raised to skip processing of the current item & continue with the next item.

Examples:

class SomeListPage(HtmlListPage):
    def process_item(self, item):
        if item.name == "Vacant":
            raise SkipItem("vacant")
        # do normal processing here

Or, if the skip logic needs to live within a detail Page:

class SomeDetailPage(HtmlPage):
    def process_page(self):
        if self.input.name == "Vacant":
            raise SkipItem("vacant")
        # do normal processing here