API Reference¶

Pages¶

Page¶

Bases: ABC

Base class for all Page scrapers, used for scraping information from a single type of page.

For details on how these methods are called, it may be helpful to read Anatomy of a Scrape.

Attributes

source

Can be set on subclasses of Page to define the initial HTTP request that the page will handle in its process_response method.

For simple GET requests, source can be a string. URL should be used for more advanced use cases.

response

requests.Response object available if access is needed to the raw response for any reason.

input

Instance of data being passed upon instantiation of this page. Must be of type input_type.

input_type

dataclass, attrs class, or pydantic model. If set will be used to prompt for and/or validate self.input

example_input

Instance of input_type to be used when invoking spatula test.

example_source

Source to fetch when invoking spatula test.

dependencies

Dictionary mapping of names to Page objects that will be available before process_page.

For example:

class EmployeeDetail(HtmlPage):
    dependencies = {"awards": AwardsPage()}

Means that before EmployeeDetail.process_page is called, it is guaranteed to have the output from AwardsPage available as self.awards.

See Specifying Dependencies for a more detailed explanation.

Methods

`do_scrape(scraper=None)` ¶

yield results from this page and any subpages

Parameters:

Name	Type	Description	Default
`scraper`	`typing.Optional[scrapelib.Scraper]`	Optional `scrapelib.Scraper` instance to use for running scrape.	`None`

Returns:

Type	Description
`typing.Iterable[typing.Any]`	Generator yielding results from the scrape.

`get_next_source()` ¶

To be overriden for paginated pages.

Return a URL or valid source to fetch the next page, None if there isn't one.

`get_source_from_input()` ¶

To be overridden.

Convert self.input to a Source object such as a URL.

`postprocess_response()` ¶

To be overridden.

This is called after source.get_response but before self.process_page.

`process_error_response(exception)` ¶

To be overridden.

This is called after source.get_response if an exception is raised.

`process_page()` `abstractmethod` ¶

To be overridden.

Return data extracted from this page and this page alone.

HtmlPage¶

Bases: Page

Page that automatically handles parsing and normalizing links in an HTML response.

Attributes

root

lxml.etree.Element object representing the root element (e.g. <html>) on the page.

Can use the normal lxml methods (such as cssselect and getchildren), or use this element as the target of a Selector subclass.

JsonPage¶

Bases: Page

Page that automatically handles parsing a JSON response.

Attributes

data: JSON data from response. (same as self.response.json())

PdfPage¶

Bases: Page

Page that automatically handles converting a PDF response to text using pdftotext.

Attributes

preserve_layout: set to True on derived class if you want the conversion function to use pdftotext's -layout option to attempt to preserve the layout of text. (False by default)
text: UTF8 text extracted by pdftotext.

XmlPage¶

Bases: Page

Page that automatically handles parsing a XML response.

Attributes

root: lxml.etree.Element object representing the root XML element on the page.

ListPages¶

ListPage¶

Bases: Page

Base class for common pattern of extracting many homogenous items from one page.

Instead of overriding process_response, subclasses should provide a process_item.

Methods

`process_item(item)` ¶

To be overridden.

Called once per subitem on page, as defined by the particular subclass being used.

Should return data extracted from the item.

If SkipItem is raised, process_item will continue to be called with the next item in the stream.

CsvListPage¶

Bases: ListPage

Processes each row in a CSV (after the first, assumed to be headers) as an item with process_item.

ExcelListPage¶

Bases: ListPage

Processes each row in an Excel file as an item with process_item.

HtmlListPage¶

Bases: LxmlListPage, HtmlPage

Selects homogenous items from HTML page using selector and passes them to process_item.

Attributes

selector: Selector subclass which matches list of homogenous elements to process. (e.g. CSS("tbody tr"))

JsonListPage¶

Bases: ListPage, JsonPage

Processes each element in a JSON list as an item with process_item.

XmlListPage¶

Bases: LxmlListPage, XmlPage

Selects homogenous items from XML document using selector and passes them to process_item.

Attributes

selector: Selector subclass which matches list of homogenous elements to process. (e.g. XPath("//item"))

Selectors¶

Selector¶

Bases: ABC

Base class implementing Selector interface.

`match(element, *, min_items=None, max_items=None, num_items=None)` ¶

Return all matches of the given selector within element.

If the number of elements matched is outside the prescribed boundaries, a SelectorError is raised.

Parameters:

Name	Type	Description	Default
`element`	`_Element`	The element to match within. When used from within a `Page` will usually be `self.root`.	required
`min_items`	`Optional[int]`	A minimum number of items to match.	`None`
`max_items`	`Optional[int]`	A maximum number of items to match.	`None`
`num_items`	`Optional[int]`	An exact number of items to match.	`None`

`match_one(element)` ¶

Return exactly one match.

Parameters:

Name	Type	Description	Default
`element`	`_Element`	Element to search within.	required

CSS¶

Utilize CSS-style selectors.

Parameters:

Name	Type	Description	Default
`css_selector`	`str`	CSS selector expression to use.	required
`min_items`	`Optional[int]`	A minimum number of items to match.	`1`
`max_items`	`Optional[int]`	A maximum number of items to match.	`None`
`num_items`	`Optional[int]`	An exact number of items to match.	`None`

SimilarLink¶

Match links that fit a provided pattern.

Parameters:

Name	Type	Description	Default
`pattern`	`str`	Regular expression for link hrefs.	required
`min_items`	`Optional[int]`	A minimum number of items to match.	`1`
`max_items`	`Optional[int]`	A maximum number of items to match.	`None`
`num_items`	`Optional[int]`	An exact number of items to match.	`None`

XPath¶

Utilize XPath selectors.

Parameters:

Name	Type	Description	Default
`xpath`	`str`	XPath expression to use.	required
`min_items`	`Optional[int]`	A minimum number of items to match.	`1`
`max_items`	`Optional[int]`	A maximum number of items to match.	`None`
`num_items`	`Optional[int]`	An exact number of items to match.	`None`

Sources¶

URL¶

Defines a resource to fetch via URL, particularly useful for handling non-GET requests.

Parameters:

Name	Type	Description	Default
`url`	`str`	URL to fetch	required
`method`	`str`	HTTP method to use, defaults to "GET"	`'GET'`
`data`	`dict`	POST data to include in request body.	`None`
`headers`	`dict`	dictionary of HTTP headers to set for the request.	`None`
`verify`	`bool`	bool indicating whether or not to verify SSL certificates for request, defaults to True	`True`
`timeout`	`Optional[float]`	HTTP(S) timeout in seconds	`None`
`retries`	`Optional[int]`	number of retries to make	`None`

NullSource¶

Bases: Source

Special class to set as a page's source to indicate no HTTP request needs to be performed.

Exceptions¶

SelectorError¶

Bases: ValueError

Error raised when a selector's constraint (min_items/max_items, etc.) is not met.

SkipItem¶

Bases: Exception

To be raised to skip processing of the current item & continue with the next item.

Example:

class SomeListPage(HtmlListPage):
    def process_item(self, item):
        if item.name == "Vacant":
            raise SkipItem("vacant")
        # do normal processing here

Or, if the skip logic needs to live within a detail Page:

class SomeDetailPage(HtmlPage):
    def process_page(self):
        if self.input.name == "Vacant":
            raise SkipItem("vacant")
        # do normal processing here

API Reference¶

Pages¶

Page¶

do_scrape(scraper=None) ¶

get_next_source() ¶

get_source_from_input() ¶

postprocess_response() ¶

process_error_response(exception) ¶

process_page() abstractmethod ¶

HtmlPage¶

JsonPage¶

PdfPage¶

XmlPage¶

ListPages¶

ListPage¶

process_item(item) ¶

CsvListPage¶

ExcelListPage¶

HtmlListPage¶

JsonListPage¶

XmlListPage¶

Selectors¶

Selector¶

match(element, *, min_items=None, max_items=None, num_items=None) ¶

match_one(element) ¶

CSS¶

SimilarLink¶

XPath¶

Sources¶

URL¶

NullSource¶

Exceptions¶

SelectorError¶

SkipItem¶

`do_scrape(scraper=None)` ¶

`get_next_source()` ¶

`get_source_from_input()` ¶

`postprocess_response()` ¶

`process_error_response(exception)` ¶

`process_page()` `abstractmethod` ¶

`process_item(item)` ¶

`match(element, *, min_items=None, max_items=None, num_items=None)` ¶

`match_one(element)` ¶