API Reference¶
Pages¶
Page¶
Bases: ABC
Base class for all Page scrapers, used for scraping information from a single type of page.
For details on how these methods are called, it may be helpful to read Anatomy of a Scrape.
Attributes
source
-
Can be set on subclasses of
Page
to define the initial HTTP request that the page will handle in itsprocess_response
method.For simple GET requests,
source
can be a string.URL
should be used for more advanced use cases. response
requests.Response
object available if access is needed to the raw response for any reason.input
- Instance of data being passed upon instantiation of this page.
Must be of type
input_type
. input_type
dataclass
,attrs
class, orpydantic
model. If set will be used to prompt for and/or validateself.input
example_input
- Instance of
input_type
to be used when invokingspatula test
. example_source
- Source to fetch when invoking
spatula test
. dependencies
-
Dictionary mapping of names to
Page
objects that will be available beforeprocess_page
.For example:
class EmployeeDetail(HtmlPage): dependencies = {"awards": AwardsPage()}
Means that before
EmployeeDetail.process_page
is called, it is guaranteed to have the output fromAwardsPage
available asself.awards
.See Specifying Dependencies for a more detailed explanation.
Methods
do_scrape(scraper=None)
¶
yield results from this page and any subpages
Parameters:
Name | Type | Description | Default |
---|---|---|---|
scraper |
typing.Optional[scrapelib.Scraper]
|
Optional |
None
|
Returns:
Type | Description |
---|---|
typing.Iterable[typing.Any]
|
Generator yielding results from the scrape. |
get_next_source()
¶
To be overriden for paginated pages.
Return a URL or valid source to fetch the next page, None if there isn't one.
get_source_from_input()
¶
To be overridden.
Convert self.input
to a Source
object such as a URL
.
postprocess_response()
¶
To be overridden.
This is called after source.get_response but before self.process_page.
process_error_response(exception)
¶
To be overridden.
This is called after source.get_response if an exception is raised.
process_page()
abstractmethod
¶
To be overridden.
Return data extracted from this page and this page alone.
HtmlPage¶
Bases: Page
Page that automatically handles parsing and normalizing links in an HTML response.
Attributes
root
-
lxml.etree.Element
object representing the root element (e.g.<html>
) on the page.Can use the normal lxml methods (such as
cssselect
andgetchildren
), or use this element as the target of aSelector
subclass.
JsonPage¶
Bases: Page
Page that automatically handles parsing a JSON response.
Attributes
data
- JSON data from response. (same as
self.response.json()
)
PdfPage¶
Bases: Page
Page that automatically handles converting a PDF response to text using pdftotext
.
Attributes
preserve_layout
- set to
True
on derived class if you want the conversion function to use pdftotext's -layout option to attempt to preserve the layout of text. (False
by default) text
- UTF8 text extracted by pdftotext.
XmlPage¶
Bases: Page
Page that automatically handles parsing a XML response.
Attributes
root
lxml.etree.Element
object representing the root XML element on the page.
ListPages¶
ListPage¶
Bases: Page
Base class for common pattern of extracting many homogenous items from one page.
Instead of overriding process_response
, subclasses should provide a process_item
.
Methods
CsvListPage¶
Bases: ListPage
Processes each row in a CSV (after the first, assumed to be headers) as an item
with process_item
.
ExcelListPage¶
Bases: ListPage
Processes each row in an Excel file as an item with process_item
.
HtmlListPage¶
Bases: LxmlListPage
, HtmlPage
Selects homogenous items from HTML page using selector
and passes them to process_item
.
Attributes
selector
Selector
subclass which matches list of homogenous elements to process. (e.g.CSS("tbody tr")
)
JsonListPage¶
Bases: ListPage
, JsonPage
Processes each element in a JSON list as an item with process_item
.
XmlListPage¶
Bases: LxmlListPage
, XmlPage
Selects homogenous items from XML document using selector
and passes them to process_item
.
Attributes
selector
Selector
subclass which matches list of homogenous elements to process. (e.g.XPath("//item")
)
Selectors¶
Selector¶
Bases: ABC
Base class implementing Selector interface.
match(element, *, min_items=None, max_items=None, num_items=None)
¶
Return all matches of the given selector within element
.
If the number of elements matched is outside the prescribed boundaries, a
SelectorError
is raised.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
element |
_Element
|
The element to match within. When used from within a |
required |
min_items |
Optional[int]
|
A minimum number of items to match. |
None
|
max_items |
Optional[int]
|
A maximum number of items to match. |
None
|
num_items |
Optional[int]
|
An exact number of items to match. |
None
|
match_one(element)
¶
Return exactly one match.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
element |
_Element
|
Element to search within. |
required |
CSS¶
Utilize CSS-style selectors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
css_selector |
str
|
CSS selector expression to use. |
required |
min_items |
Optional[int]
|
A minimum number of items to match. |
1
|
max_items |
Optional[int]
|
A maximum number of items to match. |
None
|
num_items |
Optional[int]
|
An exact number of items to match. |
None
|
SimilarLink¶
Match links that fit a provided pattern.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pattern |
str
|
Regular expression for link hrefs. |
required |
min_items |
Optional[int]
|
A minimum number of items to match. |
1
|
max_items |
Optional[int]
|
A maximum number of items to match. |
None
|
num_items |
Optional[int]
|
An exact number of items to match. |
None
|
XPath¶
Utilize XPath selectors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
xpath |
str
|
XPath expression to use. |
required |
min_items |
Optional[int]
|
A minimum number of items to match. |
1
|
max_items |
Optional[int]
|
A maximum number of items to match. |
None
|
num_items |
Optional[int]
|
An exact number of items to match. |
None
|
Sources¶
URL¶
Defines a resource to fetch via URL, particularly useful for handling non-GET requests.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
URL to fetch |
required |
method |
str
|
HTTP method to use, defaults to "GET" |
'GET'
|
data |
dict
|
POST data to include in request body. |
None
|
headers |
dict
|
dictionary of HTTP headers to set for the request. |
None
|
verify |
bool
|
bool indicating whether or not to verify SSL certificates for request, defaults to True |
True
|
timeout |
Optional[float]
|
HTTP(S) timeout in seconds |
None
|
retries |
Optional[int]
|
number of retries to make |
None
|
NullSource¶
Bases: Source
Special class to set as a page's source
to indicate no HTTP request needs
to be performed.
Exceptions¶
SelectorError¶
Bases: ValueError
Error raised when a selector's constraint (min_items/max_items, etc.) is not met.
SkipItem¶
Bases: Exception
To be raised to skip processing of the current item & continue with the next item.
Example:
class SomeListPage(HtmlListPage):
def process_item(self, item):
if item.name == "Vacant":
raise SkipItem("vacant")
# do normal processing here
Or, if the skip logic needs to live within a detail Page:
class SomeDetailPage(HtmlPage):
def process_page(self):
if self.input.name == "Vacant":
raise SkipItem("vacant")
# do normal processing here