Advanced Techniques¶
Specifying Dependencies¶
While the pattern laid out in Scraper Basics is fairly common, sometimes data is laid out in other ways that doesn't fit as neatly with the list & detail page workflow.
For example, take a look at the page https://scrapple.fly.dev/awards, which lists some awards that some Yoyodyne employees have won.
In some cases you may decide to scrape this data separately using the typical HtmlListPage
approach, but let's say you want each employee to have a list of their awards as part of the EmployeeList
/EmployeeDetail
scrape.
Scraping the New Page¶
First let's write a quick page scraper to grab the award name & year and associate them with a person's name:
# add to imports
from collections import defaultdict
from dataclasses import dataclass
@dataclass
class Award:
award: str
year: str
class AwardsPage(HtmlPage):
source = "https://scrapple.fly.dev/awards"
def process_page(self):
# augmentation pages will almost always return some kind of dictionary
mapping = defaultdict(list)
award_cards = CSS(".card .large", num_items=9).match(self.root)
for item in award_cards:
award = CSS("h2").match_one(item).text.strip()
year = CSS("small").match_one(item).text
# make sure we got exactly 2 <dd> tags, and we take the second
name = CSS("dd").match(item, num_items=2)[1].text
# map the data based on a key we know we have elsewhere in the scrape
mapping[name].append(Award(award=award, year=year))
return mapping
Running this scrape yields a single dictionary:
$ spatula test quickstart.AwardsPage
INFO:quickstart.AwardsPage:fetching https://scrapple.fly.dev/awards
defaultdict(<class 'list'>,
{'John Coyote': [Award(award='Album of the Year', year='1997')],
'John Fish': [Award(award='Cousteau Society Award', year='1989')],
'John Lee': [Award(award='Most Creative Loophole', year='1985')],
'John Many Jars': [Award(award='2nd Place, Most Jars Category', year='1987')],
"John O'Connor": [Award(award='Nobel Prize in Physics', year='2986')],
'John Two Horns': [Award(award='Paralegal of the Year', year='1999')],
'John Whorfin': [Award(award='Nobel Prize in Physics', year='1934'),
Award(award='Best Supporting Actor', year='1985')],
'John Ya Ya': [Award(award='ACM Award', year='1986')]})
Add the Dependency¶
Now that this page is working, we can connect it to our previously written EmployeeList
scrape.
class EmployeeDetail(HtmlPage):
dependencies = {"award_mapping": AwardsPage()}
...
This line ensures that each instance of EmployeeDetail
will be have a self.award_mapping
attribute, pre-populated with the result of AwardsPage
.
If you pass an instance of a page then each EmployeeDetail
will share a cached copy of AwardsPage
, ensuring it is only scraped once.
Use the Dependency Data¶
The final step now is to actually use the mapping to attach the awards to the employees:
class Employee(PartialEmployee):
status: str
hired: str
awards: list[Award]
And then within EmployeeDetail
:
def process_page(self):
status = CSS("#status").match_one(self.root)
hired = CSS("#hired").match_one(self.root)
return Employee(
first=self.input.first,
last=self.input.last,
position=self.input.position,
url=self.input.url,
status=status.text,
hired=hired.text,
awards=self.award_mapping[f"{self.input.first} {self.input.last}"],
)
We can test by passing fake data with a person that has an award:
$ spatula test quickstart.EmployeeDetail --data first=John --data last=Fish
INFO:quickstart.AwardsPage:fetching https://scrapple.fly.dev/awards
INFO:quickstart.EmployeeDetail:fetching https://scrapple.fly.dev/staff/1
{'awards': [Award(award='Cousteau Society Award', year='1989')],
'first': 'John',
'hired': '10/31/1938',
'last': 'Fish',
'status': 'Single',
'position': 'Engineer',
'url': 'https://scrapple.fly.dev/staff/1'}
Note that before fetching the EmployeeDetail
page, AwardsPage
is fetched, and the awards
data is then correctly attached to John Fish.
Advanced Sources¶
NullSource
¶
Every Page
has a source
which is fetched when it is executed. There are cases where you may wish to avoid that behavior. If you set NullSource
for a page, no HTTP request will be performed prior to the process_item
method being called.
A common use for this is dispatching multiple detail pages without a corresponding list page.
class NebraskaPageGenerator(ListPage):
"""
When scraping the Nebraska legislature, the pages are named
http://news.legislature.ne.gov/dist01/
through
http://news.legislature.ne.gov/dist49/
but with out an easy-to-scrape source.
So we use this method to mimic the results that a ListPage would yield,
without a wasted request.
"""
source = NullSource()
def process_page(self):
for n in range(1, 50):
yield NebraskaLegPage(source=f"http://news.legislature.ne.gov/dist{n:02d}/")
Custom Sources¶
Sometimes you need a page to do something that isn't easy to do with a single URL
object.
To derive a custom Source
, simply override the get_response
method in your own custom source class.
For example:
import scrapelib
import requests
from spatula import Source
class FauxFormSource(Source):
"""
emulate a case where we need to get a hidden input value to successfully
retrieve a form
"""
def get_response(self, scraper: scrapelib.Scraper) -> requests.models.Response:
url = "https://example.com/"
resp = scraper.get(url)
root = lxml.html.fromstring(resp.content)
token = form.xpath(".//input[@name='csrftoken']")[0].value
# do second request with data
resp = scraper.post(self.url, {"csrftoken": token})
return resp
You can do whatever you want within get_response
as long as something resembling a requests.Response
is returned.
Custom Page Types¶
Another powerful technique is to define your own Page
or ListPage
subclasses.
Most of the existing page types are fairly simple, typically only overriding postprocess_response
, which is called after any source
is turned into self.response
, but before process_item
is called.
If you wanted to use BeautifulSoup
instead of spatula's default lxml.html
you could define a custom SoupPage
:
from bs4 import BeautifulSoup
from spatula import Page
class SoupPage(Page):
def postprocess_response(self) -> None:
# self.response is guaranteed to have been set by now
self.soup = BeautifulSoup(self.response.content)
This would let any pages that derive from SoupPage
use self.soup
the same way that self.root
is available on HtmlPage
.
Retries¶
Sometimes a server returns incomplete or otherwise erroneous data intermittently. It can be useful to check if the page contains the expected data and retry after some wait period if not.
This can be done by adding a accept_response
method to your Page
subclass. If accept_response
is False
, spatula will sleep for spatula.config.RETRY_WAIT_SECONDS
and then retry.
By default this retry will only happen once, controlled by spatula.config.REJECTED_RESPONSE_RETRIES
.
If you need to set a per-Source number of retries, you can also pass retries
to URL
like so:
RejectPartialPage(source=URL("https://openstates.org", retries=3))
Warning
spatula.config is experimental, any use of these variables may change before 1.0 is released.
An example:
class RejectPartialPage(Page):
def accept_response(self, response: requests.Response) -> bool:
# sometimes the page is returned missing the footer, retry if so
return "<footer>" in response.text