Usage¶
Data Flow¶
Since most of the work is done by the API, the job of a SchemaScraper
is to make it easier to pass HTML and get valid output.
If you are going to go beyond the basics, it is important to understand the data flow:
-
The page HTML is passed through any preprocessors.
a. The
CleanHTML
preprocessor removes unnecessary tags and attributes. (This is done by default.)b. If an
XPath
orCSS
preprocessor is used, the results are selected and re-combined into a single HTML string.c. Custom preprocessors can also execute here.
-
The HTML and schema are sent to the LLM with instructions to extract.
-
The results are passed through any postprocessors.
a. The
JSONPostprocessor
converts the results to JSON. (This is done by default.) If the results are not valid JSON, a second (much smaller) request can be made to ask it to fix the JSON.b. Custom postprocessors can also execute here.
You can modify nearly any part of the process to suit your needs. (See Customization for more details.)
Auto-splitting¶
While the flow above covers most cases, there is one special case that is worth mentioning.
If you set the auto_split_length
parameter to a positive integer, the HTML will be split into multiple requests where each
request aims to be no larger than auto_split_length
tokens.
Warning
In list mode, a single call can make many requests. Keep an eye on the max_cost
parameter if you're using this.
While this seems to work well enough for long lists of similar items, the question of it is worth the time and money is up to you. Writing a bit of code is probably the better option in most cases.
Instead of recombining the results of the XPath
or CSS
preprocessor, the results are instead chunked into smaller pieces (<= auto_split_length
) and sent to the API separately.
The instructions are also modified slightly, indicating that your schema is for a list of similar items.
Customization¶
To make it easier to experiment with different approaches, it is possible to customize nearly every part of the process from how the HTML is retrieved to how the results are processed.
HTTP Requests¶
Instead of providing mechanisms to customize the HTTP request made by the library (e.g. to use caching, or make a POST
), you can simply pass already retrieved HTML to the scrape
method.
This means you can use any HTTP library you want to retrieve the HTML.
Preprocessors¶
Preprocessors allow you to modify the HTML before it is sent to the API.
Three preprocessors are provided:
CleanHTML
- Cleans the HTML usinglxml.html.clean.Cleaner
.XPath
- Applies an XPath selector to the HTML.CSS
- Applies a CSS selector to the HTML.
Note
CleanHTML
is always applied first, as it is part of the default preprocessors list.
You can add your own preprocessors by passing a list to the extra_preprocessors
parameter of SchemaScraper
.
scraper = SchemaScraper(schema, extra_preprocessors=[CSS("table")])
It is also possible to pass preprocessors at scrape time:
scraper = SchemaScraper(schema)
scraper.scrape("https://example.com", extra_preprocessors=[CSS("table")])
Implementing your own preprocessor is simple, just create a callable that takes a lxml.html.HtmlElement
and returns a list of one or more lxml.html.HtmlElement
objects. Look at preprocessors.py
for examples.
Altering the Instructions to GPT¶
Right now you can pass additional instructions to GPT by passing a list of strings to the extra_instructions
parameter of SchemaScraper
.
You can also pass model_params
to pass additional arguments to the API.
schema = {"name": "str", "committees": [], "bio": "str"}
scraper = SchemaScraper(
schema,
models=["gpt-4"],
extra_instructions=["Put the legislator's bio in the 'bio' field. Summarize it so that it is no longer than 3 sentences."],
)
scraper.scrape("https://norton.house.gov/about/full-biography").data
{'name': 'Representative Eleanor Holmes Norton',
'committees': [
'House Subcommittee on Highways and Transit',
'Committee on Oversight and Reform',
'Committee on Transportation and Infrastructure'
],
'bio': 'Congresswoman Eleanor Holmes Norton has been serving as the congresswoman for the District of Columbia since 1991. She is the Chair of the House Subcommittee on Highways and Transit and serves on two committees: the Committee on Oversight and Reform and the Committee on Transportation and Infrastructure. Before her congressional service, President Jimmy Carter appointed her to serve as the first woman to chair the U.S. Equal Employment Opportunity Commission.'}
These instructions can be useful for refining the results, but they are not required.
Altering the API / Model¶
See https://github.com/jamesturk/scrapeghost/issues/18
Postprocessors¶
Postprocessors take the results of the API call and modify them before returning them to the user.
Three postprocessors are provided:
JSONPostprocessor
- Converts the results to JSON.HallucinationChecker
- Checks the results for hallucinations.PydanticPostprocessor
- Converts the results to JSON and validates them using apydantic
model.
By default, JSONPostprocessor
and HallucinationChecker
are enabled.
HallucinationChecker
verifies that values in the response are present in the source HTML. This is useful for ensuring that the results are not "hallucinations".
This is done as a proof of concept, and to help determine how big of an issue hallucinations are for this use case.
Using pydantic
Models¶
If you want to validate that the returned data isn't just JSON, but data in the format you expect, you can use pydantic
models.
from pydantic import BaseModel
from scrapeghost import SchemaScraper, CSS
class CrewMember(BaseModel):
gender: str
race: str
alignment: str
# passing a pydantic model to the SchemaScraper # will generate a schema from it
# and add the PydanticPostprocessor to the postprocessors
scrape_crewmember = SchemaScraper(schema=CrewMember)
result = scrape_crewmember.scrape(
"https://spaceghost.fandom.com/wiki/Zorak",
extra_preprocessors=[CSS(".infobox")],
)
print(repr(result.data))
CrewMember(gender='Male', race='Dokarian', alignment='Evil\\nProtagonist')
This works by converting the pydantic
model to a schema and registering a PydanticPostprocessor
to validate the results automatically.
Pagination¶
One technique to handle pagination is provided by the PaginatedSchemaScraper
class.
This class takes a schema that describes a single result, and wraps it in a schema that describes a list of results as well as an additional page.
For example:
{"first_name": "str", "last_name": "str"}
Automatically becomes:
{"next_page": "url", "results": [{"first_name": "str", "last_name": "str"}]}
The PaginatedSchemaScraper
class then takes care of following the next_page
link until there are no more pages.
Note
Right now, given the library's stance on customizing requests being "just use your own HTTP library", the PaginatedSchemaScraper
class does not provide a means to customize the HTTP request used to retrieve the next page.
If you need a more complicated approach it is recommended you implement your own pagination logic for now, https://github.com/jamesturk/scrapeghost/blob/main/src/scrapeghost/scrapers.py#L238 may be a good starting point.
If you have strong opinions here, please open an issue to discuss.
It then takes the combined "results" and returns them to the user.
Here's a functional example that scrapes several pages of employees:
import json
from scrapeghost.scrapers import PaginatedSchemaScraper
schema = {"first_name": "str", "last_name": "str", "position": "str", "url": "url"}
url = "https://scrapple.fly.dev/staff"
scraper = PaginatedSchemaScraper(schema)
resp = scraper.scrape(url)
# the resulting response is a ScrapeResponse object just like any other
# all the results are gathered in resp.data
json.dump(resp.data, open("yoyodyne.json", "w"), indent=2)
Warning
One caveat of the current approach: The url
attribute on a ScraperResult
from a PaginatedSchemaScraper
is a semicolon-delimited list of all the URLs that were scraped to produce that result.