Skip to content

About

scrapeghost logo

scrapeghost is an experimental library for scraping websites using OpenAI's GPT.

The library provides a means to scrape structured data from HTML without writing page-specific code.

Important

Before you proceed, here are at least three reasons why you should not use this library:

  • It is very experimental, no guarantees are made about the stability of the API or the accuracy of the results.

  • It relies on the OpenAI API, which is quite slow and can be expensive. (See costs before using this library.)

  • Currently licensed under Hippocratic License 3.0. (See FAQ.)

Use at your own risk.

Quickstart

Step 1) Obtain an OpenAI API key (https://platform.openai.com) and set an environment variable:

export OPENAI_API_KEY=sk-...

Step 2) Install the library however you like:

pip install scrapeghost
or
poetry add scrapeghost

Step 3) Instantiate a SchemaScraper by defining the shape of the data you wish to extract:

from scrapeghost import SchemaScraper
scrape_legislators = SchemaScraper(
  schema={
      "name": "string",
      "url": "url",
      "district": "string",
      "party": "string",
      "photo_url": "url",
      "offices": [{"name": "string", "address": "string", "phone": "string"}],
  }
)

Note

There's no pre-defined format for the schema, the GPT models do a good job of figuring out what you want and you can use whatever values you want to provide hints.

Step 4) Passing the scraper a URL (or HTML) to the resulting scraper will return the scraped data:

resp = scrape_legislators("https://www.ilga.gov/house/rep.asp?MemberID=3071")
resp.data
{"name": "Emanuel 'Chris' Welch",
 "url": "https://www.ilga.gov/house/Rep.asp?MemberID=3071",
 "district": "7th", "party": "D", 
 "photo_url": "https://www.ilga.gov/images/members/{5D419B94-66B4-4F3B-86F1-BFF37B3FA55C}.jpg",
   "offices": [
     {"name": "Springfield Office",
      "address": "300 Capitol Building, Springfield, IL 62706",
       "phone": "(217) 782-5350"},
     {"name": "District Office",
      "address": "10055 W. Roosevelt Rd., Suite E, Westchester, IL 60154",
       "phone": "(708) 450-1000"}
   ]}

That's it!

Read the tutorial for a step-by-step guide to building a scraper.

Command Line Usage Example

If you've installed the package (e.g. with pipx), you can use the scrapeghost command line tool to experiment.

#!/bin/sh 
scrapeghost https://www.ncleg.gov/Members/Biography/S/436  \
        --schema "{'first_name': 'str', 'last_name': 'str',
        'photo_url': 'url', 'offices': [] }'" \
        --css div.card | python -m json.tool
{
    "first_name": "Gale",
    "last_name": "Adcock",
    "photo_url": "https://www.ncleg.gov/Members/MemberImage/S/436/Low",
    "offices": [
        {
            "type": "Mailing",
            "address": "16 West Jones Street, Rm. 1104, Raleigh, NC 27601"
        },
        {
            "type": "Office Phone",
            "phone": "(919) 715-3036"
        }
    ]
}

See the CLI docs for more details.

Features

The purpose of this library is to provide a convenient interface for exploring web scraping with GPT.

While the bulk of the work is done by the GPT model, scrapeghost provides a number of features to make it easier to use.

Python-based schema definition - Define the shape of the data you want to extract as any Python object, with as much or little detail as you want.

Preprocessing

  • HTML cleaning - Remove unnecessary HTML to reduce the size and cost of API requests.
  • CSS and XPath selectors - Pre-filter HTML by writing a single CSS or XPath selector.
  • Auto-splitting - Optionally split the HTML into multiple calls to the model, allowing for larger pages to be scraped.

Postprocessing

  • JSON validation - Ensure that the response is valid JSON. (With the option to kick it back to GPT for fixes if it's not.)
  • Schema validation - Go a step further, use a pydantic schema to validate the response.
  • Hallucination check - Does the data in the response truly exist on the page?

Cost Controls

  • Scrapers keep running totals of how many tokens have been sent and received, so costs can be tracked.
  • Support for automatic fallbacks (e.g. use cost-saving GPT-3.5-Turbo by default, fall back to GPT-4 if needed.)
  • Allows setting a budget and stops the scraper if the budget is exceeded.