Tutorial¶

This tutorial will show you how to use scrapeghost to build a web scraper without writing page-specific code.

Prerequisites¶

Install `scrapeghost`¶

You'll need to install scrapeghost. You can do this with pip, poetry, or your favorite Python package manager.

Getting an API Key¶

To use the OpenAI API you will need an API key. You can get one by creating an account and then creating an API key.

It's strongly recommended that you set a usage limit on your API key to avoid accidentally running up a large bill.

https://platform.openai.com/account/billing/limits lets you set usage limits to avoid unpleasant surprises.

Using your key¶

Once an API key is created, you can set it as an environment variable:

$ export OPENAI_API_KEY=sk-...

You can also set the API Key directly in Python:

import openai

openai.api_key_path = "~/.openai-key"
#  - or -
openai.api_key = "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Be careful not to expose this key to the public by checking it into a public repository.

Writing a Scraper¶

The goal of our scraper is going to be to get a list of all of the episodes of the podcast Comedy Bang Bang.

To do this, we'll need two kinds of scrapers: one to get a list of all of the episodes, and one to get the details of each episode.

Getting Episode Details¶

At the time of writing, the most recent episode of Comedy Bang Bang is Episode 800, Operation Golden Orb.

The URL for this episode is https://comedybangbang.fandom.com/wiki/Operation_Golden_Orb.

Let's say we want to build a scraper that finds out each episode's title, episode number, and release date.

We can do this by creating a SchemaScraper object and passing it a schema.

from scrapeghost import SchemaScraper
from pprint import pprint

url = "https://comedybangbang.fandom.com/wiki/Operation_Golden_Orb"
schema = {
    "title": "str",
    "episode_number": "int",
    "release_date": "str",
}

episode_scraper = SchemaScraper(schema)

response = episode_scraper(url)
pprint(response.data)
print(f"Total Cost: ${response.total_cost:.3f}")

There is no predefined way to define a schema, but a dictionary resembling the data you want to scrape where the keys are the names of the fields you want to scrape and the values are the types of the fields is a good place to start.

Once you have an instance of SchemaScraper you can use it to scrape a specific page by passing it a URL (or HTML if you prefer/need to fetch the data another way).

Running our code gives an error though:

scrapeghost.scrapers.TooManyTokens: HTML is 9710 tokens, max for gpt-3.5-turbo is 4096

This means that the content length is too long, we'll need to reduce our token count in order to make this work.

What Are Tokens?¶

If you haven't used OpenAI's APIs before, you may not be aware of the token limits. Every request has a limit on the number of tokens it can use. For GPT-4 this is 8,192 tokens. For GPT-3.5-Turbo it is 4,096. (A token is about three characters.)

You are also billed per token, so even if you're under the limit, fewer tokens means cheaper API calls.

Supported Models (updated June 13, 2023 for v0.5.1)

Model	Token Limit	Input Tokens	Output Tokens
gpt-3.5-turbo	4,096	0.0015	0.002
gpt-3.5-turbo-16k	16,384	0.003	0.004
gpt-4	8,192	0.03	0.06
gpt-4-32k	32,768	0.06	0.12
gpt-3.5-turbo-0613	4,096	0.0015	0.002
gpt-3.5-turbo-16k-0613	16,384	0.003	0.004

Example: A 3,000 token page that returns 1,000 tokens of JSON will cost $0.0065 with GPT-3-Turbo, but $0.15 with GPT-4.

(See OpenAI pricing page for latest info.)

Ideally, we'd only pass the relevant parts of the page to OpenAI. It shouldn't need anything outside of the HTML <body>, anything in comments, script tags, etc.

(For more details on how this library interacts with OpenAI's API, see the OpenAI API page.)

Preprocessors¶

To help with all this, scrapeghost provides a way to preprocess the HTML before it is sent to OpenAI. This is done by passing a list of preprocessor callables to the SchemaScraper constructor.

Info

A CleanHTML preprocessor is included by default. This removes HTML comments, script tags, and style tags.

If you visit the page https://comedybangbang.fandom.com/wiki/Operation_Golden_Orb viewing the source will reveal that all of the interesting content is in an element <div id="content" class="page-content">.

Just as we might if we were writing a real scraper, we'll write a CSS selector to grab this element, div.page-content will do. The CSS preprocessor will use this selector to extract the content of the element.

from scrapeghost import SchemaScraper, CSS
from pprint import pprint

url = "https://comedybangbang.fandom.com/wiki/Operation_Golden_Orb"
schema = {
    "title": "str",
    "episode_number": "int",
    "release_date": "str",
}

episode_scraper = SchemaScraper(
    schema,
    # can pass preprocessor to constructor or at scrape time
    extra_preprocessors=[CSS("div.page-content")],
)

response = episode_scraper(url)
pprint(response.data)
print(f"Total Cost: ${response.total_cost:.3f}")

Now, a call to our scraper will only pass the content of the <div> to OpenAI. We get the following output:

2023-03-24 19:18:57 [info     ] API request                    html_tokens=1332 model=gpt-3.5-turbo
2023-03-24 19:18:59 [info     ] API response                   completion_tokens=33 cost=0.00291 duration=2.3073980808258057 finish_reason=stop prompt_tokens=1422
{'episode_number': 800,
 'release_date': 'March 12, 2023',
 'title': 'Operation Golden Orb'}
Total Cost: $0.003

We can see from the logging output that the content length is much shorter now and we get the data we were hoping for.

All for less than a penny!

Tip

Even when the page fits under the token limit, it is still a good idea to pass a selector to limit the amount of content that OpenAI has to process.

Fewer tokens means faster responses and cheaper API calls. It should also get you better results.

Enhancing the Schema¶

That was easy! Let's enhance our schema to include the list of guests as well as requesting the dates in a particular format.

from scrapeghost import SchemaScraper, CSS
from pprint import pprint

url = "https://comedybangbang.fandom.com/wiki/Operation_Golden_Orb"
schema = {
    "title": "str",
    "episode_number": "int",
    "release_date": "YYYY-MM-DD",
    "guests": [{"name": "str"}],
}

episode_scraper = SchemaScraper(
    schema,
    # can pass preprocessor to constructor or at scrape time
    extra_preprocessors=[CSS("div.page-content")],
)

response = episode_scraper(url)
pprint(response.data)
print(f"Total Cost: ${response.total_cost:.3f}")

Just two small changes, but now we get the following output:

2023-03-24 19:19:00 [info     ] API request                    html_tokens=1332 model=gpt-3.5-turbo
2023-03-24 19:19:05 [info     ] API response                   completion_tokens=83 cost=0.003036 duration=4.687386989593506 finish_reason=stop prompt_tokens=1435
{'episode_number': 800,
 'guests': [{'name': 'Jason Mantzoukas'},
            {'name': 'Andy Daly'},
            {'name': 'Paul F. Tompkins'}],
 'release_date': '2023-03-12',
 'title': 'Operation Golden Orb'}
Total Cost: $0.003

Let's try it on a different episode, from the beginning of the series.

episode_scraper(
    "https://comedybangbang.fandom.com/wiki/Welcome_to_Comedy_Bang_Bang",
).data

{'episode_number': 1,
 'guests': [{'name': 'Rob Huebel'},
            {'name': 'Tom Lennon'},
            {'name': 'Doug Benson'}],
 'release_date': '2009-05-01',
 'title': 'Welcome to Comedy Bang Bang'}

Not bad!

Dealing With Page Structure Changes¶

If you've maintained a scraper for any amount of time you know that the biggest burden is dealing with changes to the structure of the pages you're scraping.

To simulate this, let's say we instead wanted to get the same information from a different page: https://www.earwolf.com/episode/operation-golden-orb/

This page has a completely different layout. We will need to change our CSS selector:

from scrapeghost import SchemaScraper, CSS
from pprint import pprint

url = "https://www.earwolf.com/episode/operation-golden-orb/"
schema = {
    "title": "str",
    "episode_number": "int",
    "release_date": "YYYY-MM-DD",
    "guests": [{"name": "str"}],
}

episode_scraper = SchemaScraper(
    schema,
    extra_preprocessors=[CSS(".hero-episode")],
)

response = episode_scraper(url)
pprint(response.data)
print(f"Total Cost: ${response.total_cost:.3f}")

2023-03-24 19:19:08 [info     ] API request                    html_tokens=2988 model=gpt-3.5-turbo
2023-03-24 19:19:13 [info     ] API response                   completion_tokens=88 cost=0.006358000000000001 duration=5.002557992935181 finish_reason=stop prompt_tokens=3091
{'episode_number': 800,
 'guests': [{'name': 'Jason Mantzoukas'},
            {'name': 'Andy Daly'},
            {'name': 'Paul F. Tompkins'}],
 'release_date': '2023-03-12',
 'title': 'EP. 800 — Operation Golden Orb'}
Total Cost: $0.006

Completely different HTML, one CSS selector change.

Extra Instructions¶

You may notice that the title changed. The second source includes the episode number in the title, but the first source does not.

You could deal with this with a bit of clean up, but you have another option at your disposal. You can give the underlying model additional instructions to modify the behavior.

from scrapeghost import SchemaScraper, CSS
from pprint import pprint

url = "https://www.earwolf.com/episode/operation-golden-orb/"
schema = {
    "title": "str",
    "episode_number": "int",
    "release_date": "YYYY-MM-DD",
    "guests": [{"name": "str"}],
}

episode_scraper = SchemaScraper(
    schema,
    extra_preprocessors=[CSS(".hero-episode")],
    extra_instructions=[
        "Do not include the episode number in the title.",
    ],
)

response = episode_scraper(url)
pprint(response.data)
print(f"Total Cost: ${response.total_cost:.3f}")

2023-03-24 19:19:14 [info     ] API request                    html_tokens=2988 model=gpt-3.5-turbo
2023-03-24 19:19:18 [info     ] API response                   completion_tokens=83 cost=0.006378 duration=4.542723894119263 finish_reason=stop prompt_tokens=3106
{'episode_number': 800,
 'guests': [{'name': 'Jason Mantzoukas'},
            {'name': 'Andy Daly'},
            {'name': 'Paul F. Tompkins'}],
 'release_date': '2023-03-12',
 'title': 'Operation Golden Orb'}
Total Cost: $0.006

At this point, you may be wondering if you'll ever need to write a web scraper again.

So to temper that, let's take a look at something that is a bit more difficult for scrapeghost to handle.

Getting a List of Episodes¶

Now that we have a scraper that can get the details of each episode, we want a scraper that can get a list of all of the episode URLs.

https://comedybangbang.fandom.com/wiki/Category:Episodes has a link to each of the episodes, perhaps we can just scrape that page?

from scrapeghost import SchemaScraper

episode_list_scraper = SchemaScraper({"episode_urls": ["str"]})
episode_list_scraper("https://comedybangbang.fandom.com/wiki/Category:Episodes")

scrapeghost.scrapers.TooManyTokens: HTML is 292918 tokens, max for gpt-3.5-turbo is 4096

Yikes, nearly 300k tokens! This is a huge page.

We can try again with a CSS selector, but this time we'll try to get a selector for each individual item.

If you have go this far, you may want to just extract links using lxml.html or BeautifulSoup instead.

But let's imagine that for some reason you don't want to, perhaps this is a one-off project and even a relatively expensive request is worth it.

SchemaScraper has a few options that will help, we'll change our scraper to use auto_split_length.

from scrapeghost import SchemaScraper, CSS

episode_list_scraper = SchemaScraper(
    "url",
    auto_split_length=2000,
    extra_preprocessors=[CSS(".mw-parser-output a[class!='image link-internal']")],
)
response = episode_list_scraper(
    "https://comedybangbang.fandom.com/wiki/Category:Episodes"
)

episode_urls = response.data
print(episode_urls[:3])
print(episode_urls[-3:])
print("total:", len(episode_urls))
print(f"Total Cost: ${response.total_cost:.3f}")

We set the auto_split_length to 2000. This is the maximum number of tokens that will be passed to OpenAI in a single request.

Setting auto_split_length alters the prompt and response format so that instead of returning a single JSON object, it returns a list of objects where each should match your provided schema.

Because of this, we alter the schema to just be a single string because we're only interested in the URL.

It's a good idea to set this to about half the token limit, since the response counts against the token limit as well.

This winds up needing to make over twenty requests, but can get there.

        *relevant log lines shown for clarity*
2023-03-24 19:25:53 [debug    ] got HTML                       length=1424892 url=https://comedybangbang.fandom.com/wiki/Category:Episodes
2023-03-24 19:25:53 [debug    ] preprocessor                   from_nodes=1 name=CleanHTML nodes=1
2023-03-24 19:25:53 [debug    ] preprocessor                   from_nodes=1 name=CSS(.mw-parser-output a[class!='image link-internal']) nodes=857
2023-03-24 19:25:53 [debug    ] chunked tags                   num=20 sizes=[1971, 1994, 1986, 1976, 1978, 1990, 1993, 1974, 1995, 1983, 1975, 1979, 1967, 1953, 1971, 1973, 1987, 1960, 1966, 682]
2023-03-24 19:25:53 [info     ] API request                    html_tokens=1971 model=gpt-3.5-turbo
2023-03-24 19:27:38 [info     ] API response                   completion_tokens=2053 cost=0.008194 duration=104.66404676437378 finish_reason=length prompt_tokens=2044
2023-03-24 19:31:28 [warning  ] retry                          model=gpt-4 wait=30
2023-03-24 19:36:34 [warning  ] API request failed             attempts=1 model=gpt-3.5-turbo
2023-03-24 19:36:34 [warning  ] retry                          model=gpt-4 wait=30
2023-03-24 19:41:53 [warning  ] API request failed             attempts=1 model=gpt-3.5-turbo
2023-03-24 19:41:53 [warning  ] retry                          model=gpt-4 wait=30
scrapeghost.errors.MaxCostExceeded: Total cost 1.04 exceeds max cost 1.00

As you can see, a couple of requests had to fall back to GPT-4, which raised the cost.

As a safeguard, the maximum cost for a single scrape is configured to $1 by default. If you want to change this, you can set the max_cost parameter.

One option is to lower the auto_split_length a bit further. Many more requests means it takes even longer, but if you can stick to GPT-3.5-Turbo it was possible to get a scrape to complete for $0.13.

But as promised, this is something that scrapeghost isn't currently very good at.

If you do want to see the pieces put together, jump down to the Putting it all Together section.

Next Steps¶

If you're planning to use this library, please keep in mind it is very much in flux and I can't commit to API stability yet.

If you are going to try to scrape using GPT, it'd probably be good to read the OpenAI API page to understand a little more about how the underlying API works.

To see what other features are currently available, check out the Usage guide.

You can also explore the command line interface to see how you can use this library without writing any Python.

Putting it all Together¶

import json
from scrapeghost import SchemaScraper, CSS

episode_list_scraper = SchemaScraper(
    '{"url": "url"}',
    auto_split_length=1500,
    # restrict this to GPT-3.5-Turbo to keep the cost down
    models=["gpt-3.5-turbo"],
    extra_preprocessors=[CSS(".mw-parser-output a[class!='image link-internal']")],
)

episode_scraper = SchemaScraper(
    {
        "title": "str",
        "episode_number": "int",
        "release_date": "YYYY-MM-DD",
        "guests": ["str"],
        "characters": ["str"],
    },
    extra_preprocessors=[CSS("div.page-content")],
)

resp = episode_list_scraper(
    "https://comedybangbang.fandom.com/wiki/Category:Episodes",
)
episode_urls = resp.data
print(f"Scraped {len(episode_urls)} episode URLs, cost {resp.total_cost}")

episode_data = []
for episode_url in episode_urls:
    print(episode_url)
    episode_data.append(
        episode_scraper(
            episode_url["url"],
        ).data
    )

# scrapers have a stats() method that returns a dict of statistics across all calls
print(f"Scraped {len(episode_data)} episodes, ${episode_scraper.stats()['total_cost']}")

with open("episode_data.json", "w") as f:
    json.dump(episode_data, f, indent=2)