Skip to content

OpenAI / GPT

This section assumes you are mostly unfamiliar with the OpenAI API and aims to provide a high-level overview of how it works in relation to this library.

API Keys

Getting an API Key

To use the OpenAI API you will need an API key. You can get one by creating an account and then creating an API key.

It's strongly recommended that you set a usage limit on your API key to avoid accidentally running up a large bill.

https://platform.openai.com/account/billing/limits lets you set usage limits to avoid unpleasant surprises.

Using your key

Once an API key is created, you can set it as an environment variable:

$ export OPENAI_API_KEY=sk-...

You can also set the API Key directly in Python:

import openai

openai.api_key_path = "~/.openai-key"
#  - or -
openai.api_key = "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Be careful not to expose this key to the public by checking it into a public repository.

Costs

The OpenAI API is considerably expensive.

The cost of a call varies based on the model used and the size of the input.

The cost estimates provided by this library are based on the OpenAI pricing page and not guaranteed to be accurate.

Supported Models (updated June 13, 2023 for v0.5.1)

Model Token Limit Input Tokens Output Tokens
gpt-3.5-turbo 4,096 0.0015 0.002
gpt-3.5-turbo-16k 16,384 0.003 0.004
gpt-4 8,192 0.03 0.06
gpt-4-32k 32,768 0.06 0.12
gpt-3.5-turbo-0613 4,096 0.0015 0.002
gpt-3.5-turbo-16k-0613 16,384 0.003 0.004

Example: A 3,000 token page that returns 1,000 tokens of JSON will cost $0.0065 with GPT-3-Turbo, but $0.15 with GPT-4.

(See OpenAI pricing page for latest info.)

Tokens

OpenAI encodes text using a tokenizer, which converts words to integers.

You'll see that billing is based on the number of tokens used. A token is approximately 3 characters, so 3000 characters of HTML will roughly correspond to 1000 tokens.

Warning

In practice, the above estimate turns out to be a bit low. Part of the issue is that HTML does not tokenize particularly efficiently. "Hello world!" is three tokens, but "Hello world!" is nine!

You can experiment via https://platform.openai.com/tokenizer

Models are limited to a maximum number of tokens. For example, the default GPT-3.5-turbo model is limited to 4096 tokens. GPT-4's default limit is 8192. Both models have larger versions available, but they are more expensive.

Various features in the library will help you avoid running into token limits, but it is still very common to exceed them in practice.

If your pages exceed these limits, you'll need to focus on improving your selectors so that only the required data is sent to the underlying models.

Prompts

The OpenAI API provides a chat-like interface, where there are three roles: system, user, and assistant. The system commands provide guidance to the assistant on how it should perform its tasks. The user provides a query to the assistant, which is then answered.

In practice, this results in a prompt that looks like something this:

System: For the given HTML, convert to a list of JSON objects matching this schema: {"name": "string", "age": "number"}

System: Be sure to provide valid JSON that is not truncated and contains no extra fields beyond those in the schema.

User: <html><div><h2>Joe</h2><span>Age: 42</span></div></html>

Assistant: {"name": "Joe", "age": 42}

It is possible to adjust the system commands the library sends, but the goal is to provide a simple default prompt that works well for most use cases.