Skip to content

Anatomy of a Scrape

This diagram illustrates the control flow when Page.do_scrape is invoked programmatically or via spatula scrape:

Dependencies

When a scrape is initiated, the first thing spatula will do is examine any dependencies declared on the page being invoked.

Each dependency is resolved (in essence a full scrape of its own) with the resulting data saved to an internal cache (to avoid duplicating scrapes of shared dependencies).

See Specifying Dependencies for a practical application.

Ensuring a Source Exists

Once any dependencies are resolved, the page attempts to resolve a source attribute.

There are a number of places that a source might come from, in order of precedence:

  1. overridden using the --source parameter if using a CLI scrape
  2. passed in via class constructor (e.g. if this is a subpage)
  3. set as a class attribute (Page.source)
  4. retrieved via get_source_from_input

Note

If any of these methods return a string, it will be automatically converted to a URL.

Processing the Response

Once a source is obtained, source.get_response is called, which typically means an HTTP request is performed.

If an exception is raised, it will be passed to process_error_response.

If all goes well, the response attribute of the page class is set and postprocess_response will be called. This is where classes like HtmlPage and CsvListPage process the response and do any additional parsing required.

User Code: Processing Page Contents

Once the response has been processed, spatula will call the page's process_page method.

In a standard use of spatula this is the first time that user-written code is run.

process_page can return or yield actual data, or additional pages to continue the scrape.

Handling Subpages

If subpages are returned, each of them will be handled in essentially the same cycle.

Pagination

After process_page terminates, spatula checks if there is a result from get_next_source.

If so, a new instance of the page class is instantiated with the new source set & the process is repeated from processing the response.