Anatomy of a Scrape¶
This diagram illustrates the control flow when Page.do_scrape
is invoked programmatically or via spatula scrape
:
Dependencies¶
When a scrape is initiated, the first thing spatula will do is examine any dependencies declared on the page being invoked.
Each dependency is resolved (in essence a full scrape of its own) with the resulting data saved to an internal cache (to avoid duplicating scrapes of shared dependencies).
See Specifying Dependencies for a practical application.
Ensuring a Source Exists¶
Once any dependencies are resolved, the page attempts to resolve a source
attribute.
There are a number of places that a source
might come from, in order of precedence:
- overridden using the
--source
parameter if using a CLI scrape - passed in via class constructor (e.g. if this is a subpage)
- set as a class attribute (
Page.source
) - retrieved via
get_source_from_input
Note
If any of these methods return a string, it will be automatically converted to a URL
.
Processing the Response¶
Once a source
is obtained, source.get_response
is called, which typically means an HTTP request is performed.
If an exception is raised, it will be passed to process_error_response
.
If all goes well, the response
attribute of the page class is set and postprocess_response
will be called. This is where classes like HtmlPage
and CsvListPage
process the response
and do any additional parsing required.
User Code: Processing Page Contents¶
Once the response has been processed, spatula will call the page's process_page
method.
In a standard use of spatula this is the first time that user-written code is run.
process_page
can return or yield actual data, or additional pages to continue the scrape.
Handling Subpages¶
If subpages are returned, each of them will be handled in essentially the same cycle.
Pagination¶
After process_page
terminates, spatula checks if there is a result from get_next_source
.
If so, a new instance of the page class is instantiated with the new source
set & the process is repeated from processing the response.