Skip to content

Data Models

Why Use Data Models?

Back in Chaining Pages Together we saw that when chaining pages we can pass data through from the parent page.

class EmployeeDetail(HtmlPage):
    def process_page(self):
        status = CSS("#status").match_one(self.root)
        hired = CSS("#hired").match_one(self.root)
        return dict(
            status=status.text,
            hired=hired.text,
            # self.input is the data passed in from the prior scrape,
            # in this case a dict we can expand here
            **self.input,
        )

Dictionaries seem to work well for this since we can decide what data we want to grab on each page and combine them easily, but as scrapers tend to evolve over time it can be nice to have something a bit more self-documenting, and add the possibility to validate the data we're collecting.

That's where we can introduce dataclasses, attrs, or pydantic models:

from dataclasses import dataclass

@dataclass
class Employee:
    first: str
    last: str
    position: str
    status: str
    hired: str
import attr

@attr.s(auto_attribs=True)
class Employee:
    first: str
    last: str
    position: str
    status: str
    hired: str
from pydantic import BaseModel

class Employee(BaseModel):
    first: str
    last: str
    position: str
    status: str
    hired: str

Aren't sure which one to pick?

dataclasses are built in to Python and easy to start with. You'll notice the examples barely differ, so it is easy to switch between them later on.

If you want to add validation, pydantic is a great choice.

And then we'll update EmployeeDetail.process_page to return our new Employee class:

    def process_page(self):
        status = CSS("#status").match_one(self.root)
        hired = CSS("#hired").match_one(self.root)
        return Employee(
            status=status.text,
            hired=hired.text,
            # self.input is the data passed in from the prior scrape,
            # in this case a dict we can expand here
            **self.input,
        )

Let's give this a run:

$ spatula test quickstart.EmployeeDetail --source "https://scrapple.fly.dev/staff/52"
INFO:quickstart.EmployeeDetail:fetching https://scrapple.fly.dev/staff/52
Traceback (most recent call last):
...
TypeError: __init__() missing 3 required positional arguments: 'first', 'last', and 'position'

Of course! We're expecting self.input to contain these values, but when we're running spatula test it doesn't know what data would have been passed in.

Defining input_type

spatula provides a way to make the dependency on self.input more explicit, and restore our ability to test as a bonus side effect.

Let's add a new data model that just includes the fields we're getting from the EmployeeList page:

@dataclass
class PartialEmployee:
    first: str
    last: str
    position: str
@attr.s(auto_attribs=True)
class PartialEmployee:
    first: str
    last: str
    position: str
class PartialEmployee(BaseModel):
    first: str
    last: str
    position: str

And then we'll update PersonDetail to set an input_type and stop assuming self.input is a dict:

class EmployeeDetail(HtmlPage):
    input_type = PartialEmployee

    def process_page(self):
        status = CSS("#status").match_one(self.root)
        hired = CSS("#hired").match_one(self.root)
        return Employee(
            first=self.input.first,
            last=self.input.last,
            position=self.input.position,
            status=status.text,
            hired=hired.text,
        )

And now when we re-run the test command:

$ spatula test quickstart.EmployeeDetail --source "https://scrapple.fly.dev/staff/52"
EmployeeDetail expects input (PartialEmployee):
  first: ~first
  last: ~last
  position: ~position
INFO:quickstart.EmployeeDetail:fetching https://scrapple.fly.dev/staff/52
Employee(first='~first', last='~last', position='~position', status='Current', hired='3/6/1963')

Test data has been used, so even though EmployeeList didn't pass data into EmployeeDetail we can still see roughly what the data would look like if it had.

Using Inheritance

The above pattern is pretty useful & common. Often part of the data comes from one page, and the rest from another (or perhaps even more).

A nice way to handle this without introducing a ton of redundancy is by setting up your models to inherit from one another:

from dataclasses import dataclass

@dataclass
class PartialEmployee:
    first: str
    last: str
    position: str

@dataclass
class Employee(PartialEmployee):
    status: str
    hired: str
import attr

@attr.s(auto_attribs=True)
class Employee:
    first: str
    last: str
    position: str

@attr.s(auto_attribs=True)
class PartialEmployee(Employee):
    status: str
    hired: str
from pydantic import BaseModel

class PartialEmployee(BaseModel):
    first: str
    last: str
    position: str

class Employee(PartialEmployee):
    status: str
    hired: str

Warning

Be sure to remember to decorate the derived class(es) if using dataclasses or attrs.

Fixing EmployeeList

Don't forget to have EmployeeList return a PartialEmployee instance now instead of a dict:

    def process_item(self, item):
        # this function is called for each <tr> we get from the selector
        # we know there are 4 <tds>
        first, last, position, details = item.getchildren()
        return EmployeeDetail(
            PartialEmployee(
                first=first.text,
                last=last.text,
                position=position.text,
            ),
            source=XPath("./a/@href").match_one(details),
        )

Overriding Default Values

Sometimes you may want to override default values (especially useful if the behavior of the second scrape varies on data from the first).

via Command Line

The --data flag to spatula test allows overriding input values with key=value pairs.

$ spatula test quickstart.EmployeeDetail --source "https://scrapple.fly.dev/staff/52" --data first=John --data last=Neptune
EmployeeDetail expects input (PartialEmployee):
  first: John
  last: Neptune
  position: ~position
INFO:ex03_data.EmployeeDetail:fetching https://scrapple.fly.dev/staff/52
Employee(first='John', last='Neptune', position='~position', status='Current', hired='3/6/1963')

Alternately, --interactive will prompt for input data.

via example_input

You can also provide the example_input attribute on the class in question. This value is assumed to be of type example_type.

For example:

class EmployeeDetail(HtmlPage):
    input_type = PartialEmployee
    example_input = PartialEmployee("John", "Neptune", "Engineer")

example_source

Like the above example_input you can define example_source to set a default value for the --source parameter when invoking spatula test

class EmployeeDetail(HtmlPage):
    input_type = PartialEmployee
    example_input = PartialEmployee("John", "Neptune", "Engineer")
    example_source = "https://scrapple.fly.dev/staff/52"

Warning

Be sure not to confuse source with example_source. The former is used whenever the class is invoked without a source parameter, while example_source is only used when running spatula test.

get_source_from_input

It is not uncommon to want to capture a URL as part of the data and then use that URL as the next source.

Let's go ahead and modify PartialEmployee to collect a URL:

@dataclass
class PartialEmployee:
    first: str
    last: str
    position: str
    url: str
@attr.s(auto_attribs=True)
class PartialEmployee:
    first: str
    last: str
    position: str
    url: str
class PartialEmployee(BaseModel):
    first: str
    last: str
    position: str
    url: str

And then we'll modify EmployeeList.process_item to capture this URL, and stop providing a redundant source:

    def process_item(self, item):
        # this function is called for each <tr> we get from the selector
        # we know there are 4 <tds>
        first, last, position, details = item.getchildren()
        return EmployeeDetail(
            PartialEmployee(
                first=first.text,
                last=last.text,
                position=position.text,
                url=XPath("./a/@href").match_one(details),
            ),
        )

And finally, add a get_source_from_input method to EmployeeDetail (as well as updating the other uses of Employee to have URL):

class EmployeeDetail(HtmlPage):
    input_type = PartialEmployee
    example_input = PartialEmployee(
        "John",
        "Neptune",
        "Engineer",
        "https://scrapple.fly.dev/staff/1",
    )

    def get_source_from_input(self):
        return self.input.url

    def process_page(self):
        status = CSS("#status").match_one(self.root)
        hired = CSS("#hired").match_one(self.root)
        return Employee(
            first=self.input.first,
            last=self.input.last,
            position=self.input.position,
            url=self.input.url,
            status=status.text,
            hired=hired.text,
        )

Of course, if you have a more complex situation you can do whatever you like in get_source_from_input.

Data Models As Output

When running spatula scrape data is written to disk as JSON. The exact method of obtaining that JSON varies a bit depending on what type of output you have:

Raw dict: Output will match exactly.

dataclasses: dataclass.asdict will be used.

attrs: attr.asdict will be used to obtain a serializable representation.

pydantic: the model's dict() method will be used.

By default the filename will be a UUID, but if you wish to provide your own filename you can add a get_filename method to your model.

Warning

When providing get_filename be sure that your filenames are still unique (you may wish to still incorporate a UUID if you don't have a key you're sure is unique). spatula does not check for this, so you may overwrite data if your get_filename function does not guarantee uniqueness.