Understanding spaCy's data model

After reading this page, you should be able to:

Design considerations

No job too big

When writing spaCy, one of my mottos was no job too big. I wanted to make sure that if Google or Facebook were founded tomorrow, spaCy would be the obvious choice for them. I wanted spaCy to be the obvious choice for web-scale NLP. This meant sweating about performance, because for web-scale tasks, Moore's law can't save you.

Most computational work gets less expensive over time. If you wrote a program to solve fluid dynamics in 2008, and you ran it again in 2014, you would expect it to be cheaper. For NLP, it often doesn't work out that way. The problem is that we're writing programs where the task is something like "Process all articles in the English Wikipedia". Sure, compute prices dropped from $0.80 per hour to $0.20 per hour on AWS in 2008-2014. But the size of Wikipedia grew from 3GB to 11GB. Maybe the job is a little cheaper in 2014 — but not by much.

Multiple layers of annotation

When I tell a certain sort of person that I'm a computational linguist, this comic is often the first thing that comes to their mind:

© xkcd

I've thought a lot about what this comic is really trying to say. It's probably not talking about data models — but in that sense at least, it really rings true.

You'll often need to model a document as a sequence of sentences. Other times you'll need to model it as a sequence of words. Sometimes you'll care about paragraphs, other times you won't. Sometimes you'll care about extracting quotes, which can cross paragraph boundaries. A quote can also occur within a sentence. When we consider sentence structure, things get even more complicated and contradictory. We have syntactic trees, sequences of entities, sequences of phrases, sub-word units, multi-word units...

Different applications are going to need to query different, overlapping, and often contradictory views of the document. They're often going to need to query them jointly. You need to be able to get the syntactic head of a named entity, or the sentiment of a paragraph.

Solutions

Fat types, thin tokens

Static model, dynamic views

Different applications are going to need to query different, overlapping, and often contradictory views of the document. For this reason, I think it's a bad idea to have too much of the document structure reflected in the data model. If you structure the data according to the needs of one layer of annotation, you're going to need to copy the data and transform it in order to use a different layer of annotation. You'll soon have lots of copies, and no single source of truth.

Never go full stand-off

Implementation

Cython 101

cdef class Doc

Let's start at the top. Here's the memory layout of the Doc class, minus irrelevant details:

from cymem.cymem cimport Pool
from ..vocab cimport Vocab
from ..structs cimport TokenC

cdef class Doc:
    cdef Pool mem
    cdef Vocab vocab

    cdef TokenC* c

    cdef int length
    cdef int max_length

So, our Doc class is a wrapper around a TokenC* array — that's where the actual document content is stored. Here's the TokenC struct, in its entirety:

cdef struct TokenC

cdef struct TokenC:
    const LexemeC* lex
    uint64_t morph
    univ_pos_t pos
    bint spacy
    int tag
    int idx
    int lemma
    int sense
    int head
    int dep
    bint sent_start

    uint32_t l_kids
    uint32_t r_kids
    uint32_t l_edge
    uint32_t r_edge

    int ent_iob
    int ent_type # TODO: Is there a better way to do this? Multiple sources of truth..
    hash_t ent_id

The token owns all of its linguistic annotations, and holds a const pointer to a LexemeC struct. The LexemeC struct owns all of the vocabulary data about the word — all the dictionary definition stuff that we want to be shared by all instances of the type. Here's the LexemeC struct, in its entirety:

cdef struct LexemeC

cdef struct LexemeC:

    int32_t id

    int32_t orth     # Allows the string to be retrieved
    int32_t length   # Length of the string

    uint64_t flags   # These are the most useful parts.
    int32_t cluster  # Distributional similarity cluster
    float prob       # Probability
    float sentiment  # Slot for sentiment

    int32_t lang

    int32_t lower    # These string views made sense
    int32_t norm     # when NLP meant linear models.
    int32_t shape    # Now they're less relevant, and
    int32_t prefix   # will probably be revised.
    int32_t suffix

    float* vector # <-- This was a design mistake, and will change.

Dynamic views

Text

You might have noticed that in all of the structs above, there's not a string to be found. The strings are all stored separately, in the StringStore class. The lexemes don't know the strings — they only know their integer IDs. The document string is never stored anywhere, either. Instead, it's reconstructed by iterating over the tokens, which look up the orth attribute of their underlying lexeme. Once we have the orth ID, we can fetch the string from the vocabulary. Finally, each token knows whether a single whitespace character (' ') should be used to separate it from the subsequent tokens. This allows us to preserve whitespace.

cdef print_text(Vocab vocab, const TokenC* tokens, int length):
    for i in range(length):
        word_string = vocab.strings[tokens.lex.orth]
        if tokens.lex.spacy:
            word_string += ' '
        print(word_string)

This is why you get whitespace tokens in spaCy — we need those tokens, so that we can reconstruct the document string. I also think you should have those tokens anyway. Most NLP libraries strip them, making it very difficult to recover the paragraph information once you're at the token level. You'll never have that sort of problem with spaCy — because there's a single source of truth.

cdef class Token

When you do...

doc[i]

...you get back an instance of class spacy.tokens.token.Token. This instance owns no data. Instead, it holds the information (doc, i), and uses these to retrieve all information via the parent container.

cdef class Span

When you do...

doc[i : j]

...you get back an instance of class spacy.tokens.span.Span. Span instances are also returned by the .sents, .ents and .noun_chunks iterators of the Doc object. A Span is a slice of tokens, with an optional label attached. Its data model is:

cdef class Span:
    cdef readonly Doc doc
    cdef int start
    cdef int end
    cdef int start_char
    cdef int end_char
    cdef int label

Once again, the Span owns almost no data. Instead, it refers back to the parent Doc container.

The start and end attributes refer to token positions, while start_char and end_char record the character positions of the span. By recording the character offsets, we can still use the Span object if the tokenization of the document changes.

cdef class Lexeme

When you do...

vocab[u'the']

...you get back an instance of class spacy.lexeme.Lexeme. The Lexeme's data model is:

cdef class Lexeme:
    cdef LexemeC* c
    cdef readonly Vocab vocab