scikit

Architecture

The central data structures in spaCy are the Doc and the Vocab. The Doc object owns the sequence of tokens and all their annotations. The Vocab object owns a set of look-up tables that make common information available across documents. By centralising strings, word vectors and lexical attributes, we avoid storing multiple copies of this data. This saves memory, and ensures there's a single source of truth.

Text annotations are also designed to allow a single source of truth: the Doc object owns the data, and Span and Token are views that point into it. The Doc object is constructed by the Tokenizer, and then modified in place by the components of the pipeline. The Language object coordinates these components. It takes raw text and sends it through the pipeline, returning an annotated document. It also orchestrates training and serialization.

Language MAKES nlp.vocab.morphology Vocab nlp.vocab StringStore nlp.vocab.strings nlp.tokenizer.vocab Tokenizer nlp.make_doc() nlp.pipeline nlp.pipeline[i].vocab pt en de fr es it nl sv fi nb hu he bn ja zh doc.vocab MAKES Doc MAKES token.doc Token Span lexeme.vocab Lexeme MAKES span.doc Dependency Parser Entity Recognizer Tagger Matcher Lemmatizer Morphology

Container objects

NameDescription
Doc A container for accessing linguistic annotations.
Span A slice from a Doc object.
Token An individual token — i.e. a word, punctuation symbol, whitespace, etc.
Lexeme An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc.

Processing pipeline

NameDescription
Language A text-processing pipeline. Usually you'll load this once per process as nlp and pass the instance around your application.
Pipe Base class for processing pipeline components.
Tagger Annotate part-of-speech tags on Doc objects.
DependencyParser Annotate syntactic dependencies on Doc objects.
EntityRecognizer Annotate named entities, e.g. persons or products, on Doc objects.
TextCategorizer Assigning categories or labels to Doc objects.
Tokenizer Segment text, and create Doc objects with the discovered segment boundaries.
Lemmatizer Determine the base forms of words.
Morphology Assign linguistic features like lemmas, noun case, verb tense etc. based on the word and its part-of-speech tag.
Matcher Match sequences of tokens, based on pattern rules, similar to regular expressions.
PhraseMatcher Match sequences of tokens based on phrases.

Other classes

NameDescription
Vocab A lookup table for the vocabulary that allows you to access Lexeme objects.
StringStore Map strings to and from hash values.
Vectors Container class for vector data keyed by string.
GoldParse Collection for training annotations.
GoldCorpus An annotated corpus, using the JSON file format. Manages annotations for tagging, dependency parsing and NER.

Neural network model architecture

The parsing model is a blend of recent results. The two recent inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at Bar Ilan1, and the SyntaxNet team from Google. The foundation of the parser is still based on the work of Joakim Nivre2, who introduced the transition-based framework3, the arc-eager transition system, and the imitation learning objective. The model is implemented using Thinc, spaCy's machine learning library. We first predict context-sensitive vectors for each word in the input:

(embed_lower | embed_prefix | embed_suffix | embed_shape)
    >> Maxout(token_width)
    >> convolution ** 4

This convolutional layer is shared between the tagger, parser and NER, and will also be shared by the future neural lemmatizer. Because the parser shares these layers with the tagger, the parser does not require tag features. I got this trick from David Weiss's "Stack Combination" paper4.

To boost the representation, the tagger actually predicts a "super tag" with POS, morphology and dependency label5. The tagger predicts these supertags by adding a softmax layer onto the convolutional layer – so, we're teaching the convolutional layer to give us a representation that's one affine transform from this informative lexical information. This is obviously good for the parser (which backprops to the convolutions too). The parser model makes a state vector by concatenating the vector representations for its context tokens. The current context tokens:

S0, S1, S2Top three words on the stack.
B0, B1First two words of the buffer.
S0L1, S1L1, S2L1, B0L1, B1L1
S0L2, S1L2, S2L2, B0L2, B1L2
Leftmost and second leftmost children of S0, S1, S2, B0 and B1.
S0R1, S1R1, S2R1, B0R1, B1R1
S0R2, S1R2, S2R2, B0R2, B1R2
Rightmost and second rightmost children of S0, S1, S2, B0 and B1.

This makes the state vector quite long: 13*T, where T is the token vector width (128 is working well). Fortunately, there's a way to structure the computation to save some expense (and make it more GPU-friendly).

The parser typically visits 2*N states for a sentence of length N (although it may visit more, if it back-tracks with a non-monotonic transition4). A naive implementation would require 2*N (B, 13*T) @ (13*T, H) matrix multiplications for a batch of size B. We can instead perform one (B*N, T) @ (T, 13*H) multiplication, to pre-compute the hidden weights for each positional feature with respect to the words in the batch. (Note that our token vectors come from the CNN — so we can't play this trick over the vocabulary. That's how Stanford's NN parser3 works — and why its model is so big.)

This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity. The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier. We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle in CUDA to train.

Currently the parser's loss function is multilabel log loss6, as the dynamic oracle allows multiple states to be 0 cost. This is defined as follows, where gZ is the sum of the scores assigned to gold classes:

(exp(score) / Z) - (exp(score) / gZ)

Cython conventions

spaCy's core data structures are implemented as Cython cdef classes. Memory is managed through the cymem cymem.Pool class, which allows you to allocate memory which will be freed when the Pool object is garbage collected. This means you usually don't have to worry about freeing memory. You just have to decide which Python object owns the memory, and make it own the Pool. When that object goes out of scope, the memory will be freed. You do have to take care that no pointers outlive the object that owns them — but this is generally quite easy.

All Cython modules should have the # cython: infer_types=True compiler directive at the top of the file. This makes the code much cleaner, as it avoids the need for many type declarations. If possible, you should prefer to declare your functions nogil, even if you don't especially care about multi-threading. The reason is that nogil functions help the Cython compiler reason about your code quite a lot — you're telling the compiler that no Python dynamics are possible. This lets many errors be raised, and ensures your function will run at C speed.

Cython gives you many choices of sequences: you could have a Python list, a numpy array, a memory view, a C++ vector, or a pointer. Pointers are preferred, because they are fastest, have the most explicit semantics, and let the compiler check your code more strictly. C++ vectors are also great — but you should only use them internally in functions. It's less friendly to accept a vector as an argument, because that asks the user to do much more work. Here's how to get a pointer from a numpy array, memory view or vector:

cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_vector, int[::1] memory_view) nogil:
pointer1 = <int*>numpy_array.data
pointer2 = cpp_vector.data()
pointer3 = &memory_view[0]

Both C arrays and C++ vectors reassure the compiler that no Python operations are possible on your variable. This is a big advantage: it lets the Cython compiler raise many more errors for you.

When getting a pointer from a numpy array or memoryview, take care that the data is actually stored in C-contiguous order — otherwise you'll get a pointer to nonsense. The type-declarations in the code above should generate runtime errors if buffers with incorrect memory layouts are passed in. To iterate over the array, the following style is preferred:

cdef int c_total(const int* int_array, int length) nogil:
    total = 0
    for item in int_array[:length]:
        total += item
    return total

If this is confusing, consider that the compiler couldn't deal with for item in int_array: — there's no length attached to a raw pointer, so how could we figure out where to stop? The length is provided in the slice notation as a solution to this. Note that we don't have to declare the type of item in the code above — the compiler can easily infer it. This gives us tidy code that looks quite like Python, but is exactly as fast as C — because we've made sure the compilation to C is trivial.

Your functions cannot be declared nogil if they need to create Python objects or call Python functions. This is perfectly okay — you shouldn't torture your code just to get nogil functions. However, if your function isn't nogil, you should compile your module with cython -a --cplus my_module.pyx and open the resulting my_module.html file in a browser. This will let you see how Cython is compiling your code. Calls into the Python run-time will be in bright yellow. This lets you easily see whether Cython is able to correctly type your code, or whether there are unexpected problems.

Working in Cython is very rewarding once you're over the initial learning curve. As with C and C++, the first way you write something in Cython will often be the performance-optimal approach. In contrast, Python optimisation generally requires a lot of experimentation. Is it faster to have an if item in my_dict check, or to use .get()? What about try/except? Does this numpy operation create a copy? There's no way to guess the answers to these questions, and you'll usually be dissatisfied with your results — so there's no way to know when to stop this process. In the worst case, you'll make a mess that invites the next reader to try their luck too. This is like one of those volcanic gas-traps, where the rescuers keep passing out from low oxygen, causing another rescuer to follow — only to succumb themselves. In short, just say no to optimizing your Python. If it's not fast enough the first time, just switch to Cython.