spaCy 101: Everything you need to know
Whether youâre new to spaCy, or just want to brush up on some NLP basics and implementation details â this page should have you covered. Each section will explain one of spaCyâs features in simple terms and with examples or illustrations. Some sections will also reappear across the usage guides as a quick introduction.
Whatâs spaCy?
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.
If youâre working with a lot of text, youâll eventually want to know more about it. For example, whatâs it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?
spaCy is designed specifically for production use and helps you build applications that process and âunderstandâ large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.
What spaCy isnât
spaCy is not a platform or âan APIâ. Unlike a platform, spaCy does not provide a software as a service, or a web application. Itâs an open-source library designed to help you build NLP applications, not a consumable service.
spaCy is not an out-of-the-box chat bot engine. While spaCy can be used to power conversational applications, itâs not designed specifically for chat bots, and only provides the underlying text processing capabilities.
spaCy is not research software. Itâs built on the latest research, but itâs designed to get things done. This leads to fairly different design decisions than NLTK or CoreNLP, which were created as platforms for teaching and research. The main difference is that spaCy is integrated and opinionated. spaCy tries to avoid asking the user to choose between multiple algorithms that deliver equivalent functionality. Keeping the menu small lets spaCy deliver generally better performance and developer experience.
spaCy is not a company. Itâs an open-source library. Our company publishing spaCy and other software is called Explosion AI.
Features
In the documentation, youâll come across mentions of spaCyâs features and capabilities. Some of them refer to linguistic concepts, while others are related to more general machine learning functionality.
Name | Description |
---|---|
Tokenization | Segmenting text into words, punctuations marks etc. |
Part-of-speech (POS) Tagging | Assigning word types to tokens, like verb or noun. |
Dependency Parsing | Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object. |
Lemmatization | Assigning the base forms of words. For example, the lemma of âwasâ is âbeâ, and the lemma of âratsâ is âratâ. |
Sentence Boundary Detection (SBD) | Finding and segmenting individual sentences. |
Named Entity Recognition (NER) | Labelling named âreal-worldâ objects, like persons, companies or locations. |
Entity Linking (EL) | Disambiguating textual entities to unique identifiers in a Knowledge Base. |
Similarity | Comparing words, text spans and documents and how similar they are to each other. |
Text Classification | Assigning categories or labels to a whole document, or parts of a document. |
Rule-based Matching | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions. |
Training | Updating and improving a statistical modelâs predictions. |
Serialization | Saving objects to files or byte strings. |
Statistical models
While some of spaCyâs features work independently, others require statistical models to be loaded, which enable spaCy to predict linguistic annotations â for example, whether a word is a verb or a noun. spaCy currently offers statistical models for a variety of languages, which can be installed as individual Python modules. Models can differ in size, speed, memory usage, accuracy and the data they include. The model you choose always depends on your use case and the texts youâre working with. For a general-purpose use case, the small, default models are always a good start. They typically include the following components:
- Binary weights for the part-of-speech tagger, dependency parser and named entity recognizer to predict those annotations in context.
- Lexical entries in the vocabulary, i.e. words and their context-independent attributes like the shape or spelling.
- Data files like lemmatization rules and lookup tables.
- Word vectors, i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other.
- Configuration options, like the language and processing pipeline settings, to put spaCy in the correct state when you load in the model.
Linguistic annotations
spaCy provides a variety of linguistic annotations to give you insights into a textâs grammatical structure. This includes the word types, like the parts of speech, and how the words are related to each other. For example, if youâre analyzing text, it makes a huge difference whether a noun is the subject of a sentence, or the object â or whether âgoogleâ is used as a verb, or refers to the website or company in a specific context.
Once youâve downloaded and installed a model, you can load it
via spacy.load()
. This will return a Language
object containing all components and data needed to process text. We usually
call it nlp
. Calling the nlp
object on a string of text will return a
processed Doc
:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
print(token.text, token.pos_, token.dep_)
Even though a Doc
is processed â e.g. split into individual words and
annotated â it still holds all information of the original text, like
whitespace characters. You can always get the offset of a token into the
original string, or reconstruct the original by joining the tokens and their
trailing whitespace. This way, youâll never lose any information when processing
text with spaCy.
Tokenization
During processing, spaCy first tokenizes the text, i.e. segments it into
words, punctuation and so on. This is done by applying rules specific to each
language. For example, punctuation at the end of a sentence should be split off
â whereas âU.K.â should remain one token. Each Doc
consists of individual
tokens, and we can iterate over them:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
print(token.text)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Apple | is | looking | at | buying | U.K. | startup | for | $ | 1 | billion |
First, the raw text is split on whitespace characters, similar to
text.split(' ')
. Then, the tokenizer processes the text from left to right. On
each substring, it performs two checks:
Does the substring match a tokenizer exception rule? For example, âdonâtâ does not contain whitespace, but should be split into two tokens, âdoâ and ânâtâ, while âU.K.â should always remain one token.
Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.
If thereâs a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.
While punctuation rules are usually pretty general, tokenizer exceptions
strongly depend on the specifics of the individual language. This is why each
available language has its own subclass like
English
or German
, that loads in lists of hard-coded data and exception
rules.
Part-of-speech tags and dependencies Needs model
After tokenization, spaCy can parse and tag a given Doc
. This is where
the statistical model comes in, which enables spaCy to make a prediction of
which tag or label most likely applies in this context. A model consists of
binary data and is produced by showing a system enough examples for it to make
predictions that generalize across the language â for example, a word following
âtheâ in English is most likely a noun.
Linguistic annotations are available as
Token
attributes. Like many NLP libraries, spaCy
encodes all strings to hash values to reduce memory usage and improve
efficiency. So to get the readable string representation of an attribute, we
need to add an underscore _
to its name:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)
Text | Lemma | POS | Tag | Dep | Shape | alpha | stop |
---|---|---|---|---|---|---|---|
Apple | apple | PROPN | NNP | nsubj | Xxxxx | True | False |
is | be | AUX | VBZ | aux | xx | True | True |
looking | look | VERB | VBG | ROOT | xxxx | True | False |
at | at | ADP | IN | prep | xx | True | True |
buying | buy | VERB | VBG | pcomp | xxxx | True | False |
U.K. | u.k. | PROPN | NNP | compound | X.X. | False | False |
startup | startup | NOUN | NN | dobj | xxxx | True | False |
for | for | ADP | IN | prep | xxx | True | True |
$ | $ | SYM | $ | quantmod | $ | False | False |
1 | 1 | NUM | CD | compound | d | False | False |
billion | billion | NUM | CD | pobj | xxxx | True | False |
Using spaCyâs built-in displaCy visualizer, hereâs what our example sentence and its dependencies look like:
Named Entities Needs model
A named entity is a âreal-world objectâ thatâs assigned a name â for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesnât always work perfectly and might need some tuning later, depending on your use case.
Named entities are available as the ents
property of a Doc
:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
Text | Start | End | Label | Description |
---|---|---|---|---|
Apple | 0 | 5 | ORG | Companies, agencies, institutions. |
U.K. | 27 | 31 | GPE | Geopolitical entity, i.e. countries, cities, states. |
$1 billion | 44 | 54 | MONEY | Monetary values, including unit. |
Using spaCyâs built-in displaCy visualizer, hereâs what our example sentence and its named entities look like:
Word vectors and similarity Needs model
Similarity is determined by comparing word vectors or âword embeddingsâ, multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec and usually look like this:
banana.vector
array([2.02280000e-01, -7.66180009e-02, 3.70319992e-01, 3.28450017e-02, -4.19569999e-01, 7.20689967e-02, -3.74760002e-01, 5.74599989e-02, -1.24009997e-02, 5.29489994e-01, -5.23800015e-01, -1.97710007e-01, -3.41470003e-01, 5.33169985e-01, -2.53309999e-02, 1.73800007e-01, 1.67720005e-01, 8.39839995e-01, 5.51070012e-02, 1.05470002e-01, 3.78719985e-01, 2.42750004e-01, 1.47449998e-02, 5.59509993e-01, 1.25210002e-01, -6.75960004e-01, 3.58420014e-01, # ... and so on ... 3.66849989e-01, 2.52470002e-03, -6.40089989e-01, -2.97650009e-01, 7.89430022e-01, 3.31680000e-01, -1.19659996e+00, -4.71559986e-02, 5.31750023e-01], dtype=float32)
Models that come with built-in word vectors make them available as the
Token.vector
attribute. Doc.vector
and Span.vector
will default to an average of their token
vectors. You can also check if a token has a vector assigned, and get the L2
norm, which can be used to normalize vectors.
import spacy
nlp = spacy.load("en_core_web_md")
tokens = nlp("dog cat banana afskfsd")
for token in tokens:
print(token.text, token.has_vector, token.vector_norm, token.is_oov)
The words âdogâ, âcatâ and âbananaâ are all pretty common in English, so theyâre
part of the modelâs vocabulary, and come with a vector. The word âafskfsdâ on
the other hand is a lot less common and out-of-vocabulary â so its vector
representation consists of 300 dimensions of 0
, which means itâs practically
nonexistent. If your application will benefit from a large vocabulary with
more vectors, you should consider using one of the larger models or loading in a
full vector package, for example,
en_vectors_web_lg
, which includes
over 1 million unique vectors.
spaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For example, you can suggest a user content thatâs similar to what theyâre currently looking at, or label a support ticket as a duplicate if itâs very similar to an already existing one.
Each Doc
, Span
and Token
comes with a
.similarity()
method that lets you compare it with
another object, and determine the similarity. Of course similarity is always
subjective â whether âdogâ and âcatâ are similar really depends on how youâre
looking at it. spaCyâs similarity model usually assumes a pretty general-purpose
definition of similarity.
import spacy
nlp = spacy.load("en_core_web_md") # make sure to use larger model!
tokens = nlp("dog cat banana")
for token1 in tokens:
for token2 in tokens:
print(token1.text, token2.text, token1.similarity(token2))
In this case, the modelâs predictions are pretty on point. A dog is very similar
to a cat, whereas a banana is not very similar to either of them. Identical
tokens are obviously 100% similar to each other (just not always exactly 1.0
,
because of vector math and floating point imprecisions).
Pipelines
When you call nlp
on a text, spaCy first tokenizes the text to produce a Doc
object. The Doc
is then processed in several different steps â this is also
referred to as the processing pipeline. The pipeline used by the
default models consists of a tagger, a parser and an entity
recognizer. Each pipeline component returns the processed Doc
, which is then
passed on to the next component.
Name | Component | Creates | Description |
---|---|---|---|
tokenizer | Tokenizer | Doc | Segment text into tokens. |
tagger | Tagger | Doc[i].tag | Assign part-of-speech tags. |
parser | DependencyParser | Doc[i].head , Doc[i].dep , Doc.sents , Doc.noun_chunks | Assign dependency labels. |
ner | EntityRecognizer | Doc.ents , Doc[i].ent_iob , Doc[i].ent_type | Detect and label named entities. |
textcat | TextCategorizer | Doc.cats | Assign document labels. |
⊠| custom components | Doc._.xxx , Token._.xxx , Span._.xxx | Assign custom attributes, methods or properties. |
The processing pipeline always depends on the statistical model and its capabilities. For example, a pipeline can only include an entity recognizer component if the model includes data to make predictions of entity labels. This is why each model will specify the pipeline to use in its meta data, as a simple list containing the component names:
"pipeline": ["tagger", "parser", "ner"]
In spaCy v2.x, the statistical components like the tagger or parser are independent and donât share any data between themselves. For example, the named entity recognizer doesnât use any features set by the tagger and parser, and so on. This means that you can swap them, or remove single components from the pipeline without affecting the others.
However, custom components may depend on annotations set by other components.
For example, a custom lemmatizer may need the part-of-speech tags assigned, so
itâll only work if itâs added after the tagger. The parser will respect
pre-defined sentence boundaries, so if a previous component in the pipeline sets
them, its dependency predictions may be different. Similarly, it matters if you
add the EntityRuler
before or after the statistical entity
recognizer: if itâs added before, the entity recognizer will take the existing
entities into account when making predictions.
The EntityLinker
, which resolves named entities to
knowledge base IDs, should be preceded by
a pipeline component that recognizes entities such as the
EntityRecognizer
.
The tokenizer is a âspecialâ component and isnât part of the regular pipeline.
It also doesnât show up in nlp.pipe_names
. The reason is that there can only
really be one tokenizer, and while all other pipeline components take a Doc
and return it, the tokenizer takes a string of text and turns it into a
Doc
. You can still customize the tokenizer, though. nlp.tokenizer
is
writable, so you can either create your own
Tokenizer
class from scratch,
or even replace it with an
entirely custom function.
Vocab, hashes and lexemes
Whenever possible, spaCy tries to store data in a vocabulary, the
Vocab
, that will be shared by multiple documents. To save
memory, spaCy also encodes all strings to hash values â in this case for
example, âcoffeeâ has the hash 3197928453018144401
. Entity labels like âORGâ
and part-of-speech tags like âVERBâ are also encoded. Internally, spaCy only
âspeaksâ in hash values.
If you process lots of documents containing the word âcoffeeâ in all kinds of
different contexts, storing the exact string âcoffeeâ every time would take up
way too much space. So instead, spaCy hashes the string and stores it in the
StringStore
. You can think of the StringStore
as a
lookup table that works in both directions â you can look up a string to get
its hash, or a hash to get its string:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I love coffee")
print(doc.vocab.strings["coffee"]) # 3197928453018144401
print(doc.vocab.strings[3197928453018144401]) # 'coffee'
Now that all strings are encoded, the entries in the vocabulary donât need to
include the word text themselves. Instead, they can look it up in the
StringStore
via its hash value. Each entry in the vocabulary, also called
Lexeme
, contains the context-independent information about
a word. For example, no matter if âloveâ is used as a verb or a noun in some
context, its spelling and whether it consists of alphabetic characters wonât
ever change. Its hash value will also always be the same.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I love coffee")
for word in doc:
lexeme = doc.vocab[word.text]
print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)
Text | Orth | Shape | Prefix | Suffix | is_alpha | is_digit |
---|---|---|---|---|---|---|
I | 4690420944186131903 | X | I | I | True | False |
love | 3702023516439754181 | xxxx | l | ove | True | False |
coffee | 3197928453018144401 | xxxx | c | fee | True | False |
The mapping of words to hashes doesnât depend on any state. To make sure each value is unique, spaCy uses a hash function to calculate the hash based on the word string. This also means that the hash for âcoffeeâ will always be the same, no matter which model youâre using or how youâve configured spaCy.
However, hashes cannot be reversed and thereâs no way to resolve
3197928453018144401
back to âcoffeeâ. All spaCy can do is look it up in the
vocabulary. Thatâs why you always need to make sure all objects you create have
access to the same vocabulary. If they donât, spaCy might not be able to find
the strings it needs.
import spacy
from spacy.tokens import Doc
from spacy.vocab import Vocab
nlp = spacy.load("en_core_web_sm")
doc = nlp("I love coffee") # Original Doc
print(doc.vocab.strings["coffee"]) # 3197928453018144401
print(doc.vocab.strings[3197928453018144401]) # 'coffee' đ
empty_doc = Doc(Vocab()) # New Doc with empty Vocab
# empty_doc.vocab.strings[3197928453018144401] will raise an error :(
empty_doc.vocab.strings.add("coffee") # Add "coffee" and generate hash
print(empty_doc.vocab.strings[3197928453018144401]) # 'coffee' đ
new_doc = Doc(doc.vocab) # Create new doc with first doc's vocab
print(new_doc.vocab.strings[3197928453018144401]) # 'coffee' đ
If the vocabulary doesnât contain a string for 3197928453018144401
, spaCy will
raise an error. You can re-add âcoffeeâ manually, but this only works if you
actually know that the document contains that word. To prevent this problem,
spaCy will also export the Vocab
when you save a Doc
or nlp
object. This
will give you the object and its encoded annotations, plus the âkeyâ to decode
it.
Knowledge Base
To support the entity linking task, spaCy stores external knowledge in a
KnowledgeBase
. The knowledge base (KB) uses the Vocab
to store
its data efficiently.
A knowledge base is created by first adding all entities to it. Next, for each potential mention or alias, a list of relevant KB IDs and their prior probabilities is added. The sum of these prior probabilities should never exceed 1 for any given alias.
import spacy
from spacy.kb import KnowledgeBase
# load the model and create an empty KB
nlp = spacy.load('en_core_web_sm')
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=3)
# adding entities
kb.add_entity(entity="Q1004791", freq=6, entity_vector=[0, 3, 5])
kb.add_entity(entity="Q42", freq=342, entity_vector=[1, 9, -3])
kb.add_entity(entity="Q5301561", freq=12, entity_vector=[-2, 4, 2])
# adding aliases
kb.add_alias(alias="Douglas", entities=["Q1004791", "Q42", "Q5301561"], probabilities=[0.6, 0.1, 0.2])
kb.add_alias(alias="Douglas Adams", entities=["Q42"], probabilities=[0.9])
print()
print("Number of entities in KB:",kb.get_size_entities()) # 3
print("Number of aliases in KB:", kb.get_size_aliases()) # 2
Candidate generation
Given a textual entity, the Knowledge Base can provide a list of plausible
candidates or entity identifiers. The EntityLinker
will
take this list of candidates as input, and disambiguate the mention to the most
probable identifier, given the document context.
import spacy
from spacy.kb import KnowledgeBase
nlp = spacy.load('en_core_web_sm')
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=3)
# adding entities
kb.add_entity(entity="Q1004791", freq=6, entity_vector=[0, 3, 5])
kb.add_entity(entity="Q42", freq=342, entity_vector=[1, 9, -3])
kb.add_entity(entity="Q5301561", freq=12, entity_vector=[-2, 4, 2])
# adding aliases
kb.add_alias(alias="Douglas", entities=["Q1004791", "Q42", "Q5301561"], probabilities=[0.6, 0.1, 0.2])
candidates = kb.get_candidates("Douglas")
for c in candidates:
print(" ", c.entity_, c.prior_prob, c.entity_vector)
Serialization
If youâve been modifying the pipeline, vocabulary, vectors and entities, or made
updates to the model, youâll eventually want to save your progress â for
example, everything thatâs in your nlp
object. This means youâll have to
translate its contents and structure into a format that can be saved, like a
file or a byte string. This process is called serialization. spaCy comes with
built-in serialization methods and supports the
Pickle protocol.
All container classes, i.e. Language
(nlp
),
Doc
, Vocab
and StringStore
have the following methods available:
Method | Returns | Example |
---|---|---|
to_bytes | bytes | data = nlp.to_bytes() |
from_bytes | object | nlp.from_bytes(data) |
to_disk | - | nlp.to_disk("/path") |
from_disk | object | nlp.from_disk("/path") |
Training
spaCyâs models are statistical and every âdecisionâ they make â for example, which part-of-speech tag to assign, or whether a word is a named entity â is a prediction. This prediction is based on the examples the model has seen during training. To train a model, you first need training data â examples of text, and the labels you want the model to predict. This could be a part-of-speech tag, a named entity or any other information.
The model is then shown the unlabelled text and will make a prediction. Because we know the correct answer, we can give the model feedback on its prediction in the form of an error gradient of the loss function that calculates the difference between the training example and the expected output. The greater the difference, the more significant the gradient and the updates to our model.
When training a model, we donât just want it to memorize our examples â we want it to come up with a theory that can be generalized across other examples. After all, we donât just want the model to learn that this one instance of âAmazonâ right here is a company â we want it to learn that âAmazonâ, in contexts like this, is most likely a company. Thatâs why the training data should always be representative of the data we want to process. A model trained on Wikipedia, where sentences in the first person are extremely rare, will likely perform badly on Twitter. Similarly, a model trained on romantic novels will likely perform badly on legal text.
This also means that in order to know how the model is performing, and whether itâs learning the right things, you donât only need training data â youâll also need evaluation data. If you only test the model with the data it was trained on, youâll have no idea how well itâs generalizing. If you want to train a model from scratch, you usually need at least a few hundred examples for both training and evaluation. To update an existing model, you can already achieve decent results with very few examples â as long as theyâre representative.
Language data
Every language is different â and usually full of exceptions and special
cases, especially amongst the most common words. Some of these exceptions are
shared across languages, while others are entirely specific â usually so
specific that they need to be hard-coded. The
lang
module
contains all language-specific data, organized in simple Python files. This
makes the data easy to update and extend.
The shared language data in the directory root includes rules that can be
generalized across languages â for example, rules for basic punctuation, emoji,
emoticons, single-letter abbreviations and norms for equivalent tokens with
different spellings, like "
and â
. This helps the models make more accurate
predictions. The individual language data in a submodule contains rules that
are only relevant to a particular language. It also takes care of putting
together all components and creating the Language
subclass â for example,
English
or German
.
Name | Description |
---|---|
Stop wordsstop_words.py | List of most common words of a language that are often useful to filter out, for example âandâ or âIâ. Matching tokens will return True for is_stop . |
Tokenizer exceptionstokenizer_exceptions.py | Special-case rules for the tokenizer, for example, contractions like âcanâtâ and abbreviations with punctuation, like âU.K.â. |
Norm exceptionsnorm_exceptions.py | Special-case rules for normalizing tokens to improve the modelâs predictions, for example on American vs. British spelling. |
Punctuation rulespunctuation.py | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. |
Character classeschar_classes.py | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. |
Lexical attributeslex_attrs.py | Custom functions for setting lexical attributes on tokens, e.g. like_num , which includes language-specific words like âtenâ or âhundredâ. |
Syntax iteratorssyntax_iterators.py | Functions that compute views of a Doc object based on its syntax. At the moment, only used for noun chunks. |
Tag maptag_map.py | Dictionary mapping strings in your tag set to Universal Dependencies tags. |
Morph rulesmorph_rules.py | Exception rules for morphological analysis of irregular words like personal pronouns. |
Lemmatizerspacy-lookups-data | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example âbeâ for âwasâ. |
Lightning tour
The following examples and code snippets give you an overview of spaCyâs functionality and its usage.
Install models and process text
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Hello, world. Here are two sentences.")
print([t.text for t in doc])
nlp_de = spacy.load("de_core_news_sm")
doc_de = nlp_de("Ich bin ein Berliner.")
print([t.text for t in doc_de])
Get tokens, noun chunks & sentences Needs model
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Peach emoji is where it has always been. Peach is the superior "
"emoji. It's outranking eggplant đ ")
print(doc[0].text) # 'Peach'
print(doc[1].text) # 'emoji'
print(doc[-1].text) # 'đ'
print(doc[17:19].text) # 'outranking eggplant'
noun_chunks = list(doc.noun_chunks)
print(noun_chunks[0].text) # 'Peach emoji'
sentences = list(doc.sents)
assert len(sentences) == 3
print(sentences[1].text) # 'Peach is the superior emoji.'
Get part-of-speech tags and flags Needs model
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
apple = doc[0]
print("Fine-grained POS tag", apple.pos_, apple.pos)
print("Coarse-grained POS tag", apple.tag_, apple.tag)
print("Word shape", apple.shape_, apple.shape)
print("Alphabetic characters?", apple.is_alpha)
print("Punctuation mark?", apple.is_punct)
billion = doc[10]
print("Digit?", billion.is_digit)
print("Like a number?", billion.like_num)
print("Like an email address?", billion.like_email)
Use hash values for any string
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I love coffee")
coffee_hash = nlp.vocab.strings["coffee"] # 3197928453018144401
coffee_text = nlp.vocab.strings[coffee_hash] # 'coffee'
print(coffee_hash, coffee_text)
print(doc[2].orth, coffee_hash) # 3197928453018144401
print(doc[2].text, coffee_text) # 'coffee'
beer_hash = doc.vocab.strings.add("beer") # 3073001599257881079
beer_text = doc.vocab.strings[beer_hash] # 'beer'
print(beer_hash, beer_text)
unicorn_hash = doc.vocab.strings.add("đŠ") # 18234233413267120783
unicorn_text = doc.vocab.strings[unicorn_hash] # 'đŠ'
print(unicorn_hash, unicorn_text)
Recognize and update named entities Needs model
import spacy
from spacy.tokens import Span
nlp = spacy.load("en_core_web_sm")
doc = nlp("San Francisco considers banning sidewalk delivery robots")
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
doc = nlp("FB is hiring a new VP of global policy")
doc.ents = [Span(doc, 0, 1, label="ORG")]
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
Train and update neural network models
import spacy
import random
nlp = spacy.load("en_core_web_sm")
train_data = [("Uber blew through $1 million", {"entities": [(0, 4, "ORG")]})]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.begin_training()
for i in range(10):
random.shuffle(train_data)
for text, annotations in train_data:
nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk("/model")
Visualize a dependency parse and named entities in your browser v2.0Needs model
from spacy import displacy
doc_dep = nlp("This is a sentence.")
displacy.serve(doc_dep, style="dep")
doc_ent = nlp("When Sebastian Thrun started working on self-driving cars at Google "
"in 2007, few people outside of the company took him seriously.")
displacy.serve(doc_ent, style="ent")
Get word vectors and similarity Needs model
import spacy
nlp = spacy.load("en_core_web_md")
doc = nlp("Apple and banana are similar. Pasta and hippo aren't.")
apple = doc[0]
banana = doc[2]
pasta = doc[6]
hippo = doc[8]
print("apple <-> banana", apple.similarity(banana))
print("pasta <-> hippo", pasta.similarity(hippo))
print(apple.has_vector, banana.has_vector, pasta.has_vector, hippo.has_vector)
For the best results, you should run this example using the
en_vectors_web_lg
model (currently
not available in the live demo).
Simple and efficient serialization
import spacy
from spacy.tokens import Doc
from spacy.vocab import Vocab
nlp = spacy.load("en_core_web_sm")
customer_feedback = open("customer_feedback_627.txt").read()
doc = nlp(customer_feedback)
doc.to_disk("/tmp/customer_feedback_627.bin")
new_doc = Doc(Vocab()).from_disk("/tmp/customer_feedback_627.bin")
Match text with token rules
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
def set_sentiment(matcher, doc, i, matches):
doc.sentiment += 0.1
pattern1 = [{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}]
pattern2 = [[{"ORTH": emoji, "OP": "+"}] for emoji in ["đ", "đ", "đ€Ł", "đ"]]
matcher.add("GoogleIO", None, pattern1) # Match "Google I/O" or "Google i/o"
matcher.add("HAPPY", set_sentiment, *pattern2) # Match one or more happy emoji
doc = nlp("A text about Google I/O đđ")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(string_id, span.text)
print("Sentiment", doc.sentiment)
Minibatched stream processing
texts = ["One document.", "...", "Lots of documents"]
# .pipe streams input, and produces streaming output
iter_texts = (texts[i % 3] for i in range(100000000))
for i, doc in enumerate(nlp.pipe(iter_texts, batch_size=50)):
assert doc.is_parsed
if i == 100:
break
Get syntactic dependencies Needs model
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("When Sebastian Thrun started working on self-driving cars at Google "
"in 2007, few people outside of the company took him seriously.")
dep_labels = []
for token in doc:
while token.head != token:
dep_labels.append(token.dep_)
token = token.head
print(dep_labels)
Export to numpy arrays
import spacy
from spacy.attrs import ORTH, LIKE_URL
nlp = spacy.load("en_core_web_sm")
doc = nlp("Check out https://spacy.io")
for token in doc:
print(token.text, token.orth, token.like_url)
attr_ids = [ORTH, LIKE_URL]
doc_array = doc.to_array(attr_ids)
print(doc_array.shape)
print(len(doc), len(attr_ids))
assert doc[0].orth == doc_array[0, 0]
assert doc[1].orth == doc_array[1, 0]
assert doc[0].like_url == doc_array[0, 1]
assert list(doc_array[:, 1]) == [t.like_url for t in doc]
print(list(doc_array[:, 1]))
Calculate inline markup on original string
import spacy
def put_spans_around_tokens(doc):
"""Here, we're building a custom "syntax highlighter" for
part-of-speech tags and dependencies. We put each token in a
span element, with the appropriate classes computed. All whitespace is
preserved, outside of the spans. (Of course, HTML will only display
multiple whitespace if enabled â but the point is, no information is lost
and you can calculate what you need, e.g. <br />, <p> etc.)
"""
output = []
html = '<span class="{classes}">{word}</span>{space}'
for token in doc:
if token.is_space:
output.append(token.text)
else:
classes = "pos-{} dep-{}".format(token.pos_, token.dep_)
output.append(html.format(classes=classes, word=token.text, space=token.whitespace_))
string = "".join(output)
string = string.replace("\n", "")
string = string.replace("\t", " ")
return "<pre>{}</pre>".format(string)
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a test.\n\nHello world.")
html = put_spans_around_tokens(doc)
print(html)
Architecture
The central data structures in spaCy are the Doc
and the Vocab
. The Doc
object owns the sequence of tokens and all their annotations. The Vocab
object owns a set of look-up tables that make common information available
across documents. By centralizing strings, word vectors and lexical attributes,
we avoid storing multiple copies of this data. This saves memory, and ensures
thereâs a single source of truth.
Text annotations are also designed to allow a single source of truth: the Doc
object owns the data, and Span
and Token
are views that point into it.
The Doc
object is constructed by the Tokenizer
, and then modified in
place by the components of the pipeline. The Language
object coordinates
these components. It takes raw text and sends it through the pipeline, returning
an annotated document. It also orchestrates training and serialization.
Container objects
Name | Description |
---|---|
Doc | A container for accessing linguistic annotations. |
Span | A slice from a Doc object. |
Token | An individual token â i.e. a word, punctuation symbol, whitespace, etc. |
Lexeme | An entry in the vocabulary. Itâs a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. |
Processing pipeline
Name | Description |
---|---|
Language | A text-processing pipeline. Usually youâll load this once per process as nlp and pass the instance around your application. |
Tokenizer | Segment text, and create Doc objects with the discovered segment boundaries. |
Lemmatizer | Determine the base forms of words. |
Morphology | Assign linguistic features like lemmas, noun case, verb tense etc. based on the word and its part-of-speech tag. |
Tagger | Annotate part-of-speech tags on Doc objects. |
DependencyParser | Annotate syntactic dependencies on Doc objects. |
EntityRecognizer | Annotate named entities, e.g. persons or products, on Doc objects. |
TextCategorizer | Assign categories or labels to Doc objects. |
Matcher | Match sequences of tokens, based on pattern rules, similar to regular expressions. |
PhraseMatcher | Match sequences of tokens based on phrases. |
EntityRuler | Add entity spans to the Doc using token-based rules or exact phrase matches. |
Sentencizer | Implement custom sentence boundary detection logic that doesnât require the dependency parse. |
Other functions | Automatically apply something to the Doc , e.g. to merge spans of tokens. |
Other classes
Name | Description |
---|---|
Vocab | A lookup table for the vocabulary that allows you to access Lexeme objects. |
StringStore | Map strings to and from hash values. |
Vectors | Container class for vector data keyed by string. |
GoldParse | Collection for training annotations. |
GoldCorpus | An annotated corpus, using the JSON file format. Manages annotations for tagging, dependency parsing and NER. |
Community & FAQ
Weâre very happy to see the spaCy community grow and include a mix of people from all kinds of different backgrounds â computational linguistics, data science, deep learning, research and more. If youâd like to get involved, below are some answers to the most important questions and resources for further reading.
Help, my code isnât working!
Bugs suck, and weâre doing our best to continuously improve the tests and fix bugs as soon as possible. Before you submit an issue, do a quick search and check if the problem has already been reported. If youâre having installation or loading problems, make sure to also check out the troubleshooting guide. Help with spaCy is available via the following platforms:
- Stack Overflow: Usage questions and everything related to problems with your specific code. The Stack Overflow community is much larger than ours, so if your problem can be solved by others, youâll receive help much quicker.
- GitHub discussions: General discussion, project ideas and usage questions. Meet other community members to get help with a specific code implementation, discuss ideas for new projects/plugins, support more languages, and share best practices.
- GitHub issue tracker: Bug reports and improvement suggestions, i.e. everything thatâs likely spaCyâs fault. This also includes problems with the models beyond statistical imprecisions, like patterns that point to a bug.
How can I contribute to spaCy?
You donât have to be an NLP expert or Python pro to contribute, and weâre happy
to help you get started. If youâre new to spaCy, a good place to start is the
help wanted (easy)
label
on GitHub, which we use to tag bugs and feature requests that are easy and
self-contained. We also appreciate contributions to the docs â whether itâs
fixing a typo, improving an example or adding additional explanations. Youâll
find a âSuggest editsâ link at the bottom of each page that points you to the
source.
Another way of getting involved is to help us improve the language data â especially if you happen to speak one of the languages currently in alpha support. Even adding simple tokenizer exceptions, stop words or lemmatizer data can make a big difference. It will also make it easier for us to provide a statistical model for the language in the future. Submitting a test that documents a bug or performance issue, or covers functionality thatâs especially important for your application is also very helpful. This way, youâll also make sure we never accidentally introduce regressions to the parts of the library that you care about the most.
For more details on the types of contributions weâre looking for, the code conventions and other useful tips, make sure to check out the contributing guidelines.
Iâve built something cool with spaCy â how can I get the word out?
First, congrats â weâd love to check it out! When you share your project on Twitter, donât forget to tag @spacy_io so we donât miss it. If you think your project would be a good fit for the spaCy Universe, feel free to submit it! Tutorials are also incredibly valuable to other users and a great way to get exposure. So we strongly encourage writing up your experiences, or sharing your code and some tips and tricks on your blog. Since our website is open-source, you can add your project or tutorial by making a pull request on GitHub.
If you would like to use the spaCy logo on your site, please get in touch and ask us first. However, if you want to show support and tell others that your project is using spaCy, you can grab one of our spaCy badges here:
[](https://spacy.io)
[](https://spacy.io)