spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default model identifies a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.
The standard way to access entity annotations is the
doc.ents property, which produces a sequence of
Span objects. The entity type is accessible either as an integer ID or as a string, using the attributes
Span object acts as a sequence of tokens, so you can iterate over the entity or index into
it. You can also get the text form of the whole entity, as though it were a single token. See the API reference for more details.
You can access token entity annotations using the
token.ent_type attributes. The
token.ent_iob attribute indicates whether an entity starts, continues or ends on the
tag (In, Begin, Out).
doc = nlp(u'London is a big city in the United Kingdom.') print(doc.text, doc.ent_iob, doc.ent_type_) # (u'London', 2, u'GPE') print(doc.text, doc.ent_iob, doc.ent_type_) # (u'is', 3, u'')
Setting entity annotations
To ensure that the sequence of token annotations remains consistent, you
have to set entity annotations at the document level — you can't write directly to the
token.ent_type attributes. The easiest way to set entities is to assign to the
doc = nlp(u'London is a big city in the United Kingdom.') doc.ents =  assert doc.ent_type_ == '' doc.ents = [Span(doc, 0, 1, label=doc.vocab.strings['GPE'])] assert doc.ent_type_ == 'GPE' doc.ents =  doc.ents = [(u'LondonCity', doc.vocab.strings['GPE'], 0, 1)]
The value you assign should be a sequence, the values of which can either be
Span objects, or
(ent_id, ent_type, start, end) tuples, where
end are token offsets that describe the slice of the document that should be annotated.
You can also assign entity annotations using the
doc.from_array() method. To do this, you should include both the
ENT_TYPE and the
ENT_IOB attributes in the array you're importing from.
from spacy.attrs import ENT_IOB, ENT_TYPE import numpy doc = nlp.make_doc(u'London is a big city in the United Kingdom.') assert list(doc.ents) ==  header = [ENT_IOB, ENT_TYPE] attr_array = numpy.zeros((len(doc), len(header))) attr_array[0, 0] = 2 # B attr_array[0, 1] = doc.vocab.strings[u'GPE'] doc.from_array(header, attr_array) assert list(doc.ents).text == u'London'
Finally, you can always write to the underlying struct, if you compile a Cython function. This is easy to do, and allows you to write efficient native code.
# cython: infer_types=True from spacy.tokens.doc cimport Doc cpdef set_entity(Doc doc, int start, int end, int ent_type): for i in range(start, end): doc.c[i].ent_type = ent_type doc.c[start].ent_iob = 3 for i in range(start+1, end): doc.c[i].ent_iob = 2
Obviously, if you write directly to the array of
TokenC* structs, you'll have responsibility for ensuring that the data is left in a
The displaCy ENT visualizer
The displaCy ENT visualizer lets you explore an entity recognition model's behaviour interactively. If you're training a model, it's very useful to run the visualization server yourself. To help you do that, we've open-sourced both the back-end service and the front-end client.
Built-in entity types
|People, including fictional.|
|Nationalities or religious or political groups.|
|Buildings, airports, highways, bridges, etc.|
|Companies, agencies, institutions, etc.|
|Countries, cities, states.|
|Non-GPE locations, mountain ranges, bodies of water.|
|Objects, vehicles, foods, etc. (Not services.)|
|Named hurricanes, battles, wars, sports events, etc.|
|Titles of books, songs, etc.|
|Any named language.|
The following values are also annotated in a style similar to names:
|Absolute or relative dates or periods.|
|Times smaller than a day.|
|Percentage, including "%".|
|Monetary values, including unit.|
|Measurements, as of weight or distance.|
|"first", "second", etc.|
|Numerals that do not fall under another type.|
Training and updating
To provide training examples to the entity recogniser, you'll first need to create an instance of the
GoldParse class. You can specify your annotations in a stand-off format or as token tags.
import spacy import random from spacy.gold import GoldParse from spacy.language import EntityRecognizer train_data = [ ('Who is Chaka Khan?', [(7, 17, 'PERSON')]), ('I like London and Berlin.', [(7, 13, 'LOC'), (18, 24, 'LOC')]) ] nlp = spacy.load('en', entity=False, parser=False) ner = EntityRecognizer(nlp.vocab, entity_types=['PERSON', 'LOC']) for itn in range(5): random.shuffle(train_data) for raw_text, entity_offsets in train_data: doc = nlp.make_doc(raw_text) gold = GoldParse(doc, entities=entity_offsets) nlp.tagger(doc) ner.update(doc, gold) ner.model.end_training()
If a character offset in your entity annotations don't fall on a token boundary, the
GoldParse class will treat that annotation as a missing value. This allows for more realistic training, because the
entity recogniser is allowed to learn from examples that may feature
You can also provide token-level entity annotation, using the following tagging scheme to describe the entity boundaries:
|The first token of a multi-token entity.|
|An inner token of a multi-token entity.|
|The final token of a multi-token entity.|
|A single-token entity.|
|A non-entity token.|
spaCy translates the character offsets into this scheme, in order to decide the cost of each action given the current state of the entity recogniser. The costs are then used to calculate the gradient of the loss, to train the model. The exact algorithm is a pastiche of well-known methods, and is not currently described in any single publication. The model is a greedy transition-based parser guided by a linear model whose weights are learned using the averaged perceptron loss, via the dynamic oracle imitation learning strategy. The transition system is equivalent to the BILOU tagging scheme.