Entity recognition

spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default model identifies a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

The standard way to access entity annotations is the doc.ents property, which produces a sequence of Span objects. The entity type is accessible either as an integer ID or as a string, using the attributes ent.label and ent.label_. The Span object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token. See the API reference for more details.

You can access token entity annotations using the token.ent_iob and token.ent_type attributes. The token.ent_iob attribute indicates whether an entity starts, continues or ends on the tag (In, Begin, Out).

Example

doc = nlp(u'London is a big city in the United Kingdom.') print(doc[0].text, doc[0].ent_iob, doc[0].ent_type_) # (u'London', 2, u'GPE') print(doc[1].text, doc[1].ent_iob, doc[1].ent_type_) # (u'is', 3, u'')

Setting entity annotations

To ensure that the sequence of token annotations remains consistent, you have to set entity annotations at the document level — you can't write directly to the token.ent_iob or token.ent_type attributes. The easiest way to set entities is to assign to the doc.ents attribute.

Example

doc = nlp(u'London is a big city in the United Kingdom.') doc.ents = [] assert doc[0].ent_type_ == '' doc.ents = [Span(doc, 0, 1, label=doc.vocab.strings['GPE'])] assert doc[0].ent_type_ == 'GPE' doc.ents = [] doc.ents = [(u'LondonCity', doc.vocab.strings['GPE'], 0, 1)]

The value you assign should be a sequence, the values of which can either be Span objects, or (ent_id, ent_type, start, end) tuples, where start and end are token offsets that describe the slice of the document that should be annotated.

You can also assign entity annotations using the doc.from_array() method. To do this, you should include both the ENT_TYPE and the ENT_IOB attributes in the array you're importing from.

Example

from spacy.attrs import ENT_IOB, ENT_TYPE import numpy doc = nlp.make_doc(u'London is a big city in the United Kingdom.') assert list(doc.ents) == [] header = [ENT_IOB, ENT_TYPE] attr_array = numpy.zeros((len(doc), len(header))) attr_array[0, 0] = 2 # B attr_array[0, 1] = doc.vocab.strings[u'GPE'] doc.from_array(header, attr_array) assert list(doc.ents)[0].text == u'London'

Finally, you can always write to the underlying struct, if you compile a Cython function. This is easy to do, and allows you to write efficient native code.

Example

# cython: infer_types=True from spacy.tokens.doc cimport Doc cpdef set_entity(Doc doc, int start, int end, int ent_type): for i in range(start, end): doc.c[i].ent_type = ent_type doc.c[start].ent_iob = 3 for i in range(start+1, end): doc.c[i].ent_iob = 2

Obviously, if you write directly to the array of TokenC* structs, you'll have responsibility for ensuring that the data is left in a consistent state.

The displaCy ENT visualizer

The displaCy ENT visualizer lets you explore an entity recognition model's behaviour interactively. If you're training a model, it's very useful to run the visualization server yourself. To help you do that, we've open-sourced both the back-end service and the front-end client.

Built-in entity types

TypeDescription
PERSONPeople, including fictional.
NORPNationalities or religious or political groups.
FACILITYBuildings, airports, highways, bridges, etc.
ORGCompanies, agencies, institutions, etc.
GPECountries, cities, states.
LOCNon-GPE locations, mountain ranges, bodies of water.
PRODUCTObjects, vehicles, foods, etc. (Not services.)
EVENTNamed hurricanes, battles, wars, sports events, etc.
WORK_OF_ARTTitles of books, songs, etc.
LANGUAGEAny named language.

The following values are also annotated in a style similar to names:

TypeDescription
DATEAbsolute or relative dates or periods.
TIMETimes smaller than a day.
PERCENTPercentage, including "%".
MONEYMonetary values, including unit.
QUANTITYMeasurements, as of weight or distance.
ORDINAL"first", "second", etc.
CARDINALNumerals that do not fall under another type.

Training and updating

To provide training examples to the entity recogniser, you'll first need to create an instance of the GoldParse class. You can specify your annotations in a stand-off format or as token tags.

import spacy
import random
from spacy.gold import GoldParse
from spacy.language import EntityRecognizer

train_data = [
    ('Who is Chaka Khan?', [(7, 17, 'PERSON')]),
    ('I like London and Berlin.', [(7, 13, 'LOC'), (18, 24, 'LOC')])
]

nlp = spacy.load('en', entity=False, parser=False)
ner = EntityRecognizer(nlp.vocab, entity_types=['PERSON', 'LOC'])

for itn in range(5):
    random.shuffle(train_data)
    for raw_text, entity_offsets in train_data:
        doc = nlp.make_doc(raw_text)
        gold = GoldParse(doc, entities=entity_offsets)

        nlp.tagger(doc)
        ner.update(doc, gold)
ner.model.end_training()

If a character offset in your entity annotations don't fall on a token boundary, the GoldParse class will treat that annotation as a missing value. This allows for more realistic training, because the entity recogniser is allowed to learn from examples that may feature tokenizer errors.

You can also provide token-level entity annotation, using the following tagging scheme to describe the entity boundaries:

TagDescription
B EGINThe first token of a multi-token entity.
I NAn inner token of a multi-token entity.
L ASTThe final token of a multi-token entity.
U NITA single-token entity.
O UTA non-entity token.

spaCy translates the character offsets into this scheme, in order to decide the cost of each action given the current state of the entity recogniser. The costs are then used to calculate the gradient of the loss, to train the model. The exact algorithm is a pastiche of well-known methods, and is not currently described in any single publication. The model is a greedy transition-based parser guided by a linear model whose weights are learned using the averaged perceptron loss, via the dynamic oracle imitation learning strategy. The transition system is equivalent to the BILOU tagging scheme.