Processing text

Once you have loaded the nlp object, you can call it as though it were a function. This allows you to process a single unicode string.

doc = nlp(u'Hello, world! A three sentence document.\nWith new lines...')

The library should perform equally well with short or long documents. All algorithms are linear-time in the length of the string, and once the data is loaded, there's no significant start-up cost to consider. This means that you don't have to strategically merge or split your text — you should feel free to feed in either single tweets or whole novels.

If you run nlp = spacy.load('en'), the nlp object will be an instance of spacy.en.English. This means that when you run doc = nlp(text), you're executing spacy.en.English.__call__, which is implemented on its parent class, Language .

doc = nlp.make_doc(text)
for proc in nlp.pipeline:
    proc(doc)

I've tried to make sure that the Language.__call__ function doesn't do any "heavy lifting", so that you won't have complicated logic to replicate if you need to make your own pipeline class. This is all it does.

The .make_doc() method and .pipeline attribute make it easier to customise spaCy's behaviour. If you're using the default pipeline, we can desugar one more time.

doc = nlp.tokenizer(text)
nlp.tagger(doc)
nlp.parser(doc)
nlp.entity(doc)

Finally, here's where you can find out about each of those components:

NameSource
tokenizerspacy.tokenizer.Tokenizer
taggerspacy.pipeline.Tagger
parserspacy.pipeline.DependencyParser
entityspacy.pipeline.EntityRecognizer

Multi-threading with .pipe()

If you have a sequence of documents to process, you should use the .pipe() method. The .pipe() method takes an iterator of texts, and accumulates an internal buffer, which it works on in parallel. It then yields the documents in order, one-by-one. After a long and bitter struggle, the global interpreter lock was freed around spaCy's main parsing loop in v0.100.3. This means that the .pipe() method will be significantly faster in most practical situations, because it allows shared memory parallelism.

for doc in nlp.pipe(texts, batch_size=10000, n_threads=3):
   pass

To make full use of the .pipe() function, you might want to brush up on Python generators. Here are a few quick hints:

Bringing your own annotations

spaCy generally assumes by default that your data is raw text. However, sometimes your data is partially annotated, e.g. with pre-existing tokenization, part-of-speech tags, etc. The most common situation is that you have pre-defined tokenization. If you have a list of strings, you can create a Doc object directly. Optionally, you can also specify a list of boolean values, indicating whether each word has a subsequent space.

doc = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])

If provided, the spaces list must be the same length as the words list. The spaces list affects the doc.text, span.text, token.idx, span.start_char and span.end_char attributes. If you don't provide a spaces sequence, spaCy will assume that all words are whitespace delimited.

good_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])
bad_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'])
assert bad_spaces.text == u'Hello , world !'
assert good_spaces.text == u'Hello, world!'

Once you have a Doc object, you can write to its attributes to set the part-of-speech tags, syntactic dependencies, named entities and other attributes. For details, see the respective usage pages.