Once you have loaded the
nlp object, you can call it as though it were a function. This allows you to process a single unicode string.
doc = nlp(u'Hello, world! A three sentence document.\nWith new lines...')
The library should perform equally well with short or long documents. All algorithms are linear-time in the length of the string, and once the data is loaded, there's no significant start-up cost to consider. This means that you don't have to strategically merge or split your text — you should feel free to feed in either single tweets or whole novels.
If you run
nlp = spacy.load('en'), the
nlp object will be an instance of
spacy.en.English. This means that when you run
doc = nlp(text), you're executing
spacy.en.English.__call__, which is implemented on its parent class,
doc = nlp.make_doc(text) for proc in nlp.pipeline: proc(doc)
I've tried to make sure that the
Language.__call__ function doesn't do any "heavy lifting", so that you won't have complicated logic
to replicate if you need to make your own pipeline class. This is all it
.make_doc() method and
.pipeline attribute make it easier to customise spaCy's behaviour. If you're using the default
pipeline, we can desugar one more time.
doc = nlp.tokenizer(text) nlp.tagger(doc) nlp.parser(doc) nlp.entity(doc)
Finally, here's where you can find out about each of those components:
If you have a sequence of documents to process, you should use the
.pipe() method. The
.pipe() method takes an iterator of texts, and accumulates an internal buffer,
which it works on in parallel. It then yields the documents in order,
one-by-one. After a long and bitter struggle, the global interpreter
lock was freed around spaCy's main parsing loop in v0.100.3. This means that the
.pipe() method will be significantly faster in most practical situations, because it allows shared memory parallelism.
for doc in nlp.pipe(texts, batch_size=10000, n_threads=3): pass
To make full use of the
.pipe() function, you might want to brush up on Python generators. Here are a few quick hints:
- Generator comprehensions can be written (
item for item in sequence)
itertoolsbuilt-in library and the
cytoolzpackage provide a lot of handy generator tools
- Often you'll have an input stream that pairs text with some
important metadata, e.g. a JSON document. To pair up the metadata with the processed
Docobject, you should use the tee function to split the generator in two, and then
izipthe extra stream to the document stream.
Bringing your own annotations
spaCy generally assumes by default that your data is raw text. However,
sometimes your data is partially annotated, e.g. with pre-existing
tokenization, part-of-speech tags, etc. The most common situation is
that you have pre-defined tokenization. If you have a list of strings, you can create a
Doc object directly. Optionally, you can also specify a list of boolean values, indicating whether each word has a
doc = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False])
If provided, the spaces list must be the same length as the words list. The spaces list affects the
span.end_char attributes. If you don't provide a
spaces sequence, spaCy will assume that all words are whitespace delimited.
good_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'], spaces=[False, True, False, False]) bad_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!']) assert bad_spaces.text == u'Hello , world !' assert good_spaces.text == u'Hello, world!'
Once you have a
Doc object, you can write to its attributes to set the part-of-speech tags, syntactic dependencies, named
entities and other attributes. For details, see the respective usage