Pipeline Functions
merge_noun_chunks function
Merge noun chunks into a single token. Also available via the string name
"merge_noun_chunks"
.
Name | Description |
---|---|
doc | The Doc object to process, e.g. the Doc in the pipeline. Doc |
RETURNS | The modified Doc with merged noun chunks. Doc |
merge_entities function
Merge named entities into a single token. Also available via the string name
"merge_entities"
.
Name | Description |
---|---|
doc | The Doc object to process, e.g. the Doc in the pipeline. Doc |
RETURNS | The modified Doc with merged entities. Doc |
merge_subtokens function
Merge subtokens into a single token. Also available via the string name
"merge_subtokens"
. As of v2.1, the parser is able to predict “subtokens” that
should be merged into one single token later on. This is especially relevant for
languages like Chinese, Japanese or Korean, where a “word” isn’t defined as a
whitespace-delimited sequence of characters. Under the hood, this component uses
the Matcher
to find sequences of tokens with the dependency
label "subtok"
and then merges them into a single token.
Name | Description |
---|---|
doc | The Doc object to process, e.g. the Doc in the pipeline. Doc |
label | The subtoken dependency label. Defaults to "subtok" . str |
RETURNS | The modified Doc with merged subtokens. Doc |
token_splitter functionv3.0
Split tokens longer than a minimum length into shorter tokens. Intended for use with transformer pipelines where long spaCy tokens lead to input text that exceed the transformer model max length.
Setting | Description |
---|---|
min_length | The minimum length for a token to be split. Defaults to 25 . int |
split_length | The length of the split tokens. Defaults to 5 . int |
RETURNS | The modified Doc with the split tokens. Doc |
doc_cleaner functionv3.2.1
Clean up Doc
attributes. Intended for use at the end of pipelines with
tok2vec
or transformer
pipeline components that store tensors and other
values that can require a lot of memory and frequently aren’t needed after the
whole pipeline has run.
Setting | Description |
---|---|
attrs | A dict of the Doc attributes and the values to set them to. Defaults to {"tensor": None, "_.trf_data": None} to clean up after tok2vec and transformer components. dict |
silent | If False , show warnings if attributes aren’t found or can’t be set. Defaults to True . bool |
RETURNS | The modified Doc with the modified attributes. Doc |
span_cleaner functionexperimental
Remove SpanGroup
s from doc.spans
based on a key prefix. This is used to
clean up after the CoreferenceResolver
when it’s paired with a
SpanResolver
.
Setting | Description |
---|---|
prefix | A prefix to check SpanGroup keys for. Any matching groups will be removed. Defaults to "coref_head_clusters" . str |
RETURNS | The modified Doc with any matching spans removed. Doc |