Pipeline Functions · spaCy API Documentation

merge_noun_chunks function

Merge noun chunks into a single token. Also available via the string name "merge_noun_chunks".

Name	Description
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. Doc
RETURNS	The modified `Doc` with merged noun chunks. Doc

merge_entities function

Merge named entities into a single token. Also available via the string name "merge_entities".

Name	Description
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. Doc
RETURNS	The modified `Doc` with merged entities. Doc

Merge subtokens into a single token. Also available via the string name "merge_subtokens". As of v2.1, the parser is able to predict “subtokens” that should be merged into one single token later on. This is especially relevant for languages like Chinese, Japanese or Korean, where a “word” isn’t defined as a whitespace-delimited sequence of characters. Under the hood, this component uses the Matcher to find sequences of tokens with the dependency label "subtok" and then merges them into a single token.

Name	Description
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. Doc
`label`	The subtoken dependency label. Defaults to `"subtok"`. str
RETURNS	The modified `Doc` with merged subtokens. Doc

token_splitter functionv3.0

Split tokens longer than a minimum length into shorter tokens. Intended for use with transformer pipelines where long spaCy tokens lead to input text that exceed the transformer model max length.

Setting	Description
`min_length`	The minimum length for a token to be split. Defaults to `25`. int
`split_length`	The length of the split tokens. Defaults to `5`. int
RETURNS	The modified `Doc` with the split tokens. Doc

doc_cleaner functionv3.2.1

Clean up Doc attributes. Intended for use at the end of pipelines with tok2vec or transformer pipeline components that store tensors and other values that can require a lot of memory and frequently aren’t needed after the whole pipeline has run.

Setting	Description
`attrs`	A dict of the `Doc` attributes and the values to set them to. Defaults to `{"tensor": None, "_.trf_data": None}` to clean up after `tok2vec` and `transformer` components. dict
`silent`	If `False`, show warnings if attributes aren’t found or can’t be set. Defaults to `True`. bool
RETURNS	The modified `Doc` with the modified attributes. Doc

span_cleaner functionexperimental

Remove SpanGroups from doc.spans based on a key prefix. This is used to clean up after the CoreferenceResolver when it’s paired with a SpanResolver.

Setting	Description
`prefix`	A prefix to check `SpanGroup` keys for. Any matching groups will be removed. Defaults to `"coref_head_clusters"`. str
RETURNS	The modified `Doc` with any matching spans removed. Doc