Pipeline

Pipeline Functions

Other built-in pipeline components and helpers

merge_noun_chunks function

Merge noun chunks into a single token. Also available via the string name "merge_noun_chunks".

NameDescription
docThe Doc object to process, e.g. the Doc in the pipeline. Doc

merge_entities function

Merge named entities into a single token. Also available via the string name "merge_entities".

NameDescription
docThe Doc object to process, e.g. the Doc in the pipeline. Doc

merge_subtokens function

Merge subtokens into a single token. Also available via the string name "merge_subtokens". As of v2.1, the parser is able to predict “subtokens” that should be merged into one single token later on. This is especially relevant for languages like Chinese, Japanese or Korean, where a “word” isn’t defined as a whitespace-delimited sequence of characters. Under the hood, this component uses the Matcher to find sequences of tokens with the dependency label "subtok" and then merges them into a single token.

NameDescription
docThe Doc object to process, e.g. the Doc in the pipeline. Doc
labelThe subtoken dependency label. Defaults to "subtok". str

token_splitter functionv3.0

Split tokens longer than a minimum length into shorter tokens. Intended for use with transformer pipelines where long spaCy tokens lead to input text that exceed the transformer model max length.

SettingDescription
min_lengthThe minimum length for a token to be split. Defaults to 25. int
split_lengthThe length of the split tokens. Defaults to 5. int

doc_cleaner functionv3.2.1

Clean up Doc attributes. Intended for use at the end of pipelines with tok2vec or transformer pipeline components that store tensors and other values that can require a lot of memory and frequently aren’t needed after the whole pipeline has run.

SettingDescription
attrsA dict of the Doc attributes and the values to set them to. Defaults to {"tensor": None, "_.trf_data": None} to clean up after tok2vec and transformer components. dict
silentIf False, show warnings if attributes aren’t found or can’t be set. Defaults to True. bool

span_cleaner functionexperimental

Remove SpanGroups from doc.spans based on a key prefix. This is used to clean up after the CoreferenceResolver when it’s paired with a SpanResolver.

SettingDescription
prefixA prefix to check SpanGroup keys for. Any matching groups will be removed. Defaults to "coref_head_clusters". str