Pipeline

Pipeline Functions

Other built-in pipeline components and helpers

merge_noun_chunks function

Merge noun chunks into a single token. Also available via the string name "merge_noun_chunks". After initialization, the component is typically added to the processing pipeline using nlp.add_pipe.

NameTypeDescription
docDocThe Doc object to process, e.g. the Doc in the pipeline.

merge_entities function

Merge named entities into a single token. Also available via the string name "merge_entities". After initialization, the component is typically added to the processing pipeline using nlp.add_pipe.

NameTypeDescription
docDocThe Doc object to process, e.g. the Doc in the pipeline.

merge_subtokens functionv2.1

Merge subtokens into a single token. Also available via the string name "merge_subtokens". After initialization, the component is typically added to the processing pipeline using nlp.add_pipe.

As of v2.1, the parser is able to predict “subtokens” that should be merged into one single token later on. This is especially relevant for languages like Chinese, Japanese or Korean, where a “word” isn’t defined as a whitespace-delimited sequence of characters. Under the hood, this component uses the Matcher to find sequences of tokens with the dependency label "subtok" and then merges them into a single token.

NameTypeDescription
docDocThe Doc object to process, e.g. the Doc in the pipeline.
labelunicodeThe subtoken dependency label. Defaults to "subtok".