Add language detection to your spaCy pipeline using CLD2

spaCy-CLD operates on Doc and Span spaCy objects. When called on a Doc or Span, the object is given two attributes: languages (a list of up to 3 language codes) and language_scores (a dictionary mapping language codes to confidence scores between 0 and 1).

spacy-cld is a little extension that wraps the PYCLD2 Python library, which in turn wraps the Compact Language Detector 2 C library originally built at Google for the Chromium project. CLD2 uses character n-grams as features and a Naive Bayes classifier to identify 80+ languages from Unicode text strings (or XML/HTML). It can detect up to 3 different languages in a given document, and reports a confidence score (reported in with each language.


import spacy from spacy_cld import LanguageDetector nlp = spacy.load('en') language_detector = LanguageDetector() nlp.add_pipe(language_detector) doc = nlp('This is some English text.') doc._.languages # ['en'] doc._.language_scores['en'] # 0.96
Nicholas D Haynes


