Adding languages

Adding full support for a language touches many different parts of the spaCy library. This guide explains how to fit everything together, and points you to the specific workflows for each component. Obviously, there are lots of ways you can organise your code when you implement your own Language class. This guide will focus on how it's done within spaCy. For full language support, we'll need to:

  1. Create a Language subclass and implement it.
  2. Define custom language data, like a stop list, tag map and tokenizer exceptions.
  3. Build the vocabulary including word frequencies, Brown clusters and word vectors.
  4. Set up a model direcory and train the tagger and parser.

For some languages, you may also want to develop a solution for lemmatization and morphological analysis.

Creating a Language subclass

Language-specific code and resources should be organised into a subpackage of spaCy, named according to the language's ISO code. For instance, code and resources specific to Spanish are placed into a folder spacy/es, which can be imported as spacy.es.

To get started, you can use our templates for the most important files. Here's what the class template looks like:

__init__.py (excerpt)

# Import language-specific data from .language_data import * class Xxxxx(Language): lang = 'xx' # ISO code class Defaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters[LANG] = lambda text: 'xx' # override defaults tokenizer_exceptions = TOKENIZER_EXCEPTIONS tag_map = TAG_MAP stop_words = STOP_WORDS

Additionally, the new Language class needs to be added to the list of available languages in __init__.py . The languages are then registered using the set_lang_class() function.

spacy/__init__.py

from . import en from . import xx _languages = (en.English, ..., xx.Xxxxx)

You'll also need to list the new package in setup.py :

spacy/setup.py

PACKAGES = [ 'spacy', 'spacy.tokens', 'spacy.en', 'spacy.xx', # ... ]

Adding language data

Every language is full of exceptions and special cases, especially amongst the most common words. Some of these exceptions are shared between multiple languages, while others are entirely idiosyncratic. spaCy makes it easy to deal with these exceptions on a case-by-case basis, by defining simple rules and exceptions. The exceptions data is defined in Python the language data , so that Python functions can be used to help you generalise and combine the data as you require.

Stop words

A "stop list" is a classic trick from the early days of information retrieval when search was largely about keyword presence and absence. It is still sometimes useful today to filter out common words from a bag-of-words model.

To improve readability, STOP_WORDS are separated by spaces and newlines, and added as a multiline string:

Example

STOP_WORDS = set(""" a about above across after afterwards again against all almost alone along already also although always am among amongst amount an and another any anyhow anyone anything anyway anywhere are around as at back be became because become becomes becoming been before beforehand behind being below beside besides between beyond both bottom but by """).split())

Tag map

Most treebanks define a custom part-of-speech tag scheme, striking a balance between level of detail and ease of prediction. While it's useful to have custom tagging schemes, it's also useful to have a common scheme, to which the more specific tags can be related. The tagger can learn a tag scheme with any arbitrary symbols. However, you need to define how those symbols map down to the Universal Dependencies tag set. This is done by providing a tag map.

The keys of the tag map should be strings in your tag set. The values should be a dictionary. The dictionary must have an entry POS whose value is one of the Universal Dependencies tags. Optionally, you can also include morphological features or other token attributes in the tag map as well. This allows you to do simple rule-based morphological analysis.

Example

TAG_MAP = { "NNS": {POS: NOUN, "Number": "plur"}, "VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"}, "DT": {POS: DET} }

Tokenizer exceptions

spaCy's tokenization algorithm lets you deal with whitespace-delimited chunks separately. This makes it easy to define special-case rules, without worrying about how they interact with the rest of the tokenizer. Whenever the key string is matched, the special-case rule is applied, giving the defined sequence of tokens. You can also attach attributes to the subtokens, covered by your special case, such as the subtokens LEMMA or TAG.

Tokenizer exceptions can be added in the following format:

language_data.py

TOKENIZER_EXCEPTIONS = { "don't": [ {ORTH: "do", LEMMA: "do"}, {ORTH: "n't", LEMMA: "not", TAG: "RB"} ] }

Some exceptions, like certain abbreviations, will always be mapped to a single token containing only an ORTH property. To make your data less verbose, you can use the helper function strings_to_exc() with a simple array of strings:

Example

from ..language_data import update_exc, strings_to_exc ORTH_ONLY = ["a.", "b.", "c."] converted = strings_to_exc(ORTH_ONLY) # {"a.": [{ORTH: "a."}], "b.": [{ORTH: "b."}], "c.": [{ORTH: "c."}]} update_exc(TOKENIZER_EXCEPTIONS, converted)

Unambiguous abbreviations, like month names or locations in English, should be added to TOKENIZER_EXCEPTIONS with a lemma assigned, for example {ORTH: "Jan.", LEMMA: "January"}.

Custom tokenizer exceptions

For language-specific tokenizer exceptions, you can use the update_exc() function to update the existing exceptions with a custom dictionary. This is especially useful for exceptions that follow a consistent pattern. Instead of adding each exception manually, you can write a simple function that returns a dictionary of exceptions.

For example, here's how exceptions for time formats like "1a.m." and "1am" are generated in the English language_data.py :

language_data.py

from ..language_data import update_exc def get_time_exc(hours): exc = {} for hour in hours: exc["%da.m." % hour] = [{ORTH: hour}, {ORTH: "a.m."}] exc["%dp.m." % hour] = [{ORTH: hour}, {ORTH: "p.m."}] exc["%dam" % hour] = [{ORTH: hour}, {ORTH: "am", LEMMA: "a.m."}] exc["%dpm" % hour] = [{ORTH: hour}, {ORTH: "pm", LEMMA: "p.m."}] return exc TOKENIZER_EXCEPTIONS = dict(language_data.TOKENIZER_EXCEPTIONS) hours = 12 update_exc(TOKENIZER_EXCEPTIONS, get_time_exc(range(1, hours + 1)))

Shared utils

The spacy.language_data package provides constants and functions that can be imported and used across languages.

NameDescription
PRON_LEMMA Special value for pronoun lemmas ("-PRON-").
DET_LEMMA Special value for determiner lemmas, used in languages with inflected determiners ("-DET-").
ENT_IDSpecial value for entity IDs ("ent_id")
update_exc(exc, additions) Update an existing dictionary of exceptions exc with a dictionary of additions.
strings_to_exc(orths) Convert an array of strings to a dictionary of exceptions of the format {"string": [{ORTH: "string"}]}.
expand_exc(excs, search, replace) Search for a string search in a dictionary of exceptions excs and if found, copy the entry and replace search with replace in both the key and ORTH value. Useful to provide exceptions containing different versions of special unicode characters, like ' and .

If you've written a custom function that seems like it might be useful for several languages, consider adding it to language_data/util.py instead of the individual language module.

Shared language data

Because languages can vary in quite arbitrary ways, spaCy avoids organising the language data into an explicit inheritance hierarchy. Instead, reuseable functions and data are collected as atomic pieces in the spacy.language_data package.

NameDescriptionSource
EMOTICONS Common unicode emoticons without whitespace.emoticons.py
TOKENIZER_PREFIXES Regular expressions to match left-attaching tokens and punctuation, e.g. $, (, "punctuation.py
TOKENIZER_SUFFIXES Regular expressions to match right-attaching tokens and punctuation, e.g. %, ), "punctuation.py
TOKENIZER_INFIXES Regular expressions to match token separators, e.g. -punctuation.py
TAG_MAP A tag map keyed by the universal part-of-speech tags to themselves with no morphological features.tag_map.py
ENTITY_RULES Patterns for named entities commonly missed by the statistical entity recognizer, for use in the rule matcher.entity_rules.py
FALSE_POSITIVES Patterns for phrases commonly mistaken for named entities by the statistical entity recognizer, to use in the rule matcher.entity_rules.py

Individual languages can extend and override any of these expressions. Often, when a new language is added, you'll find a pattern or symbol that's missing. Even if this pattern or symbol isn't common in other languages, it might be best to add it to the base expressions, unless it has some conflicting interpretation. For instance, we don't expect to see guillemot quotation symbols (» and «) in English text. But if we do see them, we'd probably prefer the tokenizer to split it off.

Building the vocabulary

spaCy expects that common words will be cached in a Vocab instance. The vocabulary caches lexical features, and makes it easy to use information from unlabelled text samples in your models. Specifically, you'll usually want to collect word frequencies, and train two types of distributional similarity model: Brown clusters, and word vectors. The Brown clusters are used as features by linear models, while the word vectors are useful for lexical similarity models and deep learning.

Word frequencies

To generate the word frequencies from a large, raw corpus, you can use the word_freqs.py script from the spaCy developer resources. Note that your corpus should not be preprocessed (i.e. you need punctuation for example). The model command expects a tab-separated word frequencies file with three columns:

  1. The number of times the word occurred in your language sample.
  2. The number of distinct documents the word occurred in.
  3. The word itself.

An example word frequencies file could look like this:

es_word_freqs.txt

6361109 111 Aunque 23598543 111 aunque 10097056 111 claro 193454 111 aro 7711123 111 viene 12812323 111 mal 23414636 111 momento 2014580 111 felicidad 233865 111 repleto 15527 111 eto 235565 111 deliciosos 17259079 111 buena 71155 111 Anímate 37705 111 anímate 33155 111 cuéntanos 2389171 111 cuál 961576 111 típico

You should make sure you use the spaCy tokenizer for your language to segment the text for your word frequencies. This will ensure that the frequencies refer to the same segmentation standards you'll be using at run-time. For instance, spaCy's English tokenizer segments "can't" into two tokens. If we segmented the text by whitespace to produce the frequency counts, we'll have incorrect frequency counts for the tokens "ca" and "n't".

Training the Brown clusters

spaCy's tagger, parser and entity recognizer are designed to use distributional similarity features provided by the Brown clustering algorithm. You should train a model with between 500 and 1000 clusters. A minimum frequency threshold of 10 usually works well.

An example clusters file could look like this:

es_clusters.data

0000 Vestigial 1 0000 Vesturland 1 0000 Veyreau 1 0000 Veynes 1 0000 Vexilografía 1 0000 Vetrigne 1 0000 Vetónica 1 0000 Asunden 1 0000 Villalambrús 1 0000 Vichuquén 1 0000 Vichtis 1 0000 Vichigasta 1 0000 VAAH 1 0000 Viciebsk 1 0000 Vicovaro 1 0000 Villardeveyo 1 0000 Vidala 1 0000 Videoguard 1 0000 Vedás 1 0000 Videocomunicado 1 0000 VideoCrypt 1

Training the word vectors

Word2vec and related algorithms let you train useful word similarity models from unlabelled text. This is a key part of using deep learning for NLP with limited labelled data. The vectors are also useful by themselves – they power the .similarity() methods in spaCy. For best results, you should pre-process the text with spaCy before training the Word2vec model. This ensures your tokenization will match.

You can use our word vectors training script , which pre-processes the text with your language-specific tokenizer and trains the model using Gensim. The vectors.bin file should consist of one word and vector per line.

Setting up a model directory

Once you've collected the word frequencies, Brown clusters and word vectors files, you can use the model command to create a data directory:

python -m spacy model [lang] [model_dir] [freqs_data] [clusters_data] [vectors_data]

This creates a spaCy data directory with a vocabulary model, ready to be loaded. By default, the command expects to be able to find your language class using spacy.util.get_lang_class(lang_id).

Training the tagger and parser

You can now train the model using a corpus for your language annotated with Universal Dependencies. If your corpus uses the CoNLL-U format, i.e. files with the extension .conllu, you can use the convert command to convert it to spaCy's JSON format for training.

Once you have your UD corpus transformed into JSON, you can train your model use the using spaCy's train command:

python -m spacy train [lang] [output_dir] [train_data] [dev_data] [--n_iter] [--parser_L1] [--no_tagger] [--no_parser] [--no_ner]