Adding languages

Adding full support for a language touches many different parts of the spaCy library. This guide explains how to fit everything together, and points you to the specific workflows for each component. Obviously, there are lots of ways you can organise your code when you implement your own Language class. This guide will focus on how it's done within spaCy. For full language support, we'll need to:

  1. Create a Language subclass and implement it.
  2. Define custom language data, like a stop list, tag map and tokenizer exceptions.
  3. Build the vocabulary including word frequencies, Brown clusters and word vectors.

Once you have the tokenizer and vocabulary, you can train the tagger, parser and entity recognizer. For some languages, you may also want to develop a solution for lemmatization and morphological analysis.

Creating a Language subclass

Language-specific code and resources should be organised into a subpackage of spaCy, named according to the language's ISO code. For instance, code and resources specific to Spanish are placed into a folder spacy/es, which can be imported as spacy.es.

To get started, you can use our templates for the most important files. Here's what the class template looks like:

__init__.py (excerpt)

# Import language-specific data from .language_data import * class Xxxxx(Language): lang = 'xx' # ISO code class Defaults(Language.Defaults): lex_attr_getters = dict(Language.Defaults.lex_attr_getters) lex_attr_getters[LANG] = lambda text: 'xx' # override defaults tokenizer_exceptions = TOKENIZER_EXCEPTIONS tag_map = TAG_MAP stop_words = STOP_WORDS

Additionally, the new Language class needs to be registered in spacy/__init__.py using the set_lang_class() function, so that you can use spacy.load().

spacy/__init__.py

from . import en from . import xx set_lang_class(en.English.lang, en.English) set_lang_class(xx.Xxxxx.lang, xx.Xxxxx)

You'll also need to list the new package in setup.py :

spacy/setup.py

PACKAGES = [ 'spacy', 'spacy.tokens', 'spacy.en', 'spacy.xx', # ... ]

Adding language data

Every language is full of exceptions and special cases, especially amongst the most common words. Some of these exceptions are shared between multiple languages, while others are entirely idiosyncratic. spaCy makes it easy to deal with these exceptions on a case-by-case basis, by defining simple rules and exceptions. The exceptions data is defined in Python the language data , so that Python functions can be used to help you generalise and combine the data as you require.

Stop words

A "stop list" is a classic trick from the early days of information retrieval when search was largely about keyword presence and absence. It is still sometimes useful today to filter out common words from a bag-of-words model.

To improve readability, STOP_WORDS are separated by spaces and newlines, and added as a multiline string:

Example

STOP_WORDS = set(""" a about above across after afterwards again against all almost alone along already also although always am among amongst amount an and another any anyhow anyone anything anyway anywhere are around as at back be became because become becomes becoming been before beforehand behind being below beside besides between beyond both bottom but by """).split())

Tag map

Most treebanks define a custom part-of-speech tag scheme, striking a balance between level of detail and ease of prediction. While it's useful to have custom tagging schemes, it's also useful to have a common scheme, to which the more specific tags can be related. The tagger can learn a tag scheme with any arbitrary symbols. However, you need to define how those symbols map down to the Universal Dependencies tag set. This is done by providing a tag map.

The keys of the tag map should be strings in your tag set. The values should be a dictionary. The dictionary must have an entry POS whose value is one of the Universal Dependencies tags. Optionally, you can also include morphological features or other token attributes in the tag map as well. This allows you to do simple rule-based morphological analysis.

Example

TAG_MAP = { "NNS": {POS: NOUN, "Number": "plur"}, "VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"}, "DT": {POS: DET} }

Tokenizer exceptions

spaCy's tokenization algorithm lets you deal with whitespace-delimited chunks separately. This makes it easy to define special-case rules, without worrying about how they interact with the rest of the tokenizer. Whenever the key string is matched, the special-case rule is applied, giving the defined sequence of tokens. You can also attach attributes to the subtokens, covered by your special case, such as the subtokens LEMMA or TAG.

Tokenizer exceptions can be added in the following format:

language_data.py

TOKENIZER_EXCEPTIONS = { "don't": [ {ORTH: "do", LEMMA: "do"}, {ORTH: "n't", LEMMA: "not", TAG: "RB"} ] }

Some exceptions, like certain abbreviations, will always be mapped to a single token containing only an ORTH property. To make your data less verbose, you can use the helper function strings_to_exc() with a simple array of strings:

Example

from ..language_data import update_exc, strings_to_exc ORTH_ONLY = ["a.", "b.", "c."] converted = strings_to_exc(ORTH_ONLY) # {"a.": [{ORTH: "a."}], "b.": [{ORTH: "b."}], "c.": [{ORTH: "c."}]} update_exc(TOKENIZER_EXCEPTIONS, converted)

Unambiguous abbreviations, like month names or locations in English, should be added to TOKENIZER_EXCEPTIONS with a lemma assigned, for example {ORTH: "Jan.", LEMMA: "January"}.

Custom tokenizer exceptions

For language-specific tokenizer exceptions, you can use the update_exc() function to update the existing exceptions with a custom dictionary. This is especially useful for exceptions that follow a consistent pattern. Instead of adding each exception manually, you can write a simple function that returns a dictionary of exceptions.

For example, here's how exceptions for time formats like "1a.m." and "1am" are generated in the English language_data.py :

language_data.py

from ..language_data import update_exc def get_time_exc(hours): exc = {} for hour in hours: exc["%da.m." % hour] = [{ORTH: hour}, {ORTH: "a.m."}] exc["%dp.m." % hour] = [{ORTH: hour}, {ORTH: "p.m."}] exc["%dam" % hour] = [{ORTH: hour}, {ORTH: "am", LEMMA: "a.m."}] exc["%dpm" % hour] = [{ORTH: hour}, {ORTH: "pm", LEMMA: "p.m."}] return exc TOKENIZER_EXCEPTIONS = dict(language_data.TOKENIZER_EXCEPTIONS) hours = 12 update_exc(TOKENIZER_EXCEPTIONS, get_time_exc(range(1, hours + 1)))

Shared utils

The spacy.language_data package provides constants and functions that can be imported and used across languages.

NameDescription
PRON_LEMMA Special value for pronoun lemmas ("-PRON-").
DET_LEMMA Special value for determiner lemmas, used in languages with inflected determiners ("-DET-").
ENT_IDSpecial value for entity IDs ("ent_id")
update_exc(exc, additions) Update an existing dictionary of exceptions exc with a dictionary of additions.
strings_to_exc(orths) Convert an array of strings to a dictionary of exceptions of the format {"string": [{ORTH: "string"}]}.
expand_exc(excs, search, replace) Search for a string search in a dictionary of exceptions excs and if found, copy the entry and replace search with replace in both the key and ORTH value. Useful to provide exceptions containing different versions of special unicode characters, like ' and .

If you've written a custom function that seems like it might be useful for several languages, consider adding it to language_data/util.py instead of the individual language module.

Shared language data

Because languages can vary in quite arbitrary ways, spaCy avoids organising the language data into an explicit inheritance hierarchy. Instead, reuseable functions and data are collected as atomic pieces in the spacy.language_data package.

NameDescriptionSource
EMOTICONS Common unicode emoticons without whitespace.emoticons.py
TOKENIZER_PREFIXES Regular expressions to match left-attaching tokens and punctuation, e.g. $, (, "punctuation.py
TOKENIZER_SUFFIXES Regular expressions to match right-attaching tokens and punctuation, e.g. %, ), "punctuation.py
TOKENIZER_INFIXES Regular expressions to match token separators, e.g. -punctuation.py
TAG_MAP A tag map keyed by the universal part-of-speech tags to themselves with no morphological features.tag_map.py
ENTITY_RULES Patterns for named entities commonly missed by the statistical entity recognizer, for use in the rule matcher.entity_rules.py
FALSE_POSITIVES Patterns for phrases commonly mistaken for named entities by the statistical entity recognizer, to use in the rule matcher.entity_rules.py

Individual languages can extend and override any of these expressions. Often, when a new language is added, you'll find a pattern or symbol that's missing. Even if this pattern or symbol isn't common in other languages, it might be best to add it to the base expressions, unless it has some conflicting interpretation. For instance, we don't expect to see guillemot quotation symbols (» and «) in English text. But if we do see them, we'd probably prefer the tokenizer to split it off.

Building the vocabulary

spaCy expects that common words will be cached in a Vocab instance. The vocabulary caches lexical features, and makes it easy to use information from unlabelled text samples in your models. Specifically, you'll usually want to collect word frequencies, and train two types of distributional similarity model: Brown clusters, and word vectors. The Brown clusters are used as features by linear models, while the word vectors are useful for lexical similarity models and deep learning.

Once you've collected the word frequencies, Brown clusters and word vectors files, you can use the init.py script from our developer resources to create a spaCy data directory:

python training/init.py xx your_data_directory/ my_data/word_freqs.txt my_data/clusters.txt my_data/word_vectors.bz2

This creates a spaCy data directory with a vocabulary model, ready to be loaded. By default, the init.py script expects to be able to find your language class using spacy.util.get_lang_class(lang_id). You can edit the script to help it find your language class if necessary.

Word frequencies

The init.py script expects a tab-separated word frequencies file with three columns:

  1. The number of times the word occurred in your language sample.
  2. The number of distinct documents the word occurred in.
  3. The word itself.

You should make sure you use the spaCy tokenizer for your language to segment the text for your word frequencies. This will ensure that the frequencies refer to the same segmentation standards you'll be using at run-time. For instance, spaCy's English tokenizer segments "can't" into two tokens. If we segmented the text by whitespace to produce the frequency counts, we'll have incorrect frequency counts for the tokens "ca" and "n't".

Training the Brown clusters

spaCy's tagger, parser and entity recognizer are designed to use distributional similarity features provided by the Brown clustering algorithm. You should train a model with between 500 and 1000 clusters. A minimum frequency threshold of 10 usually works well.

Training the word vectors

Word2vec and related algorithms let you train useful word similarity models from unlabelled text. This is a key part of using deep learning for NLP with limited labelled data. The vectors are also useful by themselves – they power the .similarity() methods in spaCy. For best results, you should pre-process the text with spaCy before training the Word2vec model. This ensures your tokenization will match.

You can use our word vectors training script , which pre-processes the text with your language-specific tokenizer and trains the model using Gensim.