scikit

Word Vectors and Semantic Similarity

spaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For example, you can suggest a user content that's similar to what they're currently looking at, or label a support ticket as a duplicate if it's very similar to an already existing one.

Each Doc, Span and Token comes with a .similarity() method that lets you compare it with another object, and determine the similarity. Of course similarity is always subjective – whether "dog" and "cat" are similar really depends on how you're looking at it. spaCy's similarity model usually assumes a pretty general-purpose definition of similarity.

tokens = nlp(u'dog cat banana')

for token1 in tokens:
    for token2 in tokens:
        print(token1.similarity(token2))
dogcatbanana
dog 1.00 similar 0.80 similar 0.24 dissimilar
cat 0.80 similar 1.00 similar 0.28 dissimilar
banana 0.24 dissimilar 0.28 dissimilar 1.00 similar

In this case, the model's predictions are pretty on point. A dog is very similar to a cat, whereas a banana is not very similar to either of them. Identical tokens are obviously 100% similar to each other (just not always exactly 1.0, because of vector math and floating point imprecisions).

Similarity is determined by comparing word vectors or "word embeddings", multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec and usually look like this:

banana.vector

array([2.02280000e-01, -7.66180009e-02, 3.70319992e-01, 3.28450017e-02, -4.19569999e-01, 7.20689967e-02, -3.74760002e-01, 5.74599989e-02, -1.24009997e-02, 5.29489994e-01, -5.23800015e-01, -1.97710007e-01, -3.41470003e-01, 5.33169985e-01, -2.53309999e-02, 1.73800007e-01, 1.67720005e-01, 8.39839995e-01, 5.51070012e-02, 1.05470002e-01, 3.78719985e-01, 2.42750004e-01, 1.47449998e-02, 5.59509993e-01, 1.25210002e-01, -6.75960004e-01, 3.58420014e-01, -4.00279984e-02, 9.59490016e-02, -5.06900012e-01, -8.53179991e-02, 1.79800004e-01, 3.38669986e-01, 1.32300004e-01, 3.10209990e-01, 2.18779996e-01, 1.68530002e-01, 1.98740005e-01, -5.73849976e-01, -1.06490001e-01, 2.66689986e-01, 1.28380001e-01, -1.28030002e-01, -1.32839993e-01, 1.26570001e-01, 8.67229998e-01, 9.67210010e-02, 4.83060002e-01, 2.12709993e-01, -5.49900010e-02, -8.24249983e-02, 2.24079996e-01, 2.39749998e-01, -6.22599982e-02, 6.21940017e-01, -5.98999977e-01, 4.32009995e-01, 2.81430006e-01, 3.38420011e-02, -4.88150001e-01, -2.13589996e-01, 2.74010003e-01, 2.40950003e-01, 4.59500015e-01, -1.86049998e-01, -1.04970002e+00, -9.73049998e-02, -1.89080000e-01, -7.09290028e-01, 4.01950002e-01, -1.87680006e-01, 5.16870022e-01, 1.25200003e-01, 8.41499984e-01, 1.20970003e-01, 8.82389992e-02, -2.91959997e-02, 1.21510006e-03, 5.68250008e-02, -2.74210006e-01, 2.55640000e-01, 6.97930008e-02, -2.22580001e-01, -3.60060006e-01, -2.24020004e-01, -5.36990017e-02, 1.20220006e+00, 5.45350015e-01, -5.79980016e-01, 1.09049998e-01, 4.21669990e-01, 2.06619993e-01, 1.29360005e-01, -4.14570011e-02, -6.67770028e-01, 4.04670000e-01, -1.52179999e-02, -2.76400000e-01, -1.56110004e-01, -7.91980028e-02, 4.00369987e-02, -1.29439995e-01, -2.40900001e-04, -2.67850012e-01, -3.81150007e-01, -9.72450018e-01, 3.17259997e-01, -4.39509988e-01, 4.19340014e-01, 1.83530003e-01, -1.52600005e-01, -1.08080000e-01, -1.03579998e+00, 7.62170032e-02, 1.65189996e-01, 2.65259994e-04, 1.66160002e-01, -1.52810007e-01, 1.81229994e-01, 7.02740014e-01, 5.79559989e-03, 5.16639985e-02, -5.97449988e-02, -2.75510013e-01, -3.90489995e-01, 6.11319989e-02, 5.54300010e-01, -8.79969969e-02, -4.16810006e-01, 3.28260005e-01, -5.25489986e-01, -4.42880005e-01, 8.21829960e-03, 2.44859993e-01, -2.29819998e-01, -3.49810004e-01, 2.68940002e-01, 3.91660005e-01, -4.19039994e-01, 1.61909997e-01, -2.62630010e+00, 6.41340017e-01, 3.97430003e-01, -1.28680006e-01, -3.19460005e-01, -2.56330013e-01, -1.22199997e-01, 3.22750002e-01, -7.99330026e-02, -1.53479993e-01, 3.15050006e-01, 3.05909991e-01, 2.60120004e-01, 1.85530007e-01, -2.40429997e-01, 4.28860001e-02, 4.06219989e-01, -2.42559999e-01, 6.38700008e-01, 6.99829996e-01, -1.40430003e-01, 2.52090007e-01, 4.89840001e-01, -6.10670000e-02, -3.67659986e-01, -5.50890028e-01, -3.82649988e-01, -2.08430007e-01, 2.28320003e-01, 5.12179971e-01, 2.78679997e-01, 4.76520002e-01, 4.79510017e-02, -3.40079993e-01, -3.28729987e-01, -4.19669986e-01, -7.54989982e-02, -3.89539987e-01, -2.96219997e-02, -3.40700001e-01, 2.21699998e-01, -6.28560036e-02, -5.19029975e-01, -3.77739996e-01, -4.34770016e-03, -5.83010018e-01, -8.75459984e-02, -2.39289999e-01, -2.47109994e-01, -2.58870006e-01, -2.98940003e-01, 1.37150005e-01, 2.98919994e-02, 3.65439989e-02, -4.96650010e-01, -1.81600004e-01, 5.29389977e-01, 2.19919994e-01, -4.45140004e-01, 3.77979994e-01, -5.70620000e-01, -4.69460003e-02, 8.18059966e-02, 1.92789994e-02, 3.32459986e-01, -1.46200001e-01, 1.71560004e-01, 3.99809986e-01, 3.62170011e-01, 1.28160000e-01, 3.16439986e-01, 3.75690013e-01, -7.46899992e-02, -4.84800003e-02, -3.14009994e-01, -1.92860007e-01, -3.12940001e-01, -1.75529998e-02, -1.75139993e-01, -2.75870003e-02, -1.00000000e+00, 1.83870003e-01, 8.14339995e-01, -1.89129993e-01, 5.09989977e-01, -9.19600017e-03, -1.92950002e-03, 2.81890005e-01, 2.72470005e-02, 4.34089988e-01, -5.49669981e-01, -9.74259973e-02, -2.45399997e-01, -1.72030002e-01, -8.86500031e-02, -3.02980006e-01, -1.35910004e-01, -2.77649999e-01, 3.12860007e-03, 2.05559999e-01, -1.57720000e-01, -5.23079991e-01, -6.47010028e-01, -3.70139986e-01, 6.93930015e-02, 1.14009999e-01, 2.75940001e-01, -1.38750002e-01, -2.72680014e-01, 6.68910027e-01, -5.64539991e-02, 2.40170002e-01, -2.67300010e-01, 2.98599988e-01, 1.00830004e-01, 5.55920005e-01, 3.28489989e-01, 7.68579990e-02, 1.55279994e-01, 2.56359994e-01, -1.07720003e-01, -1.23590000e-01, 1.18270002e-01, -9.90289971e-02, -3.43279988e-01, 1.15019999e-01, -3.78080010e-01, -3.90120000e-02, -3.45930010e-01, -1.94040000e-01, -3.35799992e-01, -6.23340011e-02, 2.89189994e-01, 2.80319989e-01, -5.37410021e-01, 6.27939999e-01, 5.69549985e-02, 6.21469975e-01, -2.52819985e-01, 4.16700006e-01, -1.01079997e-02, -2.54339993e-01, 4.00029987e-01, 4.24320012e-01, 2.26720005e-01, 1.75530002e-01, 2.30489999e-01, 2.83230007e-01, 1.38820007e-01, 3.12180002e-03, 1.70570001e-01, 3.66849989e-01, 2.52470002e-03, -6.40089989e-01, -2.97650009e-01, 7.89430022e-01, 3.31680000e-01, -1.19659996e+00, -4.71559986e-02, 5.31750023e-01], dtype=float32)

Models that come with built-in word vectors make them available as the Token.vector attribute. Doc.vector and Span.vector will default to an average of their token vectors. You can also check if a token has a vector assigned, and get the L2 norm, which can be used to normalise vectors.

nlp = spacy.load('en_core_web_lg')
tokens = nlp(u'dog cat banana sasquatch')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)
TextHas vectorVector normOOV
dogTrue7.033672992262838False
catTrue6.68081871208896False
bananaTrue6.700014292148571False
sasquatchFalse0True

The words "dog", "cat" and "banana" are all pretty common in English, so they're part of the model's vocabulary, and come with a vector. The word "sasquatch" on the other hand is a lot less common and out-of-vocabulary – so its vector representation consists of 300 dimensions of 0, which means it's practically nonexistent. If your application will benefit from a large vocabulary with more vectors, you should consider using one of the larger models or loading in a full vector package, for example, en_vectors_web_lg, which includes over 1 million unique vectors.

Similarities in context

Aside from spaCy's built-in word vectors, which were trained on a lot of text with a wide vocabulary, the parsing, tagging and NER models also rely on vector representations of the meanings of words in context. As the processing pipeline is applied spaCy encodes a document's internal meaning representations as an array of floats, also called a tensor. This allows spaCy to make a reasonable guess at a word's meaning, based on its surrounding words. Even if a word hasn't been seen before, spaCy will know something about it. Because spaCy uses a 4-layer convolutional network, the tensors are sensitive to up to four words on either side of a word.

For example, here are three sentences containing the out-of-vocabulary word "labrador" in different contexts.

doc1 = nlp(u"The labrador barked.")
doc2 = nlp(u"The labrador swam.")
doc3 = nlp(u"the labrador people live in canada.")

for doc in [doc1, doc2, doc3]:
    labrador = doc[1]
    dog = nlp(u"dog")
    print(labrador.similarity(dog))

Even though the model has never seen the word "labrador", it can make a fairly accurate prediction of its similarity to "dog" in different contexts.

Contextlabrador.similarity(dog)
The labrador barked.0.56 similar
The labrador swam.0.48 dissimilar
the labrador people live in canada.0.39 dissimilar

The same also works for whole documents. Here, the variance of the similarities is lower, as all words and their order are taken into account. However, the context-specific similarity is often still reflected pretty accurately.

doc1 = nlp(u"Paris is the largest city in France.")
doc2 = nlp(u"Vilnius is the capital of Lithuania.")
doc3 = nlp(u"An emu is a large bird.")

for doc in [doc1, doc2, doc3]:
    for other_doc in [doc1, doc2, doc3]:
        print(doc.similarity(other_doc))

Even though the sentences about Paris and Vilnius consist of different words and entities, they both describe the same concept and are seen as more similar than the sentence about emus. In this case, even a misspelled version of "Vilnius" would still produce very similar results.

Paris is the largest city in France.Vilnius is the capital of Lithuania.An emu is a large bird.
Paris is the largest city in France. 1.00 identical 0.85 similar 0.65 dissimilar
Vilnius is the capital of Lithuania. 0.85 similar 1.00 identical 0.55 dissimilar
An emu is a large bird. 0.65 dissimilar 0.55 dissimilar 1.00 identical

Sentences that consist of the same words in different order will likely be seen as very similar – but never identical.

docs = [nlp(u"dog bites man"), nlp(u"man bites dog"),
        nlp(u"man dog bites"), nlp(u"dog man bites")]

for doc in docs:
    for other_doc in docs:
        print(doc.similarity(other_doc))

Interestingly, "man bites dog" and "man dog bites" are seen as slightly more similar than "man bites dog" and "dog bites man". This may be a coincidence – or the result of "man" being interpreted as both sentence's subject.

dog bites manman bites dogman dog bitesdog man bites
dog bites man 1.00 identical 0.90 similar 0.89 similar 0.92 similar
man bites dog 0.90 similar 1.00 identical 0.93 similar 0.90 similar
man dog bites 0.89 similar 0.93 similar 1.00 identical 0.92 similar
dog man bites 0.92 similar 0.90 similar 0.92 similar 1.00 identical

Customising word vectors

Word vectors let you import knowledge from raw text into your model. The knowledge is represented as a table of numbers, with one row per term in your vocabulary. If two terms are used in similar contexts, the algorithm that learns the vectors should assign them rows that are quite similar, while words that are used in different contexts will have quite different values. This lets you use the row-values assigned to the words as a kind of dictionary, to tell you some things about what the words in your text mean.

Word vectors are particularly useful for terms which aren't well represented in your labelled training data. For instance, if you're doing named entity recognition, there will always be lots of names that you don't have examples of. For instance, imagine your training data happens to contain some examples of the term "Microsoft", but it doesn't contain any examples of the term "Symantec". In your raw text sample, there are plenty of examples of both terms, and they're used in similar contexts. The word vectors make that fact available to the entity recognition model. It still won't see examples of "Symantec" labelled as a company. However, it'll see that "Symantec" has a word vector that usually corresponds to company terms, so it can make the inference.

In order to make best use of the word vectors, you want the word vectors table to cover a very large vocabulary. However, most words are rare, so most of the rows in a large word vectors table will be accessed very rarely, or never at all. You can usually cover more than 95% of the tokens in your corpus with just a few thousand rows in the vector table. However, it's those 5% of rare terms where the word vectors are most useful. The problem is that increasing the size of the vector table produces rapidly diminishing returns in coverage over these rare terms.

Optimising vector coverage
v2.0 This feature is new and was introduced in spaCy v2.0

To help you strike a good balance between coverage and memory usage, spaCy's Vectors class lets you map multiple keys to the same row of the table. If you're using the spacy vocab command to create a vocabulary, pruning the vectors will be taken care of automatically. You can also do it manually in the following steps:

  1. Start with a word vectors model that covers a huge vocabulary. For instance, the en_vectors_web_lg model provides 300-dimensional GloVe vectors for over 1 million terms of English.
  2. If your vocabulary has values set for the Lexeme.prob attribute, the lexemes will be sorted by descending probability to determine which vectors to prune. Otherwise, lexemes will be sorted by their order in the Vocab.
  3. Call Vocab.prune_vectors with the number of vectors you want to keep.
nlp = spacy.load('en_vectors_web_lg')
n_vectors = 105000  # number of vectors to keep
removed_words = nlp.vocab.prune_vectors(n_vectors)

assert len(nlp.vocab.vectors) <= n_vectors  # unique vectors have been pruned
assert nlp.vocab.vectors.n_keys > n_vectors  # but not the total entries

Vocab.prune_vectors reduces the current vector table to a given number of unique entries, and returns a dictionary containing the removed words, mapped to (string, score) tuples, where string is the entry the removed word was mapped to, and score the similarity score between the two words.

Removed words

{ 'Shore': ('coast', 0.732257), 'Precautionary': ('caution', 0.490973), 'hopelessness': ('sadness', 0.742366), 'Continous': ('continuous', 0.732549), 'Disemboweled': ('corpse', 0.499432), 'biostatistician': ('scientist', 0.339724), 'somewheres': ('somewheres', 0.402736), 'observing': ('observe', 0.823096), 'Leaving': ('leaving', 1.0) }

In the example above, the vector for "Shore" was removed and remapped to the vector of "coast", which is deemed about 73% similar. "Leaving" was remapped to the vector of "leaving", which is identical.

Adding vectors
v2.0 This feature is new and was introduced in spaCy v2.0

spaCy's new Vectors class greatly improves the way word vectors are stored, accessed and used. The data is stored in two structures:

  • An array, which can be either on CPU or GPU.
  • A dictionary mapping string-hashes to rows in the table.

Keep in mind that the Vectors class itself has no StringStore , so you have to store the hash-to-string mapping separately. If you need to manage the strings, you should use the Vectors via the Vocab class, e.g. vocab.vectors. To add vectors to the vocabulary, you can use the Vocab.set_vector method.

Adding vectors

from spacy.vocab import Vocab vector_data = {u'dog': numpy.random.uniform(-1, 1, (300,)), u'cat': numpy.random.uniform(-1, 1, (300,)), u'orange': numpy.random.uniform(-1, 1, (300,))} vocab = Vocab() for word, vector in vector_data.items(): vocab.set_vector(word, vector)

Loading GloVe vectors
v2.0 This feature is new and was introduced in spaCy v2.0

spaCy comes with built-in support for loading GloVe vectors from a directory. The Vectors.from_glove method assumes a binary format, the vocab provided in a vocab.txt, and the naming scheme of vectors.{size}.[fd].bin. For example:

File nameDimensionsData type
vectors.128.f.bin128float32
vectors.300.d.bin300float64 (double)
nlp = spacy.load('en')
nlp.vocab.vectors.from_glove('/path/to/vectors')

If your instance of Language already contains vectors, they will be overwritten. To create your own GloVe vectors model package like spaCy's en_vectors_web_lg, you can call nlp.to_disk , and then package the model using the package command.

Loading other vectors
v2.0 This feature is new and was introduced in spaCy v2.0

You can also choose to load in vectors from other sources, like the fastText vectors for 294 languages, trained on Wikipedia. After reading in the file, the vectors are added to the Vocab using the set_vector method.

Can't fetch code example from GitHub :(

Please use the link below to view the example. If you've come across
a broken link, we always appreciate a pull request to the repository,
or a report on the issue tracker. Thanks!

Using custom similarity methods

By default, Token.vector returns the vector for its underlying Lexeme , while Doc.vector and Span.vector return an average of the vectors of their tokens. You can customise these behaviours by modifying the doc.user_hooks, doc.user_span_hooks and doc.user_token_hooks dictionaries.

Storing vectors on a GPU

If you're using a GPU, it's much more efficient to keep the word vectors on the device. You can do that by setting the Vectors.data attribute to a cupy.ndarray object if you're using spaCy or Chainer, or a torch.Tensor object if you're using PyTorch. The data object just needs to support __iter__ and __getitem__, so if you're using another library such as TensorFlow, you could also create a wrapper for your vectors data.

spaCy, Thinc or Chainer

import cupy.cuda from spacy.vectors import Vectors vector_table = numpy.zeros((3, 300), dtype='f') vectors = Vectors([u'dog', u'cat', u'orange'], vector_table) with cupy.cuda.Device(0): vectors.data = cupy.asarray(vectors.data)

PyTorch

import torch from spacy.vectors import Vectors vector_table = numpy.zeros((3, 300), dtype='f') vectors = Vectors([u'dog', u'cat', u'orange'], vector_table) vectors.data = torch.Tensor(vectors.data).cuda(0)