Using word vectors and semantic similarities

Dense, real valued vectors representing distributional similarity information are now a cornerstone of practical NLP. The most common way to train these vectors is the word2vec family of algorithms.

spaCy makes using word vectors very easy. The Lexeme , Token , Span and Doc classes all have a .vector property, which is a 1-dimensional numpy array of 32-bit floats:

import numpy

apples, and_, oranges = nlp(u'apples and oranges')
print(apples.vector.shape)
# (1,)
apples.similarity(oranges)

By default, Token.vector returns the vector for its underlying lexeme, while Doc.vector and Span.vector return an average of the vectors of their tokens. You can customize these behaviours by modifying the doc.user_hooks, doc.user_span_hooks and doc.user_token_hooks dictionaries.

The default English model installs vectors for one million vocabulary entries, using the 300-dimensional vectors trained on the Common Crawl corpus using the GloVe algorithm. The GloVe common crawl vectors have become a de facto standard for practical NLP.

You can load new word vectors from a file-like buffer using the vocab.load_vectors() method. The file should be a whitespace-delimited text file, where the word is in the first column, and subsequent columns provide the vector data. For faster loading, you can use the vocab.vectors_from_bin_loc() method, which accepts a path to a binary file written by vocab.dump_vectors().

You can also load vectors from memory, by writing to the lexeme.vector property. If the vectors you are writing are of different dimensionality from the ones currently loaded, you should first call vocab.resize_vectors(new_size).