Using word vectors and semantic similarities
Dense, real valued vectors representing distributional similarity information are now a cornerstone of practical NLP. The most common way to train these vectors is the word2vec family of algorithms.
spaCy makes using word vectors very easy. The
Doc classes all have a
.vector property, which is a 1-dimensional numpy array of 32-bit floats:
import numpy apples, and_, oranges = nlp(u'apples and oranges') print(apples.vector.shape) # (1,) apples.similarity(oranges)
Token.vector returns the vector for its underlying lexeme, while
Span.vector return an average of the vectors of their tokens. You can customize these behaviours by modifying the
The default English model installs vectors for one million vocabulary entries, using the 300-dimensional vectors trained on the Common Crawl corpus using the GloVe algorithm. The GloVe common crawl vectors have become a de facto standard for practical NLP.
You can load new word vectors from a file-like buffer using the
vocab.load_vectors() method. The file should be a whitespace-delimited text file, where the word is in the first column,
and subsequent columns provide the vector data. For faster loading, you can use the
vocab.vectors_from_bin_loc() method, which accepts a path to a binary file written by
You can also load vectors from memory, by writing to the
lexeme.vector property. If the vectors you are writing are of different dimensionality
from the ones currently loaded, you should first call