Natural Language

Thousands of researchers are trying to make
computers understand text. They're succeeding.
spaCy is a Python NLP library that helps you get
their work out of papers and into production.
Install spaCy
Latest Release: v1.0 2500+ stars on GitHub User Group on Reddit
Release update

spaCy v1.0 out now!

I'm excited — and more than a little nervous! — to finally make the 1.0 release of spaCy. By far my favourite part of the release is the new support for custom pipelines. Default support for GloVe vectors is also nice. The trickiest change was a significant rewrite of the Matcher class, to support entity IDs and attributes. I've added tutorials for the new features, and some training examples.

Read the blog post

Are you using spaCy?

Take the spaCy user survey

Two years after I started working on spaCy full time, I'm finally pushing forward with a 1.0 release. It's also past time to take a bit of a census. I hope you'll take a few minutes to fill out this survey, to help me understand how you're using the library, and how it can be better.

Thanks for your support!

Take the survey

Built for Production

Most AI software is built for research. Over the last ten years, we've used a lot of that software, and built some of it ourselves, especially for natural language processing (NLP). But the faster the research has moved, the more impatient we've become. We want to see advanced NLP technologies get out into great products, as the basis of great businesses. We built spaCy to make that happen.

Easy and Powerful

For any NLP task, there are always lots of competing algorithms. We don't believe in implementing them all and letting you choose. Instead, we just implement one – the best one. When better algorithms are developed, we can update the library without breaking your code or bloating the API. This approach makes spaCy both easier and more powerful than a pluggable architecture. spaCy also features a unique whole-document design. Where other NLP libraries rely on sentence detection as a pre-process, spaCy reads the whole document at once, making it much more robust to informal and poorly formatted text.

Permissive open-source license (MIT)

We think spaCy is valuable software, so we made it free, to raise its value even higher. Making spaCy open-source puts us on the same side – we can tell you everything about how it works, and let you run it however you like. We think the software would be much less valuable as a service, which could disappear at any point.
# pip install spacy && python -m
import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en')
# Process a document, of any size
text = open('war_and_peace.txt').read()
doc = nlp(text)

from spacy.attrs import *
# All strings mapped to integers, for easy export to numpy
np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])

from reddit_corpus import RedditComments
reddit = RedditComments('/path/to/reddit/corpus')
# Parse a stream of documents, with multi-threading (no GIL!)
# Processes over 100,000 tokens per second.
for doc in nlp.pipe(reddit.texts, batch_size=10000, n_threads=4):
    # Multi-word expressions, such as names, dates etc
    # can be merged into single tokens
    for ent in doc.ents:
        ent.merge(ent.root.tag_, ent.text, ent.ent_type_)
    # Efficient, lossless serialization --- all annotations
    # saved, same size as uncompressed text
    byte_string = doc.to_bytes()
spaCy is trusted by
What our users say...

State-of-the-art speed and accuracy

spaCy is committed to rigorous evaluation under standard methodology. Two peer-reviewed papers in 2015 confirm that it offers the fastest syntactic parser in the world and that its accuracy is within 2% of the best available.[1], [2], [3]

SystemLanguageAccuracySpeed (WPS)
Sign up for the spaCy newsletter

Stay in the loop!

Receive updates about new releases, tutorials and more.