Built for Production
Most AI software is built for research. Over the last ten years, we've used a lot of that software, and built some of it ourselves, especially for natural language processing (NLP). But the faster the research has moved, the more impatient we've become. We want to see advanced NLP technologies get out into great products, as the basis of great businesses. We built spaCy to make that happen.
Easy and Powerful
For any NLP task, there are always lots of competing algorithms. We don't believe in implementing them all and letting you choose. Instead, we just implement one – the best one. When better algorithms are developed, we can update the library without breaking your code or bloating the API. This approach makes spaCy both easier and more powerful than a pluggable architecture. spaCy also features a unique whole-document design. Where other NLP libraries rely on sentence detection as a pre-process, spaCy reads the whole document at once, making it much more robust to informal and poorly formatted text.
Permissive open-source license (MIT)
We think spaCy is valuable software, so we made it free, to raise its value even higher. Making spaCy open-source puts us on the same side – we can tell you everything about how it works, and let you run it however you like. We think the software would be much less valuable as a service, which could disappear at any point.
# pip install spacy && python -m spacy.en.download import spacy # Load English tokenizer, tagger, parser, NER and word vectors nlp = spacy.load('en') # Process a document, of any size text = open('war_and_peace.txt').read() doc = nlp(text) from spacy.attrs import * # All strings mapped to integers, for easy export to numpy np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA]) from reddit_corpus import RedditComments reddit = RedditComments('/path/to/reddit/corpus') # Parse a stream of documents, with multi-threading (no GIL!) # Processes over 100,000 tokens per second. for doc in nlp.pipe(reddit.texts, batch_size=10000, n_threads=4): # Multi-word expressions, such as names, dates etc # can be merged into single tokens for ent in doc.ents: ent.merge(ent.root.tag_, ent.text, ent.ent_type_) # Efficient, lossless serialization --- all annotations # saved, same size as uncompressed text byte_string = doc.to_bytes()
State-of-the-art speed and accuracy
spaCy is committed to rigorous evaluation under standard methodology. Two peer-reviewed papers in 2015 confirm that it offers the fastest syntactic parser in the world and that its accuracy is within 2% of the best available., , 
Stay in the loop!
Receive updates about new releases, tutorials and more.