The Berlin-based company Explosion AI has released version 3.2 of the Natural Language Processing (NLP) Python library spaCy. The update not only promises developers more performance – especially on Nvidia GPU hardware and Mac computers with Apple’s M1 CPU – but the spaCy team has also improved operation and added new functions relating to fastText.
The new release of the NLP library should work up to 8 times faster on M1 Mac computers than on comparable systems. The prerequisite for this, however, is the use of Apple’s native Accelerate library
thinc-apple-opswhich is specially tailored for matrix multiplication.
More compact models
With the release of
floret Explosion AI recently presented an extended version of fastText.
floret, which can be easily integrated using a Python wrapper, combines the partial words known from fastText with Bloom embeddings to enable compact full vectors. When using the partial words, OOV words are omitted and thanks to the Bloom embedding, the vector table can be kept small with less than 100,000 entries. Agglutinating languages such as Finnish and Korean mainly benefit from this, as two examples in the
pipelines/floret_vectors_demo Make it clear on GitHub. Combined with
tok2vec In turn, the Bloom embedding has already made for compact spaCy models.
Also new in spaCy 3.2 is support for
Doc as input in the pipelines
nlp.pipe. If the containers for access to linguistic annotations are entered into the pipelines instead of a string, they skip the tokenizer, so that before processing – as can be seen in the following listing – user-defined extensions or
DocMake s easier to create with custom tokenization.
Process a doc object doc = nlp.make_doc("This is text 500.") doc._.text_id = 500 doc = nlp(doc)
A summary of the most important innovations in spaCy 3.2 can be found on the ExplosionAI blog. The complete overview of all details keep the release notes ready on GitHub.