The Berlin-based company Explosion AI has released version 3.2 of the Natural Language Processing (NLP) Python library spaCy. The update not only promises developers more performance – especially on Nvidia GPU hardware and Mac computers with Apple’s M1 CPU – but the spaCy team has also improved operation and added new functions relating to fastText.
The new release of the NLP library should work up to 8 times faster on M1 Mac computers than on comparable systems. The prerequisite for this, however, is the use of Apple’s native Accelerate library thinc-apple-ops
which is specially tailored for matrix multiplication.
More compact models
With the release of floret
Explosion AI recently presented an extended version of fastText. floret
, which can be easily integrated using a Python wrapper, combines the partial words known from fastText with Bloom embeddings to enable compact full vectors. When using the partial words, OOV words are omitted and thanks to the Bloom embedding, the vector table can be kept small with less than 100,000 entries. Agglutinating languages such as Finnish and Korean mainly benefit from this, as two examples in the floret
Demo project pipelines/floret_vectors_demo
Make it clear on GitHub. Combined with HashEmbed
in tok2vec
In turn, the Bloom embedding has already made for compact spaCy models.
Also new in spaCy 3.2 is support for Doc
as input in the pipelines nlp
and nlp.pipe
. If the containers for access to linguistic annotations are entered into the pipelines instead of a string, they skip the tokenizer, so that before processing – as can be seen in the following listing – user-defined extensions or Doc
Make s easier to create with custom tokenization.
Process a doc object
doc = nlp.make_doc("This is text 500.")
doc._.text_id = 500
doc = nlp(doc)
A summary of the most important innovations in spaCy 3.2 can be found on the ExplosionAI blog. The complete overview of all details keep the release notes ready on GitHub.
(map)