spaCy, the open-source software library for advanced natural language processing, released its third version this year. The 3.0 version has state of the art transformer-based pipelines and pre-trained models in seventeen languages.
The first version of spaCy was a preliminary version with little support for deep-learning workflows. The second version, however, introduced convoluted neural network models in seven different languages. The third version is a massive improvement over both of these versions.
The 3.0 version has completed dropped support for Python 2 and only works on Python 3.6. We list major features and updates here.
Transformer based pipelines
The new version of spaCy offers state-of-the-art transformer-based pipelines. A user can use any pre-trained transformer to train their pipeline. It even allows the sharing of one transformer between multiple components with multi-task learning. Furthermore, it gives users access to thousands of pre-trained models for their pipeline by interoperating between PyTorch and the Hugging face library.
Model customisation
A user can implement their architecture on the platform via the spaCy machine learning library Thinc, which provides various layers and utilities. Thinc also offers thin wrappers around frameworks like PyTorch, MXNet, and TensorFlow. All the component models follow the same unified Model API. Furthermore, each Model can be used a combine implementation from different frameworks into a single model. All these features make it very easy for a user to customise neural network models used by multiple pipeline components.
Ray
Ray is a fast and simple framework that makes it easy to build and run distributed applications. Using Ray can speed up the training process by training spaCy on one or more remote machines. It has a lightweight extension package to automatically add the ray command to the spaCy CLI if it is installed in the same environment.
Managing end to end workflows
It makes it very easy for one to manage and share end to end spaCy workflow for various case uses & domains. It allows users to train, package and serve their custom pipelines. With the new version, users can effortlessly integrate workflows with other tools in the machine learning ecosystem, such as prodigy for creating labelled data version control, Ray for parallel training, FastAPI for serving models in production, etc.
Training workflow and configuration system
With spaCy 3.0, users get a comprehensive and extensible system for configuring training runs. A single configuration file will have every detail of training run with no hidden defaults; the feature makes it very easy to rerun the experiment and track changes in the training procedure. This feature makes it very easy to implement customised models and architecture.
New and updated documentation
This version comes with various new and rewritten documentation pages, along with a guide on embeddings, transformers & transfer learning, a focus on training pipelines and models rewritten from scratch, and a page explaining new spaCy projects. The documentation also contains API reference pages regarding spaCy’s machine learning model architectures and the expected data formats. API pages about pipeline components include other information, like the default config and implementation.
Backwards incompatibility
spaCy 3.0 has kept breaking changes to a minimum and focuses on essential changes to support new features, fix problems in the platform and improve the user experience.
Pipeline component APIs
The latest version has made it easier to define, configure, train, and analyse pipeline components. Any custom component can be included during training; sourcing components from existing trained pipelines let user to mix and match custom pipelines.
Type hints and type-based data validation
The platform has dropped support for Python 2 and now requires Python 3.6 and higher versions. It also means that it can take full advantage of type hints. spaCy’s user-facing API that is implemented in pure Python comes with type hints. Coupled with a new version of spaCy’s ML library, Thinc, the platform will give extensive type support.
Wrapping up
spaCy 3.0 offers restrained model families for more than 18 languages and 58 trained pipelines, including five transformer-based pipelines. With the platform, you can manage end to end multi-step workflows from pre-processing data to model deployment. The platform also offers pre-built and efficient binary wheels for all pipeline models with various new methods and commands to train models.