“A few years ago, we noticed a pattern: teams were getting overburdened with increasing costs of maintaining their feature preparation pipelines,” says LinkedIn in a blog post while talking about its feature store, Feathr. Feature stores are actually quite a recent concept that helps to create better machine learning pipelines. A major chunk of a data scientist’s time goes into wrangling data and preparing it for analysis instead of building models. Feature stores can be of great help to solve this issue.
Uber introduced Michelangelo Palette, the first feature store, in 2017. Databricks recently announced that it has made the Databricks feature store generally available. LinkedIn also open-sourced Feathr a month ago. All major tech firms have their feature stores, like Amazon SageMaker feature store, Vertex AI feature store (Google), ML Lakes Salesforce, and Overton (Apple).
Sign up for your weekly dose of what's up in emerging technology.
What is a feature store?
A feature store enables the discovery, documentation and reuse of features. It is a feature computation and storage service that enables features to be registered, discovered, and used for ML pipelines and by online applications for model inferencing. Feature stores stockpile feature data and offer low latency access to features for online applications. It also ensures consistent feature computations across the batch and serving APIs.
Most feature stores are proprietary till now
Recently, LinkedIn announced that it is open-sourcing its feature store Feathr. But most of the feature stores that exist today are still proprietary. Other popular open-source feature stores include Feast and Hopsworks. Feast was developed jointly by GO-JEK and Google Cloud for teams to store and discover features for use in machine learning projects. Hopsworks was started as a collaborative project between KTH University, RISE, and Logical Clocks. The feature store is a data management system for managing machine learning features, including the feature engineering code and the feature data.
Why do we need them
Priyanka Vergadi, developer advocate at Google, explains in a video the need for feature stores and the ML challenges that it can solve. She adds, “Most of the time spent by data scientists goes into wrangling data, more specifically, in feature engineering, which is transforming raw data into high-quality input signals for ML models. But this process is often inefficient and brittle.”
There are many ML feature challenges, as she points out:
- Hard to use and reuse across different steps of the ML workflow and across projects, which results in duplication of efforts.
- It is hard to serve in production reliably with lower latency.
- There is an inadvertent skew in feature values between training and serving usually, which causes your model quality to degrade over time.
Better performance and reusability are key attractions
Due to the sheer volume of data and algorithms big companies handle these days, feature stores are a requisite for them. Hence, we are seeing more and more feature stores coming up.
- Smoother and faster deployment – A data scientist’s primary job is to focus on building models that can help achieve business needs for an organisation. But, often, they are burdened with dealing with data engineering configurations. A feature store provides a consistent feature set, enabling a smoother deployment process.
- The biggest advantage of a feature story is its reusability aspect. The feature store keeps metadata in addition to the actual features. Due to this, data scientists can figure out the features that performed well on existing models. Data scientists can also share features with their team members, which encourages better collaboration and no duplication.
- Feature stores help standardise feature definitions. If there is no common abstraction for features, there is no uniform way to name features across models, no common type system for features, and no standard way to deploy and serve features in production.