Last December, AWS launched Amazon SageMaker Feature Store — a fully managed and purpose-built repository to store, update, retrieve and share machine learning features for training and inference with low latency.
Most tech companies, including Uber, Airbnb, Twitter, Facebook, and Netflix, already have feature stores. Companies without feature stores often end up with a lot of duplicated work, said Harish Doddi, co-founder and CEO of Datatron, at O’Reilly Data Show Podcast.
Feature store has a key role to play in the ML workflow. The table below compares feature stores currently in the market.
For those unaware, feature stores are a data warehouse of features for machine learning purposes. Unlike a traditional data warehouse, the feature store has a dual database — one database serves features at low latency to online applications. The other database stores large volumes of features used by data scientists to train datasets.
Source: Featurestore.org (Feature Store Vs Data Warehouse)
Feature stores help data professionals deploy machine learning applications faster. Besides Amazon’s feature store, other open-source data workflow management tools include Airbnb’s Airflow, Apache Kafka and Kubeflow, alongside ML tracking platforms like Comet ML, Neptune AI, Weights and Biases.
Google & Microsoft: Late bloomers?
Last year, Google announced a managed feature store would come to the Google Cloud AI platform by the end of 2020. The AI platform would be centred on Kuberflow, and provides access to Google’s machine learning framework TensorFlow, BigQuery data store and cloud-based Tensor Processing Units (TPUs).
At the time, Google had said its feature store would offer a centralised, organisation-wide repository of historical and new feature values which can be reused as desired by ML teams worldwide. Google Cloud AI’s Craig Wiley said, “this will boost users’ productivity by eliminating redundant steps in feature engineering.”
However, Google is yet to officially introduce/announce its feature store in the Google Cloud AI platform.
Microsoft does not have a feature store either. However, it recently announced a partnership with data company Logical Clocks, where it claimed to offer full support for Microsoft Azure on its cloud-managed data platform, Hopsworks. Interestingly, this complements the existing support for AWS as well.
In other words, enterprises can now manage features for training and serving models at scale on the Hopsworks feature store with their existing data stores and data science platforms in their cloud of choice, be it AWS or Azure.
Hopsworks offers a free version of tools for feature engineering, feature management and machine learning model training. Its feature store leverages Azure Blob Storage for storage and integrations with ML tracking platforms, including Databricks, SageMaker and Kubeflow.
However, Amazon’s SageMaker Feature Store looks promising. It can create feature groups using a Pythonic API, with access to the PyData package, including Pandas and NumPy, at the comfort of Jupyter notebook.
Moreover, it also claimed to provide a unified store for features during training and real-time inference without writing additional code to keep features consistent. Further, it keeps track of the metadata of stored features in real-time using Amazon Athena, an interactive query service.
Tech experts believe AWS’ feature store is currently feasible for primary use cases. However, we independently could not verify this information as there are limited use cases for large scale industrial applications when it comes to SageMaker Feature Store.