Listen to this story
In a blog article published last week, Google AI introduced TensorStore, an open-source C++ and Python library designed for storage and manipulation of n-dimensional data. The library aims to address key engineering challenges in scientific computing through better management and processing of large datasets.
Various contemporary applications of computer science and machine learning (ML) manipulate multidimensional datasets that span a single and expansive coordinate system. An example could be the use of air measurements over a geographical grid to estimate the weather.
Another could be making medical imaging predictions using multi-channel image intensity values from a 2D or 3D scan.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
A single dataset under these circumstances might also need petabytes of storage and working with such datasets could be challenging—as users may receive and write data at different scales and unpredictable intervals.
Researchers at Google AI claim that TensorStore has already been used to solve key engineering challenges such as management and processing of large datasets in neuroscience—such as peta-scale 3d electron microscopy data and “4d” videos of neuronal activity.
Additionally, the library has been used in the creation of PaLM—a large-scale machine learning model—by addressing the problem related to managing model parameters or checkpoints during distributed training.
This library natively supports storage systems like Google Cloud Storage, HTTP servers, local and network filesystems, and more, and offers a unified API for reading and writing diverse array types such as zarr and N5. With strong atomicity, consistency, isolation, and durability (ACID) guarantee, it also provides read/writeback caching and transactions. Furthermore, it is capable of supporting safe, efficient access from multiple processes and machines via optimistic concurrency.
TensorStore is also expected to offer an asynchronous API that would enable high-throughput access even to high-latency remote storage. It provides a simple Python API to load and manipulate large array data. For example, a TensorStore object is created representing 56 trillion voxel 3d image of a fly brain and which accesses a small 100×100 patch of the data as a NumPy array:
Source: Google AI Blog
The blog claims, “No actual data is accessed or stored in memory until the specific 100×100 slice is requested; hence arbitrarily large underlying datasets can be loaded and manipulated without having to store the entire dataset in memory, using indexing and manipulation syntax largely identical to standard NumPy operations.”