AWS Announces Amazon S3 Plugin For PyTorch

One can take advantage of using data from S3 buckets directly with PyTorch dataset and data loader APIs without needing to download it first on local storage.

Recently, AWS has announced the release of the Amazon S3 plugin for PyTorch — an open-source library built to be used with the deep learning framework PyTorch for streaming data from Amazon Simple Storage Service (Amazon S3). 

With this feature available in PyTorch Deep Learning Containers, one can take advantage of using data from S3 buckets directly with PyTorch dataset and data loader APIs without needing to download it first on local storage. 

It also provides a way to transfer data from Amazon S3 in parallel when needed to get maximum performance without worrying about thread safety or multiple connections to Amazon S3. You can also stream data from .zip or .tar archives and shuffle the dataset within or across the shards as required. The Amazon S3 plugin for PyTorch offers the following benefits:

Support for both map-style and iterable-style dataset interfaces – PyTorch supports two different types of datasets. In addition, the Amazon S3 plugin for PyTorch also provides the flexibility to use either map-style or iterable-style dataset interfaces based on your needs:

  • Map-style dataset – Represents a map from indexes or keys to data samples. It provides random access capabilities.
  • Iterable-style dataset – Represents an iterable over data samples. This type of dataset is particularly suitable for cases where random reads are expensive or even improbable and where the batch size depends on the fetched data.

Support for various data formats – Training data can be in a variety of different formats, such as CSV, Parquet, and JPEG. This plugin is file-format agnostic and presents objects in Amazon S3 as a binary buffer (blob). Thus, you can apply any additional transformations to the data received from Amazon S3.

Support for shuffling – In deep learning, you may need to shuffle data across and within shards to reduce variance. This plugin provides a way to shuffle data in-memory within shards using ShuffleDataset or across shards by providing the input parameter shuffle_urls while extending S3IterableDataset.

One can find the configuration, library and detailed information here.

A few days earlier,  Amazon Web Services announced the general availability of Amazon FSx for NetApp ONTAP, a new storage service that allows customers to launch and run complete, fully managed NetApp ONTAP file systems in the cloud for the first time. 

Download our Mobile App

kumar Gandharv
Kumar Gandharv, PGD in English Journalism (IIMC, Delhi), is setting out on a journey as a tech Journalist at AIM. A keen observer of National and IR-related news.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week.