Active Hackathon

AWS Announces Amazon S3 Plugin For PyTorch

One can take advantage of using data from S3 buckets directly with PyTorch dataset and data loader APIs without needing to download it first on local storage.

Recently, AWS has announced the release of the Amazon S3 plugin for PyTorch — an open-source library built to be used with the deep learning framework PyTorch for streaming data from Amazon Simple Storage Service (Amazon S3). 

With this feature available in PyTorch Deep Learning Containers, one can take advantage of using data from S3 buckets directly with PyTorch dataset and data loader APIs without needing to download it first on local storage. 


Sign up for your weekly dose of what's up in emerging technology.

It also provides a way to transfer data from Amazon S3 in parallel when needed to get maximum performance without worrying about thread safety or multiple connections to Amazon S3. You can also stream data from .zip or .tar archives and shuffle the dataset within or across the shards as required. The Amazon S3 plugin for PyTorch offers the following benefits:

Support for both map-style and iterable-style dataset interfaces – PyTorch supports two different types of datasets. In addition, the Amazon S3 plugin for PyTorch also provides the flexibility to use either map-style or iterable-style dataset interfaces based on your needs:

  • Map-style dataset – Represents a map from indexes or keys to data samples. It provides random access capabilities.
  • Iterable-style dataset – Represents an iterable over data samples. This type of dataset is particularly suitable for cases where random reads are expensive or even improbable and where the batch size depends on the fetched data.

Support for various data formats – Training data can be in a variety of different formats, such as CSV, Parquet, and JPEG. This plugin is file-format agnostic and presents objects in Amazon S3 as a binary buffer (blob). Thus, you can apply any additional transformations to the data received from Amazon S3.

Support for shuffling – In deep learning, you may need to shuffle data across and within shards to reduce variance. This plugin provides a way to shuffle data in-memory within shards using ShuffleDataset or across shards by providing the input parameter shuffle_urls while extending S3IterableDataset.

One can find the configuration, library and detailed information here.

A few days earlier,  Amazon Web Services announced the general availability of Amazon FSx for NetApp ONTAP, a new storage service that allows customers to launch and run complete, fully managed NetApp ONTAP file systems in the cloud for the first time. 

More Great AIM Stories

kumar Gandharv
Kumar Gandharv, PGD in English Journalism (IIMC, Delhi), is setting out on a journey as a tech Journalist at AIM. A keen observer of National and IR-related news.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM