How Amazon Aims To Take On Custom AI Training With Trainium Chips

A major announcement from the AWS re:Invent event is the launch of the custom machine learning chip by the company — Trainium. It is the second ML chip by AWS after Inferentia, which was launched amid much fanfare at last year’s event. While both of them share the same AWS Neuron SDK, Trainium claims to provide better performance for training ML models in the cloud, in a cost-effective manner. With the support of TensorFlow, PyTorch and MXNet, Trainium will offer one of the best performances with the most teraflops of computing power for machine learning in the cloud. 

It can perform a variety of deep learning training workloads across applications such as image classification, translation, voice recognition, natural language processing, recommendation engines, among others.

How It Takes On Inferentia

As the computer-intensive workloads are increasing, the need for high-efficiency chips is growing dramatically. AWS is expanding its custom chip capabilities to meet these demands in the end-to-end ML lifecycle. Its larger aim also revolves around making deep learning pervasive for everyday developers and democratise access to cutting edge infrastructure in an affordable manner.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Building upon these end-goals, Trainium offers the highest performance with the most teraflops of computing power while enabling a wide range of ML applications. It is to be noted that in one teraflop, a chip can process around one trillion calculations in a second. 

Most importantly, Trainium has been essentially launched to address the shortcomings of Inferentia. While both provide an end-to-end flow of ML compute — from scaling training workloads to deploying accelerated inference, the latter is a more cost-effective option that can carry a more extensive range of ML workloads. 

Download our Mobile App

Inferentia has so far been providing good results in ML tasks. However, the increased applications of ML-based working has resulted in the need to improve performance driven by inference and training, while keeping the costs in tight check. Trainium addresses these issues by ensuring that customers have an end-to-end flow of ML compute from training workloads to deploying accelerated inference. Trainium chips offer high performance, low latency and flexibility. 

“While the cost of inference, which accounts for up to 90% of ML infrastructure costs, was addressed by Inferentia, many development teams are still constrained by fixed ML training budgets. This puts a limit on the extent and frequency of training necessary to enhance their models and applications. By offering the highest performance and lowest cost for cloud ML preparation, AWS Trainium answers this challenge,” stated the company.

The company believes that with the combination of Trainium and Inferentia, they will be able to offer an end-to-end flow of ML computing from workload scaling training to rapid inference deployment.

Furthermore, AWS is collaborating with Intel to introduce EC2 instances for machine learning training based on Habana Gaudi while will improve and deliver up to 40% better price and output by next year. 

Wrapping Up

Can researchers currently working with Inferentia switch to Trainium? As they both share the same AWS Neuron SDK, it will make it easy for developers using Inferentia to get started with Trainium. The company notes that developers can easily migrate from GPU-based instances to Trianium with minimal code changes.  

While Trainium can be compared to Google’s tensor processing units which are their AI training workloads hosted in Google Cloud Platform, the offerings are different at many levels and a clear comparison cannot be made at this point of time. It can also compete with some of the newly launched AI chips this year such as IBM Power10 — which claims to be three times more efficient than the previous models of the POWER CPU series; or NVIDIA A100 — which claims to offer 6x higher performance than NVIDIA’s previous-generation chips. 

Having said that, with these new chips, AWS is aiming big, targeting enterprises to help them train ML models in an efficient and cost-effective way, and help them build stronger AI strategies.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Srishti Deoras
Srishti currently works as Associate Editor at Analytics India Magazine. When not covering the analytics news, editing and writing articles, she could be found reading or capturing thoughts into pictures.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Is Foxconn Conning India?

Most recently, Foxconn found itself embroiled in controversy when both Telangana and Karnataka governments simultaneously claimed Foxconn to have signed up for big investments in their respective states