Last updated February 24, 2023
In AI Origins & Evolution

ChatGPT is Ruining Our Environment, But There’s a Way to Tackle It

Staying true to its name, FlexGen can be flexibly configured despite and even under constraints on hardware resources by summing up compute from GPUs, CPUs and disks

Share

Published on February 23, 2023

by Poulomi Chatterjee

Listen to this story

As the ChatGPT hype peaked, so did the sense of reckoning around the carbon footprint that the tool may leave behind. Media reports quoted wild estimates (24.86 metric tonnes per capita of CO₂ emissions per day) around how much energy LLMs like GPT 3.5 (on which ChatGPT is built) have drained.

This is not to say that these worries are fanciful or shouldn’t be discussed. But it is also true that research around optimising GPU compute and other efforts to reduce compute driving LLMs have come a long way from headlines which described training GPT-3 as energy-consuming as taking a trip to the moon.

It's so over, "FlexGen runs
OPT-175B up to 100× faster on a single 16GB GPU"

Faster than deepspeed offloading pic.twitter.com/4m2I0rDRhi
— anton (@abacaj) February 20, 2023

Running a 175-billion model on 1 GPU

A couple of days back, a Stanford AI Lab research student, Ying Sheng; Lianmin Zheng from UC Berkeley; and a couple of other AI researchers released FlexGen, a generation engine for running huge LLMs like GPT-3 with very limited GPU memory. But to what extent can the compute be shrunk? The researchers were able to successfully run Meta AI’s free LLM, OPT-175B model on a single 16GB GPU by the end as demonstrated in a paper titled, ‘High-Throughput Generative Inference of Large Language Models with a Single GPU’.

Staying true to its name, FlexGen can be flexibly configured despite and even under constraints on hardware resources by summing up compute from GPUs, CPUs and disks. FlexGen’s main contribution is that it is able to build more efficient offloading systems for models achieving up to 100 times higher throughput as compared to other state-of-the-art offloading systems like Hugging Face Accelerate. The researchers were able to do this with a new algorithm made for efficient batch-wise offloaded inference.

Lighter, distilled and pretrained models

One only needs to look hard enough to find similar ways that experts are looking at to take it easy on GPUs. ML researcher, Sebastian Raschka, discussed a paper released earlier this month titled, ‘MarioGPT: Open-Ended Text2Level Generation through Large Language Models’ authored by Shyam Sudhakaran and Miguel González-Duque, among others. The model is able to generate tile-based game levels (like the paper works with Super Mario Bros levels) from text. Falling into the space of light, fun and creative generative AI models, the tool was built on a distilled GPT-3 model that can be trained on a single GPU.

"MarioGPT: Open-Ended Text2Level Generation through Large Language Models"
This looks like a pretty fun and creative project!
The best part, it's based on a distilled GPT-3 model and can be trained on a single GPU. pic.twitter.com/Cd6193YcZH
— Sebastian Raschka (@rasbt) February 22, 2023

This idea of distillation implemented in this case has gained popularity in the past couple of years. It means to essentially reduce the predictions of widely-known, big, complex models into smaller models. For instance, DistillBERT, the smaller version of Google’s BERT has 40% fewer parameters and runs 60% faster while achieving 95% of BERT’s performance-level.

Most models released by companies now are also pretrained. This effectively cuts off a major chunk of the compute that pretraining again and again eats into.

Some AI platforms are constantly working on methods to solve the scaling problem of these models. Cohere AI, a Canadian startup founded by former Google Brain employees Aidan Gomez and Nick Frosst, partnered with Google last year to build a platform for powerful LLMs which didn’t need the infrastructure or high-level expertise that such undertakings usually demand. In a paper titled, ‘Scalable Training of Language Models using JAX pjit and TPUv4’, the engineers described how their new FAX framework deployed on Google Cloud’s TPU v4 Pods could successfully train larger models quicker and also deliver the prototypes to customers faster.

Data Parallelism solutions

Last year in December, Amazon SageMaker launched a new technique of training called sharded data parallelism which performs a number of optimisations including higher speed by 39.7%. Concepts like data parallelism are a knight in shining armour for AI experts because it kills several birds with one stone – reduces training time and cost while being less energy intensive and accelerating the time-to-market period.

Stability AI, the startup founded by Emad Mostaque which released Stable Diffusion, the family of AI text-to-image models and gained instant fame, has partnered with SageMaker. A blog post making the announcement stated that Stability AI had used this technique for their foundation models.

All this is to say that there’s a tonne of work being done under the radar in the hardware department that isn’t necessarily visible or hasn’t reached a tool like ChatGPT yet. But it’s safe to say that even as generative tools like these reach levels of accessibility that were unheard of before in AI, research will continue to find ways to taper the cost of power and consequently the environment.

Access all our open Survey & Awards Nomination forms in one place

Poulomi Chatterjee

Poulomi is a Technology Journalist with Analytics India Magazine. Her fascination with tech and eagerness to dive into new areas led her to the dynamic world of AI and data analytics.