Listen to this story
A saying in the developer community goes, “Training costs scale with the number of researchers, inference costs scale with the number of users.” Imagine a few years into the future, when all the big techs have their own models and there are specialised LLMs and multimodal foundational models for specific use cases.
Now consider this scenario: While the cost of conducting a single training run for an AI model can be substantial, running inference, which involves applying the trained model to real-world data, is relatively inexpensive. However, the sheer scale of potential users and diverse applications means that the accumulated total of inference operations will eventually surpass the total cycles spent on training. The demand would move from hardware and software for training to that required for inference.
Many organisations today prefer not to fine-tune LLMs due to the availability of pre-trained models that can be adjusted with parameter tweaks, prompt banks, or sampled responses. If they do fine-tune, it’s typically on a limited number of tokens from a domain-specific corpus, incurring training costs only occasionally.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
The major cost for organisations arises during inference, especially as user numbers and questions increase. To manage these expenses, organisations are adopting various optimisation strategies at the inference level. Even for relatively modest use cases, like a customer chatbot in the automotive sector, the monthly cost can range from $2000 to $2500 when using a proprietary LLM, assuming only a small percentage of users engage with it. As user usage grows, the cost can escalate significantly due to increased token generation.
AMD is strategically focusing on AI inference, diverging from the traditional GPU-centric path. The acquisition of Mipsology, an AI software company focused on inference, signifies AMD’s commitment to enhancing AI software capabilities and offering a comprehensive solution, including CPUs, streamlining AI model deployment through the AMD Unified AI Stack. This demonstrates AMD’s determination to establish itself as a major player in AI computing, emphasising CPU-based inference solutions.
Intel is also emphasising AI inference by leveraging its CPU capabilities. Its Xeon Scalable processors, complemented by hardware features like Intel DL Boost VNNI and Intel AMX, are central to its AI inference strategy. Intel’s participation in benchmark tests like MLPerf Inference v3.1 demonstrates competitive AI inference performance across various models.
The Habana Gaudi2 accelerators and 4th Gen Intel Xeon Scalable processors are powerful options for AI workloads. Moreover, Intel’s balanced platform for AI inference, featuring a larger cache, higher core frequency, and other advantages, positions Intel CPUs as strong contenders for diverse AI inference pipelines. Intel’s active contribution to the community through open sourcing-further challenges the GPU-centric perception in AI inference scenarios.
While it’s challenging to control user behaviour, organisations are seeking ways to reduce the cost per token at the hardware level, which could prove highly beneficial in managing overall expenses.
CPUs: The emerging tech in inference
CPUs are poised to become competitive players when it comes to inference, according to people in the ecosystem. While CPUs have long been considered slower than GPUs for training, they possess a set of advantages for inference. Furthermore, they can offer cost-effective performance per arithmetic operation compared to GPUs.
The distribution of training workloads remains challenging in AI, while inference can be efficiently distributed across numerous low-cost CPUs. This makes a swarm of commodity PCs an attractive option for applications reliant on ML inference.
Unlike training, inference often requires processing small or single-input batches, necessitating different optimisation approaches. Additionally, certain elements of the model, such as weights, remain constant during inference and can benefit from pre-processing techniques like weight compression or constant folding.
Inference presents unique challenges, particularly in terms of latency, which is critical for user-facing applications.
As inference costs continue to take centre stage, it will significantly impact the approach to developing AI applications. Researchers value the ability to experiment and iterate rapidly, requiring flexibility in their tools. Conversely, applications tend to maintain their models for extended periods, using the same fundamental architecture once it meets their needs. This juxtaposition may lead to a future where model authors use specialised tools, handing over the results to deployment engineers for optimisation.
In this evolving landscape, traditional CPU platforms such as x86 and Arm are poised to emerge as the winners. Inference will need to be seamlessly integrated into conventional business logic for end-user applications, making it challenging for specialised inference hardware to function effectively due to latency concerns. Consequently, CPUs are expected to incorporate increasingly integrated machine learning support, initially as co-processors and eventually as specialised instructions, mirroring the evolution of floating-point support in CPUs.
This impending shift in the AI landscape has significant implications for hardware development.
How NVIDIA Optimises GPUs for Inference
And NVIDIA has gotten wind of it. To enhance its H100 offering, NVIDIA through its new TensorRT-LLM—an open-source software—is offering double the performance of the H100 GPU when running inference on LLMs, greatly enhancing overall speed and efficiency.
TensorRT-LLM optimises LLM inference in multiple ways. It includes ready-to-run versions of the latest LLMs like Meta Llama 2, GPT-2, GPT-3, Falcon, Mosaic MPT, and BLOOM. It also integrates cutting-edge open-source AI kernels for efficient LLM execution. Additionally, TensorRT-LLM automates the simultaneous execution of LLMs on multiple GPUs and GPU servers through Nvidia’s NVLink and InfiniBand interconnects, eliminating manual management and introducing in-flight batching to improve GPU utilisation.
Furthermore, it’s optimised for the H100’s Transformer Engine, reducing GPU memory usage. These features enhance LLM inference performance, scalability, and power efficiency, supporting various Nvidia GPUs beyond the H100.
Traditionally, many ML researchers have regarded inference as a subset of training, but this perspective seems to be on the verge of change as inference takes centre stage.