After Quantization Aware Training (QAT) and Model Maker, tech giant Google has now open-sourced TensorFlow Runtime (TFRT), a new runtime that will replace the existing TensorFlow runtime. This new runtime will be responsible for various performance such as efficient execution of kernels, low-level device-specific primitives on targeted hardware and other such.
Machine learning is a complex domain as building or deploying these models keep changing with the dynamic needs due to the increasing investment in the ML ecosystem. While the researchers at TensorFlow have been inventing new algorithms that require more compute, application developers are enhancing their products with new techniques across edge and server.
However, the increase in computation needs and rise of computing costs has sparked a proliferation of new hardware aimed at specific ML use cases. According to the developers, the TensorFlow RunTime aims to provide a unified, extensible infrastructure layer with performance across a wide variety of domain-specific hardware.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
At the virtual TensorFlow Dev Summit 2020 in March, Megan Kacholia, VP Engineering at Google, Google Brain and TensorFlow, made several interesting announcements including TensorFlow 2.2 pre-release, Model Maker, T5 (Talk-to-Text Transfer Transformer) and TFRT. During the summit, Megan stated, “The new TensorFlow Runtime won’t be exposed to TFRT as a developer or a researcher, but it will be working under the cover to provide the best performance possible across a wide variety of domain-specific hardware.”

Behind TensorFlow Runtime (TFRT)
TFRT is a new runtime that provides efficient use of multithreaded host CPUs, supports fully asynchronous programming models, and focuses on low-level efficiency. According to the developers, this new runtime include three design highlights as mentioned below
- To Achieve Higher Performance: To achieve higher performance, this new runtime has a lock-free graph executor that supports concurrent op execution with low synchronisation overhead. It also includes a thin eager op dispatch stack in order to make eager API calls asynchronous as well as more efficient.
- To Enable Extending the TF Stack Easier: This is done by decoupling device runtimes from the host runtime, the core TFRT component that drives host CPU and I/O work.
- To Get Consistent Behavior: To make sure of the consistent behaviour, this new runtime leverages common abstractions, such as shape functions and kernels, across both eager and graph.
How It Works
Unlike the existing TensorFlow runtime, this new runtime plays a crucial part in both eager and graph execution. According to the developers, in eager execution, TensorFlow APIs call directly into the new runtime. While in graph execution, the computational graph of a program is lowered to an optimised target-specific program and dispatched to TFRT. In both execution paths, the new runtime invokes a set of kernels that call into the underlying hardware devices to complete the model execution.
Wrapping Up
As part of a benchmarking study for TensorFlow Dev Summit 2020, developers at the tech giant integrated TFRT with TensorFlow Serving and measured the latency of sending requests to the model and getting prediction results back. Comparing the performance of GPU inference over TFRT to the current runtime, the developers noticed an improvement of 28% on average inference time. They stated that the early results are strong validation for this new runtime, and it is expected to provide a big boost to the performance. The project has been made available on GitHub.
According to the developers at TensorFlow, this new runtime will help researchers, application developers and hardware makers who are looking for faster iteration time and better error reporting when developing complex new models in eager mode, improved performance when training and serving models in production and to integrate edge and datacenter devices into TensorFlow in a modular way respectively.