Google has launched TensorFlow RunTime (TFRT), which is a new runtime for its TensorFlow machine learning framework.
According to a recent blog post by Eric Johnson, TFRT Product Manager and Mingsheng Hong, TFRT Tech Lead/Manager, “TensorFlow RunTime aims to provide a unified, extensible infrastructure layer with best-in-class performance across a wide variety of domain-specific hardware. It provides efficient use of multithreaded host CPUs, supports fully asynchronous programming models, and focuses on low-level efficiency.”
The company has made TFRT available on GitHub. According to the company, as part of a benchmarking study for TensorFlow Dev Summit 2020 — while comparing the performance of GPU inference over TFRT to the current runtime, we saw an improvement of 28% in average inference time. These early results are strong validation for TFRT to provide a significant boost to performance.
The blog further stated how TFRT could benefit a broad range of users — including the researchers who are looking for faster iteration time and better error reporting when developing complex new models in eager mode; application developers who are looking for improved performance when training and serving models in production; and hardware makers looking to integrate edge and datacenter devices into TensorFlow in a modular way.
Explaining further, Johnson stated that TFRT is responsible for efficient execution of kernels – low-level device-specific primitives – on targeted hardware. Alongside, it also plays a critical part in both eager and graph execution.
TensorFlow training stack
“In eager execution, TensorFlow APIs call directly into the new runtime. In graph execution, your program’s computational graph is lowered to an optimised target-specific program and dispatched to TFRT. In both execution paths, the new runtime invokes a set of kernels that call into the underlying hardware devices to complete the model execution, as shown by the black arrows,” wrote Johnson.
Comparing it with TensorFlow runtime, which was initially built for graph execution and training workloads, TFRT will make eager execution and inference first-class citizens, while putting special emphasis on architecture extensibility and modularity. Besides, TFRT has the following selected design highlights:
- TFRT has a lock-free graph executor that supports concurrent op execution with low synchronisation overhead, and a thin, eager op dispatch stack so that eager API calls will be asynchronous and more efficient. This will help in achieving higher performance.
- The company decoupled device runtimes from the host runtime, the core TFRT component that drives host CPU and I/O work, in order to make extending the TF stack easier.
- To get consistent behaviour, TFRT leverages common abstractions, such as shape functions and kernels, across both eager and graph.
According to the blog, “A high-performance low-level runtime is a key to enable the trends of today and empower the innovations of tomorrow.”
The company has limited the contributions, to begin with, but is encouraging participation in the form of requirements and design discussions.