Google recently announced new Cloud TPU virtual machines or TPU VMs, which gives direct access to TPU host machines, where developers and researchers can use its industry-leading TPU hardware to build machine learning models. Google is offering a new and improved user experience to develop and deploy TensorFlow, JAX and PyTorch on Cloud TPUs.
Here are some of the highlights of Google Cloud TPU VMs:
- Write and debug a machine learning model seamlessly using a single TPU VM, then scale it up on a Cloud TPU to take advantage of the super-fast TPU interconnect.
- Offers access to every TPU VM you create, so you can install and run any code you wish in a tight loop with TPU accelerators
- Use local storage, integrate Cloud TPUs into your research, execute custom code in your input pipelines, and production workflows more efficiently.
- Write your integrations via a new ‘libtpu’ shared library on the VM, besides integration of TensorFlow, PyTorch, and JAX on Cloud TPU.
Aidan Gomez, co-founder and CEO of Cohere, said direct access to TPU VMs has changed how we build machine learning models on TPUs. It has significantly enhanced the developer experience and model performance, he added.
Cloud TPU architecture
“Until now, you could only access Cloud TPU remotely. Typically, you would create one or more VMs that would then communicate with Cloud TPU host machines over the network using gRPC,” explained Google in its blog post.
gRPC or grpc remote procedure call is a high-performance, open-source, universal RPC framework that can run in any environment.
However, with the latest launch of Cloud TPU VMs, developers and researchers can now run on the TPU host machines directly attached to TPU accelerators. The pictorial representation is shown below.
Google claimed its new Cloud TPU system architecture is simple and flexible. The developer achieves performance gains because their code no longer needs to make round trips across datacenter networks to reach the TPUs, alongside cost benefits.
“If you previously needed a fleet of powerful compute engine VMs to feed data to remote hosts in a Cloud TPU Pod slice, you can now run that directly on the Cloud TPU host systems and eliminate the need for additional compute engine VMs,” said Google.
Since October last year, Google has given early access to a select few customers, alongside several teams of researchers and engineers.
Patrick von Platen, a research engineer at Hugging Face, said they recently integrated JAX alongside TensorFlow and PyTorch into their Transformer library. This has enabled the natural language processing (NLP) community to effectively train popular NLP models like BERT on Cloud TPU VMs.
Further, Platen believes easy access to Cloud TPU VMs will make pre-training of large language models possible for a much larger audience, including small startups and educational institutions.
Hugging Face is an open-source provider of NLP technologies and the creator of the popular Transformers library.
Shrestha Basu Mallick, a product manager at the [email protected] team, said, “Thanks to Google Cloud TPU VMs, and the ability to scale from one to 2048 TPU cores, our team has built the most powerful classical simulator of quantum circuits. The simulator can evolve a wave function of 40 qubits, which entails manipulating one trillion complex amplitudes! Scalability has been key to enabling our team to perform quantum chemistry computations of huge molecules, with up to 500k orbitals.”
Cloud TPU VMs are currently available for preview in the US and Europe regions, priced at $1.35 per hour per TPU host machine. Customers can use single Cloud TPU devices and Cloud TPU pod slices and choose TPU v2 or TPU v3 accelerator hardware.