Last week, Facebook said it would migrate all its AI systems to PyTorch. Facebook’s AI models currently perform trillions of inference operations every day for the billions of people that use its technology. Its AI tools and frameworks help fast track research work at Facebook, educational institutions and businesses globally.
Why migrate to PyTorch?
Predominantly, Facebook has been using two distinct but synergistic frameworks for deep learning: PyTorch and Caffe2. PyTorch is optimised for research, while Caffe2 is optimised for production. Caffe2 is Facebook’s in-house production framework for training and deploying large-scale machine learning models.
Facebook said adopting PyTorch as Facebook’s default AI framework ensures that all the experiences across its technologies will run optimally at scale.
“Over a year into the migration to PyTorch, there are more than 1.7K inference models in full production, and 93 percent of our new training models are on PyTorch,” said Lin Qiao, engineering director at Facebook AI.
Migration also means that Facebook will be closely working alongside the PyTorch developer community. “PyTorch not only makes our engineering and research work more efficient, collaborative and effective, but also allows us to share our work and learn from the advances made by thousands of PyTorch developers around the world,” she added.
The evolution of PyTorch
Traditionally, AI’s research-to-production pipeline has been plodding. Numerous steps and tools, fragmented processes, and lack of clear standardisation across the industry made it impossible to manage the end-to-end workflow. Researchers and engineers were forced to choose between AI frameworks optimised for either research or production.
In 2016, a group of ML/AI researchers at Facebook collaborated with the research community to better understand existing frameworks. The team experimented with machine learning (ML) frameworks such as Theano and Torch and advanced concepts from Lua Torch, Chainer, and HIPS Autograd. “After months of development, PyTorch was born,” said Qiao. It became the go-to deep learning library for AI researchers, thanks to its simple interface, dynamic computational graphs, first-class Python integration and back-end support for CPUs and GPUs.
In 2018, Facebook released PyTorch 1.0 and started the work to unify PyTorch’s research and production capabilities into a single framework. The new iteration merged Python-based PyTorch with production-ready Caffe2, providing both flexibility for research and performance optimisation for production.
With time, PyTorch engineers at Facebook introduced various tools, pretrained models, libraries, and data sets for each stage of advancement, enabling the developer and research community to quickly create and deploy new ML/AI innovations at scale. To this day, the platform continues to evolve, with the most recent release boasting more than 3K commits since the prior version.
Facebook is looking to create a smoother end-to-end developer experience for its engineers and developers and accelerate its reach-to-production pipeline by using a single platform.
“By moving away from Cafee2 and standardising in PyTorch, we are decreasing the engineering and infrastructure burden associated with maintaining two systems, as well as unifying under one common umbrella, both internally and within the open-source community.
“This is an ongoing journey and spans product teams across Facebook. As we migrate our ML/AI workloads, we also need to maintain steady model performance and limit the disruption to any downstream product traffic or research progress,” said Qiao. On average, there are over 4K models running on PyTorch daily at Facebook .
Further, Qiao said Facebook’s developers go through multiple steps including critical online and offline testing, training, inference, and then publishing. Additionally, multiple tests are conducted to check for performance, and correctness variance between Cafee2 and PyTorch, which can take engineers and researchers up to a few weeks to perform.
To address these migration scenarios, Facebook said its engineers have developed an internal workflow and custom tools to help teams decide the best way to migrate rather than getting it replaced.
While the migration seems plausible, the latency of machine learning models poses a challenge. Facebook has created internal benchmarking tools to compare the performance of original models with PyTorch counterparts ahead of time, thus, making these evaluations easier.
Advantages of migrating to PyTorch
- ML/AI models are now easier to build, program, test and debug
- Research and production environments are brought closer than ever
- Deployment on-device (PyTorch Mobile) is accelerating. PyTorch Mobile currently runs on devices like the Oculus Quest and Portal, as well as on desktops, and the Android and iOS mobile apps for Facebook, Instagram, and Messenger
- On-device AI will play a crucial role with emerging hardware technologies such as wearable AR
With PyTorch as the underlying framework powering all of Facebook’s AI workloads and innovations, its engineers can deploy new ML/AI models in minutes rather than in weeks or months. Real-world use cases include Instagram personalisation technologies, person segmentation models (especially in the AR/VR space), enlisting PyTorch in the battle against harmful content like hate speech and misinformation, text-to-speech, optical character recognition and more.
“PyTorch gives us the flexibility and scalability to move fast and innovate at Facebook,” said Aparna Lakshmi Ratan, director of product management at Facebook AI.