Ahead of Google I/O, Google Research launched a new pose detection model in TensorFlow.js called MoveNet. This ultra-fast and accurate model can detect 17 key points in the human body. MoveNet is currently available on TF Hub with two variants — Lightning and Thunder.
While Lightning is intended for latency-critical applications, Thunder is for applications that call for higher accuracy. Both models claim to run faster than real-time (30+ frames per second (FPS)) on most personal computers, laptops and phones.
Sign up for your weekly dose of what's up in emerging technology.
The model can be launched in the browser using TensorFlow.js architecture with no server calls needed after the initial page load or external packages. The live demo version is available here.
Currently, the MoveNet model works for the individual in the camera field-of-view. But, soon, Google Research looks to extend the MoveNet model to the multi-person domain so that developers can support applications with multiple people.
How Is MoveNet different from other models?
OpenPose, VIBE, and Adobe’s BodyNet are other major players in the field. Human pose estimation has come a long way in the last five years but hasn’t appeared in many applications as most companies focused on making pose models larger and more accurate, rather than doing the engineering work to make them fast and deployable everywhere.
Google has designed a model that leverages SOTA architectures but has kept inference time as low as possible. Because of this, the model can deliver accurate key points across a wide variety of poses, environments and infrastructure.
Recently, Google Research collaborated with Ohio-based health tech company IncludeHealth to offer remote care to patients. Using MoveNet, the company developed an interactive web application that guides a patient through various routines via laptop, smartphone or tablet. The exercises were virtually built and prescribed by a physical therapist to test balance, strength, and range of motion.
Google provided MoveNet to IncludeHealth, accessible through the new pose-detection API. IncludeHealth integrated this model into their application.
Ryan Eder, founder and CEO at IncludeHealth, said the MoveNet model enhanced the speed and accuracy in delivering prescriptive care. “While other models trade one for the other, this unique balance has unlocked the next-generations of care delivery,” said Eder.
With remote fitness, dance, physiotherapy, and yoga sessions going online, MoveNet can help instructors/experts access users in real-time (30+ FPS) and offer personalized solutions accordingly.
While a typical detector is sufficient for tracking easy movements, more complicated poses can still be challenging even for SOTA detectors trained on the wrong data. MoveNet claimed it offers quick and accurate results irrespective of the body postures.
MoveNet uses heatmaps to localize human key points, aka bottom-up estimation model. The architecture consists of a feature extractor and a set of prediction heads. The prediction scheme also follows CenterNet, with notable changes that improve both speed and accuracy. TensorFlow Object Detection API is used to train all models.
The feature extractor used in MoveNet architecture is MobileNetV2 with an attacked feature pyramid network (FPN), which allows for a high resolution, semantically rich feature map output/result. For prediction heads, there are four of them attached to the feature extractor, which predicts instances like::
- Person centre heatmap
- Keypoint regression field
- Person keypoint heatmap
- 2D per-keypoint offset field
MoveNet was trained on COCO, and an internal Google dataset called Active. While COCO is suitable primarily for fitness and dance applications, which exhibit challenging poses and significant motion blur, Active dataset was produced by labeling key points (adopting COCO’s standard 17 body keypoints) on yoga, fitness, and dance videos from YouTube. The model details are found here.
MoveNet browser performance
To quantify the inference speed of MoveNet, the model was tested across multiple devices. The latency (FPS) of the model was measured on GPU with WebGL and WebAssembly (WASM.
The image below shows the latency across multiple devices, including MacBook Pro, iPhone, Pixel and personal desktop.
Showcasing the performance metrics of MoveNet on a browser. The first number in each cell shows the latency of Lightning, and the second number is for Thunder. (Source: TensorFlow)
Google Research had used several techniques, including WebGL kernel for the depthwise separable convolutions and improving GL scheduling for mobile Chrome. Meanwhile, TensorFlow.js looks to continuously optimize its backends to boost model execution across all supported devices. “This is achieved through repeated benchmarking and backend optimization.”