Researchers from NVIDIA, the Vector Institute and the University of Toronto, have proposed a motion capture method that only uses video input to improve past motion-capture animation models. This new system doesn’t require the previously used expensive motion-capture hardware. This work is expected to lead to more scalable human motion synthesis as there are large online video resources.
The existing methods need accurate motion capture data to train which is expensive. With the new system, the researchers can capture individual movements using AI solely through video input and translate them into a digital avatar. In the paper, the researchers have introduced a new framework to train motion synthesis models from raw video pose estimations without using motion capture data. The framework also refines noisy pose estimates by enforcing physics constraints through contact invariant optimisation, including the computation of contact forces. Such an optimisation yields corrected 3D poses and motions, and their corresponding contact forces. The results of physically-corrected motions have significantly outperformed prior work on pose estimation.
The proposed framework then trains generative models of physically plausible human motion directly from monocular RGB videos, that are much more widely available.
Then, they train a time-series generative model on the refined poses and synthesise both future motion and contact forces. The results demonstrated significant performance boosts in pose estimation via the physics-based refinement, as well as motion synthesis results from video.
Such a framework is expected to bring people one step closer to working and playing inside virtual worlds. With this, developers can animate human motion more affordably and with a greater diversity of motions.