Recently, researchers from Shanghai Tech University introduced a new GAN-based framework that can perform human image synthesis by using a 3D body mesh recovery module, known as Impersonator++. According to the researchers, the Impersonator++ framework tackles human image synthesis, including human motion imitation, appearance transfer, and novel view synthesis.
With the gaining prominence of adversarial methods, researchers around the globe have been working on human image synthesis. Human image synthesis aims to make believable and photorealistic images of humans, including motion imitation, appearance transfer, novel view synthesis, among others. The researchers stated that human image synthesis has vast potential applications in character animation, reenactment, virtual clothes to try-on, movie or game making, etc.
According to them, the existing task-specific methods mainly use 2D key points or pose in order to estimate the human body structure. These methods only take care of the layout locations and ignore the personalised shape and limb (joints) rotations, which are even more essential than layout locations in human image synthesis. This means they can only express the position information of the human body structure with no abilities to characterise the personalised shape of the person and model the limb rotations.
In order to mitigate such issues, the researchers proposed the new GAN-based framework to use a 3D body mesh recovery module to disentangle the pose and shape. It has been claimed that the framework can model not only the joint location and rotation but also characterise the personalised body shape.
Behind Impersonator++
Impersonator++ is an Attentional Liquid Warping GAN with Attentional Liquid Warping Block (AttLWB) that preserves crucial details like texture, style, colour, and face identity. The researchers proposed this framework in order to address the loss of the source information from the mentioned key aspects, which are:
- A denoising convolutional auto-encoder, which is used to extract useful features that preserve the source information, including texture, colour, style and face identity.
- The source features of each local part are blended into a global feature stream by the proposed LWB and AttLWB to preserve the source details.
- Further, it supports multiple-source warping, such as in the appearance transfer that supports to warp the features of a head (local identity) from one source and that of a body from another and aggregate them into a global feature stream.
- A one/few-shot learning strategy is utilised to improve the generalisation of the network.
The whole approach includes three main modules, a body mesh recovery, a flow composition, and a GAN module with the Liquid Warping Block (LWB) or the Attentional Liquid Warping Block (AttLWB).
During the process, the researchers first used a parametric statistical human body model, SMPL to disentangle a human body into the pose (joint rotations) and the shape. It is a 3D body model that outputs a 3D mesh (without clothes) rather than the layouts of joints and parts.
They applied the Attentional Liquid Warping Block (AttLWB) that claims to have learned the similarities of the global features among all multiple sources features. It then fuses the multiple sources features by a linear combination of the learned similarities and the multiple sources in the feature spaces.
Lastly, inspired by the SinGAN and the Few-Shot Adversarial Learning, the researchers applied a one/few-shot adversarial learning to push the network to focus on the individual input with several steps of adaptation, namely personalisation.
According to the researchers, based on the SMPL model and the Liquid Warping Block (LWB), this method can be further extended into other tasks, including human appearance transfer and novel view synthesis for free and one model can handle these three tasks.
Dataset Used
The researchers built a new dataset called Impersonator (iPER) dataset, for the evaluation of human motion imitation, appearance transfer and novel view synthesis. The dataset includes diverse styles of clothes in videos, and there are a total of 30 subjects of different conditions of shape, height, and gender. The whole dataset contains 206 video sequences with 241,564 frames.
Contributions Of This Research
The contributions made by the researchers are mentioned below:
- The researchers proposed a Liquid Warping Block (LWB) and an Attentional Liquid Warping Block (AttLWB) that propagate and address the loss of the source information, such as texture, style, colour, and face identity, in both the image and the feature space.
- According to them, by taking advantage of both the LWB (AttLWB) and the 3D parametric model, the method is a unified framework for human motion imitation, appearance transfer, and novel view synthesis.
- Due to the limitations of the previously available datasets, they built a new dataset for these tasks, especially for human motion imitation in the video, and released all codes and datasets for further research convenience in the community.