Facial Motion Capture for Animation Using First Order Motion Model

First Order Motion Model is an Open source library that allows you to create 3D animated videos using facial capture videos and still images. The image animation consists of generating a video sequence so that an object in a source image is animated according to the motion of the driving video.

Animation and 3D Rendering is an art form of the present and the future, with hundreds of people drawn every day to its massive emotional power in movie theatres and games every year. The use of artificial intelligence to capture a 3D representation of an actor in a process of expressive movements is known as motion capture. Motion capture is a motion assisting technology that can similarly enact or resemble the motion of the objects being captured and is now widely used in film development, especially for visual effects or animation. But with the help of advanced softwares these days, it doesn’t need to have actors in Lycra suits with lots of white balls attached to them for the technique of motion capture. This has also enabled game and film companies to perform motion capture in a remote way even during the pandemic. That’s an important technological advancement, as the hassles of motion-capture systems have led to a stall in production for both movie-makers and video game companies. Markerless capture technique helps to fix this and aids motion capture that can lower the costs and hassles of doing the work.

Adding face capture to 3D Animation gives users more control over expressing their vision and enhances the look of the animation. Softwares can quickly and easily render faces today, generating 3D face animations in minutes. No special hardware is needed allowing any video captured on any device to be used to generate 3D face animations.  A facial motion capture database created can help describe the coordinates or relative positions of reference points on the present actor’s face. The capture might be in two dimensions, in which case the capture process is sometimes known as expression tracking, or in three dimensions. Two-dimensional capture can be achieved using a single camera and capture software. This produces a less sophisticated tracking and is unable to fully capture three-dimensional motions such as head rotations. 

A three-dimensional capture is accomplished by using multi-camera rigs or laser marker systems. Such systems are generally far more expensive, complicated, and time-consuming to use. Markerless processes the assignment of annotation points to objects with a computer system so that the method becomes very simple and saves time in its operation, especially in the case of facial motion capture. The development process refers to the recording of human actors as videos and using that information to convert digital character models into animation in 2D or 3D digital computer form. This form of motion capture can also be used on other parts of the human body such as the face, legs and other objects that can be moved. 

The development of motion capture is included in the category of science in computer vision, which cannot be separated from the use of electronic devices such as computers or web cameras. From the process of making this motion capture researchers have found a unique thing, where the process of making and designing it in film production or 3D modelling animation is used to make motion capture points on human faces. New ideas to summarize the motion capture process that uses image relief points on the face by replacing manipulation of face points on the computer or the process of motion capture without using the help of points on real objects are being developed more than ever. The use of markerless techniques in motion capture will facilitate the film production process by reducing the time of motion capture preparation and reducing the use of the material in motion capture. The combination of face landmark detection and rigging it in face 3D modeling also uses the transformation geometry method, an implementation using rotation and translation by distance comparison from point of face landmark or blendshape to produce real-time motion. 

Facial Motion Capture Using First Order Motion Model

First Order Motion Model is an Open source library that allows you to create 3D animated videos using facial capture videos and still images. The image animation consists of generating a video sequence so that an object in a source image is animated according to the motion of the driving video. The library framework addresses this problem without using any annotation or prior information required about the specific object to animate. Once trained on a set of videos depicting objects of similar category i.e faces or human bodies, the method can be applied to any object of the class. To achieve this, it decouples the appearance and motion information using a self-supervised formulation. To provide additional support for complex motions, it uses a representation consisting of a set of learned key points along with their local affine transformations. A generator network presents models the occlusions arising during target motions and combines the appearance extracted from the source image and the motion derived from the driving video. The First Order Motion Model framework scores best on diverse benchmarks and on a variety of object categories. 

Image Source

The two main modules present here are the motion estimation module and the image generation module. The purpose of the motion estimation module is to predict a dense motion field from the input image. It assumes that there exists an abstract reference frame and independently estimates two transformations: from reference to source and from reference to driving. This choice allows it to independently process source and driving frames. This is desired since, at test time the model receives pairs of the source image and driving frames sampled from a different video, which can be very different visually. 

Getting Started With Code

In this article, we will be creating a model, which will produce a 3D Animated moving face render from facial motion-captured videos and a set of input images and demonstrate the power of the First Order Motion Model. The following implementation is inspired by the creators of the First Order Motion Model, whose official website can be accessed using the link here.

Installing the Library

Our first step will be to install the library components required for the model. It can be done using the following lines of code, 

#Installing the Library
!pip install ffmpy &> /dev/null
!git init -q .
!git remote add origin https://github.com/AliaksandrSiarohin/first-order-model
!git pull -q origin master
!git clone -q https://github.com/graphemecluster/first-order-model-demo demo

Importing the Dependencies

Let us now import our required dependencies for the model being created.

#Importing the Dependencies
import IPython.display
import PIL.Image
import cv2
import imageio
import io
import ipywidgets
import numpy
import os.path
import requests
import skimage.transform
import warnings
from base64 import b64encode
from demo import load_checkpoints, make_animation
from ffmpy import FFmpeg
from google.colab import files, output
from IPython.display import HTML, Javascript
from skimage import img_as_ubyte
os.makedirs("user", exist_ok=True)

Creating the Display Space

To display our results, let us create a display space using HTML components. We will also be creating a few buttons to generate the results. 

#Creating the Display Space
.widget-box > * {
  flex-shrink: 0;
.widget-tab {
  min-width: 0;
  flex: 1 1 auto;
.widget-tab .p-TabBar-tabLabel {
  font-size: 15px;
.widget-upload {
  background-color: tan;
.widget-button {
  font-size: 18px;
  width: 160px;
  height: 34px;
  line-height: 34px;
.widget-dropdown {
  width: 250px;
.widget-checkbox {
    width: 650px;
.widget-checkbox + .widget-checkbox {
    margin-top: -6px;
.input-widget .output_html {
  text-align: center;
  width: 266px;
  height: 266px;
  line-height: 266px;
  color: lightgray;
  font-size: 72px;
div.stream {
  display: none;
.title {
  font-size: 20px;
  font-weight: bold;
  margin: 12px 0 6px 0;
.warning {
  display: none;
  color: red;
  margin-left: 10px;
.warn {
  display: initial;
.resource {
  cursor: pointer;
  border: 1px solid gray;
  margin: 5px;
  width: 160px;
  height: 160px;
  min-width: 160px;
  min-height: 160px;
  max-width: 160px;
  max-height: 160px;
  -webkit-box-sizing: initial;
  box-sizing: initial;
.resource:hover {
  border: 6px solid crimson;
  margin: 0;
.selected {
  border: 6px solid seagreen;
  margin: 0;
.input-widget {
  width: 266px;
  height: 266px;
  border: 1px solid gray;
.input-button {
  width: 268px;
  font-size: 15px;
  margin: 2px 0 0;
.output-widget {
  width: 256px;
  height: 256px;
  border: 1px solid gray;
.output-button {
  width: 258px;
  font-size: 15px;
  margin: 2px 0 0;
.uploaded {
  width: 256px;
  height: 256px;
  border: 6px solid seagreen;
  margin: 0;
.label-or {
  align-self: center;
  font-size: 20px;
  margin: 16px;
.loading {
  align-items: center;
  width: fit-content;
.loader {
  margin: 32px 0 16px 0;
  width: 48px;
  height: 48px;
  min-width: 48px;
  min-height: 48px;
  max-width: 48px;
  max-height: 48px;
  border: 4px solid whitesmoke;
  border-top-color: gray;
  border-radius: 50%;
  animation: spin 1.8s linear infinite;
.loading-label {
  color: gray;
.comparison-widget {
  width: 256px;
  height: 256px;
  border: 1px solid gray;
  margin-left: 2px;
.comparison-label {
  color: gray;
  font-size: 14px;
  text-align: center;
  position: relative;
  bottom: 3px;
@keyframes spin {
  from { transform: rotate(0deg); }
  to { transform: rotate(360deg); }

Further defining the components for each button and setting functionalities.

#Defining the components for each button 
def thumbnail(file):
  return imageio.get_reader(file, mode='I', format='FFMPEG').get_next_data()
def create_image(i, j):
  image_widget = ipywidgets.Image(
    value=open('demo/images/%d%d.png' % (i, j), 'rb').read(),
  image_widget.add_class('resource-image%d%d' % (i, j))
  return image_widget
def create_video(i):
  video_widget = ipywidgets.Image(
    value=cv2.imencode('.png', cv2.cvtColor(thumbnail('demo/videos/%d.mp4' % i), cv2.COLOR_RGB2BGR))[1].tostring(),
  video_widget.add_class('resource-video%d' % i)
  return video_widget
def create_title(title):
  title_widget = ipywidgets.Label(title)
  return title_widget
def download_output(button):
  complete.layout.display = 'none'
  loading.layout.display = ''
  loading.layout.display = 'none'
  complete.layout.display = ''
def convert_output(button):
  complete.layout.display = 'none'
  loading.layout.display = ''
  FFmpeg(inputs={'output.mp4': None}, outputs={'scaled.mp4': '-vf "scale=1080x1080:flags=lanczos,pad=1920:1080:420:0" -y'}).run()
  loading.layout.display = 'none'
  complete.layout.display = ''
def back_to_main(button):
  complete.layout.display = 'none'
  main.layout.display = ''

Setting the Label Components,

#Setting the label components
label_or = ipywidgets.Label('or')
image_titles = ['Peoples', 'Cartoons', 'Dolls', 'Game of Thrones', 'Statues']
image_lengths = [8, 4, 8, 9, 4]
image_tab = ipywidgets.Tab()
image_tab.children = [ipywidgets.HBox([create_image(i, j) for j in range(length)]) for i, length in enumerate(image_lengths)]
for i, title in enumerate(image_titles):
  image_tab.set_title(i, title)
input_image_widget = ipywidgets.Output()
upload_input_image_button = ipywidgets.FileUpload(accept='image/*', button_style='primary')
image_part = ipywidgets.HBox([
  ipywidgets.VBox([input_image_widget, upload_input_image_button]),
video_tab = ipywidgets.Tab()
video_tab.children = [ipywidgets.HBox([create_video(i) for i in range(5)])]
video_tab.set_title(0, 'All Videos')
input_video_widget = ipywidgets.Output()
upload_input_video_button = ipywidgets.FileUpload(accept='video/*', button_style='primary')
video_part = ipywidgets.HBox([
  ipywidgets.VBox([input_video_widget, upload_input_video_button]),
model = ipywidgets.Dropdown(
warning = ipywidgets.HTML('<b>Warning:</b> Upload your own images and videos (see README)')
model_part = ipywidgets.HBox([model, warning])

relative = ipywidgets.Checkbox(description="Relative keypoint displacement (Inherit object proporions from the video)", value=True)
adapt_movement_scale = ipywidgets.Checkbox(description="Adapt movement scale (Don’t touch unless you know want you are doing)", value=True)
generate_button = ipywidgets.Button(description="Generate", button_style='primary')
main = ipywidgets.VBox([
  create_title('Choose Image'),
  create_title('Choose Video'),
loader = ipywidgets.Label()
loading_label = ipywidgets.Label("This may take several minutes to process…")
loading = ipywidgets.VBox([loader, loading_label])
output_widget = ipywidgets.Output()
download = ipywidgets.Button(description='Download', button_style='primary')
convert = ipywidgets.Button(description='Convert to 1920×1080', button_style='primary')
back = ipywidgets.Button(description='Back', button_style='primary')
comparison_widget = ipywidgets.Output()
comparison_label = ipywidgets.Label('Comparison')
complete = ipywidgets.HBox([
  ipywidgets.VBox([output_widget, download, convert, back]),
  ipywidgets.VBox([comparison_widget, comparison_label])

Setting the Algorithm Pipeline

With our display space all set up, we can finally create our model’s image processing pipeline which will, in turn, render our animated 3D video for us. 

#setting the algorithm
selected_image = None
def select_image(filename):
  global selected_image
  selected_image = resize(PIL.Image.open('demo/images/%s.png' % filename).convert("RGB"))
  with input_image_widget:
output.register_callback("notebook.select_image", select_image)
selected_video = None
def select_video(filename):
  global selected_video
  selected_video = 'demo/videos/%s.mp4' % filename
  with input_video_widget:
output.register_callback("notebook.select_video", select_video)
def resize(image, size=(256, 256)):
    w, h = image.size
    d = min(w, h)
    r = ((w - d) // 2, (h - d) // 2, (w + d) // 2, (h + d) // 2)
    return image.resize(size, resample=PIL.Image.LANCZOS, box=r)
def upload_image(change):
  global selected_image
  for name, file_info in upload_input_image_button.value.items():
    content = file_info['content']
  if content is not None:
    selected_image = resize(PIL.Image.open(io.BytesIO(content)).convert("RGB"))
    with input_image_widget:
upload_input_image_button.observe(upload_image, names='value')
def upload_video(change):
  global selected_video
  for name, file_info in upload_input_video_button.value.items():
    content = file_info['content']
  if content is not None:
    selected_video = 'user/' + name
    preview = resize(PIL.Image.fromarray(thumbnail(content)).convert("RGB"))
    with input_video_widget:
    with open(selected_video, 'wb') as video:
upload_input_video_button.observe(upload_video, names='value')
def change_model(change):
  if model.value.startswith('vox'):
model.observe(change_model, names='value')
def generate(button):
  main.layout.display = 'none'
  loading.layout.display = ''
  filename = model.value + ('' if model.value == 'fashion' else '-cpk') + '.pth.tar'
  if not os.path.isfile(filename):
    download = requests.get(requests.get('https://cloud-api.yandex.net/v1/disk/public/resources/download?public_key=https://yadi.sk/d/lEw8uRm140L_eQ&path=/' + filename).json().get('href'))
    with open(filename, 'wb') as checkpoint:
  reader = imageio.get_reader(selected_video, mode='I', format='FFMPEG')
  fps = reader.get_meta_data()['fps']
  driving_video = []
  for frame in reader:
  generator, kp_detector = load_checkpoints(config_path='config/%s-256.yaml' % model.value, checkpoint_path=filename)
  predictions = make_animation(
    skimage.transform.resize(numpy.asarray(selected_image), (256, 256)),
    [skimage.transform.resize(frame, (256, 256)) for frame in driving_video],
  if selected_video.startswith('user/') or selected_video == 'demo/videos/0.mp4':
    imageio.mimsave('temp.mp4', [img_as_ubyte(frame) for frame in predictions], fps=fps)
    FFmpeg(inputs={'temp.mp4': None, selected_video: None}, outputs={'output.mp4': '-c copy -y'}).run()
    imageio.mimsave('output.mp4', [img_as_ubyte(frame) for frame in predictions], fps=fps)
  loading.layout.display = 'none'
  complete.layout.display = ''
  with output_widget:
    display(HTML('<video id="left" controls src="data:video/mp4;base64,%s" />' % b64encode(open('output.mp4', 'rb').read()).decode()))
  with comparison_widget:
    display(HTML('<video id="right" muted src="data:video/mp4;base64,%s" />' % b64encode(open(selected_video, 'rb').read()).decode()))
  (function(left, right) {
    left.addEventListener("play", function() {
    left.addEventListener("pause", function() {
    left.addEventListener("seeking", function() {
      right.currentTime = left.currentTime;
  })(document.getElementById("left"), document.getElementById("right"));
#generate button for displaying the result
loading.layout.display = 'none'
complete.layout.display = 'none'

Output :

Input Image :

Processed Output :

The generated output will provide us with a 3D rendered animated video for our provided input image.

End Notes

In this article, we understood how artificial intelligence is being used to create 3D animations harnessing the power of facial motion capture. We also saw how the First Order Motion Model helps create 3D animations using facial capture videos and brings life to still images while keeping the facial expressions intact. The following implementation can be found as a Colab notebook which can be accessed using the ink here.

Happy Learning!


Official First Order Motion Model Paper

FOMM Github Repository 

Download our Mobile App

Victor Dey
Victor is an aspiring Data Scientist & is a Master of Science in Data Science & Big Data Analytics. He is a Researcher, a Data Science Influencer and also an Ex-University Football Player. A keen learner of new developments in Data Science and Artificial Intelligence, he is committed to growing the Data Science community.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox