Whenever a machine learning approach struggles to improve, researchers analyse how humans learn a task and try to improve a machine learning approach’s performance somehow. Likewise, while modeling video generation tasks, it has been observed that the models sometimes fail to maintain the identity of an object when there is a dynamic change in a pose or in the object’s motion. In other words, from a machine learning model’s point of view, the key factors and attributes such as identity, pose and motion are dependent on each other. Therefore, change in one factor affects one or more other factors. But, humans are able to observe and understand an object’s factors independently. Change in one factor does not affect the other factors. While researchers try to implement this human approach in machine learning, the disentangled representation learning has evolved.
Disentangled representation of data is a way of feature representation in which the key factors are independent of each other. Disentanglement becomes mandatory in the tasks such as causal reasoning, fair machine learning and generative modeling to achieve excellent performance. However, disentanglement representation learning requires supervision or inductive bias. Generative models prefer supervised learning with a huge amount of training data. The cost of preparing annotated data leads to a search for semi-supervised or unsupervised approaches for disentangled representation learning.
In this scenario, Matthew J. Vowels, Necati Cihan Camgoz and Richard Bowden from the University of Surrey, United Kingdom, have developed VDSM, the abbreviation for Video Disentanglement via State-Space Modeling. VDSM introduces a novel unsupervised approach that learns disentangled representations through inductive bias. It has a hierarchical architecture with a dynamic prior and an MoE (Mixture of Experts) decoder. This model disentangles time-varying factors, dynamic factors and static factors from each other and leads to attribute independence. VDSM learns separate disentangled representations for an object’s identity and its motion. This unsupervised approach greatly outperforms most of the supervised learning-based generative approaches.
Architecture of VDSM
The architecture of VDSM is built intending disentanglement at its core. The following three are the independent factors achieved through disentanglement.
- Static latent factors, such as identity.
- Time-varying latent factors, such as pose.
- Action dynamics, such as waving a hand.
In a conventional generative model such as a bottleneck autoencoder, there will be an encoder that extracts features and a decoder that reconstructs (or generates) images or videos with the help of the extracted features. The key factors such as identity, pose, and action are inseparably embedded in the extracted features.
VDSM incorporates state-space modeling blocks that connect the encoder and the decoder. These blocks disentangle the above listed three factors from the extracted features of the encoder and supplies them to the decoder to generate output sequences (video).
The input image sequences (a set of n images) are fed into the encoder, where it yields two different latent feature representations – one contains time-varying features (h), and another contains static features (s). The time-varying features vary continuously as the images flow in sequence. These features help extract pose variations and action dynamics. The static features are constant for a given sequence which do not vary over time. These features help extract the identity of an object or person.
Action dynamics must be disentangled from the time-varying features (h). A bi-directional LSTM encoder generates action dynamic features (d). The action dynamic features are transformed further into a stack of parameterized posterior distribution (z) through a seq2seq LSTM decoder and a combination network. Thus, the three independent disentangled features are ready for generation. In the decoder part, an MoE (Mixture of Experts) are used to generate output sequences by varying dynamic and time-varying factors (action and pose, respectively) and keeping static factors unaltered (identity).
Python Implementation of VDSM
Requirements of VDSM are as follows:
python 3.7.6, numpy 1.19.2, pandas 1.0.5, torch 1.6.0, torchvision 0.7.0, pyro 1.4.0, sklearn 0.23.2, imageio 2.9.0, scipy 1.5.3, pytz 2020.1, natsort 7.0.1, antialiased_cnns, nonechucks, argparse.
Further, VDSM requires a CUDA GPU runtime for both training and generation. Most of this code implementation references the source code page. This implementation requires the MUG-FED (Multimedia Understanding Group – Facial Expression Dataset) dataset. Download the MUG-FED dataset from its official site and preprocess the images to have a size of (sequence_length, 112, 112, 3) in numpy ndarray format (npz).
Download the source files from the official Github repository.
!git clone https://github.com/matthewvowels1/DisentanglingSequences.git
Change directory to the newly downloaded source files.
!ls -p %cd DisentanglingSequences/
Create a new directory, ‘data’ and a subdirectory ‘MUG-FED’ as expected by the model.
%cd /content/DisentanglingSequences/VDSM_release/ !mkdir ./data/ !ls -p %cd ./data !mkdir ./MUG-FED/ !ls -p
Move the preprocessed MUG-FED data to this newly created subdirectory. VDSM supports different pre-trained image classification models such as ResNext, Inception and ResNet to be its encoder from this source. The default code implements an Inception model as the encoder. The following changes let users incorporate any suitable pre-trained model as the encoder (here, we implement a ResNet model).
def prepare_inception_model(weight_dir: Path = CACHE_DIR, device: torch.device = torch.device("cpu")): filename = "resnext-101-kinetics.pth" weight_path = weight_dir / filename model = resnet.resnet101(num_classes=400, shortcut_type='B', cardinality=32, sample_size=112, sample_duration=16, last_fc=False) model_data = torch.load(str(weight_path), map_location="cpu") fixed_model_data = OrderedDict() for key, value in model_data["state_dict"].items(): new_key = key.replace("module.", "") fixed_model_data[new_key] = value model.load_state_dict(fixed_model_data, strict=False) model = model.to(device) model.eval() return model
Train the model for 200 epochs using the following command. Users may vary parameters as per need. Training may take hours based on the compute memory availability.
# train the model !python3 main.py --RUN release_test_sprites --rnn_layers 3 --rnn_dim 512 --bs 20 --seq_len 8 --epochs 200 --bs_per_epoch 50 \ --num_test_ids 12 --dataset_name sprites --model_save_interval 50 --pretrained_model_VDSMEncDec 199\ --train_VDSMSeq True --train_VDSMEncDec True --model_test_interval 10 \ --anneal_start_dynamics 0.1 --anneal_end_dynamics 1.0 --anneal_frac_dynamics 1 --lr_VDSMSeq 0.001 --z_dim 30 --n_e_w 40 \ --dynamics_dim 50 --test_temp_id 10.0 --temp_id_end 10.0 --temp_id_start 10.0 --temp_id_frac 1 --anneal_end_id 1.0 --anneal_start_id .1 \ --anneal_frac_id 3 --anneal_start_t 0.1 --anneal_mid_t1 0.4 --anneal_mid_t2 0.4 --anneal_end_t 1 --anneal_t_midfrac1 0.5 --anneal_t_midfrac2 0.8 \ --rnn_dropout 0.2
Evaluate the model by setting training options to False, as shown in the following commands.
# evaluate the model !python3 main.py --RUN release_test_sprites --rnn_layers 3 --rnn_dim 512 --bs 20 --seq_len 8 --epochs 200 --bs_per_epoch 50 \ --num_test_ids 12 --dataset_name sprites --model_save_interval 50 --pretrained_model_VDSMEncDec 199\ --train_VDSMSeq False --train_VDSMEncDec False --model_test_interval 10 \ --anneal_start_dynamics 0.1 --anneal_end_dynamics 1.0 --anneal_frac_dynamics 1 --lr_VDSMSeq 0.001 --z_dim 30 --n_e_w 40 \ --dynamics_dim 50 --test_temp_id 10.0 --temp_id_end 10.0 --temp_id_start 10.0 --temp_id_frac 1 --anneal_end_id 1.0 --anneal_start_id .1 \ --anneal_frac_id 3 --anneal_start_t 0.1 --anneal_mid_t1 0.4 --anneal_mid_t2 0.4 --anneal_end_t 1 --anneal_t_midfrac1 0.5 --anneal_t_midfrac2 0.8 \ --rnn_dropout 0.2
This notebook explains implementation on MUG dataset.
This notebook explains implementation on Sprites dataset.
Performance of VDSM
VDSM is trained and evaluated on the Sprites dataset and MUG dataset, along with its competitor models such as G3AN, MoCoGAN, S3VAE, DSA, and MonkeyNet. The models are evaluated with different tasks such as generation and action transformation. VDSM greatly outperforms every competitor on metrics, namely, Consistent Accuracy, Dynamic Accuracy and Identity Accuracy. While 70.51% is the top accuracy on the MUG dataset with any other model, VDSM set a breakthrough by achieving 93.33% accuracy.
This article has discussed the newly developed state-of-the-art video sequence generating model, VDSM- Video Disentanglement via State-Space Modeling that disentangles the dynamic and static factors. It further discussed how this disentanglement approach yields video generation without any loss in identity. It has discussed the architecture and data flow along with code implementation. VDSM may become one of the most acclaimed machine learning approaches that matches or even exceeds human performance.