GANs have made generating images, text and so on quite easy. Each month we can see some new applications popping up. But this article is going to be about a spectacular application of a Deep Learning technique in which we synthesize or create scenes from a given picture. It can be some sequence of progressive images or frames that correspond to the original image, to be precise.
Infinite Nature, aka perpetual view generation, allows you to take an image and fly into it as a bird would do, mapping and exploring all the landscape. We generate a long range of novel views (constructing new images relating to the original but progressive). This corresponds to an arbitrary or random long camera following a trajectory of a sky view, for example, a bird. All this from a single image!!
This sounds like a challenging problem, considering how far the generation will go beyond the capabilities of current view synthesis models. These too work for a limited number of viewpoints (the image from where the synthesis will start or the base image). Another problem was that these viewpoints degenerate quickly and generate images/frames with minimal changes.
The technique discussed in this article solves all the above problems by using a hybrid solution based on integrating both image synthesis and geometry in an interactive framework with iterative rendering, refining and repeating. This allows long-range generation that can cover large distances even after hundreds of frames. This approach is trained upon a set of monocular video sequences without any manual annotation, which saves a lot of time.
The key point to be noted here is that authors have used the geometry of the image, so first, a disparity map(map showing the variation in-depth in an image) was created using a state-of-the-art network called MiDaS, informing the network about the depths inside the image.
The goal of the renderer is to generate new views based on the old view. Note that this is a differentiable generator so that backpropagation can be leveraged for training. Then a 3D mesh is used to generate an image from a novel viewpoint. Another network called SPADE, which is also state-of-the-art, accounts for conditional image synthesis. This process repeats over and over, producing newer, deeper images into the view.
Code Implementation
Below are the instructions for running the model locally.
Install libraries with the given requirements file here
#installing dependencies with help of requirements file pip3 install -r requirements.txt
As mentioned earlier, we have to use a 3D mesh renderer leveraged from TensorFlow. Authors have used GCC to build the library instead of Bazel instructions given in Tensorflow Github.
Tensorflow mesh was originally for versions less than 2.x, but authors have prepared a small patch that can be downloaded for upgrading the functions to work on version 2.x
#downloading the tensorflow 3D mesh from github source download_tf_mesh_renderer.sh
Now, download the required data and pre-trained checkpoints.
#downloading the zip file containing model and checkpoints wget https://storage.googleapis.com/gresearch/infinite_nature_public/ckpt.tar.gz #unzip the file tar xvf ckpt.tar.gz
Sample auto cruise obtained from here.
#sample inputs by authors mentioned in paper and Official github wget https://storage.googleapis.com/gresearch/infinite_nature_public/autocruise_input1.pkl wget https://storage.googleapis.com/gresearch/infinite_nature_public/autocruise_input2.pkl wget https://storage.googleapis.com/gresearch/infinite_nature_public/autocruise_input3.pkl
Inside the pickle files is a dictionary with entries containing nature scenes and respective disparity maps predicted by MiDaS.
Run the code for 100 steps of Infinite Nature using autocruise, saving the frames to a file.
#running the model with 100 frames which will stored in an output file as mentioned python -m autocruise --output_folder=autocruise --num_steps=100
So this was all about running the pre-trained model locally on a local machine.
Let’s have a look at the application of Infinite Nature on Google Colab.
Installing Dependencies
#imageio for image manipulation IPython for showing image in notebook import imageio import IPython #numpy for array pickle for model files and checkpoint files import numpy as np import pickle #importing libraries, infinite_nature_lib, fly_camera from authors import config import infinite_nature_lib import fly_camera import tensorflow as tf import tensorflow_hub as hub
Downloading model weights, sample data
#making sure dynamic linking is able to find tensorflow libraries. os.system('ldconfig ' + tf.sysconfig.get_lib()) #python can successfully find libraries defined by authors sys.path.append('infinite_nature') sys.path.append('infinite_nature/tf_mesh_renderer/mesh_renderer') #the mesh renderer library should know where from to load its .so file from. os.environ['TEST_SRCDIR'] = 'infinite_nature' #tensorflow, os and system for directories and saving files import tensorflow as tf import sys import os
NOTE : The following snippet has been taken from the Official GitHub Repository of Infinite Nature containing links and correct, specific procedure.
%%shell echo Fetching code from github... #for storing client settings while running model apt install subversion svn export --force https://github.com/google-research/google-research/trunk/infinite_nature #fetching the weights , checkpoint files in form of zip files echo echo Fetching trained model weights... rm -f autocruise_input*.pkl rm -f ckpt.tar.gz rm -rf ckpt wget https://storage.googleapis.com/gresearch/infinite_nature_public/autocruise_input1.pkl wget https://storage.googleapis.com/gresearch/infinite_nature_public/autocruise_input2.pkl wget https://storage.googleapis.com/gresearch/infinite_nature_public/autocruise_input3.pkl wget https://storage.googleapis.com/gresearch/infinite_nature_public/ckpt.tar.gz tar -xf ckpt.tar.gz #installing specific versions of libraries echo echo Installing required dependencies... pip install -r infinite_nature/requirements.txt #starting 3D mesh renderers from TF Github echo echo Fetching tf_mesh_renderer and compiling kernels... cd infinite_nature rm -rf tf_mesh_renderer source download_tf_mesh_renderer.sh echo Done.
Build Model
config.set_training(False) #model path which is a ckpt checkpoint file mod_path = "ckpt/model.ckpt-6935893" #instantiate methods from libraries render_refiner, style_encod = infinite_nature_lib.load_model(mod_path) #initial dimensions will be taken from sample images initial_rgbds = [ pickle.load(open("autocruise_input1.pkl", "rb"))['input_rgbd'], pickle.load(open("autocruise_input2.pkl", "rb"))['input_rgbd'], pickle.load(open("autocruise_input3.pkl", "rb"))['input_rgbd']] ''' The state that we need to remember while flying Code for an autopilot demo. We expose two functions that will be invoked from an HTML/JS frontend: reset and step. ''' state = { 'intrinsics': None, 'pose': None, 'rgbd': None, 'start_rgbd': None, 'style_noise': None, 'next_pose_function': None, #setting offset none for controlling with mouse 'direction_offset': None, } def current_image_png(): img_data = tf.image.encode_png( tf.image.convert_image_dtype(state['rgbd'][..., :3], dtype=tf.uint8)) return IPython.display.Image(data=img_data.numpy())
Reset Function
#function to reset the rgbd channels d is for depth def reset(rgbd=None): #condition for new input channel if rgbd is None: rgbd = state['start_rgbd'] ht, w, _ = rgbd.shape aspectratio = w / float(ht) #resizing the image so that it looks like we are zooming in rgbd_channel = tf.image.resize(rgbd_channel, [160, 256]) state['rgbd'] = rgbd_channel #default rgbd channel state['start_rgbd'] = rgbd_channel state['pose'] = np.array( [[1.0, 0.0, 0.0, 0.0], [0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0]], dtype=np.float32) #0.8 focal_x corresponds to a FOV (focal view) of ~64 degrees. state['intrinsics'] = np.array( [0.8, 0.8 * aspect_ratio, .5, .5], dtype=np.float32) #no movement from self, defined by mouse or autopilot state['direction_offset'] = (0.0, 0.0) state['style_noise'] = style_encoding(rgbd_channel) #new pose after current image state['next_pose_function'] = fly_camera.fly_dynamic( state['intrinsics'], state['pose'], #turn the camera where mouse points turn_function=(lambda _: state['direction_offset'])) return current_image_png()
Step Function
#function for direction to take new frame in def step(offx, offy): state['direction_offset'] = (offx, offy) #calling self function next= state['next_pose_function'](state['rgbd']) # new rgbd channel refiner next_rgbd = render_refiner( state['rgbd'], state['style_noise'], state['pose'], state['intrinsics'], next, state['intrinsics']) state['pose'] = next state['rgbd'] = next_rgbd return current_image_png()
Midas Disparity
#running on user-supplied images, using MiDaS V2, obtain initial disparity. midas_mod = hub.load('https://tfhub.dev/intel/midas/v2/2', tags=['serve']) def midas_dis(rgb): """Computes MiDaS v2 disparity on an RGB input image. Arguments: rgb: [H, W, 3] Range [0.0, 1.0]. Function outputs: [H, W, 1] MiDaS disparity resized to the input size and in the range [0.0, 1.0] """ size = rgb.shape[:2] resized_img = tf.image.resize(rgb, [384, 384], tf.image.ResizeMethod.BICUBIC) # MiDaS networks wants [1, C, H, W] midas_in = tf.transpose(resized_img, [2, 0, 1])[tf.newaxis] pred = midas_mod.signatures['serving_default'](midas_in)['default'][0] min = tf.reduce_min(prediction) max = tf.reduce_max(prediction) prediction = (pred - min) / (max - min) return tf.image.resize( pred[..., tf.newaxis], size, method=tf.image.ResizeMethod.AREA)
Load Function
#initial rgbd channels for frame def load_initial(i): return reset(rgbd=initial_rgbds[i]) def load_image(data): ''' Data is converted from JavaScript which ends up as a string, then it needs to be converted to byte format using Latin-1 encoding (maps 0-255 to 0-255). ''' d = d.encode('Latin-1') # decoding image from channels which are also provided as input rgb = tf.image.decode_image(data, channels=3, dtype=tf.float32) #resizing is vital for moving ahead in the frame resized = tf.image.resize(rgb, [160, 256], tf.image.ResizeMethod.AREA) #concatenation with midas disparity map from previous function rgbd = tf.concat([resized, midas_dis(resized)], axis=-1) return reset(rgbd=rgbd)
Output
The frontend for this application in HTML is given here.
#displaying frontend made by HTML script provided above display(IPython.display.HTML(html)) #initial image , base output.register_callback('initial', load_initial) #corresponding generated frame output.register_callback('image', load_image) #reset rgbd channels output.register_callback('reset', reset) #step or change the channels for new frame output.register_callback('step', step)
EndNote
The output can be viewed here, successfully made a frontend application for the Infinite Nature model. The dataset which can I recommend trying is ACID(Aerial Coastline Imagery Dataset)
One can try rigorously changing the camera position in the application. Former approach to this problem can be read here.