Guide to OpenPose for Real-time Human Pose Estimation

OpenPose is a Real-time multiple-person detection library, and it’s the first time that any library has shown the capability of jointly detecting human body, face, and foot keypoints. Thanks to Gines Hidalgo, Zhe Cao, Tomas Simon, Shih-En Wei, Hanbyul Joo, and Yaser Sheikh for making this project successful and this library is very much dependent on the CMU Panoptic Studio dataset. 

OpenPose is written in C++ and Caffe. Today we are going to see a very popular library with almost a 19.8k star and 6k fork on Github: OpenPose with a small implementation in python, the authors have created many builds for different operating systems and languages. You can try it in your local machine with GPU or without GPU, with Linux or without Linux.

There are many features of OpenPose library let’s see some of the most remarkable ones:

  • Real-time 2D multi-person keypoint detections.
  • Real-time 3D single-person keypoint detections.
  • Included a Calibration toolbox for estimation of distortion, intrinsic, and extrinsic camera parameters.
  • Single-person tracking for speeding up the detection and visual smoothing.

OpenPose Pipeline

Image for post
Figure 1. Full pipeline steps involved in OpenPose

Before going into coding, implementation let’s look at the pipeline followed by OpenPose. 

  1. First, an input RGB(red green blue) image is fed into a “two-branch multi-stage” convolutional neural network(CNN) i.e. CNN is going to produce two different outputs.
  2. The top branch, shown in the above figure(beige), predicts the confidence maps (Figure 1b) of different body parts like the right eye, left eye, right elbow, and others. 
  3. The bottom branch predicts the affinity fields (Figure 1c), which represents a degree of association between different body parts of the input image.
  4. 2nd last, the confidence maps and affinity fields are being processed by greedy inference (Fig 1d).
  5. The pose estimation outputs of the 2D key points for all people in the image are produced as shown in (Fig 1e).

In order to capture more fine outputs, we use Multi-stage to increase the depth of the neural network approach, which means that the network is stacked one on top of the other at every stage. Image for post

Figure 2. OpenPose Architecture of the two-branch multi-stage CNN.

In Figure 2, the top branch of the neural network of OpenPose produces a set of detection confidence maps S. The mathematical definition can be defined as follows.

Image for post
confidence maps

Outputs of Multi-Stage OpenPose network

Let’s see the pipeline outputs how they got affected stage by stage and at the end, we get our pose estimation on the input image.

Image for post
Figure 3. The outcome of a multi-stage network.

In the above Figure 3, The blue overlay TOP row shows the OpenPose network predicting confidence maps of the right wrist.

And the BOTTOM row shows the network predicting the Part Affinity Fields of the right forearm (right shoulder — right wrist) of humans across different stages.


With this one passage command, your openpose will be extracted from GitHub to your google colab GPU runtime environment and it will install CMake with cuda10 and install all the dependencies needed to run the library. Also, we will be needing the youtube-dl library for using OpenPose pose estimation and keypoint detection directly on youtube videos

Installing OpenPose

import os
from os.path import exists, join, basename, splitext
# initiating variable for cloning openpose
git_repo_url = ''
project_name = splitext(basename(git_repo_url))[0]
if not exists(project_name):
  # see:
  # install new CMake becaue of CUDA10
  !wget -q
  !tar xfz cmake-3.13.0-Linux-x86_64.tar.gz --strip-components=1 -C /usr/local
  # clone openpose
  !git clone -q --depth 1 $git_repo_url
  !sed -i 's/execute_process(COMMAND git checkout master WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}\/3rdparty\/caffe)/execute_process(COMMAND git checkout f019d0dfe86f49d1140961f8c7dec22130c83154 WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}\/3rdparty\/caffe)/g' openpose/CMakeLists.txt
  # install system dependencies
  !apt-get -qq install -y libatlas-base-dev libprotobuf-dev libleveldb-dev libsnappy-dev libhdf5-serial-dev protobuf-compiler libgflags-dev libgoogle-glog-dev liblmdb-dev opencl-headers ocl-icd-opencl-dev libviennacl-dev
  # install python dependencies
  !pip install -q youtube-dl
  # build openpose
  !cd openpose && rm -rf build || true && mkdir build && cd build && cmake .. && make -j`nproc`
from IPython.display import YouTubeVideo

Input & preprocess a Custom video for Pose estimation

We are going to use the charlie video as our input sample but for test, we don’t need to wait 


Input video:


import io
import base64
from IPython.display import HTML
video_encoded = base64.b64encode(, 'rb').read())
return HTML(data='''<video width="{0}" height="{1}" alt="test" controls>
                        <source src="data:video/mp4;base64,{2}" type="video/mp4" />
                      </video>'''.format(width, height, video_encoded.decode('ascii')))

Output video:


OpenPose is the best library for pose estimation and body keypoints detection, including accurately detecting foot, joining boines, and face. To learn more, you can follow some of the below resources, which include codes and research papers for a deep understanding of OpenPose:

Read more in Computer Vision

More Great AIM Stories

Mohit Maithani
Mohit is a Data & Technology Enthusiast with good exposure to solving real-world problems in various avenues of IT and Deep learning domain. He believes in solving human's daily problems with the help of technology.

More Stories


8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

Vijaysinh Lendave
How to do Pose Estimation With MoveNet

MoveNet is an ultra-fast and accurate estimator which detects the 17 key points of a body part, as shown above. This model is hosted on Tensorflow-Hub along with its two variants called as lighting and thunder. Lightning is used where critical latency application is hosted, while thunder variant where high accuracy requires. Both the estimators run for more than 30 FPS on most modern machines and mobile phones.

Keras vs PyTorch vs Caffe
Prudhvi varma
Keras vs PyTorch vs Caffe – Comparing the Implementation of CNN

In this article, we will build the same deep learning framework that will be a convolutional neural network for image classification on the same dataset in Keras, PyTorch and Caffe and we will compare the implementation in all these ways. Finally, we will see how the CNN model built in PyTorch outperforms the peers built-in Keras and Caffe.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM