Make Python Code Faster With Numba

Numba is an open-source Just-In-Time compiler that enables Python developers to translate Python and NumPy code directly into machine code.

One of the major complaints that people, mostly die-hard C++ users, have with Python is that it’s slow. Yes, Python is a dynamically typed interpreted language and it is slow. Most people don’t know that Python can provide you direct access to your hardware to perform intensive calculations. Numba is an open-source Just-In-Time compiler that does exactly that. It enables Python developers to translate a subset of Python and NumPy code directly into machine code by using the LLVM compiler in the backend. In addition to that, Numba offers a wide range of choices for parallelizing Python code for CPUs and GPUs with trivial code changes. There are a lot of ways to approach compiling Python; the approach Numba takes is to compile individual functions or a collection of functions just in time as you need them.


Numba takes the bytecode of your function and looks at the types of arguments you pass to it. The arguments, supported by Python objects, are translated into representations with no CPython dependencies. This process is called “unboxing”. Once Numba has these two things, it goes down an analysis pipeline to figure out the types of everything inside the function based on what’s passed in. It then generates an intermediate representation (IR) of what the function is doing, filling in all the data types and all that kind of stuff. LLVM is responsible for most of the hard work; it inlines functions, auto vectorize loops, does other low-level code optimization expected by a C compiler and generates the machine code. This machine code is cached so that the next time the function is run, Numba doesn’t need to go through this whole pipeline but instead skip to the end. 

An important thing to note is that Numba doesn’t interact with or change the interpreter. This means it can only optimize what’s locally possible in the function; for instance, it can’t go to other parts of your program and say that “Oh, the operation would be a lot faster if this list was a NumPy array”. Another thing Numba does is that it looks for built-in and NumPy methods and swap them out with its own implementation.  

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Using Numba to make Python & NumPy code faster

Numba can be installed from PyPI as:

pip install numba

Numba uses decorators to convert Python functions into functions that compile themselves. The most common Numba decorator is @jit. Let’s create an example function and see @jit in action.

 def example_function(n): 
     trace = 0.0
     for i in range(n.shape[0]):  
         trace += np.tanh(n[i, i]) 
     return n + trace           

The nopython=True option tells Numba to fully compile the function to remove the Python interpreter calls completely. If it is not used, exceptions are raised, indicating places in the function that need to be refactored to achieve better-than-Python performance. Using nopython=True is strongly recommended. 

We’ll be using the %timeit magic function to measure execution time because it runs the function multiple times to get a more accurate estimate of short functions. Our function has not been compiled yet; to do that, we need to call it:

 n = np.arange(10000).reshape(100, 100)
 %timeit example_function(n) 
 The slowest run took 20086.53 times longer than the fastest. This could mean that an intermediate result is being cached.
 1 loop, best of 5: 11.9 µs per loop 

The function was compiled, executed and cached. Now when it is called again, the previously generated machine code is executed directly without any need for compilation. 

%timeit example_function(n)
 The slowest run took 4.89 times longer than the fastest. This could mean that an intermediate result is being cached.
 100000 loops, best of 5: 11.8 µs per loop 

When benchmarking Numba-compiled functions, it is important to time them without including the compilation step since the compilation of a given function only happens once. Let’s compare to the uncompiled function. Numba-compiled functions have a .py_func attribute that can be used to access the original uncompiled Python function. 

%timeit example_function.py_func(n)
 The slowest run took 6.77 times longer than the fastest. This could mean that an intermediate result is being cached.
 1000 loops, best of 5: 239 µs per loop 

The original Python function is more than 20 times slower than the Numba-compiled version. However, our example function used explicit loops, which are very fast in Numba and not so much in Python. Our function is really simple so we can try optimizing it by rewriting it using only NumPy expressions:

 def numpy_example(n):
     return a + np.tanh(np.diagonal(n)).sum()
 %timeit numpy_example(n) 
The slowest run took 8.53 times longer than the fastest. This could mean that an intermediate result is being cached.
 10000 loops, best of 5: 29.2 µs per loop 

The refactored NumPy version is roughly 10 times faster than the Python version but still slower than the Numba-compiled version. 

Multithreading with Numba

Operations on NumPy array expressions are often broadcasted independently over the input elements and have a significant amount of implied parallelism. Numba’s ParallelAccelerator optimization identifies this parallelism and automatically distributes it over several threads. To enable the parallelization pass, all we need to do is use the parallel=True option.

 SQRT_2PI = np.sqrt(2 * np.pi)
 @jit(nopython=True, parallel=True)
 def gaussians(x, means, widths):
     n = means.shape[0]
     result = np.exp( -0.5 * ((x - means) / widths)**2 ) / widths
     return result / SQRT_2PI / n

Let’s call the function once to compile it:

means = np.random.uniform(-1, 1, size=1000000)
widths = np.random.uniform(0.1, 0.3, size=1000000)
gaussians(0.4, means, widths)

Now we can accurately compare the effect of threading and compiling with the normal Python version:

 gaussians_nothread = jit(nopython=True)(gaussians.py_func)
 %timeit gaussians(0.4, means, widths)  # numba-compiled and threading
 %timeit gaussians_nothread(0.4, means, widths) # no threading
 %timeit gaussians.py_func(0.4, means, widths) # normal python 
 10 loops, best of 5: 20.3 ms per loop
 1 loop, best of 5: 26.1 ms per loop
 10 loops, best of 5: 28.4 ms per loop 

There are situations suited for multithreading where there’s no array expression but rather a loop where each iteration is independent of the other. In these cases, we can use prange() in a for loop to indicate to ParallelAccelerator that this loop can be executed in parallel:

 import random
 # Serial version
 def monte_carlo_pi_serial(nsamples):
     acc = 0
     for i in range(nsamples):
         x = random.random()
         y = random.random()
         if (x**2 + y**2) < 1.0:
             acc += 1
     return 4.0 * acc / nsamples

 # Parallel version
 @jit(nopython=True, parallel=True)
 def monte_carlo_pi_parallel(nsamples):
     acc = 0
     # Only change is here
     for i in numba.prange(nsamples):
         x = random.random()
         y = random.random()
         if (x**2 + y**2) < 1.0:
             acc += 1
     return 4.0 * acc / nsamples

 %time monte_carlo_pi_serial(int(4e8))
 %time monte_carlo_pi_parallel(int(4e8)) 
 CPU times: user 5.07 s, sys: 23.9 ms, total: 5.09 s
 Wall time: 5.06 s
 CPU times: user 9.41 s, sys: 17 ms, total: 9.43 s
 Wall time: 4.9 s 

The above code implementations have been taken from the official tutorial binder available here.

One thing to note here is that prange() automatically handles the reduction variable acc in a thread-safe way.  Additionally, Numba automatically initializes the random number generator in each thread independently.

Alternatively, you can also use modules like concurrent.futures or Dask to run functions in multiple threads. For these use-cases, ParallelAccelerator isn’t helpful; we only want to obtain the Numba-compiled function to run concurrently in different threads. For accomplishing this, we need the Numba function to release the Global Interpreter Lock (GIL) during execution. This can be done using the nogil=True option. 

I highly recommend watching Gil Forsyth’s SciPy 2017 talk on Numba; if you want a more in-depth understanding of Numba or in true Numba time-saving spirit, just refer to the documentation

Aditya Singh
A machine learning enthusiast with a knack for finding patterns. In my free time, I like to delve into the world of non-fiction books and video essays.

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry


Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox