One of the major complaints that people, mostly die-hard C++ users, have with Python is that it’s slow. Yes, Python is a dynamically typed interpreted language and it is slow. Most people don’t know that Python can provide you direct access to your hardware to perform intensive calculations. Numba is an open-source Just-In-Time compiler that does exactly that. It enables Python developers to translate a subset of Python and NumPy code directly into machine code by using the LLVM compiler in the backend. In addition to that, Numba offers a wide range of choices for parallelizing Python code for CPUs and GPUs with trivial code changes. There are a lot of ways to approach compiling Python; the approach Numba takes is to compile individual functions or a collection of functions just in time as you need them.
Numba takes the bytecode of your function and looks at the types of arguments you pass to it. The arguments, supported by Python objects, are translated into representations with no CPython dependencies. This process is called “unboxing”. Once Numba has these two things, it goes down an analysis pipeline to figure out the types of everything inside the function based on what’s passed in. It then generates an intermediate representation (IR) of what the function is doing, filling in all the data types and all that kind of stuff. LLVM is responsible for most of the hard work; it inlines functions, auto vectorize loops, does other low-level code optimization expected by a C compiler and generates the machine code. This machine code is cached so that the next time the function is run, Numba doesn’t need to go through this whole pipeline but instead skip to the end.
An important thing to note is that Numba doesn’t interact with or change the interpreter. This means it can only optimize what’s locally possible in the function; for instance, it can’t go to other parts of your program and say that “Oh, the operation would be a lot faster if this list was a NumPy array”. Another thing Numba does is that it looks for built-in and NumPy methods and swap them out with its own implementation.
Using Numba to make Python & NumPy code faster
Numba can be installed from PyPI as:
pip install numba
Numba uses decorators to convert Python functions into functions that compile themselves. The most common Numba decorator is
@jit. Let’s create an example function and see
@jit in action.
@jit(nopython=True) def example_function(n): trace = 0.0 for i in range(n.shape): trace += np.tanh(n[i, i]) return n + trace
nopython=True option tells Numba to fully compile the function to remove the Python interpreter calls completely. If it is not used, exceptions are raised, indicating places in the function that need to be refactored to achieve better-than-Python performance. Using
nopython=True is strongly recommended.
We’ll be using the
%timeit magic function to measure execution time because it runs the function multiple times to get a more accurate estimate of short functions. Our function has not been compiled yet; to do that, we need to call it:
n = np.arange(10000).reshape(100, 100) %timeit example_function(n)
The slowest run took 20086.53 times longer than the fastest. This could mean that an intermediate result is being cached. 1 loop, best of 5: 11.9 µs per loop
The function was compiled, executed and cached. Now when it is called again, the previously generated machine code is executed directly without any need for compilation.
The slowest run took 4.89 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 5: 11.8 µs per loop
When benchmarking Numba-compiled functions, it is important to time them without including the compilation step since the compilation of a given function only happens once. Let’s compare to the uncompiled function. Numba-compiled functions have a
.py_func attribute that can be used to access the original uncompiled Python function.
The slowest run took 6.77 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 5: 239 µs per loop
The original Python function is more than 20 times slower than the Numba-compiled version. However, our example function used explicit loops, which are very fast in Numba and not so much in Python. Our function is really simple so we can try optimizing it by rewriting it using only NumPy expressions:
def numpy_example(n): return a + np.tanh(np.diagonal(n)).sum() %timeit numpy_example(n)
The slowest run took 8.53 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 5: 29.2 µs per loop
The refactored NumPy version is roughly 10 times faster than the Python version but still slower than the Numba-compiled version.
Multithreading with Numba
Operations on NumPy array expressions are often broadcasted independently over the input elements and have a significant amount of implied parallelism. Numba’s ParallelAccelerator optimization identifies this parallelism and automatically distributes it over several threads. To enable the parallelization pass, all we need to do is use the
SQRT_2PI = np.sqrt(2 * np.pi) @jit(nopython=True, parallel=True) def gaussians(x, means, widths): n = means.shape result = np.exp( -0.5 * ((x - means) / widths)**2 ) / widths return result / SQRT_2PI / n
Let’s call the function once to compile it:
means = np.random.uniform(-1, 1, size=1000000) widths = np.random.uniform(0.1, 0.3, size=1000000) gaussians(0.4, means, widths)
Now we can accurately compare the effect of threading and compiling with the normal Python version:
gaussians_nothread = jit(nopython=True)(gaussians.py_func) %timeit gaussians(0.4, means, widths) # numba-compiled and threading %timeit gaussians_nothread(0.4, means, widths) # no threading %timeit gaussians.py_func(0.4, means, widths) # normal python
10 loops, best of 5: 20.3 ms per loop 1 loop, best of 5: 26.1 ms per loop 10 loops, best of 5: 28.4 ms per loop
There are situations suited for multithreading where there’s no array expression but rather a loop where each iteration is independent of the other. In these cases, we can use
prange() in a for loop to indicate to ParallelAccelerator that this loop can be executed in parallel:
import random # Serial version @jit(nopython=True) def monte_carlo_pi_serial(nsamples): acc = 0 for i in range(nsamples): x = random.random() y = random.random() if (x**2 + y**2) < 1.0: acc += 1 return 4.0 * acc / nsamples # Parallel version @jit(nopython=True, parallel=True) def monte_carlo_pi_parallel(nsamples): acc = 0 # Only change is here for i in numba.prange(nsamples): x = random.random() y = random.random() if (x**2 + y**2) < 1.0: acc += 1 return 4.0 * acc / nsamples %time monte_carlo_pi_serial(int(4e8)) %time monte_carlo_pi_parallel(int(4e8))
CPU times: user 5.07 s, sys: 23.9 ms, total: 5.09 s Wall time: 5.06 s CPU times: user 9.41 s, sys: 17 ms, total: 9.43 s Wall time: 4.9 s
The above code implementations have been taken from the official tutorial binder available here.
One thing to note here is that
prange() automatically handles the reduction variable
acc in a thread-safe way. Additionally, Numba automatically initializes the random number generator in each thread independently.
Alternatively, you can also use modules like
Dask to run functions in multiple threads. For these use-cases, ParallelAccelerator isn’t helpful; we only want to obtain the Numba-compiled function to run concurrently in different threads. For accomplishing this, we need the Numba function to release the Global Interpreter Lock (GIL) during execution. This can be done using the
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
A machine learning enthusiast with a knack for finding patterns. In my free time, I like to delve into the world of non-fiction books and video essays.