In the real world, the size of datasets is huge which comes as a challenge for every data science programmer. Working on it takes a lot of time, so there is a need for a technique that can increase the algorithm’s speed. Most of us are familiar with the term parallelization that allows for the distribution of work across all available CPU cores. Python offers two built-in libraries for optimization of this process, multiprocessing and multithreading.
Multi-Processing: Multiprocessing refers to the ability of a system to support more than one processor at the same time. It works in parallel and doesn’t share memory resources.
Threading: Threads are components of a process, which can run sequentially. Memory is shared between the CPU core.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
In this article, we will discuss how much time it takes to solve a problem using a traditional approach. Further, we will research parallelization techniques like multiprocessing and multithreading that can reduce the training time of large dataset for a data science problem. While dealing with larger implementations of machine learning, the time complexity is a major concern. Through this article, we will learn how to address the concern of time complexity.

Practical Implementation
Normal Approach
We need to check the time taken by the python program when we go by normal approach.
import time def even(n): if (n % 2 == 0) : print('The Number '+str(n)+" is "+"Even Number") else: print('The Number '+str(n)+" is "+"Odd Number") starttime = time.time() for i in range(1,10): time.sleep(2) (i, even(i)) print() print('Time taken = {} seconds'.format(time.time() - starttime))
Using Multiprocessing
import time import multiprocessing def even(n): if (n % 2 == 0) : print('The Number '+str(n)+" is "+"Even Number") else: print('The Number '+str(n)+" is "+"Odd Number") def multiprocessing_func(x): time.sleep(2) (x, even(x)) if __name__ == '__main__': starttime = time.time() processes = [] for i in range(1,10): p = multiprocessing.Process(target=multiprocessing_func, args=(i,)) processes.append(p) p.start() for process in processes: process.join() print() print('Time taken = {} seconds'.format(time.time() - starttime))
From the above result, we can see, the time taken for the computation has been reduced drastically from 18.01 sec to 2.07 sec using the Process class.
Using Pool Class
import time import multiprocessing def even(n): if (n % 2 == 0) : print('The Number '+str(n)+" is "+"Even Number") else: print('The Number '+str(n)+" is "+"Odd Number") def multiprocessing_func(x): time.sleep(2) (x, even(x)) if __name__ == '__main__': starttime = time.time() pool = multiprocessing.Pool() pool.map(multiprocessing_func, range(1,10)) pool.close() print() print('Time taken = {} seconds'.format(time.time() - starttime))
The time taken for the computation has been reduced from 18.01 sec to 10.04 sec using the Pool class. This method is slower than the above one, so it is better to go for a process class technique for optimization of algorithm’s speed.
Threading
import threading def even(num): if (num % 2 == 0) : print('The Number '+str(num)+" is "+"Even Number") else: print('The Number '+str(num)+" is "+"Odd Number") def print_square(num): """ function to print square of given num """ print("Square: {}".format(num * num)) if __name__ == "__main__": starttime = time.time() # creating thread t1 = threading.Thread(target=print_square, args=(10,)) # starting thread 1 t1.start() for i in range(1,10): t2 = threading.Thread(target=even, args=(i,)) # starting thread 2 t2.start() # wait until thread 1 is completely executed t1.join() # wait until thread 2 is completely executed t2.join() print('Time taken = {} seconds'.format(time.time() - starttime))
Even with the increased number of tasks threading computation speed is very fast.The time taken for computation is reduced from 18.01 to 0.013.
Conclusion
So, we can conclude that multiprocessing and threading have great computational speed. As the trend of increasing parallelism will continue to rise in future, these techniques will become more and more important in providing solutions to a data science problem in a much lesser time.The complete code of the above implementation is available at the AIM’s GitHub repositories. Please visit this link to find the code.