MITB Banner

Optimization In Data Science Using Multiprocessing and Multithreading

In the real world, the size of datasets is huge which comes as a challenge for every data science programmer. Working on it takes a lot of time, so there is a need for a technique that can increase the algorithm’s speed. Most of us are familiar with the term parallelization that allows for the distribution of work across all available CPU cores. Python offers two built-in libraries for this process, multiprocessing and multithreading.

Share

Optimization

In the real world, the size of datasets is huge which comes as a challenge for every data science programmer. Working on it takes a lot of time, so there is a need for a technique that can increase the algorithm’s speed. Most of us are familiar with the term parallelization that allows for the distribution of work across all available CPU cores. Python offers two built-in libraries for optimization of this process, multiprocessing and multithreading

Multi-Processing: Multiprocessing refers to the ability of a system to support more than one processor at the same time. It works in parallel and doesn’t share memory resources.

Threading: Threads are components of a process, which can run sequentially. Memory is shared between the CPU core.

In this article, we will discuss how much time it takes to solve a problem using a traditional approach. Further, we will research parallelization techniques like multiprocessing and multithreading that can reduce the training time of large dataset for a data science problem. While dealing with larger implementations of machine learning, the time complexity is a major concern. Through this article, we will learn how to address the concern of time complexity. 

Practical Implementation

Normal Approach

We need to check the time taken by the python program when we go by normal approach.

import time
def even(n):
        if (n % 2 == 0) : 
          print('The Number '+str(n)+" is "+"Even Number")
    
        else:
          print('The Number '+str(n)+" is "+"Odd Number")
        
starttime = time.time()
for i in range(1,10):
    time.sleep(2)
    (i, even(i))
print()    
print('Time taken = {} seconds'.format(time.time() - starttime))

Using Multiprocessing

import time
import multiprocessing
def even(n):
  if (n % 2 == 0) : 
    print('The Number '+str(n)+" is "+"Even Number")
  else:
    print('The Number '+str(n)+" is "+"Odd Number")  
def multiprocessing_func(x):
    time.sleep(2)
    (x, even(x))
if __name__ == '__main__':
    starttime = time.time()
    processes = []
    for i in range(1,10):
        p = multiprocessing.Process(target=multiprocessing_func, args=(i,))
        processes.append(p)
        p.start()
    for process in processes:
        process.join()
    print()    
    print('Time taken = {} seconds'.format(time.time() - starttime))

From the above result, we can see, the time taken for the computation has been reduced drastically from 18.01 sec to 2.07 sec using the Process class.

Using Pool Class

import time
import multiprocessing
def even(n):
  if (n % 2 == 0) : 
    print('The Number '+str(n)+" is "+"Even Number")
  else:
    print('The Number '+str(n)+" is "+"Odd Number")
def multiprocessing_func(x):
    time.sleep(2)
    (x, even(x))
if __name__ == '__main__':
    starttime = time.time()
    pool = multiprocessing.Pool()
    pool.map(multiprocessing_func, range(1,10))
    pool.close()
    print()
    print('Time taken = {} seconds'.format(time.time() - starttime))

The time taken for the computation has been reduced from 18.01 sec to 10.04  sec using the Pool class. This method is slower than the above one, so it is better to go for a process class technique for optimization of algorithm’s speed.

Threading

import threading 
def even(num): 
    if (num % 2 == 0) : 
      print('The Number '+str(num)+" is "+"Even Number")
    else:
      print('The Number '+str(num)+" is "+"Odd Number")
def print_square(num): 
    """ 
    function to print square of given num 
    """
    print("Square: {}".format(num * num)) 
if __name__ == "__main__": 
    starttime = time.time()
    # creating thread 
    t1 = threading.Thread(target=print_square, args=(10,))
    # starting thread 1 
    t1.start() 
    for i in range(1,10):
      t2 = threading.Thread(target=even, args=(i,)) 
      # starting thread 2 
      t2.start() 
    # wait until thread 1 is completely executed 
    t1.join() 
    # wait until thread 2 is completely executed 
    t2.join() 
    print('Time taken = {} seconds'.format(time.time() - starttime))

Even with the increased number of tasks threading computation speed is very fast.The time taken for computation is reduced from 18.01 to 0.013.

Conclusion

So, we can conclude that multiprocessing and threading have great computational speed. As the trend of increasing parallelism will continue to rise in future, these techniques will become more and more important in providing solutions to a data science problem in a much lesser time.The complete code of the above implementation is available at the AIM’s GitHub repositories. Please visit this link to find the code.

Share
Picture of Ankit Das

Ankit Das

A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.