Optimization In Data Science Using Multiprocessing and Multithreading

In the real world, the size of datasets is huge which comes as a challenge for every data science programmer. Working on it takes a lot of time, so there is a need for a technique that can increase the algorithm’s speed. Most of us are familiar with the term parallelization that allows for the distribution of work across all available CPU cores. Python offers two built-in libraries for this process, multiprocessing and multithreading.
Optimization

In the real world, the size of datasets is huge which comes as a challenge for every data science programmer. Working on it takes a lot of time, so there is a need for a technique that can increase the algorithm’s speed. Most of us are familiar with the term parallelization that allows for the distribution of work across all available CPU cores. Python offers two built-in libraries for optimization of this process, multiprocessing and multithreading

Multi-Processing: Multiprocessing refers to the ability of a system to support more than one processor at the same time. It works in parallel and doesn’t share memory resources.

Threading: Threads are components of a process, which can run sequentially. Memory is shared between the CPU core.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

In this article, we will discuss how much time it takes to solve a problem using a traditional approach. Further, we will research parallelization techniques like multiprocessing and multithreading that can reduce the training time of large dataset for a data science problem. While dealing with larger implementations of machine learning, the time complexity is a major concern. Through this article, we will learn how to address the concern of time complexity. 




Practical Implementation

Normal Approach

We need to check the time taken by the python program when we go by normal approach.

import time
def even(n):
        if (n % 2 == 0) : 
          print('The Number '+str(n)+" is "+"Even Number")
    
        else:
          print('The Number '+str(n)+" is "+"Odd Number")
        
starttime = time.time()
for i in range(1,10):
    time.sleep(2)
    (i, even(i))
print()    
print('Time taken = {} seconds'.format(time.time() - starttime))

Using Multiprocessing

import time
import multiprocessing
def even(n):
  if (n % 2 == 0) : 
    print('The Number '+str(n)+" is "+"Even Number")
  else:
    print('The Number '+str(n)+" is "+"Odd Number")  
def multiprocessing_func(x):
    time.sleep(2)
    (x, even(x))
if __name__ == '__main__':
    starttime = time.time()
    processes = []
    for i in range(1,10):
        p = multiprocessing.Process(target=multiprocessing_func, args=(i,))
        processes.append(p)
        p.start()
    for process in processes:
        process.join()
    print()    
    print('Time taken = {} seconds'.format(time.time() - starttime))

From the above result, we can see, the time taken for the computation has been reduced drastically from 18.01 sec to 2.07 sec using the Process class.

Using Pool Class

import time
import multiprocessing
def even(n):
  if (n % 2 == 0) : 
    print('The Number '+str(n)+" is "+"Even Number")
  else:
    print('The Number '+str(n)+" is "+"Odd Number")
def multiprocessing_func(x):
    time.sleep(2)
    (x, even(x))
if __name__ == '__main__':
    starttime = time.time()
    pool = multiprocessing.Pool()
    pool.map(multiprocessing_func, range(1,10))
    pool.close()
    print()
    print('Time taken = {} seconds'.format(time.time() - starttime))

The time taken for the computation has been reduced from 18.01 sec to 10.04  sec using the Pool class. This method is slower than the above one, so it is better to go for a process class technique for optimization of algorithm’s speed.

Threading

import threading 
def even(num): 
    if (num % 2 == 0) : 
      print('The Number '+str(num)+" is "+"Even Number")
    else:
      print('The Number '+str(num)+" is "+"Odd Number")
def print_square(num): 
    """ 
    function to print square of given num 
    """
    print("Square: {}".format(num * num)) 
if __name__ == "__main__": 
    starttime = time.time()
    # creating thread 
    t1 = threading.Thread(target=print_square, args=(10,))
    # starting thread 1 
    t1.start() 
    for i in range(1,10):
      t2 = threading.Thread(target=even, args=(i,)) 
      # starting thread 2 
      t2.start() 
    # wait until thread 1 is completely executed 
    t1.join() 
    # wait until thread 2 is completely executed 
    t2.join() 
    print('Time taken = {} seconds'.format(time.time() - starttime))

Even with the increased number of tasks threading computation speed is very fast.The time taken for computation is reduced from 18.01 to 0.013.

Conclusion

So, we can conclude that multiprocessing and threading have great computational speed. As the trend of increasing parallelism will continue to rise in future, these techniques will become more and more important in providing solutions to a data science problem in a much lesser time.The complete code of the above implementation is available at the AIM’s GitHub repositories. Please visit this link to find the code.

Ankit Das
A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.

Download our Mobile App

MachineHack

AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.