Now Reading
Optimization In Data Science Using Multiprocessing and Multithreading

Optimization In Data Science Using Multiprocessing and Multithreading

Ankit Das
Optimization

Download our Mobile App


In the real world, the size of datasets is huge which comes as a challenge for every data science programmer. Working on it takes a lot of time, so there is a need for a technique that can increase the algorithm’s speed. Most of us are familiar with the term parallelization that allows for the distribution of work across all available CPU cores. Python offers two built-in libraries for optimization of this process, multiprocessing and multithreading

Multi-Processing: Multiprocessing refers to the ability of a system to support more than one processor at the same time. It works in parallel and doesn’t share memory resources.



Threading: Threads are components of a process, which can run sequentially. Memory is shared between the CPU core.

In this article, we will discuss how much time it takes to solve a problem using a traditional approach. Further, we will research parallelization techniques like multiprocessing and multithreading that can reduce the training time of large dataset for a data science problem. While dealing with larger implementations of machine learning, the time complexity is a major concern. Through this article, we will learn how to address the concern of time complexity. 

Practical Implementation

Normal Approach

We need to check the time taken by the python program when we go by normal approach.

import time
def even(n):
        if (n % 2 == 0) : 
          print('The Number '+str(n)+" is "+"Even Number")
    
        else:
          print('The Number '+str(n)+" is "+"Odd Number")
        
starttime = time.time()
for i in range(1,10):
    time.sleep(2)
    (i, even(i))
print()    
print('Time taken = {} seconds'.format(time.time() - starttime))

Using Multiprocessing

import time
import multiprocessing
def even(n):
  if (n % 2 == 0) : 
    print('The Number '+str(n)+" is "+"Even Number")
  else:
    print('The Number '+str(n)+" is "+"Odd Number")  
def multiprocessing_func(x):
    time.sleep(2)
    (x, even(x))
if __name__ == '__main__':
    starttime = time.time()
    processes = []
    for i in range(1,10):
        p = multiprocessing.Process(target=multiprocessing_func, args=(i,))
        processes.append(p)
        p.start()
    for process in processes:
        process.join()
    print()    
    print('Time taken = {} seconds'.format(time.time() - starttime))

From the above result, we can see, the time taken for the computation has been reduced drastically from 18.01 sec to 2.07 sec using the Process class.


Stay Connected

Get the latest updates and relevant offers by sharing your email.
See Also

Using Pool Class

import time
import multiprocessing
def even(n):
  if (n % 2 == 0) : 
    print('The Number '+str(n)+" is "+"Even Number")
  else:
    print('The Number '+str(n)+" is "+"Odd Number")
def multiprocessing_func(x):
    time.sleep(2)
    (x, even(x))
if __name__ == '__main__':
    starttime = time.time()
    pool = multiprocessing.Pool()
    pool.map(multiprocessing_func, range(1,10))
    pool.close()
    print()
    print('Time taken = {} seconds'.format(time.time() - starttime))

The time taken for the computation has been reduced from 18.01 sec to 10.04  sec using the Pool class. This method is slower than the above one, so it is better to go for a process class technique for optimization of algorithm’s speed.

Threading

import threading 
def even(num): 
    if (num % 2 == 0) : 
      print('The Number '+str(num)+" is "+"Even Number")
    else:
      print('The Number '+str(num)+" is "+"Odd Number")
def print_square(num): 
    """ 
    function to print square of given num 
    """
    print("Square: {}".format(num * num)) 
if __name__ == "__main__": 
    starttime = time.time()
    # creating thread 
    t1 = threading.Thread(target=print_square, args=(10,))
    # starting thread 1 
    t1.start() 
    for i in range(1,10):
      t2 = threading.Thread(target=even, args=(i,)) 
      # starting thread 2 
      t2.start() 
    # wait until thread 1 is completely executed 
    t1.join() 
    # wait until thread 2 is completely executed 
    t2.join() 
    print('Time taken = {} seconds'.format(time.time() - starttime))

Even with the increased number of tasks threading computation speed is very fast.The time taken for computation is reduced from 18.01 to 0.013.

Conclusion

So, we can conclude that multiprocessing and threading have great computational speed. As the trend of increasing parallelism will continue to rise in future, these techniques will become more and more important in providing solutions to a data science problem in a much lesser time.The complete code of the above implementation is available at the AIM’s GitHub repositories. Please visit this link to find the code.

What Do You Think?

If you loved this story, do join our Telegram Community.


Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.
What's Your Reaction?
Excited
2
Happy
4
In Love
0
Not Sure
1
Silly
1

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top