MITB Banner

How To Accelerate Pandas With Just One Line Of Code

Share

Pandas is one of the most popular libraries on Python. The data manipulation and storing options it provides made it a go-to option for Kaggle competition. Pandas dataframes have more than 280 methods and more than 40 APIs. The number of options it has serves almost all the needs of a data scientist.  Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

But Pandas can get clumsy when dealing with large datasets such as those of genomics. The data is trimmed using big data tools and then run on pandas.

So few data scientists at UC Berkeley came up with a new library, Modin which is a multi-process DataFrame library with an API identical to pandas.

What Gives Modin The Edge

In pandas, one can only use one core at a time when doing computation but Modin, enables the user to use all of the CPU cores on the machine.

Unlike other parallel DataFrame systems, Modin is an extremely light-weight, robust DataFrame. It provides speed-ups of up to 4x on devices with 4 physical cores.

Modin uses Ray to provide an effortless way to speed up the pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code.

Source: Modin docs

By testing for read through actions like read_csv, large gains could be witnessed by efficiently distributing the work across the entire machine.

With Modin, the developers tried to bridge the gap between handling of small and large data sets.

Installing Modin:

pip install modin

Using Modin:

import modin.pandas as pd #that ONE line#

The Modin.pandas DataFrame is an extremely light-weight parallel DataFrame. Modin transparently distributes the data and computation.

The Modin DataFrame architecture follows in the footsteps of modern architectures for database and high performance matrix system

Source: Modin docs

Query Compiler

The Query Compiler receives queries from the pandas API layer. The API layer’s responsibility is to ensure clean input to the Query Compiler. The Query Compiler must have knowledge of the in-memory format of the data (currently a pandas DataFrame) in order to efficiently compile the queries.

Partition Manager

The Partition Manager is responsible for the data layout and shuffling, partitioning, and serializing the tasks that get sent to each partition

Partition

Partitions are responsible for managing a subset of the DataFrame. As is mentioned above, the DataFrame is partitioned both row and column-wise.

What Is Ray

Ray is a another system under development. It can be used for parallel and distributed Python to unify the ML ecosystem for low latency and high performance.

Pandas on Ray is the component of Modin that runs on the Ray execution Framework. Currently, the in-memory format for Pandas on Ray is a pandas DataFrame on each partition. Currently, Ray is the only execution framework supported on Modin.

The optimization that improves the performance the most is the pre-serialization of the tasks and parameters. This is primarily applicable to map operations.

Modin will use all of the resources available on the machine and this usage, if required can be limited as follows:

import ray

ray.init(num_cpus=4)

import modin.pandas as pd


The above figure illustrates how Modin edges Pandas on 4 core CPU for better performance.

Key Takeaways

  • Modin takes care of all the partitioning and shuffling of the data.
  • Performance of ‘read_csv’ shows more than a gigabyte per second of read-through which is far better compared to what Pandas does(1GB/25 sec).
  • The architecture allows Modin to exploit potential optimizations across framework and in-memory format of the data.

Know more about Modin here

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.