Listen to this story
One of the day-to-day jobs of a data scientist involves preparing datasets. In order to prepare and analyse structured data, one of the most preferred dataframe libraries is Pandas – data scientists swear by it.
Pandas is a Python library for data analysis created in 2008. It is built on top of two core Python libraries, matplotlib and NumPy, and is known for its powerful and flexible capabilities for quantitative analysis. Pandas is used extensively in data science as an essential component of any data analysis workflow. With its rich set of functions and methods for handling and manipulating data, Pandas makes it easy to perform complex data analysis tasks in a simple and efficient manner.
However, two years ago, when the world was fighting COVID-19, Ritchie Vink was planning to launch something that could revolutionise the dataframe libraries ecosystem and pose a threat to the monopoly of Pandas.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Polars is an open source dataframe library released in March 2020. It stands out from other libraries in the field due to its ability to perform many operations parallelly, thanks to its use of the Rust programming language. Rust was chosen specifically for its performance and parallelization capabilities. In addition, Polars does not rely on an index for its dataframe, and supports lazy evaluation, making it a potential alternative to Pandas for some users.
As per the benchmark test conducted by H2O.ai, Polars was way ahead of its competitors. For example, Polars demonstrated superior performance on aggregation tasks by completing the 50 GB dataset aggregation in just 143 seconds. In comparison, Pandas was unable to complete the task due to insufficient memory.
(credit: H2O.ai benchmark test result.)
In addition to its speed, Polars is highly user-friendly and boasts a well-written codebase. For those familiar with libraries like dplyr in R, using Polars is a breeze due to its similar syntax. Overall, Polars offers a powerful and easy-to-use solution for data aggregation tasks.
Rajesh Murthy, VP-engineering, Kyvos Insights told AIM that Polars can be an excellent alternative to Pandas. Murthy said that Polars emphasises on performance, is memory-efficient and supports parallel execution that utilises all CPU cores. “The lazy API, which transforms the requests into an optimal logical plan and parallelises the jobs when needed, makes Polars more efficient for large datasets, better performance and disk space complexities, and an effective and affordable solution,” said Murthy.
However, when it comes to replacing Pandas, data scientists beg to differ. Gunnvant Singh Saini, senior data scientist at Hero Vired told AIM that most data scientists do not prefer Polars to completely replace Pandas; Polars have to be vastly superior and gel well with the previous code base. “I care about my codebase, if I have a lot of code written in Pandas already, I won’t be switching to Polars. Also, data sizes in most organisations can be handled adequately by Pandas,” said Saini.
As per Murthy, new libraries may be subject to limitations such as a nascent community, lack of specific features and seamlessly supporting other Python packages that data engineers need. “The well-established Pandas has a substantial community base similar to StackOverflow and a successful ecosystem. Polars, however, unquestionably remains a compelling choice,” concluded Murthy.
Should Pandas users be scared of Polars?
Polars has both eager and lazy APIs; eager is comparable to what Pandas provide, while lazy is the hot topic. Since the lazy API applies preemptive optimization to the entire query, enhancing performance and reducing memory footprint, it is particularly interesting.
While the project is still in the early stages of development and is constantly evolving, there are some areas where the documentation and release notes are incomplete. Additionally, Polars is not the most suitable option for handling large datasets that exceed available RAM. As an in-memory dataframe library, similar to Pandas, Polars relies on being able to store all data in memory, which can be a limiting factor despite its lazy processing capabilities.
Additionally, as Saini said, there is no reason to switch to Polars if the majority of the codes are already written in Pandas and they are just functioning as intended for day-to-day operations. Being fast isn’t necessarily a plus.