Pandas — a Python library for data structure and analysis, has been one of the essential tools for data science. Imagining data science workflows in the absence of Pandas can be nothing less than a nightmare. Although analysis can be carried out without importing data into Pandas data frames, data scientists prefer Pandas due to its powerful attributes and methods that make the evaluation of data more accessible. “Pandas allows us to focus more on research and less on programming. We have found pandas easy to learn, use, and maintain. The bottom line is that it has increased our productivity,” said Roni Israelov, PhD portfolio manager, AQR Capital Management.
On 9 January, Pandas released their Pandas 1.0 which enhanced its functionality at the same time its also depreciated others. This is the first major release which will help in optimising the data science practices. Besides, the latest release, Pandas will only be supported on Python 3.6 and above. It has further introduced a new support policy for all future versions of the library. While minor releases will depreciate functionalities, the major releases will pull out features.
Here are the following changes that will have an impact on the workflow of developers.
Handling Missing Values
Datasets often have missing values, which makes causes hindrance during data analysis. Developers replace the missing values with null, NaN, or NA values. The common practice was, np.nan for float data, np.nan or None for object-dtype data, and pd.NaT for datetime-like data. This has led to different behaviours in arithmetic operations; pd.NA represented as missing as unknown in comparison operations.
To mitigate the problems associated with different approaches, Pandas’ new feature has clubbed all of them into one with NA. As this is a significant change, the organisation has considered it an experimental and might change the functionality if required.
Introduction Of String Data Type
Until now, strings in NumPy arrays were stored in object-type. However, one could also store non-string data in the array that was only supposed to have strings, thereby, causing barrier in maintaining strings only array. Since the object-dtype didn’t check for strings before appending the value, often integer and floats were slipping in, while working with the string arrays.
Consequently, now Pandas’ new feature has a dedicated string extension, which will ensure that the array has only string objects. Besides, the StringArray will now display string instead of the object as dtype, improving the readability.
The type can be specified as dtype = pd.StringDtype() or dtype = “string”
Handling Missing Values In Boolean Data Type
Boolean only had two values True and False, but this caused hindrance when the data was missing. Missing values were treated as False, resulting in biased data. Therefore, Pandas’ new feature includes an extension type for keeping track of missing values with BooleanDtype and BooleanArray. This will result in <NA> in case of missing values, improving the quality of the data.
Increased Performance With Numba
The apply() function is a powerful function that enabled manipulation of data on series or the data frame by passing a function. Additionally, apply() be used with an engine called Numba for preforming rolling computation. Since developers use huge data sets, the speed of apply() with Numba was too slow. However, with a new engine Cython, one can gain significant performance gains with large data sets.
Data Frame Summary Readability
One of the annoying thing about data frames in Python vs data frames in R was the readability of data. However, now the Pandas has enhanced the output of DataFrame.info to help developers assimilate data effortlessly. The DataFrame.info() will now display line numbers for the columns summary when used with verbose=True.
Quite a few of the features have been depreciating. Still, the most notable one is the selection of columns from DataFrameGroupBy object, passing list of keys or tuple of keys for subsetting is deprecated. One should now use a list of items instead of keys.
While another most used change is in DataFrame.hist() and Series.hist(); now, the figsize will not have a default value, and one needs to pass tuple for the desired plot size.
The new release has also done away with numerous bugs to improve reliability during data analysis. The fillna() used to raise the ValueError when it encountered a value other than categorical data. Thus, it has now included assert to test the inconsistency and solve the exceptions. Besides, in categorical values also resulted in undesired output when they were cast into integers. Especially, with NaN values, the outputs were incorrect; thus, the updated Pandas is free from this bug.