Pandas Tricks Not Known By Many

Share

Published on March 28, 2020

by Rohit Chatterjee

Pandas is a fast, powerful and easy to use open-source data analysis and manipulation tool which is designed on top of the Cytron, C, and Python programming language. It is an amalgamation of two different terms, i.e. panel and data. From combining data frames to reshaping them, Pandas comes with a host of advanced features. For example, it lets a user input a URL in the place of a file name. One can also scrape data from a webpage using its “read_html” function. Although Pandas is one of the most popular libraries among data scientists, due to its wide range of applications, it contains methods that not everyone is familiar with.

The list of functionalities Pandas have are too long and broad to be pointed here, but its vast nature amazes the users from time to time. However, there are a number of lesser-known Pandas tricks which one could further use to be more productive.

Data Ranges

Data ranges are often required to be specified while collecting data from a database or an external API (application program interface). To make the process easier for a user, Pandas have a data range function through which one can return the date incremented by days, months or years. It generates the integer numbers between the given start integer and the stop integers to return a ranged object.

To select a range of rows and columns :

DataFrame.iloc[Range of rows, Range of columns]

Eg. df.iloc[5:10,0:4] will select the rows indexed 5 to 9 and the first 4 columns in the data frame.

Similarly, the loc indexer can perform boolean selections:

DataFrame.loc[conditional , [column_labels]]

Eg. df.loc[df[‘Age’] < 15, [‘user_id’, ‘name’]] selects only the user_id and name from the data frame where the age is less than 15.

Merge With Indicators

Merge columns are useful for those who are working with larger and multiple datasets and want to merge multiple tables into a single data frame. Furthermore, it is also possible to align the rows each depending on the common attributes and columns.

To create a merge, the indicator arguments adds a_merge column to a DataFrame. This step allows a user to identify where the row came from. It could be from the left, right or both DataFrames. The merged column can be further utilised to view the expected number of rows with values from both DataFrames.

Nearest Merge

Useful for those who are working with financial data such as cryptocurrencies and stocks, and may need to combine price changes with real trades. For example, a user may need to merge each trade with a change in price that occurs every millisecond. To ensure this example can be streamlined easily, Pandas has a function “merge_asof” that allows merging DataFrames by the nearest key.

Example

Quotes of share price,

Trade information data

By default, we are taking the asof of the quotes

pd.merge_asof(trades, quotes,on=’time’, by=’ticker’)

Excel Report Creation

With XLsxWriter Library, Pandas allows a user to create an Excel report from the DataFrame. This helps in saving a lot of time as it means that a user does not need to save a DataFrame to CSV and then format it in Excel. One can also use different kinds of charts with the code “writer = pd.ExcelWriter(‘demo.xlsx’, engine=’xlsxwriter’).” This will create an Excel report with plots. To create a chart, one must describe the type of chart (for example, line chart) along with the data series for the chart.

Saving Disk Space

Working with a variety of data science projects leads to a huge pileup of various preprocessed datasets from different experiments. Due to the reason, the SSD of one’s system might be used up too soon with data, which may not even be required. Pandas enable a user to compress a dataset while saving it along with providing the ability to read it in compressed format as well. A file may grow up to a size of 300 MB, which can be easily compressed with a single argument compression= ‘gzip’ to 136 MB.

#read data

df = pd.read_csv(“Data.csv”)

#zip configuration

compression_opts = dict(method=’zip’, archive_name=’out.csv’)

#write the bigger data.csv file into .zip format, using below code.

df.to_csv(‘out.zip’, index=False, compression=compression_opts)

Access all our open Survey & Awards Nomination forms in one place