Pandas Tricks Not Known By Many

Pandas

Advertisement

Pandas is a fast, powerful and easy to use open-source data analysis and manipulation tool which is designed on top of the Cytron, C, and Python programming language. It is an amalgamation of two different terms, i.e. panel and data. From combining data frames to reshaping them, Pandas comes with a host of advanced features. For example, it lets a user input a URL in the place of a file name. One can also scrape data from a webpage using its “read_html” function. Although Pandas is one of the most popular libraries among data scientists, due to its wide range of applications, it contains methods that not everyone is familiar with. 

The list of functionalities Pandas have are too long and broad to be pointed here, but its vast nature amazes the users from time to time. However, there are a number of lesser-known Pandas tricks which one could further use to be more productive.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

Data Ranges

Data ranges are often required to be specified while collecting data from a database or an external API (application program interface). To make the process easier for a user, Pandas have a data range function through which one can return the date incremented by days, months or years. It generates the integer numbers between the given start integer and the stop integers to return a ranged object. 

To select a range of rows and columns :

DataFrame.iloc[Range of rows, Range of columns]

Eg. df.iloc[5:10,0:4] will select the rows indexed 5 to 9 and the first 4 columns in the data frame.

Similarly, the loc indexer can perform boolean selections:

DataFrame.loc[conditional , [column_labels]]

Eg. df.loc[df[‘Age’] < 15, [‘user_id’, ‘name’]] selects only the user_id and name from the data frame where the age is less than 15.

Merge With Indicators

Via Shane Lynn

Merge columns are useful for those who are working with larger and multiple datasets and want to merge multiple tables into a single data frame. Furthermore, it is also possible to align the rows each depending on the common attributes and columns.

To create a merge, the indicator arguments adds a_merge column to a DataFrame. This step allows a user to identify where the row came from. It could be from the left, right or both DataFrames. The merged column can be further utilised to view the expected number of rows with values from both DataFrames.

Nearest Merge

Useful for those who are working with financial data such as cryptocurrencies and stocks, and may need to combine price changes with real trades. For example, a user may need to merge each trade with a change in price that occurs every millisecond. To ensure this example can be streamlined easily, Pandas has a function merge_asof that allows merging DataFrames by the nearest key.

Example

Quotes of share price,

Trade information data

By default, we are taking the asof of the quotes

pd.merge_asof(trades, quotes,on=’time’, by=’ticker’)

Excel Report Creation

Via PBPython

With XLsxWriter Library, Pandas allows a user to create an Excel report from the DataFrame. This helps in saving a lot of time as it means that a user does not need to save a DataFrame to CSV and then format it in Excel. One can also use different kinds of charts with the code writer = pd.ExcelWriter(‘demo.xlsx’, engine=’xlsxwriter’).” This will create an Excel report with plots. To create a chart, one must describe the type of chart (for example, line chart) along with the data series for the chart.

Saving Disk Space

Working with a variety of data science projects leads to a huge pileup of various preprocessed datasets from different experiments. Due to the reason, the SSD of one’s system might be used up too soon with data, which may not even be required. Pandas enable a user to compress a dataset while saving it along with providing the ability to read it in compressed format as well. A file may grow up to a size of 300 MB, which can be easily compressed with a single argument compression= ‘gzip’ to 136 MB.

#read data

df = pd.read_csv(“Data.csv”)

#zip configuration

compression_opts = dict(method=’zip’, archive_name=’out.csv’)

#write the bigger data.csv file into .zip format, using below code.  

df.to_csv(‘out.zip’, index=False, compression=compression_opts)

More Great AIM Stories

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MORE FROM AIM