Comprehensive Guide To Pandas DataFrames With Python Codes

Pandas DataFrame is a 2D mutable data structure that can store heterogeneous data in tabular format (i.e. in the form of labelled rows and columns).

Share

Published on April 22, 2021

by Nikita Shiledarbaxi

What is a DataFrame?
DataFrame creation and operations
EndNote

Python programming language provides a wide range of libraries and modules extensively used by Data Science enthusiasts. The most basic yet quite useful ones are NumPy and Pandas (built on the top of NumPy), which assist in systematic analysis and manipulation of data. This article talks about ‘DataFrame’ – a crucial concept for working with Pandas library.

What is a DataFrame?

What is a DataFrame?

DataFrame is a 2D mutable data structure that can store heterogeneous data in tabular format (i.e. in the form of labelled rows and columns). By heterogeneous data, we mean a single DataFrame can comprise different data types’ content such as numerical, categorical etc.

The building block of a DataFrame is a Pandas Series object. Built on the top of the concept of NumPy arrays, Pandas Series is a 1D labelled array that can hold heterogeneous data. What differentiates a Pandas Series from a NumPy array is that it can be indexed using default numbering (starting from 0) or custom defined labels. Have a look at how NumPy array and Pandas Series differ.

 #First, import NumPy and Pandas libraries.
 import numpy as np
 import pandas as pd
 my_data = [10,20,30]     #Initialize data elements 
 arr = np.array(my_data)  #Form NumPy array of the elements
 arr    #Display the array

Output: array([10, 20, 30])

 ser = pd.Series(data = my_data)   #Form Pandas Series of the elements
 ser    #Display the series

Output:

 0    10
 1    20
 2    30
 dtype: int64

Where, (0,1,2) are the labels and (10,20,30) form the 1D Series object.

DataFrame creation and operations

Here, we demonstrate how to deal with Pandas DataFrame using Pythonic code. Several (though not all) data operations possible with a DataFrame have been shown further in this article with explanation and code snippets.

Note: The code throughout this article has been implemented using Google colab with Python 3.7.10, NumPy 1.19.5 and pandas 1.1.5 versions.

Create a Pandas DataFrame

Populate a DataFrame with random numbers selected from a standard normal distribution using randn() function.

 from numpy.random import randn
 np.random.seed(42)
 df = pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z'])

Where, the parameters of DataFrame() represents the data to be filled in the DataFrame, list of indices and list of column names respectively.

df #Display the DataFrame

Output:

All the columns in the above DataFrame are Pandas Series objects having common index A,B,…,E. DataFrame is thus a bunch of Series sharing common indexes.

Get basic info of DataFrame

A basic summary of a number of rows and columns, data types and memory usage of a DataFrame can be obtained using info() function as follows:

df.info()

Output:

Aggregated information such as total count of entries in each column, mean, minimum element, maximum element, standard deviation etc. of numerical columns can be found as:

df.describe()

Output:

Column names of a DataFrame can be known using the ‘columns’ attribute as:

df.columns

Output: Index(['W', 'X', 'Y', 'Z', 'States'], dtype='object')

To know the range of indexes of a DataFrame, use the ‘index’ property of the DataFrame object.

df.index

Output: RangeIndex(start=0, stop=5, step=1)

The index starts from the ‘start’ value and ends at (‘stop’-1), i.e., it is from 0 to 4, incrementing in the step of 1.

Grab column(s) from DataFrame

Suppose we want to extract the column ‘W’ from the above DataFrame df. It can be done as:

df[‘W’] or df.W

Output:

Multiple columns from a DataFrame can be extracted as:

df[['W','Z']]

Output:

Check type of DataFrame object and its column

type(df)

Output: pandas.core.frame.DataFrame

type(df[‘W’])

Output: pandas.core.series.Series

The above output verifies that each column of DataFrame is a Series object.

Create a new column of DataFrame

Suppose we want to create a column with the label ‘new_col’ which has elements as the sum of elements of rows W and Y. It can be created as follows:

 df['new_col'] = df['W'] + df['Y']
 df  #Display modified DataFrame 

Output:

Drop a column from DataFrame

Let’s say we now need to drop the ‘new_col’ column from df DataFrame.

df.drop(‘new_col’)

Executing this line of code will result in an error as follows:

KeyError: "['new_col'] not found in axis"

To drop a column, we need to set the ‘axis’ parameter to 1 (indicates ‘column’). Its default value is 0, which denotes a ‘row’, and since there is no row with the label ‘new_col’ in df, we got the above error.

df.drop('new_col',axis=1)

Output:

Why axis=0 denotes row, and axis=1 denotes column?

If we check the shape of DataFrame df as:

df.shape

We get the output as (5,4), which means there are 5 rows and 4 columns in df. The shape of a DataFrame is thus stored as a tuple in which 0-indexed element denoted number of rows and 1-indexed element shows a number of columns. Hence, axis value 0 and 1 denote row and column, respectively.

However, if we check the df DataFrame now, it will still have the ‘new_col’ column in it. To drop the column from the original DataFrame as well, we need to specify the ‘inplace=True’ parameter in the drop() function as follows:

df.drop('new_col',axis=1,inplace=True)

Select a row from DataFrame

‘loc’ property can extract a column by specifying the row label, while ‘iloc’ property can be used to select a row by specifying the integer position of the row (starting from 0 for 1st row).

df.loc[‘A’] #to extract row with index ‘A’

Output:

df.iloc[2]   #to extract row at position 2 i.e. 3rd row from df

Output:

Select specific element(s) from DataFrame

To extract a single element:

df.loc[‘B’,’Y’] #element at row B and column Y

Output: 1.5792128155073915

Multiple elements from specified rows and columns can also be extracted as:

df.loc[ [‘A’,’B’], [‘W’,’Y’] ]   
#elements from rows A and B and columns W and Y

Output:

Conditional selection of elements:

We can check specific conditions, say which elements are greater than 0. ‘True’ or ‘False’ in the output will indicate if the element at that location satisfies the condition or not.

df>0

Output:

To get the actual elements within the DataFrame which satisfy the condition:

df[df>0]

Output:

Here, NaN (Not a Number) is displayed in place of elements that are not greater than 0.

If we need only those rows for which a column say W has >0 elements:

df[df['W']>0]

Output:

To extract even specific columns (say columns W and X) from the above result:

df[df['W']>0][['W','X']]

Output:

Multiple conditions can also be applied using & (‘and’ operator) and | (‘or’ operator).

df[(df['W']>0) & (df['Z']>1)]  #both conditions should hold

Output:

df[(df['W']>0) | (df['Z']>1)]   #at least one condition should hold

Output:

Set index of DataFrame

We add a new column called ‘States’ first to df

df['States'] = ['CA', 'NY', 'WY', 'OR', 'CO']

Now, set this column as the index column of df

df.set_index('States')

Output:

Reset index of DataFrame

 df.reset_index(inplace=True)
 #Setting ‘inplace’ to True will reset the index in the original DataFrame.
 df     #Display the DataFrame

Output:

This resets the index of df, but the original indices remain intact under a new column called ‘index’. We can drop that column.

 df.drop('index',axis=1,inplace=True)
 df

Output:

Deal with missing data

First, we create a DataFrame having some NaN values.

 #Create a dictionary
 dict = {'A':[1,2,np.nan], 'B':[5,np.nan,np.nan], 'C':[1,2,3]}
 #Create a DataFrame with the dictionary’s data
 #Keys will be column names, values will be data elements
 df1 = pd.DataFrame(dict)
 df1    #Display the DataFrame 

Output:

Drop rows with NaN

df1.dropna()

Output:

By default, axis=0 so the rows (and not columns) with at least one NaN value are searched on executing the above line of code.

To drop columns with NaN instead, specify axis=1 explicitly.

df1.dropna(axis=1)

Output:

We can also specify a ‘thresh’ parameter which sets a threshold, say if thresh=2, then rows/columns with less than 2 NaN values will be retained, rest will be dropped..

df1.dropna(thresh=2)

Output:

The third row in df1 had 2 NaN’s so it got dropped. We can fill the missing entries with custom value, say with the word ‘FILL’.

df1.fillna(value='FILL')

Output:

Missing values can also be replaced with some computed values e.g. 1 and 2 are the non-null values of column ‘A’ whose mean will be 1.5. All the missing values of df1 can be replaced with this average value.

df1.fillna(value=df1['A'].mean())

Output:

We can know if an element at is missing or not using isnull() function as follows:

df1.isnull()

Output:

GroupBy

GroupBy functionality allows us to group several rows of a DataFrame based on a column and perform an aggregate function such as mean, sum, standard deviation etc.

Suppose, we create a DataFrame having company names, employee names and sales as follows:

 #Create dictionary to define data elements
 d = {'Company':['Google','Google','Msft','Msft','FB','FB'], 'Person':['A','B','C','D','E','F'], 'Sales':[120,200,340,119,130,150] }
 #Create a DataFrame using the dictionary
 df2 = pd.DataFrame(data = d)
 df2      #Display the DataFrame 

Output:

Details of all employees of each of the unique companies can be grouped together as:

 byComp = df2.groupby('Company')
 #This will create a GroupBy object 
 byComp

Output:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f70a3c198d0>

We can now apply aggregate functions to this GroupBy object. E.g. getting the average sales amount for each company

byComp.mean()

Output:

We can obtain information of a particular company, say sum of the sales of FB can be obtained as:

byComp.sum().loc['FB']

Output:

 Sales    280
 Name: FB, dtype: int64

A complete summary of company-wise sales can be obtained using describe() function.

byComp.describe()

Output:

Sorting a DataFrame

Let’s say we create a DataFrame as follows:

 df3_data = {'col1':[1,2,3,4], 'col2':[44,44,55,66], 'col3':['a','b','c','d']}
 df3 = pd.DataFrame(df3_data)
 df3

Output:

The DataFrame can be sorted based on the values of a particular column by passing that column’s name to ‘by’ parameter of sort_values() method.

df3.sort_values(by='col1')

Output:

df3.sort_values(by='col2')

Output:

By default, sorting occurs in ascending order of the values of a specified column. To sort in descending order, explicitly specify the ‘ascending’ parameter as False.

df3.sort_values(by='col3',ascending=False)

Output:

Apply a function to a column

Suppose we define a custom function to multiply an element by two as follows:

 def times2(x):
   return x*2

The function can then be applied to all the values of a column using apply() method.

df3['col1'].apply(times2)

Output:

The same process can be done by defining a lambda function as:

df3['col1'].apply(lambda x: x*2)

Where, (lambda x: x*2) means that for each element, say x of the selected column, multiply that x by 2 and return the value.

Pivot table creation

A pivot table can be created summarizing the data of a DataFrame by specifying custom criteria for summarization.

E.g. We create a DataFrame as:

 df4_data = {'A':['f','f','f','b','b','b'], 'B':['one','one','two','two','one','one'], 'C':['x','y','x','y','x','y'], 'D':[1,3,2,5,4,1]}
df4 = pd.DataFrame(df4_data)
df4

Output:

df4.pivot_table(values='D', index=['A','B'], columns='C')

This will create a pivot table from df4 with multi-level indexes – the outer index will have unique values of column A while the inner index will have unique values of column B. The unique values of column C, i.e. x and y will form the column names of the pivot table and the table with populated with values of column D.

Output:

There are NaN values in the table for which an entry does not exist in df4, e.g. there is no column in df4 for which A=’b’,B=’two’ and C=’x’ so the corresponding D value in the pivot table is NaN.

You can find all the above-explained code snippets executed in a single Google colab notebook available here.

EndNote

We have covered several basic operations and functionalities of Pandas DataFrame data structure in this article. However, there are many other fundamental and complex functionalities that can be efficiently handled using a DataFrame. To put all such easy-to-handle operations on a DataFrame into practice, refer to the following sources: