Table of contents
- What is a DataFrame?
- DataFrame creation and operations
- Create a Pandas DataFrame
- Get basic info of DataFrame
- Grab column(s) from DataFrame
- Check type of DataFrame object and its column
- Create a new column of DataFrame
- Drop a column from DataFrame
- Select a row from DataFrame
- Select specific element(s) from DataFrame
- Set index of DataFrame
- Reset index of DataFrame
- Deal with missing data
- GroupBy
- Sorting a DataFrame
- Apply a function to a column
- Pivot table creation
- EndNote
Python programming language provides a wide range of libraries and modules extensively used by Data Science enthusiasts. The most basic yet quite useful ones are NumPy and Pandas (built on the top of NumPy), which assist in systematic analysis and manipulation of data. This article talks about ‘DataFrame’ – a crucial concept for working with Pandas library.
What is a DataFrame?
DataFrame is a 2D mutable data structure that can store heterogeneous data in tabular format (i.e. in the form of labelled rows and columns). By heterogeneous data, we mean a single DataFrame can comprise different data types’ content such as numerical, categorical etc.
The building block of a DataFrame is a Pandas Series object. Built on the top of the concept of NumPy arrays, Pandas Series is a 1D labelled array that can hold heterogeneous data. What differentiates a Pandas Series from a NumPy array is that it can be indexed using default numbering (starting from 0) or custom defined labels. Have a look at how NumPy array and Pandas Series differ.
#First, import NumPy and Pandas libraries. import numpy as np import pandas as pd my_data = [10,20,30] #Initialize data elements arr = np.array(my_data) #Form NumPy array of the elements arr #Display the array
Output: array([10, 20, 30])
ser = pd.Series(data = my_data) #Form Pandas Series of the elements ser #Display the series
Output:
0 10 1 20 2 30 dtype: int64
Where, (0,1,2) are the labels and (10,20,30) form the 1D Series object.
- Methods That Make Data Exploration Easy in Python: Tips And Tricks For Beginners
- 10 Important Pandas Interview Questions Every Beginner Must Know
- AI That Mistook Referee For A Ball & The Self Driving Car That Hit The Wall: A Week Of AI Botch-Ups
- 9 Effective Pandas Techniques In Python For Data Manipulation
- Pandas Tricks Not Known By Many
DataFrame creation and operations
Here, we demonstrate how to deal with Pandas DataFrame using Pythonic code. Several (though not all) data operations possible with a DataFrame have been shown further in this article with explanation and code snippets.
Note: The code throughout this article has been implemented using Google colab with Python 3.7.10, NumPy 1.19.5 and pandas 1.1.5 versions.
Create a Pandas DataFrame
Populate a DataFrame with random numbers selected from a standard normal distribution using randn() function.
from numpy.random import randn np.random.seed(42) df = pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z'])
Where, the parameters of DataFrame() represents the data to be filled in the DataFrame, list of indices and list of column names respectively.
df #Display the DataFrame
Output:
All the columns in the above DataFrame are Pandas Series objects having common index A,B,…,E. DataFrame is thus a bunch of Series sharing common indexes.
Get basic info of DataFrame
A basic summary of a number of rows and columns, data types and memory usage of a DataFrame can be obtained using info() function as follows:
df.info()
Output:
Aggregated information such as total count of entries in each column, mean, minimum element, maximum element, standard deviation etc. of numerical columns can be found as:
df.describe()
Output:
Column names of a DataFrame can be known using the ‘columns’ attribute as:
df.columns
Output: Index(['W', 'X', 'Y', 'Z', 'States'], dtype='object')
To know the range of indexes of a DataFrame, use the ‘index’ property of the DataFrame object.
df.index
Output: RangeIndex(start=0, stop=5, step=1)
The index starts from the ‘start’ value and ends at (‘stop’-1), i.e., it is from 0 to 4, incrementing in the step of 1.
Grab column(s) from DataFrame
Suppose we want to extract the column ‘W’ from the above DataFrame df. It can be done as:
df[‘W’]
or df.W
Output:
Multiple columns from a DataFrame can be extracted as:
df[['W','Z']]
Output:
Check type of DataFrame object and its column
type(df)
Output: pandas.core.frame.DataFrame
type(df[‘W’])
Output: pandas.core.series.Series
The above output verifies that each column of DataFrame is a Series object.
Create a new column of DataFrame
Suppose we want to create a column with the label ‘new_col’ which has elements as the sum of elements of rows W and Y. It can be created as follows:
df['new_col'] = df['W'] + df['Y'] df #Display modified DataFrame Output:
Drop a column from DataFrame
Let’s say we now need to drop the ‘new_col’ column from df DataFrame.
df.drop(‘new_col’)
Executing this line of code will result in an error as follows:
KeyError: "['new_col'] not found in axis"
To drop a column, we need to set the ‘axis’ parameter to 1 (indicates ‘column’). Its default value is 0, which denotes a ‘row’, and since there is no row with the label ‘new_col’ in df, we got the above error.
df.drop('new_col',axis=1)
Output:
Why axis=0 denotes row, and axis=1 denotes column?
If we check the shape of DataFrame df as:
df.shape
We get the output as (5,4), which means there are 5 rows and 4 columns in df. The shape of a DataFrame is thus stored as a tuple in which 0-indexed element denoted number of rows and 1-indexed element shows a number of columns. Hence, axis value 0 and 1 denote row and column, respectively.
However, if we check the df DataFrame now, it will still have the ‘new_col’ column in it. To drop the column from the original DataFrame as well, we need to specify the ‘inplace=True’ parameter in the drop() function as follows:
df.drop('new_col',axis=1,inplace=True)
Select a row from DataFrame
‘loc’ property can extract a column by specifying the row label, while ‘iloc’ property can be used to select a row by specifying the integer position of the row (starting from 0 for 1st row).
df.loc[‘A’] #to extract row with index ‘A’
Output:
df.iloc[2] #to extract row at position 2 i.e. 3rd row from df
Output:
Select specific element(s) from DataFrame
To extract a single element:
df.loc[‘B’,’Y’] #element at row B and column Y
Output: 1.5792128155073915
Multiple elements from specified rows and columns can also be extracted as:
df.loc[ [‘A’,’B’], [‘W’,’Y’] ] #elements from rows A and B and columns W and Y
Output:
Conditional selection of elements:
We can check specific conditions, say which elements are greater than 0. ‘True’ or ‘False’ in the output will indicate if the element at that location satisfies the condition or not.
df>0
Output:
To get the actual elements within the DataFrame which satisfy the condition:
df[df>0]
Output:
Here, NaN (Not a Number) is displayed in place of elements that are not greater than 0.
If we need only those rows for which a column say W has >0 elements:
df[df['W']>0]
Output:
To extract even specific columns (say columns W and X) from the above result:
df[df['W']>0][['W','X']]
Output:
Multiple conditions can also be applied using & (‘and’ operator) and | (‘or’ operator).
df[(df['W']>0) & (df['Z']>1)] #both conditions should hold
Output:
df[(df['W']>0) | (df['Z']>1)] #at least one condition should hold
Output:
Set index of DataFrame
We add a new column called ‘States’ first to df
df['States'] = ['CA', 'NY', 'WY', 'OR', 'CO']
Now, set this column as the index column of df
df.set_index('States')
Output:
Reset index of DataFrame
df.reset_index(inplace=True) #Setting ‘inplace’ to True will reset the index in the original DataFrame. df #Display the DataFrame
Output:
This resets the index of df, but the original indices remain intact under a new column called ‘index’. We can drop that column.
df.drop('index',axis=1,inplace=True) df
Output:
Deal with missing data
First, we create a DataFrame having some NaN values.
#Create a dictionary dict = {'A':[1,2,np.nan], 'B':[5,np.nan,np.nan], 'C':[1,2,3]} #Create a DataFrame with the dictionary’s data #Keys will be column names, values will be data elements df1 = pd.DataFrame(dict) df1 #Display the DataFrame Output:
Drop rows with NaN
df1.dropna()
Output:
By default, axis=0 so the rows (and not columns) with at least one NaN value are searched on executing the above line of code.
To drop columns with NaN instead, specify axis=1 explicitly.
df1.dropna(axis=1)
Output:
We can also specify a ‘thresh’ parameter which sets a threshold, say if thresh=2, then rows/columns with less than 2 NaN values will be retained, rest will be dropped..
df1.dropna(thresh=2)
Output:
The third row in df1 had 2 NaN’s so it got dropped. We can fill the missing entries with custom value, say with the word ‘FILL’.
df1.fillna(value='FILL')
Output:
Missing values can also be replaced with some computed values e.g. 1 and 2 are the non-null values of column ‘A’ whose mean will be 1.5. All the missing values of df1 can be replaced with this average value.
df1.fillna(value=df1['A'].mean())
Output:
We can know if an element at is missing or not using isnull() function as follows:
df1.isnull()
Output:
GroupBy
GroupBy functionality allows us to group several rows of a DataFrame based on a column and perform an aggregate function such as mean, sum, standard deviation etc.
Suppose, we create a DataFrame having company names, employee names and sales as follows:
#Create dictionary to define data elements d = {'Company':['Google','Google','Msft','Msft','FB','FB'], 'Person':['A','B','C','D','E','F'], 'Sales':[120,200,340,119,130,150] } #Create a DataFrame using the dictionary df2 = pd.DataFrame(data = d) df2 #Display the DataFrame Output:
Details of all employees of each of the unique companies can be grouped together as:
byComp = df2.groupby('Company') #This will create a GroupBy object byComp
Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f70a3c198d0>
We can now apply aggregate functions to this GroupBy object. E.g. getting the average sales amount for each company
byComp.mean()
Output:
We can obtain information of a particular company, say sum of the sales of FB can be obtained as:
byComp.sum().loc['FB']
Output:
Sales 280 Name: FB, dtype: int64
A complete summary of company-wise sales can be obtained using describe() function.
byComp.describe()
Output:
Sorting a DataFrame
Let’s say we create a DataFrame as follows:
df3_data = {'col1':[1,2,3,4], 'col2':[44,44,55,66], 'col3':['a','b','c','d']} df3 = pd.DataFrame(df3_data) df3
Output:
The DataFrame can be sorted based on the values of a particular column by passing that column’s name to ‘by’ parameter of sort_values() method.
df3.sort_values(by='col1')
Output:
df3.sort_values(by='col2')
Output:
By default, sorting occurs in ascending order of the values of a specified column. To sort in descending order, explicitly specify the ‘ascending’ parameter as False.
df3.sort_values(by='col3',ascending=False)
Output:
Apply a function to a column
Suppose we define a custom function to multiply an element by two as follows:
def times2(x): return x*2
The function can then be applied to all the values of a column using apply() method.
df3['col1'].apply(times2)
Output:
The same process can be done by defining a lambda function as:
df3['col1'].apply(lambda x: x*2)
Where, (lambda x: x*2) means that for each element, say x of the selected column, multiply that x by 2 and return the value.
Pivot table creation
A pivot table can be created summarizing the data of a DataFrame by specifying custom criteria for summarization.
E.g. We create a DataFrame as:
df4_data = {'A':['f','f','f','b','b','b'], 'B':['one','one','two','two','one','one'], 'C':['x','y','x','y','x','y'], 'D':[1,3,2,5,4,1]} df4 = pd.DataFrame(df4_data) df4
Output:
df4.pivot_table(values='D', index=['A','B'], columns='C')
This will create a pivot table from df4 with multi-level indexes – the outer index will have unique values of column A while the inner index will have unique values of column B. The unique values of column C, i.e. x and y will form the column names of the pivot table and the table with populated with values of column D.
Output:
There are NaN values in the table for which an entry does not exist in df4, e.g. there is no column in df4 for which A=’b’,B=’two’ and C=’x’ so the corresponding D value in the pivot table is NaN.
- You can find all the above-explained code snippets executed in a single Google colab notebook available here.
EndNote
We have covered several basic operations and functionalities of Pandas DataFrame data structure in this article. However, there are many other fundamental and complex functionalities that can be efficiently handled using a DataFrame. To put all such easy-to-handle operations on a DataFrame into practice, refer to the following sources:
- Pandas documentation
- Official guide to Pandas DataFrame
- Related articles (article 1) (article 2)