How to create a pivot table in Python from scratch?

A Pivot table is an exquisite tool to calculate, scrutinize and epitomize a large amount of data by the method of aggregation functions. These aggregation functions include sum, average, count and statistics. It helps to encapsulate one or more numeric variables based on categorical variables. Pivot tables are popular data analysis tools because they are the best way to summarize data and have a high level of understanding of the data which helps in further breaking down the solutions. To learn how it can be implemented practically, we will take random data and perform the tests with python and so we will try to learn this concept from scratch. The major points to be covered in this article are listed below.

Table of contents

  1. Introduction to pandas.pivot_table()
  2. Creating a pivot table 
  3. Using multiple aggregation functions
  4. Aggregating for multiple features and specific features
  5. Calculate row and column total 
  6. Multi-level index pivot table
  7. Difference between pivot table and group by

Introduction to pandas.pivot_table()

To create a spreadsheet-style pivot table as a data frame in python, we use pandas.pivot_table() function. 

Syntax

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)

Parameters

data: data frame ( Defining the dataset that is to be used for the pivot table.) 


Download our Mobile App



values: column to aggregate ( Feature that is to be seen in its statistical summary.)

index: column ( Indexes the value passed in the value argument )

columns: column ( For aggregating values based on certain features )

aggfunc: function or list of functions ( Aggregating functions like sum, mean, etc )

fill_value: scalar ( Value to replace missing values in the table )

margins: bool ( Add all row / columns ( e.g. for subtotal / grand totals ) )

Returns

Dataframe in excel style pivot table.

Creating a pivot table

By specifying the index and columns parameters in the pd.pivot_table() function, you can determine which features should appear in the columns and rows. In the values parameter, you should specify which feature should be used to fill in the cell values.

Import libraries to be used

import pandas as pd
import numpy as np

Creating a sample dataset 

df = pd.DataFrame({'Name': ['Minecraft', 'Grand Theft Auto V', 'Tetris (EA)', 'Wii Sports', 'PUBG: Battlegrounds', ],
                   'Genre': ['Survival,Sandbox', 'Action-adventure', 'Puzzle', 'Sports simulation', 'Battle royale'],
                   'Platform': ['Multi-platform', 'Multi-platform', 'Multi-platform', 
                            'Wii', 'PC'],
                   'Publishers': ['Xbox Game Studios', 'Rockstar Games', 'Electronic Arts', 
                                  'Nintendo', 'PUBG Corporation'],
                   'Total_Year': [9, 7, 14, 10, 5],
                   'Sales': [238, 160, 100, 82, 75]})
df

This data frame tells about the top 5 best-seller video games to date all over the world. Sales are in million USD. 

Create pivot table using pandas

table = pd.pivot_table( data=df, 
                        index=['Platform'], 
                        columns=['Publishers'], 
                        values='Sales',
                        aggfunc='mean')
table

Specifying the data as the name of the data frame created ‘df’, index as ‘platform’ because the need to see the sales according to different platforms and value as ‘sales’ and aggfunc as ‘mean’ to see the average sales across different platforms.

Here we have created a pivot table that shows average sales across different platforms. There would be Nan values so don’t worry about it, we will see further in the article how to handle missing values.

Using multiple aggregation functions

If the column parameter is not specified, it will aggregate based on the index. So, let’s not specify the column parameter and see what changes happen in our pivot table.

table = pd.pivot_table(data=df, index=['Platform'],
                        values='Sales',
                        aggfunc=['sum', 'mean', 'count'])
table

Here we can observe that the column parameter is not specified and a list of different aggregation functions is passed in the aggfunc parameter.

As discussed above, if the column parameter is not specified, the function itself aggregates on the index parameter.

Aggregating for multiple features and specific features

Additionally, we can perform different aggregations based on different features. The result is that multiple pivot tables need not be created to apply appropriate operations to different features.

table = pd.pivot_table(data=df, index='Platform', values=['Sales', 'Total_Year'],
                        columns=['Publishers'],
                        aggfunc={'Sales': np.sum, 'Total_Year': np.mean})
table

We can specify different aggregation on specific features or the same aggregation function on different multiple features whichever way you want your output. 

Left side in the ‘sales’ column np.sum is applied which will return the sum aggregated of the data and on the right side ‘Total_year’ column np.mean is applied which will return the average of the feature specified in the value parameter. 

Replacing missing values

In the data frame, there are a lot of missing values that can be handled by filling those by specifying the value in the fill_value parameter.

table = pd.pivot_table(data=df, index='Platform', values=['Sales', 'Total_Year'],
                        columns=['Publishers'],
                        aggfunc={'Sales': np.sum, 'Total_Year': np.mean},
                        fill_value='N/A')
table

Therefore all the missing values are filled with ‘N/A’ as specified in the fill_value parameter.

Calculate row and column total

Next, let’s examine the sales totals of each category of the platform. To do this, we will use the margins and margins_name parameters.

table = pd.pivot_table(data=df, index=['Platform'],
                        values='Sales',
                        aggfunc=['sum', 'mean', 'count'],
                        margins=True,
                        margins_name='Grand Total')
table

We can observe in that output that a new index is added as a total which contains the total sales for different aggregation functions applied on the sales column.

Multi-level index pivot table

Based on the above pivot tables, only one feature was used in the index, i.e., a single level index. We can, however, create pivot tables using multiple indices. Whenever data is organized hierarchically, a pivot table with multi-level indexes can provide very useful and detailed summary information.

table = pd.pivot_table(data=df, index=['Platform', 'Genre'],
                        values='Sales',
                        aggfunc=['sum', 'mean', 'count'],
                        margins=True,
                        margins_name='Grand Total')
table

Referring to the output there are two indexes (multi-level index) which explain that in multi-platform there are three genres and the sum of their sales, mean and count are following. The order in which the indexes are passed is important and the output will vary accordingly.

Difference between .pivot_table() and .groupby()

The main difference between pandas.pivot_table() and pandas.dataFrame.groupby()

is the resultant in pivot table the resultant is in the form of data frame whereas in group by resultant is in dataframe.groupby. Difference in output is shown below:

This is how a pivot Table is coded in python

able = pd.pivot_table( data=df, 
                        index=['Platform'], 
                        columns=['Publishers'], 
                        values='Sales',
                        aggfunc='mean')
table

This is how a groupby is done in python 

group= df.groupby('Platform')

Now let’s see the resultant output.

Output for the pivot table can be printed by simply calling the name of the data frame in which it is stored and the data type is DataFrame.

Output for groupby can be printed by using the .get_group(‘name of the group’) function. It only prints the group by which they formed. 

Conclusion

With the help of this article, we have learned about pivot tables and how to build a spreadsheet-style pivot table in python by using pandas. We could also understand how to use parameters of the pivot table function to get the desired output. Along with these, we have also learnt the core difference between a groupby and pivot table and where to use them.

References

  1. Link for the above code
  2. Pandas pivot table documentation 
  3. Pandas group by documentation

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Sourabh Mehta
Sourabh has worked as a full-time data scientist for an ISP organisation, experienced in analysing patterns and their implementation in product development. He has a keen interest in developing solutions for real-time problems with the help of data both in this universe and metaverse.

Our Upcoming Events

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023

21 Jul, 2023 | New York
MachineCon USA 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR