A Pivot table is an exquisite tool to calculate, scrutinize and epitomize a large amount of data by the method of aggregation functions. These aggregation functions include sum, average, count and statistics. It helps to encapsulate one or more numeric variables based on categorical variables. Pivot tables are popular data analysis tools because they are the best way to summarize data and have a high level of understanding of the data which helps in further breaking down the solutions. To learn how it can be implemented practically, we will take random data and perform the tests with python and so we will try to learn this concept from scratch. The major points to be covered in this article are listed below.
Table of contents
- Introduction to pandas.pivot_table()
- Creating a pivot table
- Using multiple aggregation functions
- Aggregating for multiple features and specific features
- Calculate row and column total
- Multi-level index pivot table
- Difference between pivot table and group by
Introduction to pandas.pivot_table()
To create a spreadsheet-style pivot table as a data frame in python, we use pandas.pivot_table() function.
Syntax
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)
Parameters
data: data frame ( Defining the dataset that is to be used for the pivot table.)
Download our Mobile App
values: column to aggregate ( Feature that is to be seen in its statistical summary.)
index: column ( Indexes the value passed in the value argument )
columns: column ( For aggregating values based on certain features )
aggfunc: function or list of functions ( Aggregating functions like sum, mean, etc )
fill_value: scalar ( Value to replace missing values in the table )
margins: bool ( Add all row / columns ( e.g. for subtotal / grand totals ) )
Returns
Dataframe in excel style pivot table.
Creating a pivot table
By specifying the index and columns parameters in the pd.pivot_table() function, you can determine which features should appear in the columns and rows. In the values parameter, you should specify which feature should be used to fill in the cell values.
Import libraries to be used
import pandas as pd
import numpy as np
Creating a sample dataset
df = pd.DataFrame({'Name': ['Minecraft', 'Grand Theft Auto V', 'Tetris (EA)', 'Wii Sports', 'PUBG: Battlegrounds', ], 'Genre': ['Survival,Sandbox', 'Action-adventure', 'Puzzle', 'Sports simulation', 'Battle royale'], 'Platform': ['Multi-platform', 'Multi-platform', 'Multi-platform', 'Wii', 'PC'], 'Publishers': ['Xbox Game Studios', 'Rockstar Games', 'Electronic Arts', 'Nintendo', 'PUBG Corporation'], 'Total_Year': [9, 7, 14, 10, 5], 'Sales': [238, 160, 100, 82, 75]}) df
This data frame tells about the top 5 best-seller video games to date all over the world. Sales are in million USD.
Create pivot table using pandas
table = pd.pivot_table( data=df, index=['Platform'], columns=['Publishers'], values='Sales', aggfunc='mean') table
Specifying the data as the name of the data frame created ‘df’, index as ‘platform’ because the need to see the sales according to different platforms and value as ‘sales’ and aggfunc as ‘mean’ to see the average sales across different platforms.
Here we have created a pivot table that shows average sales across different platforms. There would be Nan values so don’t worry about it, we will see further in the article how to handle missing values.
Using multiple aggregation functions
If the column parameter is not specified, it will aggregate based on the index. So, let’s not specify the column parameter and see what changes happen in our pivot table.
table = pd.pivot_table(data=df, index=['Platform'], values='Sales', aggfunc=['sum', 'mean', 'count']) table
Here we can observe that the column parameter is not specified and a list of different aggregation functions is passed in the aggfunc parameter.
As discussed above, if the column parameter is not specified, the function itself aggregates on the index parameter.
Aggregating for multiple features and specific features
Additionally, we can perform different aggregations based on different features. The result is that multiple pivot tables need not be created to apply appropriate operations to different features.
table = pd.pivot_table(data=df, index='Platform', values=['Sales', 'Total_Year'], columns=['Publishers'], aggfunc={'Sales': np.sum, 'Total_Year': np.mean}) table
We can specify different aggregation on specific features or the same aggregation function on different multiple features whichever way you want your output.
Left side in the ‘sales’ column np.sum is applied which will return the sum aggregated of the data and on the right side ‘Total_year’ column np.mean is applied which will return the average of the feature specified in the value parameter.
Replacing missing values
In the data frame, there are a lot of missing values that can be handled by filling those by specifying the value in the fill_value parameter.
table = pd.pivot_table(data=df, index='Platform', values=['Sales', 'Total_Year'], columns=['Publishers'], aggfunc={'Sales': np.sum, 'Total_Year': np.mean}, fill_value='N/A') table
Therefore all the missing values are filled with ‘N/A’ as specified in the fill_value parameter.
Calculate row and column total
Next, let’s examine the sales totals of each category of the platform. To do this, we will use the margins and margins_name parameters.
table = pd.pivot_table(data=df, index=['Platform'], values='Sales', aggfunc=['sum', 'mean', 'count'], margins=True, margins_name='Grand Total') table
We can observe in that output that a new index is added as a total which contains the total sales for different aggregation functions applied on the sales column.
Multi-level index pivot table
Based on the above pivot tables, only one feature was used in the index, i.e., a single level index. We can, however, create pivot tables using multiple indices. Whenever data is organized hierarchically, a pivot table with multi-level indexes can provide very useful and detailed summary information.
table = pd.pivot_table(data=df, index=['Platform', 'Genre'], values='Sales', aggfunc=['sum', 'mean', 'count'], margins=True, margins_name='Grand Total') table
Referring to the output there are two indexes (multi-level index) which explain that in multi-platform there are three genres and the sum of their sales, mean and count are following. The order in which the indexes are passed is important and the output will vary accordingly.
Difference between .pivot_table() and .groupby()
The main difference between pandas.pivot_table() and pandas.dataFrame.groupby()
is the resultant in pivot table the resultant is in the form of data frame whereas in group by resultant is in dataframe.groupby. Difference in output is shown below:
This is how a pivot Table is coded in python
able = pd.pivot_table( data=df, index=['Platform'], columns=['Publishers'], values='Sales', aggfunc='mean') table
This is how a groupby is done in python
group= df.groupby('Platform')
Now let’s see the resultant output.
Output for the pivot table can be printed by simply calling the name of the data frame in which it is stored and the data type is DataFrame.
Output for groupby can be printed by using the .get_group(‘name of the group’) function. It only prints the group by which they formed.
Conclusion
With the help of this article, we have learned about pivot tables and how to build a spreadsheet-style pivot table in python by using pandas. We could also understand how to use parameters of the pivot table function to get the desired output. Along with these, we have also learnt the core difference between a groupby and pivot table and where to use them.
References