Last updated June 22, 2022
In AI Mysteries

How to obtain a Pandas Dataframe from a gzip file?

Share

Published on June 22, 2022

by Darshan M

Listen to this story

Nowadays data is available in various formats and they are mostly zipped due to memory complexities and to transmit data over any platform. Zipping of data usually involves compressing the data without any loss of information and the original data can be reframed on different platforms by unzipping the data in the respective formats. So gzip is one of the formats where large files are zipped into smaller file formats and can be decompressed easily, which finds its main usage in data transmission on clouds and servers and is majorly used in various ETL tools. So in this article let us see how to decompress a gzip file into a simple pandas dataframe.

What is a gzip file?
Benefits of a gzip file?
Implementation for obtaining pandas dataframe from a gzip file
Summary

What is a gzip file?

Among various file zipping formats gzip is also one such format of file zipping where larger files are compressed into smaller file formats mostly in MegaBytes (MB). All the gzip files end with a file format specifier as (gz). This zipping format was essentially created in the year 1992 and was made an open source file format where and was intended to use over a programming paradigm named “compress”, and now gzip file formats are extensively used for easy data transmission and ETL tools.

Are you looking for a complete repository of Python libraries used in data science, check out here.

Benefits of a gzip file?

Easy to compress and decompress the file formats across various platforms
Reduces data transmission time on cloud platforms.
Dynamic capability to compress any type of data right from images to plain text.
Faster computation on web servers and 75% of web servers use this format.

Implementation for obtaining pandas dataframe from gzip file

As gzip supports compression of various data formats, the loading time of gzip file formats on different platforms varies based on the resources and the platform. If the gzip files are loaded on cloud-based or server-based platforms the gzip files may decompress quickly when compared to decompressing the gzip file on local hardware.

So in this article, a standard gzip file is used and the complete implementation of how to decompress the gzip file in a standard pandas dataframe is shown.

Let us import some basic libraries that would be required for loading the data frame

import numpy as np
import pandas as pd

Here the subprocess module of python is used instead of the OS module for easy compression of the gzip file, to decompress the gzip file independent of the platform. The check_output library is utilized and suitable decode data from the zip files on the web server.

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8")

Here basically two gzip files are used with different sizes of memory allocations where one file has a memory size close to 400MB and one gzip file is having memory up to 3MB respectively.

Let us see if there is any time difference between loading a smaller gzip file and a larger gzip file in the same working environment.

Loading a smaller gzip file

Here we can see that we are trying to decompress a 2.26MB gzip file in a working environment.

gzip_df_small = pd.read_csv('../input/dot_traffic_stations_2015.txt.gz', compression='gzip', 
                                 header=0, sep=',', quotechar='"')
gzip_df_small.head(10)

Loading a larger gzip file

Here we can see that we are using a 465.12MB gzip to decompress it in a working environment.

gzip_df_big = pd.read_csv('../input/dot_traffic_2015.txt.gz', compression='gzip', 
                         header=0, sep=',', quotechar='"')

gzip_df_big.head(10)

Key Outcomes of decompressing gzip files

Depending on the size of the gzip file and the working environment the decompression of zip files may vary a little by a fraction of seconds to minutes.
The variation in time for decompression is considerable across different platforms as gzip renders decompressed files within a considerable time range.
The knowledge of each data unit storage and separation is to be known so as to use the required separator and quote characters for any special escape characters.

Summary

Transferring huge data originally across various platforms is time-consuming and is not memory efficient and rendering the data for any applications will not be feasible due to some constraints. This is where zipped file formats play a vital role in efficient data transmission and gzip is one such zipped file format where it finds its major usage in data transmission over web servers and ETL tools due to the lightness and faster decompression of data irrespective of platforms and if decompressed in pandas format the data can be easily manipulated as required by the user or the data handlers.

Access all our open Survey & Awards Nomination forms in one place

Darshan M

Darshan is a Master's degree holder in Data Science and Machine Learning and an everyday learner of the latest trends in Data Science and Machine Learning. He is always interested to learn new things with keen interest and implementing the same and curating rich content for Data Science, Machine Learning,NLP and AI