Active Hackathon

Complete Guide To Different Persisting Methods In Pandas

In this article, we will understand how CSV handles different file formats, explore different ways to store the data and the performance of each of these persisting methods.

As a Machine learning engineer, it is a common practice to save the data and models in a CSV format. Though CSV format helps in storing data in a rectangular tabular format, it might not always be suitable for persisting all Pandas Dataframes. CSV files tend to be slow to read and write, take up more memory and space and most importantly CSVs don’t store information about data types.

In this article, we will understand how CSV handles different file formats, explore different ways to store the data and the performance of each of these persisting methods. 

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

How does CSV handle different file formats?

Pandas support a wide range of data formats and sub formats to make it easy to work with huge datasets. Some of the formats that are most popular are the object, string, timedelta, int, float, bool, category etc. Pandas use NumPy arrays as the backend to store the data. So, each data type like int is actually stored in the form of int8, int16 etc for more efficiency. But, CSV doesn’t store information about the datatypes. The user needs to specify it with the read_csv method and datatypes like timedelta are stored as strings. 

Example: 

data_csv = pd.read_csv(example, dtype={"colA": "int8","colB": "int16",colC:”category”})

This code becomes more redundant when there are multiple data columns and using this is not very efficient. 

Persisting methods other than CSV 

  1. Pickle: Pickle is the native format of python that is popular for object serialization. The advantage of pickle is that it allows the python code to implement any type of enhancements. It is much faster when compared to CSV files and reduces the file size to almost half of CSV files using its compression techniques. Also, there is no need to specify multiple parameters like CSV for each data column. 

The way to implement this is simple. 

dataframe.to_pickle(path)

Path: where the data will be stored

  1. Parquet: This is a compressed storage format that is used in Hadoop ecosystem. It allows serializing complex nested structures, supports column-wise compression and column-wise encoding, and offers fast reads. The advantage of this over CSV is that it saves information about all datatypes, are faster and it provides wide support in the Hadoop ecosystem allowing fast filtering.

To implement in python:

dataframe.to_parquet(path, engine, compression, index, partition_cols)

Path: where the data is stored

Engine: pyarrow or fastparquet engine

Compression: allowing to choose various compression methods

Index: saves dataframe index

Partition_cols: specify the order of the column partitioning

  1. Excel: Exporting the data in the form of Excel has its own advantages because of the easy manipulation offered in excel. It also allows custom formatting and cell freezing.

dataframe.to_excel(excel_writer, sheet_name,float_format,freeze_panes)

Excel_writer: pandas excel writer object or file path

sheet_name: name of the sheet where the data will be output

Float_format:  excel’s number formatting

Freeze_panes: option to freeze rows and columns

  1. HDF5: This format of storage is best suited for storing large amounts of heterogeneous data. The data is stored as an internal file-like structure. It is also useful for randomly accessing different parts of the data. For some data structures, the size and access speed are much better than CSV. 

dataframe.to_hdf(path_or_buf, key, mode)

Path_or_buf:  file path

Key: Identifier in the stored data

Mode: write, append or read-append

Testing the performance of different persisting methods

As discussed above, these models do have some advantages over CSV but to get a better understanding, we will test and compare each of their performances. For this, we will generate a random dataset that contains all the data types that pandas offer which are most commonly used. 

Importing libraries

import pandas as pd
import numpy as np
import os
import time

Random data generation

import random
import string
import datetime
defgenerate_strings(length):
    letter = string.ascii_letters
 output_str = ''.join([random.choice(letter)for i inrange(random.randint(3,length))])
return output_str
defget_sentence(words) 
    words = []
for i inrange(5):
        words.append(generate_strings(10)[:random.randint(0,10)])
return" ".join(words)
defrandom_date(start,end):
    start = datetime.datetime.strptime(start, format)
    end  = datetime.datetime.strptime(end, format)
    total = end - start
    int_total = (total.days * 24 * 60 * 60) + total.seconds    
    random_second = random.randrange(int_total)
return(start+datetime.timetotal(seconds=random_second)).strftime(format)

In the above code snippet, we have written functions to create a random string and date data. Using a random library we will generate the sentences as well. 

Now, we will create random data in different data types. The range specified is for int8, int16, int32,int64 and uint8,float,string,date and bool. 

def random_data(size):
    data = []
    for i in range(size):
        data.append(
            [random.randint(-127,127), 
             random.randint(-32768,32767),  
             random.randint(-2147483648,2147483647), 
             random.randint(-9223372036854775808 ,9223372036854775807),  
             random.randint(0,255),  
             round(random.uniform(0,10000),2),
             round(random.uniform(0,1000000),2),
             generate_strings(10),
             get_sentence(5),
             random.choice([get_random_string(10) for i in range(25)]),
             random_date("1900-01-01","2020-05-01","%Y-%m-%d"),
             random_date("1900-01-01T00:00:00","2020-05-01T23:59:59","%Y-%m-%dT%H:%M:%S"),
             random.choice([True,False])])
    return data
def test_data(size, drop_timedelta, drop_timezone):
    data = random_data(size)
    labels = ["Int8", "Int16", "Int32", "Int64", "UInt8", "Float32", "Float64", "String", "Sentence", "Category", "Date", "DateTime", "Bool"]
    df = pd.DataFrame(data, columns=labels)
    df["Int8"] = df["Int8"].astype("int8")
    df["Int16"] = df["Int16"].astype("int16")
    df["Int32"] = df["Int32"].astype("int32")
    df["UInt8"] = df["UInt8"].astype("uint8")
    df["Float32"] = df["Float32"].astype("float32")
    df["Category"] = df["Category"].astype("category")
    df["Date"] = pd.to_datetime(df["Date"], format="%Y-%m-%d")
    df["DateTime"] = pd.to_datetime(df["DateTime"], format="%Y-%m-%dT%H:%M:%S")
    if not drop_timedelta:
        df["TimeDelta"] = df["DateTime"]-df["Date"]
    if not drop_timezone:
        df["DateTime+Zone"] = df["DateTime+Zone"].dt.tz_localize('Europe/Vienna')
    return df, orig_size
dataset_size = 50000
df, orig_size= test_data(dataset_size, True, True)
dataframe.info()
persisting methods

Now that we have our data we can create functions for calculating the performance. 

def measurement(dataframe, data_load, new_file, tp, benchmark,out, original_size):
    compare_frames = pd.DataFrame({"original": dataframe.dtypes, "new": data_load.dtypes})
    compare_frames["same"] = compare_frames["original"]==compare_frames["new"]
    new_size = Path(new_file).stat().st_size
    return pd.DataFrame({
  "preserve": compare_frames["same"].sum()/compare_frames["same"].count(),
    "compression": new_size/original_size,
    "write_time": out["write"]/benchmark["write"],
    "read_time": out["read"]/benchmark["read"],
    }, index=[tp]), compare_frames

In the above code, we have specified the parameters for comparison. This function calculates how many columns remained original type after reading, the amount of compression, the time required to read the data and write when compared to CSV benchmark. 

Next, we will write a function to calculate the time elapsed for reading and writing the formats when the dataframe is loaded. Also, it measures the average processing time for 7 repetitions. 

def time_elapsed(path, repetitions, dataframe, file):
    write_time = []
    read_time = []
    for r in range(7):
        start_time = time.time()
        path["write_function"](dataframe, file, **path["write_params"])
        elapsed_writing = time.time() - start_time
        write_time.append(elapsed_writing )
        start_time = time.time()
        dataframe_loaded = path["read_function"](file, **path["read_params"])
        elapsed_reading = time.time() - start_time
        read_time.append(elapsed_reading)
    return {"write": sum(write_time)/len(write_time),
           "read": sum(read_time)/len(read_time),
           "dataframe_loaded": df_loaded}

Finally, we will write the test performance function that takes our generated data as input and provides the performance as output. 

def test_performance(exporting_types, dataframe, output, repetitions):
    results = []
    for key,value in exporting_types.items():
        new_file = output + value["extension"]
        try:
            out = repeat(v, repetitions, dataframe, new_file)
            if value["type"] == "CSV":
                benchmark = output
            results.append(measurement(dataframe, out["dataframe_loaded"], new_file, value["type"], benchmark, output, orig_size)[0])
        except 
print(“Failed”)
    return pd.concat(results)

Before we call all the functions we need to specify what types of persisting methods we are going to use. 

exporting_types = {
    "csv": {
        "type": "CSV",
        "extension": ".csv",
        "write_function": pd.DataFrame.to_csv,
        "write_params": {"index": False},
        "read_function": pd.read_csv,
        "read_params": {}
    },
    "csv_zip": {
        "type": "CSV zip",
        "extension": ".zip",
        "write_function": pd.DataFrame.to_csv,
        "write_params": {"index": False, "compression": "zip"},
        "read_function": pd.read_csv,
        "read_params": {"compression": "zip"}
    },
    "pickle": {
        "type": "Pickle bz2",
        "extension": ".pkl.bz2",
        "write_function": pd.DataFrame.to_pickle,
        "write_params": {"compression": "bz2"},
        "read_function": pd.read_pickle,
        "read_params": {"compression": "bz2"}
    },
    "parquet_pyarrow": {
        "type": "Parquet via PyArrow",
        "extension": ".parquet.gzip",
        "write_function": pd.DataFrame.to_parquet,
        "write_params": {},
        "read_function": pd.read_parquet,
        "read_params": {}
    },
    "Hdf5f": {
        "type": "Hdf5 fixed",
        "extension": ".h5",
        "write_function": pd.DataFrame.to_hdf,
        "write_params": {"key":"df", "format":"fixed"},
        "read_function": pd.read_hdf,
        "read_params": {"key":"df"}
    },
   "Excel": {
        "type": "Excel",
        "extension": ".xlsx",
        "write_function": pd.DataFrame.to_excel,
        "write_params": {"index":False},
        "read_function": pd.read_excel,
        "read_params": {}
    },

Let us measure the performance:

performance = test_performance(exporting_types, df)

persisting methods

As seen above, pickle seems to have a great performance in comparison to other types of formats. Though we have used random data it is enough to understand the performances.

Conclusion

This article covered the different types of persisting methods in pandas other than CSV and also measured the performance of each of them. It is clear that using pickle and parquet can result in faster performance, save more space and not lose out on much of the data. 

More Great AIM Stories

Bhoomika Madhukar
I am an aspiring data scientist with a passion for teaching. I am a computer science graduate from Dayananda Sagar Institute. I have experience in building models in deep learning and reinforcement learning. My goal is to use AI in the field of education to make learning meaningful for everyone.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

The curious case of Google Cloud revenue

Porat had earlier said that Google Cloud was putting in money to make more money, but even with the bucket-loads of money that it was making, profitability was still elusive.

Global Parliaments can do much more with Artificial Intelligence

The world is using AI to enhance the performance of its policymakers. India, too, has launched its own machine learning system NeVA, which at the moment is not fully implemented across the nation. How can we learn and adopt from the advancement in the Parliaments around the world? 

Why IISc wins?

IISc was selected as the world’s top research university, trumping some of the top Ivy League colleges in the QS World University Rankings 2022

[class^="wpforms-"]
[class^="wpforms-"]