Now Reading
Complete Guide To Different Persisting Methods In Pandas

Complete Guide To Different Persisting Methods In Pandas

Bhoomika Madhukar
W3Schools

As a Machine learning engineer, it is a common practice to save the data and models in a CSV format. Though CSV format helps in storing data in a rectangular tabular format, it might not always be suitable for persisting all Pandas Dataframes. CSV files tend to be slow to read and write, take up more memory and space and most importantly CSVs don’t store information about data types.

In this article, we will understand how CSV handles different file formats, explore different ways to store the data and the performance of each of these persisting methods. 

How does CSV handle different file formats?

Pandas support a wide range of data formats and sub formats to make it easy to work with huge datasets. Some of the formats that are most popular are the object, string, timedelta, int, float, bool, category etc. Pandas use NumPy arrays as the backend to store the data. So, each data type like int is actually stored in the form of int8, int16 etc for more efficiency. But, CSV doesn’t store information about the datatypes. The user needs to specify it with the read_csv method and datatypes like timedelta are stored as strings. 



Example: 

data_csv = pd.read_csv(example, dtype={"colA": "int8","colB": "int16",colC:”category”})

This code becomes more redundant when there are multiple data columns and using this is not very efficient. 

Persisting methods other than CSV 

  1. Pickle: Pickle is the native format of python that is popular for object serialization. The advantage of pickle is that it allows the python code to implement any type of enhancements. It is much faster when compared to CSV files and reduces the file size to almost half of CSV files using its compression techniques. Also, there is no need to specify multiple parameters like CSV for each data column. 

The way to implement this is simple. 

dataframe.to_pickle(path)

Path: where the data will be stored

  1. Parquet: This is a compressed storage format that is used in Hadoop ecosystem. It allows serializing complex nested structures, supports column-wise compression and column-wise encoding, and offers fast reads. The advantage of this over CSV is that it saves information about all datatypes, are faster and it provides wide support in the Hadoop ecosystem allowing fast filtering.

To implement in python:

dataframe.to_parquet(path, engine, compression, index, partition_cols)

Path: where the data is stored

Engine: pyarrow or fastparquet engine

Compression: allowing to choose various compression methods

Index: saves dataframe index

Partition_cols: specify the order of the column partitioning

  1. Excel: Exporting the data in the form of Excel has its own advantages because of the easy manipulation offered in excel. It also allows custom formatting and cell freezing.

dataframe.to_excel(excel_writer, sheet_name,float_format,freeze_panes)

Excel_writer: pandas excel writer object or file path

sheet_name: name of the sheet where the data will be output

Float_format:  excel’s number formatting

Freeze_panes: option to freeze rows and columns

  1. HDF5: This format of storage is best suited for storing large amounts of heterogeneous data. The data is stored as an internal file-like structure. It is also useful for randomly accessing different parts of the data. For some data structures, the size and access speed are much better than CSV. 

dataframe.to_hdf(path_or_buf, key, mode)

Path_or_buf:  file path

Key: Identifier in the stored data

Mode: write, append or read-append

Testing the performance of different persisting methods

As discussed above, these models do have some advantages over CSV but to get a better understanding, we will test and compare each of their performances. For this, we will generate a random dataset that contains all the data types that pandas offer which are most commonly used. 

See Also

Importing libraries

import pandas as pd
import numpy as np
import os
import time

Random data generation

import random
import string
import datetime
defgenerate_strings(length):
    letter = string.ascii_letters
 output_str = ''.join([random.choice(letter)for i inrange(random.randint(3,length))])
return output_str
defget_sentence(words) 
    words = []
for i inrange(5):
        words.append(generate_strings(10)[:random.randint(0,10)])
return" ".join(words)
defrandom_date(start,end):
    start = datetime.datetime.strptime(start, format)
    end  = datetime.datetime.strptime(end, format)
    total = end - start
    int_total = (total.days * 24 * 60 * 60) + total.seconds    
    random_second = random.randrange(int_total)
return(start+datetime.timetotal(seconds=random_second)).strftime(format)

In the above code snippet, we have written functions to create a random string and date data. Using a random library we will generate the sentences as well. 

Now, we will create random data in different data types. The range specified is for int8, int16, int32,int64 and uint8,float,string,date and bool. 

def random_data(size):
    data = []
    for i in range(size):
        data.append(
            [random.randint(-127,127), 
             random.randint(-32768,32767),  
             random.randint(-2147483648,2147483647), 
             random.randint(-9223372036854775808 ,9223372036854775807),  
             random.randint(0,255),  
             round(random.uniform(0,10000),2),
             round(random.uniform(0,1000000),2),
             generate_strings(10),
             get_sentence(5),
             random.choice([get_random_string(10) for i in range(25)]),
             random_date("1900-01-01","2020-05-01","%Y-%m-%d"),
             random_date("1900-01-01T00:00:00","2020-05-01T23:59:59","%Y-%m-%dT%H:%M:%S"),
             random.choice([True,False])])
    return data
def test_data(size, drop_timedelta, drop_timezone):
    data = random_data(size)
    labels = ["Int8", "Int16", "Int32", "Int64", "UInt8", "Float32", "Float64", "String", "Sentence", "Category", "Date", "DateTime", "Bool"]
    df = pd.DataFrame(data, columns=labels)
    df["Int8"] = df["Int8"].astype("int8")
    df["Int16"] = df["Int16"].astype("int16")
    df["Int32"] = df["Int32"].astype("int32")
    df["UInt8"] = df["UInt8"].astype("uint8")
    df["Float32"] = df["Float32"].astype("float32")
    df["Category"] = df["Category"].astype("category")
    df["Date"] = pd.to_datetime(df["Date"], format="%Y-%m-%d")
    df["DateTime"] = pd.to_datetime(df["DateTime"], format="%Y-%m-%dT%H:%M:%S")
    if not drop_timedelta:
        df["TimeDelta"] = df["DateTime"]-df["Date"]
    if not drop_timezone:
        df["DateTime+Zone"] = df["DateTime+Zone"].dt.tz_localize('Europe/Vienna')
    return df, orig_size
dataset_size = 50000
df, orig_size= test_data(dataset_size, True, True)
dataframe.info()
persisting methods

Now that we have our data we can create functions for calculating the performance. 

def measurement(dataframe, data_load, new_file, tp, benchmark,out, original_size):
    compare_frames = pd.DataFrame({"original": dataframe.dtypes, "new": data_load.dtypes})
    compare_frames["same"] = compare_frames["original"]==compare_frames["new"]
    new_size = Path(new_file).stat().st_size
    return pd.DataFrame({
  "preserve": compare_frames["same"].sum()/compare_frames["same"].count(),
    "compression": new_size/original_size,
    "write_time": out["write"]/benchmark["write"],
    "read_time": out["read"]/benchmark["read"],
    }, index=[tp]), compare_frames

In the above code, we have specified the parameters for comparison. This function calculates how many columns remained original type after reading, the amount of compression, the time required to read the data and write when compared to CSV benchmark. 

Next, we will write a function to calculate the time elapsed for reading and writing the formats when the dataframe is loaded. Also, it measures the average processing time for 7 repetitions. 

def time_elapsed(path, repetitions, dataframe, file):
    write_time = []
    read_time = []
    for r in range(7):
        start_time = time.time()
        path["write_function"](dataframe, file, **path["write_params"])
        elapsed_writing = time.time() - start_time
        write_time.append(elapsed_writing )
        start_time = time.time()
        dataframe_loaded = path["read_function"](file, **path["read_params"])
        elapsed_reading = time.time() - start_time
        read_time.append(elapsed_reading)
    return {"write": sum(write_time)/len(write_time),
           "read": sum(read_time)/len(read_time),
           "dataframe_loaded": df_loaded}

Finally, we will write the test performance function that takes our generated data as input and provides the performance as output. 

def test_performance(exporting_types, dataframe, output, repetitions):
    results = []
    for key,value in exporting_types.items():
        new_file = output + value["extension"]
        try:
            out = repeat(v, repetitions, dataframe, new_file)
            if value["type"] == "CSV":
                benchmark = output
            results.append(measurement(dataframe, out["dataframe_loaded"], new_file, value["type"], benchmark, output, orig_size)[0])
        except 
print(“Failed”)
    return pd.concat(results)

Before we call all the functions we need to specify what types of persisting methods we are going to use. 

exporting_types = {
    "csv": {
        "type": "CSV",
        "extension": ".csv",
        "write_function": pd.DataFrame.to_csv,
        "write_params": {"index": False},
        "read_function": pd.read_csv,
        "read_params": {}
    },
    "csv_zip": {
        "type": "CSV zip",
        "extension": ".zip",
        "write_function": pd.DataFrame.to_csv,
        "write_params": {"index": False, "compression": "zip"},
        "read_function": pd.read_csv,
        "read_params": {"compression": "zip"}
    },
    "pickle": {
        "type": "Pickle bz2",
        "extension": ".pkl.bz2",
        "write_function": pd.DataFrame.to_pickle,
        "write_params": {"compression": "bz2"},
        "read_function": pd.read_pickle,
        "read_params": {"compression": "bz2"}
    },
    "parquet_pyarrow": {
        "type": "Parquet via PyArrow",
        "extension": ".parquet.gzip",
        "write_function": pd.DataFrame.to_parquet,
        "write_params": {},
        "read_function": pd.read_parquet,
        "read_params": {}
    },
    "Hdf5f": {
        "type": "Hdf5 fixed",
        "extension": ".h5",
        "write_function": pd.DataFrame.to_hdf,
        "write_params": {"key":"df", "format":"fixed"},
        "read_function": pd.read_hdf,
        "read_params": {"key":"df"}
    },
   "Excel": {
        "type": "Excel",
        "extension": ".xlsx",
        "write_function": pd.DataFrame.to_excel,
        "write_params": {"index":False},
        "read_function": pd.read_excel,
        "read_params": {}
    },

Let us measure the performance:

performance = test_performance(exporting_types, df)

persisting methods

As seen above, pickle seems to have a great performance in comparison to other types of formats. Though we have used random data it is enough to understand the performances.

Conclusion

This article covered the different types of persisting methods in pandas other than CSV and also measured the performance of each of them. It is clear that using pickle and parquet can result in faster performance, save more space and not lose out on much of the data. 

What Do You Think?

If you loved this story, do join our Telegram Community.


Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top