
I am an aspiring data scientist with a passion for…

As a Machine learning engineer, it is a common practice to save the data and models in a CSV format. Though CSV format helps in storing data in a rectangular tabular format, it might not always be suitable for persisting all Pandas Dataframes. CSV files tend to be slow to read and write, take up more memory and space and most importantly CSVs don’t store information about data types.
In this article, we will understand how CSV handles different file formats, explore different ways to store the data and the performance of each of these persisting methods.
How does CSV handle different file formats?
Pandas support a wide range of data formats and sub formats to make it easy to work with huge datasets. Some of the formats that are most popular are the object, string, timedelta, int, float, bool, category etc. Pandas use NumPy arrays as the backend to store the data. So, each data type like int is actually stored in the form of int8, int16 etc for more efficiency. But, CSV doesn’t store information about the datatypes. The user needs to specify it with the read_csv method and datatypes like timedelta are stored as strings.
Example:

data_csv = pd.read_csv(example, dtype={"colA": "int8","colB": "int16",colC:”category”})
This code becomes more redundant when there are multiple data columns and using this is not very efficient.
Persisting methods other than CSV
- Pickle: Pickle is the native format of python that is popular for object serialization. The advantage of pickle is that it allows the python code to implement any type of enhancements. It is much faster when compared to CSV files and reduces the file size to almost half of CSV files using its compression techniques. Also, there is no need to specify multiple parameters like CSV for each data column.
The way to implement this is simple.
dataframe.to_pickle(path)
Path: where the data will be stored
- Parquet: This is a compressed storage format that is used in Hadoop ecosystem. It allows serializing complex nested structures, supports column-wise compression and column-wise encoding, and offers fast reads. The advantage of this over CSV is that it saves information about all datatypes, are faster and it provides wide support in the Hadoop ecosystem allowing fast filtering.
To implement in python:
dataframe.to_parquet(path, engine, compression, index, partition_cols)
Path: where the data is stored
Engine: pyarrow or fastparquet engine
Compression: allowing to choose various compression methods
Index: saves dataframe index
Partition_cols: specify the order of the column partitioning
- Excel: Exporting the data in the form of Excel has its own advantages because of the easy manipulation offered in excel. It also allows custom formatting and cell freezing.
dataframe.to_excel(excel_writer, sheet_name,float_format,freeze_panes)
Excel_writer: pandas excel writer object or file path
sheet_name: name of the sheet where the data will be output
Float_format: excel’s number formatting
Freeze_panes: option to freeze rows and columns
- HDF5: This format of storage is best suited for storing large amounts of heterogeneous data. The data is stored as an internal file-like structure. It is also useful for randomly accessing different parts of the data. For some data structures, the size and access speed are much better than CSV.
dataframe.to_hdf(path_or_buf, key, mode)
Path_or_buf: file path
Key: Identifier in the stored data
Mode: write, append or read-append
Testing the performance of different persisting methods
As discussed above, these models do have some advantages over CSV but to get a better understanding, we will test and compare each of their performances. For this, we will generate a random dataset that contains all the data types that pandas offer which are most commonly used.
Importing libraries
import pandas as pd import numpy as np import os import time
Random data generation
import random import string import datetime defgenerate_strings(length): letter = string.ascii_letters output_str = ''.join([random.choice(letter)for i inrange(random.randint(3,length))]) return output_str defget_sentence(words) words = [] for i inrange(5): words.append(generate_strings(10)[:random.randint(0,10)]) return" ".join(words) defrandom_date(start,end): start = datetime.datetime.strptime(start, format) end = datetime.datetime.strptime(end, format) total = end - start int_total = (total.days * 24 * 60 * 60) + total.seconds random_second = random.randrange(int_total) return(start+datetime.timetotal(seconds=random_second)).strftime(format)
In the above code snippet, we have written functions to create a random string and date data. Using a random library we will generate the sentences as well.
Now, we will create random data in different data types. The range specified is for int8, int16, int32,int64 and uint8,float,string,date and bool.
def random_data(size): data = [] for i in range(size): data.append( [random.randint(-127,127), random.randint(-32768,32767), random.randint(-2147483648,2147483647), random.randint(-9223372036854775808 ,9223372036854775807), random.randint(0,255), round(random.uniform(0,10000),2), round(random.uniform(0,1000000),2), generate_strings(10), get_sentence(5), random.choice([get_random_string(10) for i in range(25)]), random_date("1900-01-01","2020-05-01","%Y-%m-%d"), random_date("1900-01-01T00:00:00","2020-05-01T23:59:59","%Y-%m-%dT%H:%M:%S"), random.choice([True,False])]) return data def test_data(size, drop_timedelta, drop_timezone): data = random_data(size) labels = ["Int8", "Int16", "Int32", "Int64", "UInt8", "Float32", "Float64", "String", "Sentence", "Category", "Date", "DateTime", "Bool"] df = pd.DataFrame(data, columns=labels) df["Int8"] = df["Int8"].astype("int8") df["Int16"] = df["Int16"].astype("int16") df["Int32"] = df["Int32"].astype("int32") df["UInt8"] = df["UInt8"].astype("uint8") df["Float32"] = df["Float32"].astype("float32") df["Category"] = df["Category"].astype("category") df["Date"] = pd.to_datetime(df["Date"], format="%Y-%m-%d") df["DateTime"] = pd.to_datetime(df["DateTime"], format="%Y-%m-%dT%H:%M:%S") if not drop_timedelta: df["TimeDelta"] = df["DateTime"]-df["Date"] if not drop_timezone: df["DateTime+Zone"] = df["DateTime+Zone"].dt.tz_localize('Europe/Vienna') return df, orig_size dataset_size = 50000 df, orig_size= test_data(dataset_size, True, True) dataframe.info()
Now that we have our data we can create functions for calculating the performance.
def measurement(dataframe, data_load, new_file, tp, benchmark,out, original_size): compare_frames = pd.DataFrame({"original": dataframe.dtypes, "new": data_load.dtypes}) compare_frames["same"] = compare_frames["original"]==compare_frames["new"] new_size = Path(new_file).stat().st_size return pd.DataFrame({ "preserve": compare_frames["same"].sum()/compare_frames["same"].count(), "compression": new_size/original_size, "write_time": out["write"]/benchmark["write"], "read_time": out["read"]/benchmark["read"], }, index=[tp]), compare_frames
In the above code, we have specified the parameters for comparison. This function calculates how many columns remained original type after reading, the amount of compression, the time required to read the data and write when compared to CSV benchmark.
Next, we will write a function to calculate the time elapsed for reading and writing the formats when the dataframe is loaded. Also, it measures the average processing time for 7 repetitions.
def time_elapsed(path, repetitions, dataframe, file): write_time = [] read_time = [] for r in range(7): start_time = time.time() path["write_function"](dataframe, file, **path["write_params"]) elapsed_writing = time.time() - start_time write_time.append(elapsed_writing ) start_time = time.time() dataframe_loaded = path["read_function"](file, **path["read_params"]) elapsed_reading = time.time() - start_time read_time.append(elapsed_reading) return {"write": sum(write_time)/len(write_time), "read": sum(read_time)/len(read_time), "dataframe_loaded": df_loaded}
Finally, we will write the test performance function that takes our generated data as input and provides the performance as output.
def test_performance(exporting_types, dataframe, output, repetitions): results = [] for key,value in exporting_types.items(): new_file = output + value["extension"] try: out = repeat(v, repetitions, dataframe, new_file) if value["type"] == "CSV": benchmark = output results.append(measurement(dataframe, out["dataframe_loaded"], new_file, value["type"], benchmark, output, orig_size)[0]) except print(“Failed”) return pd.concat(results)
Before we call all the functions we need to specify what types of persisting methods we are going to use.
exporting_types = { "csv": { "type": "CSV", "extension": ".csv", "write_function": pd.DataFrame.to_csv, "write_params": {"index": False}, "read_function": pd.read_csv, "read_params": {} }, "csv_zip": { "type": "CSV zip", "extension": ".zip", "write_function": pd.DataFrame.to_csv, "write_params": {"index": False, "compression": "zip"}, "read_function": pd.read_csv, "read_params": {"compression": "zip"} }, "pickle": { "type": "Pickle bz2", "extension": ".pkl.bz2", "write_function": pd.DataFrame.to_pickle, "write_params": {"compression": "bz2"}, "read_function": pd.read_pickle, "read_params": {"compression": "bz2"} }, "parquet_pyarrow": { "type": "Parquet via PyArrow", "extension": ".parquet.gzip", "write_function": pd.DataFrame.to_parquet, "write_params": {}, "read_function": pd.read_parquet, "read_params": {} }, "Hdf5f": { "type": "Hdf5 fixed", "extension": ".h5", "write_function": pd.DataFrame.to_hdf, "write_params": {"key":"df", "format":"fixed"}, "read_function": pd.read_hdf, "read_params": {"key":"df"} }, "Excel": { "type": "Excel", "extension": ".xlsx", "write_function": pd.DataFrame.to_excel, "write_params": {"index":False}, "read_function": pd.read_excel, "read_params": {} },
Let us measure the performance:
performance = test_performance(exporting_types, df)
As seen above, pickle seems to have a great performance in comparison to other types of formats. Though we have used random data it is enough to understand the performances.
Conclusion
This article covered the different types of persisting methods in pandas other than CSV and also measured the performance of each of them. It is clear that using pickle and parquet can result in faster performance, save more space and not lose out on much of the data.
Subscribe to our Newsletter
Get the latest updates and relevant offers by sharing your email.What's Your Reaction?

I am an aspiring data scientist with a passion for teaching. I am a computer science graduate from Dayananda Sagar Institute. I have experience in building models in deep learning and reinforcement learning. My goal is to use AI in the field of education to make learning meaningful for everyone.