Now Reading
Complete Guide To Datetime Parsing With Pandas

Complete Guide To Datetime Parsing With Pandas

datetime parsing cover art

Time-series analysis and forecasting is one of the most widely applied machine learning problems. It finds applications in weather forecasting, earthquake prediction, space science, e-commerce, stock market prediction, medical sciences, and signal processing. While dealing with a time-series dataset, the data may contain the date, month, day, and time in any format. This is because people tend to use different date and time formats. Moreover, Python assumes a non-numbered entry as an object and a numbered entry as an integer or float. Hence, it is important to inform Python about the date and time entries. 

In this tutorial, we learn how to parse datetime using the Pandas library. Pandas is famous for its datetime parsing, processing, analysis and plotting functions. 

Import necessary libraries.

 import numpy as np
 import pandas as pd
 import matplotlib.pyplot as plt 

A simple time-series data

Load some simple time-series data.

 url = 'https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/TimeSeries/Algerian_forest_fires.csv'
 forestfire = pd.read_csv(url)
 forestfire.sample(10) 

Output:

original time-series dataset

We observe that there are three separate columns for day, month and year. Let’s look at the data type of these attributes.

forestfire.info()

Output:

Day, month and year values are in integers. We have to convert them to the datetime64 data type.

 forestfire['date'] = pd.to_datetime(forestfire[['day', 'month', 'year']])
 forestfire.head() 

Output:

parsed date

We used the to_datetime method available in Pandas to parse the day, month and year columns into a single date column. We can drop the first three columns as they are redundant.  Further, we can check attributes’ data types

 forestfire.drop(columns=['day','month','year'], inplace=True)
 forestfire.info() 

Output:

The parsed date can be broken down into elements, i.e., day, month and year back.

 days = forestfire['date'].dt.day
 months = forestfire['date'].dt.month
 years = forestfire['date'].dt.year
 recon = pd.DataFrame(zip(days,months,years), columns = ['DAY','MONTH','YEAR'])
 recon.head() 

Output:

Another Dataset With Both Date And Time

Let’s load another time-series dataset that contains both date and time, but in two separate columns.

 airquality_url = 'https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/TimeSeries/AirQualityUCI.csv'
 # read first 5 columns for better visual clarity
 airquality = pd.read_csv(airquality_url, sep=';').iloc[:,:5]
 airquality.sample(5) 

Output:

original time-series dataset

This time-series dataset contains Date in one column and Time in another column. Check the data types of the attributes.

airquality.info()

Output:

As expected, both Date and Time columns are in object data type. In contrast to our previous example, the Date attribute is the DD/MM/YYYY format and the Time attribute is in the HH.MM.SS format. Whenever we know the format of either date or time, we should pass it as an argument to the to_datetime method. Refer to the official documentation here for more information about different formats.

 airquality['DATE'] = pd.to_datetime(airquality['Date'], format='%d/%m/%Y')
 airquality['TIME'] = pd.to_datetime(airquality['Time'], format='%H.%M.%S')
 airquality.drop(columns=['Date', 'Time'], inplace=True)
 airquality.info() 

Output:

We removed the original Date and Time columns as they were redundant to the new ones. The new attributes DATE and TIME are of datetime64 data type. As we have split the date in the previous example, we can split the time into an hour, minute and second elements using the dt method.

 airquality['DAY'] = airquality['DATE'].dt.day
 airquality['MONTH'] = airquality['DATE'].dt.month
 airquality['YEAR'] = airquality['DATE'].dt.year
 airquality['HOUR'] = airquality['TIME'].dt.hour
 airquality['MINUTE'] = airquality['TIME'].dt.minute
 airquality['SECOND'] = airquality['TIME'].dt.second
 airquality.drop(columns=['DATE', 'TIME'], inplace=True)
 airquality.head() 

Output:

datetime split features

We can recall this example from the origin. The original dataset had 2 datetime columns: date (as object), time (as object). We converted them into 2 columns of datetime64 data type. In the last step, we split each element to form 6 new columns. However, we can merge all these split elements into a single feature of datetime64 data type to have every detail of date and time.

 airquality['parsed'] = pd.to_datetime(airquality[['DAY','MONTH','YEAR','HOUR','MINUTE','SECOND']])
 airquality.head() 

Output:

parsed datetime

In the above step, the default format YYYY-mm-dd HH:MM:SS is presented. But, we can have parsed datetime in the format we wish using the strftime method. Refer to the official documentation here for more formats. 

 airquality['formatted_date'] = pd.to_datetime(airquality[['DAY','MONTH','YEAR','HOUR','MINUTE','SECOND']]).dt.strftime('%d %b %Y, %I.%M.%S %p')
 # display last 8 columns only for better visual clarity
 airquality.head().iloc[:,-8:] 

Output:

formatted datetime

A Dataset With Inconsistent Datetime Entries

We discuss some more interesting things about datetime parsing with a complex time-series dataset. 

 url = 'https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/TimeSeries/landslides_data.csv'
 # load limited features only - for better visual clarity
 landslides = pd.read_csv(url).loc[:,['date', 'country_code', 'state/province', 'hazard_type']]
 landslides.head() 

Output:

It is observed that the feature date has different formats. Hence, we can not parse it with a predefined format. Let’s have a thorough check for any other formats.

 length = landslides['date'].str.len()
 length.value_counts() 

Output:

Date is presented in five different lengths. Lengths 7 and 8 may refer to a common format. Length 10 may refer to another format. Lengths 16 and 17 may refer to some other format.

Let’s do some analysis to find the hidden truth using NumPy!

 ind_7 = np.where([length==7])[1][0]
 ind_8 = np.where([length==8])[1][0]
 ind_10 = np.where([length==10])[1][0]
 ind_16 = np.where([length==16])[1][0]
 ind_17 = np.where([length==17])[1][0]
 # load one example row for each date length
 landslides.loc[[ind_7,ind_8,ind_10,ind_16,ind_17]] 

Output:

original time-series dataset

As we guessed, there are three different date formats in the dataset. The date presented along with time is the least available format with just 4 rows. Hence, we drop these 4 rows for the sake of simplicity.

 drop_ind = np.where([length>=16])[1]
 # drop rows where date length is greater than 15
 landslides.drop(index=drop_ind, inplace=True)
 # check for date lengths
 length = landslides['date'].str.len()
 length.value_counts() 

Output:

We need not worry about different formats in date. Pandas’ to_datetime method takes an optional boolean argument infer_datetime_format. If we pass True as the argument, Pandas will analyze the format and convert it suitably.

See Also
What Is A Time Series GAN?

 landslides['parsed_date'] = pd.to_datetime(landslides['date'], infer_datetime_format=True)
 landslides.head() 

Output:

datetime parsing

Let’s remove the original column to avoid redundancy. We can explore some more features that Pandas provide along with datetime parsing.

 landslides.drop(columns=['date'], inplace=True)
 landslides.head() 

Output:

parsed date

We can calculate the number of landslides per day by analyzing the parsed_date and plot it using Pandas plotting. Pandas plotting is a simple interface built on top of Matplotlib.

 plt.figure(figsize=(8,5))
 landslides['parsed_date'].value_counts().sort_values().plot.line()
 plt.show() 

Output:

time-series plotting

Pandas provides a powerful analysis method, named resample for datetime64 features. This method permits different analysis year-wise, month-wise, day-wise, and so on. This helps us find the pattern among the time-series data.

The total number of yearly landslides can be calculated as follows.

 plt.figure(figsize=(8,5))
 landslides['parsed_date'].value_counts().resample('Y').sum().plot.line(color=’g’)
 plt.show() 

Output:

datetime year-wise analysis

Year-wise mean slides can be calculated as follows.

 plt.figure(figsize=(8,5))
 landslides['parsed_date'].value_counts().resample('Y').mean().plot.bar(color='r')
 plt.show() 

Output:

datetime monthly analysis

According to the plot, the year 2010 had more landslides than any other year (as per the dataset). 

The total number of landslides calculated in a month-wise manner is as follows.

 plt.figure(figsize=(8,5))
 landslides['parsed_date'].value_counts().resample('M').sum().plot.area(color='b')
 plt.show() 

Output:

datetime analysis

Though we have found in our previous plots that the year 2010 had the maximum number of landslides per year, the above plot reveals that most number of landslides per month occurred in the year 2013.

Wrapping Up Datetime Parsing

In this tutorial, we have discussed how to parse datetime with Pandas library. Further, we have explored some practical examples and have performed various datetime transformations. Finally, we have discussed the analysis and visualization tools available exclusively for the datetime features. 

The datasets used in this tutorial are collected from open-sources and can be found here.

The notebook with above code implementation can be found here.

References:


Join Our Telegram Group. Be part of an engaging online community. Join Here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top