Time-series analysis and forecasting is one of the most widely applied machine learning problems. It finds applications in weather forecasting, earthquake prediction, space science, e-commerce, stock market prediction, medical sciences, and signal processing. While dealing with a time-series dataset, the data may contain the date, month, day, and time in any format. This is because people tend to use different date and time formats. Moreover, Python assumes a non-numbered entry as an object and a numbered entry as an integer or float. Hence, it is important to inform Python about the date and time entries.
In this tutorial, we learn how to parse datetime using the Pandas library. Pandas is famous for its datetime parsing, processing, analysis and plotting functions.
Import necessary libraries.
import numpy as np import pandas as pd import matplotlib.pyplot as plt
A simple time-series data
Load some simple time-series data.
url = 'https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/TimeSeries/Algerian_forest_fires.csv' forestfire = pd.read_csv(url) forestfire.sample(10)
We observe that there are three separate columns for day, month and year. Let’s look at the data type of these attributes.
Day, month and year values are in integers. We have to convert them to the
datetime64 data type.
forestfire['date'] = pd.to_datetime(forestfire[['day', 'month', 'year']]) forestfire.head()
We used the
to_datetime method available in Pandas to parse the day, month and year columns into a single date column. We can drop the first three columns as they are redundant. Further, we can check attributes’ data types.
forestfire.drop(columns=['day','month','year'], inplace=True) forestfire.info()
The parsed date can be broken down into elements, i.e., day, month and year back.
days = forestfire['date'].dt.day months = forestfire['date'].dt.month years = forestfire['date'].dt.year recon = pd.DataFrame(zip(days,months,years), columns = ['DAY','MONTH','YEAR']) recon.head()
Another Dataset With Both Date And Time
Let’s load another time-series dataset that contains both date and time, but in two separate columns.
airquality_url = 'https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/TimeSeries/AirQualityUCI.csv' # read first 5 columns for better visual clarity airquality = pd.read_csv(airquality_url, sep=';').iloc[:,:5] airquality.sample(5)
This time-series dataset contains Date in one column and Time in another column. Check the data types of the attributes.
As expected, both
Time columns are in object data type. In contrast to our previous example, the
Date attribute is the
DD/MM/YYYY format and the
Time attribute is in the
HH.MM.SS format. Whenever we know the format of either date or time, we should pass it as an argument to the
to_datetime method. Refer to the official documentation here for more information about different formats.
airquality['DATE'] = pd.to_datetime(airquality['Date'], format='%d/%m/%Y') airquality['TIME'] = pd.to_datetime(airquality['Time'], format='%H.%M.%S') airquality.drop(columns=['Date', 'Time'], inplace=True) airquality.info()
We removed the original
Time columns as they were redundant to the new ones. The new attributes
TIME are of
datetime64 data type. As we have split the date in the previous example, we can split the time into an hour, minute and second elements using the
airquality['DAY'] = airquality['DATE'].dt.day airquality['MONTH'] = airquality['DATE'].dt.month airquality['YEAR'] = airquality['DATE'].dt.year airquality['HOUR'] = airquality['TIME'].dt.hour airquality['MINUTE'] = airquality['TIME'].dt.minute airquality['SECOND'] = airquality['TIME'].dt.second airquality.drop(columns=['DATE', 'TIME'], inplace=True) airquality.head()
We can recall this example from the origin. The original dataset had 2 datetime columns:
date (as object),
time (as object). We converted them into 2 columns of
datetime64 data type. In the last step, we split each element to form 6 new columns. However, we can merge all these split elements into a single feature of
datetime64 data type to have every detail of date and time.
airquality['parsed'] = pd.to_datetime(airquality[['DAY','MONTH','YEAR','HOUR','MINUTE','SECOND']]) airquality.head()
In the above step, the default format
YYYY-mm-dd HH:MM:SS is presented. But, we can have parsed datetime in the format we wish using the
strftime method. Refer to the official documentation here for more formats.
airquality['formatted_date'] = pd.to_datetime(airquality[['DAY','MONTH','YEAR','HOUR','MINUTE','SECOND']]).dt.strftime('%d %b %Y, %I.%M.%S %p') # display last 8 columns only for better visual clarity airquality.head().iloc[:,-8:]
A Dataset With Inconsistent Datetime Entries
We discuss some more interesting things about datetime parsing with a complex time-series dataset.
url = 'https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/TimeSeries/landslides_data.csv' # load limited features only - for better visual clarity landslides = pd.read_csv(url).loc[:,['date', 'country_code', 'state/province', 'hazard_type']] landslides.head()
It is observed that the feature
date has different formats. Hence, we can not parse it with a predefined format. Let’s have a thorough check for any other formats.
length = landslides['date'].str.len() length.value_counts()
Date is presented in five different lengths. Lengths 7 and 8 may refer to a common format. Length 10 may refer to another format. Lengths 16 and 17 may refer to some other format.
Let’s do some analysis to find the hidden truth using NumPy!
ind_7 = np.where([length==7]) ind_8 = np.where([length==8]) ind_10 = np.where([length==10]) ind_16 = np.where([length==16]) ind_17 = np.where([length==17]) # load one example row for each date length landslides.loc[[ind_7,ind_8,ind_10,ind_16,ind_17]]
As we guessed, there are three different date formats in the dataset. The date presented along with time is the least available format with just 4 rows. Hence, we drop these 4 rows for the sake of simplicity.
drop_ind = np.where([length>=16]) # drop rows where date length is greater than 15 landslides.drop(index=drop_ind, inplace=True) # check for date lengths length = landslides['date'].str.len() length.value_counts()
We need not worry about different formats in date. Pandas’
to_datetime method takes an optional boolean argument
infer_datetime_format. If we pass
True as the argument, Pandas will analyze the format and convert it suitably.
landslides['parsed_date'] = pd.to_datetime(landslides['date'], infer_datetime_format=True) landslides.head()
Let’s remove the original column to avoid redundancy. We can explore some more features that Pandas provide along with datetime parsing.
landslides.drop(columns=['date'], inplace=True) landslides.head()
plt.figure(figsize=(8,5)) landslides['parsed_date'].value_counts().sort_values().plot.line() plt.show()
Pandas provides a powerful analysis method, named
datetime64 features. This method permits different analysis year-wise, month-wise, day-wise, and so on. This helps us find the pattern among the time-series data.
The total number of yearly landslides can be calculated as follows.
plt.figure(figsize=(8,5)) landslides['parsed_date'].value_counts().resample('Y').sum().plot.line(color=’g’) plt.show()
Year-wise mean slides can be calculated as follows.
plt.figure(figsize=(8,5)) landslides['parsed_date'].value_counts().resample('Y').mean().plot.bar(color='r') plt.show()
According to the plot, the year 2010 had more landslides than any other year (as per the dataset).
The total number of landslides calculated in a month-wise manner is as follows.
plt.figure(figsize=(8,5)) landslides['parsed_date'].value_counts().resample('M').sum().plot.area(color='b') plt.show()
Though we have found in our previous plots that the year 2010 had the maximum number of landslides per year, the above plot reveals that most number of landslides per month occurred in the year 2013.
Wrapping Up Datetime Parsing
In this tutorial, we have discussed how to parse datetime with Pandas library. Further, we have explored some practical examples and have performed various datetime transformations. Finally, we have discussed the analysis and visualization tools available exclusively for the datetime features.
The datasets used in this tutorial are collected from open-sources and can be found here.
The notebook with above code implementation can be found here.
Join Our Telegram Group. Be part of an engaging online community. Join Here.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
A geek in Machine Learning with a Master's degree in Engineering and a passion for writing and exploring new things. Loves reading novels, cooking, practicing martial arts, and occasionally writing novels and poems.