Time-series analysis and forecasting is one of the most widely applied machine learning problems. It finds applications in weather forecasting, earthquake prediction, space science, e-commerce, stock market prediction, medical sciences, and signal processing. While dealing with a time-series dataset, the data may contain the date, month, day, and time in any format. This is because people tend to use different date and time formats. Moreover, Python assumes a non-numbered entry as an object and a numbered entry as an integer or float. Hence, it is important to inform Python about the date and time entries.
In this tutorial, we learn how to parse datetime using the Pandas library. Pandas is famous for its datetime parsing, processing, analysis and plotting functions.
Import necessary libraries.
import numpy as np import pandas as pd import matplotlib.pyplot as plt
A simple time-series data
Load some simple time-series data.
url = 'https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/TimeSeries/Algerian_forest_fires.csv' forestfire = pd.read_csv(url) forestfire.sample(10)
Output:
We observe that there are three separate columns for day, month and year. Let’s look at the data type of these attributes.
forestfire.info()
Output:
Day, month and year values are in integers. We have to convert them to the datetime64
data type.
forestfire['date'] = pd.to_datetime(forestfire[['day', 'month', 'year']]) forestfire.head()
Output:
We used the to_datetime
method available in Pandas to parse the day, month and year columns into a single date column. We can drop the first three columns as they are redundant. Further, we can check attributes’ data types.
forestfire.drop(columns=['day','month','year'], inplace=True) forestfire.info()
Output:
The parsed date can be broken down into elements, i.e., day, month and year back.
days = forestfire['date'].dt.day months = forestfire['date'].dt.month years = forestfire['date'].dt.year recon = pd.DataFrame(zip(days,months,years), columns = ['DAY','MONTH','YEAR']) recon.head()
Output:
Another Dataset With Both Date And Time
Let’s load another time-series dataset that contains both date and time, but in two separate columns.
airquality_url = 'https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/TimeSeries/AirQualityUCI.csv' # read first 5 columns for better visual clarity airquality = pd.read_csv(airquality_url, sep=';').iloc[:,:5] airquality.sample(5)
Output:
This time-series dataset contains Date in one column and Time in another column. Check the data types of the attributes.
airquality.info()
Output:
As expected, both Date
and Time
columns are in object data type. In contrast to our previous example, the Date
attribute is the DD/MM/YYYY
format and the Time
attribute is in the HH.MM.SS
format. Whenever we know the format of either date or time, we should pass it as an argument to the to_datetime
method. Refer to the official documentation here for more information about different formats.
airquality['DATE'] = pd.to_datetime(airquality['Date'], format='%d/%m/%Y') airquality['TIME'] = pd.to_datetime(airquality['Time'], format='%H.%M.%S') airquality.drop(columns=['Date', 'Time'], inplace=True) airquality.info()
Output:
We removed the original Date
and Time
columns as they were redundant to the new ones. The new attributes DATE
and TIME
are of datetime64
data type. As we have split the date in the previous example, we can split the time into an hour, minute and second elements using the dt
method.
airquality['DAY'] = airquality['DATE'].dt.day airquality['MONTH'] = airquality['DATE'].dt.month airquality['YEAR'] = airquality['DATE'].dt.year airquality['HOUR'] = airquality['TIME'].dt.hour airquality['MINUTE'] = airquality['TIME'].dt.minute airquality['SECOND'] = airquality['TIME'].dt.second airquality.drop(columns=['DATE', 'TIME'], inplace=True) airquality.head()
Output:
We can recall this example from the origin. The original dataset had 2 datetime columns: date
(as object), time
(as object). We converted them into 2 columns of datetime64
data type. In the last step, we split each element to form 6 new columns. However, we can merge all these split elements into a single feature of datetime64
data type to have every detail of date and time.
airquality['parsed'] = pd.to_datetime(airquality[['DAY','MONTH','YEAR','HOUR','MINUTE','SECOND']]) airquality.head()
Output:
In the above step, the default format YYYY-mm-dd HH:MM:SS
is presented. But, we can have parsed datetime in the format we wish using the strftime
method. Refer to the official documentation here for more formats.
airquality['formatted_date'] = pd.to_datetime(airquality[['DAY','MONTH','YEAR','HOUR','MINUTE','SECOND']]).dt.strftime('%d %b %Y, %I.%M.%S %p') # display last 8 columns only for better visual clarity airquality.head().iloc[:,-8:]
Output:
A Dataset With Inconsistent Datetime Entries
We discuss some more interesting things about datetime parsing with a complex time-series dataset.
url = 'https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/TimeSeries/landslides_data.csv' # load limited features only - for better visual clarity landslides = pd.read_csv(url).loc[:,['date', 'country_code', 'state/province', 'hazard_type']] landslides.head()
Output:
It is observed that the feature date
has different formats. Hence, we can not parse it with a predefined format. Let’s have a thorough check for any other formats.
length = landslides['date'].str.len() length.value_counts()
Output:
Date is presented in five different lengths. Lengths 7 and 8 may refer to a common format. Length 10 may refer to another format. Lengths 16 and 17 may refer to some other format.
Let’s do some analysis to find the hidden truth using NumPy!
ind_7 = np.where([length==7])[1][0] ind_8 = np.where([length==8])[1][0] ind_10 = np.where([length==10])[1][0] ind_16 = np.where([length==16])[1][0] ind_17 = np.where([length==17])[1][0] # load one example row for each date length landslides.loc[[ind_7,ind_8,ind_10,ind_16,ind_17]]
Output:
As we guessed, there are three different date formats in the dataset. The date presented along with time is the least available format with just 4 rows. Hence, we drop these 4 rows for the sake of simplicity.
drop_ind = np.where([length>=16])[1] # drop rows where date length is greater than 15 landslides.drop(index=drop_ind, inplace=True) # check for date lengths length = landslides['date'].str.len() length.value_counts()
Output:
We need not worry about different formats in date. Pandas’ to_datetime
method takes an optional boolean argument infer_datetime_format
. If we pass True
as the argument, Pandas will analyze the format and convert it suitably.
landslides['parsed_date'] = pd.to_datetime(landslides['date'], infer_datetime_format=True) landslides.head()
Output:
Let’s remove the original column to avoid redundancy. We can explore some more features that Pandas provide along with datetime parsing.
landslides.drop(columns=['date'], inplace=True) landslides.head()
Output:
We can calculate the number of landslides per day by analyzing the parsed_date
and plot it using Pandas plotting. Pandas plotting is a simple interface built on top of Matplotlib.
plt.figure(figsize=(8,5)) landslides['parsed_date'].value_counts().sort_values().plot.line() plt.show()
Output:
Pandas provides a powerful analysis method, named resample
for datetime64
features. This method permits different analysis year-wise, month-wise, day-wise, and so on. This helps us find the pattern among the time-series data.
The total number of yearly landslides can be calculated as follows.
plt.figure(figsize=(8,5)) landslides['parsed_date'].value_counts().resample('Y').sum().plot.line(color=’g’) plt.show()
Output:
Year-wise mean slides can be calculated as follows.
plt.figure(figsize=(8,5)) landslides['parsed_date'].value_counts().resample('Y').mean().plot.bar(color='r') plt.show()
Output:
According to the plot, the year 2010 had more landslides than any other year (as per the dataset).
The total number of landslides calculated in a month-wise manner is as follows.
plt.figure(figsize=(8,5)) landslides['parsed_date'].value_counts().resample('M').sum().plot.area(color='b') plt.show()
Output:
Though we have found in our previous plots that the year 2010 had the maximum number of landslides per year, the above plot reveals that most number of landslides per month occurred in the year 2013.
Wrapping Up Datetime Parsing
In this tutorial, we have discussed how to parse datetime with Pandas library. Further, we have explored some practical examples and have performed various datetime transformations. Finally, we have discussed the analysis and visualization tools available exclusively for the datetime features.
The datasets used in this tutorial are collected from open-sources and can be found here.
The notebook with above code implementation can be found here.