Can We Trust k-fold Cross Validation For Financial Modelling?

Machine learning is not a buzzword anymore, at least not in the case of financial modelling. Fraud prevention, algorithmic trading, digital assistants and risk management are some of the areas where machine learning has found its niche. The application of ML models in finance is unlike any other industries because the decisions taken either pay the dividends immediately or plummet catastrophically.

Many financial companies need data engineering, statistics and visualization tools to meet their ends. With machine learning, the models can be retrained repeatedly until the best solution emerges. That said, there is no universal machine learning solution for all the business problems.

Source: Financial ML via n-ix

This graphic illustrates the confidence of financial institutions in imbibing machine learning methodologies. Talking about methodologies there are many statistical methods and frameworks that help in building models. For instance, validation techniques are widely popular to assess the accuracy of the models; cross-validation being the popular choice.

Cross-validation is used for determining the generalization error in a machine learning algorithm to prevent overfitting. But, in the case of financial models, overfitting does take place and can go undetected by CV. Moreover, hyper-parameter tuning can contribute to overfitting. Such fundamental errors in a model can still get passed through owing to its overfitting. While the forecasting power is reduced to null.

A Quick Recap Of CV

CV splits the dataset into two sets: the training set and the testing set. Each observation in the complete dataset belongs to one, and only one, set. This is done as to prevent leakage from one set into the other since that would defeat the purpose of testing on unseen data.

There are many alternative CV schemes, of which one of the most popular is k-fold CV. It works as follows:

  1. The dataset is partitioned into k subsets.
  2. For i = 1,…,k
  3. The ML algorithm is trained on all subsets excluding i.
  4. The fitted ML algorithm is tested on i.

The outcome from k-fold CV is a k x 1 array of cross-validated performance metrics. For example, in a binary classifier, the model is deemed to have learned something if the cross-validated accuracy is over 1/2, more than what we would achieve by tossing a fair coin.

How Well Does CV Fair With Finance

One reason k-fold CV fails in finance is that observations cannot be assumed to be drawn from an IID (Independent and Identically Distributed)processes.

A second reason for CV’s failure is that the testing set is used multiple times in the process of developing a model, leading to multiple testing and selection bias.

So, when there is an overlap between training and testing datasets, there will be some leakage.

The problem is leakage in the presence of irrelevant features, as this leads to false discoveries.

Problems With Sklearn’s Cross-Validation

Sci-kit learn is the most popular ML library for implementing cross-validation. One of the many upsides of open-source code is that you can verify everything and adjust it to your needs. In Advances of financial machine learning, Marcos Lopez de Prado lists the following two problems with sklearn:

  1. Scoring functions do not know classes_, as a consequence of sklearn’s reliance on numpy arrays rather than pandas series:
  2. cross_val_score will give different results because it passes weights to the fit method, but not to the log_loss method:

How To Reduce The Leakage

  • Drop from the training set any observation i where Yi is a function of information used to determine Yj, and j belongs to the testing set.
  • Avoiding overfitting of the classifier.
  • Early stopping of the base estimators.
  • Bagging of classifiers, while controlling for oversampling on redundant examples, so that the individual classifiers are as diverse as possible.
  • Set average uniqueness.
  • Apply sequential bootstrap

An Alternative In The Form Of Purged K-Fold CV

One way to reduce leakage is to purge from the training set all observations whose labels overlapped in time with those labels included in the testing set; called “purging.”

If no training observations occur between the first and last testing observation, then purging can be accelerated with a pandas series with a single item, spanning the entire testing set.

The larger the number of testing splits, the greater the number of overlapping observations in the training set. In many cases, purging is enough to prevent leakage. The performance will improve when the model is allowed to recalibrate more often.

Future Direction

The number of open-source machine learning algorithms and tools to curate financial data are increasing fast. And, with increasing interests of the financial institutions in AI, the funds allocated will increase, which in turn will enable more methodologies to developed. The advantage with this industry is its quantitative nature and its large repository of historical data which is exactly what a machine learning model needs. Neglecting the advancements and latching on to conventional methods will prove costly in the future.

Download our Mobile App

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox