AIM logo Black

How To Use Deep Learning For Tabular Data

Share

“Never overlook a Kaggle competition when it doesn’t award prizes/ranking, it may have even more interesting stuff (like xDeepFM) for you.”- Luca Massaron, Senior Data Scientist and  Kaggle Master.

The session “Deep Learning For Tabular Data” was presented at the DLDC 2020, also known as the Deep Learning DevCon 2020 by Luca Massaron, who is Senior Data Scientist and Kaggle Master.

Deep Learning DevCon 2020 is the conference of the year that is hosted by the Association of Data Scientists in partnership with Analytics India Magazine. Scheduled for 29th and 30th October, the DLDC conference brought together the leading experts as well as the best minds of deep learning and machine learning industry from around the globe. 

In this session, Massaron started discussing a brief on what deep learning and deep neural networks are and why it is relevant. During his talk, Massaron shed light on the topic of deep learning for tabular data. He then discussed three challenges, which are data preparation, high cardinality and architecture.

Mentioning some of the popular advantages of deep neural networks, such as voice search and voice-activated assistants, self-driving cars, sophisticated recommender systems, image generation & manipulation, etc., he stated that lack of data could prove to be an issue while building robust models. 

He said, “Images, as well as text, are not the most frequent data that you handle. Since long, relational databases have fostered the storage of a mix of numeric, symbolic and textual data, scattered all together through tables.”

He explained that several challenges lie in tabular data. Some of them are:

  • Mixed feature data-types
  • Sparse data which is not the best for DNN to converge
  • Sometimes less data than in image recognition problems
  • No state-of-the-art or best practice architecture
  • There is suspect from non-technical people because deep neural networks are less interpretable than simpler ML algorithms
  • Often no best in class solution as the GBM might perform better.

In order to overcome these challenges and to use deep neural networks successfully for tabular data, Massaron suggested the following steps:

  • It takes some effort, and one must not expect an automated process or great results at once.
  • It is not necessary to reinvent the wheel, which means one can use TensorFlow/ Keras, Scikit-learn, Pandas for the projects.
  • Process and pipeline input according to its type
  • While creating a suitable neural architecture, one must keep into account the no. of available examples
  • Use regularisation, such as L1/L2, dropout.
  • Encode prior knowledge (feature engineering)
  • Test and tune the network using cross-validation.

While explaining the challenges in data preparation, Massaron discussed using the right tools for the right project, such as using TensorFlow with Keras and Scikit-learn with Pandas. TensorFlow has features like high-level API, production-ready, robust and effective, while Scikit-learn has features like cross-validation, host of classes for data processing, .fit .transform methods, etc. 

Massaron also explained that while dealing with numerical variables, one can create machine learning pipeline by the following steps-

  • Exclude low variance variables
  • Deal with missing values: just input the median, but don’t forget to create missing indicators to catch any information in missingness not at random
  • Strive to make your data more gaussian-like
  • Catch outliers (or be confident that your BatchNormalisation layer will)
  • Normalise all the values before feeding them.

Next, he discussed the importance of Keras and some examples related to Scikit-learn, such as importing Quantile Transformer and shared some tricks as tips for parsing dates, extracting the time parts, low cardinality, categorical ordinal and other such.

Talking about the second challenge, which is categorical data, Massaron discussed the importance of embedding layer, high cardinality and how embedding layer works. Coming to the third challenge, which is neural architectures, he mentioned some of the important topics, such as Deep Double Descent, Factorisation machines, wide and deep learning, xDeepFM architecture, Compressed Interaction Network (CIN), etc.

Massaron wrapped up the session by stating that DNN for tabular data is an alternative solution for modelling a response given tabular data. They require an ad-hoc architecture and a well-crafted pipeline, but they are easier to build due to specialised layers and packages. Also, Entity Embeddings of categorical data can be of great use when dealing with high cardinality. Manson concluded the talk by saying, “Never overlook a Kaggle competition when it doesn’t award prizes/ranking, it may have even more interesting stuff (like xDeepFM) for you.”

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India