“Never overlook a Kaggle competition when it doesn’t award prizes/ranking, it may have even more interesting stuff (like xDeepFM) for you.”- Luca Massaron, Senior Data Scientist and Kaggle Master.
Deep Learning DevCon 2020 is the conference of the year that is hosted by the Association of Data Scientists in partnership with Analytics India Magazine. Scheduled for 29th and 30th October, the DLDC conference brought together the leading experts as well as the best minds of deep learning and machine learning industry from around the globe.
In this session, Massaron started discussing a brief on what deep learning and deep neural networks are and why it is relevant. During his talk, Massaron shed light on the topic of deep learning for tabular data. He then discussed three challenges, which are data preparation, high cardinality and architecture.
Mentioning some of the popular advantages of deep neural networks, such as voice search and voice-activated assistants, self-driving cars, sophisticated recommender systems, image generation & manipulation, etc., he stated that lack of data could prove to be an issue while building robust models.
He said, “Images, as well as text, are not the most frequent data that you handle. Since long, relational databases have fostered the storage of a mix of numeric, symbolic and textual data, scattered all together through tables.”
He explained that several challenges lie in tabular data. Some of them are:
- Mixed feature data-types
- Sparse data which is not the best for DNN to converge
- Sometimes less data than in image recognition problems
- No state-of-the-art or best practice architecture
- There is suspect from non-technical people because deep neural networks are less interpretable than simpler ML algorithms
- Often no best in class solution as the GBM might perform better.
In order to overcome these challenges and to use deep neural networks successfully for tabular data, Massaron suggested the following steps:
- It takes some effort, and one must not expect an automated process or great results at once.
- It is not necessary to reinvent the wheel, which means one can use TensorFlow/ Keras, Scikit-learn, Pandas for the projects.
- Process and pipeline input according to its type
- While creating a suitable neural architecture, one must keep into account the no. of available examples
- Use regularisation, such as L1/L2, dropout.
- Encode prior knowledge (feature engineering)
- Test and tune the network using cross-validation.
While explaining the challenges in data preparation, Massaron discussed using the right tools for the right project, such as using TensorFlow with Keras and Scikit-learn with Pandas. TensorFlow has features like high-level API, production-ready, robust and effective, while Scikit-learn has features like cross-validation, host of classes for data processing, .fit .transform methods, etc.
Massaron also explained that while dealing with numerical variables, one can create machine learning pipeline by the following steps-
- Exclude low variance variables
- Deal with missing values: just input the median, but don’t forget to create missing indicators to catch any information in missingness not at random
- Strive to make your data more gaussian-like
- Catch outliers (or be confident that your BatchNormalisation layer will)
- Normalise all the values before feeding them.
Next, he discussed the importance of Keras and some examples related to Scikit-learn, such as importing Quantile Transformer and shared some tricks as tips for parsing dates, extracting the time parts, low cardinality, categorical ordinal and other such.
Talking about the second challenge, which is categorical data, Massaron discussed the importance of embedding layer, high cardinality and how embedding layer works. Coming to the third challenge, which is neural architectures, he mentioned some of the important topics, such as Deep Double Descent, Factorisation machines, wide and deep learning, xDeepFM architecture, Compressed Interaction Network (CIN), etc.
Massaron wrapped up the session by stating that DNN for tabular data is an alternative solution for modelling a response given tabular data. They require an ad-hoc architecture and a well-crafted pipeline, but they are easier to build due to specialised layers and packages. Also, Entity Embeddings of categorical data can be of great use when dealing with high cardinality. Manson concluded the talk by saying, “Never overlook a Kaggle competition when it doesn’t award prizes/ranking, it may have even more interesting stuff (like xDeepFM) for you.”