Data Science Journey of Manu Joseph, The Creator of PyTorch Tabular

At Thoucentric, Manu Joseph leads the research initiatives in causality, predictive maintenance, time series forecasting, NLP and others.

Published on September 2, 2022

by Amit Raja Naik

Listen to this story

“I thrive in situations where I have to get things done or create new systems and new modules. I like to satisfy my curiosity and maker trait,” said the creator of PyTorch Tabular, GATE and LAMA-Net, Manu Joseph. He said that he is fascinated with math, data science, and machine learning, particularly deep learning, because of its flexibility and scalability.

Joseph currently heads the applied research at Thoucentric, a niche management company. At the company, he leads the group of researchers in productionising cutting-edge technology to add value to real-world customers, primarily in causality, predictive maintenance, time series forecasting, NLP and others. Prior to this, he worked with companies like Philips, Entercoms, Schneider Electric, Cognizant Technology Solutions and others.

In an exclusive interview with Analytics India Magazine, Joseph talks about his journey into data science, alongside some of his passion projects, tips for people entering data science for better career opportunities, and more.

A Self-taught Data Scientist

From starting his career in industrial engineering to working in the IT industry, and later moving to the data science and analytics field, and currently leading the research initiatives, Joseph’s journey has been truly inspirational.

“Transitioning from a STEM role, say, engineering, to data science is relatively easier than other areas,” said Joseph. He said that whatever branch you study in engineering changes the way your brain is wired. “I think that is actually helpful in all of these things,” he added.

However, he said when shifting domains to areas like machine learning, statistics, or computer science, you have to be comfortable with programming. “There’s no way around it,” he added.

He said you could learn all the machine learning, you can learn everything, but at the end of the day, for all of that to be useful, you need to convert that into code. “In today’s scenario, nobody will do it for you. So you have to do it yourself,” he added, saying that a few years ago, there was the luxury, but now, with the industry growing rapidly, there is no other option but to learn.

Further, Joseph said that you should not be afraid of Math. “It is not going to get in your way in the beginning. You can get away without Math early on, but eventually, it will come knocking on your door and then it will make a lot of difference,” he added, saying that it is a lot easier to communicate concepts in Math than in English. “Understanding what’s happening is actually very important. Otherwise, you will be able to build a model; you will be able to predict and get results out of it. But, the first time you hit a wall, without knowing what is happening in the background you won’t be able to navigate around the problem,” said Joseph.

Lastly, he said that people should start looking at interesting problems, create datasets, participate in hackathons, and develop models to make them more useful. “Move away from your standard Titanic datasets and solve something interesting that makes your resume stand out. It is very easy to identify people who have gone the extra mile,” he added.

Origin of PyTorch Tabular

An industrial engineer turned data scientist, Joseph said when you are working with a business problem, tabular data constitutes about 90 per cent of the data—which is in tables—and all of your classical machine learning are the things we always use. However, these are just a small portion of what we can do because there are a lot more avenues to explore.

“That is where we started looking at deep learning. During my research, I found out that there was not a lot of work happening in that area,” recalled Joseph, saying that previously people were still using standard feedforward networks and something like categorical embeddings on top of that, for a tabular model.

“Since I was interested in the field, I kept tabs on what was happening. That’s when models like TabNet and a few other models came out. So I did see an acceleration in the space like more and more people were looking at how to use creative architectures for tabular data,” added Joseph.

Further, he said that when all these models came out and people started to implement their own data—it was a lot of hassle. “Because apart from TabNet, which has a very good library, all the other models were mostly coded bases. Making it work was extremely cumbersome,” he added.

That was the start of PyTorch Tabular, a framework for deep learning with tabular data. The framework has been built on top of PyTorch and PyTorch Lighting and works on pandas data frames directly. It has also used SOTA models such as NODE and TabNet to create a unified API.

“I started this as an internal project. At the time, it did not even have a name. The idea, however, was to unify all of that so that you can switch between different models, just like a Scikit-learn setup,” said Joseph. He said once the data pipeline is ready, switching to a new model is just about changing one line of code. That was the guiding principle behind the development of PyTorch Tabular. Soon he open-sourced the library for others to contribute and use. It is one of the most liked and talked about ML libraries on GitHub.

Enters GATE

One thing led to another; Joseph and his colleague Harsh Raj later released a novel high-performance, parameter and computationally efficient deep learning architecture for tabular data called GATE (gated additive tree ensemble). Inspired by GRU, GATE uses a gating mechanism as a feature representation learning unit with an in-built feature selection mechanism. It also uses an ensemble of differentiable, non-linear decision trees, re-weighted with simple self-attention to predict the desired output.

Joseph said that GATE is a competitive alternative to SOTA methods like GBDTs, NODE, FT Transformers, etc., where they have experimented on several public datasets (both classification and regression). The code is yet to be available for open source.

LAMA-Net

At Thoucentric, Joseph, alongside Varchita Lalwani, recently developed LAMA-Net, a new encoder-decoder (Transformer) based model with an induced bottleneck, latent alignment using maximum mean discrepancy and manifold learning to tackle the problem of unsupervised homogeneous domain adaptation for remaining useful life (RUL) prediction.

Citing predictive maintenance in manufacturing, Joseph said this is more like a domain adaptation technique, where we focus on how we can use training data with shifting data distributions to train a robust model to predict remaining useful time.

“In a real-world implementation, it is really difficult to get the data needed to train these models—you will need to have data for multiple failures in the past, and failures are usually a rare event. So, getting the data is difficult,” said Joseph, saying that using the existing datasets, we can now use our domain adaptation to a new dataset without any labels.

The latest paper from Applied Research @thoucentric is on Unsupervised #DomainAdaptation for #PredictiveMaintenance. We focus on how we can make use of training data with shifting data distributions to train a robust model to predict remaining useful time.

A thread#ML #DL #AI https://t.co/p4FAGdlN0y
— Manu Joseph (@manujosephv) August 20, 2022

What next?

To date, Joseph has worked on more than 20+ AI/ML projects, and in a personal capacity, he has worked on more than ten projects. At Thoucentric, he is currently building a team of data scientists who will be working on new-age technologies to solve their customer problems. The team is working on four different projects and is planning to publish three papers in the coming months.

Joseph told AIM that he would continue developing new methods and technologies in areas that do not use a lot of training data and build domain-agnostic models. “Because, having worked in the industry for some time now, I know that training data is very difficult to come by. That too, like annotated training data, is very, very difficult to come by,” said Joseph. He said that is why he is interested in areas like transfer learning, self-supervised learning, etc.

Go-to Resources Curated by Manu Joseph

Data science resources:

Newsletters:

AI/ML Courses:

Must-read research papers

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

Amit Raja Naik

Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.

Join the Webinar to learn How Big Data Analytics comes to the rescue in these turbulent times

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

The Impact of Lok Sabha Election on India’s AI Progress

Vidyashree Srinivas

The BJP aims to safeguard citizen safety and privacy, leaning towards regulation, while the Congress views AI advancements as an opportunity to create jobs.