MITB Banner

Data Science Journey of Manu Joseph, The Creator of PyTorch Tabular

At Thoucentric, Manu Joseph leads the research initiatives in causality, predictive maintenance, time series forecasting, NLP and others.
Share
Listen to this story

“I thrive in situations where I have to get things done or create new systems and new modules. I like to satisfy my curiosity and maker trait,” said the creator of PyTorch Tabular, GATE and LAMA-Net, Manu Joseph. He said that he is fascinated with math, data science, and machine learning, particularly deep learning, because of its flexibility and scalability. 

Joseph currently heads the applied research at Thoucentric, a niche management company. At the company, he leads the group of researchers in productionising cutting-edge technology to add value to real-world customers, primarily in causality, predictive maintenance, time series forecasting, NLP and others. Prior to this, he worked with companies like Philips, Entercoms, Schneider Electric, Cognizant Technology Solutions and others.  

In an exclusive interview with Analytics India Magazine, Joseph talks about his journey into data science, alongside some of his passion projects, tips for people entering data science for better career opportunities, and more. 

A Self-taught Data Scientist 

From starting his career in industrial engineering to working in the IT industry, and later moving to the data science and analytics field, and currently leading the research initiatives, Joseph’s journey has been truly inspirational.  

“Transitioning from a STEM role, say, engineering, to data science is relatively easier than other areas,” said Joseph. He said that whatever branch you study in engineering changes the way your brain is wired. “I think that is actually helpful in all of these things,” he added. 

However, he said when shifting domains to areas like machine learning, statistics, or computer science, you have to be comfortable with programming. “There’s no way around it,” he added. 

He said you could learn all the machine learning, you can learn everything, but at the end of the day, for all of that to be useful, you need to convert that into code. “In today’s scenario, nobody will do it for you. So you have to do it yourself,” he added, saying that a few years ago, there was the luxury, but now, with the industry growing rapidly, there is no other option but to learn. 

Further, Joseph said that you should not be afraid of Math. “It is not going to get in your way in the beginning. You can get away without Math early on, but eventually, it will come knocking on your door and then it will make a lot of difference,” he added, saying that it is a lot easier to communicate concepts in Math than in English. “Understanding what’s happening is actually very important. Otherwise, you will be able to build a model; you will be able to predict and get results out of it. But, the first time you hit a wall, without knowing what is happening in the background you won’t be able to navigate around the problem,” said Joseph. 

Lastly, he said that people should start looking at interesting problems, create datasets, participate in hackathons, and develop models to make them more useful. “Move away from your standard Titanic datasets and solve something interesting that makes your resume stand out. It is very easy to identify people who have gone the extra mile,” he added. 

Origin of PyTorch Tabular 

An industrial engineer turned data scientist, Joseph said when you are working with a business problem, tabular data constitutes about 90 per cent of the data—which is in tables—and all of your classical machine learning are the things we always use. However, these are just a small portion of what we can do because there are a lot more avenues to explore. 

“That is where we started looking at deep learning. During my research, I found out that there was not a lot of work happening in that area,” recalled Joseph, saying that previously people were still using standard feedforward networks and something like categorical embeddings on top of that, for a tabular model. 

“Since I was interested in the field, I kept tabs on what was happening. That’s when models like TabNet and a few other models came out. So I did see an acceleration in the space like more and more people were looking at how to use creative architectures for tabular data,” added Joseph. 

Further, he said that when all these models came out and people started to implement their own data—it was a lot of hassle. “Because apart from TabNet, which has a very good library, all the other models were mostly coded bases. Making it work was extremely cumbersome,” he added.   

That was the start of PyTorch Tabular, a framework for deep learning with tabular data. The framework has been built on top of PyTorch and PyTorch Lighting and works on pandas data frames directly. It has also used SOTA models such as NODE and TabNet to create a unified API. 

“I started this as an internal project. At the time, it did not even have a name. The idea, however, was to unify all of that so that you can switch between different models, just like a Scikit-learn setup,” said Joseph. He said once the data pipeline is ready, switching to a new model is just about changing one line of code. That was the guiding principle behind the development of PyTorch Tabular. Soon he open-sourced the library for others to contribute and use. It is one of the most liked and talked about ML libraries on GitHub.  

Enters GATE 

One thing led to another; Joseph and his colleague Harsh Raj later released a novel high-performance, parameter and computationally efficient deep learning architecture for tabular data called GATE (gated additive tree ensemble). Inspired by GRU, GATE uses a gating mechanism as a feature representation learning unit with an in-built feature selection mechanism. It also uses an ensemble of differentiable, non-linear decision trees, re-weighted with simple self-attention to predict the desired output. 

Joseph said that GATE is a competitive alternative to SOTA methods like GBDTs, NODE, FT Transformers, etc., where they have experimented on several public datasets (both classification and regression). The code is yet to be available for open source. 

LAMA-Net 

At Thoucentric, Joseph, alongside Varchita Lalwani, recently developed LAMA-Net, a new encoder-decoder (Transformer) based model with an induced bottleneck, latent alignment using maximum mean discrepancy and manifold learning to tackle the problem of unsupervised homogeneous domain adaptation for remaining useful life (RUL) prediction. 

Citing predictive maintenance in manufacturing, Joseph said this is more like a domain adaptation technique, where we focus on how we can use training data with shifting data distributions to train a robust model to predict remaining useful time. 

“In a real-world implementation, it is really difficult to get the data needed to train these models—you will need to have data for multiple failures in the past, and failures are usually a rare event. So, getting the data is difficult,” said Joseph, saying that using the existing datasets, we can now use our domain adaptation to a new dataset without any labels. 

What next? 

To date, Joseph has worked on more than 20+ AI/ML projects, and in a personal capacity, he has worked on more than ten projects. At Thoucentric, he is currently building a team of data scientists who will be working on new-age technologies to solve their customer problems. The team is working on four different projects and is planning to publish three papers in the coming months. 

Joseph told AIM that he would continue developing new methods and technologies in areas that do not use a lot of training data and build domain-agnostic models. “Because, having worked in the industry for some time now, I know that training data is very difficult to come by. That too, like annotated training data, is very, very difficult to come by,” said Joseph. He said that is why he is interested in areas like transfer learning, self-supervised learning, etc. 

Go-to Resources Curated by Manu Joseph

Data science resources:

Newsletters: 

AI/ML Courses: 

Must-read research papers 

PS: The story was written using a keyboard.
Share
Picture of Amit Raja Naik

Amit Raja Naik

Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India