Gone are the days when doing machine learning on large datasets required extensive programming and knowledge of ML frameworks. Now, even with limited machine learning knowledge and programming expertise, data analysts can harness the powers of machine learning. Google Cloud’s BigQuery ML, which empowers data analysts to use machine learning through existing SQL tools and skills, is a good case in point. BigQuery ML allows analysts to build and evaluate ML models and accelerate model development and innovation by removing the need to export data from the data warehouse. Instead, BigQuery ML brings ML to the data.
Analytics India Magazine(AIM) got in touch with Abhishek Kashyap, Head of Product, Google Cloud BigQuery AI, to get a glimpse how Abhishek and his team operates.
Abhishek has a bachelors in Electrical Engineering from Indian Institute of Technology, Delhi, and a Masters and PhD in Electrical and Computer Engineering from University of Maryland, College Park. His research areas include computer networks, image processing, and graph theory.
AIM: How did your fascination with algorithms begin?
Abhishek: My fascination with algorithms began during my internship with IBM Research at IIT Delhi in 2000, where we worked on power saving algorithms for Bluetooth devices. Bluetooth was not yet mainstream, and power management was critical for success. I continued working on algorithms during my Masters and PhD, as well as at Lucent Bell Labs. A few years later, when I started MarianaIQ for AI based personalised marketing, I really got into data science. Most interesting and challenging data science component was automating it across customers without professional services – AI as a true Software as a Service. At Google Cloud, I have been focused on providing tools for anyone to successfully build quality machine learning models fast.
Currently, I am the product management lead for making BigQuery the best intelligent data warehouse platform. I started with making it a successful platform for machine learning, and I am now working on adding natural language capabilities for analytics democratization.
AIM: What books and other resources have you used in your journey?
Abhishek: I highly recommend two online courses that got me started, as they both provide a solid foundation for data science: Statistical Learning by Prof. Robert Tibshirani and Prof. Trevor Hastie at Stanford, and Learning from Data by Prof Yaser Abu-Mostafa at Caltech.
AIM: What were the initial challenges and how did you address them?
Abhishek: Let’s look at our early challenges at each stage of the machine learning process:
- Training data: We almost always had a very small amount of training data, as we focused on B2B buyers and there aren’t that many in most companies’ databases. We ended up doing a lot of manual bootstrapping in early days, and learnt from the manual process to automate it for our use case.
- Extensible modeling: As I mentioned above, we built our models to be served as a SaaS vs having a data scientist customize for each client. To achieve that, we applied a very curated mix of unsupervised learning, not-too-complex classification (due to small training data sizes), and NN embeddings that removed the need for a lot of feature engineering.
- Quality test suites: Model improvement is hardly ever universal for all data sets. Thus, we had to continuously update our quality data tests to ensure new models do not perform worse on any important data sets.
- Pipelines and ML Ops: There were no standard tools back then, so we had to build our own for data-ML pipelines and ops.
- Explainability: Our clients wanted to know why we made a certain recommendation, and we had to experiment with tools available back then, like LIME. Unfortunately, that area was still nascent, and we could not get to a satisfactory answer back then. Today, there are very credible algorithms like SHAP, which makes it much easier to explain predictions.
AIM: How do you approach any data science problem?
Abhishek: I always approach a data science problem as a business problem that needs to be solved, which can then be translated to a label and features. Always having the business problem in mind results in a much better intuition for feature engineering, and types of models or chaining that would be required. Beyond that, it’s the standard but not-always-followed advice – start with a simple model, experiment, and choose the simplest among the models with acceptable accuracy. There is a penalty in going complex, and accuracy improvements need to be significant to trade-off with that penalty.
AIM: What does your machine learning toolkit look like?
Abhishek: I am currently the product manager for BigQuery ML, so I end up using it by default. With it I am able to create my first model within 10 minutes, and iterate really fast. I do not see a need for learning more complex ML libraries, when I can do most of it with SQL.
AIM: There is a lot of hype around AI and ML. Which domain do you think will come out on top in the next 10 years?
Abhishek: Let us break it into two areas: Custom models, and embedding in applications. At a high-level, AutoML modeling and easier interfaces will be mainstream.
For custom modeling, it would be as easy as doing analytics. People will not need to learn programming languages and frameworks for most AI models, and most models would use an AutoML framework. Key expertise will be in understanding the business problem, the data, how it needs to be formatted to create training rows and labels, and how to iterate on those based on the predictions and explanations. The magic to find the right, clean data and create training data rows still doesn’t exist.
When it comes to applications, I believe AI will be embedded in all applications where it can be useful, and people won’t even notice it. That is already the case in a lot of consumer applications, such as Youtube, Maps, Spotify. It will make its way into enterprise as well. From a data science point of view, it will be enabled by AutoML. AutoML will improve, and get cost effective for a variety of applications to get AI enabled easily.
AIM: What’s your advice to data science aspirants?
Abhishek: My advice would be to build a strong foundation for machine learning by learning both the practical and theoretical aspects. Otherwise, one can find it difficult to make progress fast, and debug a model to improve it.
Additionally, I would recommend the two online courses mentioned earlier, as well as the Machine Learning Design Patterns book by my colleagues Lak Lakshmanan, Sara Robinson, and Michael Munn.