“I asked Geoffrey Hinton to take me on as a student, but he said he was too busy. ”
For this week’s ML practitioner’s series, Analytics India Magazine (AIM) got in touch with former Google Brain researcher Navdeep Jaitly. From leaving IIT to pursuing a Liberal Arts Degree, from switching between AI and computational biology to becoming a student of Geoff Hinton, Navdeep’s journey in the world of cutting-edge research has all the twists and turns to give Spielberg’s Catch Me If You Can a run for its money.
AIM: Tell us a bit about your educational background.
Navdeep: I started my undergraduate education in Engineering at IIT Delhi, but moved to the US in my second year to pursue a Liberal Arts Degree at Hanover College which offered me an academic scholarship. There, I completed a double major in Mathematics and Computer Science and also completed most of the premedical requirements. I finished my graduation early, moved to Canada, where my parents had moved, and worked for almost a year as a junior telecommunications analyst before starting my Masters at the University of Waterloo in Canada. I signed on to do Artificial Intelligence but found my supervisor’s specialization — planning and constraint satisfaction — particularly uninteresting from the standpoint of the data that was involved. Fortunately, I was taking a Computational Biology Research seminar, focusing on sequence assembly from shotgun sequencing and other string algorithms for biological problems, and I fell in love with that. I had always found Genetics fascinating since high school days, and this let me combine biology and computing.
Sign up for your weekly dose of what's up in emerging technology.
Also, the first release of the Human Genome sequence was imminent (this was 1999), and it seemed like a momentous time to work on that. So, I switched into the Computational Biology group and worked as a researcher in that area for about eight years at a biotech startup and in the U. S. National lab system, before I went back to get my PhD degree in Computer Science at the University of Toronto. I started with the Computational Biology group, but in a reversal of my direction from my Masters, this time, I ended switching to the Machine Learning group and getting my PhD in that. I should say that somewhere in the middle, while I was working, I also met half of the requirements for a Masters in statistics through an excellent online offering at Texas A&M, but I stopped pursuing that when I went back for my PhD in Computer Science. So, I have clearly had a winding academic journey, much to my spouse’s dismay.
AIM: How did your fascination with algorithms begin?
Navdeep: I had been developing statistical methods to characterise the quality of results from my signal processing algorithms for high throughput proteomics as part of my research at the Pacific Northwest National Lab. At first, I was educating myself from books I could find on Bayesian analysis (I particularly enjoyed Bayesian Data Analysis, by Andrew Gelman, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin), but I liked it so much that I enrolled in a new online offering at Texas A&M for a Masters in Statistics. So I was operating in an adjacent space to Machine Learning at this time, but Machine Learning really just entered my consciousness in a large way when I joined University of Toronto in 2008 to get my PhD in Computational Biology.
The Computational Biology group was right next door to the ML group, which is where I first discovered Geoffrey Hinton, Richard Zemel and Radford Neal and their fantastic research. Until this point, I hadn’t really considered that images, text, speech could be modeled with statistical models. It was quite an epiphany for me, when I went to my first class in Machine Learning at the University of Toronto, Geoffrey Hinton was talking about images, and maximum likelihood and sampling algorithms that could generate images (using Deep Belief Networks). As soon as I saw a Deep Belief Network model generate an image of MNIST digits, by sampling activations hierarchically from the highest layers to the lowest, I knew I had to find out more. Later in the class, I asked Geoffrey Hinton to take me on as a student, but he said he was too busy. However, by the end of the class, he changed his mind —turns out he likes students who did well on Mathematics competitions such as the Putnam math competition and Mathematics Olympiads, which I did.
AIM: What were the initial challenges and how did you address them?
Navdeep: My first project at Geoff Hinton’s lab was to use Deep Belief Networks to separate waveforms into speech from different speakers. That project morphed into a related idea where Deep Belief Networks were used to find features that could be used in a speech recognizer. My lab mates George Dahl and Abdel Rahman Mohamed had just released breakthrough results showing that Deep Belief Networks (DBN) worked really well in pertaining neural networks that could be used for speech recognition. The inputs to their models were features that the speech community had developed several decades ago, called Mel Filterbank Cepstral Coefficients (MFCCs). I wrote a paper showing how the features that were found from raw waveforms using DBNs outperformed MFCCs (“Learning a better representation of speech sound waves using restricted boltzmann machines” in ICASSP 2011). On the heels of this paper, I interned at Google and implemented a model for discriminative training of Neural Network speech models by Brian Kingsbury at IBM, merging it with George and Abdel-Rahman’s successful method. Initially, I was given a smaller toy dataset — the Deep Learning revolution hadn’t yet started (Alex-Net was a year away in the summer of 2011) and researchers at Google were quite skeptical that Deep Neural Networks would work on large scale data. Back then, they were under the belief that linear models trained on very large datasets would be hard to beat with such complex, non-linear models such as neural networks.
The project had many challenges. The speech recognition system was built for Gaussian Mixture Models – Hidden Markov Models (GMM-HMM) not neural networks. I had to spend a lot of time understanding how the Google speech recognizer worked, to find out how to plug neural network predictions into it. Further, neural network training requires GPUs, or training is very slow. And Google had no GPUs in the data centers back then. We had bought a machine with 4 GPUs that sat next to the printer and whirred away. Training one epoch of our model on this single machine took one whole day for our larger datasets. I was worried that either my code would break (we didn’t have TensorFlow or other frameworks available back then — I was just using my own code) or the machine would shut off, making me lose time in the short internship.
Meanwhile, we were up against baselines that ran with 10x more data, using Google’s cloud infrastructure (called Borg) on thousands of machines. However, within a few weeks I had better results than the best GMM-HMM model Google had built on this data, reducing word error rate from 23% down to 18%. We then started working on a larger datasets from Voice
Search and Youtube. By the end of the internship we had improved results by 15% — quite significantly better than the typical improvements over a year. Andrew Senior, a researcher in the group, later increased the gains to 30% by using more data, after my internship finished (“Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition.” by Jaitly et. al, in Interspeech 2012) . The internship had a huge impact in the speech group at Google. Shortly after, Google improved the inference speed and a version of our model was put into production and used for transcription of Google Voice Search queries. By the next year, neural networks were the norm in the speech group (“Genius Makers” by Cade Metz has a nice story about this work). One of the interesting things about neural networks is that they are often driven by a relentless belief of the practitioners in their succes. I too had no doubt that they would work, and so I kept at it until I got the results I wanted. I think it is often the case with new technologies that early practitioners have an almost irrational belief in their technology — that certainly helped our early work.
After that internship, I worked on the speech recognition problem for several years, with the goal of having a single neural network replace the entire complex system (which had the neural network as just one component). The quest was helped by the success of sequence to sequence models within Google Brain and outside, and through collaboration with incredible researchers like William Chan.
AIM: How do you approach any machine learning problem?
Navdeep: Sadly, I seem to view the world through an ML lens — whenever someone describes a system or a use case they are working on, I’m automatically drawn to think about how machine learning could make it better. And, if machine learning is being used already, my thought experiments run wild on how the current ML solution can be improved further. Sometimes, the data required for a model just doesn’t exist, and cannot be gathered with reasonable resources before the system becomes available. Other times the use case doesn’t offer the computational resources needed for a fancy solution. Both of these can be mitigated — for example, for the former, one might set up a self-supervised system, which trains on its own data and improves over time, starting with a simple model. In the latter case, one can use a wide variety of techniques to train low compute cost models. Other times, when the data does exist, the question to ask is — do we really need ML for this.
A good rule of thumb is, that the data needed to be complicated enough to need machine learning because an appropriate analytical understanding of the data cannot be had — if analytical characterizations exist that are correct, it’s probably better to find ways to use them. Another question to ask is, is the structure of data such that large amounts of data will reveal patterns that cannot be modeled by other means? For example you don’t need (and shouldn’t use) machine learning to compute the mean of numbers. Sometimes you have an interesting case where approximate analytical understanding exists for a domain, but there is a limit to its precision. If this domain has underlying structure, I have found that in such cases, Machine Learning can be very useful when large amounts of data exist, to uncover that structure. This is a situation where you can start with a simple model based on an analytic understanding, but over time you can replace it with more complicated machine learning models that can pick on emergent regularities that aren’t understood or known.
“ML is a field that requires both mathematical thinking and an attitude of tinkering at the same time.”
Generally speaking, for building machine learning situations to problems, I have found it pretty useful to follow a graduated schedule of putting together something very basic and straightforward, and use it as a basis for improved systems. For example, you might want to build only a couple of simple ML models into your system at the start. But as you gain experience, you might supplant various components with ML replacements. Eventually you might consider an alternative where the entire system is a single ML model.
AIM: What does your machine learning toolkit look like?
Navdeep: During my PhD days I had my own ML toolkit (as every researcher in ML did in those days), implementing most of the algorithms I used. But now the community has so many great alternatives, that I rarely start from scratch in anything, and instead, often my job is to figure out which toolkit works best for which task. For Deep Learning models, I now mostly use tensorflow. I’ve experimented with Pytorch and minimally with Jax and I like them both a lot as well. For other ML models, I have played with scikit (sklearn) which is quite extensive in its offerings now, for more classical ML models. For programming languages, I use python almost exclusively, unless I hit some speed bottleneck that can be solved by going to C++. In that case, I often fallback to ctypes, but there are much better alternatives now for speedups, e.g. cython, numba, boost-python to name a few. I will also fall back on C++ for fast inference needs where precise timing guarantees are needed. In terms of Cloud, I’ve used Google Cloud which works quite nicely. I’ve also looked at AWS which looks great. To be honest, I think both platforms offer various tools and workflows to help you along.
AIM: We see a lot of hype around AI. Which domain of AI, do you think, will come out on top in this decade?
Navdeep: A lot of hype has indeed been generated around AI and ML focusing on imminent doom from machines taking over. Much less attention is paid to the fact that these methods have indeed revolutionized how we interact with machines. When I started working on using Deep Learning to improve Speech recognition, some of my friends liked to show me how poorly the recognizers worked for them and expressed the opinion that there was no real revolution in play. However, over time these things have improved to a point that my kids are often getting their questions answered by Google Home or Siri, rather than by their parents. And it is my opinion that this is just the start. The algorithms are improving continually still, and their application is percolating to more and more domains and more and more tasks.
I think language modeling is another domain that has been forever changed by the current progress in Deep Learning. GPT-3 from OpenAI shows what amazing things the current best in class models such as transformers can produce. Conversational agents is an area that we should see great progress in, with such language models, and this will again really change how we interact with machines.
Machine vision has also been transformed and its ramifications are visible in the improvements we see in self driving technology, and also in better understanding of the contents of the media we capture with our phones. I think in the 10 years, we will see the impact of these methods in the changed way we deal with machines. Every time I find that my phone has learned a new trick, or a new interactive behavior, I’m really excited to see where we will be in 10 years.
AIM: What would your advice be to aspirants who want to get into ML jobs?
Navdeep: ML is a field that requires both mathematical thinking and an attitude of tinkering at the same time. For people wanting to pursue data science and ML roles, I recommend getting your hands on data and trying out things and developing an intuition for data. Amazingly, data seems to show similar patterns from one domain to another, and turning the knobs in one domain helps turning the knobs in another domain. Because of this almost vocational aspect to this field, my office mate and I often described ourselves as Neural Network technicians, and not Neural Network scientists. At the same time, understanding the underpinnings of why things work requires developing an understanding of the parallels between Machine Learning and model fitting, and of the role of optimisation algorithms in this. So I would also encourage people to spend time reading about the range of machine learning methods and models (I’m quite partial to Chris Bishop’s Pattern Recognition and Machine Learning as a starting book) and optimisation algorithms (e.g. Numerical Optimisation by Jorge Nocedal).
AIM: What books and other resources have you used in your journey?
Navdeep: To be honest, books aren’t often the best way to understand this field — most of the cutting edge stuff is in the papers and the field is constantly moving. Nevertheless, the following books gave me real mileage in understanding the underlying concepts of Machine Learning:
- Chris Bishop — Pattern Recognition and Machine Learning
- Robert Tibshirani — The Elements of Statistical Learning
- Trevor Hastie— An Introduction to Statistical Learning
- David MacKay – Information Theory, Inference and Learning Algorithms
- Pattern Classification by David G. Stork, Peter E. Hart, and Richard O. Duda
A lot of people also find Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville quite useful, but I haven’t read it myself, since I was mostly only reading papers by the time Ian and co-authors released the book.