Listen to this story
After the 2011 cricket world cup final, Indian captain MS Dhoni was asked why he decided to promote himself up the batting order (even though he was not at his best throughout the campaign). Though the decision was risky, Dhoni said, he wanted to deal with the off-spinners tactfully to build pressure on the opponent, given India had lost some key wickets early into the innings.
His decision was vindicated and India won the world cup for the second time after 28 years. Though Dhoni explained he took the decision, the neural mechanism behind the decision can’t be rendered in language, thanks to the complexity of the human brain.
“When someone asks “why do you think that?”, you can’t articulate the true reason: the billions neuron firing patterns in your head that led you to your result. You project this complexity down into a low-dimensional language description,” said Russel Kaplan, head of Nucleus at ScaleAI.
Sign up for your weekly dose of what's up in emerging technology.
The holy grail of AI is to build a human-like brain. But we have only started to scratch the surface of the problem.
“It’s increasingly clear that language-aligned datasets are the rate limiter for AI progress in many areas. We see incredible results in text-to-image generation and image captioning in large part because the internet provides massive language<>image supervision for free,” said Kaplan.
For AI to produce accurate results, using the right datasets to train machine learning algorithms is important. Datasets form the basis for training machine learning models and play a foundational role in the advancement of the field. Machine learning models generally contain different datasets like numerical datasets, time-series datasets, and text datasets. Of late, AI/ML models that generate images based on text inputs have been hogging all the limelight. These ML models have been trained on large image datasets with corresponding textual descriptions, resulting in higher quality images and a broader range of descriptions. Examples include DALL.E 2, Imagen and PARTI.
DALL·E 2, OpenAI’s new AI program, can create realistic images and art from a description in natural language. It can create not only original, realistic images and art from a text description by combining concepts, attributes, and styles but also make realistic edits to existing images from a natural language caption. Google’s Imagen is a text-to-image model similar to DALL.E 2.
Left image text input: The toronto skyline with google brainlogo; Right image text input: Dragon fruit wearing a karate belt.
“By scraping image+caption pairs, you can create a strong self-supervised objective to relate images and text: make image and text embeddings similar only if they’re from the same pair. But most data modalities don’t come with this language alignment,” Kaplan said.
DALL.E 2 and Imagen use “diffusion” to generate images based on text inputs. Diffusion is a process where the model learns to convert a pattern of random dots into images. Diffusion models have seen success in image and audio tasks like enhancing image resolution, recolouring black and white photos, editing parts of an image, and text-to-speech synthesis. Google’s Pathways Autoregressive Text-to-Image(PARTi) is an autoregressive text-to-image generation model that first converts a collection of images into a sequence of code entries, similar to puzzle pieces. A given text prompt is then translated into these code entries, creating a new image.
Language models at play
Language models play a crucial role in text-to-image generation. Take, for example, the case of Imagen. It builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. These days, the large language models (LLMs) are gaining a lot of traction. LLMs are powerful programs trained on enormous amounts of text data, sometimes at the petabyte scale.
The rising popularity of LLMS is due to several factors like a single model can be used for multiple tasks like text generation, image generation, document summarization, translation and so on; they can make decent predictions based on a few labelled examples; and, their performance continuously improves with the addition of more and more data and parameters. OpenAI’s GPT-3, and Megatron-Turing Natural Language Generation (MT-NLG), developed by Microsoft and Nvidia, are some of the popular LLMs.
The text-aligned datasets are making huge strides in text-to-image generation. However, this is not true in the case of domains like games, medical diagnostics, and economic data, where the AL models can only suggest a particular move or action but fails to justify the same. “Software actions, work tasks, healthcare, economic data, games… think about all the domains where we do *not* have this language-aligned training data, and what would be possible if we created that data,” Kaplan said.
For example, in the case of AlphaZero, DeepMind’s AI-driven computer program suggests the best possible chess moves based on billions of chess moves it has been trained on. However, it cannot explain why that move is supposedly the best.
“Of course, such an explanation won’t be perfectly accurate. Any explanation in language is a low-dimensional projection of what’s really happening inside AlphaZero’s torrent of matrix multiplies. But the same is true when we use language to describe our own thought processes,” Kaplan said.
Take another example: A diagnostic tool developed by a team of researchers from Nanyang Technological University, Singapore (NTU Singapore), Ngee Ann Polytechnic, Singapore (NP), and the National Heart Centre Singapore (NHCS) to identify cardiovascular diseases. The AI machine learning algorithm called Gabor-Convolutional Neural Network (Gabor-CNN) can recognise patterns in patients’ ECG and predict coronary artery disease, myocardial infarction and congestive heart failure. However, it cannot tell why a certain patient developed the disorder.
The inexplicability arises from the fact that the datasets used in the above cases are not language aligned.
Language-aligned datasets hold a lot of promise concerning progress in ML interpretability. “Language-aligned datasets are the key to step-change progress on ML interpretability and neural networks helping with more and more problems. They will also help neural networks work *with* people instead of just replacing them,” said Kaplan.