In a major breakthrough, researchers at OpenAI have discovered neural networks within AI systems resembling the neural network inside the human brain. The multimodal neurons are one of the most advanced neural networks to date.
The researchers have found these advanced neurons can respond to a cluster of abstract concepts centred around a common high-level theme rather than a specific visual feature. Like their biological counterparts, these neurons can respond to a range of emotions, animals, photographs, drawings and famous people.
Sign up for your weekly dose of what's up in emerging technology.
Researchers wrote these neurons in CLIP can respond to the same concept, whether presented literally, symbolically, or conceptually.
The multimodal neurons have been discovered in the CLIP model that can connect text and images. It can learn visual concepts from natural language supervision. Further, this general-purpose vision system can match the performance of a ResNet-50 but outperforms existing vision systems on the most challenging datasets. For instance, one neuron called the ‘Spider-Man’ can respond to a spider’s image, the text ‘spider’, and the comic book character ‘spider-man’.
The researchers found multimodal neurons in several CLIP models of varying sizes, but they focused on studying the mid-sized RN50-x4 model. Researchers employed two tools to understand the activations of the model:
- Feature visualisation, which maximises the neuron’s firing by doing gradient-based optimisation on the input.
- Dataset examples, which looks at the distribution of maximal activating images for a neuron from a dataset.
The researchers carried out a series of carefully-constructed experiments to find these neurons’ unique capabilities in the convolutional layer. Each layer consists of thousands of neurons. “For our preliminary analysis, we looked at feature visualisations, the dataset examples that most activated the neuron, and the English words that most activated the neuron when rastered as images,” said researchers. Most of these neurons were made to deal with sensitive topics, from political figures to emotions.
The experiment revealed an incredible diversity of features such as region neurons, person neurons, emotion neurons, art style neurons, time neurons, abstract neurons, colour neurons and more.
Researchers found that a majority of neurons in CLIP are readily interpretable. “From an interpretability perspective, these neurons can be seen as extreme examples of “multi-faceted neurons” which respond to multiple distinct cases. Looking to neuroscience, they might sound like “grandmother neurons,” but their associative nature distinguishes them from how many neuroscientists interpret that term,” stated researchers.
Researchers also studied how these multimodal neurons can give us insight into understanding how CLIP performs classification, such as image and text classification.
Neural networks work on the same principle as their biological counterparts to process data. However, the drawback is, it is difficult to understand why it makes certain decisions and how it comes to a particular conclusion.
The researchers said that despite being trained on a curated subset of the internet, it still inherits its many unchecked biases and associations. “…we have discovered several cases where CLIP holds associations that could result in representational harm, such as denigration of certain individuals or groups,” researchers stated. For instance, “Middle East” neuron was associated with terrorism; and an “immigration” neuron responded to Latin America.
Despite fine-tunes and the use of zero-shot techniques, researchers said these biases and associations would remain in the system. The CLIP findings are still evolving, and there is a lot of research and understanding that needs to be done in multimodal systems. In a bid to advance the area, researchers have shared the tools, dataset examples, text feature visualisations, and more with the community.