“GLOM answers the question: How can a neural network with a fixed architecture parse an image into a part whole hierarchy which has a different structure for each image?”Geoff Hinton
Humans have an innate ability to guess the size, shape and the type of an object from a partial view. The brain fills in the unknowns in a fraction of seconds. To mimic such operations, the machines would need much larger real estate and many million dollars (think GPUs, data centers, funding). So, cutting down extra memory or in AI context, smaller training data is of great significance. This is what Turing award recipient Geoffrey Hinton of Google Research wants to do. He wants to make neural networks smarter; understand vision and language better.
According to Dr. Hinton, the obvious way to represent the part-whole hierarchy is by combining dynamically created graphs with neural network learning techniques. But, if the whole computer is a neural network, he explained, it is unclear how to represent part-whole hierarchies that are different for every image, if we want the structure of the neural net to be identical for all images.
Capsule networks introduced by Dr.Hinton a couple of years ago offer a solution: A group of neurons, called a capsule, is set aside for each possible type of object or part in each region of the image. However, the problem with capsules is they use a mixture to model the set of possible parts. The computer will have a hard time answering questions like “Is the headlight different from the tyres and more such questions” (more on this example in the next section).
To address all these challenges and show how neural networks can comprehend part-whole representations, Dr.Hinton introduces an idea called “GLOM” inspired by biology, mathematics and neural scene representations.
The recent work on Neural Field offers a simple way to represent values of depth or intensity of the image. It uses a neural network that takes as input a code vector representing the image along with image location and outputs the predicted value at that location. Whereas in GLOM architecture, the scene-level top-down neural network converts the scene vector and the image location into an appropriate object vector for that location, which includes information about the 3-D pose of the object relative to the camera.
In GLOM, wrote Dr.Hinton, a percept really is a field and the shared embedding vector that represents a whole really is very different from the shared embedding vectors that represent the parts. Here an embedding can be thought of as a relatively low-dimensional space into which high-dimensional vectors can be translated to.
Let’s use the example of a car to understand Dr.Hinton’s idea of part-whole representations in neural networks. A car has many specific objects and it becomes more specific with each brand of the car. The front grill of a Rolls Royce or the tail lights of AUDI have their own signature styles. We humans only need a glimpse of these specificities to guess the object. Similarly Dr.Hinton’s idea is to construct a tree that branches out into layers that contain abstract information. If one were to take a 2D image of a car and run it through the neural network then the GLOM method would construct columns of layers on every patch of the image. For example the side of a car displays tyres, door handles etc. So, a column of information layers will have information stacked from color density to texture, from shape to relative positioning and more. Now imagine multiple interconnected columns all over the picture; these intermediate connections are where the attention mechanism comes into picture. In an ideal case, if the neural network gets the taste of even a single layer of the first column it should be able to guess what it is looking at!
(Image credits: Paper by Hinton)
The above illustration is a picture of the embeddings at a particular time in six nearby columns. All of the locations belong to the same object and the scene level has not yet settled on a shared vector. To show the complete embedding vector for each location, the vector is divided into a separate section for each level in the part-whole hierarchy and later the high-dimensional embedding vector is shown for a level as a 2-D vector. According to Dr.Hinton, this makes it easy to illustrate alignment of the embedding vectors of different locations.
Though Dr.Hinton doesn’t consider his paper to be a “working idea”, GLOM has picked up best of other works. There is a tinge of convolutional neural networks, attention mechanism from transformers and others at play. To understand GLOM better, we can compare it with CNNs, the de-facto standard for many computer vision applications:
- GLOM only uses 1×1 convolutions (except at the front end).
- Rather than using a single feed-forward pass through the levels of representation, GLOM iterates to allow top-down influences that are implemented by neural fields.
- GLOM includes contrastive self-supervised learning and performs hierarchical segmentation as a part of recognition rather than as a separate task. So, no more boxes.
“The difference between science and philosophy is that experiments can show that extremely plausible ideas are just wrong and extremely implausible ones, like learning a entire complicated system by end-to-end gradient descent, are just right.”Geoff Hinton
Dr.Hinton’s “single idea” paper is a much needed break from hundreds of SOTA chasing works on arxiv. Publishing ideas just to ignite innovation is almost obscure in almost all scientific domains and this paper might encourage others to put forward their crazy ideas. “This paper has gone on long enough already,” quipped Dr.Hinton as he moved to the philosophical aspects of his idea towards the end of this paper.
Download the paper here.