How Goodhart’s Law Can Save Machine Learning

“When a measure becomes a target, it ceases to be a good measure.”

GoodHart’s Law

Stochastic Gradient Descent (SGD) has been responsible for many of the most outstanding achievements in machine learning. The objective of SGD is to optimise a target in the form of a loss function. But SGD fails in finding ‘standard’ loss functions in a few settings as it converges to the ‘easy’ solutions.

As we see above, when classifying sheep, the network learns to use the green background to identify the sheep present. However, when it is provided with an image of sheep on a beach instead (which is an interesting prospect), it fails altogether.

`Register for our upcoming Masterclass>>`

However, SGD is still the go-to solution as it has a fantastic track record in many ML applications. So, what is the benefit of finding diverse solutions? Why not stick with SGD or optimise the property we care about directly?

The answer to that is Goodhart’s law, which states: “When a measure becomes a target, it ceases to be a good measure.” For instance, generalisation and zero-shot coordination do not allow direct optimisation. So, the researchers at Berkeley teamed with FAIR, Oxford and NYU to come up with an alternative — Ridge Rider algorithm.

Drawing inspiration from Goodhart’s adage that when a measure (following gradient) becomes a target, it ceases to be a good measure, the researchers in their blog, set out to introduce the Ridges approach and apply it to popular domains of reinforcement learning, supervised learning and others.

`Looking for a job change? Let us help you.`

Ridges are eigenvectors of the Hessian

Eigenvectors  are unit vectors, which means that their magnitude is equal to 1. Whereas, eigenvalues are coefficients applied to eigenvectors that give the vectors their length or magnitude.

Hessian matrix or a Hessian is a square matrix of second-order partial derivatives. It helps to test whether a given point in space is local maximum, minimum or a saddle point; a microcosm of all things optimisation in machine learning.

(This article assumes that the reader has a basic understanding of linear algebra and partial differentiation that includes eigenvalues and Hessian matrix as discussing these in detail is beyond the scope of this article.)

Ridge Rider (RR) Algorithm Applications

The answer to finding a diverse set of solutions unlike the shortcut solutions of the Gradient descent techniques, the researchers propose to follow eigenvectors of the Hessian (‘ridges’) with negative eigenvalues from a saddle or what they call the Ridge Rider (RR). The researchers have illustrated their methodology with a tree diagram, as shown above. It goes as follows:

• Start at a saddle (shown in green), where the norm of the gradient is zero.
• Follow the eigenvectors with negative eigenvalues — ridges.
• Take a step along the ridge (shown in red) until a new point is reached.
• The gradient is the step size multiplied by the eigenvalue and the eigenvector because the eigenvector was of the Hessian.

The idea here is, if the inner product between the new and the old ridge is greater than zero, then it is theoretically guaranteed to improve the loss. Ridge Rider provides us with an orthogonal set of loss reducing directions. This is opposed to SGD, which will almost always follow just one. The ridge rider algorithm can be dissected as follows:

Creating Deep Learning Models For Tabular Data using RTDL

• Select a ridge and follow.
• Update until a breaking point and branch again.
• At this point, one can choose whether to continue along the current path or select another ridge from the buffer.
• Equivalent to choosing between breadth-first search of depth-first search.
• Finally, the leaves of the tree are the solutions to the problem, each is uniquely defined by the fingerprint.

The researchers evaluated RR in settings like exploration in Reinforcement Learning, zero-shot coordination, and supervised learning on both MNIST and the more challenging Colored MNIST problem.

(Source: Paper by Jack Parker-Holder et al.,)

In an RL setting, a toy binary tree environment was used with a tabular policy. As shown above, the agent begins at s1, selects actions a_2 {left,right}, receiving reward r_2 {1, 10} upon reaching a terminal node. The maximum reward is 10.

The plot on the right shows the percentage of solutions found per algorithm, collated by tree depth. While RR outperforms all three baselines, stated the authors, often finding over 90% of the solutions, Gradient Descent on the other hand,  finds at most 50% in each setting. In the exploration setting of an RL environment, the authors conclude that their experiments indicate that RR is able to find more diverse solutions than SGD.

In spite of obtaining decent results, the authors of this work admit that there is a long way to go for the RR algorithm to match the results of the SGD. But it is promising to see that the researchers are not succumbing to the might of popular algorithms and are coming up with out of the box solutions. The key takeaways of this work can be summarised as follows:

• This method is the first to propose following the eigenvectors of the Hessian to optimise in the parameter space to train neural networks.
• This method deviates from SGD as commonly used across a broad spectrum of applications.
• Has the potential to address texture, shape and other learned biases.

Check the original paper here.

What Do You Think?