Proxy metrics are everywhere in Machine Learning

25 Jan 2019 Gregory J. Stein

Summary: Many machine learning systems are optimized using metrics that don’t perfectly match the stated goals of the system. These so-called “proxy metrics” are incredibly useful, but must be used with caution.

The use of so-called proxy metrics to solve real-world machine learning problems happens with perhaps surprising regularity. The choice to solve an alternative metric, in which the optimization target is different from the actual metric of interest, is often a conscious one. Such metrics have proven incredibly useful for the machine learning community — when used wisely, proxy metrics can be used to accomplish tasks that are otherwise extremely difficult. Here, I discuss a number of common scenarios in which I see machine learning practitioners using these proxy metrics and how this approach can sometimes result in surprising behaviors and problems.

I have written before about the repercussions of optimizing a metric that doesn’t perfectly align with the stated goal of the system. Here, I touch upon why the use of such metrics is actually quite common.

One place in which proxy metrics appear is hidden in general-purpose or off-the-shelf tools like object detectors or semantic segmentation systems: the metric used to train these tools may not match the metric. A neural network trained on ImageNet, for instance, is designed to equally penalize incorrect detections across its thousand object categories. As such, it may not be a particularly good choice for a machine learning system whose only job is to differentiate between different breeds of dog. A more custom-tailored metric may be appropriate, yet retraining the algorithm for specific use cases may be prohibitively expensive or difficult for some tasks. Owing to the difficulty associated with retraining word embedding networks, the natural language processing community frequently uses unmodified off-the-shelf tools for research.

Sometimes the system can be cleverly engineered in such a way that learning can only perform well at optimizing the proxy metric if it succeeds at understanding the target concept. One fascinating example of this is MonoDepth, which aims to train a neural network to predict depth given a monocular image. The catch? The training data does not include any explicit depth information. Instead, the system is trained on left and right image pairs captured from a stereo camera. The algorithm succeeds if it is capable of figuring out what the left image should look like given only the right image (and vice versa). The only way to do this is to understand how far objects are, since the 3D geometry is essential for reconstructing the scene from a different perspective. So, by optimizing away errors in the reconstructed image by iterating over predicted depth, the algorithm discovers how far each obstacle must be.

Note that MonoDepth has an interesting failure mode because of its choice of metric: the depth of windows and other transparent or reflective surfaces will be incorrect. This is an expected consequence of their approach, but may cause problems if avoiding windows is a critical part of your algorithm.

In some cases, the metric you care about may be hard to optimize directly. Consider, for example, a scenario in which a robot is instructed to clean a bedroom as efficiently as possible. Solving this problem exactly involves using onboard sensors to detect all messy objects and then take actions to put them in their place. However, if the perception system occasionally misses objects, the set of actions the robot needs to take will also change, which makes direct optimization of the perception system difficult. Object detection systems for robot perception are often trained in isolation because of the difficulties inherent in jointly optimizing the perception system and the actions the robot should take as a function of the noisy perception.

Finally, it is often the case that the it isn’t obvious how to mathematically specify the actual objective. This is one of the biggest challenges in many machine learning contexts: not knowing how to express the thing you want to improve to get the behavior you would like to see. This is a common problem in the reinforcement learning community, in which the behavior of an AI agent is determined by a user-specified reward function. Here is a particularly interesting example highlighted a recent survey paper on surprising behavior in “digital evolution”:

In a seminal work from 1994, Karl Sims evolved 3D virtual creatures that could discover walking, swimming, and jumping behaviors in simulated physical environments. The creatures’ bodies were made of connected blocks, and their “brains” were simple computational neural networks that generated varying torque at their joints based on perceptions from their limbs, enabling realistic-looking motion. The morphology and control systems were evolved simultaneously, allowing a wide range of possible bodies and locomotion strategies. Indeed, these ‘creatures’ remain among the most iconic products of digital evolution.

However, when Sims initially attempted to evolve locomotion behaviors, things did not go smoothly. In a simulated land environment with gravity and friction, a creature’s fitness was measured as its average ground velocity during its lifetime of ten simulated seconds. Instead of inventing clever limbs or snake-like motions that could push them along (as was hoped for), the creatures evolved to become tall and rigid. When simulated, they would fall over, harnessing their initial potential energy to achieve high velocity. Some even performed somersaults to extend their horizontal velocity. A video of this behavior can be seen here: To prevent this exploit, it was necessary to allocate time at the beginning of each simulation to relax the potential energy inherent in the creature’s initial stance before motion was rewarded.

Joel Lehman et al
August 2018

Which brings me to my final point:

Whenever you are optimizing a proxy metric, you open yourself up to potentially surprising errors.

Some research I recently presented at CoRL involved training a neural-network-based classifier to predict dead ends while exploring building-like environments. The problem of intelligent decision making in unknown environments is notoriously difficult, so we instead decided to introduce a classifier to predict where dead-ends would appear in the unknown portions of the environment. The classifier would be used as part of a larger system that could navigate through unknown environments as humans do — by learning to recognize that offices and bathrooms are far less likely to lead to faraway goals than hallways.

Many classification problems are trained using a symmetric loss, in which the penalty of incorrectly labeling an example is independent of the type of example. What was not obvious to us during the first stages of development was that this was not the case for our problem. The misclassification penalty depends on the trajectory of interest: erroneously exploring an office takes much less time than mistakenly ignoring the only hallway leading to the goal. We corrected our training metric to capture this asymmetry, and the final system performed quite well.

Fortunately, it was immediately obvious in our case that something wasn’t working as we expected. For other problem domains, the issues may be more subtle. As designers of machine learning systems, we need to be careful that our choice of metric does not cause unintended consequences or biases in production.

As always, I welcome discussion in the comments below. Feel free to ask questions, share your thoughts, or let me know of some research you would like to share.