Roughly a year and a half ago, I had the privelage of taking a graduate “Introduction to Machine Learning” course under the tutelage of the fantastic Professor Leslie Kaelbling. While I learned a great deal over the course of the semester, there was one minor point that she made to the class which stuck with me more than I expected it to at the time: before using a really fancy or sophisticated or “in-vogue” machine learning algorithm to solve your problem, try a simple Nearest Neighbor Search first.
Let’s say I gave you a bunch of data points, each with a location in space and a value, and then asked you to predict the value of a new point in space. What would you do? Perhaps the values of you data are binary (just +
s and -
s) and you’ve heard of Support Vector Machines. Should you give that a shot? Maybe the values are continuously valued (anything on the real number line) and you feel like giving Linear Regression a whirl. Or you’ve heard of the ongoing Deep Learning Revolution and feel sufficiently bold as to tackle this problem with a Neural Net.
…or you could simply find the closest point in your dataset to the one you’re interested in and offer up the value of this “nearest neighbor”. A Nearest Neighbor Search is perhaps the simplest procedure you might conceive of if presented with a machine-learning-type problem while under the influence of some sort of generalized “veil of ignorance”. Though there exist slightly more complicated variations in the algorithm, the basic principle of all of them is effectively the same.
Of course, I’ve swept some potential complications under the rug here. Perhaps the most important of these is how to answer the question “how do I compute the distance?” For simple problems, this may be easy, but there are indeed cases in which the distance must be specially defined. In addition, if you don’t have very many points in your initial data set, the performance of this approach is questionable.
It’s worth noting that having few data points in one’s training set is already enough to give most machine learning researchers pause.
It’s worth noting that having few data points in one’s training set is already enough to give most machine learning researchers pause.
Obviously, using a nearest neighbor search probably isn’t going to be the answer to most of your problems, but there are compelling reasons to try it before going down another, more complicated path. It’s certainly relatively easy to implement, which means that getting preliminary readings from your data can be a much quicker process. In addition, a nearest neighbor search can often provide a reasonable baseline performance, against which more sophisticated algorithms can be compared.
This last term, I was fortunate enough to serve as a Teaching Assistant for the same Machine Learning course I mentioned above. The final project in the course encourages the students to find creative ways of applying some of the more sophisticated algorithms we’ve taught them to data sets of their own. Many of the students would come to me during their mid-term meetings with all kinds of statistics regarding the performance of their algorithms. However, I probably would not have encountered their particular data set before, so telling me that they were able to get “over 90% accuracy” would be a meaningless quantity without something to compare it to; on some medical data sets, getting above a mere 50% predictive accuracy is often stellar performance. I found myself asking most of them the same question:
Have you tried using a 'Nearest Neighbor Search'?
Machine learning is, in many ways, a science of comparison. Algorithms are compared against one another on established data sets, which themselved are adoped via some sort of popularity metric. Nearest neighbor search is so simple and easy to implement, that it can be a reasonable baseline when others are more difficult to test against. Nowadays, the highest-performing machine learning algorithms are compared to human performance, and are recently even capable of beating us.
There are often theoretical reasons to prefer one technique over another, but sometimes effectiveness or popularity is reason enough on its own, as has become the case for neural networks.
There are often theoretical reasons to prefer one technique over another, but sometimes effectiveness or popularity is reason enough on its own, as has become the case for neural networks.
Certainly there are those of you who recognize that using a nearest neighbor search isn’t even applicaple in certain application domains. However, there’s certainly a more general lesson to be learned here: in the midst of an age characterized by algorithms of famously “unreasonable effectiveness”, it’s important to remember that simpler techniques are still powerful enough to solve many problems. Using a neural network often isn’t the best approach to take, even though they’ve dominated headlines for the last few years.
I was recently tasked with an object detection problem as a part of a demo that I’ve had to prepare for. My job was to detect a vehicle — not a specific type of vehicle and not a make or model of car — a single, specific vehicle that we have in our garage. My first instinct was to pull up the cutting edge work in real-time object detection (which I did) and get it to work on my own machine (which I also did) and to train it with a massive amount of images of our particular vehicle (which I was unable to do). Neural networks require a notoriously massive amount of data; this Google Neural Network paper is capable of classifying 1,000 different types of images and was trained on over a million photos. I quickly discovered that such a system was overkill for me, and resorted to using an open source implementation of a simpler algorithm. Not only was this easier to do, since there was fairly decent documentation, but the performance, while not “cutting edge” was more than enough to satisfy the relatively lax constraints for our simple demo.
Collecting thousands of images on one’s own is difficult enough without having to vary backgrounds and lighting conditions to build an effective training set.
Collecting thousands of images on one’s own is difficult enough without having to vary backgrounds and lighting conditions to build an effective training set.
The takeaway here is that though the simpler algorithms may not perform quite as well as the state-of-the-art, the gain in both time and computational complexity often outweighs the difficulties associated with more sophisticated solutions. So the next time you’re faced with an unknown machine learning problem, remember to give Nearest Neighbor Search a try.