Welcome to
Caches to Caches

This blog is devoted to the broad interests of Gregory J Stein, which includes topics such as Numerical Modeling, Web Design, Robotics, and a number of my individual hobby projects. If there's any article you would like to see, or something you've been wondering about, be sure to let me know on Twitter.


Summary: Machine learning must always balance flexibility and prior assumptions about the data. In neural networks, the network architecture codifies these prior assumptions, yet the precise relationship between them is opaque. Deep learning solutions are therefore difficult to build without a lot of trial and error, and neural nets are far from an out-of-the-box solution for most applications.

Since I entered the machine learning community, I have frequently found myself engaging in conversation with researchers or startup-types from other communities about how they can get the most out of their data and, more often than not, we end up taking about neural networks. I get it: the allure is strong. The ability of a neural network to learn complex patterns from massive amounts of data has enabled computers to challenge (and even outperform) humans on general tasks like object detection and games like Go. But reading about the successes of these newer machine learning techniques rarely makes clear one important point:

Nothing is ever free.

When training a neural network — or any machine learning system — a tradeoff is always made between flexibility in the sorts of things the system can learn and the amount of data necessary to train these systems. Yet in practice, the precise nature of this tradeoff is opaque. That a neural network is capable of learning complex concepts — like what an object looks like from a bunch of images — means that training it effectively requires a large amount of data to convincingly rule-out other interpretations of the data and reject the impact of noise. On the face of it, this statement is perhaps obvious: of course it requires more work/data/effort to extract meaning out of more complex problems. Yet, perhaps counterintuitive to the thinking of many machine learning outsiders, the way in which these systems are designed and the relationship between the many complex hyperparameters that define them has a profound impact on how well the system performs.

Noise takes many forms. In the case of object detection, noise might include the color of the object: I should be able to identify that a car is a car regardless of its color.


In my role as a Communication Advisor for the MIT Communication Lab, I see a lot of practice talks. Students, both graduate and undergraduate, sign up for a 30 minute or 1 hour long session during which they will present some material they're working on and ask for guidance on both content and presentation: "How clear is what I was trying to accomplish?" or "Are my results figures clear?"

Rarely do students ask "Did I use too much jargon?" It likely doesn't occur to them that, despite their relative inexperience, they might know more about the subject at hand than those to which they are presenting.

One of the key components to good technical communication is the right amount of context. Provide too much background material and your audience will lose interest; too little, and the audience may not be able to follow the remainder of the talk. The first half of a talk should clearly communicate Why the audience should care about your work and How your work compares to other work in the field. Addressing these questions often requires an understanding of popular trends within a discipline or how common certain tools or tricks are.

It should come as no surprise that newer researchers, typically undergraduates or first/second-year graduate students, may find it difficult to decide what information to include when preparing a talk. More frequently than not, I find that most technical talks — particularly those from newcomers to the field — spend too much time discussing the nitty-gritty details of an experiment while leaving out important details about the motivation of their research. Talks from neophyte researchers often vacillate between including an overwhelming amount of detail to covering unnecessary minutiae or unknowingly including too much jargon when explaining difficult concepts, likely in an effort to seem experienced. It is not uncommon for such presentations — in the space of two slides — to transition from an in-depth description of background material that the audience might consider "common knowledge" to a hastily-done description of domain-specific information essential for understanding the remainder of the talk. To make matters more complicated, the composition of the audience must be taken into consideration when deciding what material needs to be addressed during the talk: what one group might decide is "common knowledge" may be completely foreign to another.

Preparing a talk requires understanding one's audience and, without external support, only experience yields such knowledge. Technical communication is understandably hard for newcomers. Not only do they have trouble fully appreciating what they know and don't know, it's also extremely difficult for them to understand what others around them know. Good mentorship is critical for shaping a younger student's perspective in this regard. Such students should seek out feedback from more established members of the community and experienced communicators should make themselves available to provide support.

As always, I welcome your thoughts (and personal anecdotes) in the comments below or on Hacker News.


Summary: Many machine learning systems are optimized using metrics that don't perfectly match the stated goals of the system. These so-called "proxy metrics" are incredibly useful, but must be used with caution.

The use of so-called proxy metrics to solve real-world machine learning problems happens with perhaps surprising regularity. The choice to solve an alternative metric, in which the optimization target is different from the actual metric of interest, is often a conscious one. Such metrics have proven incredibly useful for the machine learning community — when used wisely, proxy metrics can be used to accomplish tasks that are otherwise extremely difficult. Here, I discuss a number of common scenarios in which I see machine learning practitioners using these proxy metrics and how this approach can sometimes result in surprising behaviors and problems.

I have written before about the repercussions of optimizing a metric that doesn't perfectly align with the stated goal of the system. Here, I touch upon why the use of such metrics is actually quite common.


It is especially easy at the beginning of a new year to fall into the trap of the New Year's Resolution. New goals and challenges and routines are established under the banner of self-improvement. Gyms and fitness centers become packed. Productivity books fly off the shelves. In the words of The New Yorker's Alexandria Schwarts, we're improving ourselves to death and in our crusade to make time for all this self-improvement, hobbies are too often forgotten.

What I've discovered over time is that many of my skills and passions have developed only through little side-projects that never see the light of day.

For me, hobbies walk an enjoyable line between work and play: I often aim to learn something or try something new. Hobbies afford me the opportunity to challenge myself and embrace failure without worry of repercussion. A couple years ago, during a self-imposed redesign of this blog, I came across Robert Bringhurst's fantastic book, The Elements of Typographic Style. Captured by his prose on the art of typography and font design, I started exploring and my website became an outlet for experiments in typography. My exploration grew to include other types of graphic design, and I've been experimenting with design software and creating vector art ever since. Mastery has never been the goal: learning something new and having an outlet for my creativity are their own rewards. But despite the independent nature of my exploration, making high-quality graphs and figures became easier and the quality of my technical presentations at work has clearly improved.


Summary: Big, publically available datasets are great. Yet many practitioners who seek to use models pretrained on outside data need to ask themselves how informative the data is likely to be for their purposes. "Dataset bias" and "task specificity" are important factors to keep in mind.

As I read deep learning papers these days, I am occasionally struck by the staggering amount of data some researchers are using for their experiments. While I typically work to develop representations that allow for good performance with less data, some scientists are racing full steam ahead in the opposite direction.

It was only a few years ago that we thought the ImageNet 2012 dataset, with 1.2 million labeled images, was quite large. Only six years later, researchers from Facebook AI Research (FAIR) have dwarfed ImageNet 2012 with a 3-billion-image dataset comprised of hashtag-labeled images from Instagram. Google's YouTube-8M dataset, geared towards large-scale video understanding, consists of audio/visual features extracted from 350,000 hours of video. Simulation tools have also been growing to incredible sizes; InteriorNet is a simulation environment consisting of 22 million 3D interior environments, hand-designed by over a thousand interior designers. And let's not forget about OpenAI either, whose multiplayer-game-playing AI is trained using a massive cluster of computers so that it can play 180 years of games against itself every day.


AlphaZero is incredible. If you have yet to read DeepMind's blog post about their recent paper in Science detailing the ins and outs of their legendary game-playing AI, I recommend you do so. In it, DeepMind's scientists describe an intelligent system capable of playing the games of Go, Chess, and Shogi at superhuman levels. Even legendary chess Grandmaster Garry Kasparov says the moves selected by the system demonstrate a "superior understanding" of the games. Even more remarkable is that AlphaZero, a successor to the well-known AlphaGo and AlphaGo Zero, is trained entirely via self-play — it was able to learn good strategies without any meaningful human input.

So do these results imply that Artificial General Intelligence is soon-to-be a solved problem? Hardly. There is a massive difference between an artificially intelligent agent capable of playing chess and a robot that can solve practical real-world tasks, like exploring a building its never seen before to find someone's office. AlphaZero's intelligence derives from its ability to make predictions about how a game is likely to unfold: it learns to predict which moves are better than others and uses this information to think a few moves ahead. As it learns to make increasingly accurate predictions, AlphaZero gets better at rejecting "bad moves" and is able to simulate deeper into the future. But the real world is almost immeasurably complex, and, to act in the real world, a system like AlphaZero must decide between a nearly infinite set of possible actions at every instant in time. Overcoming this limitation is not merely a matter of throwing more computational power at the problem:

Using AlphaZero to solve real problems will require a change in the way computers represent and think about the world.

Yet despite the complexity inherent in the real world, humans are still capable of making predictions about how the world behaves and using this information to make decisions. To understand how, we consider how humans learn to play games.


The modern revolution in machine learning and robotics have been largely enabled by access to massive repositories of labeled image data. AI has become synonymous with big data, chiefly because machine learning approaches to tasks like object detection or automated text translation require massive amounts of labeled training data. Yet obtaining real-world data can be expensive, time-consuming, and inconvenient. In response, many researchers have turned to simulation tools — which can generate nearly limitless training data. These tools have become fundamental in the development of algorithms, particularly in the fields of Robotics and Deep Reinforcement Learning.

This is the first post in a three-part series on the role of simulated image data in the era of Deep Learning. In this post, I discuss the significance of simulation tools in the field of robotics and the promise and limitations of photorealistic simulators.