Practical Guidelines for Getting Started with Machine Learning

As the use of machine learning systems grows beyond the academic sphere, one of the more worrying features I have witnessed is a lack of understanding of how machine learning systems should be trained and applied. The lessons the AI community has learned over the last few decades of research are hard-earned, and it should go without saying that those who do not understand the inner workings of a machine learning tool risk having that system fail in often surprising ways.

This advice is not limited to AI. Using any stochastic system without an understanding of when or how it is likely to fail comes with inherent risk.

However, the potential advantages of AI are many, and using machine learning to accelerate your business, whether empowering employees or improving your product, may outweigh potential pitfalls. If you are looking to use machine learning tools, here are a few guidelines you should keep in mind:

Establish clear metrics for success.
Start with the simplest approach.
Ask yourself if machine learning is even necessary.
Use both a test and a validation dataset.
Understand and mitigate data overfitting.
Be wary of bias in your data.

Even before you get started, it is important to establish clear metrics for success. How is performance measured? Is it measured as accuracy on a particular task? Is it improvement in user feedback or retention? You should also ask yourself how much of an improvement in performance you would require before you would put it into practice: is the hassle of maintaining the machine learning system worth the benefits it provides?

Having a clear idea of what success would look like is good advice in general.

In addition to establishing a performance metric, you should also have a performance baseline. Are you comparing the system to human performance or is there another automated system that you might be comparing against? With this in mind, I always recommend that one start with the simplest approach. This is a topic I’ve written about in the past, but it certainly bears repeating. Many modern machine learning tools are complex and opaque, and sometimes a simpler approach is enough to get excellent performance. Even if the simpler approaches do not reach the performance goal you might be looking for, they may serve as performance baselines of their own, against which the more complex techniques can be compared.

K-nearest neighbors and Support Vector Machine are two relatively easy-to-implement machine learning techniques that each make for a good starting point when tackling a new task. These two apply to a large number of problems and do not require very much parameter tuning.

You might also ask yourself if machine learning is even necessary to solve the problem you are looking to overcome. Nearly a decade ago, I very nearly automated myself out of a job working as a paralegal. Many of the tasks I had been assigned were various types of data entry and were rather straightforward to automate with conventional scripting in Visual Basic: no machine learning necessary. This example is rather clear-cut, but I hear about scenarios like this one in different industries rather frequently. A programmer may be better suited than a data scientist to solve your problem.

I recommend taking a look at this article in which a database programmer in the e-commerce industry describes how a combination of asking the right questions and writing a few clever SQL queries vastly improved the open rate for the company’s marketing emails.

The way in which the system is trained is also important to keep in mind. One important guideline is to use both a test and a validation dataset. For outsiders to the machine learning community, it is usually clear that the data used to tune the parameters of the algorithm, often known as the training set, should not be used to evaluate the performance of the algorithm. A separate test set should be provided to evaluate the performance once the algorithm is trained. However, one might want to tune the parameters of the algorithm in response to the results on the test set. This is bad practice: instead, there should be another data set, known as the validation set, on which the parameters of the algorithm can be tuned. Only once the parameters of the algorithm are fixed should the final performance statistics be generated on the test set.

The primary reason you should want more data than just your training dataset is so that you can understand and mitigate data overfitting. With only a single dataset, it’s easy to be fooled into thinking performance is very good. Convolutional Neural Networks, a popular tool for tasks like image classification and object detection, are so powerful that they are capable of correctly classifying a training dataset of random noise with very high accuracy. However, a machine learning algorithm trained in such a way expectedly performs poorly on the test and validation datasets, since there is no correlation between them and the data upon which the learning algorithm was trained. Some overfitting to your training data is expected, but a large performance gap between the training and test sets may indicate that more data is necessary or that a different machine learning technique should be used.

Finally, you should always be wary of bias in your data. This is another issue I have written about at length. Bias manifests itself in subtle ways and can influence the performance of your system in practice. In particular, if your machine learning system will directly interact with users, be sure that the training, test, and validation data you use are collected so that they matche the sort of data that your system will see in production: this is a common failure mode for machine learning systems. In particular, this effect has been known to exacerbate social biases, particularly when there is a disparity between the composition of those who train the system and the end users of the system.

For those curious to read more about dataset bias, I would recommend the fantastic research article Unbiased Look at Dataset Bias by Antonio Torralba and Alexei Efros, two influential researchers in the field of computer vision.

These are just a handful of the things one should keep in mind when using machine learning to make decisions in your business. A slightly more technical guide, Google’s Best Practices for ML Engineering is worth taking a look at: it lists over 40 guidelines one should consider when working with machine learning systems in production and is curated by some of the most talented machine learning engineers in the world. In short, be cautious when using machine learning; despite the incredible promise of these techniques, you should take care to ensure that these systems are actually performing well.