Learning Machine Learning

By Emil Kotrc posted Feb 02, 2017 02:01 PM


Machine learning is a rapidly growing field of the computer science, which has many far-reaching applications. Unless you've been living in a cave, you have probably heard a lot about Machine Learning (ML) in the last couple of years. Of course, there are many relevant materials available as well as a lot of noise and misunderstandings. You probably remember other buzzwords floating around (like Big Data) and therefore you may ask, what's the deal here, is it really something for me? Hopefully, this article will help you a bit to answer these questions and to find an entry point for your journey. We will start with a brief introduction and definition of ML, then we will move into some very basics ideas and concepts, and we will end with some tips of where to start learning more.

Introduction and definition

First of all, Machine Learning is nothing new. It has been here for decades, and many of the ideas and algorithms have been around for many years. What has changed though, is the exciting technology that allows us to process huge amounts of data within an unbelievably short amount of time. This goes hand in hand with the huge volume of data we generate everyday using our beloved (or hated) gadgets and devices, and with faster living.

This is the time when Machine Learning started spreading from the academics and specialized industries into common businesses and even to individual users and fans.

I am not so old (yet), but I remember that a couple of years ago, in order to train certain kinds of models, I needed a few gigabytes of memory and had to use a computational cluster, which ran the algorithm for two days. Nowadays, I can easily do the same with my home computer in a fraction of the time.

So, what is Machine Learning? Machine learning itself is an interdisciplinary field that shares many paradigms and concepts with other fields of mathematics and statistics, and can be also viewed as a part of Artificial Intelligence (AI). However, in contrast to AI, ML does not try to imitate an intelligent behavior, but rather focuses on algorithms that can process huge volumes of data and detect patterns that are not obvious or easily deduced by humans.

If you check Wikipedia for the definition of ML, you will see that machine learning explores the study and construction of algorithms that can learn from and make predictions on data. There are two important points in the previous sentence, so let's emphasise them:

  1. Machine Learning is about algorithms that create so-called models based on some known data called a training set. Models then make data-driven predictions (decisions) on new, unseen data. It means that the model is not a program you would code, but instead is generated logic that can interpret the data and provide some output.
  2. The algorithms learn from and make predictions on data. So, data are really crucial for ML, and to even start with ML you need some data. Later in the article we will learn a bit more about supervised versus unsupervised learning, but at this moment you can think about training a model as teaching a child by giving her some examples, and the predictions are made by a student who has graduated from University.

Unless you are a researcher developing new ML algorithms, you can use a variety of algorithms and families of algorithms that are already available today. However, the second point mentioned above is about data: you need some data you can apply the algorithms to. This is really crucial for the initial encounter with ML. Do you have any data that you can process? Yes, you do, ok, and now we are getting to another question, what can be ML used for?

Machine learning algorithms

In this section, we will briefly mention the traditional and most common use cases for ML in order to make sure to understand whether ML can help analyze your data.

If I oversimplify things, ML can be helpful in the following areas:

  1. Classification problems, where you need to classify some cases or observations into a given sets of classes. This is usually referred to as binary or multiclass classification. For instance, a spam filter could be a good candidate, or any other class assignments.
  2. Regression problems, which are similar to classification, but the output is not a category or class from a discrete set, but rather continuous like a real number. Temperature forecasts, and stock price changes are two examples of this kind of analysis.
  3. Clustering, which often complements data mining, is helpful when you want to find some patterns in your data or divide the observations into groups, which are not known beforehand. For example, anomaly detection usually utilizes some algorithm from this family, where you want the ML to find what an anomaly actually looks like.

ML studies the algorithms that provide means for creating models. Before naming some of the most common ML algorithms, we will discuss the two typical classes of algorithms.

  1. Supervised learning algorithms, where you train your model based on cases with a known classification - training set. These are used in classification and regression.
  2. Unsupervised learning algorithms, where you group the observations according to their properties. This is usually used for clustering and data mining. In this case you don't have any cases with known labels, but instead you want a model to find the patterns.

As we already know, the algorithms work on the data. We usually assume the following types of data:

  1. Training set - which is a data set of cases called labelled data with a known classification (or more generally with a label). These data are used to build the model.
  2. Testing set - this is a different data set of cases with a known label, and is used for evaluation and validation of the model. This data is very important especially due to a phenomenon called overfitting. It may happen that when you train a model, it will memorize the training set and such will give perfect results on the training set, but would behave poorly for unknown cases. To validate each model, you should evaluate its accuracy, and of course there are many methods how to do that.

The overfitting problem usually correlates with so called Occam's razor, which states that a simple explanation tends to be more valid than a complex explanation. In other words, the overcomplicated model is not necessary the best one.

The question is also how to obtain a good and representative testing set. Usually, you will have just a training set, but in the simplest scenario, it can be sampled to create two subsets - a training data set and a testing data set. Moreover, there are other techniques like bagging, boosting, cross-validation that can build models based on random sampling of the data.

Please note that with unsupervised learning, there is usually no testing set, and the training set does not include the labels for the cases as it is what we want the ML algorithm to do for us.

If you look at some existing tools you will find many different algorithms that can be used for various use cases, but there are some algorithms and techniques you will find almost everywhere. We will briefly discuss them in the following subsections.

Decision trees

These set of algorithms are very popular and relatively easy to understand and interpret, because they can be seen as trees of if-then-rules created on the training data. The induction of a decision tree is based on recursive partitioning of the training set, generally known as divide and conquer. I don't want to go into details, but as you can imagine the rules for splitting are one of the main differentiators of different methods and are also a means to tune the methods. A well-known term in the decision tree terminology is pruning of the trees, which is a technique to avoid overfitting and is usually processed after a tree is fully grown.

The other advantage of decision trees is that they can be used for classification and regression problems and as such they are often used as expert systems. When speaking about experts, another family of decision tree based methods are ensembles of decision trees, sometimes called decision forests. Where a set of trees is voting about the final decision, this always reminds me of J.R.R. Tolkien's Ent decision makers of Middle Earth.

Neural networks

The human brain is fascinating and yet is still not fully understood. Researchers have been attracted by the brain and its core components - neurons - almost since the beginning of computer science. It sounds natural to build an artificial neural network and put it in an android, which will complete a new entity. Machine Learning has many models that try to simulate a human brain using a set of connected neurons. The simplest representative is so called perceptron, which is just a single neuron with multiple inputs and a single output. When you train such a neural network, you typically set a weight on individual inputs, which means that some are more important than the others and you make a final decision based on the combination of the weighted inputs. The output typically splits the space of observations into two sets based on a hyperplane.

The natural evolution of the single perceptron is a multilayer perceptron where you connect multiple layers of perceptrons. Each layer can communicate just with the neurons from the next layer and so on. The last layer provides the final outputs. This is just a single representation of neural networks, but there are a large number of methods that fit into this category. In recent years, a new term was introduced: deep learning, which in some literature is understood to be just a new (buzz)word for multi-layer neural networks.

Like to decision trees, neural networks can be used for classification and regression, but you cannot easily interpret a neural network as a set of simple if-then rules. Neural networks can be very complex and nonlinear functions.

Naive Bayes

When we were mentioning spam detection, one of the well-known methods was used in early spam detectors. This method is called Naive Bayes classifier and it is based on a simple property of conditional probability called Bayes' theorem.

Today's spam filters are much more advanced than this method, but still it can be helpful in certain types of applications.

k-Nearest Neighbours

This is a very simple and intuitive method, with lots of modified versions. Imagine you want to classify an unknown case into a class and you have groups of cases with a known classification. You can see its k nearest neighbours according to some chosen distance metric and assign the class according to the majority of those k friends. Of course, this works for some types of assignments.


To name at least one representative of unsupervised learning, we mention K-means algorithm. It partitions the space of observations into k clusters based on some defined properties that characterize each observation. At a minimum, you must specify how many clusters there are. Of course, there are many modifications of this algorithm. For example, adaptive k-means tries to find the optimal number of clusters fitting on some criteria.

If your problem is finding some patterns in your data, i.e. placing them into some categories, some kinds of unsupervised learning can help you. K-means is one of the most simple and usually the first option that is used.

Where to start

It all starts with data - you must have some data to play with. But, do not expect that you can pick up a tool, feed it data, and magically you'll have perfect output. Unfortunately, we are not there (quite) yet. This is typically a job of a data scientist, who looks at the data, extracts the valid properties, converts and transforms the data if needed, selects a relevant model, trains it, validates, and deploys. I believe that this will be one of the key roles in the incoming millennium. It is very important to understand that ML is not just feeding the data into an algorithm, but that it also requires several other tasks, which can be more complex that the model training itself.

On the other hand, I am a big fan of various hobby markets and activities - like Arduino or Raspberry Pi. Even a very technical thing like Machine Learning deserves to have enthusiasts playing with stuff. Thanks to many open source projects, anyone can start playing with the basics even if they do not hold a Ph.D. or are a professional data scientist; I think this is a good thing. From time to time I hear some voices stating that such technical things should be left to experts only, protecting their ivory towers. But I still think that opening such relatively new fields helps even the experts to see the patterns the public is seeking.

So, where can you learn more? If you are more interested in the theory, there are several great books available. When I was studying machine learning, one of the key publications, sometimes called a bible of statistical learning, was Elements of Statistical Learning (in short ESL) by Hastie, Tibshirani, & Friedman. The other book I found interesting was Pattern Recognition and Machine Learning by Christopher Bishop. These two books are very technical and theoretical and they require some non-trivial mathematical background. Recently, a friend of mine recommended another book, Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz and Shai Ben-David, which is more condensed, and is more accessible than the previous two books, even for non-mathematicians. It is worth checking out (and you can download it for personal use).

If you are more practical than theoretical, there are several online courses available, on Coursera or on IBM Big Data University. You can also find many interesting videos on YouTube, like the series from Google or from Khan Academy. Such courses should give you a solid introduction and background if you are interested in this field.

After you have gone through some basic stuff, how can you start playing with it? What are some of the tools you could use? I would recommend the following ones (in no particular order), which are all available for free. Each of them deserves its own article, but let me be very brief.

  • R - I see R as the gold standard in statistics, machine learning, and other fields including data visualization. R is an open source project for statistical computations and visualization. Several machine learning algorithms are already included with the base install and many more are available as optional packages. R is also great because of its help system and sample data that you can use out of the box. R is also very popular in academics, which means that many brand new algorithms are implemented in the R language first and that there are many books that include examples in R, like An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. So if you want to be on the high end, R can be a good choice. Another advantage is R's great visualization capabilities, like the very popular ggplot Importantly, there are also several IDEs that you can help you, like RStudio or Rcommander. In our community, it is also important to mention that even IBM invests in R development and provides an R package that allows you to run R with IBM database management systems and appliances. R is therefore gaining popularity even outside academics and recently in businesses.
  • python. Python is another language that is very popular with students because it is a very popular first-time programming language. This may explain the rapidly increasing popularity of python even in machine learning, which was domain of other specialized languages including R. There are many libraries like pandas or sckit-learn that provide Machine Learning algorithms for python. A very attractive tool in the python world is project Jupyter (formerly IPython), which provides an interactive and easy to use shell in a web browser. It is also worth saying that Jupyter is even available for R, and even Spark can be integrated with another popular project called Zeppelin. These so-called multipurpose notebooks seem like new IDEs for data science.
  • Apache Spark is a relatively new kid on the block. Spark is a very efficient and highly parallelizable framework and a Machine Learning library is one of its core components that is included in the base install. Comparing Spark's Machine Learning library with R, although it is not quite as mature, it is evolving very quickly. On the other hand, big players like IBM are investing in Spark because they see a potential where they customers may utilize Spark for many incoming future ML projects. The native programming language of Spark is Scala, and many new features are implemented in this language first. However, Spark can use other languages like Java, python, or even R, which provides another reason to start learning about Machine Learning using R.

There are many other tools available for free than what I've just mentioned. Just to name a few others without any further description - Weka, Julia language and Machine Learning packages, and Google's TensorFlow.


This article covers just a small fraction of the Machine Learning world, and I have left out many theoretical details and other broad technicalities. However, I hope that this article gave you at least some impressions, ideas, and a high level overview of what ML can provide to you. Just try to think about your data and possible use cases, try to learn, fiddle, and have fun!

Also, if you have any other useful tips and advices, please share them in the comments below. Everyone, including myself, will appreciate your feedback.