Separating Hyperplanes - Why Mathematicians Make Bad Bloggers

Aug 29, 2017

Tyler Hughes, Craftsman | Kingsmen Analytics™

The first blog I wrote, Using Jupyter Notebook for Powerful, Stunning Data Analysis, was, regrettably, a flawed endeavor. I wanted to blog about analytics, but floundered for a topic for days. Machine learning is the buzzword on everyone’s lips, so I knew I wanted to write about it. Additionally, I wanted the post to be relatively simple for the audience, and also for the newbie blogger who was writing it. This is where I erred. The problem I tackled involving the Iris dataset truly was simple - that dataset is designed to respond well to machine learning. However, the blog was riddled with terminology that would only make sense to those familiar with the jargon and foundations of the field. This became apparent to me when one of our managing partners, Kevin Carney reposted the blog on LinkedIn, with the following comment:

Tyler Hughes, I don’t know what “using a Support Vector Machine algorithm to find the hyperplane for linearly separable data” means, but I’m glad YOU do!

It’s time I took a step back to explain foundational principles of machine learning. You see Kevin, it’s really quite simple:

hst tylerblog1

Glad I could clear that up…

Obviously I’ve only furthered your confusion on the topic. But, if you type “what is a separating hyperplane” into Google (which is what I did), this is the kind of language you’ll have to cope with. Explanations like this are only useful to geometric topologists and people who like to pretend they’re smart by doing fancy math on office whiteboards.

When I saw Kevin’s LinkedIn post, I realized that machine learning, along with other data analytics topics, are intimidating to people who haven’t studied them. It’s hard to blame people for feeling overwhelmed when the lessons are as convoluted as the Wikipedia article I snagged the hyperplane separation theorem from.

Thus, I will write this and future entries with a renewed purpose: to communicate advanced data analytics concepts to as broad an audience as possible. Let this post serve as a first step towards that goal as we take a principled approach to machine learning, starting with addressing Kevin’s quandary through an in-depth look at linear classifiers.

The Linear Classifier

Classification is simply the act of drawing a distinction between two different things. When we apply a classification algorithm to a dataset, we are making a hypothesis that our data is sufficient to distinguish between the different groupings of samples. What’s more, we know what the groupings are via the class labels. In Using Jupyter Notebook for Powerful, Stunning Data Analysis, the class labels were the species of Iris - it’s the thing we’re trying to classify. When we have labeled data, i.e. data with a class label, we’re performing a supervised learning task.

The rest of the information, i.e. petal width and petal length, form what’s called a feature set. Our classification algorithm ingests the features in the feature set, and tries to develop an equation that will render the answers given by the class labels. By feeding the algorithm enough data, it is often able to accomplish this task. Thus, we say we have fit our classifier.

This brings us back to Kevin’s quandary. Linear binary classifiers seek to distinguish between two different class labels by finding a line that perfectly separates the data into two groups. Pretty simple, right? When I plotted the decision regions of my Linear Support Vector Machine in Using Jupyter Notebook for Powerful, Stunning Data Analysis, that line lies at the border between the differently colored regions. It is this straight, unintimidating line that mathematicians have decided to name the separating hyperplane. That’s the big mystery. It’s just a line that separates two groups of data. So why the fancy name and convoluted theorem?

Mathematicians often formalize concepts to be extendable to generic cases. In this rather simple case where we’re plotting our data in two dimensions, the hyperplane is just a line. However, if we were to add a third dimension to the dataset, say sepal width, a line is no longer dimensionally able to separate our data. We need a plane that propagates in two dimensions. This is the source of the confusion. Mathematicians aren’t known for their communication skills.

In order to build a robust algorithm that generalizes to other datasets, it isn’t enough to draw a line in between two clusters of data and call it a day. The ideal hyperplane divides the two clusters of data while maximizing its distance to the individual data points. Think of it as giving the classes some “wiggle room”. Refer back to my Iris analysis and you’ll see that the model could have accurately predicted an Iris-setosa with slightly larger and wider petals than it had seen in the training data. This is due to the expansive area on either side of the hyperplane that separates the clusters of data. It’s a prime example of why the Iris dataset is so widely used - linear classifiers converge very easily. To illustrate this concept further, let’s look at the diagram below.

2000px-Svm intro.svg copy

Both A and B are hyperplanes that separate the data. That fancy hyperplane separation theorem above doesn’t say that the hyperplane has to be ideal, only that disjoint (unconnected) sets of data with a convex shape will always have at least one hyperplane that separates them. Obviously, plane A is superior for this dataset because it distinguishes between the classes and has more “wiggle room”. Though each of these models would have the same accuracy in classifying this set of data points, which we call the training set, the model converging to hyperplane A would likely generalize more effectively to other data.

Which brings us to the purpose of linear classification algorithms, like the Support Vector Machine from our previous exercise in irises. Quite simply, a linear classifier will try to find the most optimal hyperplane it can that perfectly separates the data into two groups. How does it do this? It maximizes the margin, which is the distance between the hyperplane and the data samples closest to it. Those datapoints that lie at the margin are the so-called support vectors that lend the algorithm its namesake.

This is the theoretical basis for the SVM. If we maximize the margin, we’ll find the hyperplane that best separates our data. Distilling this logic into a mathematical expression brings us to a fundament of machine learning: the objective function, or the function which, when minimized, yields the best results in predicting the target variables:

hst tylerblog2

How many of you read that and started stressing out? Luckily, this is the part scikit-learn does for you. Some very kind data scientists have coded this objective function for our use, so no need to reinvent the wheel. But, it’s helpful to understand each term in this equation. The w represents the algorithm’s weights. Put simply, it represents the importance, or emphasis the algorithm places on each input variable (i.e. iris petal length), based on its effectiveness in predicting the target variable (i.e. iris species). w is what machine learning attempts to “learn”. At each iteration, the algorithm will try some weights, see how well they do in classifying the training data, then rework the weights and try again. Many classification algorithms follow this same premise, and differ only in the implementation details and specific objective function.

The second term in the SVM objective function represents something called regularization. It’s out of scope for this blog entry, but in short it allows our model to generalize to other data more effectively. Expect a more fleshed-out description in a future posting.

The best thing about the Linear SVM is that it always converges for linearly separable data, i.e. data that conforms to the hyperplane separation theorem. What happens when it encounters data that can’t be separated by a straight line? Well, it proceeds to lose its sanity and keeps trying to find the separating hyperplane forever. Seriously. It’ll just keep going if you let it. That’s why we must always give the Linear SVM a maximum number of iterations to perform before stopping. We have to ensure it that nobody’s perfect, and everything will be okay.

Nevertheless, this is a severe limitation of the SVM. It’s very useful as a learning tool; it’s more readily understood than other, more useful linear classifiers, which often require some knowledge of calculus. Logistic regression is a more robust binary classification algorithm that you should definitely try in your own machine learning practice.

In Summary…

Let’s look at Kevin’s quote one more time:

Tyler Hughes, I don’t know what “using a Support Vector Machine algorithm to find the hyperplane for linearly separable data” means, but I’m glad YOU do!

A hyperplane is just a line with a fancy name. Mathematicians call it a hyperplane to generalize the theory to scatter plots in more than two dimensions. For binary classification, the separating hyperplane is just the line that separates two clusters of labeled data. The Support Vector Machine is just one of the many algorithms that can help find the best separating hyperplane for a data set. It does do by maximizing the hyperplane’s “wiggle room” - its distance to the dataset’s support vectors, the closest data points of each class. The SVM works best with linearly separable data i.e. data that can be perfectly separated by a straight line, because its weights will converge. For non-linearly separable data, we should usually find a different classifier to use.

I hope this has clarified some of the material from my first blog post, which I implore you to check out if you haven’t already. It should be more understandable armed with the knowledge gleaned from this entry. As always, look for more posts from me in the future on all thing analytics.

About Kingsmen Software

Kingsmen Software is a software development company that crafts high-quality software products and empowers their clients to do the same. Their exclusive approach, The Kingsmen Way, combines processes, tooling, automation, and frameworks to enable scalability, efficiency, and business agility. Kingsmen’s craftsmen can be found collaborating in the historic Camden Cotton Mill at the corner of 5th and Graham in Charlotte, NC.

Back to Blog