Jun 07, 2017
Tyler Hughes, Craftsman | Kingsmen Analytics™
If you’re an experienced Python developer, it’s unlikely that I’ll need to sing Jupyter’s praises to you. It’s a fantastic tool for general exploration as well as documenting programs. Sure, it probably isn’t going to be your platform for hardcore development, but it’s way more than a niche product.
For the uninitiated, Jupyter Notebook is an application works as an amalgam of the iPython shell with a lightweight IDE. The notebook is made up of boxes that run independently as single code snippets. So, type some code in the box and execute it. Those variables are now available to use in the rest of your program. Those boxes can be copied, pasted, deleted, turned into markdown - it’s this modular nature of Jupyter that really endears it to me and other data scientists the world over. If you need a clean slate, just restart the kernel and run what you need afterward.
As you may have guessed, since Jupyter runs on the iPython kernel, you also have access to iPython’s bevy of magic functions. Cell magics are very useful. One of my favorite use cases is to time individual cells when writing machine learning algorithms. Once I know how long each of my algorithms takes to run, I can choose the algorithm that strikes the right balance between predictive power and computational efficiency.
I do most of my analytics in Jupyter Notebook, including algorithm builds. In a single file, I can bring in a dataset, cleanse it, do some basic statistics, plot it, and even train and test an algorithm. Better yet - I can bring this to the business and run the notebook. They can see each step of the process and watch plots render inline with the rest of the code. Jupyter allows for excellent documentation with support for markdown language. Jupyter notebook is one of the most elegant ways to deliver analytical insights to an organization, and its beauty scales with your Python expertise.
Let’s see how this all works. I’m going to be bringing in a public dataset that’s commonly used in machine learning applications called the Iris dataset. If you’ve studied machine learning in Python before, you’re probably sick of seeing people use the Iris dataset. But it’s the “Hello World” of machine learning at this point, so who am I to fight tradition?
The Iris dataset is composed of 150 records of Iris classification data. There are four features: sepal length, sepal width, petal length, and petal width. The dataset has information for three classes of Iris: Iris-Setosa, Iris-Versicolor, and Iris-Virginica. The dataset is so widely used because of the exceedingly high value of petal length and petal width as predictive features, each of which has ~95% correlation with the class labels.
Basically, if you want to look like you know how to do machine learning, train an algorithm with the Iris dataset. It’s worked for scores of data scientists in the past (this one included). The dataset and information about it can be found here.
Alright, enough talk - let’s do this thing.
Jupyter notebook is a great organizational tool. Headers make the overall presentation look more attractive, and it’s easy to give a short introduction before diving into the code. As far as pulling in the data is concerned, I’ve coded one cell for importing libraries and another for accessing the data and formatting it. Each step of the process is compartmentalized. It’s obvious what code led to the table at the bottom because our Pandas dataframe renders inline with the rest of our code. It’s hard to overstate the value of this, and as we’ll see later, plots behave the same way.
If we want to train a model on this data, our class labels need to be encoded to integers. Also, we’ll be doing a simple model for demonstration purposes here, so let’s only consider two Iris classes. Scikit-learn, Python’s excellent machine learning module, has a lot of tools that can help us accomplish these tasks. One such tool is LabelEncoder, which will let us map our Iris classes to integers and change them as needed.
We’ve taken the first 100 rows of the dataset corresponding to two Iris classes: Iris-setosa and Iris-versicolor. Our df.shape parameter demonstrates this. Additionally, we see the functionality of LabelEncoder().inverse_transform(). The original call to LabelEncoder turned our string-formatted class labels to 0 and 1 respectively, which we confirm with df.dtypes. Now, we can easily get them back into their original forms. I’ll use this later to properly format a graph legend.
After splitting into separate training and test sets, we standardize our feature sets using sklearn.preprocessing.StandardScaler(). This is another very useful offering from scikit-learn as scaling is a requirement for many machine learning algorithms.
We’re ready to train a model. Let’s train a Support Vector Classifier, or SVC. While we do so, I’ll continue to show some of the features of Jupyter that make it so useful.
We can write mathematical expressions in markdown for documentation! This is something that Kingsmen Analytics is going to be using as part of our employee training process - execute code alongside the literature to reinforce learning.
Let’s make a plot to visualize how our algorithm is classifying things. We’ll see some Jupyter inline plotting as well. The plot_decision_regions() function comes from Sebastian Raschka’s book Python Machine Learning. It’s a fantastic book for learning and reference. He has a GitHub repository where you can pull down some of his Jupyter notebooks, many of which come from his book. However, I still highly recommend giving Python Machine Learning a read if you’re looking to learn this stuff.
We achieved perfect accuracy on our training set! The different colored areas of the plot represent how the algorithm would classify a data point that fell within the region. We see that this SVC found a hyperplane that perfectly separated the two classes. However, this dataset is so nice that we could have done the same thing at a glance. Now, let’s see how the model does on the test set.
Perfect classification. Obviously, results may vary when trying to use these algorithms for your own applications. This dataset is known to be extremely easy to work with. If you’ve only recently been exposed to machine learning, I would recommend pulling the dataset into a Jupyter notebook and playing around with it. It’s a great way to get your feet wet.
Hopefully you’ve seen the usefulness of Jupyter. Having the entire process I just went through in an organized, documented form that can be shared with others is something that’s less simple to do with a simple IDE. If you’re interested in checking it out for yourself, I recommend the Anaconda distribution. It’ll set everything up for you, including installation of a lot of important modules and iPython.
This was a pretty simple post for the first one - be on the lookout for more posts from me on all things Kingsmen Analytics.
Raschka, S. (2015). Python machine learning.
Kingsmen Software is a software development company that crafts high-quality software products and empowers their clients to do the same. Their exclusive approach, The Kingsmen Way, combines processes, tooling, automation, and frameworks to enable scalability, efficiency, and business agility. Kingsmen’s craftsmen can be found collaborating in the historic Camden Cotton Mill at the corner of 5th and Graham in Charlotte, NC.