Jul 12, 2017
Tyler Hughes, Craftsman | Kingsmen Analytics™
The question of how to visualize data has been persistent throughout the history of information design. When one of my clients needs me to build a data visualization, it’s important that I know how to use a wide variety of tools so that I can tackle their problems using their available resources. However, Kingsmen Software also tries to stay on the bleeding edge of the tech industry. This requires us to explore emerging technologies and experimental tooling, instead of relying on potentially outdated industry staples. The same is true for data visualization. In this post, I will attempt to consolidate some of my learning on the topic. It’s important to understand that data visualization is a broad topic that covers much more than tooling alone. Telling a data-driven story to your given audience isn’t always as simple as making a bar chart. Building the right visualization for a dataset requires practice, intuition, and knowledge of information design principles, regardless of what tool you choose to use. With that disclaimer out of the way, here are a few of the tools I’ve used for data visualization along with my opinions on their overall effectiveness.
Considered by many to be the premier BI tool on the market, Tableau is a powerful visualization product that can scale relatively well into Big Data applications (albeit with a marked decrease in performance). Tableau finds its canter with medium-sized datasets that have been extracted from a database and update via batch processes. If you’re presenting a quarterly dashboard to an executive, Tableau will be a good friend for you. If you’re streaming real time data, it might only be an acquaintance.
With a little bit of practice, creating calculated fields is intuitive and powerful. Being able to define a level of detail over which to aggregate is vital, and Tableau’s implementation of this feature is sufficient (even if I’ve never found a use for any of the keywords besides FIXED). Additionally, Tableau’s coding language is robust, and I rarely find it impossible to do something. It might take some amount of Googling, but the functionality is present. As cool as level-of-detail calculations are, calculating fields is just a step in preparing your data set - you could use any number of tools for that. Tableau’s all about its visualizations, and admittedly, it gets the job done. With enough practice, you can create a wide range of unique visualizations using the built-in graphs and data coloring, sizing, and shaping tools. The graphs are attractive and very customizable. One of my primary sticking points with Tableau is the dashboarding feature. Dashboarding is possible, but frustratingly clunky. Users should be able to drag and drop their graphs onto a canvas, but instead must try to navigate confusing layout menus that needlessly complicate the task.
Tableau 10 allows for data blending right out of the box. It can connect to most data sources, but be aware that performance suffers with larger data sets, even when connecting to an extracted database. Level-of-detail calculations also exacerbate the performance loss. Finally, Tableau’s functionality comes at a pretty penny. Depending on your needs, software licenses and hosting workbooks can cost in excess of $100 per user per month.
Power BI is somewhat new to the BI tool scene. As a part of the growing suite of Microsoft productivity applications, it slots nicely into the Office 365 cloud. This gives it some unique functionality that might position it ahead of its competitors in future iterations. With the growth of Microsoft Azure, Power BI may soon be the front-end viz tool of choice for Azure users. This is particularly exciting with the advent of Azure Machine Learning Studio in mind; however, both Azure and Power BI have a long way to go before they can match the likes of Amazon Web Services and Tableau.
Connecting Power BI to a data source is intuitive and will be familiar to anyone who uses Excel. What’s more, it has much of the same functionality as Tableau with creating calculated fields. It utilizes a language called DAX that is comparable to Tableau’s coding language. However, I’ve noticed a few shortcomings with DAX that preclude me from giving it a glowing recommendation. There’s simply missing functionality, like the ability to do certain datetime calculations to the minute.
When it comes to visualization, Power BI might even have the upper hand over Tableau. It has a dashboarding system that works exactly as it should: dragging and dropping visualizations onto a canvas, resizing and repositioning as necessary. This is way more intuitive than what Tableau is offering. Power BI also has some pre-built dashboards for services like GitHub that do a lot of the analysis work for you.
All accolades aside, the overall workflow of Power BI is one of its biggest drawbacks. There are too many different screens the user works between to build the data model, and performing level-of-detail calculations often requires building entirely new tables. I’ve also had issues with Power BI not allowing me to define relationships between these tables like I should be able to. Overall, the process of going from data import to beautiful dashboard is, quite simply, more of a hassle with Power BI.
Power BI does get one more feather in its cap: the service is much cheaper than Tableau. You can download Power BI Desktop and get limited functionality for no cost at all. Depending on how much love the folks over at Microsoft give it, Power BI could be an important player at some point in the future.
Should you continue reading my blog posts, you’ll learn how much I love Python in short order. When you use a prebuilt tool like those previously mentioned in this article, it necessitates sacrificing some functionality. You are beholden to what the application was designed to do, not necessarily what you want it to do. Naturally, this is true with Python too. But the libraries I’m about to mention allow for a much greater deal of flexibility than with a BI tool alone, and you get the power of Python data processing with NumPy and pandas alongside your visualization platform.
A gallery of bokeh examples can be found here
It’s hard to make predictions about the tools data scientists will be using in the next few years. The industry is evolving extremely rapidly, and keeping up with the trends is difficult even now. With that said, here are some developing tools that data scientists are keeping their eyes on.
Superset is a relatively new player in the BI tool space that has two advantages right out of the gate: it’s totally open-source and began development at Airbnb, one of the world’s analytics powerhouses. Originally known as Caravel, Superset is a flexible, powerful webtool built on a Python framework. It utilizes a query layer that couples SQLAlchemy (a Python library for querying a wide variety of databases) with a columnar distributed data tool known as Druid. The result is a flexible tool that scales very well to Big Data applications. Getting the data is simple too – once you’ve connected to a database, you can query against it to create data sources in Superset. Visualizations are created from slices of these data sources, and filtering these slices uses SQL “WITH” and “HAVING” statements. So, there’s some impressive stuff going on under the hood, and the developers deliver that analytical oomph with a nice coat of paint. As far as BI tools are concerned, I believe that Superset has the most attractive visualizations, and many of the chart types available with Superset are missing from other popular BI tools. Additionally, dashboarding is a breeze – it works in a similar fashion to Power BI. By combining the power of Tableau with the effective dashboarding of Power BI, there’s a lot that Superset can offer your analytics practice.
Your opinion on Superset will likely fall in line with your opinion on open-source technologies as a whole. Unlike with traditional software, you won’t get the program wrapped up for you with a bow on top. However, spinning up a server is not overly complicated, and Superset has some excellent documentation. You can find help in setting up Superset along with some nice example dashboards at the GitHub repository.
Datashader is such a cool project. It takes one of the most fundamental aspects of data visualization and flips it on its head. When plotting something, we put data “points” onto a grid comprised of two axes that represent the ranges of the variables we’re plotting. Each “point” is a collection of pixels that forms whatever pretty shape we desire. That shape is centered on the value of the two axis variables. But thinking about this logically, we have mapped a continuous variable (if we’re looking at a range of real numbers) to a discrete scale. What discrete scale, you ask? The scale represented by the pixels that make up our computer screen.
When we abstract out the notion of a pixel from our tool, we’ve lost a potential degree of freedom for presenting our data. It matters little for data sets with 1000 or so data points, but what if we’re considering 100,000? 1,000,000? 1,000,000,000? The programmers of datashader had the idea to aggregate over the grid of pixels. If multiple data points would fall within a single pixel, brighten the color of that pixel to indicate intensity/density of the data set at that point.
If some of this isn’t quite clear, don’t worry. I think that datashader plots speak for themselves. Feel free to ooh and ah as much as you want at those census plots.
Datashader allows for a user to zoom into its plots without losing meaning – the aggregate calculations are mapped to the grid of pixels, so zooming in will remap those calculations. By integrating with Dask, another Big Data Python library, this all happens quickly and efficiently, even with millions of data points. The datashader module is still very much a work in progress, but early signs indicate that it could potentially shake up the way we visualize data at a fundamental level.
The analytics world is evolving rapidly, and it’s difficult to tell which of these tools will continue to be useful a couple of years from now, and which ones will flop. Hopefully this post is sufficiently helpful in finding a tool that meets your current needs whilst simultaneously getting you to think about the future. Thanks for reading, and look forward to more posts from me regarding all things analytics.
Kingsmen Software is a software development company that crafts high-quality software products and empowers their clients to do the same. Their exclusive approach, The Kingsmen Way, combines processes, tooling, automation, and frameworks to enable scalability, efficiency, and business agility. Kingsmen’s craftsmen can be found collaborating in the historic Camden Cotton Mill at the corner of 5th and Graham in Charlotte, NC.