I have frequently remarked that “traditional” analysis of service events and the plotting of data is highly misleading. This is due to a distribution of data that is neither symmetric nor normal. A useful data plotting tool for asymmetric, non normal data distributions is the violin plot.
This article joins my series of articles concerning graphical management tools.
What is a violin plot?
It is not some nefarious scheme by which a fiddler will dominate the world. No, a violin plot is a graphic representation of the probability density of a sample of data. Derived from the box plot, it appears that the violin plot was first defined in 1998 by Jerry Hintze and Ray Nelson in “Violin Plots: A Box Plot-Density Trace Synergism”, The American Statistician 52/2 (May 1998) 181-184. Thus, the information in a violin plot is closely related to histograms.
Violin plots and data distribution
Violin plots are interesting to use when the data density function is neither normal nor consistent from data sample to data sample. If data were always similarly distributed and, for example, normally distributed, all the violins would have similar shapes—wide in the middle and tapered at both ends. If that distribution were commonly known, there would be little benefit to using a violin plot as opposed to a box plot.
The interest in using violin plots is precisely when the distribution of data varies from sample to sample and especially when it is not normal.
In its simplest form, a violin plot graphically shows a distribution of data points in the form of an enclosed shape that roughly looks like the outline of a violin. Imagine a histogram where the bars have been center aligned, rather than being bottom aligned at the origin. The violin shape would trace the outline of the histogram’s bars. The bars themselves are not displayed.
A violin plot will normally have three axes. Although the violins can be oriented vertically or horizontally, I will assume a vertical orientation in this discussion.
The Y axis will show the range of values of the distribution densities. In other words, it shows the range of bucket sizes for the histograms. The scale of the Y axis could be linear but might also be logarithmic, if that is useful.
The X axis will vary according to the segmentation of the data to be plotted. Suppose you wish to compare categories of values, with one violin per category. You would show each category on the X axis. Another possibility would be a diagram showing the evolution of values by period, say, by month. In that case, the X axis would show a series of months.
The Z axis would be parallel to the X axis (and not, as is common elsewhere, parallel to the Y axis). It would show the width of the violins, in other words, the size of each bucket in the histogram of each violin. The Z axis scale would not be continuous. Instead, there would be a separate set of values for each segment of the X axis.
As you see in Fig. 3, labels on a Z axis are hard to read, hard to apply and add little value. If it is important to know the exact counts in each bucket in each segment, it is best to use a table of aggregated data instead of a diagram. The diagram is easier to use when trends and relative volumes are important.
Other statistics may be displayed graphically with each violin. A common use would be to show the median value of the data sample as a dot or an X positioned along a vertical line centered within the violin. The mean value would not be terribly interesting, given the non-normal distribution of the data.
Another statistic might be the confidence interval, shown by a different symbol positioned along that same vertical line. The top of the line would be 100%, the bottom 0%.
While there is no limit to the number of statistics that might be displayed, attention should be paid to the readability of the plot. As it is, violin plots display a very large amount of information in a very dense way.
Using violin plots
Let’s look at some examples of how violin plots might be used to support the management of of services or of the flow of work. In general, when you want to analyze or communicate the probability density of a sample of continuous data or compare it to another analyzed sample, a violin plot can be very helpful.
Suppose you have a selection of hardware components of different models, say, hard disks, and you wish to compare their reliability. For each model, you would plot the age before failure for each disk, during a certain period of time. A violin plot would quickly show any anomalies in the failure distribution, such as excessive failures during burn-in or bumps in the violin before the explosion of failures at the end of the useful life of the model of disk. Furthermore, you could easily compare one model to another. Thus, violin plots would give a much more useful and sophisticated analysis of reliability than simply depending on misleading statistics, such as the mean time before failure.
Comparing lead times by team
Suppose there are several teams, each of which is performing a similar type of work. We may wish to compare how each team performs and identify particular issues affecting them.
With the vast majority of knowledge work, the distribution density function of lead times can be approximated by a Weibull function. That is to say, there is a minimum lead time, below which any data points are likely to be data errors. For teams that perform well, the histogram quickly rises to a maximum, which then tails off asymmetrically to the right. That tail will tend to be relatively long, depending on the exceptional cases that cause delays to work.
The violin plot will quickly show if the lead time distribution resembles this model. If not, there are probably serious dysfunctions either in the team’s work methods or in the data collection. It is very easy to compare the relative vertical positions of the violins, where better performing teams would have lower violins. It is also easy to see how reliably a team can perform. If the tapering off the violin at the top is very long, then team performance is less reliable than cases where the violin quickly tapers off
In this context, I may point out that the display of confidence intervals might be helpful in managing service levels. As I have discussed elsewhere, service levels should not be defined in terms of thresholds that are breached or not. Instead, they should be defined in terms of the probability that a given threshold might be breached.
For example, suppose a service level is defined for lead time for some type of work, with a probability of 95%. Since the violin plot typically shows a 95% confidence level, there is a visual indication of the volume of instances that go beyond that level. This might not be an ideal way of testing for service level compliance, but it does allow for an initial, visual approximation of service level status.
When is a violin plot not very useful?
As we know, some variables are discrete and some are continuous. A discrete variable is one that has a certain list of possible values, such as True or False, or Female/Male/Both/Neither/Other. A continuous variable is generally a numeric value that might have any value with a range, such as any real number or any positive integer.
Violin plots are useful when the data distribution (the changing width of the violin outline) concerns continuous variables. That being said, the values plotted on the X axis are typically discrete variables. Thus, the continuous variable might measure something like lead time, whereas the discrete variable might be something like month of the year or team name.
Suppose you want to analyze the distribution of customers by country of residence. Since this is a set of discrete values, a violin plot will not be very useful (but a map would be great)! Suppose you want to analyze individual performance by gender (alas! some people are interested in such questions). The level of performance could be represented by a continuous lead time, which would be a good subject for a violin plot. There would be once violin per gender. But the reverse would not be very useful.
Tools for violin plots
- are not commonly known
- present a lot of different types of data, making them relatively complex to create
Anyone interested in statistical analysis and data plotting should probably know about, if not be a user of, R. R is able to generate violin plots when the ggplot2 library is loaded. Natively, R has a terminal-type interface. Various GUI interfaces also exist. For more information, consult the R project web site or read the book on R graphics by Winston Chang, R Graphics Cookbook: Practical Recipes for Visualizing Data, 2nd ed.
BoxPlotR is an online service. It allows an anonymous user to upload data, generate a diagram from that data and download an export picture of the diagram in eps, pdf or svg format. A built-in sample data set lets you see the functionality easily. In spite of the name of the service, it is able to generate violin plots as well as box plots.
The diagrams in this article were generated, in part, with BoxPlotR.
For those with Python programming skills, a visualization library called seaborn supports the creation of violin plots. That library also allows for the creation of a wide variety of other diagrams, covering many of the graphical needs of statistical analysis.
Another python library is matplotlib. There may be other libraries that support creating violin plots. The availability of such libraries is sure to evolve.
Another online service that generates all sorts of plots is plot.ly. It is a paid service without a free version.
To my knowledge there is no functionality inherent in any spreadsheet tool to generate violin plots, nor are there any add-ons that do so. That being said, it may be possible to simulate violin plots by judiciously preparing tables with the requisite values and generating charts on that basis. See, for example, this discussion.
We have seen that violin plots can be very effective analytical and communication tools when continuous data is assessed against discrete categories. The information it displays is intuitively understood. The plots densely convey much information with little ink.
We can only hope that an increasing demand for such sophisticated plotting will result in more integration of violin plotting in standard tools, such as flow management tools, operations management tools and service management tools.