index.md (7564B)
1 --- 2 title: "Violin plots are great" 3 date: '2015-09-26' 4 description: Fiddling with a visualization technique that's not common in psychology (yet). 5 categories: 6 - Data Science 7 tags: 8 - R 9 - Dataviz 10 - Statistics 11 --- 12 13 Anybody who's used the ggplot2 package in R is probably familiar with 14 `geom_violin`, which creates [violin 15 plots](https://en.wikipedia.org/wiki/Violin_plot). I'm not going to reproduce 16 the Wikipedia article here; just think of violin plots as sideways density 17 plots (which themselves are basically smooth histograms). They look like this: 18 19 ```r 20 library(ggplot2) 21 ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) + 22 geom_violin() 23 ``` 24 25 ![Violin plot of mpg by cylinder](unnamed-chunk-1-1.png) 26 27 I wasn't familiar with this type of plot until I started using ggplot2. Judging 28 by a recent conversation with friends (because my friends and I are nerds), 29 that's not uncommon among people coming from psychology (but check out Figure 1 30 in the [recent report on reproducability in 31 psychology](https://osf.io/ezcuj/wiki/home/) by the Open Science 32 Collaboration). 33 34 That's a shame, because this is a useful technique; a quick glance shows 35 viewers the shape of the distribution and lets them easily estimate its mode 36 and range. If you also need the median and interquartile range, it's simple to 37 display them by overlaying a box plot (violin plots are usually made this way). 38 39 ```r 40 ggplot(mtcars, aes(x = factor(cyl), y = mpg)) + 41 geom_violin(aes(fill = factor(cyl))) + 42 geom_boxplot(width = 0.2) 43 ``` 44 45 ![Violin plots with overlaid box plots](unnamed-chunk-2-1.png) 46 47 ### Violin plots vs. bar graphs 48 49 Violin plots aren't popular in the psychology literature--at least among 50 vision/cognition researchers. Instead, it's more common to see bar graphs, 51 which throw away all of the information present in a violin plot. 52 53 ```r 54 library(dplyr) 55 mtcarsSummary <- mtcars %>% 56 group_by(cyl) %>% 57 summarize(mpg_mean = mean(mpg), 58 mpg_se = sqrt(var(mpg)/length(mpg))) 59 60 ggplot(mtcarsSummary, aes(x = factor(cyl), y = mpg_mean, fill = factor(cyl))) + 61 geom_bar(stat = "identity") 62 ``` 63 64 ![Bar charts of mpg by cyl](unnamed-chunk-3-1.png) 65 66 Bar graphs highlight a single statistic of the analyst's choosing. In 67 psychology (and many other fields), researchers use bar graphs to visualize the 68 mean of the data, and usually include error bars to show the standard error (or 69 another confidence interval). 70 71 However, when audiences see bar graphs of means, they erroneously judge values 72 that fall *inside* a bar (i.e., below the mean) as more probable than values 73 equidistant from the mean but outside a bar ([Newman & Scholl, 74 2012](http://perception.research.yale.edu/papers/12-Newman-Scholl-PBR.pdf)). 75 This bias doesn't affect violin plots because the region inside the violin 76 contains *all* of the observed data. I'd guess that observers will correctly 77 intuit that values in the wider parts of the violin are more probable than 78 those in narrower parts. 79 80 The mean and standard error are only useful statistics when you assume that 81 your data are normally distributed; bar graphs don't help you check that 82 assumption. For small sample size studies, it's more effective to just plot 83 every single observation ([Weissgerber et al., 84 2015](http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002128)). 85 Till Bergmann [explored this 86 approach](http://tillbergmann.com/blog/articles/barplots-are-pies.html) in a 87 post that includes R code. 88 89 If it's important to display the mean and standard error, these values can be 90 overlaid on any visualization. 91 92 ```r 93 library(gridExtra) 94 95 plotBars <- ggplot(mtcarsSummary, aes(x = factor(cyl), y = mpg_mean, fill = factor(cyl))) + 96 geom_bar(aes(fill = factor(cyl)), stat = "identity") + 97 geom_errorbar(aes(y = mpg_mean, ymin = mpg_mean-mpg_se, ymax = mpg_mean+mpg_se), 98 color = "black", width = 0.4) + 99 ylim(0, 35) + 100 theme(legend.position = "none") + 101 ggtitle("Bar graph") 102 103 plotPoints <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, color = factor(cyl))) + 104 geom_point(aes(y = mpg, color = factor(cyl)), 105 position = position_jitter(width = 0.25, height = 0.0), 106 alpha = 0.6) + 107 geom_point(aes(y = mpg_mean), color = "black", size = 2, data = mtcarsSummary) + 108 geom_errorbar(aes(y = mpg_mean, ymin = mpg_mean-mpg_se, ymax = mpg_mean+mpg_se), 109 color = "black", width = 0.2, data = mtcarsSummary) + 110 ylim(0, 35) + 111 theme(legend.position = "none") + 112 ggtitle("Every observation") 113 114 plotViolins <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) + 115 geom_violin(aes(y = mpg, fill = factor(cyl))) + 116 geom_point(aes(y = mpg_mean), color = "black", size = 2, data = mtcarsSummary) + 117 geom_errorbar(aes(y = mpg_mean, ymin = mpg_mean-mpg_se, ymax = mpg_mean+mpg_se), 118 color = "black", width = 0.2, data = mtcarsSummary) + 119 ylim(0, 35) + 120 theme(legend.position = "none") + 121 ggtitle("Violin plot") 122 123 grid.arrange(plotBars, plotPoints, plotViolins, ncol = 3) 124 ``` 125 126 ![Comparison of the same data visualized with bar charts, points, and violin 127 charts](unnamed-chunk-4-1.png) 128 129 ### Violin plots vs. density plots 130 131 A violin plot shows the distribution's density using the width of the plot, 132 which is symmetric about its axis, while traditional density plots use height 133 from a common baseline. It may be easier to estimate relative differences in 134 density plots, though I don't know of any research on the topic. More 135 importantly, density plots (or histograms) readily display the density 136 estimates, whereas violin plots typically don't present these. 137 138 ```r 139 ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) + 140 geom_density(alpha = 0.6) 141 ``` 142 143 ![Density plots](unnamed-chunk-5-1.png) 144 145 Figures like the one above quickly become crowded as the number of factors 146 increases. It's easy to flip the coordinates and use faceting to construct 147 figures similar to violin plots. 148 149 ```r 150 # trim = TRUE trims the tails to the range of the data, 151 # which is the default for geom_violin 152 ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) + 153 geom_density(trim = TRUE) + 154 coord_flip() + 155 facet_grid(. ~ cyl) 156 ``` 157 158 ![Faceted density plots](unnamed-chunk-6-1.png) 159 160 I asked twitter about making plots like this without faceting. [David 161 Robinson](http://varianceexplained.org/) came through with a [new geom that 162 does it](https://gist.github.com/dgrtwo/eb7750e74997891d7c20). Like traditional 163 violin plots, these toss out the density estimates--and currently only work 164 with the development version of ggplot2--but they do the trick. 165 166 ```r 167 ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) + 168 geom_flat_violin() 169 ``` 170 171 ![Flat violins](unnamed-chunk-7-1.png) 172 173 ### Use violin plots 174 175 If you're into R's base graphics (why?), it looks like the [vioplot 176 package](http://www.inside-r.org/packages/cran/vioplot/docs/vioplot) can make 177 violin plots without using ggplot2. Seaborn appears to bring [very powerful 178 violin 179 plots](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.violinplot.html) 180 to python, but I haven't had much opportunity to explore the awesome pandas 181 world that's emerged since I last used python for most of my analyses. 182 183 Psychologists should use violin plots more often. They're ideal for displaying 184 non-normal data, which we frequently encounter when looking at a single 185 participant's performance (e.g., response times). More importantly, previous 186 research--*in psychology*--has shown that viewers have difficulty interpreting 187 bar graphs, and violin plots present a viable alternative. 188