2015-09-26-violin-plots.Rmd - blogposts - Code used to generate some blog posts.

2015-09-26-violin-plots.Rmd (7432B)
      1 ---
      2 title: "Violin plots are great"
      3 author: "Eamon Caddigan"
      4 date: '2015-09-26'
      5 output: html_document
      6 layout: post
      7 summary: Fiddling with a visualization technique that's not common in psychology (yet).
      8 categories: dataviz R psych
      9 ---
     10 
     11 ```{r global_options, include=FALSE}
     12 knitr::opts_chunk$set(cache=TRUE, echo=TRUE, warning=FALSE, message=FALSE,
     13                       fig.width=8, fig.height=5, fig.align="center")
     14 
     15 source("geom_flat_violin.R")
     16 ```
     17 
     18 Anybody who's used the ggplot2 package in R is probably familiar with `geom_violin`, which creates [violin plots](https://en.wikipedia.org/wiki/Violin_plot). I'm not going to reproduce the Wikipedia article here; just think of violin plots as sideways density plots (which themselves are basically smooth histograms). They look like this:
     19 
     20 ```{r}
     21 library(ggplot2)
     22 ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) + 
     23   geom_violin()
     24 ```
     25 
     26 I wasn't familiar with this type of plot until I started using ggplot2. Judging by a recent conversation with friends (because my friends and I are nerds), that's not uncommon among people coming from psychology (but check out Figure 1 in the [recent report on reproducability in psychology](https://osf.io/ezcuj/wiki/home/) by the Open Science Collaboration). 
     27 
     28 That's a shame, because this is a useful technique; a quick glance shows viewers the shape of the distribution and lets them easily estimate its mode and range. If you also need the median and interquartile range, it's simple to display them by overlaying a box plot (violin plots are usually made this way).
     29 
     30 ```{r}
     31 ggplot(mtcars, aes(x = factor(cyl), y = mpg)) + 
     32   geom_violin(aes(fill = factor(cyl))) + 
     33   geom_boxplot(width = 0.2)
     34 ```
     35 
     36 ### Violin plots vs. bar graphs
     37 
     38 Violin plots aren't popular in the psychology literature--at least among vision/cognition researchers. Instead, it's more common to see bar graphs, which throw away all of the information present in a violin plot.
     39 
     40 ```{r}
     41 library(dplyr)
     42 mtcarsSummary <- mtcars %>%
     43   group_by(cyl) %>%
     44   summarize(mpg_mean = mean(mpg),
     45             mpg_se = sqrt(var(mpg)/length(mpg)))
     46 
     47 ggplot(mtcarsSummary, aes(x = factor(cyl), y = mpg_mean, fill = factor(cyl))) + 
     48   geom_bar(stat = "identity")
     49 ```
     50 
     51 Bar graphs highlight a single statistic of the analyst's choosing. In psychology (and many other fields), researchers use bar graphs to visualize the mean of the data, and usually include error bars to show the standard error (or another confidence interval). 
     52 
     53 However, when audiences see bar graphs of means, they erroneously judge values that fall *inside* a bar (i.e., below the mean) as more probable than values equidistant from the mean but outside a bar ([Newman & Scholl, 2012](http://perception.research.yale.edu/papers/12-Newman-Scholl-PBR.pdf)). This bias doesn't affect violin plots because the region inside the violin contains *all* of the observed data. I'd guess that observers will correctly intuit that values in the wider parts of the violin are more probable than those in narrower parts.
     54 
     55 The mean and standard error are only useful statistics when you assume that your data are normally distributed; bar graphs don't help you check that assumption. For small sample size studies, it's more effective to just plot every single observation ([Weissgerber et al., 2015](http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002128)). Till Bergmann [explored this approach](http://tillbergmann.com/blog/articles/barplots-are-pies.html) in a post that includes R code. 
     56 
     57 If it's important to display the mean and standard error, these values can be overlaid on any visualization.
     58 
     59 ```{r, fig.width=11, fig.height=5}
     60 library(gridExtra)
     61 
     62 plotBars <- ggplot(mtcarsSummary, aes(x = factor(cyl), y = mpg_mean, fill = factor(cyl))) +  
     63   geom_bar(aes(fill = factor(cyl)), stat = "identity") + 
     64   geom_errorbar(aes(y = mpg_mean, ymin = mpg_mean-mpg_se, ymax = mpg_mean+mpg_se), 
     65                 color = "black", width = 0.4) + 
     66   ylim(0, 35) + 
     67   theme(legend.position = "none") +
     68   ggtitle("Bar graph")
     69 
     70 plotPoints <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, color = factor(cyl))) + 
     71   geom_point(aes(y = mpg, color = factor(cyl)), 
     72              position = position_jitter(width = 0.25, height = 0.0),
     73              alpha = 0.6) + 
     74   geom_point(aes(y = mpg_mean), color = "black", size = 2, data = mtcarsSummary) + 
     75   geom_errorbar(aes(y = mpg_mean, ymin = mpg_mean-mpg_se, ymax = mpg_mean+mpg_se), 
     76                 color = "black", width = 0.2, data = mtcarsSummary) + 
     77   ylim(0, 35) + 
     78   theme(legend.position = "none") +
     79   ggtitle("Every observation")
     80 
     81 plotViolins <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) + 
     82   geom_violin(aes(y = mpg, fill = factor(cyl))) + 
     83   geom_point(aes(y = mpg_mean), color = "black", size = 2, data = mtcarsSummary) + 
     84   geom_errorbar(aes(y = mpg_mean, ymin = mpg_mean-mpg_se, ymax = mpg_mean+mpg_se), 
     85                 color = "black", width = 0.2, data = mtcarsSummary) + 
     86   ylim(0, 35) + 
     87   theme(legend.position = "none") + 
     88   ggtitle("Violin plot")
     89 
     90 grid.arrange(plotBars, plotPoints, plotViolins, ncol = 3)
     91 ```
     92 
     93 ### Violin plots vs. density plots
     94 
     95 A violin plot shows the distribution's density using the width of the plot, which is symmetric about its axis, while traditional density plots use height from a common baseline. It may be easier to estimate relative differences in density plots, though I don't know of any research on the topic. More importantly, density plots (or histograms) readily display the density estimates, whereas violin plots typically don't present these.
     96 
     97 ```{r}
     98 ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) + 
     99   geom_density(alpha = 0.6)
    100 ```
    101 
    102 Figures like the one above quickly become crowded as the number of factors increases. It's easy to flip the coordinates and use faceting to construct figures similar to violin plots.
    103 
    104 ```{r}
    105 # trim = TRUE trims the tails to the range of the data,
    106 # which is the default for geom_violin
    107 ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) + 
    108   geom_density(trim = TRUE) + 
    109   coord_flip() +
    110   facet_grid(. ~ cyl)
    111 ```
    112 
    113 I asked twitter about making plots like this without faceting. [David Robinson](http://varianceexplained.org/) came through with a [new geom that does it](https://gist.github.com/dgrtwo/eb7750e74997891d7c20). Like traditional violin plots, these toss out the density estimates--and currently only work with the development version of ggplot2--but they do the trick.
    114 
    115 ```{r}
    116 ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
    117     geom_flat_violin()
    118 ```
    119 
    120 ### Use violin plots
    121 
    122 If you're into R's base graphics (why?), it looks like the [vioplot package](http://www.inside-r.org/packages/cran/vioplot/docs/vioplot) can make violin plots without using ggplot2. Seaborn appears to bring [very powerful violin plots](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.violinplot.html) to python, but I haven't had much opportunity to explore the awesome pandas world that's emerged since I last used python for most of my analyses. 
    123 
    124 Psychologists should use violin plots more often. They're ideal for displaying non-normal data, which we frequently encounter when looking at a single participant's performance (e.g., response times). More importantly, previous research--*in psychology*--has shown that viewers have difficulty interpreting bar graphs, and violin plots present a viable alternative.
	blogposts Code used to generate some blog posts.
	git clone https://git.eamoncaddigan.net/blogposts.git
	Log \| Files \| Refs \| README \| LICENSE