www.eamoncaddigan.net

Content and configuration for https://www.eamoncaddigan.net
git clone https://git.eamoncaddigan.net/www.eamoncaddigan.net.git
Log | Files | Refs | Submodules | README

index.md (7564B)


      1 ---
      2 title: "Violin plots are great"
      3 date: '2015-09-26'
      4 description: Fiddling with a visualization technique that's not common in psychology (yet).
      5 categories:
      6 - Data Science
      7 tags:
      8 - R
      9 - Dataviz
     10 - Statistics
     11 ---
     12 
     13 Anybody who's used the ggplot2 package in R is probably familiar with
     14 `geom_violin`, which creates [violin
     15 plots](https://en.wikipedia.org/wiki/Violin_plot). I'm not going to reproduce
     16 the Wikipedia article here; just think of violin plots as sideways density
     17 plots (which themselves are basically smooth histograms). They look like this:
     18 
     19 ```r
     20 library(ggplot2)
     21 ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) + 
     22   geom_violin()
     23 ```
     24 
     25 ![Violin plot of mpg by cylinder](unnamed-chunk-1-1.png)
     26 
     27 I wasn't familiar with this type of plot until I started using ggplot2. Judging
     28 by a recent conversation with friends (because my friends and I are nerds),
     29 that's not uncommon among people coming from psychology (but check out Figure 1
     30 in the [recent report on reproducability in
     31 psychology](https://osf.io/ezcuj/wiki/home/) by the Open Science
     32 Collaboration). 
     33 
     34 That's a shame, because this is a useful technique; a quick glance shows
     35 viewers the shape of the distribution and lets them easily estimate its mode
     36 and range. If you also need the median and interquartile range, it's simple to
     37 display them by overlaying a box plot (violin plots are usually made this way).
     38 
     39 ```r
     40 ggplot(mtcars, aes(x = factor(cyl), y = mpg)) + 
     41   geom_violin(aes(fill = factor(cyl))) + 
     42   geom_boxplot(width = 0.2)
     43 ```
     44 
     45 ![Violin plots with overlaid box plots](unnamed-chunk-2-1.png)
     46 
     47 ### Violin plots vs. bar graphs
     48 
     49 Violin plots aren't popular in the psychology literature--at least among
     50 vision/cognition researchers. Instead, it's more common to see bar graphs,
     51 which throw away all of the information present in a violin plot.
     52 
     53 ```r
     54 library(dplyr)
     55 mtcarsSummary <- mtcars %>%
     56   group_by(cyl) %>%
     57   summarize(mpg_mean = mean(mpg),
     58             mpg_se = sqrt(var(mpg)/length(mpg)))
     59 
     60 ggplot(mtcarsSummary, aes(x = factor(cyl), y = mpg_mean, fill = factor(cyl))) + 
     61   geom_bar(stat = "identity")
     62 ```
     63 
     64 ![Bar charts of mpg by cyl](unnamed-chunk-3-1.png)
     65 
     66 Bar graphs highlight a single statistic of the analyst's choosing. In
     67 psychology (and many other fields), researchers use bar graphs to visualize the
     68 mean of the data, and usually include error bars to show the standard error (or
     69 another confidence interval). 
     70 
     71 However, when audiences see bar graphs of means, they erroneously judge values
     72 that fall *inside* a bar (i.e., below the mean) as more probable than values
     73 equidistant from the mean but outside a bar ([Newman & Scholl,
     74 2012](http://perception.research.yale.edu/papers/12-Newman-Scholl-PBR.pdf)).
     75 This bias doesn't affect violin plots because the region inside the violin
     76 contains *all* of the observed data. I'd guess that observers will correctly
     77 intuit that values in the wider parts of the violin are more probable than
     78 those in narrower parts.
     79 
     80 The mean and standard error are only useful statistics when you assume that
     81 your data are normally distributed; bar graphs don't help you check that
     82 assumption. For small sample size studies, it's more effective to just plot
     83 every single observation ([Weissgerber et al.,
     84 2015](http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002128)).
     85 Till Bergmann [explored this
     86 approach](http://tillbergmann.com/blog/articles/barplots-are-pies.html) in a
     87 post that includes R code. 
     88 
     89 If it's important to display the mean and standard error, these values can be
     90 overlaid on any visualization.
     91 
     92 ```r
     93 library(gridExtra)
     94 
     95 plotBars <- ggplot(mtcarsSummary, aes(x = factor(cyl), y = mpg_mean, fill = factor(cyl))) +  
     96   geom_bar(aes(fill = factor(cyl)), stat = "identity") + 
     97   geom_errorbar(aes(y = mpg_mean, ymin = mpg_mean-mpg_se, ymax = mpg_mean+mpg_se), 
     98                 color = "black", width = 0.4) + 
     99   ylim(0, 35) + 
    100   theme(legend.position = "none") +
    101   ggtitle("Bar graph")
    102 
    103 plotPoints <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, color = factor(cyl))) + 
    104   geom_point(aes(y = mpg, color = factor(cyl)), 
    105              position = position_jitter(width = 0.25, height = 0.0),
    106              alpha = 0.6) + 
    107   geom_point(aes(y = mpg_mean), color = "black", size = 2, data = mtcarsSummary) + 
    108   geom_errorbar(aes(y = mpg_mean, ymin = mpg_mean-mpg_se, ymax = mpg_mean+mpg_se), 
    109                 color = "black", width = 0.2, data = mtcarsSummary) + 
    110   ylim(0, 35) + 
    111   theme(legend.position = "none") +
    112   ggtitle("Every observation")
    113 
    114 plotViolins <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) + 
    115   geom_violin(aes(y = mpg, fill = factor(cyl))) + 
    116   geom_point(aes(y = mpg_mean), color = "black", size = 2, data = mtcarsSummary) + 
    117   geom_errorbar(aes(y = mpg_mean, ymin = mpg_mean-mpg_se, ymax = mpg_mean+mpg_se), 
    118                 color = "black", width = 0.2, data = mtcarsSummary) + 
    119   ylim(0, 35) + 
    120   theme(legend.position = "none") + 
    121   ggtitle("Violin plot")
    122 
    123 grid.arrange(plotBars, plotPoints, plotViolins, ncol = 3)
    124 ```
    125 
    126 ![Comparison of the same data visualized with bar charts, points, and violin
    127 charts](unnamed-chunk-4-1.png)
    128 
    129 ### Violin plots vs. density plots
    130 
    131 A violin plot shows the distribution's density using the width of the plot,
    132 which is symmetric about its axis, while traditional density plots use height
    133 from a common baseline. It may be easier to estimate relative differences in
    134 density plots, though I don't know of any research on the topic. More
    135 importantly, density plots (or histograms) readily display the density
    136 estimates, whereas violin plots typically don't present these.
    137 
    138 ```r
    139 ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) + 
    140   geom_density(alpha = 0.6)
    141 ```
    142 
    143 ![Density plots](unnamed-chunk-5-1.png)
    144 
    145 Figures like the one above quickly become crowded as the number of factors
    146 increases. It's easy to flip the coordinates and use faceting to construct
    147 figures similar to violin plots.
    148 
    149 ```r
    150 # trim = TRUE trims the tails to the range of the data,
    151 # which is the default for geom_violin
    152 ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) + 
    153   geom_density(trim = TRUE) + 
    154   coord_flip() +
    155   facet_grid(. ~ cyl)
    156 ```
    157 
    158 ![Faceted density plots](unnamed-chunk-6-1.png)
    159 
    160 I asked twitter about making plots like this without faceting. [David
    161 Robinson](http://varianceexplained.org/) came through with a [new geom that
    162 does it](https://gist.github.com/dgrtwo/eb7750e74997891d7c20). Like traditional
    163 violin plots, these toss out the density estimates--and currently only work
    164 with the development version of ggplot2--but they do the trick.
    165 
    166 ```r
    167 ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
    168     geom_flat_violin()
    169 ```
    170 
    171 ![Flat violins](unnamed-chunk-7-1.png)
    172 
    173 ### Use violin plots
    174 
    175 If you're into R's base graphics (why?), it looks like the [vioplot
    176 package](http://www.inside-r.org/packages/cran/vioplot/docs/vioplot) can make
    177 violin plots without using ggplot2. Seaborn appears to bring [very powerful
    178 violin
    179 plots](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.violinplot.html)
    180 to python, but I haven't had much opportunity to explore the awesome pandas
    181 world that's emerged since I last used python for most of my analyses. 
    182 
    183 Psychologists should use violin plots more often. They're ideal for displaying
    184 non-normal data, which we frequently encounter when looking at a single
    185 participant's performance (e.g., response times). More importantly, previous
    186 research--*in psychology*--has shown that viewers have difficulty interpreting
    187 bar graphs, and violin plots present a viable alternative.
    188