www.eamoncaddigan.net

Content and configuration for https://www.eamoncaddigan.net
git clone https://git.eamoncaddigan.net/www.eamoncaddigan.net.git
Log | Files | Refs | Submodules | README

index.md (6008B)


      1 ---
      2 title: "Labeling bar charts (and other graphics) in ggplot2"
      3 description: "How to combine stats and geoms to simplify complex plots."
      4 date: 2022-07-08T19:32:57-04:00
      5 draft: False
      6 knit: (function(input, ...) {
      7     rmarkdown::render(
      8       input,
      9       output_dir = file.path(Sys.getenv("HUGO_ROOT"), "content/posts")
     10     )
     11   })
     12 output: 
     13   md_document:
     14     variant: markdown
     15     preserve_yaml: true
     16 categories:
     17 - Data Science
     18 tags:
     19 - R
     20 - Dataviz
     21 ---
     22 
     23 Bar charts are [rightfully criticized for being potentially
     24 misleading](https://www.kickstarter.com/projects/1474588473/barbarplots),
     25 but they're still useful for some data graphics. They're a great way to
     26 display counts, for example, and non-technical audiences are comfortable
     27 interpreting them.
     28 
     29 One way that a typical bar plot can be improved is by removing
     30 unnecessary axis labels and directly labeling the bars themselves. For
     31 example, if we wanted to see how many counties are in some of the states
     32 comprising the Midwestern US, we could use the `midwest` data set that's
     33 packaged with ggplot2 and draw a simple bar plot.
     34 
     35 ``` r
     36 library(ggplot2)
     37 
     38 ggplot(midwest, aes(x = state)) + 
     39   geom_bar() +
     40   labs(x = "", y = "Number of counties") +
     41   theme_minimal() +
     42   theme(axis.text.x = element_text(size = 16),
     43         axis.title.y = element_text(size = 14))
     44 ```
     45 
     46 ![Count of counties per state for IL, IN, MI, OH, and WI](simple_bar-1.png)
     47 
     48 This is fine (provided that you can accept that Ohio is part of the
     49 Midwest and Iowa isn't). However, we can easily strip away some of the
     50 "chart junk" by removing unnecessary theme elements and labeling the
     51 bars using `geom_text`.
     52 
     53 ``` r
     54 ggplot(midwest, aes(state)) +
     55   geom_bar() +
     56   geom_text(aes(y = ..count.., label = ..count..), stat = "count", 
     57             vjust = 0, nudge_y = 2, size = 8) +
     58   expand_limits(y = 110) +
     59   theme(panel.background = element_blank(),
     60         panel.grid = element_blank(),
     61         axis.line = element_blank(),
     62         axis.ticks = element_blank(),
     63         axis.title = element_blank(),
     64         axis.text.y = element_blank(),
     65         axis.text.x = element_text(size = 16)) +
     66   labs(title = "Number of counties in Midwestern states")
     67 ```
     68 
     69 ![Counties per state (with counts displayed above each bar)](better_plot-1.png)
     70 
     71 There's still room for improvement---I'd consider rearranging the bars
     72 based on something more meaningful than alphabetic order, and possibly
     73 coloring them---but this simultaneously conveys more information than
     74 the previous plot (objectively) and is less "cluttered" (subjectively).
     75 
     76 ## How this works
     77 
     78 I'm surprised I'm still writing blog posts about ggplot2 after seven
     79 years, but I continue to learn new things about it even after regular
     80 use. Recently, I decided that it bothered me that there were "stats"
     81 (like `stat_ecdf`) and "geoms" (like `geom_bar`), and I really didn't
     82 understand the difference between them---they seemed interchange but not
     83 identical. I finally looked into it and came across [this response on
     84 StackOverflow](https://stackoverflow.com/a/44226841), which contained
     85 the following quote from the [ggplot2 book](https://ggplot2-book.org/):
     86 
     87 > You only need to set one of stat and geom: every geom has a default
     88 > stat, and every stat has a default geom.
     89 
     90 It turns out that "stats" and "geoms" *are* largely interchangeable,
     91 because they're both wrappers around `layer()`, with different ways of
     92 handling defaults. The code above works because I changed the `stat`
     93 parameter of `geom_text` (which, according to [the
     94 documentation](https://ggplot2.tidyverse.org/reference/geom_text.html),
     95 defaults to `"identity"`) to `"count"`, which is the default `stat` of
     96 `geom_bar`. Looking at [the documentation for
     97 `stat_count`](https://ggplot2.tidyverse.org/reference/geom_bar.html), we
     98 see that it provides two *computed variables*, and we use use the
     99 computed `count` variable in our aesthetics for `geom_text` to provide
    100 both the vertical position and label.
    101 
    102 Getting my head around this has helped me iterate through analyses more
    103 quickly. For instance, I frequently use `stat_ecdf` to get a handle on
    104 distributions, and now I can label these plots easily. Sticking with
    105 data about the Midwest, here's the ECDF of population density for
    106 Midwestern counties, with a few values highlighted.
    107 
    108 ``` r
    109 ggplot(midwest, aes(popdensity)) +
    110   stat_ecdf() + # This could also be `geom_step(stat = "ecdf")`
    111   geom_text(aes(label = sprintf("(%.0f: %.0f%%)", ..x.., floor(100 * ..y..))), 
    112             stat = "ecdf", hjust = 1, vjust = 0, check_overlap = TRUE, size = 5) +
    113   expand_limits(x = -15000) +
    114   scale_y_continuous(labels = scales::label_percent()) +
    115   theme_minimal() +
    116   labs(x = "Population density", y = "CDF")
    117 ```
    118 
    119 ![The curve of the CDF of population density in Midwestern counties](ecdf-1.png)
    120 
    121 I wouldn't call this a *beautiful* graphic, but visualizations like this are
    122 often *useful*, [especially when getting a handle on a new data set]({{< ref
    123 "/posts/tufte-plot/index.md" >}}). Here we see that over a third of
    124 Midwestern counties have population densities fully two orders of magnitude
    125 smaller than the most densely populated county, and we've only had to draw a
    126 single plot.
    127 
    128 It's hard to justify putting much effort into producing plots that
    129 aren't meant to be shared, and a typical data analysis will involve
    130 creating dozens of plots that are only useful for the person analyzing
    131 the data. Without understanding stats and geoms, I would never have
    132 bothered labeling an ECDF plot; "quick and dirty" needs to be quick
    133 after all. On the other hand, labeling the values in a bar chart is
    134 something that one should do when preparing publication-quality
    135 graphics. GGplot2 is useful in both phases of a project.
    136 
    137 In the words of Alan Kay, ["simple things should be simple, complex
    138 things should be
    139 possible."](https://www.quora.com/What-is-the-story-behind-Alan-Kay-s-adage-Simple-things-should-be-simple-complex-things-should-be-possible)
    140 GGplot2 does both of these consistently. Occasionally, it even makes
    141 complex things simple.