blogposts

Code used to generate some blog posts.
git clone https://git.eamoncaddigan.net/blogposts.git
Log | Files | Refs | README | LICENSE

2022-07-08-labeling-plots.Rmd (6064B)


      1 ---
      2 title: "Labeling bar charts (and other graphics) in ggplot2"
      3 description: "How to combine stats and geoms to simplify complex plots."
      4 date: 2022-07-08T19:32:57-04:00
      5 draft: False
      6 knit: (function(input, ...) {
      7     rmarkdown::render(
      8       input,
      9       output_dir = file.path(Sys.getenv("HUGO_ROOT"), "content/posts")
     10     )
     11   })
     12 output: 
     13   md_document:
     14     variant: markdown
     15     preserve_yaml: true
     16 ---
     17 
     18 ```{r setup, include=FALSE}
     19 knitr::opts_chunk$set(
     20   echo = TRUE,
     21   fig.path = file.path("figs", 
     22                        sub("\\.Rmd$", "",
     23                            basename(rstudioapi::getActiveDocumentContext()$path)),
     24                        "")
     25 )
     26 knitr::opts_knit$set(
     27   base.dir = file.path(Sys.getenv("HUGO_ROOT"), "static"),
     28   base.url = "/"
     29 )
     30 ```
     31 
     32 Bar charts are [rightfully criticized for being potentially misleading](https://www.kickstarter.com/projects/1474588473/barbarplots), but they're still useful for some data graphics. They're a great way to display counts, for example, and non-technical audiences are comfortable interpreting them.
     33 
     34 One way that a typical bar plot can be improved is by removing unnecessary axis labels and directly labeling the bars themselves. For example, if we wanted to see how many counties are in some of the states comprising the Midwestern US, we could use the `midwest` data set that's packaged with ggplot2 and draw a simple bar plot.
     35 
     36 ```{r simple_bar}
     37 library(ggplot2)
     38 
     39 ggplot(midwest, aes(x = state)) + 
     40   geom_bar() +
     41   labs(x = "", y = "Number of counties") +
     42   theme_minimal() +
     43   theme(axis.text.x = element_text(size = 16),
     44         axis.title.y = element_text(size = 14))
     45 ```
     46 
     47 This is fine (provided that you can accept that Ohio is part of the Midwest and Iowa isn't). However, we can easily strip away some of the "chart junk" by removing unnecessary theme elements and labeling the bars using `geom_text`.
     48 
     49 ```{r better_plot}
     50 ggplot(midwest, aes(state)) +
     51   geom_bar() +
     52   geom_text(aes(y = ..count.., label = ..count..), stat = "count", 
     53             vjust = 0, nudge_y = 2, size = 8) +
     54   expand_limits(y = 110) +
     55   theme(panel.background = element_blank(),
     56         panel.grid = element_blank(),
     57         axis.line = element_blank(),
     58         axis.ticks = element_blank(),
     59         axis.title = element_blank(),
     60         axis.text.y = element_blank(),
     61         axis.text.x = element_text(size = 16)) +
     62   labs(title = "Number of counties in Midwestern states")
     63 ```
     64 
     65 There's still room for improvement—I'd consider rearranging the bars based on something more meaningful than alphabetic order, and possibly coloring them—but this simultaneously conveys more information than the previous plot (objectively) and is less "cluttered" (subjectively).
     66 
     67 ## How this works
     68 
     69 I'm surprised I'm still writing blog posts about ggplot2 after seven years, but I continue to learn new things about it even after regular use. Recently, I decided that it bothered me that there were "stats" (like `stat_ecdf`) and "geoms" (like `geom_bar`), and I really didn't understand the difference between them—they seemed interchange but not identical. I finally looked into it and came across [this response on StackOverflow](https://stackoverflow.com/a/44226841), which contained the following quote from the [ggplot2 book](https://ggplot2-book.org/):
     70 
     71 > You only need to set one of stat and geom: every geom has a default stat, and every stat has a default geom.
     72 
     73 It turns out that "stats" and "geoms" _are_ largely interchangeable, because they're both wrappers around `layer()`, with different ways of handling defaults. The code above works because I changed the `stat` parameter of `geom_text` (which, according to [the documentation](https://ggplot2.tidyverse.org/reference/geom_text.html), defaults to `"identity"`) to `"count"`, which is the default `stat` of `geom_bar`. Looking at [the documentation for `stat_count`](https://ggplot2.tidyverse.org/reference/geom_bar.html), we see that it provides two _computed variables_, and we use use the computed `count` variable in our aesthetics for `geom_text` to provide both the vertical position and label. 
     74 
     75 Getting my head around this has helped me iterate through analyses more quickly. For instance, I frequently use `stat_ecdf` to get a handle on distributions, and now I can label these plots easily. Sticking with data about the Midwest, here's the ECDF of population density for Midwestern counties, with a few values highlighted.
     76 
     77 ```{r ecdf}
     78 ggplot(midwest, aes(popdensity)) +
     79   stat_ecdf() + # This could also be `geom_step(stat = "ecdf")`
     80   geom_text(aes(label = sprintf("(%.0f: %.0f%%)", ..x.., floor(100 * ..y..))), 
     81             stat = "ecdf", hjust = 1, vjust = 0, check_overlap = TRUE, size = 5) +
     82   expand_limits(x = -15000) +
     83   scale_y_continuous(labels = scales::label_percent()) +
     84   theme_minimal() +
     85   labs(x = "Population density", y = "CDF")
     86 ```
     87 
     88 I wouldn't call this a _beautiful_ graphic, but visualizations like this are often _useful_, especially when getting a handle on a new data set. Here we see that over a third of Midwestern counties have population densities fully two orders of magnitude smaller than the most densely populated county, and we've only had to draw a single plot.
     89 
     90 It's hard to justify putting much effort into producing plots that aren't meant to be shared, and a typical data analysis will involve creating dozens of plots that are only useful for the person analyzing the data. Without understanding stats and geoms, I would never have bothered labeling an ECDF plot; "quick and dirty" needs to be quick after all. On the other hand, labeling the values in a bar chart is something that one should do when preparing publication-quality graphics. ggplot2 is useful in both phases of a project. 
     91 
     92 In the words of Alan Kay, ["simple things should be simple, complex things should be possible."](https://www.quora.com/What-is-the-story-behind-Alan-Kay-s-adage-Simple-things-should-be-simple-complex-things-should-be-possible) ggplot2 does both of these consistently. Occasionally, it even makes complex things simple.