index.md (6008B)
1 --- 2 title: "Labeling bar charts (and other graphics) in ggplot2" 3 description: "How to combine stats and geoms to simplify complex plots." 4 date: 2022-07-08T19:32:57-04:00 5 draft: False 6 knit: (function(input, ...) { 7 rmarkdown::render( 8 input, 9 output_dir = file.path(Sys.getenv("HUGO_ROOT"), "content/posts") 10 ) 11 }) 12 output: 13 md_document: 14 variant: markdown 15 preserve_yaml: true 16 categories: 17 - Data Science 18 tags: 19 - R 20 - Dataviz 21 --- 22 23 Bar charts are [rightfully criticized for being potentially 24 misleading](https://www.kickstarter.com/projects/1474588473/barbarplots), 25 but they're still useful for some data graphics. They're a great way to 26 display counts, for example, and non-technical audiences are comfortable 27 interpreting them. 28 29 One way that a typical bar plot can be improved is by removing 30 unnecessary axis labels and directly labeling the bars themselves. For 31 example, if we wanted to see how many counties are in some of the states 32 comprising the Midwestern US, we could use the `midwest` data set that's 33 packaged with ggplot2 and draw a simple bar plot. 34 35 ``` r 36 library(ggplot2) 37 38 ggplot(midwest, aes(x = state)) + 39 geom_bar() + 40 labs(x = "", y = "Number of counties") + 41 theme_minimal() + 42 theme(axis.text.x = element_text(size = 16), 43 axis.title.y = element_text(size = 14)) 44 ``` 45 46 ![Count of counties per state for IL, IN, MI, OH, and WI](simple_bar-1.png) 47 48 This is fine (provided that you can accept that Ohio is part of the 49 Midwest and Iowa isn't). However, we can easily strip away some of the 50 "chart junk" by removing unnecessary theme elements and labeling the 51 bars using `geom_text`. 52 53 ``` r 54 ggplot(midwest, aes(state)) + 55 geom_bar() + 56 geom_text(aes(y = ..count.., label = ..count..), stat = "count", 57 vjust = 0, nudge_y = 2, size = 8) + 58 expand_limits(y = 110) + 59 theme(panel.background = element_blank(), 60 panel.grid = element_blank(), 61 axis.line = element_blank(), 62 axis.ticks = element_blank(), 63 axis.title = element_blank(), 64 axis.text.y = element_blank(), 65 axis.text.x = element_text(size = 16)) + 66 labs(title = "Number of counties in Midwestern states") 67 ``` 68 69 ![Counties per state (with counts displayed above each bar)](better_plot-1.png) 70 71 There's still room for improvement---I'd consider rearranging the bars 72 based on something more meaningful than alphabetic order, and possibly 73 coloring them---but this simultaneously conveys more information than 74 the previous plot (objectively) and is less "cluttered" (subjectively). 75 76 ## How this works 77 78 I'm surprised I'm still writing blog posts about ggplot2 after seven 79 years, but I continue to learn new things about it even after regular 80 use. Recently, I decided that it bothered me that there were "stats" 81 (like `stat_ecdf`) and "geoms" (like `geom_bar`), and I really didn't 82 understand the difference between them---they seemed interchange but not 83 identical. I finally looked into it and came across [this response on 84 StackOverflow](https://stackoverflow.com/a/44226841), which contained 85 the following quote from the [ggplot2 book](https://ggplot2-book.org/): 86 87 > You only need to set one of stat and geom: every geom has a default 88 > stat, and every stat has a default geom. 89 90 It turns out that "stats" and "geoms" *are* largely interchangeable, 91 because they're both wrappers around `layer()`, with different ways of 92 handling defaults. The code above works because I changed the `stat` 93 parameter of `geom_text` (which, according to [the 94 documentation](https://ggplot2.tidyverse.org/reference/geom_text.html), 95 defaults to `"identity"`) to `"count"`, which is the default `stat` of 96 `geom_bar`. Looking at [the documentation for 97 `stat_count`](https://ggplot2.tidyverse.org/reference/geom_bar.html), we 98 see that it provides two *computed variables*, and we use use the 99 computed `count` variable in our aesthetics for `geom_text` to provide 100 both the vertical position and label. 101 102 Getting my head around this has helped me iterate through analyses more 103 quickly. For instance, I frequently use `stat_ecdf` to get a handle on 104 distributions, and now I can label these plots easily. Sticking with 105 data about the Midwest, here's the ECDF of population density for 106 Midwestern counties, with a few values highlighted. 107 108 ``` r 109 ggplot(midwest, aes(popdensity)) + 110 stat_ecdf() + # This could also be `geom_step(stat = "ecdf")` 111 geom_text(aes(label = sprintf("(%.0f: %.0f%%)", ..x.., floor(100 * ..y..))), 112 stat = "ecdf", hjust = 1, vjust = 0, check_overlap = TRUE, size = 5) + 113 expand_limits(x = -15000) + 114 scale_y_continuous(labels = scales::label_percent()) + 115 theme_minimal() + 116 labs(x = "Population density", y = "CDF") 117 ``` 118 119 ![The curve of the CDF of population density in Midwestern counties](ecdf-1.png) 120 121 I wouldn't call this a *beautiful* graphic, but visualizations like this are 122 often *useful*, [especially when getting a handle on a new data set]({{< ref 123 "/posts/tufte-plot/index.md" >}}). Here we see that over a third of 124 Midwestern counties have population densities fully two orders of magnitude 125 smaller than the most densely populated county, and we've only had to draw a 126 single plot. 127 128 It's hard to justify putting much effort into producing plots that 129 aren't meant to be shared, and a typical data analysis will involve 130 creating dozens of plots that are only useful for the person analyzing 131 the data. Without understanding stats and geoms, I would never have 132 bothered labeling an ECDF plot; "quick and dirty" needs to be quick 133 after all. On the other hand, labeling the values in a bar chart is 134 something that one should do when preparing publication-quality 135 graphics. GGplot2 is useful in both phases of a project. 136 137 In the words of Alan Kay, ["simple things should be simple, complex 138 things should be 139 possible."](https://www.quora.com/What-is-the-story-behind-Alan-Kay-s-adage-Simple-things-should-be-simple-complex-things-should-be-possible) 140 GGplot2 does both of these consistently. Occasionally, it even makes 141 complex things simple.