index.Rmd (11834B)
1 --- 2 title: "Recreating a Tufte Slopegraph" 3 author: "Eamon Caddigan" 4 date: "2018-07-05" 5 output: md_document 6 --- 7 8 ```{r setup, include=FALSE} 9 knitr::opts_chunk$set(echo = TRUE) 10 ``` 11 ```{r load_libraries, include=FALSE, warning=FALSE, message=FALSE} 12 #extrafont::loadfonts(device="win") 13 suppressPackageStartupMessages(library(tibble)) 14 suppressPackageStartupMessages(library(dplyr)) 15 suppressPackageStartupMessages(library(ggplot2)) 16 suppressPackageStartupMessages(library(ggrepel)) 17 ``` 18 19 Recently on Twitter, data visualization guru Edward R. Tufte wrote that graphics produced by R are not “publication ready”. His proposed workflow is to use statistical software to create an initial version of a plot, and then make final improvements in Adobe Illustrator. 20 21 I disagree with this advice. First I’ll show the steps a data analyst might take to create a high-quality graphic entirely in R. Then, I’ll explain why I think this is a better approach. 22 23 ## Publication quality graphics in R 24 25 Page 158 of Tufte’s classic book, [The Visual Display of Quantitative Information (2nd ed.)](https://www.edwardtufte.com/tufte/books_vdqi), features a “slope graph” that shows the change in government receipts for several countries between 1970 and 1979. Below are the first few rows of these data in a [tidy data frame](https://www.jstatsoft.org/article/view/v059i10), `receiptData`, along with a quick and dirty slopegraph. 26 27 ```{r receipt_data, echo=FALSE} 28 receiptData <- tribble( 29 ~country, ~year, ~receipts, 30 "Belgium", 1970L, 35.2, 31 "Belgium", 1979L, 43.2, 32 "Britain", 1970L, 40.7, 33 "Britain", 1979L, 39.0, 34 "Canada", 1970L, 35.2, 35 "Canada", 1979L, 35.8, 36 "Finland", 1970L, 34.9, 37 "Finland", 1979L, 38.2, 38 "France", 1970L, 39.0, 39 "France", 1979L, 43.4, 40 "Germany", 1970L, 37.5, 41 "Germany", 1979L, 42.9, 42 "Greece", 1970L, 26.8, 43 "Greece", 1979L, 30.6, 44 "Italy", 1970L, 30.4, 45 "Italy", 1979L, 35.7, 46 "Japan", 1970L, 20.7, 47 "Japan", 1979L, 26.6, 48 "Netherlands", 1970L, 44.0, 49 "Netherlands", 1979L, 55.8, 50 "Norway", 1970L, 43.5, 51 "Norway", 1979L, 52.2, 52 "Spain", 1970L, 22.5, 53 "Spain", 1979L, 27.1, 54 "Sweden", 1970L, 46.9, 55 "Sweden", 1979L, 57.4, 56 "Switzerland", 1970L, 26.5, 57 "Switzerland", 1979L, 33.2, 58 "United States", 1970L, 30.3, 59 "United States", 1979L, 32.5 60 ) 61 62 receiptData %>% 63 head(6) %>% 64 knitr::kable() 65 ``` 66 ```{r graph_1} 67 ggplot(receiptData, aes(year, receipts, group = country)) + 68 geom_line() + 69 geom_text_repel(aes(label = country)) + 70 labs(x = "Year", y = "Government receipts as percentage of GDP") 71 ``` 72 73 This plot is not attractive, but it is useful for getting a handle on the data. Whether [iterating through an exploratory data analysis]({{< ref "/posts/2015-09-09-data-science.md" >}}) or preparing a graphic for publication, analysts will create many ugly graphics on the path to settling on a design and refining it. 74 75 For our first round of improvements, we can change the aspect ratio of the graphic and arrange the country labels so they don’t overlap with the data. We should also remove the “chart junk” in the background, such as the background grid, and label only the years of interest on the x-axis. 76 77 ```{r graph_2, fig.height=8, fig.width=5} 78 ggplot(receiptData, aes(year, receipts, group = country)) + 79 geom_line() + 80 geom_text_repel(aes(label = country), 81 data = filter(receiptData, year == 1970), 82 nudge_x = -0.5, hjust = 1, 83 direction = "y", size = 5) + 84 geom_text_repel(aes(label = country), 85 data = filter(receiptData, year == 1979), 86 nudge_x = 0.5, hjust = 0, 87 direction = "y", size = 5) + 88 scale_x_continuous(breaks = c(1970, 1979), limits = c(1966, 1983)) + 89 theme(panel.background = element_blank(), 90 axis.title = element_text(size = 16), 91 axis.text = element_text(size = 12)) + 92 labs(x = "Year", y = "Government receipts as percentage of GDP") 93 ``` 94 95 The [ggrepel](https://github.com/slowkow/ggrepel) package has done a great job preventing the labels from overlapping; I usually use `geom_text_repel()` instead of `geom_text()` during exploratory data analysis. Here, however, the segments connecting the labels to data points create confusing clutter. While these can be removed within the function, our final figure will be easier to understand if we nudge labels manually so that they’re as close to their data as possible. 96 97 We can also drop the axes; the year labels will go right above the data, and we’ll show each country’s value next to its label. Since we’re losing the axis titles, we’ll also embed a caption in this version of the graphic. 98 99 Finally, we’ll change the typeface. Since we’re trying to make something that would please Tufte, we’ll use a serif font. If you haven’t done so before, you may need to run `loadfonts()` from the [extrafont package](https://github.com/wch/extrafont) to tell R how to find system fonts. We’ll also further boost our “data-ink ratio” by making the lines thinner. 100 101 ```{r graph_3, fig.height=8, fig.width=8} 102 labelAdjustments <- tribble( 103 ~country, ~year, ~nudge_y, 104 "Netherlands", 1970L, 0.3, 105 "Norway", 1970L, -0.2, 106 "Belgium", 1970L, 0.9, 107 "Canada", 1970L, -0.1, 108 "Finland", 1970L, -0.8, 109 "Italy", 1970L, 0.4, 110 "United States", 1970L, -0.5, 111 "Greece", 1970L, 0.4, 112 "Switzerland", 1970L, -0.3, 113 "France", 1979L, 0.8, 114 "Germany", 1979L, -0.7, 115 "Britain", 1979L, 0.1, 116 "Finland", 1979L, -0.1, 117 "Canada", 1979L, 0.4, 118 "Italy", 1979L, -0.5, 119 "Switzerland", 1979L, 0.2, 120 "United States", 1979L, -0.1, 121 "Spain", 1979L, 0.3, 122 "Japan", 1979L, -0.2 123 ) 124 125 receiptData <- left_join(receiptData, labelAdjustments, 126 by = c("country", "year")) %>% 127 mutate(receipts_nudged = ifelse(is.na(nudge_y), 0, nudge_y) + receipts) 128 129 update_geom_defaults("text", list(family = "Georgia", size = 4.0)) 130 ggplot(receiptData, aes(year, receipts, group = country)) + 131 # Slope lines 132 geom_line(size = 0.1) + 133 # Country names (manually nudged) 134 geom_text(aes(year, receipts_nudged, label = country), 135 data = filter(receiptData, year == 1970), 136 hjust = 1, nudge_x = -3) + 137 geom_text(aes(year, receipts_nudged, label = country), 138 data = filter(receiptData, year == 1979), 139 hjust = 0, nudge_x = 3) + 140 # Data values 141 geom_text(aes(year, receipts_nudged, label = sprintf("%0.1f", receipts)), 142 data = filter(receiptData, year == 1970), 143 hjust = 1, nudge_x = -0.5) + 144 geom_text(aes(year, receipts_nudged, label = sprintf("%0.1f", receipts)), 145 data = filter(receiptData, year == 1979), 146 hjust = 0, nudge_x = 0.5) + 147 # Column labels 148 annotate("text", x = c(1970, 1979) + c(-0.5, 0.5), 149 y = max(receiptData$receipts) + 3, 150 label = c("1970", "1979"), 151 hjust = c(1, 0)) + 152 # Plot annotation 153 annotate("text", x = 1945, y = 58, hjust = 0, vjust = 1, 154 label = paste("Current Receipts of Government as a", 155 "Percentage of Gross Domestic\nProduct, 1970 and 1979", 156 sep = "\n"), 157 family = "Georgia", size = 4.0) + 158 coord_cartesian(xlim = c(1945, 1990)) + 159 theme(panel.background = element_blank(), 160 axis.title = element_blank(), 161 axis.text = element_blank(), 162 axis.ticks = element_blank()) 163 ``` 164 165 ## Why it’s worth it 166 167 It took several iterations to get the aesthetics of this plot just right, while an adept user of a graphics editor would be able to recreate it in minutes. However, this initial time savings obscures several advantages to completing this process in code as opposed to switching to a tool like Illustrator. 168 169 ### R is free software 170 171 All of the programs I used to create this graphic (and share it with you) are [free software](https://www.fsf.org/about/what-is-free-software), unlike Adobe’s terrific but expensive Illustrator. [Inkscape](https://inkscape.org/en/) is an excellent open source alternative, but it is not as powerful as its commercial competitor. On the other hand, [R](https://www.r-project.org/) is arguably the most advanced and well-supported statistical analysis tool. Even if the politics of the open source movement aren’t important, free is a better price-point. 172 173 ### Code can be reproduced 174 175 Anybody who wants to recreate the final figure using a different set of countries, a new pair of years, or a favored economic indicator can run the code above on their own data. With minimal tweaking, they will have a graphic with aesthetics identical to those above. Reproducing this graphic using a multi-tool workflow is more complicated; a novice would first need to learn how to create a “rough draft” of it in software, and then learn how to use a graphics editor to improve its appearance. 176 177 ### Scripted graphics are easier to update 178 179 Imagine that an analyst discovered a mistake in their analysis that needed to be corrected. Or imagine the less stressful situation where a colleague suggested replacing some of the countries in the figure. In such cases, the work of creating the graphic will need to be redone: the analyst will need to add data or correct errors, generate a new plot in software, and re-edit the plot in their graphics editor. Using a scripted, code-based workflow, the analyst would only need to do the first step, and possibly update some manual tweaks. The initial time savings afforded by a graphics editor disappears after the first time this happens. 180 181 ### Automation prevents errors 182 183 Not only does scripted workflow make it easier to deal with errors, it guards against them. When analysts adjust graphic elements in a GUI, there’s a risk of mistakenly moving, deleting, or mislabeling data points. Such mistakes are difficult to detect, but code-based workflows avoid these risks; if the data source and analysis are error-free, then the graphic will also be. 184 185 These considerations are among the reasons why scripted analyses are [considered best practice](https://peerj.com/preprints/3210/) in statistical analysis. Most of Tufte’s advice on creating graphics is excellent, but I recommend ignoring his suggestion to make the final edits in a program like Illustrator. Learning how to make polished graphics in software will ultimately save analysts time and help them avoid mistakes.