index.md (10241B)
1 --- 2 title: "Recreating a Tufte Slope Graph" 3 description: Publication-ready graphics should be created in code, not a graphics editor. 4 date: 2018-07-22 5 categories: 6 - Programming 7 - Data Science 8 tags: 9 - R 10 - Dataviz 11 - Design 12 --- 13 14 Recently on Twitter, data visualization guru Edward R. Tufte wrote that 15 graphics produced by R are not “publication ready”. His proposed 16 workflow is to use statistical software to create an initial version of 17 a plot, and then make final improvements in Adobe Illustrator. 18 19 I disagree with this advice. First I’ll show the steps a data analyst 20 might take to create a high-quality graphic entirely in R. Then, I’ll 21 explain why I think this is a better approach. 22 23 ## Publication quality graphics in R 24 25 Page 158 of Tufte’s classic book, [The Visual Display of Quantitative 26 Information (2nd ed.)](https://www.edwardtufte.com/tufte/books_vdqi), 27 features a “slope graph” that shows the change in government receipts 28 for several countries between 1970 and 1979. Below are the first few 29 rows of these data in a [tidy data 30 frame](https://www.jstatsoft.org/article/view/v059i10), `receiptData`, 31 along with a quick and dirty slopegraph. 32 33 |country | year| receipts| 34 |:-------|----:|--------:| 35 |Belgium | 1970| 35.2| 36 |Belgium | 1979| 43.2| 37 |Britain | 1970| 40.7| 38 |Britain | 1979| 39.0| 39 |Canada | 1970| 35.2| 40 |Canada | 1979| 35.8| 41 42 ```r 43 ggplot(receiptData, aes(year, receipts, group = country)) + 44 geom_line() + 45 geom_text_repel(aes(label = country)) + 46 labs(x = "Year", y = "Government receipts as percentage of GDP") 47 ``` 48 49 ![A quick and dirty slopegraph](graph_1-1.png) 50 51 This plot is not attractive, but it is useful for getting a handle on the 52 data. Whether [iterating through an exploratory data analysis]({{< ref 53 "/posts/art-of-data-science/index.md" >}}) or preparing a graphic for 54 publication, analysts will create many ugly graphics on the path to settling 55 on a design and refining it. 56 57 For our first round of improvements, we can change the aspect ratio of 58 the graphic and arrange the country labels so they don’t overlap with 59 the data. We should also remove the “chart junk” in the background, such 60 as the background grid, and label only the years of interest on the 61 x-axis. 62 63 ```r 64 ggplot(receiptData, aes(year, receipts, group = country)) + 65 geom_line() + 66 geom_text_repel(aes(label = country), 67 data = filter(receiptData, year == 1970), 68 nudge_x = -0.5, hjust = 1, 69 direction = "y", size = 5) + 70 geom_text_repel(aes(label = country), 71 data = filter(receiptData, year == 1979), 72 nudge_x = 0.5, hjust = 0, 73 direction = "y", size = 5) + 74 scale_x_continuous(breaks = c(1970, 1979), limits = c(1966, 1983)) + 75 theme(panel.background = element_blank(), 76 axis.title = element_text(size = 16), 77 axis.text = element_text(size = 12)) + 78 labs(x = "Year", y = "Government receipts as percentage of GDP") 79 ``` 80 81 ![A better but still imperfect graphic](graph_2-1.png) 82 83 The [ggrepel](https://github.com/slowkow/ggrepel) package has done a 84 great job preventing the labels from overlapping; I usually use 85 `geom_text_repel()` instead of `geom_text()` during exploratory data 86 analysis. Here, however, the segments connecting the labels to data 87 points create confusing clutter. While these can be removed within the 88 function, our final figure will be easier to understand if we nudge 89 labels manually so that they’re as close to their data as possible. 90 91 We can also drop the axes; the year labels will go right above the data, 92 and we’ll show each country’s value next to its label. Since we’re 93 losing the axis titles, we’ll also embed a caption in this version of 94 the graphic. 95 96 Finally, we’ll change the typeface. Since we’re trying to make something 97 that would please Tufte, we’ll use a serif font. If you haven’t done so 98 before, you may need to run `loadfonts()` from the [extrafont 99 package](https://github.com/wch/extrafont) to tell R how to find system 100 fonts. We’ll also further boost our “data-ink ratio” by making the lines 101 thinner. 102 103 ```r 104 labelAdjustments <- tribble( 105 ~country, ~year, ~nudge_y, 106 "Netherlands", 1970L, 0.3, 107 "Norway", 1970L, -0.2, 108 "Belgium", 1970L, 0.9, 109 "Canada", 1970L, -0.1, 110 "Finland", 1970L, -0.8, 111 "Italy", 1970L, 0.4, 112 "United States", 1970L, -0.5, 113 "Greece", 1970L, 0.4, 114 "Switzerland", 1970L, -0.3, 115 "France", 1979L, 0.8, 116 "Germany", 1979L, -0.7, 117 "Britain", 1979L, 0.1, 118 "Finland", 1979L, -0.1, 119 "Canada", 1979L, 0.4, 120 "Italy", 1979L, -0.5, 121 "Switzerland", 1979L, 0.2, 122 "United States", 1979L, -0.1, 123 "Spain", 1979L, 0.3, 124 "Japan", 1979L, -0.2 125 ) 126 127 receiptData <- left_join(receiptData, labelAdjustments, 128 by = c("country", "year")) %>% 129 mutate(receipts_nudged = ifelse(is.na(nudge_y), 0, nudge_y) + receipts) 130 131 update_geom_defaults("text", list(family = "Georgia", size = 4.0)) 132 ggplot(receiptData, aes(year, receipts, group = country)) + 133 # Slope lines 134 geom_line(size = 0.1) + 135 # Country names (manually nudged) 136 geom_text(aes(year, receipts_nudged, label = country), 137 data = filter(receiptData, year == 1970), 138 hjust = 1, nudge_x = -3) + 139 geom_text(aes(year, receipts_nudged, label = country), 140 data = filter(receiptData, year == 1979), 141 hjust = 0, nudge_x = 3) + 142 # Data values 143 geom_text(aes(year, receipts_nudged, label = sprintf("%0.1f", receipts)), 144 data = filter(receiptData, year == 1970), 145 hjust = 1, nudge_x = -0.5) + 146 geom_text(aes(year, receipts_nudged, label = sprintf("%0.1f", receipts)), 147 data = filter(receiptData, year == 1979), 148 hjust = 0, nudge_x = 0.5) + 149 # Column labels 150 annotate("text", x = c(1970, 1979) + c(-0.5, 0.5), 151 y = max(receiptData$receipts) + 3, 152 label = c("1970", "1979"), 153 hjust = c(1, 0)) + 154 # Plot annotation 155 annotate("text", x = 1945, y = 58, hjust = 0, vjust = 1, 156 label = paste("Current Receipts of Government as a", 157 "Percentage of Gross Domestic\nProduct, 1970 and 1979", 158 sep = "\n"), 159 family = "Georgia", size = 4.0) + 160 coord_cartesian(xlim = c(1945, 1990)) + 161 theme(panel.background = element_blank(), 162 axis.title = element_blank(), 163 axis.text = element_blank(), 164 axis.ticks = element_blank()) 165 ``` 166 167 ![A publication-ready graphic](graph_3-1.png) 168 169 ## Why it’s worth it 170 171 It took several iterations to get the aesthetics of this plot just 172 right, while an adept user of a graphics editor would be able to 173 recreate it in minutes. However, this initial time savings obscures 174 several advantages to completing this process in code as opposed to 175 switching to a tool like Illustrator. 176 177 ### R is free software 178 179 All of the programs I used to create this graphic (and share it with 180 you) are [free 181 software](https://www.fsf.org/about/what-is-free-software), unlike 182 Adobe’s terrific but expensive Illustrator. 183 [Inkscape](https://inkscape.org/en/) is an excellent open source 184 alternative, but it is not as powerful as its commercial competitor. On 185 the other hand, [R](https://www.r-project.org/) is arguably the most 186 advanced and well-supported statistical analysis tool. Even if the 187 politics of the open source movement aren’t important, free is a better 188 price-point. 189 190 ### Code can be reproduced 191 192 Anybody who wants to recreate the final figure using a different set of 193 countries, a new pair of years, or a favored economic indicator can run 194 the code above on their own data. With minimal tweaking, they will have 195 a graphic with aesthetics identical to those above. Reproducing this 196 graphic using a multi-tool workflow is more complicated; a novice would 197 first need to learn how to create a “rough draft” of it in software, and 198 then learn how to use a graphics editor to improve its appearance. 199 200 ### Scripted graphics are easier to update 201 202 Imagine that an analyst discovered a mistake in their analysis that 203 needed to be corrected. Or imagine the less stressful situation where a 204 colleague suggested replacing some of the countries in the figure. In 205 such cases, the work of creating the graphic will need to be redone: the 206 analyst will need to add data or correct errors, generate a new plot in 207 software, and re-edit the plot in their graphics editor. Using a 208 scripted, code-based workflow, the analyst would only need to do the 209 first step, and possibly update some manual tweaks. The initial time 210 savings afforded by a graphics editor disappears after the first time 211 this happens. 212 213 ### Automation prevents errors 214 215 Not only does scripted workflow make it easier to deal with errors, it 216 guards against them. When analysts adjust graphic elements in a GUI, 217 there’s a risk of mistakenly moving, deleting, or mislabeling data 218 points. Such mistakes are difficult to detect, but code-based workflows 219 avoid these risks; if the data source and analysis are error-free, then 220 the graphic will also be. 221 222 These considerations are among the reasons why scripted analyses are 223 [considered best practice](https://peerj.com/preprints/3210/) in 224 statistical analysis. Most of Tufte’s advice on creating graphics is 225 excellent, but I recommend ignoring his suggestion to make the final 226 edits in a program like Illustrator. Learning how to make polished 227 graphics in software will ultimately save analysts time and help them 228 avoid mistakes. 229