index.Rmd - www.eamoncaddigan.net - Content and configuration for https://www.eamoncaddigan.net

index.Rmd (11834B)
      1 ---
      2 title: "Recreating a Tufte Slopegraph"
      3 author: "Eamon Caddigan"
      4 date: "2018-07-05"
      5 output: md_document
      6 ---
      7 
      8 ```{r setup, include=FALSE}
      9 knitr::opts_chunk$set(echo = TRUE)
     10 ```
     11 ```{r load_libraries, include=FALSE, warning=FALSE, message=FALSE}
     12 #extrafont::loadfonts(device="win")
     13 suppressPackageStartupMessages(library(tibble))
     14 suppressPackageStartupMessages(library(dplyr))
     15 suppressPackageStartupMessages(library(ggplot2))
     16 suppressPackageStartupMessages(library(ggrepel))
     17 ```
     18 
     19 Recently on Twitter, data visualization guru Edward R. Tufte wrote that graphics produced by R are not “publication ready”. His proposed workflow is to use statistical software to create an initial version of a plot, and then make final improvements in Adobe Illustrator.
     20 
     21 I disagree with this advice. First I’ll show the steps a data analyst might take to create a high-quality graphic entirely in R. Then, I’ll explain why I think this is a better approach.
     22 
     23 ## Publication quality graphics in R
     24 
     25 Page 158 of Tufte’s classic book, [The Visual Display of Quantitative Information (2nd ed.)](https://www.edwardtufte.com/tufte/books_vdqi), features a “slope graph” that shows the change in government receipts for several countries between 1970 and 1979. Below are the first few rows of these data in a [tidy data frame](https://www.jstatsoft.org/article/view/v059i10), `receiptData`, along with a quick and dirty slopegraph.
     26 
     27 ```{r receipt_data, echo=FALSE}
     28 receiptData <- tribble(
     29                          ~country, ~year, ~receipts,
     30                         "Belgium", 1970L,      35.2,
     31                         "Belgium", 1979L,      43.2,
     32                         "Britain", 1970L,      40.7,
     33                         "Britain", 1979L,      39.0,
     34                          "Canada", 1970L,      35.2,
     35                          "Canada", 1979L,      35.8,
     36                         "Finland", 1970L,      34.9,
     37                         "Finland", 1979L,      38.2,
     38                          "France", 1970L,      39.0,
     39                          "France", 1979L,      43.4,
     40                         "Germany", 1970L,      37.5,
     41                         "Germany", 1979L,      42.9,
     42                          "Greece", 1970L,      26.8,
     43                          "Greece", 1979L,      30.6,
     44                           "Italy", 1970L,      30.4,
     45                           "Italy", 1979L,      35.7,
     46                           "Japan", 1970L,      20.7,
     47                           "Japan", 1979L,      26.6,
     48                     "Netherlands", 1970L,      44.0,
     49                     "Netherlands", 1979L,      55.8,
     50                          "Norway", 1970L,      43.5,
     51                          "Norway", 1979L,      52.2,
     52                           "Spain", 1970L,      22.5,
     53                           "Spain", 1979L,      27.1,
     54                          "Sweden", 1970L,      46.9,
     55                          "Sweden", 1979L,      57.4,
     56                     "Switzerland", 1970L,      26.5,
     57                     "Switzerland", 1979L,      33.2,
     58                   "United States", 1970L,      30.3,
     59                   "United States", 1979L,      32.5
     60                  )
     61 
     62 receiptData %>%
     63   head(6) %>%
     64   knitr::kable()
     65 ```
     66 ```{r graph_1}
     67 ggplot(receiptData, aes(year, receipts, group = country)) +
     68   geom_line() +
     69   geom_text_repel(aes(label = country)) +
     70   labs(x = "Year", y = "Government receipts as percentage of GDP")
     71 ```
     72 
     73 This plot is not attractive, but it is useful for getting a handle on the data. Whether [iterating through an exploratory data analysis]({{< ref "/posts/2015-09-09-data-science.md" >}}) or preparing a graphic for publication, analysts will create many ugly graphics on the path to settling on a design and refining it.
     74 
     75 For our first round of improvements, we can change the aspect ratio of the graphic and arrange the country labels so they don’t overlap with the data. We should also remove the “chart junk” in the background, such as the background grid, and label only the years of interest on the x-axis.
     76 
     77 ```{r graph_2, fig.height=8, fig.width=5}
     78 ggplot(receiptData, aes(year, receipts, group = country)) +
     79   geom_line() +
     80   geom_text_repel(aes(label = country),
     81                   data = filter(receiptData, year == 1970),
     82                   nudge_x = -0.5, hjust = 1,
     83                   direction = "y", size = 5) +
     84   geom_text_repel(aes(label = country),
     85                   data = filter(receiptData, year == 1979),
     86                   nudge_x = 0.5, hjust = 0, 
     87                   direction = "y", size = 5) +
     88   scale_x_continuous(breaks = c(1970, 1979), limits = c(1966, 1983)) +
     89   theme(panel.background = element_blank(),
     90         axis.title = element_text(size = 16),
     91         axis.text = element_text(size = 12)) +
     92   labs(x = "Year", y = "Government receipts as percentage of GDP")
     93 ```
     94 
     95 The [ggrepel](https://github.com/slowkow/ggrepel) package has done a great job preventing the labels from overlapping; I usually use `geom_text_repel()` instead of `geom_text()` during exploratory data analysis. Here, however, the segments connecting the labels to data points create confusing clutter. While these can be removed within the function, our final figure will be easier to understand if we nudge labels manually so that they’re as close to their data as possible.
     96 
     97 We can also drop the axes; the year labels will go right above the data, and we’ll show each country’s value next to its label. Since we’re losing the axis titles, we’ll also embed a caption in this version of the graphic.
     98 
     99 Finally, we’ll change the typeface. Since we’re trying to make something that would please Tufte, we’ll use a serif font. If you haven’t done so before, you may need to run `loadfonts()` from the [extrafont package](https://github.com/wch/extrafont) to tell R how to find system fonts. We’ll also further boost our “data-ink ratio” by making the lines thinner.
    100 
    101 ```{r graph_3, fig.height=8, fig.width=8}
    102 labelAdjustments <- tribble(
    103                               ~country, ~year, ~nudge_y,
    104                          "Netherlands", 1970L,      0.3,
    105                               "Norway", 1970L,     -0.2,
    106                              "Belgium", 1970L,      0.9,
    107                               "Canada", 1970L,     -0.1,
    108                              "Finland", 1970L,     -0.8,
    109                                "Italy", 1970L,      0.4,
    110                        "United States", 1970L,     -0.5,
    111                               "Greece", 1970L,      0.4,
    112                          "Switzerland", 1970L,     -0.3,
    113                               "France", 1979L,      0.8,
    114                              "Germany", 1979L,     -0.7,
    115                              "Britain", 1979L,      0.1,
    116                              "Finland", 1979L,     -0.1,
    117                               "Canada", 1979L,      0.4,
    118                                "Italy", 1979L,     -0.5,
    119                          "Switzerland", 1979L,      0.2,
    120                        "United States", 1979L,     -0.1,
    121                                "Spain", 1979L,      0.3,
    122                                "Japan", 1979L,     -0.2
    123                       )
    124 
    125 receiptData <- left_join(receiptData, labelAdjustments,
    126                            by = c("country", "year")) %>% 
    127   mutate(receipts_nudged = ifelse(is.na(nudge_y), 0, nudge_y) + receipts)
    128 
    129 update_geom_defaults("text", list(family = "Georgia", size = 4.0))
    130 ggplot(receiptData, aes(year, receipts, group = country)) +
    131   # Slope lines
    132   geom_line(size = 0.1) +
    133   # Country names (manually nudged)
    134   geom_text(aes(year, receipts_nudged, label = country),
    135             data = filter(receiptData, year == 1970),
    136             hjust = 1, nudge_x = -3) +
    137   geom_text(aes(year, receipts_nudged, label = country),
    138             data = filter(receiptData, year == 1979),
    139             hjust = 0, nudge_x = 3) +
    140   # Data values
    141   geom_text(aes(year, receipts_nudged, label = sprintf("%0.1f", receipts)),
    142             data = filter(receiptData, year == 1970),
    143             hjust = 1, nudge_x = -0.5) +
    144   geom_text(aes(year, receipts_nudged, label = sprintf("%0.1f", receipts)),
    145             data = filter(receiptData, year == 1979),
    146             hjust = 0, nudge_x = 0.5) +
    147   # Column labels
    148   annotate("text", x = c(1970, 1979) + c(-0.5, 0.5), 
    149            y = max(receiptData$receipts) + 3,
    150            label = c("1970", "1979"), 
    151            hjust = c(1, 0)) +
    152   # Plot annotation
    153   annotate("text", x = 1945, y = 58, hjust = 0, vjust = 1,
    154            label = paste("Current Receipts of Government as a", 
    155                          "Percentage of Gross Domestic\nProduct, 1970 and 1979",
    156                          sep = "\n"),
    157            family = "Georgia", size = 4.0) +
    158   coord_cartesian(xlim = c(1945, 1990)) +
    159   theme(panel.background = element_blank(),
    160         axis.title = element_blank(),
    161         axis.text = element_blank(),
    162         axis.ticks = element_blank())
    163 ```
    164 
    165 ## Why it’s worth it
    166 
    167 It took several iterations to get the aesthetics of this plot just right, while an adept user of a graphics editor would be able to recreate it in minutes. However, this initial time savings obscures several advantages to completing this process in code as opposed to switching to a tool like Illustrator.
    168 
    169 ### R is free software
    170 
    171 All of the programs I used to create this graphic (and share it with you) are [free software](https://www.fsf.org/about/what-is-free-software), unlike Adobe’s terrific but expensive Illustrator. [Inkscape](https://inkscape.org/en/) is an excellent open source alternative, but it is not as powerful as its commercial competitor. On the other hand, [R](https://www.r-project.org/) is arguably the most advanced and well-supported statistical analysis tool. Even if the politics of the open source movement aren’t important, free is a better price-point.
    172 
    173 ### Code can be reproduced
    174 
    175 Anybody who wants to recreate the final figure using a different set of countries, a new pair of years, or a favored economic indicator can run the code above on their own data. With minimal tweaking, they will have a graphic with aesthetics identical to those above. Reproducing this graphic using a multi-tool workflow is more complicated; a novice would first need to learn how to create a “rough draft” of it in software, and then learn how to use a graphics editor to improve its appearance.
    176 
    177 ### Scripted graphics are easier to update
    178 
    179 Imagine that an analyst discovered a mistake in their analysis that needed to be corrected. Or imagine the less stressful situation where a colleague suggested replacing some of the countries in the figure. In such cases, the work of creating the graphic will need to be redone: the analyst will need to add data or correct errors, generate a new plot in software, and re-edit the plot in their graphics editor. Using a scripted, code-based workflow, the analyst would only need to do the first step, and possibly update some manual tweaks. The initial time savings afforded by a graphics editor disappears after the first time this happens.
    180 
    181 ### Automation prevents errors
    182 
    183 Not only does scripted workflow make it easier to deal with errors, it guards against them. When analysts adjust graphic elements in a GUI, there’s a risk of mistakenly moving, deleting, or mislabeling data points. Such mistakes are difficult to detect, but code-based workflows avoid these risks; if the data source and analysis are error-free, then the graphic will also be.
    184 
    185 These considerations are among the reasons why scripted analyses are [considered best practice](https://peerj.com/preprints/3210/) in statistical analysis. Most of Tufte’s advice on creating graphics is excellent, but I recommend ignoring his suggestion to make the final edits in a program like Illustrator. Learning how to make polished graphics in software will ultimately save analysts time and help them avoid mistakes.
	www.eamoncaddigan.net Content and configuration for https://www.eamoncaddigan.net
	git clone https://git.eamoncaddigan.net/www.eamoncaddigan.net.git
	Log \| Files \| Refs \| Submodules \| README