www.eamoncaddigan.net

Content and configuration for https://www.eamoncaddigan.net
git clone https://git.eamoncaddigan.net/www.eamoncaddigan.net.git
Log | Files | Refs | Submodules | README

index.md (10241B)


      1 ---
      2 title: "Recreating a Tufte Slope Graph"
      3 description: Publication-ready graphics should be created in code, not a graphics editor. 
      4 date: 2018-07-22
      5 categories:
      6 - Programming
      7 - Data Science
      8 tags:
      9 - R
     10 - Dataviz
     11 - Design
     12 ---
     13 
     14 Recently on Twitter, data visualization guru Edward R. Tufte wrote that
     15 graphics produced by R are not “publication ready”. His proposed
     16 workflow is to use statistical software to create an initial version of
     17 a plot, and then make final improvements in Adobe Illustrator.
     18 
     19 I disagree with this advice. First I’ll show the steps a data analyst
     20 might take to create a high-quality graphic entirely in R. Then, I’ll
     21 explain why I think this is a better approach.
     22 
     23 ## Publication quality graphics in R
     24 
     25 Page 158 of Tufte’s classic book, [The Visual Display of Quantitative
     26 Information (2nd ed.)](https://www.edwardtufte.com/tufte/books_vdqi),
     27 features a “slope graph” that shows the change in government receipts
     28 for several countries between 1970 and 1979. Below are the first few
     29 rows of these data in a [tidy data
     30 frame](https://www.jstatsoft.org/article/view/v059i10), `receiptData`,
     31 along with a quick and dirty slopegraph.
     32 
     33 |country | year| receipts|
     34 |:-------|----:|--------:|
     35 |Belgium | 1970|     35.2|
     36 |Belgium | 1979|     43.2|
     37 |Britain | 1970|     40.7|
     38 |Britain | 1979|     39.0|
     39 |Canada  | 1970|     35.2|
     40 |Canada  | 1979|     35.8|
     41 
     42 ```r
     43 ggplot(receiptData, aes(year, receipts, group = country)) +
     44     geom_line() +
     45     geom_text_repel(aes(label = country)) +
     46     labs(x = "Year", y = "Government receipts as percentage of GDP")
     47 ```
     48 
     49 ![A quick and dirty slopegraph](graph_1-1.png)
     50 
     51 This plot is not attractive, but it is useful for getting a handle on the
     52 data. Whether [iterating through an exploratory data analysis]({{< ref
     53 "/posts/art-of-data-science/index.md" >}}) or preparing a graphic for
     54 publication, analysts will create many ugly graphics on the path to settling
     55 on a design and refining it.
     56 
     57 For our first round of improvements, we can change the aspect ratio of
     58 the graphic and arrange the country labels so they don’t overlap with
     59 the data. We should also remove the “chart junk” in the background, such
     60 as the background grid, and label only the years of interest on the
     61 x-axis.
     62 
     63 ```r
     64 ggplot(receiptData, aes(year, receipts, group = country)) +
     65     geom_line() +
     66     geom_text_repel(aes(label = country),
     67                     data = filter(receiptData, year == 1970),
     68                     nudge_x = -0.5, hjust = 1,
     69                     direction = "y", size = 5) +
     70     geom_text_repel(aes(label = country),
     71                     data = filter(receiptData, year == 1979),
     72                     nudge_x = 0.5, hjust = 0, 
     73                     direction = "y", size = 5) +
     74     scale_x_continuous(breaks = c(1970, 1979), limits = c(1966, 1983)) +
     75     theme(panel.background = element_blank(),
     76           axis.title = element_text(size = 16),
     77           axis.text = element_text(size = 12)) +
     78     labs(x = "Year", y = "Government receipts as percentage of GDP")
     79 ```
     80 
     81 ![A better but still imperfect graphic](graph_2-1.png)
     82 
     83 The [ggrepel](https://github.com/slowkow/ggrepel) package has done a
     84 great job preventing the labels from overlapping; I usually use
     85 `geom_text_repel()` instead of `geom_text()` during exploratory data
     86 analysis. Here, however, the segments connecting the labels to data
     87 points create confusing clutter. While these can be removed within the
     88 function, our final figure will be easier to understand if we nudge
     89 labels manually so that they’re as close to their data as possible.
     90 
     91 We can also drop the axes; the year labels will go right above the data,
     92 and we’ll show each country’s value next to its label. Since we’re
     93 losing the axis titles, we’ll also embed a caption in this version of
     94 the graphic.
     95 
     96 Finally, we’ll change the typeface. Since we’re trying to make something
     97 that would please Tufte, we’ll use a serif font. If you haven’t done so
     98 before, you may need to run `loadfonts()` from the [extrafont
     99 package](https://github.com/wch/extrafont) to tell R how to find system
    100 fonts. We’ll also further boost our “data-ink ratio” by making the lines
    101 thinner.
    102 
    103 ```r
    104 labelAdjustments <- tribble(
    105                                 ~country, ~year, ~nudge_y,
    106                            "Netherlands", 1970L,      0.3,
    107                                 "Norway", 1970L,     -0.2,
    108                                "Belgium", 1970L,      0.9,
    109                                 "Canada", 1970L,     -0.1,
    110                                "Finland", 1970L,     -0.8,
    111                                  "Italy", 1970L,      0.4,
    112                          "United States", 1970L,     -0.5,
    113                                 "Greece", 1970L,      0.4,
    114                            "Switzerland", 1970L,     -0.3,
    115                                 "France", 1979L,      0.8,
    116                                "Germany", 1979L,     -0.7,
    117                                "Britain", 1979L,      0.1,
    118                                "Finland", 1979L,     -0.1,
    119                                 "Canada", 1979L,      0.4,
    120                                  "Italy", 1979L,     -0.5,
    121                            "Switzerland", 1979L,      0.2,
    122                          "United States", 1979L,     -0.1,
    123                                  "Spain", 1979L,      0.3,
    124                                  "Japan", 1979L,     -0.2
    125                         )
    126 
    127 receiptData <- left_join(receiptData, labelAdjustments,
    128                          by = c("country", "year")) %>% 
    129     mutate(receipts_nudged = ifelse(is.na(nudge_y), 0, nudge_y) + receipts)
    130 
    131 update_geom_defaults("text", list(family = "Georgia", size = 4.0))
    132 ggplot(receiptData, aes(year, receipts, group = country)) +
    133     # Slope lines
    134     geom_line(size = 0.1) +
    135     # Country names (manually nudged)
    136     geom_text(aes(year, receipts_nudged, label = country),
    137               data = filter(receiptData, year == 1970),
    138               hjust = 1, nudge_x = -3) +
    139     geom_text(aes(year, receipts_nudged, label = country),
    140               data = filter(receiptData, year == 1979),
    141               hjust = 0, nudge_x = 3) +
    142     # Data values
    143     geom_text(aes(year, receipts_nudged, label = sprintf("%0.1f", receipts)),
    144               data = filter(receiptData, year == 1970),
    145               hjust = 1, nudge_x = -0.5) +
    146     geom_text(aes(year, receipts_nudged, label = sprintf("%0.1f", receipts)),
    147               data = filter(receiptData, year == 1979),
    148               hjust = 0, nudge_x = 0.5) +
    149     # Column labels
    150     annotate("text", x = c(1970, 1979) + c(-0.5, 0.5), 
    151              y = max(receiptData$receipts) + 3,
    152              label = c("1970", "1979"), 
    153              hjust = c(1, 0)) +
    154     # Plot annotation
    155     annotate("text", x = 1945, y = 58, hjust = 0, vjust = 1,
    156              label = paste("Current Receipts of Government as a", 
    157                            "Percentage of Gross Domestic\nProduct, 1970 and 1979",
    158                             sep = "\n"),
    159              family = "Georgia", size = 4.0) +
    160     coord_cartesian(xlim = c(1945, 1990)) +
    161     theme(panel.background = element_blank(),
    162           axis.title = element_blank(),
    163           axis.text = element_blank(),
    164           axis.ticks = element_blank())
    165 ```
    166 
    167 ![A publication-ready graphic](graph_3-1.png)
    168 
    169 ## Why it’s worth it
    170 
    171 It took several iterations to get the aesthetics of this plot just
    172 right, while an adept user of a graphics editor would be able to
    173 recreate it in minutes. However, this initial time savings obscures
    174 several advantages to completing this process in code as opposed to
    175 switching to a tool like Illustrator.
    176 
    177 ### R is free software
    178 
    179 All of the programs I used to create this graphic (and share it with
    180 you) are [free
    181 software](https://www.fsf.org/about/what-is-free-software), unlike
    182 Adobe’s terrific but expensive Illustrator.
    183 [Inkscape](https://inkscape.org/en/) is an excellent open source
    184 alternative, but it is not as powerful as its commercial competitor. On
    185 the other hand, [R](https://www.r-project.org/) is arguably the most
    186 advanced and well-supported statistical analysis tool. Even if the
    187 politics of the open source movement aren’t important, free is a better
    188 price-point.
    189 
    190 ### Code can be reproduced
    191 
    192 Anybody who wants to recreate the final figure using a different set of
    193 countries, a new pair of years, or a favored economic indicator can run
    194 the code above on their own data. With minimal tweaking, they will have
    195 a graphic with aesthetics identical to those above. Reproducing this
    196 graphic using a multi-tool workflow is more complicated; a novice would
    197 first need to learn how to create a “rough draft” of it in software, and
    198 then learn how to use a graphics editor to improve its appearance.
    199 
    200 ### Scripted graphics are easier to update
    201 
    202 Imagine that an analyst discovered a mistake in their analysis that
    203 needed to be corrected. Or imagine the less stressful situation where a
    204 colleague suggested replacing some of the countries in the figure. In
    205 such cases, the work of creating the graphic will need to be redone: the
    206 analyst will need to add data or correct errors, generate a new plot in
    207 software, and re-edit the plot in their graphics editor. Using a
    208 scripted, code-based workflow, the analyst would only need to do the
    209 first step, and possibly update some manual tweaks. The initial time
    210 savings afforded by a graphics editor disappears after the first time
    211 this happens.
    212 
    213 ### Automation prevents errors
    214 
    215 Not only does scripted workflow make it easier to deal with errors, it
    216 guards against them. When analysts adjust graphic elements in a GUI,
    217 there’s a risk of mistakenly moving, deleting, or mislabeling data
    218 points. Such mistakes are difficult to detect, but code-based workflows
    219 avoid these risks; if the data source and analysis are error-free, then
    220 the graphic will also be.
    221 
    222 These considerations are among the reasons why scripted analyses are
    223 [considered best practice](https://peerj.com/preprints/3210/) in
    224 statistical analysis. Most of Tufte’s advice on creating graphics is
    225 excellent, but I recommend ignoring his suggestion to make the final
    226 edits in a program like Illustrator. Learning how to make polished
    227 graphics in software will ultimately save analysts time and help them
    228 avoid mistakes.
    229