24.Rmd - bookclub-advr - DSLC Advanced R Book Club

24.Rmd (6989B)
      1 ---
      2 engine: knitr
      3 title: Improving performance
      4 ---
      5 
      6 ## Overview
      7 
      8 1. Code organization
      9 2. Check for existing solutions
     10 3. Do as little as possible
     11 4. Vectorise
     12 5. Avoid Copies
     13 
     14 ## Organizing code
     15 
     16 - Write a function for each approach
     17 ```{r}
     18 mean1 <- function(x) mean(x)
     19 mean2 <- function(x) sum(x) / length(x)
     20 ```
     21 - Keep old functions that you've tried, even the failures
     22 - Generate a representative test case
     23 ```{r}
     24 x <- runif(1e5)
     25 ```
     26 - Use `bench::mark` to compare the different versions (and include unit tests)
     27 ```{r}
     28 bench::mark(
     29   mean1(x),
     30   mean2(x)
     31 )[c("expression", "min", "median", "itr/sec", "n_gc")]
     32 ```
     33 
     34 ## Check for Existing Solution
     35 - CRAN task views (http://cran.rstudio.com/web/views/)
     36 - Reverse dependencies of Rcpp (https://cran.r-project.org/web/packages/Rcpp/)
     37 - Talk to others!
     38   - Google (rseek)
     39   - Stackoverflow ([R])
     40   - https://community.rstudio.com/
     41   - DSLC community
     42 
     43 ## Do as little as possible
     44 - use a function tailored to a more specific type of input or output, or to a more specific problem
     45   - `rowSums()`, `colSums()`, `rowMeans()`, and `colMeans()` are faster than equivalent invocations that use `apply()` because they are vectorised
     46   - `vapply()` is faster than `sapply()` because it pre-specifies the output type
     47   - `any(x == 10)` is much faster than `10 %in% x` because testing equality is simpler than testing set inclusion
     48 - Some functions coerce their inputs into a specific type. If your input is not the right type, the function has to do extra work
     49   - e.g. `apply()` will always turn a dataframe into a matrix
     50 - Other examples
     51   - `read.csv()`: specify known column types with `colClasses`. (Also consider
     52   switching to `readr::read_csv()` or `data.table::fread()` which are 
     53   considerably faster than `read.csv()`.)
     54 
     55   - `factor()`: specify known levels with `levels`.
     56 
     57   - `cut()`: don't generate labels with `labels = FALSE` if you don't need them,
     58   or, even better, use `findInterval()` as mentioned in the "see also" section
     59   of the documentation.
     60   
     61   - `unlist(x, use.names = FALSE)` is much faster than `unlist(x)`.
     62 
     63   - `interaction()`: if you only need combinations that exist in the data, use
     64   `drop = TRUE`.
     65   
     66 ## Avoiding Method Dispatch
     67 ```{r}
     68 x <- runif(1e2)
     69 bench::mark(
     70   mean(x),
     71   mean.default(x)
     72 )[c("expression", "min", "median", "itr/sec", "n_gc")]
     73 ```
     74 
     75 ```{r}
     76 x <- runif(1e2)
     77 bench::mark(
     78   mean(x),
     79   mean.default(x),
     80   .Internal(mean(x))
     81 )[c("expression", "min", "median", "itr/sec", "n_gc")]
     82 ```
     83 
     84 ```{r}
     85 x <- runif(1e4)
     86 bench::mark(
     87   mean(x),
     88   mean.default(x),
     89   .Internal(mean(x))
     90 )[c("expression", "min", "median", "itr/sec", "n_gc")]
     91 ```
     92 
     93 ## Avoiding Input Coercion
     94 - `as.data.frame()` is quite slow because it coerces each element into a data frame and then `rbind()`s them together
     95 - instead, if you have a named list with vectors of equal length, you can directly transform it into a data frame
     96 
     97 ```{r}
     98 quickdf <- function(l) {
     99   class(l) <- "data.frame"
    100   attr(l, "row.names") <- .set_row_names(length(l[[1]]))
    101   l
    102 }
    103 l <- lapply(1:26, function(i) runif(1e3))
    104 names(l) <- letters
    105 bench::mark(
    106   as.data.frame = as.data.frame(l),
    107   quick_df      = quickdf(l)
    108 )[c("expression", "min", "median", "itr/sec", "n_gc")]
    109 ```
    110 
    111 *Caveat!* This method is fast because it's dangerous!
    112 
    113 ## Vectorise
    114 - vectorisation means finding the existing R function that is implemented in C and most closely applies to your problem
    115 - Vectorised functions that apply to many scenarios
    116   - `rowSums()`, `colSums()`, `rowMeans()`, and `colMeans()`
    117   - Vectorised subsetting can lead to big improvements in speed
    118   - `cut()` and `findInterval()` for converting continuous variables to categorical
    119   - Be aware of vectorised functions like `cumsum()` and `diff()`
    120   - Matrix algebra is a general example of vectorisation
    121 
    122 ## Avoiding copies
    123 
    124 - Whenever you use c(), append(), cbind(), rbind(), or paste() to create a bigger object, R must first allocate space for the new object and then copy the old object to its new home.
    125 
    126 ```{r}
    127 random_string <- function() {
    128   paste(sample(letters, 50, replace = TRUE), collapse = "")
    129 }
    130 strings10 <- replicate(10, random_string())
    131 strings100 <- replicate(100, random_string())
    132 collapse <- function(xs) {
    133   out <- ""
    134   for (x in xs) {
    135     out <- paste0(out, x)
    136   }
    137   out
    138 }
    139 bench::mark(
    140   loop10  = collapse(strings10),
    141   loop100 = collapse(strings100),
    142   vec10   = paste(strings10, collapse = ""),
    143   vec100  = paste(strings100, collapse = ""),
    144   check = FALSE
    145 )[c("expression", "min", "median", "itr/sec", "n_gc")]
    146 ```
    147 
    148 ## Case study: t-test
    149 
    150 ```{r}
    151 m <- 1000
    152 n <- 50
    153 X <- matrix(rnorm(m * n, mean = 10, sd = 3), nrow = m)
    154 grp <- rep(1:2, each = n / 2)
    155 ```
    156 
    157 ```{r}
    158 # formula interface
    159 system.time(
    160   for (i in 1:m) {
    161     t.test(X[i, ] ~ grp)$statistic
    162   }
    163 )
    164 # provide two vectors
    165 system.time(
    166   for (i in 1:m) {
    167     t.test(X[i, grp == 1], X[i, grp == 2])$statistic
    168   }
    169 )
    170 ```
    171 
    172 Add functionality to save values
    173 
    174 ```{r}
    175 compT <- function(i){
    176   t.test(X[i, grp == 1], X[i, grp == 2])$statistic
    177 }
    178 system.time(t1 <- purrr::map_dbl(1:m, compT))
    179 ```
    180 
    181 If you look at the source code of `stats:::t.test.default()`, you’ll see that it does a lot more than just compute the t-statistic.
    182 
    183 ```{r}
    184 # Do less work
    185 my_t <- function(x, grp) {
    186   t_stat <- function(x) {
    187     m <- mean(x)
    188     n <- length(x)
    189     var <- sum((x - m) ^ 2) / (n - 1)
    190     list(m = m, n = n, var = var)
    191   }
    192   g1 <- t_stat(x[grp == 1])
    193   g2 <- t_stat(x[grp == 2])
    194   se_total <- sqrt(g1$var / g1$n + g2$var / g2$n)
    195   (g1$m - g2$m) / se_total
    196 }
    197 system.time(t2 <- purrr::map_dbl(1:m, ~ my_t(X[.,], grp)))
    198 stopifnot(all.equal(t1, t2))
    199 ```
    200 
    201 This gives us a six-fold speed improvement!
    202 
    203 ```{r}
    204 # Vectorise it
    205 rowtstat <- function(X, grp){
    206   t_stat <- function(X) {
    207     m <- rowMeans(X)
    208     n <- ncol(X)
    209     var <- rowSums((X - m) ^ 2) / (n - 1)
    210     list(m = m, n = n, var = var)
    211   }
    212   g1 <- t_stat(X[, grp == 1])
    213   g2 <- t_stat(X[, grp == 2])
    214   se_total <- sqrt(g1$var / g1$n + g2$var / g2$n)
    215   (g1$m - g2$m) / se_total
    216 }
    217 system.time(t3 <- rowtstat(X, grp))
    218 stopifnot(all.equal(t1, t3))
    219 ```
    220 
    221 1000 times faster than when we started!
    222 
    223 ## Other techniques
    224 * [Read R blogs](http://www.r-bloggers.com/) to see what performance
    225   problems other people have struggled with, and how they have made their
    226   code faster.
    227 
    228 * Read other R programming books, like The Art of R Programming or Patrick Burns'
    229   [_R Inferno_](http://www.burns-stat.com/documents/books/the-r-inferno/) to
    230   learn about common traps.
    231 
    232 * Take an algorithms and data structure course to learn some
    233   well known ways of tackling certain classes of problems. I have heard
    234   good things about Princeton's
    235   [Algorithms course](https://www.coursera.org/course/algs4partI) offered on
    236   Coursera.
    237   
    238 * Learn how to parallelise your code. Two places to start are
    239   Parallel R and Parallel Computing for Data Science
    240 
    241 * Read general books about optimisation like Mature optimisation
    242   or the Pragmatic Programmer
    243   
    244 * Read more R code. StackOverflow, R Mailing List, DSLC, GitHub, etc.
	bookclub-advr DSLC Advanced R Book Club
	git clone https://git.eamoncaddigan.net/bookclub-advr.git
	Log \| Files \| Refs \| README \| LICENSE