24.Rmd (6989B)
1 --- 2 engine: knitr 3 title: Improving performance 4 --- 5 6 ## Overview 7 8 1. Code organization 9 2. Check for existing solutions 10 3. Do as little as possible 11 4. Vectorise 12 5. Avoid Copies 13 14 ## Organizing code 15 16 - Write a function for each approach 17 ```{r} 18 mean1 <- function(x) mean(x) 19 mean2 <- function(x) sum(x) / length(x) 20 ``` 21 - Keep old functions that you've tried, even the failures 22 - Generate a representative test case 23 ```{r} 24 x <- runif(1e5) 25 ``` 26 - Use `bench::mark` to compare the different versions (and include unit tests) 27 ```{r} 28 bench::mark( 29 mean1(x), 30 mean2(x) 31 )[c("expression", "min", "median", "itr/sec", "n_gc")] 32 ``` 33 34 ## Check for Existing Solution 35 - CRAN task views (http://cran.rstudio.com/web/views/) 36 - Reverse dependencies of Rcpp (https://cran.r-project.org/web/packages/Rcpp/) 37 - Talk to others! 38 - Google (rseek) 39 - Stackoverflow ([R]) 40 - https://community.rstudio.com/ 41 - DSLC community 42 43 ## Do as little as possible 44 - use a function tailored to a more specific type of input or output, or to a more specific problem 45 - `rowSums()`, `colSums()`, `rowMeans()`, and `colMeans()` are faster than equivalent invocations that use `apply()` because they are vectorised 46 - `vapply()` is faster than `sapply()` because it pre-specifies the output type 47 - `any(x == 10)` is much faster than `10 %in% x` because testing equality is simpler than testing set inclusion 48 - Some functions coerce their inputs into a specific type. If your input is not the right type, the function has to do extra work 49 - e.g. `apply()` will always turn a dataframe into a matrix 50 - Other examples 51 - `read.csv()`: specify known column types with `colClasses`. (Also consider 52 switching to `readr::read_csv()` or `data.table::fread()` which are 53 considerably faster than `read.csv()`.) 54 55 - `factor()`: specify known levels with `levels`. 56 57 - `cut()`: don't generate labels with `labels = FALSE` if you don't need them, 58 or, even better, use `findInterval()` as mentioned in the "see also" section 59 of the documentation. 60 61 - `unlist(x, use.names = FALSE)` is much faster than `unlist(x)`. 62 63 - `interaction()`: if you only need combinations that exist in the data, use 64 `drop = TRUE`. 65 66 ## Avoiding Method Dispatch 67 ```{r} 68 x <- runif(1e2) 69 bench::mark( 70 mean(x), 71 mean.default(x) 72 )[c("expression", "min", "median", "itr/sec", "n_gc")] 73 ``` 74 75 ```{r} 76 x <- runif(1e2) 77 bench::mark( 78 mean(x), 79 mean.default(x), 80 .Internal(mean(x)) 81 )[c("expression", "min", "median", "itr/sec", "n_gc")] 82 ``` 83 84 ```{r} 85 x <- runif(1e4) 86 bench::mark( 87 mean(x), 88 mean.default(x), 89 .Internal(mean(x)) 90 )[c("expression", "min", "median", "itr/sec", "n_gc")] 91 ``` 92 93 ## Avoiding Input Coercion 94 - `as.data.frame()` is quite slow because it coerces each element into a data frame and then `rbind()`s them together 95 - instead, if you have a named list with vectors of equal length, you can directly transform it into a data frame 96 97 ```{r} 98 quickdf <- function(l) { 99 class(l) <- "data.frame" 100 attr(l, "row.names") <- .set_row_names(length(l[[1]])) 101 l 102 } 103 l <- lapply(1:26, function(i) runif(1e3)) 104 names(l) <- letters 105 bench::mark( 106 as.data.frame = as.data.frame(l), 107 quick_df = quickdf(l) 108 )[c("expression", "min", "median", "itr/sec", "n_gc")] 109 ``` 110 111 *Caveat!* This method is fast because it's dangerous! 112 113 ## Vectorise 114 - vectorisation means finding the existing R function that is implemented in C and most closely applies to your problem 115 - Vectorised functions that apply to many scenarios 116 - `rowSums()`, `colSums()`, `rowMeans()`, and `colMeans()` 117 - Vectorised subsetting can lead to big improvements in speed 118 - `cut()` and `findInterval()` for converting continuous variables to categorical 119 - Be aware of vectorised functions like `cumsum()` and `diff()` 120 - Matrix algebra is a general example of vectorisation 121 122 ## Avoiding copies 123 124 - Whenever you use c(), append(), cbind(), rbind(), or paste() to create a bigger object, R must first allocate space for the new object and then copy the old object to its new home. 125 126 ```{r} 127 random_string <- function() { 128 paste(sample(letters, 50, replace = TRUE), collapse = "") 129 } 130 strings10 <- replicate(10, random_string()) 131 strings100 <- replicate(100, random_string()) 132 collapse <- function(xs) { 133 out <- "" 134 for (x in xs) { 135 out <- paste0(out, x) 136 } 137 out 138 } 139 bench::mark( 140 loop10 = collapse(strings10), 141 loop100 = collapse(strings100), 142 vec10 = paste(strings10, collapse = ""), 143 vec100 = paste(strings100, collapse = ""), 144 check = FALSE 145 )[c("expression", "min", "median", "itr/sec", "n_gc")] 146 ``` 147 148 ## Case study: t-test 149 150 ```{r} 151 m <- 1000 152 n <- 50 153 X <- matrix(rnorm(m * n, mean = 10, sd = 3), nrow = m) 154 grp <- rep(1:2, each = n / 2) 155 ``` 156 157 ```{r} 158 # formula interface 159 system.time( 160 for (i in 1:m) { 161 t.test(X[i, ] ~ grp)$statistic 162 } 163 ) 164 # provide two vectors 165 system.time( 166 for (i in 1:m) { 167 t.test(X[i, grp == 1], X[i, grp == 2])$statistic 168 } 169 ) 170 ``` 171 172 Add functionality to save values 173 174 ```{r} 175 compT <- function(i){ 176 t.test(X[i, grp == 1], X[i, grp == 2])$statistic 177 } 178 system.time(t1 <- purrr::map_dbl(1:m, compT)) 179 ``` 180 181 If you look at the source code of `stats:::t.test.default()`, you’ll see that it does a lot more than just compute the t-statistic. 182 183 ```{r} 184 # Do less work 185 my_t <- function(x, grp) { 186 t_stat <- function(x) { 187 m <- mean(x) 188 n <- length(x) 189 var <- sum((x - m) ^ 2) / (n - 1) 190 list(m = m, n = n, var = var) 191 } 192 g1 <- t_stat(x[grp == 1]) 193 g2 <- t_stat(x[grp == 2]) 194 se_total <- sqrt(g1$var / g1$n + g2$var / g2$n) 195 (g1$m - g2$m) / se_total 196 } 197 system.time(t2 <- purrr::map_dbl(1:m, ~ my_t(X[.,], grp))) 198 stopifnot(all.equal(t1, t2)) 199 ``` 200 201 This gives us a six-fold speed improvement! 202 203 ```{r} 204 # Vectorise it 205 rowtstat <- function(X, grp){ 206 t_stat <- function(X) { 207 m <- rowMeans(X) 208 n <- ncol(X) 209 var <- rowSums((X - m) ^ 2) / (n - 1) 210 list(m = m, n = n, var = var) 211 } 212 g1 <- t_stat(X[, grp == 1]) 213 g2 <- t_stat(X[, grp == 2]) 214 se_total <- sqrt(g1$var / g1$n + g2$var / g2$n) 215 (g1$m - g2$m) / se_total 216 } 217 system.time(t3 <- rowtstat(X, grp)) 218 stopifnot(all.equal(t1, t3)) 219 ``` 220 221 1000 times faster than when we started! 222 223 ## Other techniques 224 * [Read R blogs](http://www.r-bloggers.com/) to see what performance 225 problems other people have struggled with, and how they have made their 226 code faster. 227 228 * Read other R programming books, like The Art of R Programming or Patrick Burns' 229 [_R Inferno_](http://www.burns-stat.com/documents/books/the-r-inferno/) to 230 learn about common traps. 231 232 * Take an algorithms and data structure course to learn some 233 well known ways of tackling certain classes of problems. I have heard 234 good things about Princeton's 235 [Algorithms course](https://www.coursera.org/course/algs4partI) offered on 236 Coursera. 237 238 * Learn how to parallelise your code. Two places to start are 239 Parallel R and Parallel Computing for Data Science 240 241 * Read general books about optimisation like Mature optimisation 242 or the Pragmatic Programmer 243 244 * Read more R code. StackOverflow, R Mailing List, DSLC, GitHub, etc.