bookclub-advr

DSLC Advanced R Book Club
git clone https://git.eamoncaddigan.net/bookclub-advr.git
Log | Files | Refs | README | LICENSE

html.json (10676B)


      1 {
      2   "hash": "355b584146ead093c1e2e038c1794779",
      3   "result": {
      4     "engine": "knitr",
      5     "markdown": "---\nengine: knitr\ntitle: Improving performance\n---\n\n## Overview\n\n1. Code organization\n2. Check for existing solutions\n3. Do as little as possible\n4. Vectorise\n5. Avoid Copies\n\n## Organizing code\n\n- Write a function for each approach\n\n::: {.cell}\n\n```{.r .cell-code}\nmean1 <- function(x) mean(x)\nmean2 <- function(x) sum(x) / length(x)\n```\n:::\n\n- Keep old functions that you've tried, even the failures\n- Generate a representative test case\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- runif(1e5)\n```\n:::\n\n- Use `bench::mark` to compare the different versions (and include unit tests)\n\n::: {.cell}\n\n```{.r .cell-code}\nbench::mark(\n  mean1(x),\n  mean2(x)\n)[c(\"expression\", \"min\", \"median\", \"itr/sec\", \"n_gc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 2 × 4\n#>   expression      min   median `itr/sec`\n#>   <bch:expr> <bch:tm> <bch:tm>     <dbl>\n#> 1 mean1(x)    146.4µs  166.8µs     5839.\n#> 2 mean2(x)     67.1µs   75.7µs    12950.\n```\n\n\n:::\n:::\n\n\n## Check for Existing Solution\n- CRAN task views (http://cran.rstudio.com/web/views/)\n- Reverse dependencies of Rcpp (https://cran.r-project.org/web/packages/Rcpp/)\n- Talk to others!\n  - Google (rseek)\n  - Stackoverflow ([R])\n  - https://community.rstudio.com/\n  - DSLC community\n\n## Do as little as possible\n- use a function tailored to a more specific type of input or output, or to a more specific problem\n  - `rowSums()`, `colSums()`, `rowMeans()`, and `colMeans()` are faster than equivalent invocations that use `apply()` because they are vectorised\n  - `vapply()` is faster than `sapply()` because it pre-specifies the output type\n  - `any(x == 10)` is much faster than `10 %in% x` because testing equality is simpler than testing set inclusion\n- Some functions coerce their inputs into a specific type. If your input is not the right type, the function has to do extra work\n  - e.g. `apply()` will always turn a dataframe into a matrix\n- Other examples\n  - `read.csv()`: specify known column types with `colClasses`. (Also consider\n  switching to `readr::read_csv()` or `data.table::fread()` which are \n  considerably faster than `read.csv()`.)\n\n  - `factor()`: specify known levels with `levels`.\n\n  - `cut()`: don't generate labels with `labels = FALSE` if you don't need them,\n  or, even better, use `findInterval()` as mentioned in the \"see also\" section\n  of the documentation.\n  \n  - `unlist(x, use.names = FALSE)` is much faster than `unlist(x)`.\n\n  - `interaction()`: if you only need combinations that exist in the data, use\n  `drop = TRUE`.\n  \n## Avoiding Method Dispatch\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- runif(1e2)\nbench::mark(\n  mean(x),\n  mean.default(x)\n)[c(\"expression\", \"min\", \"median\", \"itr/sec\", \"n_gc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 2 × 4\n#>   expression           min   median `itr/sec`\n#>   <bch:expr>      <bch:tm> <bch:tm>     <dbl>\n#> 1 mean(x)            2.8µs    3.1µs   295383.\n#> 2 mean.default(x)    900ns    1.1µs   839532.\n```\n\n\n:::\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- runif(1e2)\nbench::mark(\n  mean(x),\n  mean.default(x),\n  .Internal(mean(x))\n)[c(\"expression\", \"min\", \"median\", \"itr/sec\", \"n_gc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 3 × 4\n#>   expression              min   median `itr/sec`\n#>   <bch:expr>         <bch:tm> <bch:tm>     <dbl>\n#> 1 mean(x)               2.7µs    2.9µs   310733.\n#> 2 mean.default(x)       900ns    1.1µs   849192.\n#> 3 .Internal(mean(x))    100ns    200ns  4495392.\n```\n\n\n:::\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- runif(1e4)\nbench::mark(\n  mean(x),\n  mean.default(x),\n  .Internal(mean(x))\n)[c(\"expression\", \"min\", \"median\", \"itr/sec\", \"n_gc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 3 × 4\n#>   expression              min   median `itr/sec`\n#>   <bch:expr>         <bch:tm> <bch:tm>     <dbl>\n#> 1 mean(x)              16.6µs   17.3µs    55604.\n#> 2 mean.default(x)      14.4µs   14.9µs    64958.\n#> 3 .Internal(mean(x))   13.6µs   13.7µs    70368.\n```\n\n\n:::\n:::\n\n\n## Avoiding Input Coercion\n- `as.data.frame()` is quite slow because it coerces each element into a data frame and then `rbind()`s them together\n- instead, if you have a named list with vectors of equal length, you can directly transform it into a data frame\n\n\n::: {.cell}\n\n```{.r .cell-code}\nquickdf <- function(l) {\n  class(l) <- \"data.frame\"\n  attr(l, \"row.names\") <- .set_row_names(length(l[[1]]))\n  l\n}\nl <- lapply(1:26, function(i) runif(1e3))\nnames(l) <- letters\nbench::mark(\n  as.data.frame = as.data.frame(l),\n  quick_df      = quickdf(l)\n)[c(\"expression\", \"min\", \"median\", \"itr/sec\", \"n_gc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 2 × 4\n#>   expression         min   median `itr/sec`\n#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl>\n#> 1 as.data.frame  589.5µs  656.2µs     1447.\n#> 2 quick_df         3.7µs    4.3µs   207929.\n```\n\n\n:::\n:::\n\n\n*Caveat!* This method is fast because it's dangerous!\n\n## Vectorise\n- vectorisation means finding the existing R function that is implemented in C and most closely applies to your problem\n- Vectorised functions that apply to many scenarios\n  - `rowSums()`, `colSums()`, `rowMeans()`, and `colMeans()`\n  - Vectorised subsetting can lead to big improvements in speed\n  - `cut()` and `findInterval()` for converting continuous variables to categorical\n  - Be aware of vectorised functions like `cumsum()` and `diff()`\n  - Matrix algebra is a general example of vectorisation\n\n## Avoiding copies\n\n- Whenever you use c(), append(), cbind(), rbind(), or paste() to create a bigger object, R must first allocate space for the new object and then copy the old object to its new home.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrandom_string <- function() {\n  paste(sample(letters, 50, replace = TRUE), collapse = \"\")\n}\nstrings10 <- replicate(10, random_string())\nstrings100 <- replicate(100, random_string())\ncollapse <- function(xs) {\n  out <- \"\"\n  for (x in xs) {\n    out <- paste0(out, x)\n  }\n  out\n}\nbench::mark(\n  loop10  = collapse(strings10),\n  loop100 = collapse(strings100),\n  vec10   = paste(strings10, collapse = \"\"),\n  vec100  = paste(strings100, collapse = \"\"),\n  check = FALSE\n)[c(\"expression\", \"min\", \"median\", \"itr/sec\", \"n_gc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 4 × 4\n#>   expression      min   median `itr/sec`\n#>   <bch:expr> <bch:tm> <bch:tm>     <dbl>\n#> 1 loop10       17.1µs     18µs    52743.\n#> 2 loop100     460.3µs  491.5µs     1959.\n#> 3 vec10         2.7µs    2.9µs   317912.\n#> 4 vec100       16.3µs   18.5µs    51297.\n```\n\n\n:::\n:::\n\n\n## Case study: t-test\n\n\n::: {.cell}\n\n```{.r .cell-code}\nm <- 1000\nn <- 50\nX <- matrix(rnorm(m * n, mean = 10, sd = 3), nrow = m)\ngrp <- rep(1:2, each = n / 2)\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# formula interface\nsystem.time(\n  for (i in 1:m) {\n    t.test(X[i, ] ~ grp)$statistic\n  }\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#>    user  system elapsed \n#>    0.28    0.00    0.33\n```\n\n\n:::\n\n```{.r .cell-code}\n# provide two vectors\nsystem.time(\n  for (i in 1:m) {\n    t.test(X[i, grp == 1], X[i, grp == 2])$statistic\n  }\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#>    user  system elapsed \n#>    0.08    0.00    0.08\n```\n\n\n:::\n:::\n\n\nAdd functionality to save values\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompT <- function(i){\n  t.test(X[i, grp == 1], X[i, grp == 2])$statistic\n}\nsystem.time(t1 <- purrr::map_dbl(1:m, compT))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#>    user  system elapsed \n#>    0.09    0.00    0.09\n```\n\n\n:::\n:::\n\n\nIf you look at the source code of `stats:::t.test.default()`, you’ll see that it does a lot more than just compute the t-statistic.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Do less work\nmy_t <- function(x, grp) {\n  t_stat <- function(x) {\n    m <- mean(x)\n    n <- length(x)\n    var <- sum((x - m) ^ 2) / (n - 1)\n    list(m = m, n = n, var = var)\n  }\n  g1 <- t_stat(x[grp == 1])\n  g2 <- t_stat(x[grp == 2])\n  se_total <- sqrt(g1$var / g1$n + g2$var / g2$n)\n  (g1$m - g2$m) / se_total\n}\nsystem.time(t2 <- purrr::map_dbl(1:m, ~ my_t(X[.,], grp)))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#>    user  system elapsed \n#>    0.02    0.00    0.01\n```\n\n\n:::\n\n```{.r .cell-code}\nstopifnot(all.equal(t1, t2))\n```\n:::\n\n\nThis gives us a six-fold speed improvement!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Vectorise it\nrowtstat <- function(X, grp){\n  t_stat <- function(X) {\n    m <- rowMeans(X)\n    n <- ncol(X)\n    var <- rowSums((X - m) ^ 2) / (n - 1)\n    list(m = m, n = n, var = var)\n  }\n  g1 <- t_stat(X[, grp == 1])\n  g2 <- t_stat(X[, grp == 2])\n  se_total <- sqrt(g1$var / g1$n + g2$var / g2$n)\n  (g1$m - g2$m) / se_total\n}\nsystem.time(t3 <- rowtstat(X, grp))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#>    user  system elapsed \n#>    0.02    0.00    0.02\n```\n\n\n:::\n\n```{.r .cell-code}\nstopifnot(all.equal(t1, t3))\n```\n:::\n\n\n1000 times faster than when we started!\n\n## Other techniques\n* [Read R blogs](http://www.r-bloggers.com/) to see what performance\n  problems other people have struggled with, and how they have made their\n  code faster.\n\n* Read other R programming books, like The Art of R Programming or Patrick Burns'\n  [_R Inferno_](http://www.burns-stat.com/documents/books/the-r-inferno/) to\n  learn about common traps.\n\n* Take an algorithms and data structure course to learn some\n  well known ways of tackling certain classes of problems. I have heard\n  good things about Princeton's\n  [Algorithms course](https://www.coursera.org/course/algs4partI) offered on\n  Coursera.\n  \n* Learn how to parallelise your code. Two places to start are\n  Parallel R and Parallel Computing for Data Science\n\n* Read general books about optimisation like Mature optimisation\n  or the Pragmatic Programmer\n  \n* Read more R code. StackOverflow, R Mailing List, DSLC, GitHub, etc.\n",
      6     "supporting": [
      7       "24_files"
      8     ],
      9     "filters": [
     10       "rmarkdown/pagebreak.lua"
     11     ],
     12     "includes": {},
     13     "engineDependencies": {},
     14     "preserve": {},
     15     "postProcess": true
     16   }
     17 }