html.json - bookclub-advr - DSLC Advanced R Book Club

html.json (18624B)
      1 {
      2   "hash": "842c1c5354574cc08224d4899d85eeba",
      3   "result": {
      4     "engine": "knitr",
      5     "markdown": "---\nengine: knitr\ntitle: Rewriting R code in C++\n---\n\n## Learning objectives:\n\n-   Learn to improve performance by rewriting bottlenecks in C++\n\n-   Introduction to the [{Rcpp} package](https://www.rcpp.org/)\n\n## Introduction\n\nIn this chapter we'll learn how to rewrite **R** code in **C++** to make it faster using the **Rcpp package**. The **Rcpp**  package makes it simple to connect C++ to R! With C++ you can fix:\n\n-   Loops that can't be easily vectorised because subsequent iterations depend on previous ones.\n\n-   Recursive functions, or problems which involve calling functions millions of times. The overhead of calling a function in C++ is much lower than in R.\n\n-   Problems that require advanced data structures and algorithms that R doesn't provide. Through the **standard template library (STL)**, C++ has efficient implementations of many important data structures, from ordered maps to double-ended queue\n\n<center>Like how?</center>\n\n<center> </center>\n\n<center>![](https://media.giphy.com/media/vLyZk5CJo12Wk/giphy.gif)</center>\n\n \n\n## Getting started with C++\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(Rcpp)\n```\n:::\n\n\nInstall a C++ compiler:\n\n-   Rtools, on Windows\n-   Xcode, on Mac\n-   Sudo apt-get install r-base-dev or similar, on Linux.\n\n\n### First example {-}\n\nRcpp compiling the C++ code:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncppFunction('int add(int x, int y, int z) {\n  int sum = x + y + z;\n  return sum;\n}')\n# add works like a regular R function\nadd\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> function (x, y, z) \n#> .Call(<pointer: 0x00007ff9d7e315f0>, x, y, z)\n```\n\n\n:::\n\n```{.r .cell-code}\nadd(1, 2, 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 6\n```\n\n\n:::\n:::\n\n\nSome things to note:\n\n\n-   The syntax to create a function is different.\n-   Types of inputs and outputs must be explicitly declared\n-   Use = for assignment, not `<-`.\n-   Every statement is terminated by a ;\n-   C++ has it's own name for the types we are used to:\n    -   scalar types are `int`, `double`, `bool` and `String`\n    -   vector types (for Rcpp) are `IntegerVector`, `NumericVector`, `LogicalVector` and `CharacterVector`\n    -   Other R types are available in C++: `List`, `Function`, `DataFrame`, and more.\n    \n-   Explicitly use a `return` statement to return a value from a function.\n\n \n\n## Example with scalar input and output {-}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsignR <- function(x) {\n  if (x > 0) {\n    1\n  } else if (x == 0) {\n    0\n  } else {\n    -1\n  }\n}\n\na <- -0.5\nb <- 0.5\nc <- 0\nsignR(c)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 0\n```\n\n\n:::\n:::\n\n\nTranslation:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncppFunction('int signC(int x) {\n  if (x > 0) {\n    return 1;\n  } else if (x == 0) {\n    return 0;\n  } else {\n    return -1;\n  }\n}')\n```\n:::\n\n\n* Note that the `if` syntax is identical! Not everything is different!\n\n## Vector Input, Scalar output:{-}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsumR <- function(x) {\n  total <- 0\n  for (i in seq_along(x)) {\n    total <- total + x[i]\n  }\n  total\n}\n\nx<- runif(100)\nsumR(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 51.16287\n```\n\n\n:::\n:::\n\n\nTranslation:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncppFunction('double sumC(NumericVector x) {\n  int n = x.size();\n  double total = 0;\n  for(int i = 0; i < n; ++i) {\n    total += x[i];\n  }\n  return total;\n}')\n```\n:::\n\n\nSome observations:\n\n-   vector indices *start at 0*\n-   The for statement has a different syntax: for(init; check; increment)\n-   Methods are called with `.`\n-   `total += x[i]` is equivalent to `total = total + x[i]`.\n-   other in-place operators are `-=`, `*=`, `and /=`\n\n\nTo check for the fastest way we can use:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?bench::mark\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- runif(1e3)\nbench::mark(\n  sum(x),\n  sumC(x),\n  sumR(x)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 3 × 6\n#>   expression      min   median `itr/sec` mem_alloc `gc/sec`\n#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>\n#> 1 sum(x)        700ns    1.6µs   611939.        0B        0\n#> 2 sumC(x)       900ns    1.7µs   469982.        0B        0\n#> 3 sumR(x)      14.2µs   27.4µs    33865.        0B        0\n```\n\n\n:::\n:::\n\n\n## Vector input and output {-}\n\n\n::: {.cell}\n\n```{.r .cell-code}\npdistR <- function(x, ys) {\n  sqrt((x - ys) ^ 2)\n}\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncppFunction('NumericVector pdistC(double x, NumericVector ys) {\n  int n = ys.size();\n  NumericVector out(n);\n\n  for(int i = 0; i < n; ++i) {\n    out[i] = sqrt(pow(ys[i] - x, 2.0));\n  }\n  return out;\n}')\n```\n:::\n\n\nNote:   uses `pow()`, not `^`, for exponentiation\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- runif(1e6)\nbench::mark(\n  pdistR(0.5, y),\n  pdistC(0.5, y)\n)[1:6]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 2 × 6\n#>   expression          min   median `itr/sec` mem_alloc `gc/sec`\n#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>\n#> 1 pdistR(0.5, y)   4.37ms   8.76ms      119.    7.63MB     63.2\n#> 2 pdistC(0.5, y)   2.26ms    5.7ms      182.    7.63MB     87.5\n```\n\n\n:::\n:::\n\n\n## Source your C++ code {-}\n\nSource stand-alone C++ files into R using `sourceCpp()`\n\n\nC++ files have extension `.cpp`\n\n```\n#include <Rcpp.h>\nusing namespace Rcpp;\n```\n\nAnd for each function that you want available within R, you need to prefix it with:\n\n```\n// [[Rcpp::export]]\n```\n\nInside a cpp file you can include `R` code using special comments\n\n```\n/*** R\nrcode here\n*/\n```\n\n\n\n### Example {-}\n\nThis block in Rmarkdown uses `{Rcpp}` as a short hand for  engine = \"Rcpp\". \n\n\n::: {.cell}\n\n```{.cpp .cell-code}\n#include <Rcpp.h>\nusing namespace Rcpp;\n\n// [[Rcpp::export]]\ndouble meanC(NumericVector x) {\n  int n = x.size();\n  double total = 0;\n\n  for(int i = 0; i < n; ++i) {\n    total += x[i];\n  }\n  return total / n;\n}\n\n/*** R\nx <- runif(1e5)\nbench::mark(\n  mean(x),\n  meanC(x)\n)\n*/\n```\n:::\n\n\nNOTE: For some reason although the r code above runs, `knit` doesn't include the output. Why?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- runif(1e5)\nbench::mark(\n  mean(x),\n  meanC(x)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 2 × 6\n#>   expression      min   median `itr/sec` mem_alloc `gc/sec`\n#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>\n#> 1 mean(x)     122.8µs    234µs     4084.        0B     2.02\n#> 2 meanC(x)     40.4µs    114µs     9270.        0B     0\n```\n\n\n:::\n:::\n\n\n\n\n## Data frames, functions, and attributes\n\n### Lists and Dataframes {-}\n\nContrived example to illustrate how to access a dataframe from c++:\n\n\n::: {.cell}\n\n```{.cpp .cell-code}\n#include <Rcpp.h>\nusing namespace Rcpp;\n\n// [[Rcpp::export]]\ndouble mpe(List mod) {\n  if (!mod.inherits(\"lm\")) stop(\"Input must be a linear model\");\n\n  NumericVector resid = as<NumericVector>(mod[\"residuals\"]);\n  NumericVector fitted = as<NumericVector>(mod[\"fitted.values\"]);\n\n  int n = resid.size();\n  double err = 0;\n  for(int i = 0; i < n; ++i) {\n    err += resid[i] / (fitted[i] + resid[i]);\n  }\n  return err / n;\n}\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmod <- lm(mpg ~ wt, data = mtcars)\nmpe(mod)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] -0.01541615\n```\n\n\n:::\n:::\n\n\n- Note that you must *cast* the values to the required type. C++ needs to know the types in advance.\n\n### Functions {-}\n\n\n::: {.cell}\n\n```{.cpp .cell-code}\n#include <Rcpp.h>\nusing namespace Rcpp;\n\n// [[Rcpp::export]]\nRObject callWithOne(Function f) {\n  return f(1);\n}\n```\n:::\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncallWithOne(function(x) x + 1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 2\n```\n\n\n:::\n:::\n\n\n\n* Other values can be accessed from c++ including\n\n   * attributes (use: `.attr()`. Also `.names()` is alias for name attribute.\n   * `Environment`, `DottedPair`, `Language`, `Symbol` , etc. \n\n## Missing values\n\n### Missing values behave differently for C++ scalers{-}\n\n* Scalar NA's in Cpp : `NA_LOGICAL`, `NA_INTEGER`, `NA_REAL`, `NA_STRING`.\n\n* Integers (`int`) stores R NA's as the smallest integer. Better to use length 1 `IntegerVector`\n* Doubles use IEEE 754 NaN , which behaves a bit differently for logical expressions (but ok for math expressions). \n\n\n::: {.cell}\n\n```{.r .cell-code}\nevalCpp(\"NA_REAL || FALSE\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] TRUE\n```\n\n\n:::\n:::\n\n\n* Strings are a class from Rcpp, so they handle missing values fine.\n\n* `bool` can only hold two values, so be careful. Consider using vectors of length 1 or coercing to `int`\n\n\n### Vectors\n\n* Vectors are all type introduced by RCpp and know how to handle missing values if you use the specific type for that vector.\n\n\n::: {.cell}\n\n```{.cpp .cell-code}\n#include <Rcpp.h>\nusing namespace Rcpp;\n\n// [[Rcpp::export]]\nList missing_sampler() {\n  return List::create(\n    NumericVector::create(NA_REAL),\n    IntegerVector::create(NA_INTEGER),\n    LogicalVector::create(NA_LOGICAL),\n    CharacterVector::create(NA_STRING)\n  );\n}\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(missing_sampler())\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> List of 4\n#>  $ : num NA\n#>  $ : int NA\n#>  $ : logi NA\n#>  $ : chr NA\n```\n\n\n:::\n:::\n\n\n## Standard Template Library\n\nSTL provides powerful data structures and algorithms for C++.  \n\n### Iterators {-}\n\nIterators are used extensively in the STL to abstract away details of underlying data structures.\n\nIf you an iterator `it`, you can:\n\n- Get the value by 'dereferencing' with `*it`\n- Advance to the next value with `++it`\n- Compare iterators (locations) with `==`\n\n\n### Algorithms {-}\n\n* The real power of iterators comes from using them with STL algorithms. \n \n* A good reference is [https://en.cppreference.com/w/cpp/algorithm]\n\n* Book provides examples using `accumulate` and `upper_buond`\n\n* Another Example:\n\n\n::: {.cell}\n\n```{.cpp .cell-code}\n\n#include <algorithm>\n#include <Rcpp.h>\n\nusing namespace Rcpp;\n \n \n// Explicit iterator version\n \n// [[Rcpp::export]]\nNumericVector square_C_it(NumericVector x){\n  NumericVector out(x.size());\n  // Each container has its own iterator type\n  NumericVector::iterator in_it;\n  NumericVector::iterator out_it;\n  \n  for(in_it = x.begin(), out_it = out.begin(); in_it != x.end();  ++in_it, ++out_it) {\n    *out_it = pow(*in_it,2);\n  }\n  \n  return out;\n  \n}\n \n \n// Use algorithm 'transform'\n  \n// [[Rcpp::export]]\nNumericVector square_C(NumericVector x) {\n \n  NumericVector out(x.size());\n \n \n  std::transform(x.begin(),x.end(), out.begin(),\n            [](double v) -> double { return v*v; });\n  return out;\n}\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsquare_C(c(1.0,2.0,3.0))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1 4 9\n```\n\n\n:::\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsquare_C_it(c(1.0,2.0,3.0))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1 4 9\n```\n\n\n:::\n:::\n\n\n## Data Structures {-}\n\nSTL provides a large set of data structures. Some of the most important:\n\n* `std::vector` - like an `R` vector, except knows how to grow efficiently\n\n* `std::unordered_set` - unique set of values. Ordered version `std::set`. Unordered is more efficient.\n\n* `std::map` - Moslty similar to `R` lists, provide an association between a key and a value. There is also an unordered version. \n\nA quick example illustrating the `map`:\n\n\n::: {.cell}\n\n```{.cpp .cell-code}\n#include <Rcpp.h>\nusing namespace Rcpp;\n\n// [[Rcpp::export]]\nstd::map<double, int> tableC(NumericVector x) {\n  // Note the types are <key, value>\n  std::map<double, int> counts;\n\n  int n = x.size();\n  for (int i = 0; i < n; i++) {\n    counts[x[i]]++;\n  }\n\n  return counts;\n}\n```\n:::\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres = tableC(c(1,1,2,1,4,5))\nres\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 1 2 4 5 \n#> 3 1 1 1\n```\n\n\n:::\n:::\n\n\n* Note that the map is converted to a named vector in this case on return\n\n \nTo learn more about the STL data structures see [containers](https://en.cppreference.com/w/cpp/container) at `cppreference`\n\n## Case Studies\n\n![Case Study](images/case_study.jpg)\n\nReal life uses of C++ to replace slow R code.\n\n\n## Case study 1: Gibbs sampler {-}\n\nThe [Gibbs sampler](https://en.wikipedia.org/wiki/Gibbs_sampling) is a method for estimating parameters expectations. It is a **MCMC algorithm** that has been adapted to sample from multidimensional target distributions. Gibbs sampling generates a **Markov chain** of samples, each of which is correlated with nearby samples. \n\n[Example blogged by Dirk Eddelbuettel](https://dirk.eddelbuettel.com/blog/2011/07/14/), the R and C++ code is very similar but runs about 20 times faster.\n\n> \"Darren Wilkinson stresses the rather pragmatic aspects of how fast and/or easy it is to write the code, rather than just the mere runtime.\n\n\n<center>![](https://media.giphy.com/media/13GIgrGdslD9oQ/giphy.gif)</center>\n\n\nR code:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngibbs_r <- function(N, thin) {\n  mat <- matrix(nrow = N, ncol = 2)\n  x <- y <- 0\n\n  for (i in 1:N) {\n    for (j in 1:thin) {\n      x <- rgamma(1, 3, y * y + 4)\n      y <- rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))\n    }\n    mat[i, ] <- c(x, y)\n  }\n  mat\n}\n```\n:::\n\n\nActions to convert R to C++: \n\n- Add type declarations to all variables \n- Use `(` instead of `[` to index into the matrix \n- Subscript the results of `rgamma` and `rnorm` to convert from a vector into a scalar.\n\n\n::: {.cell}\n\n```{.cpp .cell-code}\n#include <Rcpp.h>\nusing namespace Rcpp;\n\n// [[Rcpp::export]]\nNumericMatrix gibbs_cpp(int N, int thin) {\n  NumericMatrix mat(N, 2);\n  double x = 0, y = 0;\n\n  for(int i = 0; i < N; i++) {\n    for(int j = 0; j < thin; j++) {\n      x = rgamma(1, 3, 1 / (y * y + 4))[0];\n      y = rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))[0];\n    }\n    mat(i, 0) = x;\n    mat(i, 1) = y;\n  }\n\n  return(mat);\n}\n```\n:::\n\n\nChecking who's best:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbench::mark(\n  gibbs_r(100, 10),\n  gibbs_cpp(100, 10),\n  check = FALSE\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 2 × 6\n#>   expression              min   median `itr/sec` mem_alloc `gc/sec`\n#>   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>\n#> 1 gibbs_r(100, 10)     1.31ms   3.83ms      283.  103.22KB     18.0\n#> 2 gibbs_cpp(100, 10)  150.1µs  416.3µs     2456.    1.61KB     16.9\n```\n\n\n:::\n:::\n\n\n## Case study 2: predict a model response from three inputs {-}\n\n[Rcpp is smoking fast for agent based models in data frames](https://gweissman.github.io/post/rcpp-is-smoking-fast-for-agent-based-models-in-data-frames/) by Gary Weissman, MD, MSHP.\n\nStarts with this code:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvacc1a <- function(age, female, ily) {\n  p <- 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily\n  p <- p * if (female) 1.25 else 0.75\n  p <- max(0, p)\n  p <- min(1, p)\n  p\n}\n```\n:::\n\n\nR code with a for loop:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvacc1 <- function(age, female, ily) {\n  n <- length(age)\n  out <- numeric(n)\n  for (i in seq_len(n)) {\n    out[i] <- vacc1a(age[i], female[i], ily[i])\n  }\n  out\n}\n```\n:::\n\n\nVectorized R code:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvacc2 <- function(age, female, ily) {\n  p <- 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily\n  p <- p * ifelse(female, 1.25, 0.75)\n  p <- pmax(0, p)\n  p <- pmin(1, p)\n  p\n}\n```\n:::\n\n\nC++:\n\n\n::: {.cell}\n\n```{.cpp .cell-code}\n#include <Rcpp.h>\nusing namespace Rcpp;\n\ndouble vacc3a(double age, bool female, bool ily){\n  double p = 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily;\n  p = p * (female ? 1.25 : 0.75);\n  p = std::max(p, 0.0);\n  p = std::min(p, 1.0);\n  return p;\n}\n\n// [[Rcpp::export]]\nNumericVector vacc3(NumericVector age, LogicalVector female, \n                    LogicalVector ily) {\n  int n = age.size();\n  NumericVector out(n);\n\n  for(int i = 0; i < n; ++i) {\n    out[i] = vacc3a(age[i], female[i], ily[i]);\n  }\n\n  return out;\n}\n```\n:::\n\n\nSample data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn <- 1000\nage <- rnorm(n, mean = 50, sd = 10)\nfemale <- sample(c(T, F), n, rep = TRUE)\nily <- sample(c(T, F), n, prob = c(0.8, 0.2), rep = TRUE)\n\nstopifnot(\n  all.equal(vacc1(age, female, ily), vacc2(age, female, ily)),\n  all.equal(vacc1(age, female, ily), vacc3(age, female, ily))\n)\n```\n:::\n\n\n<center>**Who's faster?**</center>\n<center>![](https://media.giphy.com/media/l41JGlWa1xOjJSsV2/giphy.gif)</center>\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbench::mark(\n  vacc1 = vacc1(age, female, ily),\n  vacc2 = vacc2(age, female, ily),\n  vacc3 = vacc3(age, female, ily)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 3 × 6\n#>   expression      min   median `itr/sec` mem_alloc `gc/sec`\n#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>\n#> 1 vacc1       750.5µs   1.88ms      546.    7.86KB    28.8 \n#> 2 vacc2        48.4µs   72.1µs    10005.  146.68KB    16.9 \n#> 3 vacc3        30.8µs   41.5µs    15519.   11.98KB     4.06\n```\n\n\n:::\n:::\n\n\n## Resources\n\n-   [Rcpp: Seamless R and C++ Integration](https:\\\\Rcpp.org)\n-   [cpp-tutorial](https://www.learncpp.com) is often recommended. Lots of ads though!\n-   [cpp-reference](https://en.cppreference.com/w/cpp)\n-   [C++20 for Programmers](https://www.pearson.com/en-us/subject-catalog/p/c20-for-programmers-an-objects-natural-approach/P200000000211/9780137570461) is a newer book that covers modern c++ for people who know programming in another language.\n \n## Op Success!\n\n![Congrats!](images/we-did-it-celebration-meme.jpg)\n",
      6     "supporting": [
      7       "25_files"
      8     ],
      9     "filters": [
     10       "rmarkdown/pagebreak.lua"
     11     ],
     12     "includes": {},
     13     "engineDependencies": {},
     14     "preserve": {},
     15     "postProcess": true
     16   }
     17 }
	bookclub-advr DSLC Advanced R Book Club
	git clone https://git.eamoncaddigan.net/bookclub-advr.git
	Log \| Files \| Refs \| README \| LICENSE