bookclub-advr

DSLC Advanced R Book Club
git clone https://git.eamoncaddigan.net/bookclub-advr.git
Log | Files | Refs | README | LICENSE

25.Rmd (13565B)


      1 ---
      2 engine: knitr
      3 title: Rewriting R code in C++
      4 ---
      5 
      6 ## Learning objectives:
      7 
      8 -   Learn to improve performance by rewriting bottlenecks in C++
      9 
     10 -   Introduction to the [{Rcpp} package](https://www.rcpp.org/)
     11 
     12 ## Introduction
     13 
     14 In this chapter we'll learn how to rewrite **R** code in **C++** to make it faster using the **Rcpp package**. The **Rcpp**  package makes it simple to connect C++ to R! With C++ you can fix:
     15 
     16 -   Loops that can't be easily vectorised because subsequent iterations depend on previous ones.
     17 
     18 -   Recursive functions, or problems which involve calling functions millions of times. The overhead of calling a function in C++ is much lower than in R.
     19 
     20 -   Problems that require advanced data structures and algorithms that R doesn't provide. Through the **standard template library (STL)**, C++ has efficient implementations of many important data structures, from ordered maps to double-ended queue
     21 
     22 <center>Like how?</center>
     23 
     24 <center> </center>
     25 
     26 <center>![](https://media.giphy.com/media/vLyZk5CJo12Wk/giphy.gif)</center>
     27 
     28  
     29 
     30 ## Getting started with C++
     31 
     32 ```{r warning=FALSE}
     33 library(Rcpp)
     34 ```
     35 
     36 Install a C++ compiler:
     37 
     38 -   Rtools, on Windows
     39 -   Xcode, on Mac
     40 -   Sudo apt-get install r-base-dev or similar, on Linux.
     41 
     42 
     43 ### First example {-}
     44 
     45 Rcpp compiling the C++ code:
     46 
     47 ```{r}
     48 cppFunction('int add(int x, int y, int z) {
     49   int sum = x + y + z;
     50   return sum;
     51 }')
     52 # add works like a regular R function
     53 add
     54 
     55 add(1, 2, 3)
     56 ```
     57 
     58 Some things to note:
     59 
     60 
     61 -   The syntax to create a function is different.
     62 -   Types of inputs and outputs must be explicitly declared
     63 -   Use = for assignment, not `<-`.
     64 -   Every statement is terminated by a ;
     65 -   C++ has it's own name for the types we are used to:
     66     -   scalar types are `int`, `double`, `bool` and `String`
     67     -   vector types (for Rcpp) are `IntegerVector`, `NumericVector`, `LogicalVector` and `CharacterVector`
     68     -   Other R types are available in C++: `List`, `Function`, `DataFrame`, and more.
     69     
     70 -   Explicitly use a `return` statement to return a value from a function.
     71 
     72  
     73 
     74 ## Example with scalar input and output {-}
     75 
     76 ```{r}
     77 signR <- function(x) {
     78   if (x > 0) {
     79     1
     80   } else if (x == 0) {
     81     0
     82   } else {
     83     -1
     84   }
     85 }
     86 
     87 a <- -0.5
     88 b <- 0.5
     89 c <- 0
     90 signR(c)
     91 ```
     92 
     93 Translation:
     94 
     95 ```{r}
     96 cppFunction('int signC(int x) {
     97   if (x > 0) {
     98     return 1;
     99   } else if (x == 0) {
    100     return 0;
    101   } else {
    102     return -1;
    103   }
    104 }')
    105 ```
    106 
    107 * Note that the `if` syntax is identical! Not everything is different!
    108 
    109 ## Vector Input, Scalar output:{-}
    110 
    111 ```{r}
    112 sumR <- function(x) {
    113   total <- 0
    114   for (i in seq_along(x)) {
    115     total <- total + x[i]
    116   }
    117   total
    118 }
    119 
    120 x<- runif(100)
    121 sumR(x)
    122 ```
    123 
    124 Translation:
    125 
    126 ```{r}
    127 cppFunction('double sumC(NumericVector x) {
    128   int n = x.size();
    129   double total = 0;
    130   for(int i = 0; i < n; ++i) {
    131     total += x[i];
    132   }
    133   return total;
    134 }')
    135 ```
    136 
    137 Some observations:
    138 
    139 -   vector indices *start at 0*
    140 -   The for statement has a different syntax: for(init; check; increment)
    141 -   Methods are called with `.`
    142 -   `total += x[i]` is equivalent to `total = total + x[i]`.
    143 -   other in-place operators are `-=`, `*=`, `and /=`
    144 
    145 
    146 To check for the fastest way we can use:
    147 
    148 ```{r eval=FALSE}
    149 ?bench::mark
    150 ```
    151 
    152 ```{r}
    153 x <- runif(1e3)
    154 bench::mark(
    155   sum(x),
    156   sumC(x),
    157   sumR(x)
    158 )
    159 ```
    160 
    161 ## Vector input and output {-}
    162 
    163 ```{r}
    164 pdistR <- function(x, ys) {
    165   sqrt((x - ys) ^ 2)
    166 }
    167 ```
    168 
    169 ```{r}
    170 cppFunction('NumericVector pdistC(double x, NumericVector ys) {
    171   int n = ys.size();
    172   NumericVector out(n);
    173 
    174   for(int i = 0; i < n; ++i) {
    175     out[i] = sqrt(pow(ys[i] - x, 2.0));
    176   }
    177   return out;
    178 }')
    179 ```
    180 
    181 Note:   uses `pow()`, not `^`, for exponentiation
    182 
    183 ```{r}
    184 y <- runif(1e6)
    185 bench::mark(
    186   pdistR(0.5, y),
    187   pdistC(0.5, y)
    188 )[1:6]
    189 ```
    190 
    191 ## Source your C++ code {-}
    192 
    193 Source stand-alone C++ files into R using `sourceCpp()`
    194 
    195 
    196 C++ files have extension `.cpp`
    197 
    198 ```
    199 #include <Rcpp.h>
    200 using namespace Rcpp;
    201 ```
    202 
    203 And for each function that you want available within R, you need to prefix it with:
    204 
    205 ```
    206 // [[Rcpp::export]]
    207 ```
    208 
    209 Inside a cpp file you can include `R` code using special comments
    210 
    211 ```
    212 /*** R
    213 rcode here
    214 */
    215 ```
    216 
    217 
    218 
    219 ### Example {-}
    220 
    221 This block in Rmarkdown uses `{Rcpp}` as a short hand for  engine = "Rcpp". 
    222 
    223 ```{Rcpp}
    224 #include <Rcpp.h>
    225 using namespace Rcpp;
    226 
    227 // [[Rcpp::export]]
    228 double meanC(NumericVector x) {
    229   int n = x.size();
    230   double total = 0;
    231 
    232   for(int i = 0; i < n; ++i) {
    233     total += x[i];
    234   }
    235   return total / n;
    236 }
    237 
    238 /*** R
    239 x <- runif(1e5)
    240 bench::mark(
    241   mean(x),
    242   meanC(x)
    243 )
    244 */
    245 ```
    246 
    247 NOTE: For some reason although the r code above runs, `knit` doesn't include the output. Why?
    248 
    249 ```{r}
    250 x <- runif(1e5)
    251 bench::mark(
    252   mean(x),
    253   meanC(x)
    254 )
    255 ```
    256 
    257 
    258 
    259 ## Data frames, functions, and attributes
    260 
    261 ### Lists and Dataframes {-}
    262 
    263 Contrived example to illustrate how to access a dataframe from c++:
    264 
    265 ```{Rcpp}
    266 #include <Rcpp.h>
    267 using namespace Rcpp;
    268 
    269 // [[Rcpp::export]]
    270 double mpe(List mod) {
    271   if (!mod.inherits("lm")) stop("Input must be a linear model");
    272 
    273   NumericVector resid = as<NumericVector>(mod["residuals"]);
    274   NumericVector fitted = as<NumericVector>(mod["fitted.values"]);
    275 
    276   int n = resid.size();
    277   double err = 0;
    278   for(int i = 0; i < n; ++i) {
    279     err += resid[i] / (fitted[i] + resid[i]);
    280   }
    281   return err / n;
    282 }
    283 ```
    284 
    285 ```{r}
    286 mod <- lm(mpg ~ wt, data = mtcars)
    287 mpe(mod)
    288 ```
    289 
    290 - Note that you must *cast* the values to the required type. C++ needs to know the types in advance.
    291 
    292 ### Functions {-}
    293 
    294 ```{Rcpp}
    295 #include <Rcpp.h>
    296 using namespace Rcpp;
    297 
    298 // [[Rcpp::export]]
    299 RObject callWithOne(Function f) {
    300   return f(1);
    301 }
    302 ```
    303 
    304 
    305 ```{r}
    306 callWithOne(function(x) x + 1)
    307 ```
    308 
    309 
    310 * Other values can be accessed from c++ including
    311 
    312    * attributes (use: `.attr()`. Also `.names()` is alias for name attribute.
    313    * `Environment`, `DottedPair`, `Language`, `Symbol` , etc. 
    314 
    315 ## Missing values
    316 
    317 ### Missing values behave differently for C++ scalers{-}
    318 
    319 * Scalar NA's in Cpp : `NA_LOGICAL`, `NA_INTEGER`, `NA_REAL`, `NA_STRING`.
    320 
    321 * Integers (`int`) stores R NA's as the smallest integer. Better to use length 1 `IntegerVector`
    322 * Doubles use IEEE 754 NaN , which behaves a bit differently for logical expressions (but ok for math expressions). 
    323 
    324 ```{r}
    325 evalCpp("NA_REAL || FALSE")
    326 ```
    327 
    328 * Strings are a class from Rcpp, so they handle missing values fine.
    329 
    330 * `bool` can only hold two values, so be careful. Consider using vectors of length 1 or coercing to `int`
    331 
    332 
    333 ### Vectors
    334 
    335 * Vectors are all type introduced by RCpp and know how to handle missing values if you use the specific type for that vector.
    336 
    337 ```{Rcpp}
    338 #include <Rcpp.h>
    339 using namespace Rcpp;
    340 
    341 // [[Rcpp::export]]
    342 List missing_sampler() {
    343   return List::create(
    344     NumericVector::create(NA_REAL),
    345     IntegerVector::create(NA_INTEGER),
    346     LogicalVector::create(NA_LOGICAL),
    347     CharacterVector::create(NA_STRING)
    348   );
    349 }
    350 ```
    351 
    352 ```{r}
    353 str(missing_sampler())
    354 ```
    355 
    356 ## Standard Template Library
    357 
    358 STL provides powerful data structures and algorithms for C++.  
    359 
    360 ### Iterators {-}
    361 
    362 Iterators are used extensively in the STL to abstract away details of underlying data structures.
    363 
    364 If you an iterator `it`, you can:
    365 
    366 - Get the value by 'dereferencing' with `*it`
    367 - Advance to the next value with `++it`
    368 - Compare iterators (locations) with `==`
    369 
    370 
    371 ### Algorithms {-}
    372 
    373 * The real power of iterators comes from using them with STL algorithms. 
    374  
    375 * A good reference is [https://en.cppreference.com/w/cpp/algorithm]
    376 
    377 * Book provides examples using `accumulate` and `upper_buond`
    378 
    379 * Another Example:
    380 
    381 ```{Rcpp}
    382 
    383 #include <algorithm>
    384 #include <Rcpp.h>
    385 
    386 using namespace Rcpp;
    387  
    388  
    389 // Explicit iterator version
    390  
    391 // [[Rcpp::export]]
    392 NumericVector square_C_it(NumericVector x){
    393   NumericVector out(x.size());
    394   // Each container has its own iterator type
    395   NumericVector::iterator in_it;
    396   NumericVector::iterator out_it;
    397   
    398   for(in_it = x.begin(), out_it = out.begin(); in_it != x.end();  ++in_it, ++out_it) {
    399     *out_it = pow(*in_it,2);
    400   }
    401   
    402   return out;
    403   
    404 }
    405  
    406  
    407 // Use algorithm 'transform'
    408   
    409 // [[Rcpp::export]]
    410 NumericVector square_C(NumericVector x) {
    411  
    412   NumericVector out(x.size());
    413  
    414  
    415   std::transform(x.begin(),x.end(), out.begin(),
    416             [](double v) -> double { return v*v; });
    417   return out;
    418 }
    419 ```
    420 
    421 ```{r}
    422 square_C(c(1.0,2.0,3.0))
    423 ```
    424 ```{r}
    425 square_C_it(c(1.0,2.0,3.0))
    426 ```
    427 
    428 ## Data Structures {-}
    429 
    430 STL provides a large set of data structures. Some of the most important:
    431 
    432 * `std::vector` - like an `R` vector, except knows how to grow efficiently
    433 
    434 * `std::unordered_set` - unique set of values. Ordered version `std::set`. Unordered is more efficient.
    435 
    436 * `std::map` - Moslty similar to `R` lists, provide an association between a key and a value. There is also an unordered version. 
    437 
    438 A quick example illustrating the `map`:
    439 
    440 ```{Rcpp}
    441 #include <Rcpp.h>
    442 using namespace Rcpp;
    443 
    444 // [[Rcpp::export]]
    445 std::map<double, int> tableC(NumericVector x) {
    446   // Note the types are <key, value>
    447   std::map<double, int> counts;
    448 
    449   int n = x.size();
    450   for (int i = 0; i < n; i++) {
    451     counts[x[i]]++;
    452   }
    453 
    454   return counts;
    455 }
    456 ```
    457 
    458 
    459 ```{r}
    460 res = tableC(c(1,1,2,1,4,5))
    461 res
    462 ```
    463 
    464 * Note that the map is converted to a named vector in this case on return
    465 
    466  
    467 To learn more about the STL data structures see [containers](https://en.cppreference.com/w/cpp/container) at `cppreference`
    468 
    469 ## Case Studies
    470 
    471 ![Case Study](images/case_study.jpg)
    472 
    473 Real life uses of C++ to replace slow R code.
    474 
    475 
    476 ## Case study 1: Gibbs sampler {-}
    477 
    478 The [Gibbs sampler](https://en.wikipedia.org/wiki/Gibbs_sampling) is a method for estimating parameters expectations. It is a **MCMC algorithm** that has been adapted to sample from multidimensional target distributions. Gibbs sampling generates a **Markov chain** of samples, each of which is correlated with nearby samples. 
    479 
    480 [Example blogged by Dirk Eddelbuettel](https://dirk.eddelbuettel.com/blog/2011/07/14/), the R and C++ code is very similar but runs about 20 times faster.
    481 
    482 > "Darren Wilkinson stresses the rather pragmatic aspects of how fast and/or easy it is to write the code, rather than just the mere runtime.
    483 
    484 
    485 <center>![](https://media.giphy.com/media/13GIgrGdslD9oQ/giphy.gif)</center>
    486 
    487 
    488 R code:
    489 
    490 ```{r}
    491 gibbs_r <- function(N, thin) {
    492   mat <- matrix(nrow = N, ncol = 2)
    493   x <- y <- 0
    494 
    495   for (i in 1:N) {
    496     for (j in 1:thin) {
    497       x <- rgamma(1, 3, y * y + 4)
    498       y <- rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))
    499     }
    500     mat[i, ] <- c(x, y)
    501   }
    502   mat
    503 }
    504 ```
    505 
    506 Actions to convert R to C++: 
    507 
    508 - Add type declarations to all variables 
    509 - Use `(` instead of `[` to index into the matrix 
    510 - Subscript the results of `rgamma` and `rnorm` to convert from a vector into a scalar.
    511 
    512 ```{Rcpp}
    513 #include <Rcpp.h>
    514 using namespace Rcpp;
    515 
    516 // [[Rcpp::export]]
    517 NumericMatrix gibbs_cpp(int N, int thin) {
    518   NumericMatrix mat(N, 2);
    519   double x = 0, y = 0;
    520 
    521   for(int i = 0; i < N; i++) {
    522     for(int j = 0; j < thin; j++) {
    523       x = rgamma(1, 3, 1 / (y * y + 4))[0];
    524       y = rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))[0];
    525     }
    526     mat(i, 0) = x;
    527     mat(i, 1) = y;
    528   }
    529 
    530   return(mat);
    531 }
    532 ```
    533 
    534 Checking who's best:
    535 
    536 ```{r}
    537 bench::mark(
    538   gibbs_r(100, 10),
    539   gibbs_cpp(100, 10),
    540   check = FALSE
    541 )
    542 ```
    543 
    544 ## Case study 2: predict a model response from three inputs {-}
    545 
    546 [Rcpp is smoking fast for agent based models in data frames](https://gweissman.github.io/post/rcpp-is-smoking-fast-for-agent-based-models-in-data-frames/) by Gary Weissman, MD, MSHP.
    547 
    548 Starts with this code:
    549 
    550 ```{r}
    551 vacc1a <- function(age, female, ily) {
    552   p <- 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily
    553   p <- p * if (female) 1.25 else 0.75
    554   p <- max(0, p)
    555   p <- min(1, p)
    556   p
    557 }
    558 ```
    559 
    560 R code with a for loop:
    561 
    562 ```{r}
    563 vacc1 <- function(age, female, ily) {
    564   n <- length(age)
    565   out <- numeric(n)
    566   for (i in seq_len(n)) {
    567     out[i] <- vacc1a(age[i], female[i], ily[i])
    568   }
    569   out
    570 }
    571 ```
    572 
    573 Vectorized R code:
    574 
    575 ```{r}
    576 vacc2 <- function(age, female, ily) {
    577   p <- 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily
    578   p <- p * ifelse(female, 1.25, 0.75)
    579   p <- pmax(0, p)
    580   p <- pmin(1, p)
    581   p
    582 }
    583 ```
    584 
    585 C++:
    586 
    587 ```{Rcpp}
    588 #include <Rcpp.h>
    589 using namespace Rcpp;
    590 
    591 double vacc3a(double age, bool female, bool ily){
    592   double p = 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily;
    593   p = p * (female ? 1.25 : 0.75);
    594   p = std::max(p, 0.0);
    595   p = std::min(p, 1.0);
    596   return p;
    597 }
    598 
    599 // [[Rcpp::export]]
    600 NumericVector vacc3(NumericVector age, LogicalVector female, 
    601                     LogicalVector ily) {
    602   int n = age.size();
    603   NumericVector out(n);
    604 
    605   for(int i = 0; i < n; ++i) {
    606     out[i] = vacc3a(age[i], female[i], ily[i]);
    607   }
    608 
    609   return out;
    610 }
    611 ```
    612 
    613 Sample data:
    614 
    615 ```{r}
    616 n <- 1000
    617 age <- rnorm(n, mean = 50, sd = 10)
    618 female <- sample(c(T, F), n, rep = TRUE)
    619 ily <- sample(c(T, F), n, prob = c(0.8, 0.2), rep = TRUE)
    620 
    621 stopifnot(
    622   all.equal(vacc1(age, female, ily), vacc2(age, female, ily)),
    623   all.equal(vacc1(age, female, ily), vacc3(age, female, ily))
    624 )
    625 ```
    626 
    627 <center>**Who's faster?**</center>
    628 <center>![](https://media.giphy.com/media/l41JGlWa1xOjJSsV2/giphy.gif)</center>
    629 
    630 ```{r}
    631 bench::mark(
    632   vacc1 = vacc1(age, female, ily),
    633   vacc2 = vacc2(age, female, ily),
    634   vacc3 = vacc3(age, female, ily)
    635 )
    636 ```
    637 
    638 ## Resources
    639 
    640 -   [Rcpp: Seamless R and C++ Integration](https:\\Rcpp.org)
    641 -   [cpp-tutorial](https://www.learncpp.com) is often recommended. Lots of ads though!
    642 -   [cpp-reference](https://en.cppreference.com/w/cpp)
    643 -   [C++20 for Programmers](https://www.pearson.com/en-us/subject-catalog/p/c20-for-programmers-an-objects-natural-approach/P200000000211/9780137570461) is a newer book that covers modern c++ for people who know programming in another language.
    644  
    645 ## Op Success!
    646 
    647 ![Congrats!](images/we-did-it-celebration-meme.jpg)