25.Rmd (13565B)
1 --- 2 engine: knitr 3 title: Rewriting R code in C++ 4 --- 5 6 ## Learning objectives: 7 8 - Learn to improve performance by rewriting bottlenecks in C++ 9 10 - Introduction to the [{Rcpp} package](https://www.rcpp.org/) 11 12 ## Introduction 13 14 In this chapter we'll learn how to rewrite **R** code in **C++** to make it faster using the **Rcpp package**. The **Rcpp** package makes it simple to connect C++ to R! With C++ you can fix: 15 16 - Loops that can't be easily vectorised because subsequent iterations depend on previous ones. 17 18 - Recursive functions, or problems which involve calling functions millions of times. The overhead of calling a function in C++ is much lower than in R. 19 20 - Problems that require advanced data structures and algorithms that R doesn't provide. Through the **standard template library (STL)**, C++ has efficient implementations of many important data structures, from ordered maps to double-ended queue 21 22 <center>Like how?</center> 23 24 <center> </center> 25 26 <center></center> 27 28 29 30 ## Getting started with C++ 31 32 ```{r warning=FALSE} 33 library(Rcpp) 34 ``` 35 36 Install a C++ compiler: 37 38 - Rtools, on Windows 39 - Xcode, on Mac 40 - Sudo apt-get install r-base-dev or similar, on Linux. 41 42 43 ### First example {-} 44 45 Rcpp compiling the C++ code: 46 47 ```{r} 48 cppFunction('int add(int x, int y, int z) { 49 int sum = x + y + z; 50 return sum; 51 }') 52 # add works like a regular R function 53 add 54 55 add(1, 2, 3) 56 ``` 57 58 Some things to note: 59 60 61 - The syntax to create a function is different. 62 - Types of inputs and outputs must be explicitly declared 63 - Use = for assignment, not `<-`. 64 - Every statement is terminated by a ; 65 - C++ has it's own name for the types we are used to: 66 - scalar types are `int`, `double`, `bool` and `String` 67 - vector types (for Rcpp) are `IntegerVector`, `NumericVector`, `LogicalVector` and `CharacterVector` 68 - Other R types are available in C++: `List`, `Function`, `DataFrame`, and more. 69 70 - Explicitly use a `return` statement to return a value from a function. 71 72 73 74 ## Example with scalar input and output {-} 75 76 ```{r} 77 signR <- function(x) { 78 if (x > 0) { 79 1 80 } else if (x == 0) { 81 0 82 } else { 83 -1 84 } 85 } 86 87 a <- -0.5 88 b <- 0.5 89 c <- 0 90 signR(c) 91 ``` 92 93 Translation: 94 95 ```{r} 96 cppFunction('int signC(int x) { 97 if (x > 0) { 98 return 1; 99 } else if (x == 0) { 100 return 0; 101 } else { 102 return -1; 103 } 104 }') 105 ``` 106 107 * Note that the `if` syntax is identical! Not everything is different! 108 109 ## Vector Input, Scalar output:{-} 110 111 ```{r} 112 sumR <- function(x) { 113 total <- 0 114 for (i in seq_along(x)) { 115 total <- total + x[i] 116 } 117 total 118 } 119 120 x<- runif(100) 121 sumR(x) 122 ``` 123 124 Translation: 125 126 ```{r} 127 cppFunction('double sumC(NumericVector x) { 128 int n = x.size(); 129 double total = 0; 130 for(int i = 0; i < n; ++i) { 131 total += x[i]; 132 } 133 return total; 134 }') 135 ``` 136 137 Some observations: 138 139 - vector indices *start at 0* 140 - The for statement has a different syntax: for(init; check; increment) 141 - Methods are called with `.` 142 - `total += x[i]` is equivalent to `total = total + x[i]`. 143 - other in-place operators are `-=`, `*=`, `and /=` 144 145 146 To check for the fastest way we can use: 147 148 ```{r eval=FALSE} 149 ?bench::mark 150 ``` 151 152 ```{r} 153 x <- runif(1e3) 154 bench::mark( 155 sum(x), 156 sumC(x), 157 sumR(x) 158 ) 159 ``` 160 161 ## Vector input and output {-} 162 163 ```{r} 164 pdistR <- function(x, ys) { 165 sqrt((x - ys) ^ 2) 166 } 167 ``` 168 169 ```{r} 170 cppFunction('NumericVector pdistC(double x, NumericVector ys) { 171 int n = ys.size(); 172 NumericVector out(n); 173 174 for(int i = 0; i < n; ++i) { 175 out[i] = sqrt(pow(ys[i] - x, 2.0)); 176 } 177 return out; 178 }') 179 ``` 180 181 Note: uses `pow()`, not `^`, for exponentiation 182 183 ```{r} 184 y <- runif(1e6) 185 bench::mark( 186 pdistR(0.5, y), 187 pdistC(0.5, y) 188 )[1:6] 189 ``` 190 191 ## Source your C++ code {-} 192 193 Source stand-alone C++ files into R using `sourceCpp()` 194 195 196 C++ files have extension `.cpp` 197 198 ``` 199 #include <Rcpp.h> 200 using namespace Rcpp; 201 ``` 202 203 And for each function that you want available within R, you need to prefix it with: 204 205 ``` 206 // [[Rcpp::export]] 207 ``` 208 209 Inside a cpp file you can include `R` code using special comments 210 211 ``` 212 /*** R 213 rcode here 214 */ 215 ``` 216 217 218 219 ### Example {-} 220 221 This block in Rmarkdown uses `{Rcpp}` as a short hand for engine = "Rcpp". 222 223 ```{Rcpp} 224 #include <Rcpp.h> 225 using namespace Rcpp; 226 227 // [[Rcpp::export]] 228 double meanC(NumericVector x) { 229 int n = x.size(); 230 double total = 0; 231 232 for(int i = 0; i < n; ++i) { 233 total += x[i]; 234 } 235 return total / n; 236 } 237 238 /*** R 239 x <- runif(1e5) 240 bench::mark( 241 mean(x), 242 meanC(x) 243 ) 244 */ 245 ``` 246 247 NOTE: For some reason although the r code above runs, `knit` doesn't include the output. Why? 248 249 ```{r} 250 x <- runif(1e5) 251 bench::mark( 252 mean(x), 253 meanC(x) 254 ) 255 ``` 256 257 258 259 ## Data frames, functions, and attributes 260 261 ### Lists and Dataframes {-} 262 263 Contrived example to illustrate how to access a dataframe from c++: 264 265 ```{Rcpp} 266 #include <Rcpp.h> 267 using namespace Rcpp; 268 269 // [[Rcpp::export]] 270 double mpe(List mod) { 271 if (!mod.inherits("lm")) stop("Input must be a linear model"); 272 273 NumericVector resid = as<NumericVector>(mod["residuals"]); 274 NumericVector fitted = as<NumericVector>(mod["fitted.values"]); 275 276 int n = resid.size(); 277 double err = 0; 278 for(int i = 0; i < n; ++i) { 279 err += resid[i] / (fitted[i] + resid[i]); 280 } 281 return err / n; 282 } 283 ``` 284 285 ```{r} 286 mod <- lm(mpg ~ wt, data = mtcars) 287 mpe(mod) 288 ``` 289 290 - Note that you must *cast* the values to the required type. C++ needs to know the types in advance. 291 292 ### Functions {-} 293 294 ```{Rcpp} 295 #include <Rcpp.h> 296 using namespace Rcpp; 297 298 // [[Rcpp::export]] 299 RObject callWithOne(Function f) { 300 return f(1); 301 } 302 ``` 303 304 305 ```{r} 306 callWithOne(function(x) x + 1) 307 ``` 308 309 310 * Other values can be accessed from c++ including 311 312 * attributes (use: `.attr()`. Also `.names()` is alias for name attribute. 313 * `Environment`, `DottedPair`, `Language`, `Symbol` , etc. 314 315 ## Missing values 316 317 ### Missing values behave differently for C++ scalers{-} 318 319 * Scalar NA's in Cpp : `NA_LOGICAL`, `NA_INTEGER`, `NA_REAL`, `NA_STRING`. 320 321 * Integers (`int`) stores R NA's as the smallest integer. Better to use length 1 `IntegerVector` 322 * Doubles use IEEE 754 NaN , which behaves a bit differently for logical expressions (but ok for math expressions). 323 324 ```{r} 325 evalCpp("NA_REAL || FALSE") 326 ``` 327 328 * Strings are a class from Rcpp, so they handle missing values fine. 329 330 * `bool` can only hold two values, so be careful. Consider using vectors of length 1 or coercing to `int` 331 332 333 ### Vectors 334 335 * Vectors are all type introduced by RCpp and know how to handle missing values if you use the specific type for that vector. 336 337 ```{Rcpp} 338 #include <Rcpp.h> 339 using namespace Rcpp; 340 341 // [[Rcpp::export]] 342 List missing_sampler() { 343 return List::create( 344 NumericVector::create(NA_REAL), 345 IntegerVector::create(NA_INTEGER), 346 LogicalVector::create(NA_LOGICAL), 347 CharacterVector::create(NA_STRING) 348 ); 349 } 350 ``` 351 352 ```{r} 353 str(missing_sampler()) 354 ``` 355 356 ## Standard Template Library 357 358 STL provides powerful data structures and algorithms for C++. 359 360 ### Iterators {-} 361 362 Iterators are used extensively in the STL to abstract away details of underlying data structures. 363 364 If you an iterator `it`, you can: 365 366 - Get the value by 'dereferencing' with `*it` 367 - Advance to the next value with `++it` 368 - Compare iterators (locations) with `==` 369 370 371 ### Algorithms {-} 372 373 * The real power of iterators comes from using them with STL algorithms. 374 375 * A good reference is [https://en.cppreference.com/w/cpp/algorithm] 376 377 * Book provides examples using `accumulate` and `upper_buond` 378 379 * Another Example: 380 381 ```{Rcpp} 382 383 #include <algorithm> 384 #include <Rcpp.h> 385 386 using namespace Rcpp; 387 388 389 // Explicit iterator version 390 391 // [[Rcpp::export]] 392 NumericVector square_C_it(NumericVector x){ 393 NumericVector out(x.size()); 394 // Each container has its own iterator type 395 NumericVector::iterator in_it; 396 NumericVector::iterator out_it; 397 398 for(in_it = x.begin(), out_it = out.begin(); in_it != x.end(); ++in_it, ++out_it) { 399 *out_it = pow(*in_it,2); 400 } 401 402 return out; 403 404 } 405 406 407 // Use algorithm 'transform' 408 409 // [[Rcpp::export]] 410 NumericVector square_C(NumericVector x) { 411 412 NumericVector out(x.size()); 413 414 415 std::transform(x.begin(),x.end(), out.begin(), 416 [](double v) -> double { return v*v; }); 417 return out; 418 } 419 ``` 420 421 ```{r} 422 square_C(c(1.0,2.0,3.0)) 423 ``` 424 ```{r} 425 square_C_it(c(1.0,2.0,3.0)) 426 ``` 427 428 ## Data Structures {-} 429 430 STL provides a large set of data structures. Some of the most important: 431 432 * `std::vector` - like an `R` vector, except knows how to grow efficiently 433 434 * `std::unordered_set` - unique set of values. Ordered version `std::set`. Unordered is more efficient. 435 436 * `std::map` - Moslty similar to `R` lists, provide an association between a key and a value. There is also an unordered version. 437 438 A quick example illustrating the `map`: 439 440 ```{Rcpp} 441 #include <Rcpp.h> 442 using namespace Rcpp; 443 444 // [[Rcpp::export]] 445 std::map<double, int> tableC(NumericVector x) { 446 // Note the types are <key, value> 447 std::map<double, int> counts; 448 449 int n = x.size(); 450 for (int i = 0; i < n; i++) { 451 counts[x[i]]++; 452 } 453 454 return counts; 455 } 456 ``` 457 458 459 ```{r} 460 res = tableC(c(1,1,2,1,4,5)) 461 res 462 ``` 463 464 * Note that the map is converted to a named vector in this case on return 465 466 467 To learn more about the STL data structures see [containers](https://en.cppreference.com/w/cpp/container) at `cppreference` 468 469 ## Case Studies 470 471  472 473 Real life uses of C++ to replace slow R code. 474 475 476 ## Case study 1: Gibbs sampler {-} 477 478 The [Gibbs sampler](https://en.wikipedia.org/wiki/Gibbs_sampling) is a method for estimating parameters expectations. It is a **MCMC algorithm** that has been adapted to sample from multidimensional target distributions. Gibbs sampling generates a **Markov chain** of samples, each of which is correlated with nearby samples. 479 480 [Example blogged by Dirk Eddelbuettel](https://dirk.eddelbuettel.com/blog/2011/07/14/), the R and C++ code is very similar but runs about 20 times faster. 481 482 > "Darren Wilkinson stresses the rather pragmatic aspects of how fast and/or easy it is to write the code, rather than just the mere runtime. 483 484 485 <center></center> 486 487 488 R code: 489 490 ```{r} 491 gibbs_r <- function(N, thin) { 492 mat <- matrix(nrow = N, ncol = 2) 493 x <- y <- 0 494 495 for (i in 1:N) { 496 for (j in 1:thin) { 497 x <- rgamma(1, 3, y * y + 4) 498 y <- rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1))) 499 } 500 mat[i, ] <- c(x, y) 501 } 502 mat 503 } 504 ``` 505 506 Actions to convert R to C++: 507 508 - Add type declarations to all variables 509 - Use `(` instead of `[` to index into the matrix 510 - Subscript the results of `rgamma` and `rnorm` to convert from a vector into a scalar. 511 512 ```{Rcpp} 513 #include <Rcpp.h> 514 using namespace Rcpp; 515 516 // [[Rcpp::export]] 517 NumericMatrix gibbs_cpp(int N, int thin) { 518 NumericMatrix mat(N, 2); 519 double x = 0, y = 0; 520 521 for(int i = 0; i < N; i++) { 522 for(int j = 0; j < thin; j++) { 523 x = rgamma(1, 3, 1 / (y * y + 4))[0]; 524 y = rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))[0]; 525 } 526 mat(i, 0) = x; 527 mat(i, 1) = y; 528 } 529 530 return(mat); 531 } 532 ``` 533 534 Checking who's best: 535 536 ```{r} 537 bench::mark( 538 gibbs_r(100, 10), 539 gibbs_cpp(100, 10), 540 check = FALSE 541 ) 542 ``` 543 544 ## Case study 2: predict a model response from three inputs {-} 545 546 [Rcpp is smoking fast for agent based models in data frames](https://gweissman.github.io/post/rcpp-is-smoking-fast-for-agent-based-models-in-data-frames/) by Gary Weissman, MD, MSHP. 547 548 Starts with this code: 549 550 ```{r} 551 vacc1a <- function(age, female, ily) { 552 p <- 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily 553 p <- p * if (female) 1.25 else 0.75 554 p <- max(0, p) 555 p <- min(1, p) 556 p 557 } 558 ``` 559 560 R code with a for loop: 561 562 ```{r} 563 vacc1 <- function(age, female, ily) { 564 n <- length(age) 565 out <- numeric(n) 566 for (i in seq_len(n)) { 567 out[i] <- vacc1a(age[i], female[i], ily[i]) 568 } 569 out 570 } 571 ``` 572 573 Vectorized R code: 574 575 ```{r} 576 vacc2 <- function(age, female, ily) { 577 p <- 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily 578 p <- p * ifelse(female, 1.25, 0.75) 579 p <- pmax(0, p) 580 p <- pmin(1, p) 581 p 582 } 583 ``` 584 585 C++: 586 587 ```{Rcpp} 588 #include <Rcpp.h> 589 using namespace Rcpp; 590 591 double vacc3a(double age, bool female, bool ily){ 592 double p = 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily; 593 p = p * (female ? 1.25 : 0.75); 594 p = std::max(p, 0.0); 595 p = std::min(p, 1.0); 596 return p; 597 } 598 599 // [[Rcpp::export]] 600 NumericVector vacc3(NumericVector age, LogicalVector female, 601 LogicalVector ily) { 602 int n = age.size(); 603 NumericVector out(n); 604 605 for(int i = 0; i < n; ++i) { 606 out[i] = vacc3a(age[i], female[i], ily[i]); 607 } 608 609 return out; 610 } 611 ``` 612 613 Sample data: 614 615 ```{r} 616 n <- 1000 617 age <- rnorm(n, mean = 50, sd = 10) 618 female <- sample(c(T, F), n, rep = TRUE) 619 ily <- sample(c(T, F), n, prob = c(0.8, 0.2), rep = TRUE) 620 621 stopifnot( 622 all.equal(vacc1(age, female, ily), vacc2(age, female, ily)), 623 all.equal(vacc1(age, female, ily), vacc3(age, female, ily)) 624 ) 625 ``` 626 627 <center>**Who's faster?**</center> 628 <center></center> 629 630 ```{r} 631 bench::mark( 632 vacc1 = vacc1(age, female, ily), 633 vacc2 = vacc2(age, female, ily), 634 vacc3 = vacc3(age, female, ily) 635 ) 636 ``` 637 638 ## Resources 639 640 - [Rcpp: Seamless R and C++ Integration](https:\\Rcpp.org) 641 - [cpp-tutorial](https://www.learncpp.com) is often recommended. Lots of ads though! 642 - [cpp-reference](https://en.cppreference.com/w/cpp) 643 - [C++20 for Programmers](https://www.pearson.com/en-us/subject-catalog/p/c20-for-programmers-an-objects-natural-approach/P200000000211/9780137570461) is a newer book that covers modern c++ for people who know programming in another language. 644 645 ## Op Success! 646 647 