bookclub-advr

DSLC Advanced R Book Club
git clone https://git.eamoncaddigan.net/bookclub-advr.git
Log | Files | Refs | README | LICENSE

commit 9368efdb6d708da59873219ed48ab484ffe700f3
parent 207894ca0b62c6ce9e3ae268a433137b5ccb5734
Author: Federica Gazzelloni <61802414+Fgazzelloni@users.noreply.github.com>
Date:   Wed, 18 Jan 2023 22:02:18 +0100

Chapter25 (#46)

* chapter 25

* adding Rcpp

* second commit
Diffstat:
M25_Rewriting_R_code_in_C++.Rmd | 510+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
MDESCRIPTION | 1+
2 files changed, 503 insertions(+), 8 deletions(-)

diff --git a/25_Rewriting_R_code_in_C++.Rmd b/25_Rewriting_R_code_in_C++.Rmd @@ -2,12 +2,506 @@ **Learning objectives:** -- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY +- how to improve performance by rewriting key functions in C++ +- how to use [{Rcpp} package](https://www.jstatsoft.org/index.php/jss/article/view/v040i08/475) (with key contributions by Doug Bates, John Chambers, and JJ Allaire) +- how to check who's faster -## SLIDE 1 +## Introduction -- ADD SLIDES AS SECTIONS (`##`). -- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF. +In this chapter we'll learn how to rewrite **R** code in **C++** for making it faster! We'll use the **Rcpp package** which provides **API for comparison**. + +A closer look at C++ will provide an overview of the language and its key conventions to focus on the differences with R. We will also have look at the standard template library STL, the C++ library to use, which provides a set of extremely useful data structures and algorithms. + +A very interesting part will involve comparing the two codes benchmarking the two implementations yields with `bench::mark()`. + +<center>Like how?</center> + +<center> </center> + +<center>![](https://media.giphy.com/media/vLyZk5CJo12Wk/giphy.gif)</center> + +## What C++ can handle + +The C language was originally implemented by Dennis Ritchie for Linux at the end of '70s. C++ is the object oriented version of C. + +- Loops that can't be easily vectorised because subsequent iterations depend on previous ones. + +- Recursive functions, or problems which involve calling functions millions of times. The overhead of calling a function in C++ is much lower than in R. + +- Problems that require advanced data structures and algorithms that R doesn't provide. Through the **standard template library (STL)**, C++ has efficient implementations of many important data structures, from ordered maps to double-ended queues. + +## Getting started with C++ + +```{r warning=FALSE} +library(Rcpp) +``` + +Install a C++ compiler: + +- Rtools, on Windows +- Xcode, on Mac +- Sudo apt-get install r-base-dev or similar, on Linux. + +### C++ conventions + +<center> </center> + +<center> </center> + +<center>![](https://media.giphy.com/media/xT9KVfk0D9TEQmRwUU/giphy.gif)</center> + + + +- Use = for assignment, not \<-. +- Scalars and vectors are different: + - scalar equivalents of numeric, integer, character, + - logical vectors are: double, int, String, and bool. +- explicitly use a `return` statement to return a value from a function. +- Every statement is terminated by a ; +- The for statement has a different syntax: for(init; check; increment) +- vector indices start at 0 +- methods are called with . +- total += x[i] is equivalent to total = total + x[i]. +- in-place operators are -=, \*=, and /= +- uses pow(), not \^, for exponentiation +- comment block: /\*\*\* R \# This is R code \*/ + +## Examples with the cppFunction function + +### Make the sum function + +Rcpp compiling the C++ code: + +```{r} +cppFunction('int add(int x, int y, int z) { + int sum = x + y + z; + return sum; +}') +# add works like a regular R function +add + +add(1, 2, 3) +``` + +### Build a simple numerical function without arguments + +In R: + +```{r} +one <- function() 1L +one()+100 +``` + +In C++: + + int one() { + return 1; + } + +Translation: + +```{r} +cppFunction('int one() { + return 1; +}') +``` + +### The sign function + +```{r} +signR <- function(x) { + if (x > 0) { + 1 + } else if (x == 0) { + 0 + } else { + -1 + } +} + +a <- -0.5 +b <- 0.5 +c <- 0 +signR(c) +``` + +Translation: + +```{r} +cppFunction('int signC(int x) { + if (x > 0) { + return 1; + } else if (x == 0) { + return 0; + } else { + return -1; + } +}') +``` + +### Sum of a sequence: sumR vs sumC + +```{r} +sumR <- function(x) { + total <- 0 + for (i in seq_along(x)) { + total <- total + x[i] + } + total +} + +x<- runif(100) +sumR(x) +``` + +Translation: + +```{r} +cppFunction('double sumC(NumericVector x) { + int n = x.size(); + double total = 0; + for(int i = 0; i < n; ++i) { + total += x[i]; + } + return total; +}') +``` + +To check for the fastest way we can use: + +```{r eval=FALSE} +?bench::mark +``` + +```{r} +x <- runif(1e3) +bench::mark( + sum(x), + sumC(x), + sumR(x) +) +``` + +### Euclidean distance: pdistR versus pdistC + +```{r} +pdistR <- function(x, ys) { + sqrt((x - ys) ^ 2) +} +``` + +```{r} +cppFunction('NumericVector pdistC(double x, NumericVector ys) { + int n = ys.size(); + NumericVector out(n); + + for(int i = 0; i < n; ++i) { + out[i] = sqrt(pow(ys[i] - x, 2.0)); + } + return out; +}') +``` + +```{r} +y <- runif(1e6) +bench::mark( + pdistR(0.5, y), + pdistC(0.5, y) +)[1:6] +``` + +## Source your C++ code + +Source stand-alone C++ files into R using `sourceCpp()` + +<center> </center> + +<center> </center> + +<center>![](https://media.giphy.com/media/I0qyuvxbdzzYc8INP7/giphy.gif)</center> + + +C++ files have extension `.cpp` + +```{r eval=FALSE} +#include <Rcpp.h> +using namespace Rcpp; +``` + +And for each function that you want available within R, you need to prefix it with: + +```{r eval=FALSE} +// [[Rcpp::export]] +``` + +To call the files: + +- in R use `source(echo = TRUE)` +- in C++ use `sourceCpp("path/to/file.cpp")` + +### Example + + +```{Rcpp} +#include <Rcpp.h> +using namespace Rcpp; + +// [[Rcpp::export]] +double meanC(NumericVector x) { + int n = x.size(); + double total = 0; + + for(int i = 0; i < n; ++i) { + total += x[i]; + } + return total / n; +} + +/*** R +x <- runif(1e5) +bench::mark( + mean(x), + meanC(x) +) +*/ +``` + +## Data frames, functions, and attributes + +Example of Data frames + +```{Rcpp} +#include <Rcpp.h> +using namespace Rcpp; + +// [[Rcpp::export]] +double mpe(List mod) { + if (!mod.inherits("lm")) stop("Input must be a linear model"); + + NumericVector resid = as<NumericVector>(mod["residuals"]); + NumericVector fitted = as<NumericVector>(mod["fitted.values"]); + + int n = resid.size(); + double err = 0; + for(int i = 0; i < n; ++i) { + err += resid[i] / (fitted[i] + resid[i]); + } + return err / n; +} +``` + +```{r} +mod <- lm(mpg ~ wt, data = mtcars) +mpe(mod) +``` + +## Missing values + +Dealing with missing values can differs if: + +- scalars: + - integers + - doubles +- strings +- Boolean +- vectors + +## Standard Template Library + +STL is the fundamental library in C++, it provides a set of useful data structures and algorithms. + + +- Using iterators, the next step up from basic loops: + - NumericVector::iterator + - LogicalVector::iterator + - CharacterVector::iterator +- Algorithms +- Data structures: + - vector + - unordered_set + - unordered_map +- Vectors +- Map + +A good resource is **Effective STL by Scott Meyers**. +And one more about the STL data structures is [the container](https://en.cppreference.com/w/cpp/container) + +## Case Studies + +Real life uses of C++ to replace slow R code. + +### Case study 1: Gibbs sampler + +The [Gibbs sampler](https://en.wikipedia.org/wiki/Gibbs_sampling) is a method for estimating parameters expectations. It is a **MCMC algorithm** that has been adapted to sample from multidimensional target distributions. Gibbs sampling generates a **Markov chain** of samples, each of which is correlated with nearby samples. + +[Example blogged by Dirk Eddelbuettel](https://dirk.eddelbuettel.com/blog/2011/07/14/), the R and C++ code is very similar but runs about 20 times faster. + +> "Darren Wilkinson stresses the rather pragmatic aspects of how fast and/or easy it is to write the code, rather than just the mere runtime. + + +<center>![](https://media.giphy.com/media/13GIgrGdslD9oQ/giphy.gif)</center> + + +R code: + +```{r} +gibbs_r <- function(N, thin) { + mat <- matrix(nrow = N, ncol = 2) + x <- y <- 0 + + for (i in 1:N) { + for (j in 1:thin) { + x <- rgamma(1, 3, y * y + 4) + y <- rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1))) + } + mat[i, ] <- c(x, y) + } + mat +} +``` + +Actions to convert R to C++: + +- Add type declarations to all variables +- Use `(` instead of `[` to index into the matrix +- Subscript the results of `rgamma` and `rnorm` to convert from a vector into a scalar. + +```{Rcpp} +#include <Rcpp.h> +using namespace Rcpp; + +// [[Rcpp::export]] +NumericMatrix gibbs_cpp(int N, int thin) { + NumericMatrix mat(N, 2); + double x = 0, y = 0; + + for(int i = 0; i < N; i++) { + for(int j = 0; j < thin; j++) { + x = rgamma(1, 3, 1 / (y * y + 4))[0]; + y = rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))[0]; + } + mat(i, 0) = x; + mat(i, 1) = y; + } + + return(mat); +} +``` + +Checking who's best: + +```{r} +bench::mark( + gibbs_r(100, 10), + gibbs_cpp(100, 10), + check = FALSE +) +``` + +### Case study 2: predict a model response from three inputs + +[Rcpp is smoking fast for agent based models in data frames](https://gweissman.github.io/post/rcpp-is-smoking-fast-for-agent-based-models-in-data-frames/) by Gary Weissman, MD, MSHP. + +Starts with this code: + +```{r} +vacc1a <- function(age, female, ily) { + p <- 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily + p <- p * if (female) 1.25 else 0.75 + p <- max(0, p) + p <- min(1, p) + p +} +``` + +R code with a for loop: + +```{r} +vacc1 <- function(age, female, ily) { + n <- length(age) + out <- numeric(n) + for (i in seq_len(n)) { + out[i] <- vacc1a(age[i], female[i], ily[i]) + } + out +} +``` + +R code without a for loop: + +```{r} +vacc2 <- function(age, female, ily) { + p <- 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily + p <- p * ifelse(female, 1.25, 0.75) + p <- pmax(0, p) + p <- pmin(1, p) + p +} +``` + +C++: + +```{Rcpp} +#include <Rcpp.h> +using namespace Rcpp; + +double vacc3a(double age, bool female, bool ily){ + double p = 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily; + p = p * (female ? 1.25 : 0.75); + p = std::max(p, 0.0); + p = std::min(p, 1.0); + return p; +} + +// [[Rcpp::export]] +NumericVector vacc3(NumericVector age, LogicalVector female, + LogicalVector ily) { + int n = age.size(); + NumericVector out(n); + + for(int i = 0; i < n; ++i) { + out[i] = vacc3a(age[i], female[i], ily[i]); + } + + return out; +} +``` + +Sample data: + +```{r} +n <- 1000 +age <- rnorm(n, mean = 50, sd = 10) +female <- sample(c(T, F), n, rep = TRUE) +ily <- sample(c(T, F), n, prob = c(0.8, 0.2), rep = TRUE) + +stopifnot( + all.equal(vacc1(age, female, ily), vacc2(age, female, ily)), + all.equal(vacc1(age, female, ily), vacc3(age, female, ily)) +) +``` + +<center>**Who's faster?**</center> +<center>![](https://media.giphy.com/media/l41JGlWa1xOjJSsV2/giphy.gif)</center> + +```{r} +bench::mark( + vacc1 = vacc1(age, female, ily), + vacc2 = vacc2(age, female, ily), + vacc3 = vacc3(age, female, ily) +) +``` + +## Resources + +- [Rcpp: Seamless R and C++ Integration](https://www.jstatsoft.org/index.php/jss/article/view/v040i08/475) +- [cpp-tutorial](https://www.learncpp.com/cpp-tutorial/introduction-to-function-parameters-and-arguments/) +- [cpp-reference](https://en.cppreference.com/w/cpp) +- A good resource is **Effective STL by Scott Meyers** +- the STL data structures found in [the container](https://en.cppreference.com/w/cpp/container) +- [Exposing C++ functions and classes +with Rcpp modules](https://cran.rstudio.com/web/packages/Rcpp/vignettes/Rcpp-modules.pdf) +- All gifs are from: https://giphy.com/ ## Meeting Videos @@ -38,9 +532,9 @@ `r knitr::include_url("https://www.youtube.com/embed/URL")` <details> -<summary> Meeting chat log </summary> -``` -LOG -``` +<summary>Meeting chat log</summary> + + LOG + </details> diff --git a/DESCRIPTION b/DESCRIPTION @@ -29,6 +29,7 @@ Imports: profvis, purrr, R6, + Rcpp, reshape2, rlang, rmarkdown,