commit 9368efdb6d708da59873219ed48ab484ffe700f3
parent 207894ca0b62c6ce9e3ae268a433137b5ccb5734
Author: Federica Gazzelloni <61802414+Fgazzelloni@users.noreply.github.com>
Date:   Wed, 18 Jan 2023 22:02:18 +0100
Chapter25 (#46)
* chapter 25
* adding Rcpp
* second commit
Diffstat:
2 files changed, 503 insertions(+), 8 deletions(-)
diff --git a/25_Rewriting_R_code_in_C++.Rmd b/25_Rewriting_R_code_in_C++.Rmd
@@ -2,12 +2,506 @@
 
 **Learning objectives:**
 
-- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY
+-   how to improve performance by rewriting key functions in C++
+-   how to use [{Rcpp} package](https://www.jstatsoft.org/index.php/jss/article/view/v040i08/475) (with key contributions by Doug Bates, John Chambers, and JJ Allaire)
+-   how to check who's faster
 
-## SLIDE 1
+## Introduction
 
-- ADD SLIDES AS SECTIONS (`##`).
-- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF.
+In this chapter we'll learn how to rewrite **R** code in **C++** for making it faster! We'll use the **Rcpp package** which provides **API for comparison**.
+
+A closer look at C++ will provide an overview of the language and its key conventions to focus on the differences with R. We will also have look at the standard template library STL, the C++ library to use, which provides a set of extremely useful data structures and algorithms.
+
+A very interesting part will involve comparing the two codes benchmarking the two implementations yields with `bench::mark()`.
+
+<center>Like how?</center>
+
+<center> </center>
+
+<center></center>
+
+## What C++ can handle
+
+The C language was originally implemented by Dennis Ritchie for Linux at the end of '70s. C++ is the object oriented version of C.
+
+-   Loops that can't be easily vectorised because subsequent iterations depend on previous ones.
+
+-   Recursive functions, or problems which involve calling functions millions of times. The overhead of calling a function in C++ is much lower than in R.
+
+-   Problems that require advanced data structures and algorithms that R doesn't provide. Through the **standard template library (STL)**, C++ has efficient implementations of many important data structures, from ordered maps to double-ended queues.
+
+## Getting started with C++
+
+```{r warning=FALSE}
+library(Rcpp)
+```
+
+Install a C++ compiler:
+
+-   Rtools, on Windows
+-   Xcode, on Mac
+-   Sudo apt-get install r-base-dev or similar, on Linux.
+
+### C++ conventions
+
+<center> </center>
+
+<center> </center>
+
+<center></center>
+
+
+
+-   Use = for assignment, not \<-.
+-   Scalars and vectors are different:
+    -   scalar equivalents of numeric, integer, character,
+    -   logical vectors are: double, int, String, and bool.
+-   explicitly use a `return` statement to return a value from a function.
+-   Every statement is terminated by a ;
+-   The for statement has a different syntax: for(init; check; increment)
+-   vector indices start at 0
+-   methods are called with .
+-   total += x[i] is equivalent to total = total + x[i].
+-   in-place operators are -=, \*=, and /=
+-   uses pow(), not \^, for exponentiation
+-   comment block: /\*\*\* R \# This is R code \*/
+
+## Examples with the cppFunction function
+
+### Make the sum function
+
+Rcpp compiling the C++ code:
+
+```{r}
+cppFunction('int add(int x, int y, int z) {
+  int sum = x + y + z;
+  return sum;
+}')
+# add works like a regular R function
+add
+
+add(1, 2, 3)
+```
+
+### Build a simple numerical function without arguments
+
+In R:
+
+```{r}
+one <- function() 1L
+one()+100
+```
+
+In C++:
+
+    int one() {
+         return 1;
+              }
+
+Translation:
+
+```{r}
+cppFunction('int one() {
+  return 1;
+}')
+```
+
+### The sign function
+
+```{r}
+signR <- function(x) {
+  if (x > 0) {
+    1
+  } else if (x == 0) {
+    0
+  } else {
+    -1
+  }
+}
+
+a <- -0.5
+b <- 0.5
+c <- 0
+signR(c)
+```
+
+Translation:
+
+```{r}
+cppFunction('int signC(int x) {
+  if (x > 0) {
+    return 1;
+  } else if (x == 0) {
+    return 0;
+  } else {
+    return -1;
+  }
+}')
+```
+
+### Sum of a sequence: sumR vs sumC
+
+```{r}
+sumR <- function(x) {
+  total <- 0
+  for (i in seq_along(x)) {
+    total <- total + x[i]
+  }
+  total
+}
+
+x<- runif(100)
+sumR(x)
+```
+
+Translation:
+
+```{r}
+cppFunction('double sumC(NumericVector x) {
+  int n = x.size();
+  double total = 0;
+  for(int i = 0; i < n; ++i) {
+    total += x[i];
+  }
+  return total;
+}')
+```
+
+To check for the fastest way we can use:
+
+```{r eval=FALSE}
+?bench::mark
+```
+
+```{r}
+x <- runif(1e3)
+bench::mark(
+  sum(x),
+  sumC(x),
+  sumR(x)
+)
+```
+
+### Euclidean distance: pdistR versus pdistC
+
+```{r}
+pdistR <- function(x, ys) {
+  sqrt((x - ys) ^ 2)
+}
+```
+
+```{r}
+cppFunction('NumericVector pdistC(double x, NumericVector ys) {
+  int n = ys.size();
+  NumericVector out(n);
+
+  for(int i = 0; i < n; ++i) {
+    out[i] = sqrt(pow(ys[i] - x, 2.0));
+  }
+  return out;
+}')
+```
+
+```{r}
+y <- runif(1e6)
+bench::mark(
+  pdistR(0.5, y),
+  pdistC(0.5, y)
+)[1:6]
+```
+
+## Source your C++ code
+
+Source stand-alone C++ files into R using `sourceCpp()`
+
+<center> </center>
+
+<center> </center>
+
+<center></center>
+
+
+C++ files have extension `.cpp`
+
+```{r eval=FALSE}
+#include <Rcpp.h>
+using namespace Rcpp;
+```
+
+And for each function that you want available within R, you need to prefix it with:
+
+```{r eval=FALSE}
+// [[Rcpp::export]]
+```
+
+To call the files: 
+
+- in R use `source(echo = TRUE)` 
+- in C++ use `sourceCpp("path/to/file.cpp")`
+
+### Example
+
+
+```{Rcpp}
+#include <Rcpp.h>
+using namespace Rcpp;
+
+// [[Rcpp::export]]
+double meanC(NumericVector x) {
+  int n = x.size();
+  double total = 0;
+
+  for(int i = 0; i < n; ++i) {
+    total += x[i];
+  }
+  return total / n;
+}
+
+/*** R
+x <- runif(1e5)
+bench::mark(
+  mean(x),
+  meanC(x)
+)
+*/
+```
+
+## Data frames, functions, and attributes
+
+Example of Data frames
+
+```{Rcpp}
+#include <Rcpp.h>
+using namespace Rcpp;
+
+// [[Rcpp::export]]
+double mpe(List mod) {
+  if (!mod.inherits("lm")) stop("Input must be a linear model");
+
+  NumericVector resid = as<NumericVector>(mod["residuals"]);
+  NumericVector fitted = as<NumericVector>(mod["fitted.values"]);
+
+  int n = resid.size();
+  double err = 0;
+  for(int i = 0; i < n; ++i) {
+    err += resid[i] / (fitted[i] + resid[i]);
+  }
+  return err / n;
+}
+```
+
+```{r}
+mod <- lm(mpg ~ wt, data = mtcars)
+mpe(mod)
+```
+
+## Missing values
+
+Dealing with missing values can differs if:
+
+-   scalars:
+    -   integers
+    -   doubles
+-   strings
+-   Boolean
+-   vectors
+
+## Standard Template Library
+
+STL is the fundamental library in C++, it provides a set of useful data structures and algorithms.
+
+
+- Using iterators, the next step up from basic loops:
+    - NumericVector::iterator
+    - LogicalVector::iterator
+    - CharacterVector::iterator
+- Algorithms
+- Data structures:
+    - vector
+    - unordered_set
+    - unordered_map
+- Vectors
+- Map
+
+A good resource is **Effective STL by Scott Meyers**.
+And one more about the STL data structures is [the container](https://en.cppreference.com/w/cpp/container)
+
+## Case Studies
+
+Real life uses of C++ to replace slow R code.
+
+### Case study 1: Gibbs sampler
+
+The [Gibbs sampler](https://en.wikipedia.org/wiki/Gibbs_sampling) is a method for estimating parameters expectations. It is a **MCMC algorithm** that has been adapted to sample from multidimensional target distributions. Gibbs sampling generates a **Markov chain** of samples, each of which is correlated with nearby samples. 
+
+[Example blogged by Dirk Eddelbuettel](https://dirk.eddelbuettel.com/blog/2011/07/14/), the R and C++ code is very similar but runs about 20 times faster.
+
+> "Darren Wilkinson stresses the rather pragmatic aspects of how fast and/or easy it is to write the code, rather than just the mere runtime.
+
+
+<center></center>
+
+
+R code:
+
+```{r}
+gibbs_r <- function(N, thin) {
+  mat <- matrix(nrow = N, ncol = 2)
+  x <- y <- 0
+
+  for (i in 1:N) {
+    for (j in 1:thin) {
+      x <- rgamma(1, 3, y * y + 4)
+      y <- rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))
+    }
+    mat[i, ] <- c(x, y)
+  }
+  mat
+}
+```
+
+Actions to convert R to C++: 
+
+- Add type declarations to all variables 
+- Use `(` instead of `[` to index into the matrix 
+- Subscript the results of `rgamma` and `rnorm` to convert from a vector into a scalar.
+
+```{Rcpp}
+#include <Rcpp.h>
+using namespace Rcpp;
+
+// [[Rcpp::export]]
+NumericMatrix gibbs_cpp(int N, int thin) {
+  NumericMatrix mat(N, 2);
+  double x = 0, y = 0;
+
+  for(int i = 0; i < N; i++) {
+    for(int j = 0; j < thin; j++) {
+      x = rgamma(1, 3, 1 / (y * y + 4))[0];
+      y = rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))[0];
+    }
+    mat(i, 0) = x;
+    mat(i, 1) = y;
+  }
+
+  return(mat);
+}
+```
+
+Checking who's best:
+
+```{r}
+bench::mark(
+  gibbs_r(100, 10),
+  gibbs_cpp(100, 10),
+  check = FALSE
+)
+```
+
+### Case study 2: predict a model response from three inputs
+
+[Rcpp is smoking fast for agent based models in data frames](https://gweissman.github.io/post/rcpp-is-smoking-fast-for-agent-based-models-in-data-frames/) by Gary Weissman, MD, MSHP.
+
+Starts with this code:
+
+```{r}
+vacc1a <- function(age, female, ily) {
+  p <- 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily
+  p <- p * if (female) 1.25 else 0.75
+  p <- max(0, p)
+  p <- min(1, p)
+  p
+}
+```
+
+R code with a for loop:
+
+```{r}
+vacc1 <- function(age, female, ily) {
+  n <- length(age)
+  out <- numeric(n)
+  for (i in seq_len(n)) {
+    out[i] <- vacc1a(age[i], female[i], ily[i])
+  }
+  out
+}
+```
+
+R code without a for loop:
+
+```{r}
+vacc2 <- function(age, female, ily) {
+  p <- 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily
+  p <- p * ifelse(female, 1.25, 0.75)
+  p <- pmax(0, p)
+  p <- pmin(1, p)
+  p
+}
+```
+
+C++:
+
+```{Rcpp}
+#include <Rcpp.h>
+using namespace Rcpp;
+
+double vacc3a(double age, bool female, bool ily){
+  double p = 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily;
+  p = p * (female ? 1.25 : 0.75);
+  p = std::max(p, 0.0);
+  p = std::min(p, 1.0);
+  return p;
+}
+
+// [[Rcpp::export]]
+NumericVector vacc3(NumericVector age, LogicalVector female, 
+                    LogicalVector ily) {
+  int n = age.size();
+  NumericVector out(n);
+
+  for(int i = 0; i < n; ++i) {
+    out[i] = vacc3a(age[i], female[i], ily[i]);
+  }
+
+  return out;
+}
+```
+
+Sample data:
+
+```{r}
+n <- 1000
+age <- rnorm(n, mean = 50, sd = 10)
+female <- sample(c(T, F), n, rep = TRUE)
+ily <- sample(c(T, F), n, prob = c(0.8, 0.2), rep = TRUE)
+
+stopifnot(
+  all.equal(vacc1(age, female, ily), vacc2(age, female, ily)),
+  all.equal(vacc1(age, female, ily), vacc3(age, female, ily))
+)
+```
+
+<center>**Who's faster?**</center>
+<center></center>
+
+```{r}
+bench::mark(
+  vacc1 = vacc1(age, female, ily),
+  vacc2 = vacc2(age, female, ily),
+  vacc3 = vacc3(age, female, ily)
+)
+```
+
+## Resources
+
+-   [Rcpp: Seamless R and C++ Integration](https://www.jstatsoft.org/index.php/jss/article/view/v040i08/475)
+-   [cpp-tutorial](https://www.learncpp.com/cpp-tutorial/introduction-to-function-parameters-and-arguments/)
+-   [cpp-reference](https://en.cppreference.com/w/cpp)
+- A good resource is **Effective STL by Scott Meyers**
+- the STL data structures found in [the container](https://en.cppreference.com/w/cpp/container)
+- [Exposing C++ functions and classes
+with Rcpp modules](https://cran.rstudio.com/web/packages/Rcpp/vignettes/Rcpp-modules.pdf)
+- All gifs are from: https://giphy.com/
 
 ## Meeting Videos
 
@@ -38,9 +532,9 @@
 `r knitr::include_url("https://www.youtube.com/embed/URL")`
 
 <details>
-<summary> Meeting chat log </summary>
 
-```
-LOG
-```
+<summary>Meeting chat log</summary>
+
+    LOG
+
 </details>
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -29,6 +29,7 @@ Imports:
     profvis,
     purrr,
     R6,
+    Rcpp,
     reshape2,
     rlang,
     rmarkdown,