commit 9368efdb6d708da59873219ed48ab484ffe700f3
parent 207894ca0b62c6ce9e3ae268a433137b5ccb5734
Author: Federica Gazzelloni <61802414+Fgazzelloni@users.noreply.github.com>
Date: Wed, 18 Jan 2023 22:02:18 +0100
Chapter25 (#46)
* chapter 25
* adding Rcpp
* second commit
Diffstat:
2 files changed, 503 insertions(+), 8 deletions(-)
diff --git a/25_Rewriting_R_code_in_C++.Rmd b/25_Rewriting_R_code_in_C++.Rmd
@@ -2,12 +2,506 @@
**Learning objectives:**
-- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY
+- how to improve performance by rewriting key functions in C++
+- how to use [{Rcpp} package](https://www.jstatsoft.org/index.php/jss/article/view/v040i08/475) (with key contributions by Doug Bates, John Chambers, and JJ Allaire)
+- how to check who's faster
-## SLIDE 1
+## Introduction
-- ADD SLIDES AS SECTIONS (`##`).
-- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF.
+In this chapter we'll learn how to rewrite **R** code in **C++** for making it faster! We'll use the **Rcpp package** which provides **API for comparison**.
+
+A closer look at C++ will provide an overview of the language and its key conventions to focus on the differences with R. We will also have look at the standard template library STL, the C++ library to use, which provides a set of extremely useful data structures and algorithms.
+
+A very interesting part will involve comparing the two codes benchmarking the two implementations yields with `bench::mark()`.
+
+<center>Like how?</center>
+
+<center> </center>
+
+<center></center>
+
+## What C++ can handle
+
+The C language was originally implemented by Dennis Ritchie for Linux at the end of '70s. C++ is the object oriented version of C.
+
+- Loops that can't be easily vectorised because subsequent iterations depend on previous ones.
+
+- Recursive functions, or problems which involve calling functions millions of times. The overhead of calling a function in C++ is much lower than in R.
+
+- Problems that require advanced data structures and algorithms that R doesn't provide. Through the **standard template library (STL)**, C++ has efficient implementations of many important data structures, from ordered maps to double-ended queues.
+
+## Getting started with C++
+
+```{r warning=FALSE}
+library(Rcpp)
+```
+
+Install a C++ compiler:
+
+- Rtools, on Windows
+- Xcode, on Mac
+- Sudo apt-get install r-base-dev or similar, on Linux.
+
+### C++ conventions
+
+<center> </center>
+
+<center> </center>
+
+<center></center>
+
+
+
+- Use = for assignment, not \<-.
+- Scalars and vectors are different:
+ - scalar equivalents of numeric, integer, character,
+ - logical vectors are: double, int, String, and bool.
+- explicitly use a `return` statement to return a value from a function.
+- Every statement is terminated by a ;
+- The for statement has a different syntax: for(init; check; increment)
+- vector indices start at 0
+- methods are called with .
+- total += x[i] is equivalent to total = total + x[i].
+- in-place operators are -=, \*=, and /=
+- uses pow(), not \^, for exponentiation
+- comment block: /\*\*\* R \# This is R code \*/
+
+## Examples with the cppFunction function
+
+### Make the sum function
+
+Rcpp compiling the C++ code:
+
+```{r}
+cppFunction('int add(int x, int y, int z) {
+ int sum = x + y + z;
+ return sum;
+}')
+# add works like a regular R function
+add
+
+add(1, 2, 3)
+```
+
+### Build a simple numerical function without arguments
+
+In R:
+
+```{r}
+one <- function() 1L
+one()+100
+```
+
+In C++:
+
+ int one() {
+ return 1;
+ }
+
+Translation:
+
+```{r}
+cppFunction('int one() {
+ return 1;
+}')
+```
+
+### The sign function
+
+```{r}
+signR <- function(x) {
+ if (x > 0) {
+ 1
+ } else if (x == 0) {
+ 0
+ } else {
+ -1
+ }
+}
+
+a <- -0.5
+b <- 0.5
+c <- 0
+signR(c)
+```
+
+Translation:
+
+```{r}
+cppFunction('int signC(int x) {
+ if (x > 0) {
+ return 1;
+ } else if (x == 0) {
+ return 0;
+ } else {
+ return -1;
+ }
+}')
+```
+
+### Sum of a sequence: sumR vs sumC
+
+```{r}
+sumR <- function(x) {
+ total <- 0
+ for (i in seq_along(x)) {
+ total <- total + x[i]
+ }
+ total
+}
+
+x<- runif(100)
+sumR(x)
+```
+
+Translation:
+
+```{r}
+cppFunction('double sumC(NumericVector x) {
+ int n = x.size();
+ double total = 0;
+ for(int i = 0; i < n; ++i) {
+ total += x[i];
+ }
+ return total;
+}')
+```
+
+To check for the fastest way we can use:
+
+```{r eval=FALSE}
+?bench::mark
+```
+
+```{r}
+x <- runif(1e3)
+bench::mark(
+ sum(x),
+ sumC(x),
+ sumR(x)
+)
+```
+
+### Euclidean distance: pdistR versus pdistC
+
+```{r}
+pdistR <- function(x, ys) {
+ sqrt((x - ys) ^ 2)
+}
+```
+
+```{r}
+cppFunction('NumericVector pdistC(double x, NumericVector ys) {
+ int n = ys.size();
+ NumericVector out(n);
+
+ for(int i = 0; i < n; ++i) {
+ out[i] = sqrt(pow(ys[i] - x, 2.0));
+ }
+ return out;
+}')
+```
+
+```{r}
+y <- runif(1e6)
+bench::mark(
+ pdistR(0.5, y),
+ pdistC(0.5, y)
+)[1:6]
+```
+
+## Source your C++ code
+
+Source stand-alone C++ files into R using `sourceCpp()`
+
+<center> </center>
+
+<center> </center>
+
+<center></center>
+
+
+C++ files have extension `.cpp`
+
+```{r eval=FALSE}
+#include <Rcpp.h>
+using namespace Rcpp;
+```
+
+And for each function that you want available within R, you need to prefix it with:
+
+```{r eval=FALSE}
+// [[Rcpp::export]]
+```
+
+To call the files:
+
+- in R use `source(echo = TRUE)`
+- in C++ use `sourceCpp("path/to/file.cpp")`
+
+### Example
+
+
+```{Rcpp}
+#include <Rcpp.h>
+using namespace Rcpp;
+
+// [[Rcpp::export]]
+double meanC(NumericVector x) {
+ int n = x.size();
+ double total = 0;
+
+ for(int i = 0; i < n; ++i) {
+ total += x[i];
+ }
+ return total / n;
+}
+
+/*** R
+x <- runif(1e5)
+bench::mark(
+ mean(x),
+ meanC(x)
+)
+*/
+```
+
+## Data frames, functions, and attributes
+
+Example of Data frames
+
+```{Rcpp}
+#include <Rcpp.h>
+using namespace Rcpp;
+
+// [[Rcpp::export]]
+double mpe(List mod) {
+ if (!mod.inherits("lm")) stop("Input must be a linear model");
+
+ NumericVector resid = as<NumericVector>(mod["residuals"]);
+ NumericVector fitted = as<NumericVector>(mod["fitted.values"]);
+
+ int n = resid.size();
+ double err = 0;
+ for(int i = 0; i < n; ++i) {
+ err += resid[i] / (fitted[i] + resid[i]);
+ }
+ return err / n;
+}
+```
+
+```{r}
+mod <- lm(mpg ~ wt, data = mtcars)
+mpe(mod)
+```
+
+## Missing values
+
+Dealing with missing values can differs if:
+
+- scalars:
+ - integers
+ - doubles
+- strings
+- Boolean
+- vectors
+
+## Standard Template Library
+
+STL is the fundamental library in C++, it provides a set of useful data structures and algorithms.
+
+
+- Using iterators, the next step up from basic loops:
+ - NumericVector::iterator
+ - LogicalVector::iterator
+ - CharacterVector::iterator
+- Algorithms
+- Data structures:
+ - vector
+ - unordered_set
+ - unordered_map
+- Vectors
+- Map
+
+A good resource is **Effective STL by Scott Meyers**.
+And one more about the STL data structures is [the container](https://en.cppreference.com/w/cpp/container)
+
+## Case Studies
+
+Real life uses of C++ to replace slow R code.
+
+### Case study 1: Gibbs sampler
+
+The [Gibbs sampler](https://en.wikipedia.org/wiki/Gibbs_sampling) is a method for estimating parameters expectations. It is a **MCMC algorithm** that has been adapted to sample from multidimensional target distributions. Gibbs sampling generates a **Markov chain** of samples, each of which is correlated with nearby samples.
+
+[Example blogged by Dirk Eddelbuettel](https://dirk.eddelbuettel.com/blog/2011/07/14/), the R and C++ code is very similar but runs about 20 times faster.
+
+> "Darren Wilkinson stresses the rather pragmatic aspects of how fast and/or easy it is to write the code, rather than just the mere runtime.
+
+
+<center></center>
+
+
+R code:
+
+```{r}
+gibbs_r <- function(N, thin) {
+ mat <- matrix(nrow = N, ncol = 2)
+ x <- y <- 0
+
+ for (i in 1:N) {
+ for (j in 1:thin) {
+ x <- rgamma(1, 3, y * y + 4)
+ y <- rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))
+ }
+ mat[i, ] <- c(x, y)
+ }
+ mat
+}
+```
+
+Actions to convert R to C++:
+
+- Add type declarations to all variables
+- Use `(` instead of `[` to index into the matrix
+- Subscript the results of `rgamma` and `rnorm` to convert from a vector into a scalar.
+
+```{Rcpp}
+#include <Rcpp.h>
+using namespace Rcpp;
+
+// [[Rcpp::export]]
+NumericMatrix gibbs_cpp(int N, int thin) {
+ NumericMatrix mat(N, 2);
+ double x = 0, y = 0;
+
+ for(int i = 0; i < N; i++) {
+ for(int j = 0; j < thin; j++) {
+ x = rgamma(1, 3, 1 / (y * y + 4))[0];
+ y = rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))[0];
+ }
+ mat(i, 0) = x;
+ mat(i, 1) = y;
+ }
+
+ return(mat);
+}
+```
+
+Checking who's best:
+
+```{r}
+bench::mark(
+ gibbs_r(100, 10),
+ gibbs_cpp(100, 10),
+ check = FALSE
+)
+```
+
+### Case study 2: predict a model response from three inputs
+
+[Rcpp is smoking fast for agent based models in data frames](https://gweissman.github.io/post/rcpp-is-smoking-fast-for-agent-based-models-in-data-frames/) by Gary Weissman, MD, MSHP.
+
+Starts with this code:
+
+```{r}
+vacc1a <- function(age, female, ily) {
+ p <- 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily
+ p <- p * if (female) 1.25 else 0.75
+ p <- max(0, p)
+ p <- min(1, p)
+ p
+}
+```
+
+R code with a for loop:
+
+```{r}
+vacc1 <- function(age, female, ily) {
+ n <- length(age)
+ out <- numeric(n)
+ for (i in seq_len(n)) {
+ out[i] <- vacc1a(age[i], female[i], ily[i])
+ }
+ out
+}
+```
+
+R code without a for loop:
+
+```{r}
+vacc2 <- function(age, female, ily) {
+ p <- 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily
+ p <- p * ifelse(female, 1.25, 0.75)
+ p <- pmax(0, p)
+ p <- pmin(1, p)
+ p
+}
+```
+
+C++:
+
+```{Rcpp}
+#include <Rcpp.h>
+using namespace Rcpp;
+
+double vacc3a(double age, bool female, bool ily){
+ double p = 0.25 + 0.3 * 1 / (1 - exp(0.04 * age)) + 0.1 * ily;
+ p = p * (female ? 1.25 : 0.75);
+ p = std::max(p, 0.0);
+ p = std::min(p, 1.0);
+ return p;
+}
+
+// [[Rcpp::export]]
+NumericVector vacc3(NumericVector age, LogicalVector female,
+ LogicalVector ily) {
+ int n = age.size();
+ NumericVector out(n);
+
+ for(int i = 0; i < n; ++i) {
+ out[i] = vacc3a(age[i], female[i], ily[i]);
+ }
+
+ return out;
+}
+```
+
+Sample data:
+
+```{r}
+n <- 1000
+age <- rnorm(n, mean = 50, sd = 10)
+female <- sample(c(T, F), n, rep = TRUE)
+ily <- sample(c(T, F), n, prob = c(0.8, 0.2), rep = TRUE)
+
+stopifnot(
+ all.equal(vacc1(age, female, ily), vacc2(age, female, ily)),
+ all.equal(vacc1(age, female, ily), vacc3(age, female, ily))
+)
+```
+
+<center>**Who's faster?**</center>
+<center></center>
+
+```{r}
+bench::mark(
+ vacc1 = vacc1(age, female, ily),
+ vacc2 = vacc2(age, female, ily),
+ vacc3 = vacc3(age, female, ily)
+)
+```
+
+## Resources
+
+- [Rcpp: Seamless R and C++ Integration](https://www.jstatsoft.org/index.php/jss/article/view/v040i08/475)
+- [cpp-tutorial](https://www.learncpp.com/cpp-tutorial/introduction-to-function-parameters-and-arguments/)
+- [cpp-reference](https://en.cppreference.com/w/cpp)
+- A good resource is **Effective STL by Scott Meyers**
+- the STL data structures found in [the container](https://en.cppreference.com/w/cpp/container)
+- [Exposing C++ functions and classes
+with Rcpp modules](https://cran.rstudio.com/web/packages/Rcpp/vignettes/Rcpp-modules.pdf)
+- All gifs are from: https://giphy.com/
## Meeting Videos
@@ -38,9 +532,9 @@
`r knitr::include_url("https://www.youtube.com/embed/URL")`
<details>
-<summary> Meeting chat log </summary>
-```
-LOG
-```
+<summary>Meeting chat log</summary>
+
+ LOG
+
</details>
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -29,6 +29,7 @@ Imports:
profvis,
purrr,
R6,
+ Rcpp,
reshape2,
rlang,
rmarkdown,