bookclub-advr

DSLC Advanced R Book Club
git clone https://git.eamoncaddigan.net/bookclub-advr.git
Log | Files | Refs | README | LICENSE

commit 1294434bc010f048d1ede6b11cee225bbdaef060
parent ba2aa53434f47791c352a422080d0642c0856e29
Author: DrEntropy <DrEntropy@users.noreply.github.com>
Date:   Tue, 13 Jun 2023 04:11:53 -0700

notes for the last chapter ! (#54)

* Made changes for this cohort

* Add silly meme
Diffstat:
M25_Rewriting_R_code_in_C++.Rmd | 357+++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------
Aimages/case_study.jpg | 0
Aimages/we-did-it-celebration-meme.jpg | 0
3 files changed, 249 insertions(+), 108 deletions(-)

diff --git a/25_Rewriting_R_code_in_C++.Rmd b/25_Rewriting_R_code_in_C++.Rmd @@ -2,17 +2,19 @@ **Learning objectives:** -- how to improve performance by rewriting key functions in C++ -- how to use [{Rcpp} package](https://www.jstatsoft.org/index.php/jss/article/view/v040i08/475) (with key contributions by Doug Bates, John Chambers, and JJ Allaire) -- how to check who's faster +- Learn to improve performance by rewriting bottlenecks in C++ + +- Introduction to the [{Rcpp} package](https://www.rcpp.org/) ## Introduction -In this chapter we'll learn how to rewrite **R** code in **C++** for making it faster! We'll use the **Rcpp package** which provides **API for comparison**. +In this chapter we'll learn how to rewrite **R** code in **C++** to make it faster using the **Rcpp package**. The **Rcpp** package makes it simple to connect C++ to R! With C++ you can fix: -A closer look at C++ will provide an overview of the language and its key conventions to focus on the differences with R. We will also have look at the standard template library STL, the C++ library to use, which provides a set of extremely useful data structures and algorithms. +- Loops that can't be easily vectorised because subsequent iterations depend on previous ones. -A very interesting part will involve comparing the two codes benchmarking the two implementations yields with `bench::mark()`. +- Recursive functions, or problems which involve calling functions millions of times. The overhead of calling a function in C++ is much lower than in R. + +- Problems that require advanced data structures and algorithms that R doesn't provide. Through the **standard template library (STL)**, C++ has efficient implementations of many important data structures, from ordered maps to double-ended queue <center>Like how?</center> @@ -20,15 +22,7 @@ A very interesting part will involve comparing the two codes benchmarking the tw <center>![](https://media.giphy.com/media/vLyZk5CJo12Wk/giphy.gif)</center> -## What C++ can handle - -The C language was originally implemented by Dennis Ritchie for Linux at the end of '70s. C++ is the object oriented version of C. - -- Loops that can't be easily vectorised because subsequent iterations depend on previous ones. - -- Recursive functions, or problems which involve calling functions millions of times. The overhead of calling a function in C++ is much lower than in R. - -- Problems that require advanced data structures and algorithms that R doesn't provide. Through the **standard template library (STL)**, C++ has efficient implementations of many important data structures, from ordered maps to double-ended queues. + ## Getting started with C++ @@ -42,33 +36,8 @@ Install a C++ compiler: - Xcode, on Mac - Sudo apt-get install r-base-dev or similar, on Linux. -### C++ conventions - -<center> </center> - -<center> </center> - -<center>![](https://media.giphy.com/media/xT9KVfk0D9TEQmRwUU/giphy.gif)</center> - - - -- Use = for assignment, not \<-. -- Scalars and vectors are different: - - scalar equivalents of numeric, integer, character, - - logical vectors are: double, int, String, and bool. -- explicitly use a `return` statement to return a value from a function. -- Every statement is terminated by a ; -- The for statement has a different syntax: for(init; check; increment) -- vector indices start at 0 -- methods are called with . -- total += x[i] is equivalent to total = total + x[i]. -- in-place operators are -=, \*=, and /= -- uses pow(), not \^, for exponentiation -- comment block: /\*\*\* R \# This is R code \*/ -## Examples with the cppFunction function - -### Make the sum function +### First example {-} Rcpp compiling the C++ code: @@ -83,30 +52,23 @@ add add(1, 2, 3) ``` -### Build a simple numerical function without arguments +Some things to note: -In R: -```{r} -one <- function() 1L -one()+100 -``` - -In C++: +- The syntax to create a function is different. +- Types of inputs and outputs must be explicitly declared +- Use = for assignment, not `<-`. +- Every statement is terminated by a ; +- C++ has it's own name for the types we are used to: + - scalar types are `int`, `double`, `bool` and `String` + - vector types (for Rcpp) are `IntegerVector`, `NumericVector`, `LogicalVector` and `CharacterVector` + - Other R types are available in C++: `List`, `Function`, `DataFrame`, and more. + +- Explicitly use a `return` statement to return a value from a function. - int one() { - return 1; - } + -Translation: - -```{r} -cppFunction('int one() { - return 1; -}') -``` - -### The sign function +## Example with scalar input and output {-} ```{r} signR <- function(x) { @@ -139,7 +101,9 @@ cppFunction('int signC(int x) { }') ``` -### Sum of a sequence: sumR vs sumC +* Note that the `if` syntax is identical! Not everything is different! + +## Vector Input, Scalar output:{-} ```{r} sumR <- function(x) { @@ -167,6 +131,15 @@ cppFunction('double sumC(NumericVector x) { }') ``` +Some observations: + +- vector indices *start at 0* +- The for statement has a different syntax: for(init; check; increment) +- Methods are called with `.` +- `total += x[i]` is equivalent to `total = total + x[i]`. +- other in-place operators are `-=`, `*=`, `and /=` + + To check for the fastest way we can use: ```{r eval=FALSE} @@ -182,7 +155,7 @@ bench::mark( ) ``` -### Euclidean distance: pdistR versus pdistC +## Vector input and output {-} ```{r} pdistR <- function(x, ys) { @@ -202,6 +175,8 @@ cppFunction('NumericVector pdistC(double x, NumericVector ys) { }') ``` +Note: uses `pow()`, not `^`, for exponentiation + ```{r} y <- runif(1e6) bench::mark( @@ -210,37 +185,37 @@ bench::mark( )[1:6] ``` -## Source your C++ code +## Source your C++ code {-} Source stand-alone C++ files into R using `sourceCpp()` -<center> </center> - -<center> </center> - -<center>![](https://media.giphy.com/media/I0qyuvxbdzzYc8INP7/giphy.gif)</center> - C++ files have extension `.cpp` -```{r eval=FALSE} +``` #include <Rcpp.h> using namespace Rcpp; ``` And for each function that you want available within R, you need to prefix it with: -```{r eval=FALSE} +``` // [[Rcpp::export]] ``` -To call the files: +Inside a cpp file you can include `R` code using special comments + +``` +/*** R +rcode here +*/ +``` + -- in R use `source(echo = TRUE)` -- in C++ use `sourceCpp("path/to/file.cpp")` -### Example +### Example {-} +This block in Rmarkdown uses `{Rcpp}` as a short hand for engine = "Rcpp". ```{Rcpp} #include <Rcpp.h> @@ -266,9 +241,23 @@ bench::mark( */ ``` +NOTE: For some reason although the r code above runs, `knit` doesn't include the output. Why? + +```{r} +x <- runif(1e5) +bench::mark( + mean(x), + meanC(x) +) +``` + + + ## Data frames, functions, and attributes -Example of Data frames +### Lists and Dataframes {-} + +Contrived example to illustrate how to access a dataframe from c++: ```{Rcpp} #include <Rcpp.h> @@ -295,42 +284,193 @@ mod <- lm(mpg ~ wt, data = mtcars) mpe(mod) ``` +- Note that you must *cast* the values to the required type. C++ needs to know the types in advance. + +### Functions {-} + +```{Rcpp} +#include <Rcpp.h> +using namespace Rcpp; + +// [[Rcpp::export]] +RObject callWithOne(Function f) { + return f(1); +} +``` + + +```{r} +callWithOne(function(x) x + 1) +``` + + +* Other values can be accessed from c++ including + + * attributes (use: `.attr()`. Also `.names()` is alias for name attribute. + * `Environment`, `DottedPair`, `Language`, `Symbol` , etc. + ## Missing values -Dealing with missing values can differs if: +### Missing values behave differently for C++ scalers{-} + +* Scalar NA's in Cpp : `NA_LOGICAL`, `NA_INTEGER`, `NA_REAL`, `NA_STRING`. + +* Integers (`int`) stores R NA's as the smallest integer. Better to use length 1 `IntegerVector` +* Doubles use IEEE 754 NaN , which behaves a bit differently for logical expressions (but ok for math expressions). + +```{r} +evalCpp("NA_REAL || FALSE") +``` + +* Strings are a class from Rcpp, so they handle missing values fine. -- scalars: - - integers - - doubles -- strings -- Boolean -- vectors +* `bool` can only hold two values, so be careful. Consider using vectors of length 1 or coercing to `int` + + +### Vectors + +* Vectors are all type introduced by RCpp and know how to handle missing values if you use the specific type for that vector. + +```{Rcpp} +#include <Rcpp.h> +using namespace Rcpp; + +// [[Rcpp::export]] +List missing_sampler() { + return List::create( + NumericVector::create(NA_REAL), + IntegerVector::create(NA_INTEGER), + LogicalVector::create(NA_LOGICAL), + CharacterVector::create(NA_STRING) + ); +} +``` + +```{r} +str(missing_sampler()) +``` ## Standard Template Library -STL is the fundamental library in C++, it provides a set of useful data structures and algorithms. +STL provides powerful data structures and algorithms for C++. + +### Iterators {-} + +Iterators are used extensively in the STL to abstract away details of underlying data structures. + +If you an iterator `it`, you can: + +- Get the value by 'dereferencing' with `*it` +- Advance to the next value with `++it` +- Compare iterators (locations) with `==` + + +### Algorithms {-} +* The real power of iterators comes from using them with STL algorithms. + +* A good reference is [https://en.cppreference.com/w/cpp/algorithm] -- Using iterators, the next step up from basic loops: - - NumericVector::iterator - - LogicalVector::iterator - - CharacterVector::iterator -- Algorithms -- Data structures: - - vector - - unordered_set - - unordered_map -- Vectors -- Map +* Book provides examples using `accumulate` and `upper_buond` -A good resource is **Effective STL by Scott Meyers**. -And one more about the STL data structures is [the container](https://en.cppreference.com/w/cpp/container) +* Another Example: + +```{Rcpp} + +#include <algorithm> +#include <Rcpp.h> + +using namespace Rcpp; + + +// Explicit iterator version + +// [[Rcpp::export]] +NumericVector square_C_it(NumericVector x){ + NumericVector out(x.size()); + // Each container has its own iterator type + NumericVector::iterator in_it; + NumericVector::iterator out_it; + + for(in_it = x.begin(), out_it = out.begin(); in_it != x.end(); ++in_it, ++out_it) { + *out_it = pow(*in_it,2); + } + + return out; + +} + + +// Use algorithm 'transform' + +// [[Rcpp::export]] +NumericVector square_C(NumericVector x) { + + NumericVector out(x.size()); + + + std::transform(x.begin(),x.end(), out.begin(), + [](double v) -> double { return v*v; }); + return out; +} +``` + +```{r} +square_C(c(1.0,2.0,3.0)) +``` +```{r} +square_C_it(c(1.0,2.0,3.0)) +``` + +## Data Structures {-} + +STL provides a large set of data structures. Some of the most important: + +* `std::vector` - like an `R` vector, except knows how to grow efficiently + +* `std::unordered_set` - unique set of values. Ordered version `std::set`. Unordered is more efficient. + +* `std::map` - Moslty similar to `R` lists, provide an association between a key and a value. There is also an unordered version. + +A quick example illustrating the `map`: + +```{Rcpp} +#include <Rcpp.h> +using namespace Rcpp; + +// [[Rcpp::export]] +std::map<double, int> tableC(NumericVector x) { + // Note the types are <key, value> + std::map<double, int> counts; + + int n = x.size(); + for (int i = 0; i < n; i++) { + counts[x[i]]++; + } + + return counts; +} +``` + + +```{r} +res = tableC(c(1,1,2,1,4,5)) +res +``` + +* Note that the map is converted to a named vector in this case on return + + +To learn more about the STL data structures see [containers](https://en.cppreference.com/w/cpp/container) at `cppreference` ## Case Studies +![Case Study](images/case_study.jpg) + Real life uses of C++ to replace slow R code. -### Case study 1: Gibbs sampler + +## Case study 1: Gibbs sampler {-} The [Gibbs sampler](https://en.wikipedia.org/wiki/Gibbs_sampling) is a method for estimating parameters expectations. It is a **MCMC algorithm** that has been adapted to sample from multidimensional target distributions. Gibbs sampling generates a **Markov chain** of samples, each of which is correlated with nearby samples. @@ -398,7 +538,7 @@ bench::mark( ) ``` -### Case study 2: predict a model response from three inputs +## Case study 2: predict a model response from three inputs {-} [Rcpp is smoking fast for agent based models in data frames](https://gweissman.github.io/post/rcpp-is-smoking-fast-for-agent-based-models-in-data-frames/) by Gary Weissman, MD, MSHP. @@ -427,7 +567,7 @@ vacc1 <- function(age, female, ily) { } ``` -R code without a for loop: +Vectorized R code: ```{r} vacc2 <- function(age, female, ily) { @@ -494,14 +634,15 @@ bench::mark( ## Resources -- [Rcpp: Seamless R and C++ Integration](https://www.jstatsoft.org/index.php/jss/article/view/v040i08/475) -- [cpp-tutorial](https://www.learncpp.com/cpp-tutorial/introduction-to-function-parameters-and-arguments/) +- [Rcpp: Seamless R and C++ Integration](https:\\Rcpp.org) +- [cpp-tutorial](https://www.learncpp.com) is often recommended. Lots of ads though! - [cpp-reference](https://en.cppreference.com/w/cpp) -- A good resource is **Effective STL by Scott Meyers** -- the STL data structures found in [the container](https://en.cppreference.com/w/cpp/container) -- [Exposing C++ functions and classes -with Rcpp modules](https://cran.rstudio.com/web/packages/Rcpp/vignettes/Rcpp-modules.pdf) -- All gifs are from: https://giphy.com/ +- [C++20 for Programmers](https://www.pearson.com/en-us/subject-catalog/p/c20-for-programmers-an-objects-natural-approach/P200000000211/9780137570461) is a newer book that covers modern c++ for people who know programming in another language. + +## Op Success! + +![Congrats!](images/we-did-it-celebration-meme.jpg) + ## Meeting Videos diff --git a/images/case_study.jpg b/images/case_study.jpg Binary files differ. diff --git a/images/we-did-it-celebration-meme.jpg b/images/we-did-it-celebration-meme.jpg Binary files differ.