commit 1294434bc010f048d1ede6b11cee225bbdaef060
parent ba2aa53434f47791c352a422080d0642c0856e29
Author: DrEntropy <DrEntropy@users.noreply.github.com>
Date: Tue, 13 Jun 2023 04:11:53 -0700
notes for the last chapter ! (#54)
* Made changes for this cohort
* Add silly meme
Diffstat:
3 files changed, 249 insertions(+), 108 deletions(-)
diff --git a/25_Rewriting_R_code_in_C++.Rmd b/25_Rewriting_R_code_in_C++.Rmd
@@ -2,17 +2,19 @@
**Learning objectives:**
-- how to improve performance by rewriting key functions in C++
-- how to use [{Rcpp} package](https://www.jstatsoft.org/index.php/jss/article/view/v040i08/475) (with key contributions by Doug Bates, John Chambers, and JJ Allaire)
-- how to check who's faster
+- Learn to improve performance by rewriting bottlenecks in C++
+
+- Introduction to the [{Rcpp} package](https://www.rcpp.org/)
## Introduction
-In this chapter we'll learn how to rewrite **R** code in **C++** for making it faster! We'll use the **Rcpp package** which provides **API for comparison**.
+In this chapter we'll learn how to rewrite **R** code in **C++** to make it faster using the **Rcpp package**. The **Rcpp** package makes it simple to connect C++ to R! With C++ you can fix:
-A closer look at C++ will provide an overview of the language and its key conventions to focus on the differences with R. We will also have look at the standard template library STL, the C++ library to use, which provides a set of extremely useful data structures and algorithms.
+- Loops that can't be easily vectorised because subsequent iterations depend on previous ones.
-A very interesting part will involve comparing the two codes benchmarking the two implementations yields with `bench::mark()`.
+- Recursive functions, or problems which involve calling functions millions of times. The overhead of calling a function in C++ is much lower than in R.
+
+- Problems that require advanced data structures and algorithms that R doesn't provide. Through the **standard template library (STL)**, C++ has efficient implementations of many important data structures, from ordered maps to double-ended queue
<center>Like how?</center>
@@ -20,15 +22,7 @@ A very interesting part will involve comparing the two codes benchmarking the tw
<center></center>
-## What C++ can handle
-
-The C language was originally implemented by Dennis Ritchie for Linux at the end of '70s. C++ is the object oriented version of C.
-
-- Loops that can't be easily vectorised because subsequent iterations depend on previous ones.
-
-- Recursive functions, or problems which involve calling functions millions of times. The overhead of calling a function in C++ is much lower than in R.
-
-- Problems that require advanced data structures and algorithms that R doesn't provide. Through the **standard template library (STL)**, C++ has efficient implementations of many important data structures, from ordered maps to double-ended queues.
+
## Getting started with C++
@@ -42,33 +36,8 @@ Install a C++ compiler:
- Xcode, on Mac
- Sudo apt-get install r-base-dev or similar, on Linux.
-### C++ conventions
-
-<center> </center>
-
-<center> </center>
-
-<center></center>
-
-
-
-- Use = for assignment, not \<-.
-- Scalars and vectors are different:
- - scalar equivalents of numeric, integer, character,
- - logical vectors are: double, int, String, and bool.
-- explicitly use a `return` statement to return a value from a function.
-- Every statement is terminated by a ;
-- The for statement has a different syntax: for(init; check; increment)
-- vector indices start at 0
-- methods are called with .
-- total += x[i] is equivalent to total = total + x[i].
-- in-place operators are -=, \*=, and /=
-- uses pow(), not \^, for exponentiation
-- comment block: /\*\*\* R \# This is R code \*/
-## Examples with the cppFunction function
-
-### Make the sum function
+### First example {-}
Rcpp compiling the C++ code:
@@ -83,30 +52,23 @@ add
add(1, 2, 3)
```
-### Build a simple numerical function without arguments
+Some things to note:
-In R:
-```{r}
-one <- function() 1L
-one()+100
-```
-
-In C++:
+- The syntax to create a function is different.
+- Types of inputs and outputs must be explicitly declared
+- Use = for assignment, not `<-`.
+- Every statement is terminated by a ;
+- C++ has it's own name for the types we are used to:
+ - scalar types are `int`, `double`, `bool` and `String`
+ - vector types (for Rcpp) are `IntegerVector`, `NumericVector`, `LogicalVector` and `CharacterVector`
+ - Other R types are available in C++: `List`, `Function`, `DataFrame`, and more.
+
+- Explicitly use a `return` statement to return a value from a function.
- int one() {
- return 1;
- }
+
-Translation:
-
-```{r}
-cppFunction('int one() {
- return 1;
-}')
-```
-
-### The sign function
+## Example with scalar input and output {-}
```{r}
signR <- function(x) {
@@ -139,7 +101,9 @@ cppFunction('int signC(int x) {
}')
```
-### Sum of a sequence: sumR vs sumC
+* Note that the `if` syntax is identical! Not everything is different!
+
+## Vector Input, Scalar output:{-}
```{r}
sumR <- function(x) {
@@ -167,6 +131,15 @@ cppFunction('double sumC(NumericVector x) {
}')
```
+Some observations:
+
+- vector indices *start at 0*
+- The for statement has a different syntax: for(init; check; increment)
+- Methods are called with `.`
+- `total += x[i]` is equivalent to `total = total + x[i]`.
+- other in-place operators are `-=`, `*=`, `and /=`
+
+
To check for the fastest way we can use:
```{r eval=FALSE}
@@ -182,7 +155,7 @@ bench::mark(
)
```
-### Euclidean distance: pdistR versus pdistC
+## Vector input and output {-}
```{r}
pdistR <- function(x, ys) {
@@ -202,6 +175,8 @@ cppFunction('NumericVector pdistC(double x, NumericVector ys) {
}')
```
+Note: uses `pow()`, not `^`, for exponentiation
+
```{r}
y <- runif(1e6)
bench::mark(
@@ -210,37 +185,37 @@ bench::mark(
)[1:6]
```
-## Source your C++ code
+## Source your C++ code {-}
Source stand-alone C++ files into R using `sourceCpp()`
-<center> </center>
-
-<center> </center>
-
-<center></center>
-
C++ files have extension `.cpp`
-```{r eval=FALSE}
+```
#include <Rcpp.h>
using namespace Rcpp;
```
And for each function that you want available within R, you need to prefix it with:
-```{r eval=FALSE}
+```
// [[Rcpp::export]]
```
-To call the files:
+Inside a cpp file you can include `R` code using special comments
+
+```
+/*** R
+rcode here
+*/
+```
+
-- in R use `source(echo = TRUE)`
-- in C++ use `sourceCpp("path/to/file.cpp")`
-### Example
+### Example {-}
+This block in Rmarkdown uses `{Rcpp}` as a short hand for engine = "Rcpp".
```{Rcpp}
#include <Rcpp.h>
@@ -266,9 +241,23 @@ bench::mark(
*/
```
+NOTE: For some reason although the r code above runs, `knit` doesn't include the output. Why?
+
+```{r}
+x <- runif(1e5)
+bench::mark(
+ mean(x),
+ meanC(x)
+)
+```
+
+
+
## Data frames, functions, and attributes
-Example of Data frames
+### Lists and Dataframes {-}
+
+Contrived example to illustrate how to access a dataframe from c++:
```{Rcpp}
#include <Rcpp.h>
@@ -295,42 +284,193 @@ mod <- lm(mpg ~ wt, data = mtcars)
mpe(mod)
```
+- Note that you must *cast* the values to the required type. C++ needs to know the types in advance.
+
+### Functions {-}
+
+```{Rcpp}
+#include <Rcpp.h>
+using namespace Rcpp;
+
+// [[Rcpp::export]]
+RObject callWithOne(Function f) {
+ return f(1);
+}
+```
+
+
+```{r}
+callWithOne(function(x) x + 1)
+```
+
+
+* Other values can be accessed from c++ including
+
+ * attributes (use: `.attr()`. Also `.names()` is alias for name attribute.
+ * `Environment`, `DottedPair`, `Language`, `Symbol` , etc.
+
## Missing values
-Dealing with missing values can differs if:
+### Missing values behave differently for C++ scalers{-}
+
+* Scalar NA's in Cpp : `NA_LOGICAL`, `NA_INTEGER`, `NA_REAL`, `NA_STRING`.
+
+* Integers (`int`) stores R NA's as the smallest integer. Better to use length 1 `IntegerVector`
+* Doubles use IEEE 754 NaN , which behaves a bit differently for logical expressions (but ok for math expressions).
+
+```{r}
+evalCpp("NA_REAL || FALSE")
+```
+
+* Strings are a class from Rcpp, so they handle missing values fine.
-- scalars:
- - integers
- - doubles
-- strings
-- Boolean
-- vectors
+* `bool` can only hold two values, so be careful. Consider using vectors of length 1 or coercing to `int`
+
+
+### Vectors
+
+* Vectors are all type introduced by RCpp and know how to handle missing values if you use the specific type for that vector.
+
+```{Rcpp}
+#include <Rcpp.h>
+using namespace Rcpp;
+
+// [[Rcpp::export]]
+List missing_sampler() {
+ return List::create(
+ NumericVector::create(NA_REAL),
+ IntegerVector::create(NA_INTEGER),
+ LogicalVector::create(NA_LOGICAL),
+ CharacterVector::create(NA_STRING)
+ );
+}
+```
+
+```{r}
+str(missing_sampler())
+```
## Standard Template Library
-STL is the fundamental library in C++, it provides a set of useful data structures and algorithms.
+STL provides powerful data structures and algorithms for C++.
+
+### Iterators {-}
+
+Iterators are used extensively in the STL to abstract away details of underlying data structures.
+
+If you an iterator `it`, you can:
+
+- Get the value by 'dereferencing' with `*it`
+- Advance to the next value with `++it`
+- Compare iterators (locations) with `==`
+
+
+### Algorithms {-}
+* The real power of iterators comes from using them with STL algorithms.
+
+* A good reference is [https://en.cppreference.com/w/cpp/algorithm]
-- Using iterators, the next step up from basic loops:
- - NumericVector::iterator
- - LogicalVector::iterator
- - CharacterVector::iterator
-- Algorithms
-- Data structures:
- - vector
- - unordered_set
- - unordered_map
-- Vectors
-- Map
+* Book provides examples using `accumulate` and `upper_buond`
-A good resource is **Effective STL by Scott Meyers**.
-And one more about the STL data structures is [the container](https://en.cppreference.com/w/cpp/container)
+* Another Example:
+
+```{Rcpp}
+
+#include <algorithm>
+#include <Rcpp.h>
+
+using namespace Rcpp;
+
+
+// Explicit iterator version
+
+// [[Rcpp::export]]
+NumericVector square_C_it(NumericVector x){
+ NumericVector out(x.size());
+ // Each container has its own iterator type
+ NumericVector::iterator in_it;
+ NumericVector::iterator out_it;
+
+ for(in_it = x.begin(), out_it = out.begin(); in_it != x.end(); ++in_it, ++out_it) {
+ *out_it = pow(*in_it,2);
+ }
+
+ return out;
+
+}
+
+
+// Use algorithm 'transform'
+
+// [[Rcpp::export]]
+NumericVector square_C(NumericVector x) {
+
+ NumericVector out(x.size());
+
+
+ std::transform(x.begin(),x.end(), out.begin(),
+ [](double v) -> double { return v*v; });
+ return out;
+}
+```
+
+```{r}
+square_C(c(1.0,2.0,3.0))
+```
+```{r}
+square_C_it(c(1.0,2.0,3.0))
+```
+
+## Data Structures {-}
+
+STL provides a large set of data structures. Some of the most important:
+
+* `std::vector` - like an `R` vector, except knows how to grow efficiently
+
+* `std::unordered_set` - unique set of values. Ordered version `std::set`. Unordered is more efficient.
+
+* `std::map` - Moslty similar to `R` lists, provide an association between a key and a value. There is also an unordered version.
+
+A quick example illustrating the `map`:
+
+```{Rcpp}
+#include <Rcpp.h>
+using namespace Rcpp;
+
+// [[Rcpp::export]]
+std::map<double, int> tableC(NumericVector x) {
+ // Note the types are <key, value>
+ std::map<double, int> counts;
+
+ int n = x.size();
+ for (int i = 0; i < n; i++) {
+ counts[x[i]]++;
+ }
+
+ return counts;
+}
+```
+
+
+```{r}
+res = tableC(c(1,1,2,1,4,5))
+res
+```
+
+* Note that the map is converted to a named vector in this case on return
+
+
+To learn more about the STL data structures see [containers](https://en.cppreference.com/w/cpp/container) at `cppreference`
## Case Studies
+
+
Real life uses of C++ to replace slow R code.
-### Case study 1: Gibbs sampler
+
+## Case study 1: Gibbs sampler {-}
The [Gibbs sampler](https://en.wikipedia.org/wiki/Gibbs_sampling) is a method for estimating parameters expectations. It is a **MCMC algorithm** that has been adapted to sample from multidimensional target distributions. Gibbs sampling generates a **Markov chain** of samples, each of which is correlated with nearby samples.
@@ -398,7 +538,7 @@ bench::mark(
)
```
-### Case study 2: predict a model response from three inputs
+## Case study 2: predict a model response from three inputs {-}
[Rcpp is smoking fast for agent based models in data frames](https://gweissman.github.io/post/rcpp-is-smoking-fast-for-agent-based-models-in-data-frames/) by Gary Weissman, MD, MSHP.
@@ -427,7 +567,7 @@ vacc1 <- function(age, female, ily) {
}
```
-R code without a for loop:
+Vectorized R code:
```{r}
vacc2 <- function(age, female, ily) {
@@ -494,14 +634,15 @@ bench::mark(
## Resources
-- [Rcpp: Seamless R and C++ Integration](https://www.jstatsoft.org/index.php/jss/article/view/v040i08/475)
-- [cpp-tutorial](https://www.learncpp.com/cpp-tutorial/introduction-to-function-parameters-and-arguments/)
+- [Rcpp: Seamless R and C++ Integration](https:\\Rcpp.org)
+- [cpp-tutorial](https://www.learncpp.com) is often recommended. Lots of ads though!
- [cpp-reference](https://en.cppreference.com/w/cpp)
-- A good resource is **Effective STL by Scott Meyers**
-- the STL data structures found in [the container](https://en.cppreference.com/w/cpp/container)
-- [Exposing C++ functions and classes
-with Rcpp modules](https://cran.rstudio.com/web/packages/Rcpp/vignettes/Rcpp-modules.pdf)
-- All gifs are from: https://giphy.com/
+- [C++20 for Programmers](https://www.pearson.com/en-us/subject-catalog/p/c20-for-programmers-an-objects-natural-approach/P200000000211/9780137570461) is a newer book that covers modern c++ for people who know programming in another language.
+
+## Op Success!
+
+
+
## Meeting Videos
diff --git a/images/case_study.jpg b/images/case_study.jpg
Binary files differ.
diff --git a/images/we-did-it-celebration-meme.jpg b/images/we-did-it-celebration-meme.jpg
Binary files differ.