Update Chapter 2 notes (Cohort 9) (#62) - bookclub-advr

commit e3cc65958c01c181704c314eae428a3ab8851ae5
parent 66932bacbed6d3d27a9180eaeba310accc0df1ba
Author: Steffi LaZerte <sel@steffilazerte.ca>
Date:   Thu, 30 May 2024 16:34:11 -0400

Update Chapter 2 notes (Cohort 9) (#62)

* Update Chapter 2 notes

* Repair meeting videos section

Something got cut off, and it still had numbers.

---------

Co-authored-by: Jon Harmon <jonthegeek@gmail.com>
Diffstat:
M 02_Names_and_values.Rmd  | 461 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------
A images/02-character-2.png  | 0 
A images/02-copy_on_modify_fig2.png  | 0 
A images/02-l-modify-2.png  | 0

4 files changed, 395 insertions(+), 66 deletions(-)
diff --git a/02_Names_and_values.Rmd b/02_Names_and_values.Rmd
@@ -2,176 +2,505 @@
 
 **Learning objectives:**
 
-- test your knowledge
-- identify objects in R (size, id, ...)
+- To be able to understand distinction between an *object* and its *name*
+- With this knowledge, to be able write faster code using less memory
+- To better understand R's functional programming tools
 
+Using lobstr package here.
+```{r}
+library(lobstr)
+```
 
-## Quiz {-}
 
-1. How do I create a new column called "3" that contains the sum of 1 and 2?
+### Quiz {-}
+
+##### 1. How do I create a new column called `3` that contains the sum of `1` and `2`? {-}
+
 ```{r}
 df <- data.frame(runif(3), runif(3))
+names(df) <- c(1, 2)
 df
 ```
 
+```{r}
+df$`3` <- df$`1` + df$`2`
+df
+```
+
+**What makes these names challenging?**
+
+> You need to use backticks (`) when the name of an object doesn't start with a 
+> a character or '.' [or . followed by a number] (non-syntactic names).
+
+##### 2. How much memory does `y` occupy? {-}
 
 ```{r}
-names(df) <- c(1, 2)
+x <- runif(1e6)
+y <- list(x, x, x)
 ```
+
+Need to use the lobstr package:
 ```{r}
-df$`3`
+lobstr::obj_size(y)
 ```
 
+> Note that if you look in the RStudio Environment or use R base `object.size()`
+> you actually get a value of 24 MB
 
 ```{r}
-df$`3` <- df$`1` + df$`2`
+object.size(y)
+```
 
-df
+##### 3. On which line does `a` get copied in the following example? {-}
+```{r}
+a <- c(1, 5, 3, 2)
+b <- a
+b[[1]] <- 10
 ```
 
+> Not until `b` is modified, the third line
+
+## Binding basics {-}
+
+- Create values and *bind* a name to them
+- Names have values (rather than values have names)
+- Multiple names can refer to the same values
+- We can look at an object's address to keep track of the values independent of their names
 
-2. How much memory does y occupy?
-hint: use `library(lobstr)` 
 ```{r}
-x <- runif(1e6)
-y <- list(x, x, x)
+x <- c(1, 2, 3)
+y <- x
+obj_addr(x)
+obj_addr(y)
+```
 
-length(y)
 
+### Exercises {-}
+
+##### 1. Explain the relationships {-}
+```{r}
+a <- 1:10
+b <- a
+c <- b
+d <- 1:10
 ```
 
+> `a` `b` and `c` are all names that refer to the first value `1:10`
+> 
+> `d` is a name that refers to the *second* value of `1:10`.
 
+
+##### 2. Do the following all point to the same underlying function object? hint: `lobstr::obj_addr()` {-}
 ```{r}
-library(lobstr)
-lobstr::obj_size(y)
+obj_addr(mean)
+obj_addr(base::mean)
+obj_addr(get("mean"))
+obj_addr(evalq(mean))
+obj_addr(match.fun("mean"))
 ```
 
+> Yes!
+
+## Copy-on-modify {-}
+
+- If you modify a value bound to multiple names, it is 'copy-on-modify'
+- If you modify a value bound to a single name, it is 'modify-in-place'
+- Use `tracemem()` to see when a name's value changes
 
-3. On which line does a get copied in the following example?
 ```{r}
-a <- c(1, 5, 3, 2)
-a
+x <- c(1, 2, 3)
+cat(tracemem(x), "\n")
+```
+
+```{r}
+y <- x
+y[[3]] <- 4L  # Changes (copy-on-modify)
+y[[3]] <- 5L  # Doesn't change (modify-in-place)
 ```
 
+Turn off `tracemem()` with `untracemem()`
+
+> Can also use `ref(x)` to get the address of the value bound to a given name
+
+
+## Functions {-}
+
+- Copying also applies within functions
+- If you copy (but don't modify) `x` within `f()`, no copy is made
 
 ```{r}
-b <- a
-b
+f <- function(a) {
+  a
+}
+
+x <- c(1, 2, 3)
+z <- f(x) # No change in value
+
+ref(x)
+ref(z)
 ```
 
+<!-- ![](images/02-trace.png) -->
+
+## Lists {-}
+
+- A list overall, has it's own reference (id)
+- List *elements* also each point to other values
+- List doesn't store the value, it *stores a reference to the value*
+- As of R 3.1.0, modifying lists creates a *shallow copy*
+    - References (bindings) are copied, but *values are not*
 
 ```{r}
-b[[1]] <- 10
+l1 <- list(1, 2, 3)
+l2 <- l1
+l2[[3]] <- 4
 ```
 
+- We can use `ref()` to see how they compare
+  - See how the list reference is different
+  - But first two items in each list are the same
+
 ```{r}
-b
+ref(l1, l2)
 ```
 
+![](images/02-l-modify-2.png){width=50%}
+
+## Data Frames {-}
 
-## Object’s identifier {-}
+- Data frames are lists of vectors
+- So copying and modifying a column *only affects that column*
+- **BUT** if you modify a *row*, every column must be copied
 
 ```{r}
-x <- c(1, 2, 3)
-obj_addr(x)
+d1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3))
+d2 <- d1
+d3 <- d1
 ```
 
+Only the modified column changes
+```{r}
+d2[, 2] <- d2[, 2] * 2
+ref(d1, d2)
+```
+
+All columns change
+```{r}
+d3[1, ] <- d3[1, ] * 3
+ref(d1, d3)
+```
+
+## Character vectors {-}
+
+- R has a **global string pool**
+- Elements of character vectors point to unique strings in the pool
+
+```{r}
+x <- c("a", "a", "abc", "d")
+```
+
+![](images/02-character-2.png)
 
 ## Exercises {-}
 
-Do they all point to the same underlying function object? hint: `lobstr::obj_addr()`
+##### 1. Why is `tracemem(1:10)` not useful? {-}
+
+> Because it tries to trace a value that is not bound to a name
+
+##### 2. Why are there two copies? {-}
 ```{r}
-mean
-base::mean
-get("mean")
-evalq(mean)
-match.fun("mean")
+x <- c(1L, 2L, 3L)
+tracemem(x)
+x[[3]] <- 4
 ```
 
+> Because we convert an *integer* vector (using 1L, etc.) to a *double* vector (using just 4)- 
+
+##### 3. What is the relationships among these objects? {-}
 
-## Copy-on-modify {-}
 ```{r}
-x <- c(1, 2, 3)
-cat(tracemem(x), "\n")
-#> <0x7f80c0e0ffc8> 
+a <- 1:10      
+b <- list(a, a)
+c <- list(b, a, 1:10) # 
 ```
 
-> untracemem() is the opposite of tracemem(); it turns tracing off.
+a <- obj 1    
+b <- obj 1, obj 1    
+c <- b(obj 1, obj 1), obj 1, 1:10    
 
 ```{r}
-y <- x
+ref(c)
+```
 
-y[[3]] <- 5L
 
+##### 4. What happens here? {-}
+```{r}
+x <- list(1:10)
+x[[2]] <- x
+```
+
+- `x` is a list
+- `x[[2]] <- x` creates a new list, which in turn contains a reference to the 
+  original list
+- `x` is no longer bound to `list(1:10)`
+
+```{r}
+ref(x)
+```
+
+![](images/02-copy_on_modify_fig2.png){width=50%}
+
+## Object Size {-}
+
+- Use `lobstr::obj_size()` 
+- Lists may be smaller than expected because of referencing the same value
+- Strings may be smaller than expected because using global string pool
+- Difficult to predict how big something will be
+  - Can only add sizes together if they share no references in common
+
+### Alternative Representation {-}
+- As of R 3.5.0 - ALTREP
+- Represent some vectors compactly
+    - e.g., 1:1000 - not 10,000 values, just 1 and 1,000
+
+### Exercises {-}
+
+##### 1. Why are the sizes so different? {-}
+
+```{r}
+y <- rep(list(runif(1e4)), 100)
+
+object.size(y) # ~8000 kB
+obj_size(y)    # ~80   kB
+```
+
+> From `?object.size()`: 
+> 
+> "This function merely provides a rough indication: it should be reasonably accurate for atomic vectors, but **does not detect if elements of a list are shared**, for example.
+
+##### 2. Why is the size misleading? {-}
+
+```{r}
+funs <- list(mean, sd, var)
+obj_size(funs)
+```
+
+> Because they reference functions from base and stats, which are always available.
+> Why bother looking at the size? What use is that?
+
+##### 3. Predict the sizes {-}
+
+```{r}
+a <- runif(1e6) # 8 MB
+obj_size(a)
+```
+
+
+```{r}
+b <- list(a, a)
+```
+
+- There is one value ~8MB
+- `a` and `b[[1]]` and `b[[2]]` all point to the same value.
+
+```{r}
+obj_size(b)
+obj_size(a, b)
+```
+
+
+```{r}
+b[[1]][[1]] <- 10
+```
+- Now there are two values ~8MB each (16MB total)
+- `a` and `b[[2]]` point to the same value (8MB)
+- `b[[1]]` is new (8MB) because the first element (`b[[1]][[1]]`) has been changed
+
+```{r}
+obj_size(b)     # 16 MB (two values, two element references)
+obj_size(a, b)  # 16 MB (a & b[[2]] point to the same value)
+```
+
+
+```{r}
+b[[2]][[1]] <- 10
+```
+- Finally, now there are three values ~8MB each (24MB total)
+- Although `b[[1]]` and `b[[2]]` have the same contents, 
+  they are not references to the same object.
+
+```{r}
+obj_size(b)
+obj_size(a, b)
+```
+
+
+## Modify-in-place {-}
+
+- Modifying usually creates a copy except for
+    - Objects with a single binding (performance optimization)
+    - Environments (special)
+
+### Objects with a single binding {-}
+
+- Hard to know if copy will occur
+- If you have 2+ bindings and remove them, R can't follow how many are removed (so will always think there are more than one)
+- May make a copy even if there's only one binding left
+- Using a function makes a reference to it **unless it's a function based on C**
+- Best to use `tracemem()` to check rather than guess.
+
+
+#### Example - lists vs. data frames in for loop {-}
+
+**Setup**  
+
+Create the data to modify
+```{r}
+x <- data.frame(matrix(runif(5 * 1e4), ncol = 5))
+medians <- vapply(x, median, numeric(1))
+```
+
+
+**Data frame - Copied every time!**
+```{r}
+cat(tracemem(x), "\n")
+for (i in seq_along(medians)) {
+  x[[i]] <- x[[i]] - medians[[i]]
+}
 untracemem(x)
 ```
 
-## Introduction to functions {-}
+**List (uses internal C code) - Copied once!**
+```{r}
+y <- as.list(x)
 
-How to make a function in r:
-```{r eval=FALSE}
-name <- function(variables) {
-  
+cat(tracemem(y), "\n")
+for (i in seq_along(medians)) {
+  y[[i]] <- y[[i]] - medians[[i]]
 }
+untracemem(y)
 ```
 
+#### Benchmark this (Exercise #2) {-}
+
+**First wrap in a function**
 ```{r}
-f <- function(a) {
-  a
+med <- function(d, medians) {
+  for (i in seq_along(medians)) {
+    d[[i]] <- d[[i]] - medians[[i]]
+  }
 }
+```
 
-x <- c(1, 2, 3)
-cat(tracemem(x), "\n")
-#> <0x7fe1121693a8>
+**Try with 5 columns**
+```{r}
+x <- data.frame(matrix(runif(5 * 1e4), ncol = 5))
+medians <- vapply(x, median, numeric(1))
+y <- as.list(x)
+
+bench::mark(
+  "data.frame" = med(x, medians),
+  "list" = med(y, medians)
+)
+```
+
+**Try with 20 columns**
+```{r}
+x <- data.frame(matrix(runif(5 * 1e4), ncol = 20))
+medians <- vapply(x, median, numeric(1))
+y <- as.list(x)
+
+bench::mark(
+  "data.frame" = med(x, medians),
+  "list" = med(y, medians)
+)
+```
 
-z <- f(x)
-# there's no copy here!
+**WOW!**
 
-untracemem(x)
+
+### Environmments {-}
+- Always modified in place (**reference semantics**)
+- Interesting because if you modify the environment, all existing bindings have the same reference
+- If two names point to the same environment, and you update one, you update both!
+
+```{r}
+e1 <- rlang::env(a = 1, b = 2, c = 3)
+e2 <- e1
+e1$c <- 4
+e2$c
 ```
 
-![](images/02-trace.png)
+- This means that environments can contain themselves (!)
 
-## Lists {-}
+### Exercises {-}
+
+##### 1. Why isn't this circular? {-}
 ```{r}
-l1 <- list(1, 2, 3)
+x <- list()
+x[[1]] <- x
 ```
 
-## Data Frames {-}
+> Because the binding to the list() object moves from `x` in the first line to `x[[1]]` in the second.
+
+##### 2. (see "Objects with a single binding") {-}
+
+##### 3. What happens if you attempt to use tracemem() on an environment? {-}
+
 ```{r}
-d1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3))
+#| error: true
+e1 <- rlang::env(a = 1, b = 2, c = 3)
+tracemem(e1)
 ```
 
-## Character vectors {-}
+> Because environments always modified in place, there's no point in tracing them
+
+
+## Unbinding and the garbage collector {-}
+
+- If you delete the 'name' bound to an object, the object still exists
+- R runs a "garbage collector" (GC) to remove these objects when it needs more memory
+- "Looking from the outside, it’s basically impossible to predict when the GC will run. In fact, you shouldn’t even try."
+- If you want to know when it runs, use `gcinfo(TRUE)` to get a message printed
+- You can force GC with `gc()` but you never need to to use more memory *within* R
+- Only reason to do so is to free memory for other system software, or, to get the
+message printed about how much memory is being used
+
 ```{r}
-x <- c("a", "a", "abc", "d")
+gc()
+mem_used()
 ```
 
+- These numbers will **not** be what you OS tells you because, 
+  1. It includes objects created by R, but not R interpreter
+  2. R and OS are lazy and don't reclaim/release memory until it's needed
+  3. R counts memory from objects, but there are gaps due to those that are deleted -> 
+  *memory fragmentation* [less memory actually available they you might think]
 
 
-## Meeting Videos
+## Meeting Videos {-}
 
-### Cohort 1
+### Cohort 1 {-}
 
 (no video recorded)
 
-### Cohort 2
+### Cohort 2 {-}
 
 `r knitr::include_url("https://www.youtube.com/embed/pCiNj2JRK50")`
 
-### Cohort 3
+### Cohort 3 {-}
 
 `r knitr::include_url("https://www.youtube.com/embed/-bEXdOoxO_E")`
 
-### Cohort 4
+### Cohort 4 {-}
 
 `r knitr::include_url("https://www.youtube.com/embed/gcVU_F-L6zY")`
 
-### Cohort 5
+### Cohort 5 {-}
 
 `r knitr::include_url("https://www.youtube.com/embed/aqcvKox9V0Q")`
 
-### Cohort 6
+### Cohort 6 {-}
 
 `r knitr::include_url("https://www.youtube.com/embed/O4Oo_qO7SIY")`
 
@@ -190,7 +519,7 @@ x <- c("a", "a", "abc", "d")
 ```
 </details>
 
-### Cohort 7
+### Cohort 7 {-}
 
 `r knitr::include_url("https://www.youtube.com/embed/kpAUoGO6elE")`
 
diff --git a/images/02-character-2.png b/images/02-character-2.png
Binary files differ.
diff --git a/images/02-copy_on_modify_fig2.png b/images/02-copy_on_modify_fig2.png
Binary files differ.
diff --git a/images/02-l-modify-2.png b/images/02-l-modify-2.png
Binary files differ.

	bookclub-advr DSLC Advanced R Book Club
	git clone https://git.eamoncaddigan.net/bookclub-advr.git
	Log \| Files \| Refs \| README \| LICENSE

M	02_Names_and_values.Rmd	\|	461	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------
A	images/02-character-2.png	\|	0
A	images/02-copy_on_modify_fig2.png	\|	0
A	images/02-l-modify-2.png	\|	0