commit 802c1533e2ed1c0736cb9aade05099ae82f83df7
parent 011e94299c79a179e804427680c9786d18651635
Author: Nick Giangreco <nick.giangreco@gmail.com>
Date: Mon, 18 Aug 2025 07:53:01 -0400
Advanced R book club - Chapter 2 document edits (#87)
* first pass at restructuring slide headers to be a more compact presentation experience
* finished revising chapter 2 slides
* Update slides/02.qmd
make LO's declarative
Co-authored-by: Jon Harmon <jonthegeek@gmail.com>
* Update slides/02.qmd
Co-authored-by: Jon Harmon <jonthegeek@gmail.com>
* Update slides/02.qmd
Co-authored-by: Jon Harmon <jonthegeek@gmail.com>
* Update slides/02.qmd
Co-authored-by: Jon Harmon <jonthegeek@gmail.com>
* Update slides/02.qmd
Co-authored-by: Jon Harmon <jonthegeek@gmail.com>
* Update slides/02.qmd
Co-authored-by: Jon Harmon <jonthegeek@gmail.com>
* Update slides/02.qmd
Co-authored-by: Jon Harmon <jonthegeek@gmail.com>
* Update slides/02.qmd
Co-authored-by: Jon Harmon <jonthegeek@gmail.com>
* using assertion slide making technique
* fix spelling and shorten titles
* Tweak 02 slides and repair shared metadata.
---------
Co-authored-by: Jon Harmon <jonthegeek@gmail.com>
Diffstat:
4 files changed, 148 insertions(+), 358 deletions(-)
diff --git a/_freeze/slides/02/execute-results/html.json b/_freeze/slides/02/execute-results/html.json
@@ -1,15 +1,19 @@
{
- "hash": "a11b52aa373c64a45cd125fdb1d36946",
+ "hash": "387178242b89e67960838bbc8e8752bc",
"result": {
"engine": "knitr",
- "markdown": "---\nengine: knitr\ntitle: Names and values\n---\n\n## Learning objectives\n\n- To be able to understand distinction between an *object* and its *name*\n- With this knowledge, to be able write faster code using less memory\n- To better understand R's functional programming tools\n\nUsing lobstr package here.\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(lobstr)\n```\n:::\n\n\n\n## Quiz\n\n### 1. How do I create a new column called `3` that contains the sum of `1` and `2`?\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- data.frame(runif(3), runif(3))\nnames(df) <- c(1, 2)\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 1 2\n#> 1 0.8893205 0.9874973\n#> 2 0.4645398 0.7004741\n#> 3 0.7312149 0.2986040\n```\n\n\n:::\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$`3` <- df$`1` + df$`2`\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 1 2 3\n#> 1 0.8893205 0.9874973 1.876818\n#> 2 0.4645398 0.7004741 1.165014\n#> 3 0.7312149 0.2986040 1.029819\n```\n\n\n:::\n:::\n\n\n**What makes these names challenging?**\n\n> You need to use backticks (`) when the name of an object doesn't start with a \n> a character or '.' [or . followed by a number] (non-syntactic names).\n\n### 2. How much memory does `y` occupy?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- runif(1e6)\ny <- list(x, x, x)\n```\n:::\n\n\nNeed to use the lobstr package:\n\n::: {.cell}\n\n```{.r .cell-code}\nlobstr::obj_size(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 8.00 MB\n```\n\n\n:::\n:::\n\n\n> Note that if you look in the RStudio Environment or use R base `object.size()`\n> you actually get a value of 24 MB\n\n\n::: {.cell}\n\n```{.r .cell-code}\nobject.size(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 24000224 bytes\n```\n\n\n:::\n:::\n\n\n### 3. On which line does `a` get copied in the following example?\n\n::: {.cell}\n\n```{.r .cell-code}\na <- c(1, 5, 3, 2)\nb <- a\nb[[1]] <- 10\n```\n:::\n\n\n> Not until `b` is modified, the third line\n\n## Binding basics\n\n- Create values and *bind* a name to them\n- Names have values (rather than values have names)\n- Multiple names can refer to the same values\n- We can look at an object's address to keep track of the values independent of their names\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(1, 2, 3)\ny <- x\nobj_addr(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2a58503acd8\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2a58503acd8\"\n```\n\n\n:::\n:::\n\n\n\n### Exercises\n\n##### 1. Explain the relationships\n\n::: {.cell}\n\n```{.r .cell-code}\na <- 1:10\nb <- a\nc <- b\nd <- 1:10\n```\n:::\n\n\n> `a` `b` and `c` are all names that refer to the first value `1:10`\n> \n> `d` is a name that refers to the *second* value of `1:10`.\n\n\n##### 2. Do the following all point to the same underlying function object? hint: `lobstr::obj_addr()`\n\n::: {.cell}\n\n```{.r .cell-code}\nobj_addr(mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2a5828bf738\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(base::mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2a5828bf738\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(get(\"mean\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2a5828bf738\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(evalq(mean))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2a5828bf738\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(match.fun(\"mean\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2a5828bf738\"\n```\n\n\n:::\n:::\n\n\n> Yes!\n\n## Copy-on-modify\n\n- If you modify a value bound to multiple names, it is 'copy-on-modify'\n- If you modify a value bound to a single name, it is 'modify-in-place'\n- Use `tracemem()` to see when a name's value changes\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(1, 2, 3)\ncat(tracemem(x), \"\\n\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> <000002A585CC0FF8>\n```\n\n\n:::\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- x\ny[[3]] <- 4L # Changes (copy-on-modify)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> tracemem[0x000002a585cc0ff8 -> 0x000002a58600d5e8]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main\n```\n\n\n:::\n\n```{.r .cell-code}\ny[[3]] <- 5L # Doesn't change (modify-in-place)\n```\n:::\n\n\nTurn off `tracemem()` with `untracemem()`\n\n> Can also use `ref(x)` to get the address of the value bound to a given name\n\n\n## Functions\n\n- Copying also applies within functions\n- If you copy (but don't modify) `x` within `f()`, no copy is made\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(a) {\n a\n}\n\nx <- c(1, 2, 3)\nz <- f(x) # No change in value\n\nref(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1:0x2a58669dd18] <dbl>\n```\n\n\n:::\n\n```{.r .cell-code}\nref(z)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1:0x2a58669dd18] <dbl>\n```\n\n\n:::\n:::\n\n\n<!--  -->\n\n## Lists\n\n- A list overall, has it's own reference (id)\n- List *elements* also each point to other values\n- List doesn't store the value, it *stores a reference to the value*\n- As of R 3.1.0, modifying lists creates a *shallow copy*\n - References (bindings) are copied, but *values are not*\n\n\n::: {.cell}\n\n```{.r .cell-code}\nl1 <- list(1, 2, 3)\nl2 <- l1\nl2[[3]] <- 4\n```\n:::\n\n\n- We can use `ref()` to see how they compare\n - See how the list reference is different\n - But first two items in each list are the same\n\n\n::: {.cell}\n\n```{.r .cell-code}\nref(l1, l2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2a586f2e698] <list> \n#> ├─[2:0x2a5877133b8] <dbl> \n#> ├─[3:0x2a5877131f8] <dbl> \n#> └─[4:0x2a587713038] <dbl> \n#> \n#> █ [5:0x2a586fc3098] <list> \n#> ├─[2:0x2a5877133b8] \n#> ├─[3:0x2a5877131f8] \n#> └─[6:0x2a58770fc78] <dbl>\n```\n\n\n:::\n:::\n\n\n{width=50%}\n\n## Data Frames\n\n- Data frames are lists of vectors\n- So copying and modifying a column *only affects that column*\n- **BUT** if you modify a *row*, every column must be copied\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3))\nd2 <- d1\nd3 <- d1\n```\n:::\n\n\nOnly the modified column changes\n\n::: {.cell}\n\n```{.r .cell-code}\nd2[, 2] <- d2[, 2] * 2\nref(d1, d2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2a584931608] <df[,2]> \n#> ├─x = [2:0x2a57f3b9cc8] <dbl> \n#> └─y = [3:0x2a57f3b9c78] <dbl> \n#> \n#> █ [4:0x2a5810eb508] <df[,2]> \n#> ├─x = [2:0x2a57f3b9cc8] \n#> └─y = [5:0x2a57feb2058] <dbl>\n```\n\n\n:::\n:::\n\n\nAll columns change\n\n::: {.cell}\n\n```{.r .cell-code}\nd3[1, ] <- d3[1, ] * 3\nref(d1, d3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2a584931608] <df[,2]> \n#> ├─x = [2:0x2a57f3b9cc8] <dbl> \n#> └─y = [3:0x2a57f3b9c78] <dbl> \n#> \n#> █ [4:0x2a57faa92c8] <df[,2]> \n#> ├─x = [5:0x2a585a91b38] <dbl> \n#> └─y = [6:0x2a585a91ae8] <dbl>\n```\n\n\n:::\n:::\n\n\n## Character vectors\n\n- R has a **global string pool**\n- Elements of character vectors point to unique strings in the pool\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(\"a\", \"a\", \"abc\", \"d\")\n```\n:::\n\n\n\n\n## Exercises\n\n##### 1. Why is `tracemem(1:10)` not useful?\n\n> Because it tries to trace a value that is not bound to a name\n\n##### 2. Why are there two copies?\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(1L, 2L, 3L)\ntracemem(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"<000002A5856391C8>\"\n```\n\n\n:::\n\n```{.r .cell-code}\nx[[3]] <- 4\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> tracemem[0x000002a5856391c8 -> 0x000002a585653f08]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a585653f08 -> 0x000002a58663f8b8]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main\n```\n\n\n:::\n:::\n\n\n> Because we convert an *integer* vector (using 1L, etc.) to a *double* vector (using just 4)- \n\n##### 3. What is the relationships among these objects?\n\n\n::: {.cell}\n\n```{.r .cell-code}\na <- 1:10 \nb <- list(a, a)\nc <- list(b, a, 1:10) # \n```\n:::\n\n\na <- obj 1 \nb <- obj 1, obj 1 \nc <- b(obj 1, obj 1), obj 1, 1:10 \n\n\n::: {.cell}\n\n```{.r .cell-code}\nref(c)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2a586fc3ea8] <list> \n#> ├─█ [2:0x2a585a1a308] <list> \n#> │ ├─[3:0x2a585aa1c40] <int> \n#> │ └─[3:0x2a585aa1c40] \n#> ├─[3:0x2a585aa1c40] \n#> └─[4:0x2a585b13d90] <int>\n```\n\n\n:::\n:::\n\n\n\n##### 4. What happens here?\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(1:10)\nx[[2]] <- x\n```\n:::\n\n\n- `x` is a list\n- `x[[2]] <- x` creates a new list, which in turn contains a reference to the \n original list\n- `x` is no longer bound to `list(1:10)`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nref(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2a586172508] <list> \n#> ├─[2:0x2a58641c040] <int> \n#> └─█ [3:0x2a586b06c48] <list> \n#> └─[2:0x2a58641c040]\n```\n\n\n:::\n:::\n\n\n{width=50%}\n\n## Object Size\n\n- Use `lobstr::obj_size()` \n- Lists may be smaller than expected because of referencing the same value\n- Strings may be smaller than expected because using global string pool\n- Difficult to predict how big something will be\n - Can only add sizes together if they share no references in common\n\n### Alternative Representation\n- As of R 3.5.0 - ALTREP\n- Represent some vectors compactly\n - e.g., 1:1000 - not 10,000 values, just 1 and 1,000\n\n### Exercises\n\n##### 1. Why are the sizes so different?\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- rep(list(runif(1e4)), 100)\n\nobject.size(y) # ~8000 kB\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 8005648 bytes\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_size(y) # ~80 kB\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 80.90 kB\n```\n\n\n:::\n:::\n\n\n> From `?object.size()`: \n> \n> \"This function merely provides a rough indication: it should be reasonably accurate for atomic vectors, but **does not detect if elements of a list are shared**, for example.\n\n##### 2. Why is the size misleading?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfuns <- list(mean, sd, var)\nobj_size(funs)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 18.76 kB\n```\n\n\n:::\n:::\n\n\n> Because they reference functions from base and stats, which are always available.\n> Why bother looking at the size? What use is that?\n\n##### 3. Predict the sizes\n\n\n::: {.cell}\n\n```{.r .cell-code}\na <- runif(1e6) # 8 MB\nobj_size(a)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 8.00 MB\n```\n\n\n:::\n:::\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb <- list(a, a)\n```\n:::\n\n\n- There is one value ~8MB\n- `a` and `b[[1]]` and `b[[2]]` all point to the same value.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nobj_size(b)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 8.00 MB\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_size(a, b)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 8.00 MB\n```\n\n\n:::\n:::\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb[[1]][[1]] <- 10\n```\n:::\n\n- Now there are two values ~8MB each (16MB total)\n- `a` and `b[[2]]` point to the same value (8MB)\n- `b[[1]]` is new (8MB) because the first element (`b[[1]][[1]]`) has been changed\n\n\n::: {.cell}\n\n```{.r .cell-code}\nobj_size(b) # 16 MB (two values, two element references)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 16.00 MB\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_size(a, b) # 16 MB (a & b[[2]] point to the same value)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 16.00 MB\n```\n\n\n:::\n:::\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb[[2]][[1]] <- 10\n```\n:::\n\n- Finally, now there are three values ~8MB each (24MB total)\n- Although `b[[1]]` and `b[[2]]` have the same contents, \n they are not references to the same object.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nobj_size(b)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 16.00 MB\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_size(a, b)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 24.00 MB\n```\n\n\n:::\n:::\n\n\n\n## Modify-in-place\n\n- Modifying usually creates a copy except for\n - Objects with a single binding (performance optimization)\n - Environments (special)\n\n### Objects with a single binding\n\n- Hard to know if copy will occur\n- If you have 2+ bindings and remove them, R can't follow how many are removed (so will always think there are more than one)\n- May make a copy even if there's only one binding left\n- Using a function makes a reference to it **unless it's a function based on C**\n- Best to use `tracemem()` to check rather than guess.\n\n\n#### Example - lists vs. data frames in for loop\n\n**Setup** \n\nCreate the data to modify\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- data.frame(matrix(runif(5 * 1e4), ncol = 5))\nmedians <- vapply(x, median, numeric(1))\n```\n:::\n\n\n\n**Data frame - Copied every time!**\n\n::: {.cell}\n\n```{.r .cell-code}\ncat(tracemem(x), \"\\n\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> <000002A587857268>\n```\n\n\n:::\n\n```{.r .cell-code}\nfor (i in seq_along(medians)) {\n x[[i]] <- x[[i]] - medians[[i]]\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> tracemem[0x000002a587857268 -> 0x000002a584b5de78]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584b5de78 -> 0x000002a584b5d2a8]: [[<-.data.frame [[<- eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584b5d2a8 -> 0x000002a584b5d238]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584b5d238 -> 0x000002a584b5d1c8]: [[<-.data.frame [[<- eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584b5d1c8 -> 0x000002a584b5d158]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584b5d158 -> 0x000002a584b5d0e8]: [[<-.data.frame [[<- eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584b5d0e8 -> 0x000002a584b5d078]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584b5d078 -> 0x000002a584b5d008]: [[<-.data.frame [[<- eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584b5d008 -> 0x000002a584bbfea8]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584bbfea8 -> 0x000002a584bbfe38]: [[<-.data.frame [[<- eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main\n```\n\n\n:::\n\n```{.r .cell-code}\nuntracemem(x)\n```\n:::\n\n\n**List (uses internal C code) - Copied once!**\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- as.list(x)\n\ncat(tracemem(y), \"\\n\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> <000002A584B16388>\n```\n\n\n:::\n\n```{.r .cell-code}\nfor (i in seq_along(medians)) {\n y[[i]] <- y[[i]] - medians[[i]]\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> tracemem[0x000002a584b16388 -> 0x000002a582d8fea8]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main\n```\n\n\n:::\n\n```{.r .cell-code}\nuntracemem(y)\n```\n:::\n\n\n#### Benchmark this (Exercise #2)\n\n**First wrap in a function**\n\n::: {.cell}\n\n```{.r .cell-code}\nmed <- function(d, medians) {\n for (i in seq_along(medians)) {\n d[[i]] <- d[[i]] - medians[[i]]\n }\n}\n```\n:::\n\n\n**Try with 5 columns**\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- data.frame(matrix(runif(5 * 1e4), ncol = 5))\nmedians <- vapply(x, median, numeric(1))\ny <- as.list(x)\n\nbench::mark(\n \"data.frame\" = med(x, medians),\n \"list\" = med(y, medians)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 2 × 6\n#> expression min median `itr/sec` mem_alloc `gc/sec`\n#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>\n#> 1 data.frame 52.3µs 68.2µs 13411. 410KB 201.\n#> 2 list 16.2µs 33µs 28621. 391KB 279.\n```\n\n\n:::\n:::\n\n\n**Try with 20 columns**\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- data.frame(matrix(runif(5 * 1e4), ncol = 20))\nmedians <- vapply(x, median, numeric(1))\ny <- as.list(x)\n\nbench::mark(\n \"data.frame\" = med(x, medians),\n \"list\" = med(y, medians)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 2 × 6\n#> expression min median `itr/sec` mem_alloc `gc/sec`\n#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>\n#> 1 data.frame 143.8µs 189.7µs 4722. 400KB 50.1\n#> 2 list 25.6µs 39.7µs 24419. 392KB 243.\n```\n\n\n:::\n:::\n\n\n**WOW!**\n\n\n### Environmments\n- Always modified in place (**reference semantics**)\n- Interesting because if you modify the environment, all existing bindings have the same reference\n- If two names point to the same environment, and you update one, you update both!\n\n\n::: {.cell}\n\n```{.r .cell-code}\ne1 <- rlang::env(a = 1, b = 2, c = 3)\ne2 <- e1\ne1$c <- 4\ne2$c\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 4\n```\n\n\n:::\n:::\n\n\n- This means that environments can contain themselves (!)\n\n### Exercises\n\n##### 1. Why isn't this circular?\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list()\nx[[1]] <- x\n```\n:::\n\n\n> Because the binding to the list() object moves from `x` in the first line to `x[[1]]` in the second.\n\n##### 2. (see \"Objects with a single binding\")\n\n##### 3. What happens if you attempt to use tracemem() on an environment?\n\n\n::: {.cell}\n\n```{.r .cell-code}\ne1 <- rlang::env(a = 1, b = 2, c = 3)\ntracemem(e1)\n```\n\n::: {.cell-output .cell-output-error}\n\n```\n#> Error in tracemem(e1): 'tracemem' is not useful for promise and environment objects\n```\n\n\n:::\n:::\n\n\n> Because environments always modified in place, there's no point in tracing them\n\n\n## Unbinding and the garbage collector\n\n- If you delete the 'name' bound to an object, the object still exists\n- R runs a \"garbage collector\" (GC) to remove these objects when it needs more memory\n- \"Looking from the outside, it’s basically impossible to predict when the GC will run. In fact, you shouldn’t even try.\"\n- If you want to know when it runs, use `gcinfo(TRUE)` to get a message printed\n- You can force GC with `gc()` but you never need to to use more memory *within* R\n- Only reason to do so is to free memory for other system software, or, to get the\nmessage printed about how much memory is being used\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngc()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> used (Mb) gc trigger (Mb) max used (Mb)\n#> Ncells 805637 43.1 1486050 79.4 1486050 79.4\n#> Vcells 4532584 34.6 10146329 77.5 10146315 77.5\n```\n\n\n:::\n\n```{.r .cell-code}\nmem_used()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 81.38 MB\n```\n\n\n:::\n:::\n\n\n- These numbers will **not** be what you OS tells you because, \n 1. It includes objects created by R, but not R interpreter\n 2. R and OS are lazy and don't reclaim/release memory until it's needed\n 3. R counts memory from objects, but there are gaps due to those that are deleted -> \n *memory fragmentation* [less memory actually available they you might think]\n",
+ "markdown": "---\nengine: knitr\ntitle: Names and values\n---\n\n## Learning objectives\n\n- Distinguish between an *object* and its *name*.\n- Identify when data are *copied* versus *modified*.\n- Trace and identify the memory used by R.\n\nThe `{lobstr}` package will help us throughout the chapter\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(lobstr)\n```\n:::\n\n\n\n## Syntactic names are easier to create and work with than non-syntactic names\n\n\n- Syntactic names: `my_variable`, `x`, `cpp11`, `.by`.\n - Can't use names in `?Reserved`\n\n- Non-syntactic names need to be surrounded in backticks. \n\n## Names are *bound to* values with `<-`\n\n\n::: {.cell}\n\n```{.r .cell-code}\na <- c(1, 2, 3)\na\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1 2 3\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(a)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226bfbc968\"\n```\n\n\n:::\n:::\n\n\n## Many names can be bound to the same values\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb <- a\nobj_addr(a)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226bfbc968\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(b)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226bfbc968\"\n```\n\n\n:::\n:::\n\n\n## If shared values are modified, the object is copied to a new address\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb[[1]] <- 5\nobj_addr(a)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226bfbc968\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(b)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226fda7278\"\n```\n\n\n:::\n:::\n\n\n## Memory addresses can differ even if objects seem the same\n\n\n::: {.cell}\n\n```{.r .cell-code}\na <- 1:10\nb <- a\nc <- 1:10\n\nobj_addr(a)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x22271c9e7b0\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(b)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x22271c9e7b0\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(c)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x22271d77708\"\n```\n\n\n:::\n:::\n\n\n## Functions have a single address regardless of how they're referenced\n\n\n::: {.cell}\n\n```{.r .cell-code}\nobj_addr(mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226f891738\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(base::mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226f891738\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(get(\"mean\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226f891738\"\n```\n\n\n:::\n:::\n\n\n## Unlike most objects, environments keep the same memory address on modify\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd <- new.env()\nobj_addr(d)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226b7489d8\"\n```\n\n\n:::\n\n```{.r .cell-code}\ne <- d\ne[['a']] <- 1\nobj_addr(e)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226b7489d8\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(d)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226b7489d8\"\n```\n\n\n:::\n\n```{.r .cell-code}\nd[['a']]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1\n```\n\n\n:::\n:::\n\n\n## Use `tracemem` to validate if values are copied or modified\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- runif(10)\ntracemem(x)\n#> [1] \"<000001F4185B4B08>\"\ny <- x\nx[[1]] <- 10\n#> tracemem[0x000001f4185b4b08 -> 0x000001f4185b4218]:\nuntracemem(x)\n```\n:::\n\n\n## `tracemem` shows internal C code minimizes copying\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- as.list(x)\ntracemem(y)\n#> [1] \"<000001AD67FDCD38>\"\nmedians <- vapply(x, median, numeric(1))\nfor (i in 1:5) {\n y[[i]] <- y[[i]] - medians[[i]]\n}\n#> tracemem[0x000001ad67fdcd38 -> 0x000001ad61982638]:\nuntracemem(y)\n```\n:::\n\n\n## A function's environment follows copy-on-modify rules\n\n:::: {.columns}\n\n::: {.column}\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(a) {\n a\n}\n\nx <- c(1, 2, 3)\nz <- f(x) # No change in value\n\nobj_addr(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226bdb8738\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(z) # No address change \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226bdb8738\"\n```\n\n\n:::\n:::\n\n:::\n\n::: {.column}\n\n:::\n\n::::\n\n::: notes\n- Diagrams will be explained more in chapter 7.\n- `a` points to same address as `x`.\n- If `a` modified inside function, `z` would have new address.\n:::\n\n\n## `ref()` shows the memory address of a list and its *elements*\n\n:::: {.columns}\n\n::: {.column}\n\n::: {.cell}\n\n```{.r .cell-code}\nl1 <- list(1, 2, 3)\nobj_addr(l1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226c315db8\"\n```\n\n\n:::\n\n```{.r .cell-code}\nl2 <- l1\nl2[[3]] <- 4\nref(l1, l2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2226c315db8] <list> \n#> ├─[2:0x2226c730078] <dbl> \n#> ├─[3:0x2226c75cdc8] <dbl> \n#> └─[4:0x2226c75cc08] <dbl> \n#> \n#> █ [5:0x2226c36cd68] <list> \n#> ├─[2:0x2226c730078] \n#> ├─[3:0x2226c75cdc8] \n#> └─[6:0x2226c75b318] <dbl>\n```\n\n\n:::\n:::\n\n:::\n\n::: {.column}\n{width=50%}\n:::\n\n::::\n\n## Since dataframes are lists of (column) vectors, mutating a column modifies only that column\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3))\nd2 <- d1\nd2[, 2] <- d2[, 2] * 2\nref(d1, d2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2226cca93c8] <df[,2]> \n#> ├─x = [2:0x22272216148] <dbl> \n#> └─y = [3:0x222722160f8] <dbl> \n#> \n#> █ [4:0x2226ce0cf48] <df[,2]> \n#> ├─x = [2:0x22272216148] \n#> └─y = [5:0x222727b0c38] <dbl>\n```\n\n\n:::\n:::\n\n\n## Since dataframes are lists of (column) vectors, mutating a row modifies the value\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3))\nd2 <- d1\nd2[1, ] <- d2[1, ] * 2\nref(d1, d2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x22272912588] <df[,2]> \n#> ├─x = [2:0x222730bd408] <dbl> \n#> └─y = [3:0x222730bd3b8] <dbl> \n#> \n#> █ [4:0x22272b0e548] <df[,2]> \n#> ├─x = [5:0x222731501d8] <dbl> \n#> └─y = [6:0x22273150188] <dbl>\n```\n\n\n:::\n:::\n\n\n::: notes\n- Here \"mutate\" means \"change\", not `dplyr::mutate()`\n:::\n\n## Characters are unique due to the global string pool\n\n:::: {.columns}\n\n::: {.column}\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:4\nref(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1:0x22272f78460] <int>\n```\n\n\n:::\n\n```{.r .cell-code}\ny <- 1:4\nref(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1:0x222730899d8] <int>\n```\n\n\n:::\n\n```{.r .cell-code}\nx <- c(\"a\", \"a\", \"b\")\nref(x, character = TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2227394f1d8] <chr> \n#> ├─[2:0x22267d9b118] <string: \"a\"> \n#> ├─[2:0x22267d9b118] \n#> └─[3:0x2226e17f3b8] <string: \"b\">\n```\n\n\n:::\n\n```{.r .cell-code}\ny <- c(\"a\")\nref(y, character = TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2227397fce8] <chr> \n#> └─[2:0x22267d9b118] <string: \"a\">\n```\n\n\n:::\n:::\n\n:::\n\n::: {.column}\n\n:::\n\n::::\n\n::: notes\n- \"a\" is always at the same address.\n- Each member of character vector has its own address (kind of list-like).\n:::\n\n## Memory amount can also be measured, using `lobstr::obj_size`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbanana <- \"bananas bananas bananas\"\nobj_addr(banana)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x22271bc25b8\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_size(banana)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 136 B\n```\n\n\n:::\n:::\n\n\n## Alternative Representation or ALTREPs represent vector values efficiently\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:10\nobj_size(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 680 B\n```\n\n\n:::\n\n```{.r .cell-code}\ny <- 1:10000\nobj_size(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 680 B\n```\n\n\n:::\n:::\n\n\n## We can measure memory & speed using `bench::mark()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmed <- function(d, medians) {\n for (i in seq_along(medians)) {\n d[[i]] <- d[[i]] - medians[[i]]\n }\n}\nx <- data.frame(matrix(runif(5 * 1e4), ncol = 5))\nmedians <- vapply(x, median, numeric(1))\ny <- as.list(x)\n\nbench::mark(\n \"data.frame\" = med(x, medians),\n \"list\" = med(y, medians)\n)[, c(\"min\", \"median\", \"mem_alloc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 2 × 3\n#> min median mem_alloc\n#> <bch:tm> <bch:tm> <bch:byt>\n#> 1 52.7µs 66.6µs 491KB\n#> 2 22.8µs 33.7µs 391KB\n```\n\n\n:::\n:::\n\n\n::: notes\n- The thing to see: list version uses less RAM and is faster\n:::\n\n## The garbage collector `gc()` explicitly clears out unbound objects\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:3\nx <- 2:4 # \"1:3\" is orphaned\nrm(x) # \"2:4\" is orphaned\ngc()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> used (Mb) gc trigger (Mb) max used (Mb)\n#> Ncells 791104 42.3 1505464 80.5 1505464 80.5\n#> Vcells 1497631 11.5 8388608 64.0 8388482 64.0\n```\n\n\n:::\n\n```{.r .cell-code}\nlobstr::mem_used() # Wrapper around gc()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 56.29 MB\n```\n\n\n:::\n:::\n\n\n::: aside\n`gc()` runs automatically, never *need* to call\n:::\n\n::: notes\n- `mem_used()` multiplies Ncells \"used\" by either 28 (32-bit architecture) or 56 (64-bit architecture)., and Vcells \"used\" by 8, adds them, and converts to Mb.\n:::",
"supporting": [
"02_files"
],
"filters": [
"rmarkdown/pagebreak.lua"
],
- "includes": {},
+ "includes": {
+ "include-after-body": [
+ "\n<script>\n // htmlwidgets need to know to resize themselves when slides are shown/hidden.\n // Fire the \"slideenter\" event (handled by htmlwidgets.js) when the current\n // slide changes (different for each slide format).\n (function () {\n // dispatch for htmlwidgets\n function fireSlideEnter() {\n const event = window.document.createEvent(\"Event\");\n event.initEvent(\"slideenter\", true, true);\n window.document.dispatchEvent(event);\n }\n\n function fireSlideChanged(previousSlide, currentSlide) {\n fireSlideEnter();\n\n // dispatch for shiny\n if (window.jQuery) {\n if (previousSlide) {\n window.jQuery(previousSlide).trigger(\"hidden\");\n }\n if (currentSlide) {\n window.jQuery(currentSlide).trigger(\"shown\");\n }\n }\n }\n\n // hookup for slidy\n if (window.w3c_slidy) {\n window.w3c_slidy.add_observer(function (slide_num) {\n // slide_num starts at position 1\n fireSlideChanged(null, w3c_slidy.slides[slide_num - 1]);\n });\n }\n\n })();\n</script>\n\n"
+ ]
+ },
"engineDependencies": {},
"preserve": {},
"postProcess": true
diff --git a/slides/.gitignore b/slides/.gitignore
@@ -0,0 +1 @@
+*_files/
+\ No newline at end of file
diff --git a/slides/02.qmd b/slides/02.qmd
@@ -5,132 +5,114 @@ title: Names and values
## Learning objectives
-- To be able to understand distinction between an *object* and its *name*
-- With this knowledge, to be able write faster code using less memory
-- To better understand R's functional programming tools
+- Distinguish between an *object* and its *name*.
+- Identify when data are *copied* versus *modified*.
+- Trace and identify the memory used by R.
+
+The `{lobstr}` package will help us throughout the chapter
-Using lobstr package here.
```{r}
library(lobstr)
```
-## Quiz
-
-### 1. How do I create a new column called `3` that contains the sum of `1` and `2`?
+## Syntactic names are easier to create and work with than non-syntactic names
-```{r}
-df <- data.frame(runif(3), runif(3))
-names(df) <- c(1, 2)
-df
-```
-```{r}
-df$`3` <- df$`1` + df$`2`
-df
-```
+- Syntactic names: `my_variable`, `x`, `cpp11`, `.by`.
+ - Can't use names in `?Reserved`
-**What makes these names challenging?**
+- Non-syntactic names need to be surrounded in backticks.
-> You need to use backticks (`) when the name of an object doesn't start with a
-> a character or '.' [or . followed by a number] (non-syntactic names).
-
-### 2. How much memory does `y` occupy?
+## Names are *bound to* values with `<-`
```{r}
-x <- runif(1e6)
-y <- list(x, x, x)
+a <- c(1, 2, 3)
+a
+obj_addr(a)
```
-Need to use the lobstr package:
-```{r}
-lobstr::obj_size(y)
-```
-
-> Note that if you look in the RStudio Environment or use R base `object.size()`
-> you actually get a value of 24 MB
+## Many names can be bound to the same values
```{r}
-object.size(y)
-```
-
-### 3. On which line does `a` get copied in the following example?
-```{r}
-a <- c(1, 5, 3, 2)
b <- a
-b[[1]] <- 10
+obj_addr(a)
+obj_addr(b)
```
-> Not until `b` is modified, the third line
-
-## Binding basics
-
-- Create values and *bind* a name to them
-- Names have values (rather than values have names)
-- Multiple names can refer to the same values
-- We can look at an object's address to keep track of the values independent of their names
+## If shared values are modified, the object is copied to a new address
```{r}
-x <- c(1, 2, 3)
-y <- x
-obj_addr(x)
-obj_addr(y)
+b[[1]] <- 5
+obj_addr(a)
+obj_addr(b)
```
+## Memory addresses can differ even if objects seem the same
-### Exercises
-
-##### 1. Explain the relationships
```{r}
a <- 1:10
b <- a
-c <- b
-d <- 1:10
-```
+c <- 1:10
-> `a` `b` and `c` are all names that refer to the first value `1:10`
->
-> `d` is a name that refers to the *second* value of `1:10`.
+obj_addr(a)
+obj_addr(b)
+obj_addr(c)
+```
+## Functions have a single address regardless of how they're referenced
-##### 2. Do the following all point to the same underlying function object? hint: `lobstr::obj_addr()`
```{r}
obj_addr(mean)
obj_addr(base::mean)
obj_addr(get("mean"))
-obj_addr(evalq(mean))
-obj_addr(match.fun("mean"))
```
-> Yes!
-
-## Copy-on-modify
-
-- If you modify a value bound to multiple names, it is 'copy-on-modify'
-- If you modify a value bound to a single name, it is 'modify-in-place'
-- Use `tracemem()` to see when a name's value changes
+## Unlike most objects, environments keep the same memory address on modify
```{r}
-x <- c(1, 2, 3)
-cat(tracemem(x), "\n")
+d <- new.env()
+obj_addr(d)
+e <- d
+e[['a']] <- 1
+obj_addr(e)
+obj_addr(d)
+d[['a']]
```
+## Use `tracemem` to validate if values are copied or modified
+
```{r}
+#| eval: false
+x <- runif(10)
+tracemem(x)
+#> [1] "<000001F4185B4B08>"
y <- x
-y[[3]] <- 4L # Changes (copy-on-modify)
-y[[3]] <- 5L # Doesn't change (modify-in-place)
+x[[1]] <- 10
+#> tracemem[0x000001f4185b4b08 -> 0x000001f4185b4218]:
+untracemem(x)
```
-Turn off `tracemem()` with `untracemem()`
-
-> Can also use `ref(x)` to get the address of the value bound to a given name
+## `tracemem` shows internal C code minimizes copying
+```{r}
+#| eval: false
+y <- as.list(x)
+tracemem(y)
+#> [1] "<000001AD67FDCD38>"
+medians <- vapply(x, median, numeric(1))
+for (i in 1:5) {
+ y[[i]] <- y[[i]] - medians[[i]]
+}
+#> tracemem[0x000001ad67fdcd38 -> 0x000001ad61982638]:
+untracemem(y)
+```
-## Functions
+## A function's environment follows copy-on-modify rules
-- Copying also applies within functions
-- If you copy (but don't modify) `x` within `f()`, no copy is made
+:::: {.columns}
+::: {.column}
```{r}
f <- function(a) {
a
@@ -139,264 +121,119 @@ f <- function(a) {
x <- c(1, 2, 3)
z <- f(x) # No change in value
-ref(x)
-ref(z)
+obj_addr(x)
+obj_addr(z) # No address change
```
+:::
-<!--  -->
+::: {.column}
+
+:::
-## Lists
+::::
-- A list overall, has it's own reference (id)
-- List *elements* also each point to other values
-- List doesn't store the value, it *stores a reference to the value*
-- As of R 3.1.0, modifying lists creates a *shallow copy*
- - References (bindings) are copied, but *values are not*
+::: notes
+- Diagrams will be explained more in chapter 7.
+- `a` points to same address as `x`.
+- If `a` modified inside function, `z` would have new address.
+:::
+
+## `ref()` shows the memory address of a list and its *elements*
+
+:::: {.columns}
+
+::: {.column}
```{r}
l1 <- list(1, 2, 3)
+obj_addr(l1)
l2 <- l1
l2[[3]] <- 4
-```
-
-- We can use `ref()` to see how they compare
- - See how the list reference is different
- - But first two items in each list are the same
-
-```{r}
ref(l1, l2)
```
+:::
+::: {.column}
{width=50%}
+:::
-## Data Frames
+::::
-- Data frames are lists of vectors
-- So copying and modifying a column *only affects that column*
-- **BUT** if you modify a *row*, every column must be copied
+## Since dataframes are lists of (column) vectors, mutating a column modifies only that column
```{r}
d1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3))
d2 <- d1
-d3 <- d1
-```
-
-Only the modified column changes
-```{r}
d2[, 2] <- d2[, 2] * 2
ref(d1, d2)
```
-All columns change
-```{r}
-d3[1, ] <- d3[1, ] * 3
-ref(d1, d3)
-```
-
-## Character vectors
-
-- R has a **global string pool**
-- Elements of character vectors point to unique strings in the pool
+## Since dataframes are lists of (column) vectors, mutating a row modifies the value
```{r}
-x <- c("a", "a", "abc", "d")
-```
-
-
-
-## Exercises
-
-##### 1. Why is `tracemem(1:10)` not useful?
-
-> Because it tries to trace a value that is not bound to a name
-
-##### 2. Why are there two copies?
-```{r}
-x <- c(1L, 2L, 3L)
-tracemem(x)
-x[[3]] <- 4
-```
-
-> Because we convert an *integer* vector (using 1L, etc.) to a *double* vector (using just 4)-
-
-##### 3. What is the relationships among these objects?
-
-```{r}
-a <- 1:10
-b <- list(a, a)
-c <- list(b, a, 1:10) #
-```
-
-a <- obj 1
-b <- obj 1, obj 1
-c <- b(obj 1, obj 1), obj 1, 1:10
-
-```{r}
-ref(c)
+d1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3))
+d2 <- d1
+d2[1, ] <- d2[1, ] * 2
+ref(d1, d2)
```
+::: notes
+- Here "mutate" means "change", not `dplyr::mutate()`
+:::
-##### 4. What happens here?
-```{r}
-x <- list(1:10)
-x[[2]] <- x
-```
+## Characters are unique due to the global string pool
-- `x` is a list
-- `x[[2]] <- x` creates a new list, which in turn contains a reference to the
- original list
-- `x` is no longer bound to `list(1:10)`
+:::: {.columns}
+::: {.column}
```{r}
+x <- 1:4
ref(x)
+y <- 1:4
+ref(y)
+x <- c("a", "a", "b")
+ref(x, character = TRUE)
+y <- c("a")
+ref(y, character = TRUE)
```
+:::
-{width=50%}
-
-## Object Size
-
-- Use `lobstr::obj_size()`
-- Lists may be smaller than expected because of referencing the same value
-- Strings may be smaller than expected because using global string pool
-- Difficult to predict how big something will be
- - Can only add sizes together if they share no references in common
-
-### Alternative Representation
-- As of R 3.5.0 - ALTREP
-- Represent some vectors compactly
- - e.g., 1:1000 - not 10,000 values, just 1 and 1,000
-
-### Exercises
-
-##### 1. Why are the sizes so different?
-
-```{r}
-y <- rep(list(runif(1e4)), 100)
-
-object.size(y) # ~8000 kB
-obj_size(y) # ~80 kB
-```
-
-> From `?object.size()`:
->
-> "This function merely provides a rough indication: it should be reasonably accurate for atomic vectors, but **does not detect if elements of a list are shared**, for example.
-
-##### 2. Why is the size misleading?
-
-```{r}
-funs <- list(mean, sd, var)
-obj_size(funs)
-```
-
-> Because they reference functions from base and stats, which are always available.
-> Why bother looking at the size? What use is that?
-
-##### 3. Predict the sizes
-
-```{r}
-a <- runif(1e6) # 8 MB
-obj_size(a)
-```
-
-
-```{r}
-b <- list(a, a)
-```
-
-- There is one value ~8MB
-- `a` and `b[[1]]` and `b[[2]]` all point to the same value.
-
-```{r}
-obj_size(b)
-obj_size(a, b)
-```
-
-
-```{r}
-b[[1]][[1]] <- 10
-```
-- Now there are two values ~8MB each (16MB total)
-- `a` and `b[[2]]` point to the same value (8MB)
-- `b[[1]]` is new (8MB) because the first element (`b[[1]][[1]]`) has been changed
-
-```{r}
-obj_size(b) # 16 MB (two values, two element references)
-obj_size(a, b) # 16 MB (a & b[[2]] point to the same value)
-```
-
-
-```{r}
-b[[2]][[1]] <- 10
-```
-- Finally, now there are three values ~8MB each (24MB total)
-- Although `b[[1]]` and `b[[2]]` have the same contents,
- they are not references to the same object.
-
-```{r}
-obj_size(b)
-obj_size(a, b)
-```
-
-
-## Modify-in-place
-
-- Modifying usually creates a copy except for
- - Objects with a single binding (performance optimization)
- - Environments (special)
-
-### Objects with a single binding
-
-- Hard to know if copy will occur
-- If you have 2+ bindings and remove them, R can't follow how many are removed (so will always think there are more than one)
-- May make a copy even if there's only one binding left
-- Using a function makes a reference to it **unless it's a function based on C**
-- Best to use `tracemem()` to check rather than guess.
+::: {.column}
+
+:::
+::::
-#### Example - lists vs. data frames in for loop
+::: notes
+- "a" is always at the same address.
+- Each member of character vector has its own address (kind of list-like).
+:::
-**Setup**
+## Memory amount can also be measured, using `lobstr::obj_size`
-Create the data to modify
```{r}
-x <- data.frame(matrix(runif(5 * 1e4), ncol = 5))
-medians <- vapply(x, median, numeric(1))
+banana <- "bananas bananas bananas"
+obj_addr(banana)
+obj_size(banana)
```
+## Alternative Representation or ALTREPs represent vector values efficiently
-**Data frame - Copied every time!**
-```{r}
-cat(tracemem(x), "\n")
-for (i in seq_along(medians)) {
- x[[i]] <- x[[i]] - medians[[i]]
-}
-untracemem(x)
-```
-
-**List (uses internal C code) - Copied once!**
```{r}
-y <- as.list(x)
-
-cat(tracemem(y), "\n")
-for (i in seq_along(medians)) {
- y[[i]] <- y[[i]] - medians[[i]]
-}
-untracemem(y)
+x <- 1:10
+obj_size(x)
+y <- 1:10000
+obj_size(y)
```
-#### Benchmark this (Exercise #2)
+## We can measure memory & speed using `bench::mark()`
-**First wrap in a function**
```{r}
med <- function(d, medians) {
for (i in seq_along(medians)) {
d[[i]] <- d[[i]] - medians[[i]]
}
}
-```
-
-**Try with 5 columns**
-```{r}
x <- data.frame(matrix(runif(5 * 1e4), ncol = 5))
medians <- vapply(x, median, numeric(1))
y <- as.list(x)
@@ -404,78 +241,27 @@ y <- as.list(x)
bench::mark(
"data.frame" = med(x, medians),
"list" = med(y, medians)
-)
-```
-
-**Try with 20 columns**
-```{r}
-x <- data.frame(matrix(runif(5 * 1e4), ncol = 20))
-medians <- vapply(x, median, numeric(1))
-y <- as.list(x)
-
-bench::mark(
- "data.frame" = med(x, medians),
- "list" = med(y, medians)
-)
+)[, c("min", "median", "mem_alloc")]
```
-**WOW!**
+::: notes
+- The thing to see: list version uses less RAM and is faster
+:::
-
-### Environmments
-- Always modified in place (**reference semantics**)
-- Interesting because if you modify the environment, all existing bindings have the same reference
-- If two names point to the same environment, and you update one, you update both!
-
-```{r}
-e1 <- rlang::env(a = 1, b = 2, c = 3)
-e2 <- e1
-e1$c <- 4
-e2$c
-```
-
-- This means that environments can contain themselves (!)
-
-### Exercises
-
-##### 1. Why isn't this circular?
-```{r}
-x <- list()
-x[[1]] <- x
-```
-
-> Because the binding to the list() object moves from `x` in the first line to `x[[1]]` in the second.
-
-##### 2. (see "Objects with a single binding")
-
-##### 3. What happens if you attempt to use tracemem() on an environment?
-
-```{r}
-#| error: true
-e1 <- rlang::env(a = 1, b = 2, c = 3)
-tracemem(e1)
-```
-
-> Because environments always modified in place, there's no point in tracing them
-
-
-## Unbinding and the garbage collector
-
-- If you delete the 'name' bound to an object, the object still exists
-- R runs a "garbage collector" (GC) to remove these objects when it needs more memory
-- "Looking from the outside, it’s basically impossible to predict when the GC will run. In fact, you shouldn’t even try."
-- If you want to know when it runs, use `gcinfo(TRUE)` to get a message printed
-- You can force GC with `gc()` but you never need to to use more memory *within* R
-- Only reason to do so is to free memory for other system software, or, to get the
-message printed about how much memory is being used
+## The garbage collector `gc()` explicitly clears out unbound objects
```{r}
+x <- 1:3
+x <- 2:4 # "1:3" is orphaned
+rm(x) # "2:4" is orphaned
gc()
-mem_used()
+lobstr::mem_used() # Wrapper around gc()
```
-- These numbers will **not** be what you OS tells you because,
- 1. It includes objects created by R, but not R interpreter
- 2. R and OS are lazy and don't reclaim/release memory until it's needed
- 3. R counts memory from objects, but there are gaps due to those that are deleted ->
- *memory fragmentation* [less memory actually available they you might think]
+::: aside
+`gc()` runs automatically, never *need* to call
+:::
+
+::: notes
+- `mem_used()` multiplies Ncells "used" by either 28 (32-bit architecture) or 56 (64-bit architecture)., and Vcells "used" by 8, adds them, and converts to Mb.
+:::
+\ No newline at end of file
diff --git a/slides/_metadata.yml b/slides/_metadata.yml
@@ -7,10 +7,7 @@ format:
incremental: false
execute:
eval: true
- r:
- echo: true
- mermaid:
- echo: false
+ echo: true
knitr:
opts_chunk:
comment: "#>"