bookclub-advr

DSLC Advanced R Book Club
git clone https://git.eamoncaddigan.net/bookclub-advr.git
Log | Files | Refs | README | LICENSE

commit 802c1533e2ed1c0736cb9aade05099ae82f83df7
parent 011e94299c79a179e804427680c9786d18651635
Author: Nick Giangreco <nick.giangreco@gmail.com>
Date:   Mon, 18 Aug 2025 07:53:01 -0400

Advanced R book club - Chapter 2 document edits (#87)

* first pass at restructuring slide headers to be a more compact presentation experience

* finished revising chapter 2 slides

* Update slides/02.qmd

make LO's declarative

Co-authored-by: Jon Harmon <jonthegeek@gmail.com>

* Update slides/02.qmd

Co-authored-by: Jon Harmon <jonthegeek@gmail.com>

* Update slides/02.qmd

Co-authored-by: Jon Harmon <jonthegeek@gmail.com>

* Update slides/02.qmd

Co-authored-by: Jon Harmon <jonthegeek@gmail.com>

* Update slides/02.qmd

Co-authored-by: Jon Harmon <jonthegeek@gmail.com>

* Update slides/02.qmd

Co-authored-by: Jon Harmon <jonthegeek@gmail.com>

* Update slides/02.qmd

Co-authored-by: Jon Harmon <jonthegeek@gmail.com>

* Update slides/02.qmd

Co-authored-by: Jon Harmon <jonthegeek@gmail.com>

* using assertion slide making technique

* fix spelling  and shorten titles

* Tweak 02 slides and repair shared metadata.

---------

Co-authored-by: Jon Harmon <jonthegeek@gmail.com>
Diffstat:
M_freeze/slides/02/execute-results/html.json | 10+++++++---
Aslides/.gitignore | 2++
Mslides/02.qmd | 489+++++++++++++++++++++++--------------------------------------------------------
Mslides/_metadata.yml | 5+----
4 files changed, 148 insertions(+), 358 deletions(-)

diff --git a/_freeze/slides/02/execute-results/html.json b/_freeze/slides/02/execute-results/html.json @@ -1,15 +1,19 @@ { - "hash": "a11b52aa373c64a45cd125fdb1d36946", + "hash": "387178242b89e67960838bbc8e8752bc", "result": { "engine": "knitr", - "markdown": "---\nengine: knitr\ntitle: Names and values\n---\n\n## Learning objectives\n\n- To be able to understand distinction between an *object* and its *name*\n- With this knowledge, to be able write faster code using less memory\n- To better understand R's functional programming tools\n\nUsing lobstr package here.\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(lobstr)\n```\n:::\n\n\n\n## Quiz\n\n### 1. How do I create a new column called `3` that contains the sum of `1` and `2`?\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- data.frame(runif(3), runif(3))\nnames(df) <- c(1, 2)\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 1 2\n#> 1 0.8893205 0.9874973\n#> 2 0.4645398 0.7004741\n#> 3 0.7312149 0.2986040\n```\n\n\n:::\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$`3` <- df$`1` + df$`2`\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 1 2 3\n#> 1 0.8893205 0.9874973 1.876818\n#> 2 0.4645398 0.7004741 1.165014\n#> 3 0.7312149 0.2986040 1.029819\n```\n\n\n:::\n:::\n\n\n**What makes these names challenging?**\n\n> You need to use backticks (`) when the name of an object doesn't start with a \n> a character or '.' [or . followed by a number] (non-syntactic names).\n\n### 2. How much memory does `y` occupy?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- runif(1e6)\ny <- list(x, x, x)\n```\n:::\n\n\nNeed to use the lobstr package:\n\n::: {.cell}\n\n```{.r .cell-code}\nlobstr::obj_size(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 8.00 MB\n```\n\n\n:::\n:::\n\n\n> Note that if you look in the RStudio Environment or use R base `object.size()`\n> you actually get a value of 24 MB\n\n\n::: {.cell}\n\n```{.r .cell-code}\nobject.size(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 24000224 bytes\n```\n\n\n:::\n:::\n\n\n### 3. On which line does `a` get copied in the following example?\n\n::: {.cell}\n\n```{.r .cell-code}\na <- c(1, 5, 3, 2)\nb <- a\nb[[1]] <- 10\n```\n:::\n\n\n> Not until `b` is modified, the third line\n\n## Binding basics\n\n- Create values and *bind* a name to them\n- Names have values (rather than values have names)\n- Multiple names can refer to the same values\n- We can look at an object's address to keep track of the values independent of their names\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(1, 2, 3)\ny <- x\nobj_addr(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2a58503acd8\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2a58503acd8\"\n```\n\n\n:::\n:::\n\n\n\n### Exercises\n\n##### 1. Explain the relationships\n\n::: {.cell}\n\n```{.r .cell-code}\na <- 1:10\nb <- a\nc <- b\nd <- 1:10\n```\n:::\n\n\n> `a` `b` and `c` are all names that refer to the first value `1:10`\n> \n> `d` is a name that refers to the *second* value of `1:10`.\n\n\n##### 2. Do the following all point to the same underlying function object? hint: `lobstr::obj_addr()`\n\n::: {.cell}\n\n```{.r .cell-code}\nobj_addr(mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2a5828bf738\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(base::mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2a5828bf738\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(get(\"mean\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2a5828bf738\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(evalq(mean))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2a5828bf738\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(match.fun(\"mean\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2a5828bf738\"\n```\n\n\n:::\n:::\n\n\n> Yes!\n\n## Copy-on-modify\n\n- If you modify a value bound to multiple names, it is 'copy-on-modify'\n- If you modify a value bound to a single name, it is 'modify-in-place'\n- Use `tracemem()` to see when a name's value changes\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(1, 2, 3)\ncat(tracemem(x), \"\\n\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> <000002A585CC0FF8>\n```\n\n\n:::\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- x\ny[[3]] <- 4L # Changes (copy-on-modify)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> tracemem[0x000002a585cc0ff8 -> 0x000002a58600d5e8]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main\n```\n\n\n:::\n\n```{.r .cell-code}\ny[[3]] <- 5L # Doesn't change (modify-in-place)\n```\n:::\n\n\nTurn off `tracemem()` with `untracemem()`\n\n> Can also use `ref(x)` to get the address of the value bound to a given name\n\n\n## Functions\n\n- Copying also applies within functions\n- If you copy (but don't modify) `x` within `f()`, no copy is made\n\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(a) {\n a\n}\n\nx <- c(1, 2, 3)\nz <- f(x) # No change in value\n\nref(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1:0x2a58669dd18] <dbl>\n```\n\n\n:::\n\n```{.r .cell-code}\nref(z)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1:0x2a58669dd18] <dbl>\n```\n\n\n:::\n:::\n\n\n<!-- ![](images/02-trace.png) -->\n\n## Lists\n\n- A list overall, has it's own reference (id)\n- List *elements* also each point to other values\n- List doesn't store the value, it *stores a reference to the value*\n- As of R 3.1.0, modifying lists creates a *shallow copy*\n - References (bindings) are copied, but *values are not*\n\n\n::: {.cell}\n\n```{.r .cell-code}\nl1 <- list(1, 2, 3)\nl2 <- l1\nl2[[3]] <- 4\n```\n:::\n\n\n- We can use `ref()` to see how they compare\n - See how the list reference is different\n - But first two items in each list are the same\n\n\n::: {.cell}\n\n```{.r .cell-code}\nref(l1, l2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2a586f2e698] <list> \n#> ├─[2:0x2a5877133b8] <dbl> \n#> ├─[3:0x2a5877131f8] <dbl> \n#> └─[4:0x2a587713038] <dbl> \n#> \n#> █ [5:0x2a586fc3098] <list> \n#> ├─[2:0x2a5877133b8] \n#> ├─[3:0x2a5877131f8] \n#> └─[6:0x2a58770fc78] <dbl>\n```\n\n\n:::\n:::\n\n\n![](images/02-l-modify-2.png){width=50%}\n\n## Data Frames\n\n- Data frames are lists of vectors\n- So copying and modifying a column *only affects that column*\n- **BUT** if you modify a *row*, every column must be copied\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3))\nd2 <- d1\nd3 <- d1\n```\n:::\n\n\nOnly the modified column changes\n\n::: {.cell}\n\n```{.r .cell-code}\nd2[, 2] <- d2[, 2] * 2\nref(d1, d2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2a584931608] <df[,2]> \n#> ├─x = [2:0x2a57f3b9cc8] <dbl> \n#> └─y = [3:0x2a57f3b9c78] <dbl> \n#> \n#> █ [4:0x2a5810eb508] <df[,2]> \n#> ├─x = [2:0x2a57f3b9cc8] \n#> └─y = [5:0x2a57feb2058] <dbl>\n```\n\n\n:::\n:::\n\n\nAll columns change\n\n::: {.cell}\n\n```{.r .cell-code}\nd3[1, ] <- d3[1, ] * 3\nref(d1, d3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2a584931608] <df[,2]> \n#> ├─x = [2:0x2a57f3b9cc8] <dbl> \n#> └─y = [3:0x2a57f3b9c78] <dbl> \n#> \n#> █ [4:0x2a57faa92c8] <df[,2]> \n#> ├─x = [5:0x2a585a91b38] <dbl> \n#> └─y = [6:0x2a585a91ae8] <dbl>\n```\n\n\n:::\n:::\n\n\n## Character vectors\n\n- R has a **global string pool**\n- Elements of character vectors point to unique strings in the pool\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(\"a\", \"a\", \"abc\", \"d\")\n```\n:::\n\n\n![](images/02-character-2.png)\n\n## Exercises\n\n##### 1. Why is `tracemem(1:10)` not useful?\n\n> Because it tries to trace a value that is not bound to a name\n\n##### 2. Why are there two copies?\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(1L, 2L, 3L)\ntracemem(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"<000002A5856391C8>\"\n```\n\n\n:::\n\n```{.r .cell-code}\nx[[3]] <- 4\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> tracemem[0x000002a5856391c8 -> 0x000002a585653f08]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a585653f08 -> 0x000002a58663f8b8]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main\n```\n\n\n:::\n:::\n\n\n> Because we convert an *integer* vector (using 1L, etc.) to a *double* vector (using just 4)- \n\n##### 3. What is the relationships among these objects?\n\n\n::: {.cell}\n\n```{.r .cell-code}\na <- 1:10 \nb <- list(a, a)\nc <- list(b, a, 1:10) # \n```\n:::\n\n\na <- obj 1 \nb <- obj 1, obj 1 \nc <- b(obj 1, obj 1), obj 1, 1:10 \n\n\n::: {.cell}\n\n```{.r .cell-code}\nref(c)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2a586fc3ea8] <list> \n#> ├─█ [2:0x2a585a1a308] <list> \n#> │ ├─[3:0x2a585aa1c40] <int> \n#> │ └─[3:0x2a585aa1c40] \n#> ├─[3:0x2a585aa1c40] \n#> └─[4:0x2a585b13d90] <int>\n```\n\n\n:::\n:::\n\n\n\n##### 4. What happens here?\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(1:10)\nx[[2]] <- x\n```\n:::\n\n\n- `x` is a list\n- `x[[2]] <- x` creates a new list, which in turn contains a reference to the \n original list\n- `x` is no longer bound to `list(1:10)`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nref(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2a586172508] <list> \n#> ├─[2:0x2a58641c040] <int> \n#> └─█ [3:0x2a586b06c48] <list> \n#> └─[2:0x2a58641c040]\n```\n\n\n:::\n:::\n\n\n![](images/02-copy_on_modify_fig2.png){width=50%}\n\n## Object Size\n\n- Use `lobstr::obj_size()` \n- Lists may be smaller than expected because of referencing the same value\n- Strings may be smaller than expected because using global string pool\n- Difficult to predict how big something will be\n - Can only add sizes together if they share no references in common\n\n### Alternative Representation\n- As of R 3.5.0 - ALTREP\n- Represent some vectors compactly\n - e.g., 1:1000 - not 10,000 values, just 1 and 1,000\n\n### Exercises\n\n##### 1. Why are the sizes so different?\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- rep(list(runif(1e4)), 100)\n\nobject.size(y) # ~8000 kB\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 8005648 bytes\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_size(y) # ~80 kB\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 80.90 kB\n```\n\n\n:::\n:::\n\n\n> From `?object.size()`: \n> \n> \"This function merely provides a rough indication: it should be reasonably accurate for atomic vectors, but **does not detect if elements of a list are shared**, for example.\n\n##### 2. Why is the size misleading?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfuns <- list(mean, sd, var)\nobj_size(funs)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 18.76 kB\n```\n\n\n:::\n:::\n\n\n> Because they reference functions from base and stats, which are always available.\n> Why bother looking at the size? What use is that?\n\n##### 3. Predict the sizes\n\n\n::: {.cell}\n\n```{.r .cell-code}\na <- runif(1e6) # 8 MB\nobj_size(a)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 8.00 MB\n```\n\n\n:::\n:::\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb <- list(a, a)\n```\n:::\n\n\n- There is one value ~8MB\n- `a` and `b[[1]]` and `b[[2]]` all point to the same value.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nobj_size(b)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 8.00 MB\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_size(a, b)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 8.00 MB\n```\n\n\n:::\n:::\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb[[1]][[1]] <- 10\n```\n:::\n\n- Now there are two values ~8MB each (16MB total)\n- `a` and `b[[2]]` point to the same value (8MB)\n- `b[[1]]` is new (8MB) because the first element (`b[[1]][[1]]`) has been changed\n\n\n::: {.cell}\n\n```{.r .cell-code}\nobj_size(b) # 16 MB (two values, two element references)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 16.00 MB\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_size(a, b) # 16 MB (a & b[[2]] point to the same value)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 16.00 MB\n```\n\n\n:::\n:::\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb[[2]][[1]] <- 10\n```\n:::\n\n- Finally, now there are three values ~8MB each (24MB total)\n- Although `b[[1]]` and `b[[2]]` have the same contents, \n they are not references to the same object.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nobj_size(b)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 16.00 MB\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_size(a, b)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 24.00 MB\n```\n\n\n:::\n:::\n\n\n\n## Modify-in-place\n\n- Modifying usually creates a copy except for\n - Objects with a single binding (performance optimization)\n - Environments (special)\n\n### Objects with a single binding\n\n- Hard to know if copy will occur\n- If you have 2+ bindings and remove them, R can't follow how many are removed (so will always think there are more than one)\n- May make a copy even if there's only one binding left\n- Using a function makes a reference to it **unless it's a function based on C**\n- Best to use `tracemem()` to check rather than guess.\n\n\n#### Example - lists vs. data frames in for loop\n\n**Setup** \n\nCreate the data to modify\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- data.frame(matrix(runif(5 * 1e4), ncol = 5))\nmedians <- vapply(x, median, numeric(1))\n```\n:::\n\n\n\n**Data frame - Copied every time!**\n\n::: {.cell}\n\n```{.r .cell-code}\ncat(tracemem(x), \"\\n\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> <000002A587857268>\n```\n\n\n:::\n\n```{.r .cell-code}\nfor (i in seq_along(medians)) {\n x[[i]] <- x[[i]] - medians[[i]]\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> tracemem[0x000002a587857268 -> 0x000002a584b5de78]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584b5de78 -> 0x000002a584b5d2a8]: [[<-.data.frame [[<- eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584b5d2a8 -> 0x000002a584b5d238]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584b5d238 -> 0x000002a584b5d1c8]: [[<-.data.frame [[<- eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584b5d1c8 -> 0x000002a584b5d158]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584b5d158 -> 0x000002a584b5d0e8]: [[<-.data.frame [[<- eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584b5d0e8 -> 0x000002a584b5d078]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584b5d078 -> 0x000002a584b5d008]: [[<-.data.frame [[<- eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584b5d008 -> 0x000002a584bbfea8]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main \n#> tracemem[0x000002a584bbfea8 -> 0x000002a584bbfe38]: [[<-.data.frame [[<- eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main\n```\n\n\n:::\n\n```{.r .cell-code}\nuntracemem(x)\n```\n:::\n\n\n**List (uses internal C code) - Copied once!**\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- as.list(x)\n\ncat(tracemem(y), \"\\n\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> <000002A584B16388>\n```\n\n\n:::\n\n```{.r .cell-code}\nfor (i in seq_along(medians)) {\n y[[i]] <- y[[i]] - medians[[i]]\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> tracemem[0x000002a584b16388 -> 0x000002a582d8fea8]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group withCallingHandlers <Anonymous> process_file <Anonymous> <Anonymous> execute .main\n```\n\n\n:::\n\n```{.r .cell-code}\nuntracemem(y)\n```\n:::\n\n\n#### Benchmark this (Exercise #2)\n\n**First wrap in a function**\n\n::: {.cell}\n\n```{.r .cell-code}\nmed <- function(d, medians) {\n for (i in seq_along(medians)) {\n d[[i]] <- d[[i]] - medians[[i]]\n }\n}\n```\n:::\n\n\n**Try with 5 columns**\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- data.frame(matrix(runif(5 * 1e4), ncol = 5))\nmedians <- vapply(x, median, numeric(1))\ny <- as.list(x)\n\nbench::mark(\n \"data.frame\" = med(x, medians),\n \"list\" = med(y, medians)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 2 × 6\n#> expression min median `itr/sec` mem_alloc `gc/sec`\n#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>\n#> 1 data.frame 52.3µs 68.2µs 13411. 410KB 201.\n#> 2 list 16.2µs 33µs 28621. 391KB 279.\n```\n\n\n:::\n:::\n\n\n**Try with 20 columns**\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- data.frame(matrix(runif(5 * 1e4), ncol = 20))\nmedians <- vapply(x, median, numeric(1))\ny <- as.list(x)\n\nbench::mark(\n \"data.frame\" = med(x, medians),\n \"list\" = med(y, medians)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 2 × 6\n#> expression min median `itr/sec` mem_alloc `gc/sec`\n#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>\n#> 1 data.frame 143.8µs 189.7µs 4722. 400KB 50.1\n#> 2 list 25.6µs 39.7µs 24419. 392KB 243.\n```\n\n\n:::\n:::\n\n\n**WOW!**\n\n\n### Environmments\n- Always modified in place (**reference semantics**)\n- Interesting because if you modify the environment, all existing bindings have the same reference\n- If two names point to the same environment, and you update one, you update both!\n\n\n::: {.cell}\n\n```{.r .cell-code}\ne1 <- rlang::env(a = 1, b = 2, c = 3)\ne2 <- e1\ne1$c <- 4\ne2$c\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 4\n```\n\n\n:::\n:::\n\n\n- This means that environments can contain themselves (!)\n\n### Exercises\n\n##### 1. Why isn't this circular?\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list()\nx[[1]] <- x\n```\n:::\n\n\n> Because the binding to the list() object moves from `x` in the first line to `x[[1]]` in the second.\n\n##### 2. (see \"Objects with a single binding\")\n\n##### 3. What happens if you attempt to use tracemem() on an environment?\n\n\n::: {.cell}\n\n```{.r .cell-code}\ne1 <- rlang::env(a = 1, b = 2, c = 3)\ntracemem(e1)\n```\n\n::: {.cell-output .cell-output-error}\n\n```\n#> Error in tracemem(e1): 'tracemem' is not useful for promise and environment objects\n```\n\n\n:::\n:::\n\n\n> Because environments always modified in place, there's no point in tracing them\n\n\n## Unbinding and the garbage collector\n\n- If you delete the 'name' bound to an object, the object still exists\n- R runs a \"garbage collector\" (GC) to remove these objects when it needs more memory\n- \"Looking from the outside, it’s basically impossible to predict when the GC will run. In fact, you shouldn’t even try.\"\n- If you want to know when it runs, use `gcinfo(TRUE)` to get a message printed\n- You can force GC with `gc()` but you never need to to use more memory *within* R\n- Only reason to do so is to free memory for other system software, or, to get the\nmessage printed about how much memory is being used\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngc()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> used (Mb) gc trigger (Mb) max used (Mb)\n#> Ncells 805637 43.1 1486050 79.4 1486050 79.4\n#> Vcells 4532584 34.6 10146329 77.5 10146315 77.5\n```\n\n\n:::\n\n```{.r .cell-code}\nmem_used()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 81.38 MB\n```\n\n\n:::\n:::\n\n\n- These numbers will **not** be what you OS tells you because, \n 1. It includes objects created by R, but not R interpreter\n 2. R and OS are lazy and don't reclaim/release memory until it's needed\n 3. R counts memory from objects, but there are gaps due to those that are deleted -> \n *memory fragmentation* [less memory actually available they you might think]\n", + "markdown": "---\nengine: knitr\ntitle: Names and values\n---\n\n## Learning objectives\n\n- Distinguish between an *object* and its *name*.\n- Identify when data are *copied* versus *modified*.\n- Trace and identify the memory used by R.\n\nThe `{lobstr}` package will help us throughout the chapter\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(lobstr)\n```\n:::\n\n\n\n## Syntactic names are easier to create and work with than non-syntactic names\n\n\n- Syntactic names: `my_variable`, `x`, `cpp11`, `.by`.\n - Can't use names in `?Reserved`\n\n- Non-syntactic names need to be surrounded in backticks. \n\n## Names are *bound to* values with `<-`\n\n\n::: {.cell}\n\n```{.r .cell-code}\na <- c(1, 2, 3)\na\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1 2 3\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(a)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226bfbc968\"\n```\n\n\n:::\n:::\n\n\n## Many names can be bound to the same values\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb <- a\nobj_addr(a)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226bfbc968\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(b)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226bfbc968\"\n```\n\n\n:::\n:::\n\n\n## If shared values are modified, the object is copied to a new address\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb[[1]] <- 5\nobj_addr(a)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226bfbc968\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(b)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226fda7278\"\n```\n\n\n:::\n:::\n\n\n## Memory addresses can differ even if objects seem the same\n\n\n::: {.cell}\n\n```{.r .cell-code}\na <- 1:10\nb <- a\nc <- 1:10\n\nobj_addr(a)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x22271c9e7b0\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(b)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x22271c9e7b0\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(c)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x22271d77708\"\n```\n\n\n:::\n:::\n\n\n## Functions have a single address regardless of how they're referenced\n\n\n::: {.cell}\n\n```{.r .cell-code}\nobj_addr(mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226f891738\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(base::mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226f891738\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(get(\"mean\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226f891738\"\n```\n\n\n:::\n:::\n\n\n## Unlike most objects, environments keep the same memory address on modify\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd <- new.env()\nobj_addr(d)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226b7489d8\"\n```\n\n\n:::\n\n```{.r .cell-code}\ne <- d\ne[['a']] <- 1\nobj_addr(e)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226b7489d8\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(d)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226b7489d8\"\n```\n\n\n:::\n\n```{.r .cell-code}\nd[['a']]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1\n```\n\n\n:::\n:::\n\n\n## Use `tracemem` to validate if values are copied or modified\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- runif(10)\ntracemem(x)\n#> [1] \"<000001F4185B4B08>\"\ny <- x\nx[[1]] <- 10\n#> tracemem[0x000001f4185b4b08 -> 0x000001f4185b4218]:\nuntracemem(x)\n```\n:::\n\n\n## `tracemem` shows internal C code minimizes copying\n\n\n::: {.cell}\n\n```{.r .cell-code}\ny <- as.list(x)\ntracemem(y)\n#> [1] \"<000001AD67FDCD38>\"\nmedians <- vapply(x, median, numeric(1))\nfor (i in 1:5) {\n y[[i]] <- y[[i]] - medians[[i]]\n}\n#> tracemem[0x000001ad67fdcd38 -> 0x000001ad61982638]:\nuntracemem(y)\n```\n:::\n\n\n## A function's environment follows copy-on-modify rules\n\n:::: {.columns}\n\n::: {.column}\n\n::: {.cell}\n\n```{.r .cell-code}\nf <- function(a) {\n a\n}\n\nx <- c(1, 2, 3)\nz <- f(x) # No change in value\n\nobj_addr(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226bdb8738\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_addr(z) # No address change \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226bdb8738\"\n```\n\n\n:::\n:::\n\n:::\n\n::: {.column}\n![](images/02-trace.png)\n:::\n\n::::\n\n::: notes\n- Diagrams will be explained more in chapter 7.\n- `a` points to same address as `x`.\n- If `a` modified inside function, `z` would have new address.\n:::\n\n\n## `ref()` shows the memory address of a list and its *elements*\n\n:::: {.columns}\n\n::: {.column}\n\n::: {.cell}\n\n```{.r .cell-code}\nl1 <- list(1, 2, 3)\nobj_addr(l1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x2226c315db8\"\n```\n\n\n:::\n\n```{.r .cell-code}\nl2 <- l1\nl2[[3]] <- 4\nref(l1, l2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2226c315db8] <list> \n#> ├─[2:0x2226c730078] <dbl> \n#> ├─[3:0x2226c75cdc8] <dbl> \n#> └─[4:0x2226c75cc08] <dbl> \n#> \n#> █ [5:0x2226c36cd68] <list> \n#> ├─[2:0x2226c730078] \n#> ├─[3:0x2226c75cdc8] \n#> └─[6:0x2226c75b318] <dbl>\n```\n\n\n:::\n:::\n\n:::\n\n::: {.column}\n![](images/02-l-modify-2.png){width=50%}\n:::\n\n::::\n\n## Since dataframes are lists of (column) vectors, mutating a column modifies only that column\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3))\nd2 <- d1\nd2[, 2] <- d2[, 2] * 2\nref(d1, d2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2226cca93c8] <df[,2]> \n#> ├─x = [2:0x22272216148] <dbl> \n#> └─y = [3:0x222722160f8] <dbl> \n#> \n#> █ [4:0x2226ce0cf48] <df[,2]> \n#> ├─x = [2:0x22272216148] \n#> └─y = [5:0x222727b0c38] <dbl>\n```\n\n\n:::\n:::\n\n\n## Since dataframes are lists of (column) vectors, mutating a row modifies the value\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3))\nd2 <- d1\nd2[1, ] <- d2[1, ] * 2\nref(d1, d2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x22272912588] <df[,2]> \n#> ├─x = [2:0x222730bd408] <dbl> \n#> └─y = [3:0x222730bd3b8] <dbl> \n#> \n#> █ [4:0x22272b0e548] <df[,2]> \n#> ├─x = [5:0x222731501d8] <dbl> \n#> └─y = [6:0x22273150188] <dbl>\n```\n\n\n:::\n:::\n\n\n::: notes\n- Here \"mutate\" means \"change\", not `dplyr::mutate()`\n:::\n\n## Characters are unique due to the global string pool\n\n:::: {.columns}\n\n::: {.column}\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:4\nref(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1:0x22272f78460] <int>\n```\n\n\n:::\n\n```{.r .cell-code}\ny <- 1:4\nref(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1:0x222730899d8] <int>\n```\n\n\n:::\n\n```{.r .cell-code}\nx <- c(\"a\", \"a\", \"b\")\nref(x, character = TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2227394f1d8] <chr> \n#> ├─[2:0x22267d9b118] <string: \"a\"> \n#> ├─[2:0x22267d9b118] \n#> └─[3:0x2226e17f3b8] <string: \"b\">\n```\n\n\n:::\n\n```{.r .cell-code}\ny <- c(\"a\")\nref(y, character = TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> █ [1:0x2227397fce8] <chr> \n#> └─[2:0x22267d9b118] <string: \"a\">\n```\n\n\n:::\n:::\n\n:::\n\n::: {.column}\n![](images/02-character-2.png)\n:::\n\n::::\n\n::: notes\n- \"a\" is always at the same address.\n- Each member of character vector has its own address (kind of list-like).\n:::\n\n## Memory amount can also be measured, using `lobstr::obj_size`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbanana <- \"bananas bananas bananas\"\nobj_addr(banana)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"0x22271bc25b8\"\n```\n\n\n:::\n\n```{.r .cell-code}\nobj_size(banana)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 136 B\n```\n\n\n:::\n:::\n\n\n## Alternative Representation or ALTREPs represent vector values efficiently\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:10\nobj_size(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 680 B\n```\n\n\n:::\n\n```{.r .cell-code}\ny <- 1:10000\nobj_size(y)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 680 B\n```\n\n\n:::\n:::\n\n\n## We can measure memory & speed using `bench::mark()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmed <- function(d, medians) {\n for (i in seq_along(medians)) {\n d[[i]] <- d[[i]] - medians[[i]]\n }\n}\nx <- data.frame(matrix(runif(5 * 1e4), ncol = 5))\nmedians <- vapply(x, median, numeric(1))\ny <- as.list(x)\n\nbench::mark(\n \"data.frame\" = med(x, medians),\n \"list\" = med(y, medians)\n)[, c(\"min\", \"median\", \"mem_alloc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 2 × 3\n#> min median mem_alloc\n#> <bch:tm> <bch:tm> <bch:byt>\n#> 1 52.7µs 66.6µs 491KB\n#> 2 22.8µs 33.7µs 391KB\n```\n\n\n:::\n:::\n\n\n::: notes\n- The thing to see: list version uses less RAM and is faster\n:::\n\n## The garbage collector `gc()` explicitly clears out unbound objects\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:3\nx <- 2:4 # \"1:3\" is orphaned\nrm(x) # \"2:4\" is orphaned\ngc()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> used (Mb) gc trigger (Mb) max used (Mb)\n#> Ncells 791104 42.3 1505464 80.5 1505464 80.5\n#> Vcells 1497631 11.5 8388608 64.0 8388482 64.0\n```\n\n\n:::\n\n```{.r .cell-code}\nlobstr::mem_used() # Wrapper around gc()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> 56.29 MB\n```\n\n\n:::\n:::\n\n\n::: aside\n`gc()` runs automatically, never *need* to call\n:::\n\n::: notes\n- `mem_used()` multiplies Ncells \"used\" by either 28 (32-bit architecture) or 56 (64-bit architecture)., and Vcells \"used\" by 8, adds them, and converts to Mb.\n:::", "supporting": [ "02_files" ], "filters": [ "rmarkdown/pagebreak.lua" ], - "includes": {}, + "includes": { + "include-after-body": [ + "\n<script>\n // htmlwidgets need to know to resize themselves when slides are shown/hidden.\n // Fire the \"slideenter\" event (handled by htmlwidgets.js) when the current\n // slide changes (different for each slide format).\n (function () {\n // dispatch for htmlwidgets\n function fireSlideEnter() {\n const event = window.document.createEvent(\"Event\");\n event.initEvent(\"slideenter\", true, true);\n window.document.dispatchEvent(event);\n }\n\n function fireSlideChanged(previousSlide, currentSlide) {\n fireSlideEnter();\n\n // dispatch for shiny\n if (window.jQuery) {\n if (previousSlide) {\n window.jQuery(previousSlide).trigger(\"hidden\");\n }\n if (currentSlide) {\n window.jQuery(currentSlide).trigger(\"shown\");\n }\n }\n }\n\n // hookup for slidy\n if (window.w3c_slidy) {\n window.w3c_slidy.add_observer(function (slide_num) {\n // slide_num starts at position 1\n fireSlideChanged(null, w3c_slidy.slides[slide_num - 1]);\n });\n }\n\n })();\n</script>\n\n" + ] + }, "engineDependencies": {}, "preserve": {}, "postProcess": true diff --git a/slides/.gitignore b/slides/.gitignore @@ -0,0 +1 @@ +*_files/ +\ No newline at end of file diff --git a/slides/02.qmd b/slides/02.qmd @@ -5,132 +5,114 @@ title: Names and values ## Learning objectives -- To be able to understand distinction between an *object* and its *name* -- With this knowledge, to be able write faster code using less memory -- To better understand R's functional programming tools +- Distinguish between an *object* and its *name*. +- Identify when data are *copied* versus *modified*. +- Trace and identify the memory used by R. + +The `{lobstr}` package will help us throughout the chapter -Using lobstr package here. ```{r} library(lobstr) ``` -## Quiz - -### 1. How do I create a new column called `3` that contains the sum of `1` and `2`? +## Syntactic names are easier to create and work with than non-syntactic names -```{r} -df <- data.frame(runif(3), runif(3)) -names(df) <- c(1, 2) -df -``` -```{r} -df$`3` <- df$`1` + df$`2` -df -``` +- Syntactic names: `my_variable`, `x`, `cpp11`, `.by`. + - Can't use names in `?Reserved` -**What makes these names challenging?** +- Non-syntactic names need to be surrounded in backticks. -> You need to use backticks (`) when the name of an object doesn't start with a -> a character or '.' [or . followed by a number] (non-syntactic names). - -### 2. How much memory does `y` occupy? +## Names are *bound to* values with `<-` ```{r} -x <- runif(1e6) -y <- list(x, x, x) +a <- c(1, 2, 3) +a +obj_addr(a) ``` -Need to use the lobstr package: -```{r} -lobstr::obj_size(y) -``` - -> Note that if you look in the RStudio Environment or use R base `object.size()` -> you actually get a value of 24 MB +## Many names can be bound to the same values ```{r} -object.size(y) -``` - -### 3. On which line does `a` get copied in the following example? -```{r} -a <- c(1, 5, 3, 2) b <- a -b[[1]] <- 10 +obj_addr(a) +obj_addr(b) ``` -> Not until `b` is modified, the third line - -## Binding basics - -- Create values and *bind* a name to them -- Names have values (rather than values have names) -- Multiple names can refer to the same values -- We can look at an object's address to keep track of the values independent of their names +## If shared values are modified, the object is copied to a new address ```{r} -x <- c(1, 2, 3) -y <- x -obj_addr(x) -obj_addr(y) +b[[1]] <- 5 +obj_addr(a) +obj_addr(b) ``` +## Memory addresses can differ even if objects seem the same -### Exercises - -##### 1. Explain the relationships ```{r} a <- 1:10 b <- a -c <- b -d <- 1:10 -``` +c <- 1:10 -> `a` `b` and `c` are all names that refer to the first value `1:10` -> -> `d` is a name that refers to the *second* value of `1:10`. +obj_addr(a) +obj_addr(b) +obj_addr(c) +``` +## Functions have a single address regardless of how they're referenced -##### 2. Do the following all point to the same underlying function object? hint: `lobstr::obj_addr()` ```{r} obj_addr(mean) obj_addr(base::mean) obj_addr(get("mean")) -obj_addr(evalq(mean)) -obj_addr(match.fun("mean")) ``` -> Yes! - -## Copy-on-modify - -- If you modify a value bound to multiple names, it is 'copy-on-modify' -- If you modify a value bound to a single name, it is 'modify-in-place' -- Use `tracemem()` to see when a name's value changes +## Unlike most objects, environments keep the same memory address on modify ```{r} -x <- c(1, 2, 3) -cat(tracemem(x), "\n") +d <- new.env() +obj_addr(d) +e <- d +e[['a']] <- 1 +obj_addr(e) +obj_addr(d) +d[['a']] ``` +## Use `tracemem` to validate if values are copied or modified + ```{r} +#| eval: false +x <- runif(10) +tracemem(x) +#> [1] "<000001F4185B4B08>" y <- x -y[[3]] <- 4L # Changes (copy-on-modify) -y[[3]] <- 5L # Doesn't change (modify-in-place) +x[[1]] <- 10 +#> tracemem[0x000001f4185b4b08 -> 0x000001f4185b4218]: +untracemem(x) ``` -Turn off `tracemem()` with `untracemem()` - -> Can also use `ref(x)` to get the address of the value bound to a given name +## `tracemem` shows internal C code minimizes copying +```{r} +#| eval: false +y <- as.list(x) +tracemem(y) +#> [1] "<000001AD67FDCD38>" +medians <- vapply(x, median, numeric(1)) +for (i in 1:5) { + y[[i]] <- y[[i]] - medians[[i]] +} +#> tracemem[0x000001ad67fdcd38 -> 0x000001ad61982638]: +untracemem(y) +``` -## Functions +## A function's environment follows copy-on-modify rules -- Copying also applies within functions -- If you copy (but don't modify) `x` within `f()`, no copy is made +:::: {.columns} +::: {.column} ```{r} f <- function(a) { a @@ -139,264 +121,119 @@ f <- function(a) { x <- c(1, 2, 3) z <- f(x) # No change in value -ref(x) -ref(z) +obj_addr(x) +obj_addr(z) # No address change ``` +::: -<!-- ![](images/02-trace.png) --> +::: {.column} +![](images/02-trace.png) +::: -## Lists +:::: -- A list overall, has it's own reference (id) -- List *elements* also each point to other values -- List doesn't store the value, it *stores a reference to the value* -- As of R 3.1.0, modifying lists creates a *shallow copy* - - References (bindings) are copied, but *values are not* +::: notes +- Diagrams will be explained more in chapter 7. +- `a` points to same address as `x`. +- If `a` modified inside function, `z` would have new address. +::: + +## `ref()` shows the memory address of a list and its *elements* + +:::: {.columns} + +::: {.column} ```{r} l1 <- list(1, 2, 3) +obj_addr(l1) l2 <- l1 l2[[3]] <- 4 -``` - -- We can use `ref()` to see how they compare - - See how the list reference is different - - But first two items in each list are the same - -```{r} ref(l1, l2) ``` +::: +::: {.column} ![](images/02-l-modify-2.png){width=50%} +::: -## Data Frames +:::: -- Data frames are lists of vectors -- So copying and modifying a column *only affects that column* -- **BUT** if you modify a *row*, every column must be copied +## Since dataframes are lists of (column) vectors, mutating a column modifies only that column ```{r} d1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3)) d2 <- d1 -d3 <- d1 -``` - -Only the modified column changes -```{r} d2[, 2] <- d2[, 2] * 2 ref(d1, d2) ``` -All columns change -```{r} -d3[1, ] <- d3[1, ] * 3 -ref(d1, d3) -``` - -## Character vectors - -- R has a **global string pool** -- Elements of character vectors point to unique strings in the pool +## Since dataframes are lists of (column) vectors, mutating a row modifies the value ```{r} -x <- c("a", "a", "abc", "d") -``` - -![](images/02-character-2.png) - -## Exercises - -##### 1. Why is `tracemem(1:10)` not useful? - -> Because it tries to trace a value that is not bound to a name - -##### 2. Why are there two copies? -```{r} -x <- c(1L, 2L, 3L) -tracemem(x) -x[[3]] <- 4 -``` - -> Because we convert an *integer* vector (using 1L, etc.) to a *double* vector (using just 4)- - -##### 3. What is the relationships among these objects? - -```{r} -a <- 1:10 -b <- list(a, a) -c <- list(b, a, 1:10) # -``` - -a <- obj 1 -b <- obj 1, obj 1 -c <- b(obj 1, obj 1), obj 1, 1:10 - -```{r} -ref(c) +d1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3)) +d2 <- d1 +d2[1, ] <- d2[1, ] * 2 +ref(d1, d2) ``` +::: notes +- Here "mutate" means "change", not `dplyr::mutate()` +::: -##### 4. What happens here? -```{r} -x <- list(1:10) -x[[2]] <- x -``` +## Characters are unique due to the global string pool -- `x` is a list -- `x[[2]] <- x` creates a new list, which in turn contains a reference to the - original list -- `x` is no longer bound to `list(1:10)` +:::: {.columns} +::: {.column} ```{r} +x <- 1:4 ref(x) +y <- 1:4 +ref(y) +x <- c("a", "a", "b") +ref(x, character = TRUE) +y <- c("a") +ref(y, character = TRUE) ``` +::: -![](images/02-copy_on_modify_fig2.png){width=50%} - -## Object Size - -- Use `lobstr::obj_size()` -- Lists may be smaller than expected because of referencing the same value -- Strings may be smaller than expected because using global string pool -- Difficult to predict how big something will be - - Can only add sizes together if they share no references in common - -### Alternative Representation -- As of R 3.5.0 - ALTREP -- Represent some vectors compactly - - e.g., 1:1000 - not 10,000 values, just 1 and 1,000 - -### Exercises - -##### 1. Why are the sizes so different? - -```{r} -y <- rep(list(runif(1e4)), 100) - -object.size(y) # ~8000 kB -obj_size(y) # ~80 kB -``` - -> From `?object.size()`: -> -> "This function merely provides a rough indication: it should be reasonably accurate for atomic vectors, but **does not detect if elements of a list are shared**, for example. - -##### 2. Why is the size misleading? - -```{r} -funs <- list(mean, sd, var) -obj_size(funs) -``` - -> Because they reference functions from base and stats, which are always available. -> Why bother looking at the size? What use is that? - -##### 3. Predict the sizes - -```{r} -a <- runif(1e6) # 8 MB -obj_size(a) -``` - - -```{r} -b <- list(a, a) -``` - -- There is one value ~8MB -- `a` and `b[[1]]` and `b[[2]]` all point to the same value. - -```{r} -obj_size(b) -obj_size(a, b) -``` - - -```{r} -b[[1]][[1]] <- 10 -``` -- Now there are two values ~8MB each (16MB total) -- `a` and `b[[2]]` point to the same value (8MB) -- `b[[1]]` is new (8MB) because the first element (`b[[1]][[1]]`) has been changed - -```{r} -obj_size(b) # 16 MB (two values, two element references) -obj_size(a, b) # 16 MB (a & b[[2]] point to the same value) -``` - - -```{r} -b[[2]][[1]] <- 10 -``` -- Finally, now there are three values ~8MB each (24MB total) -- Although `b[[1]]` and `b[[2]]` have the same contents, - they are not references to the same object. - -```{r} -obj_size(b) -obj_size(a, b) -``` - - -## Modify-in-place - -- Modifying usually creates a copy except for - - Objects with a single binding (performance optimization) - - Environments (special) - -### Objects with a single binding - -- Hard to know if copy will occur -- If you have 2+ bindings and remove them, R can't follow how many are removed (so will always think there are more than one) -- May make a copy even if there's only one binding left -- Using a function makes a reference to it **unless it's a function based on C** -- Best to use `tracemem()` to check rather than guess. +::: {.column} +![](images/02-character-2.png) +::: +:::: -#### Example - lists vs. data frames in for loop +::: notes +- "a" is always at the same address. +- Each member of character vector has its own address (kind of list-like). +::: -**Setup** +## Memory amount can also be measured, using `lobstr::obj_size` -Create the data to modify ```{r} -x <- data.frame(matrix(runif(5 * 1e4), ncol = 5)) -medians <- vapply(x, median, numeric(1)) +banana <- "bananas bananas bananas" +obj_addr(banana) +obj_size(banana) ``` +## Alternative Representation or ALTREPs represent vector values efficiently -**Data frame - Copied every time!** -```{r} -cat(tracemem(x), "\n") -for (i in seq_along(medians)) { - x[[i]] <- x[[i]] - medians[[i]] -} -untracemem(x) -``` - -**List (uses internal C code) - Copied once!** ```{r} -y <- as.list(x) - -cat(tracemem(y), "\n") -for (i in seq_along(medians)) { - y[[i]] <- y[[i]] - medians[[i]] -} -untracemem(y) +x <- 1:10 +obj_size(x) +y <- 1:10000 +obj_size(y) ``` -#### Benchmark this (Exercise #2) +## We can measure memory & speed using `bench::mark()` -**First wrap in a function** ```{r} med <- function(d, medians) { for (i in seq_along(medians)) { d[[i]] <- d[[i]] - medians[[i]] } } -``` - -**Try with 5 columns** -```{r} x <- data.frame(matrix(runif(5 * 1e4), ncol = 5)) medians <- vapply(x, median, numeric(1)) y <- as.list(x) @@ -404,78 +241,27 @@ y <- as.list(x) bench::mark( "data.frame" = med(x, medians), "list" = med(y, medians) -) -``` - -**Try with 20 columns** -```{r} -x <- data.frame(matrix(runif(5 * 1e4), ncol = 20)) -medians <- vapply(x, median, numeric(1)) -y <- as.list(x) - -bench::mark( - "data.frame" = med(x, medians), - "list" = med(y, medians) -) +)[, c("min", "median", "mem_alloc")] ``` -**WOW!** +::: notes +- The thing to see: list version uses less RAM and is faster +::: - -### Environmments -- Always modified in place (**reference semantics**) -- Interesting because if you modify the environment, all existing bindings have the same reference -- If two names point to the same environment, and you update one, you update both! - -```{r} -e1 <- rlang::env(a = 1, b = 2, c = 3) -e2 <- e1 -e1$c <- 4 -e2$c -``` - -- This means that environments can contain themselves (!) - -### Exercises - -##### 1. Why isn't this circular? -```{r} -x <- list() -x[[1]] <- x -``` - -> Because the binding to the list() object moves from `x` in the first line to `x[[1]]` in the second. - -##### 2. (see "Objects with a single binding") - -##### 3. What happens if you attempt to use tracemem() on an environment? - -```{r} -#| error: true -e1 <- rlang::env(a = 1, b = 2, c = 3) -tracemem(e1) -``` - -> Because environments always modified in place, there's no point in tracing them - - -## Unbinding and the garbage collector - -- If you delete the 'name' bound to an object, the object still exists -- R runs a "garbage collector" (GC) to remove these objects when it needs more memory -- "Looking from the outside, it’s basically impossible to predict when the GC will run. In fact, you shouldn’t even try." -- If you want to know when it runs, use `gcinfo(TRUE)` to get a message printed -- You can force GC with `gc()` but you never need to to use more memory *within* R -- Only reason to do so is to free memory for other system software, or, to get the -message printed about how much memory is being used +## The garbage collector `gc()` explicitly clears out unbound objects ```{r} +x <- 1:3 +x <- 2:4 # "1:3" is orphaned +rm(x) # "2:4" is orphaned gc() -mem_used() +lobstr::mem_used() # Wrapper around gc() ``` -- These numbers will **not** be what you OS tells you because, - 1. It includes objects created by R, but not R interpreter - 2. R and OS are lazy and don't reclaim/release memory until it's needed - 3. R counts memory from objects, but there are gaps due to those that are deleted -> - *memory fragmentation* [less memory actually available they you might think] +::: aside +`gc()` runs automatically, never *need* to call +::: + +::: notes +- `mem_used()` multiplies Ncells "used" by either 28 (32-bit architecture) or 56 (64-bit architecture)., and Vcells "used" by 8, adds them, and converts to Mb. +::: +\ No newline at end of file diff --git a/slides/_metadata.yml b/slides/_metadata.yml @@ -7,10 +7,7 @@ format: incremental: false execute: eval: true - r: - echo: true - mermaid: - echo: false + echo: true knitr: opts_chunk: comment: "#>"