bookclub-advr

DSLC Advanced R Book Club
git clone https://git.eamoncaddigan.net/bookclub-advr.git
Log | Files | Refs | README | LICENSE

02.qmd (4523B)


      1 ---
      2 engine: knitr
      3 title: Names and values
      4 ---
      5 
      6 ## Learning objectives
      7 
      8 - Distinguish between an *object* and its *name*.
      9 - Identify when data are *copied* versus *modified*.
     10 - Trace and identify the memory used by R.
     11 
     12 The `{lobstr}` package will help us throughout the chapter
     13 
     14 ```{r}
     15 library(lobstr)
     16 ```
     17 
     18 
     19 ## Syntactic names are easier to create and work with than non-syntactic names
     20 
     21 
     22 - Syntactic names: `my_variable`, `x`, `cpp11`, `.by`.
     23   - Can't use names in `?Reserved`
     24 
     25 - Non-syntactic names need to be surrounded in backticks. 
     26 
     27 ## Names are *bound to* values with `<-`
     28 
     29 ```{r}
     30 a <- c(1, 2, 3)
     31 a
     32 obj_addr(a)
     33 ```
     34 
     35 ## Many names can be bound to the same values
     36 
     37 ```{r}
     38 b <- a
     39 obj_addr(a)
     40 obj_addr(b)
     41 ```
     42 
     43 ## If shared values are modified, the object is copied to a new address
     44 
     45 ```{r}
     46 b[[1]] <- 5
     47 obj_addr(a)
     48 obj_addr(b)
     49 ```
     50 
     51 ## Memory addresses can differ even if objects seem the same
     52 
     53 ```{r}
     54 a <- 1:10
     55 b <- a
     56 c <- 1:10
     57 
     58 obj_addr(a)
     59 obj_addr(b)
     60 obj_addr(c)
     61 ```
     62 
     63 ## Functions have a single address regardless of how they're referenced
     64 
     65 ```{r}
     66 obj_addr(mean)
     67 obj_addr(base::mean)
     68 obj_addr(get("mean"))
     69 ```
     70 
     71 ## Unlike most objects, environments keep the same memory address on modify
     72 
     73 ```{r}
     74 d <- new.env()
     75 obj_addr(d)
     76 e <- d
     77 e[['a']] <- 1
     78 obj_addr(e)
     79 obj_addr(d)
     80 d[['a']]
     81 ```
     82 
     83 ## Use `tracemem` to validate if values are copied or modified
     84 
     85 ```{r}
     86 #| eval: false
     87 x <- runif(10)
     88 tracemem(x)
     89 #> [1] "<000001F4185B4B08>"
     90 y <- x
     91 x[[1]] <- 10
     92 #> tracemem[0x000001f4185b4b08 -> 0x000001f4185b4218]:
     93 untracemem(x)
     94 ```
     95 
     96 ## `tracemem` shows internal C code minimizes copying
     97 
     98 ```{r}
     99 #| eval: false
    100 y <- as.list(x)
    101 tracemem(y)
    102 #> [1] "<000001AD67FDCD38>"
    103 medians <- vapply(x, median, numeric(1))
    104 for (i in 1:5) {
    105   y[[i]] <- y[[i]] - medians[[i]]
    106 }
    107 #> tracemem[0x000001ad67fdcd38 -> 0x000001ad61982638]:
    108 untracemem(y)
    109 ```
    110 
    111 ## A function's environment follows copy-on-modify rules
    112 
    113 :::: columns
    114 
    115 ::: column
    116 ```{r}
    117 f <- function(a) {
    118   a
    119 }
    120 
    121 x <- c(1, 2, 3)
    122 z <- f(x) # No change in value
    123 
    124 obj_addr(x)
    125 obj_addr(z) # No address change 
    126 ```
    127 :::
    128 
    129 ::: column
    130 ![](images/02-trace.png)
    131 :::
    132 
    133 ::::
    134 
    135 ::: notes
    136 - Diagrams will be explained more in chapter 7.
    137 - `a` points to same address as `x`.
    138 - If `a` modified inside function, `z` would have new address.
    139 :::
    140 
    141 
    142 ## `ref()` shows the memory address of a list and its *elements*
    143 
    144 :::: columns
    145 
    146 ::: column
    147 ```{r}
    148 l1 <- list(1, 2, 3)
    149 obj_addr(l1)
    150 l2 <- l1
    151 l2[[3]] <- 4
    152 ref(l1, l2)
    153 ```
    154 :::
    155 
    156 ::: column
    157 ![](images/02-l-modify-2.png){width=50%}
    158 :::
    159 
    160 ::::
    161 
    162 ## Since dataframes are lists of (column) vectors, mutating a column modifies only that column
    163 
    164 ```{r}
    165 d1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3))
    166 d2 <- d1
    167 d2[, 2] <- d2[, 2] * 2
    168 ref(d1, d2)
    169 ```
    170 
    171 ## Since dataframes are lists of (column) vectors, mutating a row modifies the value
    172 
    173 ```{r}
    174 d1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3))
    175 d2 <- d1
    176 d2[1, ] <- d2[1, ] * 2
    177 ref(d1, d2)
    178 ```
    179 
    180 ::: notes
    181 - Here "mutate" means "change", not `dplyr::mutate()`
    182 :::
    183 
    184 ## Characters are unique due to the global string pool
    185 
    186 :::: columns
    187 
    188 ::: column
    189 ```{r}
    190 x <- 1:4
    191 ref(x)
    192 y <- 1:4
    193 ref(y)
    194 x <- c("a", "a", "b")
    195 ref(x, character = TRUE)
    196 y <- c("a")
    197 ref(y, character = TRUE)
    198 ```
    199 :::
    200 
    201 ::: column
    202 ![](images/02-character-2.png)
    203 :::
    204 
    205 ::::
    206 
    207 ::: notes
    208 - "a" is always at the same address.
    209 - Each member of character vector has its own address (kind of list-like).
    210 :::
    211 
    212 ## Memory amount can also be measured, using `lobstr::obj_size`
    213 
    214 ```{r}
    215 banana <- "bananas bananas bananas"
    216 obj_addr(banana)
    217 obj_size(banana)
    218 ```
    219 
    220 ## Alternative Representation or ALTREPs represent vector values efficiently
    221 
    222 ```{r}
    223 x <- 1:10
    224 obj_size(x)
    225 y <- 1:10000
    226 obj_size(y)
    227 ```
    228 
    229 ## We can measure memory & speed using `bench::mark()`
    230 
    231 ```{r}
    232 med <- function(d, medians) {
    233   for (i in seq_along(medians)) {
    234     d[[i]] <- d[[i]] - medians[[i]]
    235   }
    236 }
    237 x <- data.frame(matrix(runif(5 * 1e4), ncol = 5))
    238 medians <- vapply(x, median, numeric(1))
    239 y <- as.list(x)
    240 
    241 bench::mark(
    242   "data.frame" = med(x, medians),
    243   "list" = med(y, medians)
    244 )[, c("min", "median", "mem_alloc")]
    245 ```
    246 
    247 ::: notes
    248 - The thing to see: list version uses less RAM and is faster
    249 :::
    250 
    251 ## The garbage collector `gc()` explicitly clears out unbound objects
    252 
    253 ```{r}
    254 x <- 1:3
    255 x <- 2:4 # "1:3" is orphaned
    256 rm(x) # "2:4" is orphaned
    257 gc()
    258 lobstr::mem_used() # Wrapper around gc()
    259 ```
    260 
    261 ::: aside
    262 `gc()` runs automatically, never *need* to call
    263 :::
    264 
    265 ::: notes
    266 - `mem_used()` multiplies Ncells "used" by either 28 (32-bit architecture) or 56 (64-bit architecture)., and Vcells "used" by 8, adds them, and converts to Mb.
    267 :::