bookclub-advr

DSLC Advanced R Book Club
git clone https://git.eamoncaddigan.net/bookclub-advr.git
Log | Files | Refs | README | LICENSE

commit 05745ca52c11ad9905777fba1e4c95a9e461be83
parent da54ce7f8fbe516c9ff405e981b28b1dabe3f1bc
Author: Jon Harmon <jonthegeek@gmail.com>
Date:   Mon,  4 Aug 2025 05:17:50 -0500

Restructure into listings (#80)


Diffstat:
A01.qmd | 16++++++++++++++++
A02.qmd | 16++++++++++++++++
A_chapter_body.qmd | 7+++++++
M_quarto.yml | 8++------
Alisting_new_window.js | 9+++++++++
Dslides/01-introduction.qmd | 105-------------------------------------------------------------------------------
Aslides/01.qmd | 99+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Aslides/02.qmd | 546+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Dslides/02_Names_and_values.Rmd | 543-------------------------------------------------------------------------------
Mstyles.css | 3+++
Dvideos/01.qmd | 67-------------------------------------------------------------------
Avideos/01/02.qmd | 5+++++
Avideos/01/03.qmd | 5+++++
Avideos/01/04.qmd | 5+++++
Avideos/01/05.qmd | 5+++++
Avideos/01/06.qmd | 23+++++++++++++++++++++++
Avideos/01/07.qmd | 23+++++++++++++++++++++++
17 files changed, 764 insertions(+), 721 deletions(-)

diff --git a/01.qmd b/01.qmd @@ -0,0 +1,16 @@ +--- +title: "1. Introduction" +listing: + - contents: "slides/01.qmd" + id: slides + type: grid + - contents: "videos/01" + id: videos + type: grid + grid-columns: 2 + sort: "filename desc" +header-includes: + - '<script src="listing_new_window.js" defer></script>' +--- + +{{< include _chapter_body.qmd >}} diff --git a/02.qmd b/02.qmd @@ -0,0 +1,16 @@ +--- +title: "2. Names and values" +listing: + - contents: "slides/02.qmd" + id: slides + type: grid + - contents: "videos/02" + id: videos + type: grid + grid-columns: 2 + sort: "filename desc" +header-includes: + - '<script src="listing_new_window.js" defer></script>' +--- + +{{< include _chapter_body.qmd >}} diff --git a/_chapter_body.qmd b/_chapter_body.qmd @@ -0,0 +1,7 @@ +:::{#slides} +::: + +## Meeting videos + +:::{#videos} +::: diff --git a/_quarto.yml b/_quarto.yml @@ -21,14 +21,10 @@ website: target: advr_club-slides - section: "Getting started" contents: - - section: slides/01-introduction.qmd - target: advr_club-slides - contents: - - file: videos/01.qmd + - 01.qmd - section: "Foundations" contents: - - file: slides/02_Names_and_values.Rmd - target: advr_club-slides + - 02.qmd - file: slides/03_Vectors.Rmd target: advr_club-slides - file: slides/04_Subsetting.Rmd diff --git a/listing_new_window.js b/listing_new_window.js @@ -0,0 +1,9 @@ +// listing_new_window.js +document.addEventListener("DOMContentLoaded", function() { + const links = document.querySelectorAll('#listing-slides a, #listing-videos a'); + + links.forEach(link => { + link.target = '_blank'; + link.rel = 'noopener noreferrer'; + }); +}); diff --git a/slides/01-introduction.qmd b/slides/01-introduction.qmd @@ -1,105 +0,0 @@ ---- -engine: knitr -title: Introduction ---- - -# ️✅ Learning objectives - -## LOs for the entire book - -- Improve programming skills. -- Develop a deep understanding of R language fundamentals. -- Understand what functional programming means. -- Understand object-oriented programming as applied in R. -- Understand metaprogramming while developing in R. -- Be able to identify what to optimize and how to optimize it. - -## LOs for this chapter - -- Recognize the differences between the 1st and 2nd edition of this book. -- Describe the overall structure of the book. -- Decide whether this book is right for you. - -Books suggestions: - -- [The Structure and Interpretation of Computer Programs (SICP)](https://mitp-content-server.mit.edu/books/content/sectbyfn/books_pres_0/6515/sicp.zip/full-text/book/book.html) -- [Concepts, Techniques and Models of Computer Programming](https://mitpress.mit.edu/books/concepts-techniques-and-models-computer-programming) -- [The Pragmatic Programmer](https://pragprog.com/titles/tpp20/the-pragmatic-programmer-20th-anniversary-edition/) - -# What's new? - -## Hadley's goals - -- Improve coverage of concepts Hadley understood better after 1e -- Reduce coverage of unimportant topics -- Easier to understand (including many more diagrams) - -## Base vs rlang - -- [1e](http://adv-r.had.co.nz) used base R almost exclusively -- 2e uses {[rlang](https://rlang.r-lib.org/)}, {[purrr](https://purrr.tidyverse.org/)}, etc - -# What we'll learn - -## The 5 sections - -- **Foundations:** (7 chapters) Building blocks of R -- **Functional programming:** (3 chapters) Treating functions as objects (that can be args in functions) -- **Object-oriented programming:** (5 chapters + 1) The many object systems of R (we'll add S7) -- **Metaprogramming:** (5 chapters) Generating code with code -- **Techniques:** (4 chapters) Debugging, measuring performance, improving performance - -::: notes -- Might be useful to open TOC here. -::: - -## Why R? - -- Diverse & welcoming community -- Many packages for stats & modeling, ML, dataviz, data wrangling -- Rmarkdown / Quarto -- RStudio / Positron -- Often used in science -- Functional programming powerful for data -- Metaprogramming -- Ease of connection to C, C++, etc - -## R imperfections - -- Much code by non-coders (messy) -- Community more about results than programming best practices -- Metaprogramming can lead to weird failures -- Inconsistency from > 30 years of evolution -- Poorly written R code runs very poorly - -## Who should read Advanced R? - -- Intermediate (and up) R programmers who want to really understand R -- Programmers from other langs who want to know why R is weird -- Prereqs: - - You've written lots of code - - You understand basics of data analysis - - You can install CRAN packages - -## What this book is not - -- [R for Data Science](https://r4ds.hadley.nz/) -- [R Packages](https://r-pkgs.org/) - -## Meta-techniques - -- Read source code - - F2 to see code in RStudio/Positron (with RStudio bindings) -- Adopt a scientific mindset - - Don't understand something? Hypothesize & experiment - -## Other books - -- The Structure and Interpretation of Computer Programs (Abelson, Sussman, and Sussman, 1996) [PDF](https://web.mit.edu/6.001/6.037/sicp.pdf) -- Concepts, Techniques and Models of Computer Programming (Van Roy & Haridi, 2003) [PDF](https://webperso.info.ucl.ac.be/~pvr/VanRoyHaridi2003-book.pdf) -- The Pragmatic Programmer (Hunt & Thomas, 1990) [buy eBook](https://pragprog.com/titles/tpp20/the-pragmatic-programmer-20th-anniversary-edition/) - -::: notes -- As far as I can tell, first 2 PDFs are legal. -- I don't think a legal, free version of The Pragmatic Programmer is available. -::: diff --git a/slides/01.qmd b/slides/01.qmd @@ -0,0 +1,99 @@ +--- +engine: knitr +title: Introduction +--- + +# ️✅ Learning objectives + +## LOs for the entire book + +- Improve programming skills. +- Develop a deep understanding of R language fundamentals. +- Understand what functional programming means. +- Understand object-oriented programming as applied in R. +- Understand metaprogramming while developing in R. +- Be able to identify what to optimize and how to optimize it. + +## LOs for this chapter + +- Recognize the differences between the 1st and 2nd edition of this book. +- Describe the overall structure of the book. +- Decide whether this book is right for you. + +# What's new? + +## Hadley's goals + +- Improve coverage of concepts Hadley understood better after 1e +- Reduce coverage of unimportant topics +- Easier to understand (including many more diagrams) + +## Base vs rlang + +- [1e](http://adv-r.had.co.nz) used base R almost exclusively +- 2e uses {[rlang](https://rlang.r-lib.org/)}, {[purrr](https://purrr.tidyverse.org/)}, etc + +# What we'll learn + +## The 5 sections + +- **Foundations:** (7 chapters) Building blocks of R +- **Functional programming:** (3 chapters) Treating functions as objects (that can be args in functions) +- **Object-oriented programming:** (5 chapters + 1) The many object systems of R (we'll add S7) +- **Metaprogramming:** (5 chapters) Generating code with code +- **Techniques:** (4 chapters) Debugging, measuring performance, improving performance + +::: notes +- Might be useful to open TOC here. +::: + +## Why R? + +- Diverse & welcoming community +- Many packages for stats & modeling, ML, dataviz, data wrangling +- Rmarkdown / Quarto +- RStudio / Positron +- Often used in science +- Functional programming powerful for data +- Metaprogramming +- Ease of connection to C, C++, etc + +## R imperfections + +- Much code by non-coders (messy) +- Community more about results than programming best practices +- Metaprogramming can lead to weird failures +- Inconsistency from > 30 years of evolution +- Poorly written R code runs very poorly + +## Who should read Advanced R? + +- Intermediate (and up) R programmers who want to really understand R +- Programmers from other langs who want to know why R is weird +- Prereqs: + - You've written lots of code + - You understand basics of data analysis + - You can install CRAN packages + +## What this book is not + +- [R for Data Science](https://r4ds.hadley.nz/) +- [R Packages](https://r-pkgs.org/) + +## Meta-techniques + +- Read source code + - F2 to see code in RStudio/Positron (with RStudio bindings) +- Adopt a scientific mindset + - Don't understand something? Hypothesize & experiment + +## Other books + +- The Structure and Interpretation of Computer Programs (Abelson, Sussman, and Sussman, 1996) [PDF](https://web.mit.edu/6.001/6.037/sicp.pdf) +- Concepts, Techniques and Models of Computer Programming (Van Roy & Haridi, 2003) [PDF](https://webperso.info.ucl.ac.be/~pvr/VanRoyHaridi2003-book.pdf) +- The Pragmatic Programmer (Hunt & Thomas, 1990) [buy eBook](https://pragprog.com/titles/tpp20/the-pragmatic-programmer-20th-anniversary-edition/) + +::: notes +- As far as I can tell, first 2 PDFs are legal. +- I don't think a legal, free version of The Pragmatic Programmer is available. +::: diff --git a/slides/02.qmd b/slides/02.qmd @@ -0,0 +1,546 @@ +--- +engine: knitr +title: Names and values +--- + +## Learning objectives + +- To be able to understand distinction between an *object* and its *name* +- With this knowledge, to be able write faster code using less memory +- To better understand R's functional programming tools + +Using lobstr package here. +```{r} +library(lobstr) +``` + + +## Quiz + +### 1. How do I create a new column called `3` that contains the sum of `1` and `2`? + +```{r} +df <- data.frame(runif(3), runif(3)) +names(df) <- c(1, 2) +df +``` + +```{r} +df$`3` <- df$`1` + df$`2` +df +``` + +**What makes these names challenging?** + +> You need to use backticks (`) when the name of an object doesn't start with a +> a character or '.' [or . followed by a number] (non-syntactic names). + +### 2. How much memory does `y` occupy? + +```{r} +x <- runif(1e6) +y <- list(x, x, x) +``` + +Need to use the lobstr package: +```{r} +lobstr::obj_size(y) +``` + +> Note that if you look in the RStudio Environment or use R base `object.size()` +> you actually get a value of 24 MB + +```{r} +object.size(y) +``` + +### 3. On which line does `a` get copied in the following example? +```{r} +a <- c(1, 5, 3, 2) +b <- a +b[[1]] <- 10 +``` + +> Not until `b` is modified, the third line + +## Binding basics + +- Create values and *bind* a name to them +- Names have values (rather than values have names) +- Multiple names can refer to the same values +- We can look at an object's address to keep track of the values independent of their names + +```{r} +x <- c(1, 2, 3) +y <- x +obj_addr(x) +obj_addr(y) +``` + + +### Exercises + +##### 1. Explain the relationships +```{r} +a <- 1:10 +b <- a +c <- b +d <- 1:10 +``` + +> `a` `b` and `c` are all names that refer to the first value `1:10` +> +> `d` is a name that refers to the *second* value of `1:10`. + + +##### 2. Do the following all point to the same underlying function object? hint: `lobstr::obj_addr()` +```{r} +obj_addr(mean) +obj_addr(base::mean) +obj_addr(get("mean")) +obj_addr(evalq(mean)) +obj_addr(match.fun("mean")) +``` + +> Yes! + +## Copy-on-modify + +- If you modify a value bound to multiple names, it is 'copy-on-modify' +- If you modify a value bound to a single name, it is 'modify-in-place' +- Use `tracemem()` to see when a name's value changes + +```{r} +x <- c(1, 2, 3) +cat(tracemem(x), "\n") +``` + +```{r} +y <- x +y[[3]] <- 4L # Changes (copy-on-modify) +y[[3]] <- 5L # Doesn't change (modify-in-place) +``` + +Turn off `tracemem()` with `untracemem()` + +> Can also use `ref(x)` to get the address of the value bound to a given name + + +## Functions + +- Copying also applies within functions +- If you copy (but don't modify) `x` within `f()`, no copy is made + +```{r} +f <- function(a) { + a +} + +x <- c(1, 2, 3) +z <- f(x) # No change in value + +ref(x) +ref(z) +``` + +<!-- ![](images/02-trace.png) --> + +## Lists + +- A list overall, has it's own reference (id) +- List *elements* also each point to other values +- List doesn't store the value, it *stores a reference to the value* +- As of R 3.1.0, modifying lists creates a *shallow copy* + - References (bindings) are copied, but *values are not* + +```{r} +l1 <- list(1, 2, 3) +l2 <- l1 +l2[[3]] <- 4 +``` + +- We can use `ref()` to see how they compare + - See how the list reference is different + - But first two items in each list are the same + +```{r} +ref(l1, l2) +``` + +![](images/02-l-modify-2.png){width=50%} + +## Data Frames + +- Data frames are lists of vectors +- So copying and modifying a column *only affects that column* +- **BUT** if you modify a *row*, every column must be copied + +```{r} +d1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3)) +d2 <- d1 +d3 <- d1 +``` + +Only the modified column changes +```{r} +d2[, 2] <- d2[, 2] * 2 +ref(d1, d2) +``` + +All columns change +```{r} +d3[1, ] <- d3[1, ] * 3 +ref(d1, d3) +``` + +## Character vectors + +- R has a **global string pool** +- Elements of character vectors point to unique strings in the pool + +```{r} +x <- c("a", "a", "abc", "d") +``` + +![](images/02-character-2.png) + +## Exercises + +##### 1. Why is `tracemem(1:10)` not useful? + +> Because it tries to trace a value that is not bound to a name + +##### 2. Why are there two copies? +```{r} +x <- c(1L, 2L, 3L) +tracemem(x) +x[[3]] <- 4 +``` + +> Because we convert an *integer* vector (using 1L, etc.) to a *double* vector (using just 4)- + +##### 3. What is the relationships among these objects? + +```{r} +a <- 1:10 +b <- list(a, a) +c <- list(b, a, 1:10) # +``` + +a <- obj 1 +b <- obj 1, obj 1 +c <- b(obj 1, obj 1), obj 1, 1:10 + +```{r} +ref(c) +``` + + +##### 4. What happens here? +```{r} +x <- list(1:10) +x[[2]] <- x +``` + +- `x` is a list +- `x[[2]] <- x` creates a new list, which in turn contains a reference to the + original list +- `x` is no longer bound to `list(1:10)` + +```{r} +ref(x) +``` + +![](images/02-copy_on_modify_fig2.png){width=50%} + +## Object Size + +- Use `lobstr::obj_size()` +- Lists may be smaller than expected because of referencing the same value +- Strings may be smaller than expected because using global string pool +- Difficult to predict how big something will be + - Can only add sizes together if they share no references in common + +### Alternative Representation +- As of R 3.5.0 - ALTREP +- Represent some vectors compactly + - e.g., 1:1000 - not 10,000 values, just 1 and 1,000 + +### Exercises + +##### 1. Why are the sizes so different? + +```{r} +y <- rep(list(runif(1e4)), 100) + +object.size(y) # ~8000 kB +obj_size(y) # ~80 kB +``` + +> From `?object.size()`: +> +> "This function merely provides a rough indication: it should be reasonably accurate for atomic vectors, but **does not detect if elements of a list are shared**, for example. + +##### 2. Why is the size misleading? + +```{r} +funs <- list(mean, sd, var) +obj_size(funs) +``` + +> Because they reference functions from base and stats, which are always available. +> Why bother looking at the size? What use is that? + +##### 3. Predict the sizes + +```{r} +a <- runif(1e6) # 8 MB +obj_size(a) +``` + + +```{r} +b <- list(a, a) +``` + +- There is one value ~8MB +- `a` and `b[[1]]` and `b[[2]]` all point to the same value. + +```{r} +obj_size(b) +obj_size(a, b) +``` + + +```{r} +b[[1]][[1]] <- 10 +``` +- Now there are two values ~8MB each (16MB total) +- `a` and `b[[2]]` point to the same value (8MB) +- `b[[1]]` is new (8MB) because the first element (`b[[1]][[1]]`) has been changed + +```{r} +obj_size(b) # 16 MB (two values, two element references) +obj_size(a, b) # 16 MB (a & b[[2]] point to the same value) +``` + + +```{r} +b[[2]][[1]] <- 10 +``` +- Finally, now there are three values ~8MB each (24MB total) +- Although `b[[1]]` and `b[[2]]` have the same contents, + they are not references to the same object. + +```{r} +obj_size(b) +obj_size(a, b) +``` + + +## Modify-in-place + +- Modifying usually creates a copy except for + - Objects with a single binding (performance optimization) + - Environments (special) + +### Objects with a single binding + +- Hard to know if copy will occur +- If you have 2+ bindings and remove them, R can't follow how many are removed (so will always think there are more than one) +- May make a copy even if there's only one binding left +- Using a function makes a reference to it **unless it's a function based on C** +- Best to use `tracemem()` to check rather than guess. + + +#### Example - lists vs. data frames in for loop + +**Setup** + +Create the data to modify +```{r} +x <- data.frame(matrix(runif(5 * 1e4), ncol = 5)) +medians <- vapply(x, median, numeric(1)) +``` + + +**Data frame - Copied every time!** +```{r} +cat(tracemem(x), "\n") +for (i in seq_along(medians)) { + x[[i]] <- x[[i]] - medians[[i]] +} +untracemem(x) +``` + +**List (uses internal C code) - Copied once!** +```{r} +y <- as.list(x) + +cat(tracemem(y), "\n") +for (i in seq_along(medians)) { + y[[i]] <- y[[i]] - medians[[i]] +} +untracemem(y) +``` + +#### Benchmark this (Exercise #2) + +**First wrap in a function** +```{r} +med <- function(d, medians) { + for (i in seq_along(medians)) { + d[[i]] <- d[[i]] - medians[[i]] + } +} +``` + +**Try with 5 columns** +```{r} +x <- data.frame(matrix(runif(5 * 1e4), ncol = 5)) +medians <- vapply(x, median, numeric(1)) +y <- as.list(x) + +bench::mark( + "data.frame" = med(x, medians), + "list" = med(y, medians) +) +``` + +**Try with 20 columns** +```{r} +x <- data.frame(matrix(runif(5 * 1e4), ncol = 20)) +medians <- vapply(x, median, numeric(1)) +y <- as.list(x) + +bench::mark( + "data.frame" = med(x, medians), + "list" = med(y, medians) +) +``` + +**WOW!** + + +### Environmments +- Always modified in place (**reference semantics**) +- Interesting because if you modify the environment, all existing bindings have the same reference +- If two names point to the same environment, and you update one, you update both! + +```{r} +e1 <- rlang::env(a = 1, b = 2, c = 3) +e2 <- e1 +e1$c <- 4 +e2$c +``` + +- This means that environments can contain themselves (!) + +### Exercises + +##### 1. Why isn't this circular? +```{r} +x <- list() +x[[1]] <- x +``` + +> Because the binding to the list() object moves from `x` in the first line to `x[[1]]` in the second. + +##### 2. (see "Objects with a single binding") + +##### 3. What happens if you attempt to use tracemem() on an environment? + +```{r} +#| error: true +e1 <- rlang::env(a = 1, b = 2, c = 3) +tracemem(e1) +``` + +> Because environments always modified in place, there's no point in tracing them + + +## Unbinding and the garbage collector + +- If you delete the 'name' bound to an object, the object still exists +- R runs a "garbage collector" (GC) to remove these objects when it needs more memory +- "Looking from the outside, it’s basically impossible to predict when the GC will run. In fact, you shouldn’t even try." +- If you want to know when it runs, use `gcinfo(TRUE)` to get a message printed +- You can force GC with `gc()` but you never need to to use more memory *within* R +- Only reason to do so is to free memory for other system software, or, to get the +message printed about how much memory is being used + +```{r} +gc() +mem_used() +``` + +- These numbers will **not** be what you OS tells you because, + 1. It includes objects created by R, but not R interpreter + 2. R and OS are lazy and don't reclaim/release memory until it's needed + 3. R counts memory from objects, but there are gaps due to those that are deleted -> + *memory fragmentation* [less memory actually available they you might think] + + +## Meeting Videos + +### Cohort 1 + +(no video recorded) + +### Cohort 2 + +`r knitr::include_url("https://www.youtube.com/embed/pCiNj2JRK50")` + +### Cohort 3 + +`r knitr::include_url("https://www.youtube.com/embed/-bEXdOoxO_E")` + +### Cohort 4 + +`r knitr::include_url("https://www.youtube.com/embed/gcVU_F-L6zY")` + +### Cohort 5 + +`r knitr::include_url("https://www.youtube.com/embed/aqcvKox9V0Q")` + +### Cohort 6 + +`r knitr::include_url("https://www.youtube.com/embed/O4Oo_qO7SIY")` + +<details> +<summary> Meeting chat log </summary> + +``` +00:16:57 Federica Gazzelloni: cohort 2 video: https://www.youtube.com/watch?v=pCiNj2JRK50 +00:18:39 Federica Gazzelloni: cohort 2 presentation: https://r4ds.github.io/bookclub-Advanced_R/Presentations/Week02/Cohort2_America/Chapter2Slides.html#1 +00:40:24 Arthur Shaw: Just the opposite, Ryan. Very clear presentation! +00:51:54 Trevin: parquet? +00:53:00 Arthur Shaw: We may all be right. {arrow} looks to deal with feather and parquet files: https://arrow.apache.org/docs/r/ +01:00:04 Arthur Shaw: Some questions for future meetings. (1) I find Ryan's use of slides hugely effective in conveying information. Would it be OK if future sessions (optionally) used slides? If so, should/could we commit slides to some folder on the repo? (2) I think reusing the images from Hadley's books really helps understanding and discussion. Is that OK to do? Here I'm thinking about copyright concerns. (If possible, I would rather not redraw variants of Hadley's images.) +01:01:35 Federica Gazzelloni: It's all ok, you can use past presentation, you don't need to push them to the repo, you can use the images from the book +01:07:19 Federica Gazzelloni: Can I use: gc(reset = TRUE) safely? +``` +</details> + +### Cohort 7 + +`r knitr::include_url("https://www.youtube.com/embed/kpAUoGO6elE")` + +<details> + +<summary>Meeting chat log</summary> +``` +00:09:40 Ryan Honomichl: https://drdoane.com/three-deep-truths-about-r/ +00:12:51 Robert Hilly: Be right back +00:36:12 Ryan Honomichl: brb +00:41:18 Ron: I tried mapply and also got different answers +00:41:44 collinberke: Interesting, would like to know more what is going on. +00:49:57 Robert Hilly: simple_map <- function(x, f, ...) { + out <- vector("list", length(x)) + for (i in seq_along(x)) { + out[[i]] <- f(x[[i]], ...) + } + out +} +``` +</details> diff --git a/slides/02_Names_and_values.Rmd b/slides/02_Names_and_values.Rmd @@ -1,543 +0,0 @@ -# Names and values - -**Learning objectives:** - -- To be able to understand distinction between an *object* and its *name* -- With this knowledge, to be able write faster code using less memory -- To better understand R's functional programming tools - -Using lobstr package here. -```{r} -library(lobstr) -``` - - -### Quiz {-} - -##### 1. How do I create a new column called `3` that contains the sum of `1` and `2`? {-} - -```{r} -df <- data.frame(runif(3), runif(3)) -names(df) <- c(1, 2) -df -``` - -```{r} -df$`3` <- df$`1` + df$`2` -df -``` - -**What makes these names challenging?** - -> You need to use backticks (`) when the name of an object doesn't start with a -> a character or '.' [or . followed by a number] (non-syntactic names). - -##### 2. How much memory does `y` occupy? {-} - -```{r} -x <- runif(1e6) -y <- list(x, x, x) -``` - -Need to use the lobstr package: -```{r} -lobstr::obj_size(y) -``` - -> Note that if you look in the RStudio Environment or use R base `object.size()` -> you actually get a value of 24 MB - -```{r} -object.size(y) -``` - -##### 3. On which line does `a` get copied in the following example? {-} -```{r} -a <- c(1, 5, 3, 2) -b <- a -b[[1]] <- 10 -``` - -> Not until `b` is modified, the third line - -## Binding basics {-} - -- Create values and *bind* a name to them -- Names have values (rather than values have names) -- Multiple names can refer to the same values -- We can look at an object's address to keep track of the values independent of their names - -```{r} -x <- c(1, 2, 3) -y <- x -obj_addr(x) -obj_addr(y) -``` - - -### Exercises {-} - -##### 1. Explain the relationships {-} -```{r} -a <- 1:10 -b <- a -c <- b -d <- 1:10 -``` - -> `a` `b` and `c` are all names that refer to the first value `1:10` -> -> `d` is a name that refers to the *second* value of `1:10`. - - -##### 2. Do the following all point to the same underlying function object? hint: `lobstr::obj_addr()` {-} -```{r} -obj_addr(mean) -obj_addr(base::mean) -obj_addr(get("mean")) -obj_addr(evalq(mean)) -obj_addr(match.fun("mean")) -``` - -> Yes! - -## Copy-on-modify {-} - -- If you modify a value bound to multiple names, it is 'copy-on-modify' -- If you modify a value bound to a single name, it is 'modify-in-place' -- Use `tracemem()` to see when a name's value changes - -```{r} -x <- c(1, 2, 3) -cat(tracemem(x), "\n") -``` - -```{r} -y <- x -y[[3]] <- 4L # Changes (copy-on-modify) -y[[3]] <- 5L # Doesn't change (modify-in-place) -``` - -Turn off `tracemem()` with `untracemem()` - -> Can also use `ref(x)` to get the address of the value bound to a given name - - -## Functions {-} - -- Copying also applies within functions -- If you copy (but don't modify) `x` within `f()`, no copy is made - -```{r} -f <- function(a) { - a -} - -x <- c(1, 2, 3) -z <- f(x) # No change in value - -ref(x) -ref(z) -``` - -<!-- ![](images/02-trace.png) --> - -## Lists {-} - -- A list overall, has it's own reference (id) -- List *elements* also each point to other values -- List doesn't store the value, it *stores a reference to the value* -- As of R 3.1.0, modifying lists creates a *shallow copy* - - References (bindings) are copied, but *values are not* - -```{r} -l1 <- list(1, 2, 3) -l2 <- l1 -l2[[3]] <- 4 -``` - -- We can use `ref()` to see how they compare - - See how the list reference is different - - But first two items in each list are the same - -```{r} -ref(l1, l2) -``` - -![](images/02-l-modify-2.png){width=50%} - -## Data Frames {-} - -- Data frames are lists of vectors -- So copying and modifying a column *only affects that column* -- **BUT** if you modify a *row*, every column must be copied - -```{r} -d1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3)) -d2 <- d1 -d3 <- d1 -``` - -Only the modified column changes -```{r} -d2[, 2] <- d2[, 2] * 2 -ref(d1, d2) -``` - -All columns change -```{r} -d3[1, ] <- d3[1, ] * 3 -ref(d1, d3) -``` - -## Character vectors {-} - -- R has a **global string pool** -- Elements of character vectors point to unique strings in the pool - -```{r} -x <- c("a", "a", "abc", "d") -``` - -![](images/02-character-2.png) - -## Exercises {-} - -##### 1. Why is `tracemem(1:10)` not useful? {-} - -> Because it tries to trace a value that is not bound to a name - -##### 2. Why are there two copies? {-} -```{r} -x <- c(1L, 2L, 3L) -tracemem(x) -x[[3]] <- 4 -``` - -> Because we convert an *integer* vector (using 1L, etc.) to a *double* vector (using just 4)- - -##### 3. What is the relationships among these objects? {-} - -```{r} -a <- 1:10 -b <- list(a, a) -c <- list(b, a, 1:10) # -``` - -a <- obj 1 -b <- obj 1, obj 1 -c <- b(obj 1, obj 1), obj 1, 1:10 - -```{r} -ref(c) -``` - - -##### 4. What happens here? {-} -```{r} -x <- list(1:10) -x[[2]] <- x -``` - -- `x` is a list -- `x[[2]] <- x` creates a new list, which in turn contains a reference to the - original list -- `x` is no longer bound to `list(1:10)` - -```{r} -ref(x) -``` - -![](images/02-copy_on_modify_fig2.png){width=50%} - -## Object Size {-} - -- Use `lobstr::obj_size()` -- Lists may be smaller than expected because of referencing the same value -- Strings may be smaller than expected because using global string pool -- Difficult to predict how big something will be - - Can only add sizes together if they share no references in common - -### Alternative Representation {-} -- As of R 3.5.0 - ALTREP -- Represent some vectors compactly - - e.g., 1:1000 - not 10,000 values, just 1 and 1,000 - -### Exercises {-} - -##### 1. Why are the sizes so different? {-} - -```{r} -y <- rep(list(runif(1e4)), 100) - -object.size(y) # ~8000 kB -obj_size(y) # ~80 kB -``` - -> From `?object.size()`: -> -> "This function merely provides a rough indication: it should be reasonably accurate for atomic vectors, but **does not detect if elements of a list are shared**, for example. - -##### 2. Why is the size misleading? {-} - -```{r} -funs <- list(mean, sd, var) -obj_size(funs) -``` - -> Because they reference functions from base and stats, which are always available. -> Why bother looking at the size? What use is that? - -##### 3. Predict the sizes {-} - -```{r} -a <- runif(1e6) # 8 MB -obj_size(a) -``` - - -```{r} -b <- list(a, a) -``` - -- There is one value ~8MB -- `a` and `b[[1]]` and `b[[2]]` all point to the same value. - -```{r} -obj_size(b) -obj_size(a, b) -``` - - -```{r} -b[[1]][[1]] <- 10 -``` -- Now there are two values ~8MB each (16MB total) -- `a` and `b[[2]]` point to the same value (8MB) -- `b[[1]]` is new (8MB) because the first element (`b[[1]][[1]]`) has been changed - -```{r} -obj_size(b) # 16 MB (two values, two element references) -obj_size(a, b) # 16 MB (a & b[[2]] point to the same value) -``` - - -```{r} -b[[2]][[1]] <- 10 -``` -- Finally, now there are three values ~8MB each (24MB total) -- Although `b[[1]]` and `b[[2]]` have the same contents, - they are not references to the same object. - -```{r} -obj_size(b) -obj_size(a, b) -``` - - -## Modify-in-place {-} - -- Modifying usually creates a copy except for - - Objects with a single binding (performance optimization) - - Environments (special) - -### Objects with a single binding {-} - -- Hard to know if copy will occur -- If you have 2+ bindings and remove them, R can't follow how many are removed (so will always think there are more than one) -- May make a copy even if there's only one binding left -- Using a function makes a reference to it **unless it's a function based on C** -- Best to use `tracemem()` to check rather than guess. - - -#### Example - lists vs. data frames in for loop {-} - -**Setup** - -Create the data to modify -```{r} -x <- data.frame(matrix(runif(5 * 1e4), ncol = 5)) -medians <- vapply(x, median, numeric(1)) -``` - - -**Data frame - Copied every time!** -```{r} -cat(tracemem(x), "\n") -for (i in seq_along(medians)) { - x[[i]] <- x[[i]] - medians[[i]] -} -untracemem(x) -``` - -**List (uses internal C code) - Copied once!** -```{r} -y <- as.list(x) - -cat(tracemem(y), "\n") -for (i in seq_along(medians)) { - y[[i]] <- y[[i]] - medians[[i]] -} -untracemem(y) -``` - -#### Benchmark this (Exercise #2) {-} - -**First wrap in a function** -```{r} -med <- function(d, medians) { - for (i in seq_along(medians)) { - d[[i]] <- d[[i]] - medians[[i]] - } -} -``` - -**Try with 5 columns** -```{r} -x <- data.frame(matrix(runif(5 * 1e4), ncol = 5)) -medians <- vapply(x, median, numeric(1)) -y <- as.list(x) - -bench::mark( - "data.frame" = med(x, medians), - "list" = med(y, medians) -) -``` - -**Try with 20 columns** -```{r} -x <- data.frame(matrix(runif(5 * 1e4), ncol = 20)) -medians <- vapply(x, median, numeric(1)) -y <- as.list(x) - -bench::mark( - "data.frame" = med(x, medians), - "list" = med(y, medians) -) -``` - -**WOW!** - - -### Environmments {-} -- Always modified in place (**reference semantics**) -- Interesting because if you modify the environment, all existing bindings have the same reference -- If two names point to the same environment, and you update one, you update both! - -```{r} -e1 <- rlang::env(a = 1, b = 2, c = 3) -e2 <- e1 -e1$c <- 4 -e2$c -``` - -- This means that environments can contain themselves (!) - -### Exercises {-} - -##### 1. Why isn't this circular? {-} -```{r} -x <- list() -x[[1]] <- x -``` - -> Because the binding to the list() object moves from `x` in the first line to `x[[1]]` in the second. - -##### 2. (see "Objects with a single binding") {-} - -##### 3. What happens if you attempt to use tracemem() on an environment? {-} - -```{r} -#| error: true -e1 <- rlang::env(a = 1, b = 2, c = 3) -tracemem(e1) -``` - -> Because environments always modified in place, there's no point in tracing them - - -## Unbinding and the garbage collector {-} - -- If you delete the 'name' bound to an object, the object still exists -- R runs a "garbage collector" (GC) to remove these objects when it needs more memory -- "Looking from the outside, it’s basically impossible to predict when the GC will run. In fact, you shouldn’t even try." -- If you want to know when it runs, use `gcinfo(TRUE)` to get a message printed -- You can force GC with `gc()` but you never need to to use more memory *within* R -- Only reason to do so is to free memory for other system software, or, to get the -message printed about how much memory is being used - -```{r} -gc() -mem_used() -``` - -- These numbers will **not** be what you OS tells you because, - 1. It includes objects created by R, but not R interpreter - 2. R and OS are lazy and don't reclaim/release memory until it's needed - 3. R counts memory from objects, but there are gaps due to those that are deleted -> - *memory fragmentation* [less memory actually available they you might think] - - -## Meeting Videos {-} - -### Cohort 1 {-} - -(no video recorded) - -### Cohort 2 {-} - -`r knitr::include_url("https://www.youtube.com/embed/pCiNj2JRK50")` - -### Cohort 3 {-} - -`r knitr::include_url("https://www.youtube.com/embed/-bEXdOoxO_E")` - -### Cohort 4 {-} - -`r knitr::include_url("https://www.youtube.com/embed/gcVU_F-L6zY")` - -### Cohort 5 {-} - -`r knitr::include_url("https://www.youtube.com/embed/aqcvKox9V0Q")` - -### Cohort 6 {-} - -`r knitr::include_url("https://www.youtube.com/embed/O4Oo_qO7SIY")` - -<details> -<summary> Meeting chat log </summary> - -``` -00:16:57 Federica Gazzelloni: cohort 2 video: https://www.youtube.com/watch?v=pCiNj2JRK50 -00:18:39 Federica Gazzelloni: cohort 2 presentation: https://r4ds.github.io/bookclub-Advanced_R/Presentations/Week02/Cohort2_America/Chapter2Slides.html#1 -00:40:24 Arthur Shaw: Just the opposite, Ryan. Very clear presentation! -00:51:54 Trevin: parquet? -00:53:00 Arthur Shaw: We may all be right. {arrow} looks to deal with feather and parquet files: https://arrow.apache.org/docs/r/ -01:00:04 Arthur Shaw: Some questions for future meetings. (1) I find Ryan's use of slides hugely effective in conveying information. Would it be OK if future sessions (optionally) used slides? If so, should/could we commit slides to some folder on the repo? (2) I think reusing the images from Hadley's books really helps understanding and discussion. Is that OK to do? Here I'm thinking about copyright concerns. (If possible, I would rather not redraw variants of Hadley's images.) -01:01:35 Federica Gazzelloni: It's all ok, you can use past presentation, you don't need to push them to the repo, you can use the images from the book -01:07:19 Federica Gazzelloni: Can I use: gc(reset = TRUE) safely? -``` -</details> - -### Cohort 7 {-} - -`r knitr::include_url("https://www.youtube.com/embed/kpAUoGO6elE")` - -<details> - -<summary>Meeting chat log</summary> -``` -00:09:40 Ryan Honomichl: https://drdoane.com/three-deep-truths-about-r/ -00:12:51 Robert Hilly: Be right back -00:36:12 Ryan Honomichl: brb -00:41:18 Ron: I tried mapply and also got different answers -00:41:44 collinberke: Interesting, would like to know more what is going on. -00:49:57 Robert Hilly: simple_map <- function(x, f, ...) { - out <- vector("list", length(x)) - for (i in seq_along(x)) { - out[[i]] <- f(x[[i]], ...) - } - out -} -``` -</details> diff --git a/styles.css b/styles.css @@ -0,0 +1,3 @@ +#listing-videos .listing-item-img-placeholder { + display: none; +} diff --git a/videos/01.qmd b/videos/01.qmd @@ -1,67 +0,0 @@ ---- -title: Meeting Videos ---- - -## Cohort 1 - -(no video recorded) - -## Cohort 2 - -{{< video https://www.youtube.com/embed/PCG52lU_YlA >}} - -## Cohort 3 - -{{< video https://www.youtube.com/embed/f6PuOnuZWBc >}} - -## Cohort 4 - -{{< video https://www.youtube.com/embed/qDaJvX-Mpls >}} - -## Cohort 5 - -{{< video https://www.youtube.com/embed/BvmiQlWOP5o >}} - -## Cohort 6 - -{{< video https://www.youtube.com/embed/dH72riiXrVI >}} - -<details> -<summary> Meeting chat log </summary> - -``` -00:14:40 SriRam: From Toronto, Civil Engineer. I use R for infrastructure planning/ GIS. Here coz of the ping 😄 , was not ready with a good computer with mic/audio ! -00:15:20 SriRam: I was with Ryan, Federica on other courses -00:23:21 SriRam: I think the only caution is about Copyright issues -00:31:32 Ryan Metcalf: Citation, giving credit back to source. Great comment SriRam. -00:34:33 SriRam: one = one, in my opinion -00:41:53 Ryan Metcalf: https://docs.google.com/spreadsheets/d/1_WFY82UxAdvP4GUdZ2luh15quwdO1n0Km3Q0tfYuqvc/edit#gid=0 -00:48:35 Arthur Shaw: The README has a nice step-by-step process at the bottom: https://github.com/r4ds/bookclub-advr#how-to-present. I've not done this myself yet, but it looks fairly straightforward. -00:54:13 lucus w: Thanks Ryan. Probably {usethis} will be easier. It looks straight forward -01:00:02 Moria W.: Thank you for sharing that. This has been good! -01:00:08 Vaibhav Janve: Thank you -01:00:44 Federica Gazzelloni: hi SriRam we are going.. -``` -</details> - -## Cohort 7 - -{{< video https://www.youtube.com/embed/vfTg6upHvO4 >}} -{{< video https://www.youtube.com/embed/3wRyE6-3OKQ >}} - -<details> - -<summary>Meeting chat log</summary> -``` -00:20:42 collinberke: https://rich-iannone.github.io/pointblank/ -00:27:36 Ryan Honomichl: brb -00:37:05 collinberke: https://rstudio.github.io/renv/articles/renv.html -00:51:52 Ryan Honomichl: gotta sign off I'll be ready to lead chapter 2 next week! -00:52:43 collinberke: https://r4ds.had.co.nz/iteration.html -00:59:44 collinberke: https://mastering-shiny.org/action-tidy.html -01:00:12 collinberke: https://dplyr.tidyverse.org/articles/programming.html -01:05:02 collinberke: https://usethis.r-lib.org/reference/create_from_github.html -01:05:53 collinberke: https://github.com/r4ds/bookclub-advr -01:06:28 Ron: I gotta run , fun conversation, and nice to meet you Matthew ! -``` -</details> diff --git a/videos/01/02.qmd b/videos/01/02.qmd @@ -0,0 +1,5 @@ +--- +title: Cohort 2 +--- + +{{< video https://www.youtube.com/embed/PCG52lU_YlA >}} diff --git a/videos/01/03.qmd b/videos/01/03.qmd @@ -0,0 +1,5 @@ +--- +title: Cohort 3 +--- + +{{< video https://www.youtube.com/embed/f6PuOnuZWBc >}} diff --git a/videos/01/04.qmd b/videos/01/04.qmd @@ -0,0 +1,5 @@ +--- +title: Cohort 4 +--- + +{{< video https://www.youtube.com/embed/qDaJvX-Mpls >}} diff --git a/videos/01/05.qmd b/videos/01/05.qmd @@ -0,0 +1,5 @@ +--- +title: Cohort 5 +--- + +{{< video https://www.youtube.com/embed/BvmiQlWOP5o >}} diff --git a/videos/01/06.qmd b/videos/01/06.qmd @@ -0,0 +1,23 @@ +--- +title: Cohort 6 +--- + +{{< video https://www.youtube.com/embed/dH72riiXrVI >}} + +<details> +<summary> Meeting chat log </summary> + +``` +00:14:40 SriRam: From Toronto, Civil Engineer. I use R for infrastructure planning/ GIS. Here coz of the ping 😄 , was not ready with a good computer with mic/audio ! +00:15:20 SriRam: I was with Ryan, Federica on other courses +00:23:21 SriRam: I think the only caution is about Copyright issues +00:31:32 Ryan Metcalf: Citation, giving credit back to source. Great comment SriRam. +00:34:33 SriRam: one = one, in my opinion +00:41:53 Ryan Metcalf: https://docs.google.com/spreadsheets/d/1_WFY82UxAdvP4GUdZ2luh15quwdO1n0Km3Q0tfYuqvc/edit#gid=0 +00:48:35 Arthur Shaw: The README has a nice step-by-step process at the bottom: https://github.com/r4ds/bookclub-advr#how-to-present. I've not done this myself yet, but it looks fairly straightforward. +00:54:13 lucus w: Thanks Ryan. Probably {usethis} will be easier. It looks straight forward +01:00:02 Moria W.: Thank you for sharing that. This has been good! +01:00:08 Vaibhav Janve: Thank you +01:00:44 Federica Gazzelloni: hi SriRam we are going.. +``` +</details> diff --git a/videos/01/07.qmd b/videos/01/07.qmd @@ -0,0 +1,23 @@ +--- +title: Cohort 7 +--- + +{{< video https://www.youtube.com/embed/vfTg6upHvO4 >}} +{{< video https://www.youtube.com/embed/3wRyE6-3OKQ >}} + +<details> + +<summary>Meeting chat log</summary> +``` +00:20:42 collinberke: https://rich-iannone.github.io/pointblank/ +00:27:36 Ryan Honomichl: brb +00:37:05 collinberke: https://rstudio.github.io/renv/articles/renv.html +00:51:52 Ryan Honomichl: gotta sign off I'll be ready to lead chapter 2 next week! +00:52:43 collinberke: https://r4ds.had.co.nz/iteration.html +00:59:44 collinberke: https://mastering-shiny.org/action-tidy.html +01:00:12 collinberke: https://dplyr.tidyverse.org/articles/programming.html +01:05:02 collinberke: https://usethis.r-lib.org/reference/create_from_github.html +01:05:53 collinberke: https://github.com/r4ds/bookclub-advr +01:06:28 Ron: I gotta run , fun conversation, and nice to meet you Matthew ! +``` +</details>