bookclub-advr

DSLC Advanced R Book Club
git clone https://git.eamoncaddigan.net/bookclub-advr.git
Log | Files | Refs | README | LICENSE

commit cc607fe9afa0887d0c24fc4cd952e322354194e7
parent ddee529ad1dc157a6ae90c16f8fdbe9d7bd42643
Author: Trevin Flickinger <twflick@gmail.com>
Date:   Wed, 15 Jun 2022 16:41:15 -0400

subsetting chapter cohort 6 (#12)


Diffstat:
M03_Vectors.Rmd | 217+++++++++++++++++++++++++++++++++++++++----------------------------------------
M04_Subsetting.Rmd | 405++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
MDESCRIPTION | 3++-
Aimages/subsetting/hadley-tweet.png | 0
Aimages/subsetting/train-1.png | 0
Aimages/subsetting/train-2.png | 0
Aimages/subsetting/train-3.png | 0
7 files changed, 509 insertions(+), 116 deletions(-)

diff --git a/03_Vectors.Rmd b/03_Vectors.Rmd @@ -2,29 +2,27 @@ **Learning objectives:** -- Learn about different types of vectors -- Learn how these types relate to one another +- Learn about different types of vectors +- Learn how these types relate to one another ## Types of vectors The family tree of vectors: -![](images/vectors/summary-tree.png) -Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham +![](images/vectors/summary-tree.png) Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham -- **Atomic.** Elements all the same type. -- **List.** Elements are different Types. -- **NULL** Null elements. Length zero. +- **Atomic.** Elements all the same type. +- **List.** Elements are different Types. +- **NULL** Null elements. Length zero. ## Atomic vectors ### Types -- The vector family tree revisited. -- Meet the children of atomic vectors +- The vector family tree revisited. +- Meet the children of atomic vectors -![](images/vectors/summary-tree-atomic.png) -Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham +![](images/vectors/summary-tree-atomic.png) Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham ### Length one @@ -73,7 +71,6 @@ lgl_vec <- c(TRUE, FALSE) ``` - **2. With other vectors** ```{r long_vec} @@ -84,19 +81,19 @@ c(c(1, 2), c(3, 4)) `{rlang}` has [vector constructor functions too](https://rlang.r-lib.org/reference/vector-construction.html): -- `rlang::lgl(...)` -- `rlang::int(...)` -- `rlang::dbl(...)` -- `rlang::chr(...)` +- `rlang::lgl(...)` +- `rlang::int(...)` +- `rlang::dbl(...)` +- `rlang::chr(...)` They look to do both more and less than `c()`. -- More: - - Enforce type - - Splice lists - - More types: `rlang::bytes()`, `rlang::cpl(...)` -- Less: - - Stricter rules on names +- More: + - Enforce type + - Splice lists + - More types: `rlang::bytes()`, `rlang::cpl(...)` +- Less: + - Stricter rules on names Note: currently has `questioning` lifecycle badge, since these constructors may get moved to `vctrs` @@ -115,14 +112,15 @@ sum(c(1, 2, NA, 3)) sum(c(1, 2, NA, 3), na.rm = TRUE) ``` + **Types** Each type has its own NA type -- Logical: `NA` -- Integer: `NA_integer` -- Double: `NA_double` -- Character: `NA_character` +- Logical: `NA` +- Integer: `NA_integer` +- Double: `NA_double` +- Character: `NA_character` This may not matter in many contexts. @@ -134,23 +132,23 @@ But this does matter for operations where types matter like `dplyr::if_else()`. Test data type: -- Logical: `is.logical()` -- Integer: `is.integer()` -- Double: `is.double()` -- Character: `is.character()` +- Logical: `is.logical()` +- Integer: `is.integer()` +- Double: `is.double()` +- Character: `is.character()` **What type of object is it?** Don't test objects with these tools: -- `is.vector()` -- `is.atomic()` -- `is.numeric()` +- `is.vector()` +- `is.atomic()` +- `is.numeric()` Instead, maybe, use `{rlang}` -- `rlang::is_vector` -- `rlang::is_atomic` +- `rlang::is_vector` +- `rlang::is_atomic` ```{r test_rlang} # vector @@ -163,7 +161,6 @@ rlang::is_atomic(list(1, "a")) ``` - See more [here](https://rlang.r-lib.org/reference/type-predicates.html) ### Coercion @@ -176,8 +173,8 @@ R can coerce either automatically or explicitly Two contexts for automatic coercion: -1. Combination -1. Mathematical +1. Combination +2. Mathematical Combination: @@ -199,15 +196,15 @@ sum(has_attribute) Use `as.*()` -- Logical: `as.logical()` -- Integer: `as.integer()` -- Double: `as.double()` -- Character: `as.character()` +- Logical: `as.logical()` +- Integer: `as.integer()` +- Double: `as.double()` +- Character: `as.character()` But note that coercions may fail in one of two ways, or both: -- With warning/error -- NAs +- With warning/error +- NAs ```{r coerce_error} as.integer(c(1, 2, "three")) @@ -215,16 +212,16 @@ as.integer(c(1, 2, "three")) ## Attributes -- What -- How -- Why +- What +- How +- Why ### What Two perspectives: -- Name-value pairs -- Metadata +- Name-value pairs +- Metadata **Name-value pairs** @@ -232,20 +229,20 @@ Formally, attributes have a name and a value. **Metadata** -- Not data itself -- But data about the data +- Not data itself +- But data about the data ### How Two operations: -1. Get -1. Set +1. Get +2. Set Two cases: -1. Single attribute -2. Multiple attributes +1. Single attribute +2. Multiple attributes **Single attribute** @@ -261,10 +258,10 @@ attr(x = a, which = "some_attribute_name") <- "some attribute" # get attribute attr(x = a, which = "some_attribute_name") ``` + **Multiple attributes** -To set multiple attributes, use `structure()` -To get multiple attributes, use `attributes()` +To set multiple attributes, use `structure()` To get multiple attributes, use `attributes()` ```{r attr_multiple} b <- c(4, 5, 6) @@ -284,8 +281,8 @@ str(attributes(b)) Two common use cases: -- Names -- Dimensions +- Names +- Dimensions **Names** @@ -326,27 +323,26 @@ matrix(1:6, nrow = 2, ncol = 3) ## S3 atomic vectors -- The vector family tree revisited. -- Meet the children of typed atomic vectors +- The vector family tree revisited. +- Meet the children of typed atomic vectors -![](images/vectors/summary-tree-s3-1.png) -Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham +![](images/vectors/summary-tree-s3-1.png) Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham -This list could (more easily) be expanded to new vector types with [`{vctrs}`](https://vctrs.r-lib.org/). See [rstudio::conf(2019) talk on the package around 18:27](https://www.rstudio.com/resources/rstudioconf-2019/vctrs-tools-for-making-size-and-type-consistent-functions/). See also [rstudio::conf(2020) talk on new vector types for dealing with non-decimal currencies](https://www.rstudio.com/resources/rstudioconf-2020/vctrs-creating-custom-vector-classes-with-the-vctrs-package/). +This list could (more easily) be expanded to new vector types with [`{vctrs}`](https://vctrs.r-lib.org/). See [rstudio::conf(2019) talk on the package around 18:27](https://www.rstudio.com/resources/rstudioconf-2019/vctrs-tools-for-making-size-and-type-consistent-functions/). See also [rstudio::conf(2020) talk on new vector types for dealing with non-decimal currencies](https://www.rstudio.com/resources/rstudioconf-2020/vctrs-creating-custom-vector-classes-with-the-vctrs-package/). What makes S3 atomic vectors different than their parents? Two things: -1. Class -2. Attributes (typically) +1. Class +2. Attributes (typically) ### Factors Factors are integer vectors with: -- Class: "factor" -- Attributes: "levels", or the set of allowed values +- Class: "factor" +- Attributes: "levels", or the set of allowed values ```{r factor} # Build a factor @@ -387,8 +383,8 @@ ordered_factor Dates are: -- Double vectors -- With class "Date" +- Double vectors +- With class "Date" The double component represents the number of days since since `1970-01-01` @@ -406,14 +402,14 @@ attributes(notes_date) There are 2 Date-time representations in base R: -- POSIXct, where "ct" denotes calendar time -- POSIXlt, where "lt" designates local time. +- POSIXct, where "ct" denotes calendar time +- POSIXlt, where "lt" designates local time. Let's focus on POSIXct because: -- Simplest -- Built on an atomic vector -- Most apt to be in a data frame +- Simplest +- Built on an atomic vector +- Most apt to be in a data frame Let's now build and deconstruct a Date-time @@ -436,14 +432,13 @@ typeof(note_date_time) attributes(note_date_time) ``` - ### Durations Durations are: -- Double vectors -- Class: "difftime" -- Attributes: "units", or the unit of duration (e.g., weeks, hours, minutes, seconds, etc.) +- Double vectors +- Class: "difftime" +- Attributes: "units", or the unit of duration (e.g., weeks, hours, minutes, seconds, etc.) ```{r durations} # Construct @@ -461,8 +456,8 @@ attributes(one_minute) See also: -- [`lubridate::make_difftime()`](https://lubridate.tidyverse.org/reference/make_difftime.html) -- [`clock::date_time_build()`](https://clock.r-lib.org/reference/date_time_build.html) +- [`lubridate::make_difftime()`](https://lubridate.tidyverse.org/reference/make_difftime.html) +- [`clock::date_time_build()`](https://clock.r-lib.org/reference/date_time_build.html) ## Lists @@ -492,6 +487,7 @@ typeof(simple_list) str(simple_list) ``` + Nested lists: ```{r list_nested} @@ -528,8 +524,8 @@ str(list_comb2) Check that is a list: -- `is.list()` -- `rlang::is_list()`` +- `is.list()` +- \`rlang::is_list()\`\` The two do the same, except that the latter can check for the number of elements @@ -546,26 +542,24 @@ rlang::is_list(x = list_comb2, n = 4) rlang::is_vector(list_comb2) ``` - ### Coercion ## Data frames and tibbles -- The vector family tree revisited. -- Meet the children of lists +- The vector family tree revisited. +- Meet the children of lists -![](images/vectors/summary-tree-s3-2.png) -Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham +![](images/vectors/summary-tree-s3-2.png) Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham ### Data frame A data frame is a: -- Named list of vectors (i.e., column names) -- Class: "data frame" -- Attributes: - - (column) `names` - - `row.names`` +- Named list of vectors (i.e., column names) +- Class: "data frame" +- Attributes: + - (column) `names` + - \`row.names\`\` ```{r data_frame} # Construct @@ -588,23 +582,22 @@ typeof(df) attributes(df) ``` - Unlike other lists, the length of each vector must be the same (i.e. as many vector elements as rows in the data frame). ### Tibble As compared to data frames, tibbles are data frames that are: -- Lazy -- Surly +- Lazy +- Surly #### Lazy Tibbles do not: -- Coerce strings -- Transform non-syntactic names -- Recycle vectors of length greater than 1 +- Coerce strings +- Transform non-syntactic names +- Recycle vectors of length greater than 1 **Coerce strings** @@ -663,13 +656,12 @@ tbl <- tibble::tibble( ) ``` - #### Surly Tibbles do only what they're asked and complain if what they're asked doesn't make sense: -- Subsetting always yields a tibble -- Complains if cannot find column +- Subsetting always yields a tibble +- Complains if cannot find column **Subsetting always yields a tibble** @@ -715,15 +707,15 @@ Whether tibble: `tibble::is_tibble`. Note: only tibbles are tibbles. Vanilla dat ### Coercion -- To data frame: `as.data.frame()` -- To tibble: `tibble::as_tibble()` +- To data frame: `as.data.frame()` +- To tibble: `tibble::as_tibble()` ## `NULL` Special type of object that: -- Length 0 -- Cannot have attributes +- Length 0 +- Cannot have attributes ```{r null, error=TRUE} typeof(NULL) @@ -736,7 +728,6 @@ x <- NULL attr(x, "y") <- 1 ``` - ## Meeting Videos ### Cohort 1 @@ -764,9 +755,13 @@ attr(x, "y") <- 1 `r knitr::include_url("https://www.youtube.com/embed/URL")` <details> -<summary> Meeting chat log </summary> -``` -LOG -``` +<summary> + +Meeting chat log + +</summary> + + LOG + </details> diff --git a/04_Subsetting.Rmd b/04_Subsetting.Rmd @@ -2,12 +2,406 @@ **Learning objectives:** -- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY +- Learn about the 6 ways to subset atomic vectors +- Learn about the 3 subsetting operators: `[[`, `[`, and `$` +- Learn how subsetting works with different vector types -## SLIDE 1 +## Selecting multiple elements + +### Atomic Vectors + +- 6 ways to subset atomic vectors + +Let's take a look with an example vector. + +```{r atomic_vector} +x <- c(3.1, 2.2, 1.3, 4.4) +``` + +**Positive integers** + +```{r positive_int} +# return elements at specified positions +x[c(4, 1)] + +# duplicate indices return duplicate values +x[c(2, 2)] + +# real numbers truncate to integers +x[c(3.2, 3.8)] +``` + +**Negative integers** + +```{r, eval=FALSE} +### excludes elements at specified positions +# x[-c(1, 3)] # same as x[c(-1, -3)] + +### mixing positive and negative is a no-no +# x[c(-1, 3)] +``` + +**Logical Vectors** + +```{r logical_vec} +x[c(TRUE, TRUE, FALSE, TRUE)] + +x[x < 3] +``` + +- **Recyling rules** apply when subsetting this way: x[y] +- Easy to understand if x or y is 1, best to avoid other lengths + +```{r missing} +# missing value in index will also return NA in output +x[c(NA, TRUE)] +``` + + +**Nothing** + +```{r nothing} +# returns the original vector +x[] +``` + +**Zero** + +```{r zero} +# returns a zero-length vector +x[0] +``` + +**Character vectors** + +```{r character} +# if name, you can use to return matched elements +(y <- setNames(x, letters[1:4])) + +y[c("d", "b", "a")] +``` + +### Lists + +- Subsetting works the same way +- `[` always returns a list, `[[` and `$` let you pull elements out of a list + +### Matrices and arrays + +You can subset higher dimensional structures in three ways: +- with multiple vectors +- with a single vector +- with a matrix + +```{r, eval=FALSE} +a <- matrix(1:9, nrow = 3) +colnames(a) <- c("A", "B", "C") +a[1:2, ] +#> A B C +#> [1,] 1 4 7 +#> [2,] 2 5 8 +a[c(TRUE, FALSE, TRUE), c("B", "A")] +#> B A +#> [1,] 4 1 +#> [2,] 6 3 +a[0, -2] +#> A C + +a[1, ] +#> A B C +#> 1 4 7 + +a[1, 1] +#> A +#> 1 +``` + +Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham + +Matrices and arrays are just special vectors; can subset with a single vector +(arrays in R stored column wise) + +```{r} +vals <- outer(1:5, 1:5, FUN = "paste", sep = ",") +vals + +vals[c(3, 15)] +``` + +### Data frames and tibbles + +Data frames act like lists and matrices +- single index -> list +- two indices -> matrix + +```{r penguins} +library(palmerpenguins) + +# single index +penguins[1:2] + +penguins[c("species","island")] + +# two indices +penguins[1:2, ] +``` + +Subsetting a tibble with `[` always returns a tibble + +### Preserving dimensionality + +- Data frames and tibbles behave differently +- tibble will default to preserve dimensionality, data frames do not +- this can lead to unexpected behavior and code breaking in the future + +Can use `drop = FALSE` when using a data frame or can use tibbles + +## Selecting a single element + +`[[` and `$` are used to extract single elements + +### `[[]]` + +```{r train} +x <- list(1:3, "a", 4:6) +``` + +![](images/subsetting/train-1.png) + +![](images/subsetting/train-2.png) + +![](images/subsetting/train-3.png) + +Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham + +![](images/subsetting/hadley-tweet.png) + +### `$` + +- `x$y` is equivalent to `x[["y"]]` + +the `$` operator doens't work with stored vals + +```{r, eval=FALSE} +var <- "cyl" +# Doesn't work - mtcars$var translated to mtcars[["var"]] +mtcars$var +#> NULL + +# Instead use [[ +mtcars[[var]] +#> [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4 +``` + +`$` allows partial matching, `[[]]` does not + +```{r, eval=FALSE} +x <- list(abc = 1) +x$a +#> [1] 1 +x[["a"]] +#> NULL +``` + +Hadley advises to change Global settings: + +```{r, eval=FALSE} +options(warnPartialMatchDollar = TRUE) +x$a +#> Warning in x$a: partial match of 'a' to 'abc' +#> [1] 1 +``` + +tibbles don't have this behavior +```{r} +penguins$s +``` + +### missing and out of bound indices +- Due to the inconsistency of how R handles such indices, `purrr::pluck()` and `purrr::chuck()` are recommended +```{r, eval=FALSE} +x <- list( + a = list(1, 2, 3), + b = list(3, 4, 5) +) +purrr::pluck(x, "a", 1) +# [1] 1 +purrr::pluck(x, "c", 1) +# NULL +purrr::pluck(x, "c", 1, .default = NA) +# [1] NA +``` + +### `@` and `slot()` +- `@` is `$` for S4 objects (to be revisited in Chapter 15) + +- `slot()` is `[[ ]]` for S4 objects + +## Subsetting and Assignment + +- Subsetting can be combined with assignment to edit values + +```{r} +x <- c("Tigers", "Royals", "White Sox", "Twins", "Indians") + +x[5] <- "Guardians" + +x +``` + +- length of the subset and assignment vector should be the same to avoid recycling + +You can use NULL to remove a component + +```{r} +x <- list(a = 1, b = 2) +x[["b"]] <- NULL +str(x) +``` + +Subsetting with nothing can preserve structure of original object + +```{r, eval=FALSE} +# mtcars[] <- lapply(mtcars, as.integer) +# is.data.frame(mtcars) +# [1] TRUE +# mtcars <- lapply(mtcars, as.integer) +#> is.data.frame(mtcars) +# [1] FALSE +``` + +## Applications + +Applications copied from cohort 2 slide + +### Lookup tables (character subsetting) +```{r, eval=FALSE} +x <- c("m", "f", "u", "f", "f", "m", "m") +lookup <- c(m = "Male", f = "Female", u = NA) +lookup[x] +# m f u f f m m +# "Male" "Female" NA "Female" "Female" "Male" "Male" +``` + +### Matching and merging by hand (integer subsetting) +- The `match()` function allows merging a vector with a table +```{r, eval=FALSE} +grades <- c("D", "A", "C", "B", "F") +info <- data.frame( + grade = c("A", "B", "C", "D", "F"), + desc = c("Excellent", "Very Good", "Average", "Fair", "Poor"), + fail = c(F, F, F, F, T) +) +id <- match(grades, info$grade) +id +# [1] 3 2 2 1 3 +info[id, ] +# grade desc fail +# 4 D Fair FALSE +# 1 A Excellent FALSE +# 3 C Average FALSE +# 2 B Very Good FALSE +# 5 F Poor TRUE +``` + + +### Random samples and bootstrapping (integer subsetting) +```{r, eval=FALSE} +# mtcars[sample(nrow(mtcars), 3), ] # use replace = TRUE to replace +# mpg cyl disp hp drat wt qsec vs am gear carb +# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 +# Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 +# Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 +``` + + +### Ordering (integer subsetting) +```{r, eval=FALSE} +# mtcars[order(mtcars$mpg), ] +# mpg cyl disp hp drat wt qsec vs am gear carb +# Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 +# Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 +# Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 +# Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 +# Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 +# Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 +# ... +``` + + +### Expanding aggregated counts (integer subsetting) +- We can expand a count column by using `rep()` +```{r, eval=FALSE} +df <- tibble::tibble(x = c("Amy", "Julie", "Brian"), n = c(2, 1, 3)) +df[rep(1:nrow(df), df$n), ] +# A tibble: 6 x 2 +# x n +# <chr> <dbl> +# 1 Amy 2 +# 2 Amy 2 +# 3 Julie 1 +# 4 Brian 3 +# 5 Brian 3 +# 6 Brian 3 +``` + + + +### Removing columns from data frames (character) +- We can remove a column by subsetting, which does not change the object +```{r, eval=FALSE} +df[, 1] +# A tibble: 3 x 1 +# x +# <chr> +# 1 Amy +# 2 Julie +# 3 Brian +``` +- We can also delete the column using `NULL` +```{r, eval=FALSE} +df$n <- NULL +df +# A tibble: 3 x 1 +# x +# <chr> +# 1 Amy +# 2 Julie +# 3 Brian +``` + + + +### Selecting rows based on a condition (logical subsetting) + +```{r, eval=FALSE} +# mtcars[mtcars$gear == 5, ] +# mpg cyl disp hp drat wt qsec vs am gear carb +# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2 +# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 +# Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4 +# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6 +# Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8 +``` + + + +### Boolean algebra versus sets (logical and integer) +- `which()` gives the indices of a Boolean vector + +```{r, eval=FALSE} +(x1 <- 1:10 %% 2 == 0) # 1-10 divisible by 2 +# [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE +(x2 <- which(x1)) +# [1] 2 4 6 8 10 +(y1 <- 1:10 %% 5 == 0) # 1-10 divisible by 5 +# [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE +(y2 <- which(y1)) +# [1] 5 10 +x1 & y1 +# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE +``` -- ADD SLIDES AS SECTIONS (`##`). -- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF. ## Meeting Videos @@ -42,3 +436,6 @@ LOG ``` </details> + + + diff --git a/DESCRIPTION b/DESCRIPTION @@ -16,4 +16,5 @@ Imports: bookdown, rmarkdown, tidyverse, - DiagrammeR + DiagrammeR, + palmerpenguins diff --git a/images/subsetting/hadley-tweet.png b/images/subsetting/hadley-tweet.png Binary files differ. diff --git a/images/subsetting/train-1.png b/images/subsetting/train-1.png Binary files differ. diff --git a/images/subsetting/train-2.png b/images/subsetting/train-2.png Binary files differ. diff --git a/images/subsetting/train-3.png b/images/subsetting/train-3.png Binary files differ.