bookclub-advr

DSLC Advanced R Book Club
git clone https://git.eamoncaddigan.net/bookclub-advr.git
Log | Files | Refs | README | LICENSE

commit 53cd24d75806f8dbda59b97aff056c7f1d9c44c2
parent 802c1533e2ed1c0736cb9aade05099ae82f83df7
Author: Jon Harmon <jonthegeek@gmail.com>
Date:   Tue,  2 Sep 2025 11:10:03 -0500

Update chapter 4 Subsetting (#90)


Diffstat:
M_freeze/slides/04/execute-results/html.json | 14++++++++------
Dslides/04.Rmd | 523-------------------------------------------------------------------------------
Aslides/04.qmd | 376+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 384 insertions(+), 529 deletions(-)

diff --git a/_freeze/slides/04/execute-results/html.json b/_freeze/slides/04/execute-results/html.json @@ -1,15 +1,17 @@ { - "hash": "706ff898197dfba9ce51c9d83ac97658", + "hash": "b5ac0f8cfdf1cfbdbe9f5056fb42d7ee", "result": { "engine": "knitr", - "markdown": "---\nengine: knitr\ntitle: Subsetting\n---\n\n## Learning objectives:\n\n- Learn about the 6 ways to subset atomic vectors\n- Learn about the 3 subsetting operators: `[[`, `[`, and `$`\n- Learn how subsetting works with different vector types\n- Learn how subsetting can be combined with assignment\n\n## Selecting multiple elements\n\n### Atomic Vectors\n\n- 6 ways to subset atomic vectors\n\nLet's take a look with an example vector.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(1.1, 2.2, 3.3, 4.4)\n```\n:::\n\n\n**Positive integer indices**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# return elements at specified positions which can be out of order\nx[c(4, 1)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 4.4 1.1\n```\n\n\n:::\n\n```{.r .cell-code}\n# duplicate indices return duplicate values\nx[c(2, 2)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 2.2 2.2\n```\n\n\n:::\n\n```{.r .cell-code}\n# real numbers truncate to integers\n# so this behaves as if it is x[c(3, 3)]\nx[c(3.2, 3.8)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 3.3 3.3\n```\n\n\n:::\n:::\n\n\n**Negative integer indices**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n### excludes elements at specified positions\nx[-c(1, 3)] # same as x[c(-1, -3)] or x[c(2, 4)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 2.2 4.4\n```\n\n\n:::\n\n```{.r .cell-code}\n### mixing positive and negative is a no-no\nx[c(-1, 3)]\n```\n\n::: {.cell-output .cell-output-error}\n\n```\n#> Error in x[c(-1, 3)]: only 0's may be mixed with negative subscripts\n```\n\n\n:::\n:::\n\n\n**Logical Vectors**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx[c(TRUE, TRUE, FALSE, TRUE)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1.1 2.2 4.4\n```\n\n\n:::\n\n```{.r .cell-code}\nx[x < 3]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1.1 2.2\n```\n\n\n:::\n\n```{.r .cell-code}\ncond <- x > 2.5\nx[cond]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 3.3 4.4\n```\n\n\n:::\n:::\n\n\n- **Recyling rules** applies when the two vectors are of different lengths\n- the shorter of the two is recycled to the length of the longer\n- Easy to understand if x or y is 1, best to avoid other lengths\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx[c(F, T)] # equivalent to: x[c(FALSE, TRUE, FALSE, TRUE)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 2.2 4.4\n```\n\n\n:::\n:::\n\n\n**Missing values (NA)**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Missing values in index will also return NA in output\nx[c(NA, TRUE)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] NA 2.2 NA 4.4\n```\n\n\n:::\n:::\n\n\n**Nothing**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# returns the original vector\nx[]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1.1 2.2 3.3 4.4\n```\n\n\n:::\n:::\n\n\n**Zero**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# returns a zero-length vector\nx[0]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> numeric(0)\n```\n\n\n:::\n:::\n\n\n**Character vectors**\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# if name, you can use to return matched elements\n(y <- setNames(x, letters[1:4]))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> a b c d \n#> 1.1 2.2 3.3 4.4\n```\n\n\n:::\n\n```{.r .cell-code}\ny[c(\"d\", \"b\", \"a\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> d b a \n#> 4.4 2.2 1.1\n```\n\n\n:::\n\n```{.r .cell-code}\n# Like integer indices, you can repeat indices\ny[c(\"a\", \"a\", \"a\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> a a a \n#> 1.1 1.1 1.1\n```\n\n\n:::\n\n```{.r .cell-code}\n# When subsetting with [, names are always matched exactly\nz <- c(abc = 1, def = 2)\nz\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> abc def \n#> 1 2\n```\n\n\n:::\n\n```{.r .cell-code}\nz[c(\"a\", \"d\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> <NA> <NA> \n#> NA NA\n```\n\n\n:::\n:::\n\n\n### Lists\n\n- Subsetting works the same way\n- `[` always returns a list\n- `[[` and `$` let you pull elements out of a list\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_list <- list(a = c(T, F), b = letters[5:15], c = 100:108)\nmy_list\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> $a\n#> [1] TRUE FALSE\n#> \n#> $b\n#> [1] \"e\" \"f\" \"g\" \"h\" \"i\" \"j\" \"k\" \"l\" \"m\" \"n\" \"o\"\n#> \n#> $c\n#> [1] 100 101 102 103 104 105 106 107 108\n```\n\n\n:::\n:::\n\n\n**Return a (named) list**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nl1 <- my_list[2]\nl1\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> $b\n#> [1] \"e\" \"f\" \"g\" \"h\" \"i\" \"j\" \"k\" \"l\" \"m\" \"n\" \"o\"\n```\n\n\n:::\n:::\n\n\n**Return a vector**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nl2 <- my_list[[2]]\nl2\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"e\" \"f\" \"g\" \"h\" \"i\" \"j\" \"k\" \"l\" \"m\" \"n\" \"o\"\n```\n\n\n:::\n\n```{.r .cell-code}\nl2b <- my_list$b\nl2b\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"e\" \"f\" \"g\" \"h\" \"i\" \"j\" \"k\" \"l\" \"m\" \"n\" \"o\"\n```\n\n\n:::\n:::\n\n\n**Return a specific element**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nl3 <- my_list[[2]][3]\nl3\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"g\"\n```\n\n\n:::\n\n```{.r .cell-code}\nl4 <- my_list[['b']][3]\nl4\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"g\"\n```\n\n\n:::\n\n```{.r .cell-code}\nl4b <- my_list$b[3]\nl4b\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"g\"\n```\n\n\n:::\n:::\n\n\n**Visual Representation**\n\n![](images/subsetting/hadley-tweet.png) \n\nSee this stackoverflow article for more detailed information about the differences: https://stackoverflow.com/questions/1169456/the-difference-between-bracket-and-double-bracket-for-accessing-the-el\n\n### Matrices and arrays\n\nYou can subset higher dimensional structures in three ways:\n\n- with multiple vectors\n- with a single vector\n- with a matrix\n\n\n::: {.cell}\n\n```{.r .cell-code}\na <- matrix(1:12, nrow = 3)\ncolnames(a) <- c(\"A\", \"B\", \"C\", \"D\")\n\n# single row\na[1, ]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> A B C D \n#> 1 4 7 10\n```\n\n\n:::\n\n```{.r .cell-code}\n# single column\na[, 1]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1 2 3\n```\n\n\n:::\n\n```{.r .cell-code}\n# single element\na[1, 1]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> A \n#> 1\n```\n\n\n:::\n\n```{.r .cell-code}\n# two rows from two columns\na[1:2, 3:4]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> C D\n#> [1,] 7 10\n#> [2,] 8 11\n```\n\n\n:::\n\n```{.r .cell-code}\na[c(TRUE, FALSE, TRUE), c(\"B\", \"A\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> B A\n#> [1,] 4 1\n#> [2,] 6 3\n```\n\n\n:::\n\n```{.r .cell-code}\n# zero index and negative index\na[0, -2]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> A C D\n```\n\n\n:::\n:::\n\n\n**Subset a matrix with a matrix**\n\n\n::: {.cell}\n\n```{.r .cell-code}\nb <- matrix(1:4, nrow = 2)\nb\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [,1] [,2]\n#> [1,] 1 3\n#> [2,] 2 4\n```\n\n\n:::\n\n```{.r .cell-code}\na[b]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 7 11\n```\n\n\n:::\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvals <- outer(1:5, 1:5, FUN = \"paste\", sep = \",\")\nvals\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [,1] [,2] [,3] [,4] [,5] \n#> [1,] \"1,1\" \"1,2\" \"1,3\" \"1,4\" \"1,5\"\n#> [2,] \"2,1\" \"2,2\" \"2,3\" \"2,4\" \"2,5\"\n#> [3,] \"3,1\" \"3,2\" \"3,3\" \"3,4\" \"3,5\"\n#> [4,] \"4,1\" \"4,2\" \"4,3\" \"4,4\" \"4,5\"\n#> [5,] \"5,1\" \"5,2\" \"5,3\" \"5,4\" \"5,5\"\n```\n\n\n:::\n\n```{.r .cell-code}\nselect <- matrix(ncol = 2, byrow = TRUE, \n c(1, 1,\n 3, 1,\n 2, 4))\nselect\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [,1] [,2]\n#> [1,] 1 1\n#> [2,] 3 1\n#> [3,] 2 4\n```\n\n\n:::\n\n```{.r .cell-code}\nvals[select]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"1,1\" \"3,1\" \"2,4\"\n```\n\n\n:::\n:::\n\n\nMatrices and arrays are just special vectors; can subset with a single vector\n(arrays in R stored column wise)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvals[c(3, 15, 16, 17)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"3,1\" \"5,3\" \"1,4\" \"2,4\"\n```\n\n\n:::\n:::\n\n\n### Data frames and tibbles\n\nData frames act like both lists and matrices\n\n- When subsetting with a single index, they behave like lists and index the columns, so `df[1:2]` selects the first two columns.\n- When subsetting with two indices, they behave like matrices, so `df[1:3, ]` selects the first three rows (and all the columns).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(palmerpenguins)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n#> \n#> Attaching package: 'palmerpenguins'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\n#> The following objects are masked from 'package:datasets':\n#> \n#> penguins, penguins_raw\n```\n\n\n:::\n\n```{.r .cell-code}\npenguins <- penguins\n\n# single index selects first two columns\ntwo_cols <- penguins[2:3] # or penguins[c(2,3)]\nhead(two_cols)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 6 × 2\n#> island bill_length_mm\n#> <fct> <dbl>\n#> 1 Torgersen 39.1\n#> 2 Torgersen 39.5\n#> 3 Torgersen 40.3\n#> 4 Torgersen NA \n#> 5 Torgersen 36.7\n#> 6 Torgersen 39.3\n```\n\n\n:::\n\n```{.r .cell-code}\n# equivalent to the above code\nsame_two_cols <- penguins[c(\"island\", \"bill_length_mm\")]\nhead(same_two_cols)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 6 × 2\n#> island bill_length_mm\n#> <fct> <dbl>\n#> 1 Torgersen 39.1\n#> 2 Torgersen 39.5\n#> 3 Torgersen 40.3\n#> 4 Torgersen NA \n#> 5 Torgersen 36.7\n#> 6 Torgersen 39.3\n```\n\n\n:::\n\n```{.r .cell-code}\n# two indices separated by comma (first two rows of 3rd and 4th columns)\npenguins[1:2, 3:4]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 2 × 2\n#> bill_length_mm bill_depth_mm\n#> <dbl> <dbl>\n#> 1 39.1 18.7\n#> 2 39.5 17.4\n```\n\n\n:::\n\n```{.r .cell-code}\n# Can't do this...\npenguins[[3:4]][c(1:4)]\n```\n\n::: {.cell-output .cell-output-error}\n\n```\n#> Error:\n#> ! The `j` argument of `[[.tbl_df()` can't be a vector of length 2 as of\n#> tibble 3.0.0.\n#> ℹ Recursive subsetting is deprecated for tibbles.\n```\n\n\n:::\n\n```{.r .cell-code}\n# ...but this works...\npenguins[[3]][c(1:4)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 39.1 39.5 40.3 NA\n```\n\n\n:::\n\n```{.r .cell-code}\n# ...or this equivalent...\npenguins$bill_length_mm[1:4]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 39.1 39.5 40.3 NA\n```\n\n\n:::\n:::\n\n\nSubsetting a tibble with `[` always returns a tibble\n\n### Preserving dimensionality\n\n- Data frames and tibbles behave differently\n- tibble will default to preserve dimensionality, data frames do not\n- this can lead to unexpected behavior and code breaking in the future\n- Use `drop = FALSE` to preserve dimensionality when subsetting a data frame or use tibbles\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntb <- tibble::tibble(a = 1:2, b = 1:2)\n\n# returns tibble\nstr(tb[, \"a\"])\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> tibble [2 × 1] (S3: tbl_df/tbl/data.frame)\n#> $ a: int [1:2] 1 2\n```\n\n\n:::\n\n```{.r .cell-code}\ntb[, \"a\"] # equivalent to tb[, \"a\", drop = FALSE]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 2 × 1\n#> a\n#> <int>\n#> 1 1\n#> 2 2\n```\n\n\n:::\n\n```{.r .cell-code}\n# returns integer vector\n# str(tb[, \"a\", drop = TRUE])\ntb[, \"a\", drop = TRUE]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1 2\n```\n\n\n:::\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- data.frame(a = 1:2, b = 1:2)\n\n# returns integer vector\n# str(df[, \"a\"])\ndf[, \"a\"]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1 2\n```\n\n\n:::\n\n```{.r .cell-code}\n# returns data frame with one column\n# str(df[, \"a\", drop = FALSE])\ndf[, \"a\", drop = FALSE]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> a\n#> 1 1\n#> 2 2\n```\n\n\n:::\n:::\n\n**Factors**\n\nFactor subsetting drop argument controls whether or not levels (rather than dimensions) are preserved.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nz <- factor(c(\"a\", \"b\", \"c\"))\nz[1]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] a\n#> Levels: a b c\n```\n\n\n:::\n\n```{.r .cell-code}\nz[1, drop = TRUE]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] a\n#> Levels: a\n```\n\n\n:::\n:::\n\n\n## Selecting a single element\n\n`[[` and `$` are used to extract single elements (note: a vector can be a single element)\n\n### `[[]]`\n\nBecause `[[]]` can return only a single item, you must use it with either a single positive integer or a single string. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(1:3, \"a\", 4:6)\nx[[1]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1 2 3\n```\n\n\n:::\n:::\n\n\nHadley Wickham recommends using `[[]]` with atomic vectors whenever you want to extract a single value to reinforce the expectation that you are getting and setting individual values. \n\n### `$`\n\n- `x$y` is equivalent to `x[[\"y\"]]`\n\nthe `$` operator doesn't work with stored vals\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvar <- \"cyl\"\n\n# Doesn't work - mtcars$var translated to mtcars[[\"var\"]]\nmtcars$var\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> NULL\n```\n\n\n:::\n\n```{.r .cell-code}\n# Instead use [[\nmtcars[[var]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4\n```\n\n\n:::\n:::\n\n\n`$` allows partial matching, `[[]]` does not\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(abc = 1)\nx$a\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1\n```\n\n\n:::\n\n```{.r .cell-code}\nx[[\"a\"]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> NULL\n```\n\n\n:::\n:::\n\n\nHadley advises to change Global settings:\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(warnPartialMatchDollar = TRUE)\nx$a\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n#> Warning in x$a: partial match of 'a' to 'abc'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1\n```\n\n\n:::\n:::\n\n\ntibbles don't have this behavior\n\n\n::: {.cell}\n\n```{.r .cell-code}\npenguins$s\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n#> Warning: Unknown or uninitialised column: `s`.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> NULL\n```\n\n\n:::\n:::\n\n\n### missing and out of bound indices\n\n- Due to the inconsistency of how R handles such indices, `purrr::pluck()` and `purrr::chuck()` are recommended\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(\n a = list(1, 2, 3),\n b = list(3, 4, 5)\n)\npurrr::pluck(x, \"a\", 1)\n# [1] 1\npurrr::pluck(x, \"c\", 1)\n# NULL\npurrr::pluck(x, \"c\", 1, .default = NA)\n# [1] NA\n```\n:::\n\n\n### `@` and `slot()`\n- `@` is `$` for S4 objects (to be revisited in Chapter 15)\n\n- `slot()` is `[[ ]]` for S4 objects\n\n## Subsetting and Assignment\n\n- Subsetting can be combined with assignment to edit values\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(\"Tigers\", \"Royals\", \"White Sox\", \"Twins\", \"Indians\")\n\nx[5] <- \"Guardians\"\n\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"Tigers\" \"Royals\" \"White Sox\" \"Twins\" \"Guardians\"\n```\n\n\n:::\n:::\n\n\n- length of the subset and assignment vector should be the same to avoid recycling\n\nYou can use NULL to remove a component\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = 1, b = 2)\nx[[\"b\"]] <- NULL\nstr(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> List of 1\n#> $ a: num 1\n```\n\n\n:::\n:::\n\n\nSubsetting with nothing can preserve structure of original object\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# mtcars[] <- lapply(mtcars, as.integer)\n# is.data.frame(mtcars)\n# [1] TRUE\n# mtcars <- lapply(mtcars, as.integer)\n#> is.data.frame(mtcars)\n# [1] FALSE\n```\n:::\n\n\n## Applications\n\nApplications copied from cohort 2 slide\n\n### Lookup tables (character subsetting)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(\"m\", \"f\", \"u\", \"f\", \"f\", \"m\", \"m\")\nlookup <- c(m = \"Male\", f = \"Female\", u = NA)\nlookup[x]\n# m f u f f m m \n# \"Male\" \"Female\" NA \"Female\" \"Female\" \"Male\" \"Male\"\n```\n:::\n\n\n### Matching and merging by hand (integer subsetting)\n\n- The `match()` function allows merging a vector with a table\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngrades <- c(\"D\", \"A\", \"C\", \"B\", \"F\")\ninfo <- data.frame(\n grade = c(\"A\", \"B\", \"C\", \"D\", \"F\"),\n desc = c(\"Excellent\", \"Very Good\", \"Average\", \"Fair\", \"Poor\"),\n fail = c(F, F, F, F, T)\n)\nid <- match(grades, info$grade)\nid\n# [1] 3 2 2 1 3\ninfo[id, ]\n# grade desc fail\n# 4 D Fair FALSE\n# 1 A Excellent FALSE\n# 3 C Average FALSE\n# 2 B Very Good FALSE\n# 5 F Poor TRUE\n```\n:::\n\n\n### Random samples and bootstrapping (integer subsetting)\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# mtcars[sample(nrow(mtcars), 3), ] # use replace = TRUE to replace\n# mpg cyl disp hp drat wt qsec vs am gear carb\n# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2\n# Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4\n# Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4\n```\n:::\n\n\n### Ordering (integer subsetting)\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# mtcars[order(mtcars$mpg), ]\n# mpg cyl disp hp drat wt qsec vs am gear carb\n# Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4\n# Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4\n# Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4\n# Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4\n# Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4\n# Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8\n# ...\n```\n:::\n\n\n### Expanding aggregated counts (integer subsetting)\n\n- We can expand a count column by using `rep()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- tibble::tibble(x = c(\"Amy\", \"Julie\", \"Brian\"), n = c(2, 1, 3))\ndf[rep(1:nrow(df), df$n), ]\n# A tibble: 6 x 2\n# x n\n# <chr> <dbl>\n# 1 Amy 2\n# 2 Amy 2\n# 3 Julie 1\n# 4 Brian 3\n# 5 Brian 3\n# 6 Brian 3\n```\n:::\n\n\n### Removing columns from data frames (character)\n\n- We can remove a column by subsetting, which does not change the object\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[, 1]\n# A tibble: 3 x 1\n# x \n# <chr>\n# 1 Amy \n# 2 Julie\n# 3 Brian\n```\n:::\n\n\n- We can also delete the column using `NULL`\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$n <- NULL\ndf\n# A tibble: 3 x 1\n# x \n# <chr>\n# 1 Amy \n# 2 Julie\n# 3 Brian\n```\n:::\n\n\n### Selecting rows based on a condition (logical subsetting)\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# mtcars[mtcars$gear == 5, ]\n# mpg cyl disp hp drat wt qsec vs am gear carb\n# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2\n# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2\n# Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4\n# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6\n# Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8\n```\n:::\n\n\n### Boolean algebra versus sets (logical and integer)\n\n- `which()` gives the indices of a Boolean vector\n\n\n::: {.cell}\n\n```{.r .cell-code}\n(x1 <- 1:10 %% 2 == 0) # 1-10 divisible by 2\n# [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE\n(x2 <- which(x1))\n# [1] 2 4 6 8 10\n(y1 <- 1:10 %% 5 == 0) # 1-10 divisible by 5\n# [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE\n(y2 <- which(y1))\n# [1] 5 10\nx1 & y1\n# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE\n```\n:::\n\n", - "supporting": [ - "04_files" - ], + "markdown": "---\nengine: knitr\ntitle: Subsetting\n---\n\n## Learning objectives:\n\n- Select multiple elements from a vector with `[`\n- Learn about the 3 subsetting operators: `[[`, `[`, and `$`\n- Learn how subsetting works with different vector types\n- Learn how subsetting can be combined with assignment\n\n# Selecting multiple elements\n\n## 1. Positive integers return elements at specified positions\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(1.1, 2.2, 3.3, 4.4) # decimal = original position\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1.1 2.2 3.3 4.4\n```\n\n\n:::\n\n```{.r .cell-code}\nx[c(4, 1)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 4.4 1.1\n```\n\n\n:::\n\n```{.r .cell-code}\nx[c(1, 1, 1)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1.1 1.1 1.1\n```\n\n\n:::\n\n```{.r .cell-code}\nx[c(1.9999)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1.1\n```\n\n\n:::\n:::\n\n\nReals *truncate* to integers.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx[c(1.0001, 1.9999)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1.1 1.1\n```\n\n\n:::\n:::\n\n\n## 2. Negative integers remove specified elements\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx[-c(1, 3)] # same as x[c(-1, -3)] or x[c(2, 4)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 2.2 4.4\n```\n\n\n:::\n:::\n\n\n## 2b. Mixing negative and positive integers throws an error\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx[c(-1, 3)]\n```\n\n::: {.cell-output .cell-output-error}\n\n```\n#> Error in x[c(-1, 3)]: only 0's may be mixed with negative subscripts\n```\n\n\n:::\n:::\n\n\n## 2c. Zeros ignored with other ints \n\n\n::: {.cell}\n\n```{.r .cell-code}\nx[c(-1, 0)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 2.2 3.3 4.4\n```\n\n\n:::\n\n```{.r .cell-code}\nx[c(-1, 0, 0, 0, 0, 0 ,0 ,0)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 2.2 3.3 4.4\n```\n\n\n:::\n\n```{.r .cell-code}\nx[c(1, 0, 2, 0, 3, 0)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1.1 2.2 3.3\n```\n\n\n:::\n:::\n\n\n\n## 3. Logical vectors select specified elements\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx[c(TRUE, TRUE, FALSE, TRUE)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1.1 2.2 4.4\n```\n\n\n:::\n\n```{.r .cell-code}\nx[x < 3]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1.1 2.2\n```\n\n\n:::\n\n```{.r .cell-code}\ncond <- x > 2.5\nx[cond]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 3.3 4.4\n```\n\n\n:::\n:::\n\n\n## 3b. Shorter element are recycled to higher length\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx[FALSE]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> numeric(0)\n```\n\n\n:::\n\n```{.r .cell-code}\nx[TRUE]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1.1 2.2 3.3 4.4\n```\n\n\n:::\n\n```{.r .cell-code}\nx[c(FALSE, TRUE)] # equivalent to: x[c(FALSE, TRUE, FALSE, TRUE)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 2.2 4.4\n```\n\n\n:::\n:::\n\n\n- Easy to understand if x or y is 1, best to avoid other lengths\n\n## 3c. NA index returns NA\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx[c(NA, TRUE, NA, TRUE)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] NA 2.2 NA 4.4\n```\n\n\n:::\n:::\n\n## 3d. Extra TRUE index returns NA\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx[c(FALSE, TRUE, TRUE, TRUE, TRUE, TRUE)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 2.2 3.3 4.4 NA NA\n```\n\n\n:::\n\n```{.r .cell-code}\nx[1:5]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1.1 2.2 3.3 4.4 NA\n```\n\n\n:::\n:::\n\n\n## 4. Indexing with nothing returns original vector\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx[]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1.1 2.2 3.3 4.4\n```\n\n\n:::\n:::\n\n\n## 5. Indexing with just 0 returns 0-length vector (with class)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx[0]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> numeric(0)\n```\n\n\n:::\n\n```{.r .cell-code}\nletters[0]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> character(0)\n```\n\n\n:::\n:::\n\n\n## 6. Indexing with character vector returns element of named vector\n\n\n::: {.cell}\n\n```{.r .cell-code}\n(y <- setNames(x, letters[1:4]))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> a b c d \n#> 1.1 2.2 3.3 4.4\n```\n\n\n:::\n\n```{.r .cell-code}\ny[c(\"d\", \"b\", \"a\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> d b a \n#> 4.4 2.2 1.1\n```\n\n\n:::\n\n```{.r .cell-code}\ny[c(\"a\", \"a\", \"a\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> a a a \n#> 1.1 1.1 1.1\n```\n\n\n:::\n:::\n\n\n## 6b. Names must be exact for `[`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nz <- c(abc = 1, def = 2)\nz\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> abc def \n#> 1 2\n```\n\n\n:::\n\n```{.r .cell-code}\nz[c(\"a\", \"d\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> <NA> <NA> \n#> NA NA\n```\n\n\n:::\n:::\n\n\n## Subsetting a list with `[` returns a list\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_list <- list(a = c(T, F), b = letters[5:15], c = 100:108)\nmy_list\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> $a\n#> [1] TRUE FALSE\n#> \n#> $b\n#> [1] \"e\" \"f\" \"g\" \"h\" \"i\" \"j\" \"k\" \"l\" \"m\" \"n\" \"o\"\n#> \n#> $c\n#> [1] 100 101 102 103 104 105 106 107 108\n```\n\n\n:::\n\n```{.r .cell-code}\nmy_list[c(\"a\", \"b\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> $a\n#> [1] TRUE FALSE\n#> \n#> $b\n#> [1] \"e\" \"f\" \"g\" \"h\" \"i\" \"j\" \"k\" \"l\" \"m\" \"n\" \"o\"\n```\n\n\n:::\n:::\n\n\n## Lists use same rules for `[`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmy_list[2:3]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> $b\n#> [1] \"e\" \"f\" \"g\" \"h\" \"i\" \"j\" \"k\" \"l\" \"m\" \"n\" \"o\"\n#> \n#> $c\n#> [1] 100 101 102 103 104 105 106 107 108\n```\n\n\n:::\n\n```{.r .cell-code}\nmy_list[c(TRUE, FALSE, TRUE)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> $a\n#> [1] TRUE FALSE\n#> \n#> $c\n#> [1] 100 101 102 103 104 105 106 107 108\n```\n\n\n:::\n:::\n\n\n## Matrices & arrays take multidimensional indices\n\n\n::: {.cell}\n\n```{.r .cell-code}\na <- matrix(1:9, nrow = 3)\na\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [,1] [,2] [,3]\n#> [1,] 1 4 7\n#> [2,] 2 5 8\n#> [3,] 3 6 9\n```\n\n\n:::\n\n```{.r .cell-code}\na[1:2, 2:3] # rows, columns\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [,1] [,2]\n#> [1,] 4 7\n#> [2,] 5 8\n```\n\n\n:::\n:::\n\n\n## Matrices & arrays can accept character, logical, etc\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(a) <- c(\"A\", \"B\", \"C\")\na[c(TRUE, TRUE, FALSE), c(\"B\", \"A\")] # a[1:2, 2:1]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> B A\n#> [1,] 4 1\n#> [2,] 5 2\n```\n\n\n:::\n:::\n\n\n## Matrices & arrays are also vectors\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvals <- outer(1:5, 1:5, FUN = \"paste\", sep = \",\") # All chr combos of 1:5\nvals\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [,1] [,2] [,3] [,4] [,5] \n#> [1,] \"1,1\" \"1,2\" \"1,3\" \"1,4\" \"1,5\"\n#> [2,] \"2,1\" \"2,2\" \"2,3\" \"2,4\" \"2,5\"\n#> [3,] \"3,1\" \"3,2\" \"3,3\" \"3,4\" \"3,5\"\n#> [4,] \"4,1\" \"4,2\" \"4,3\" \"4,4\" \"4,5\"\n#> [5,] \"5,1\" \"5,2\" \"5,3\" \"5,4\" \"5,5\"\n```\n\n\n:::\n\n```{.r .cell-code}\nvals[c(4, 15)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"4,1\" \"5,3\"\n```\n\n\n:::\n\n```{.r .cell-code}\na[a > 5]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 6 7 8 9\n```\n\n\n:::\n:::\n\n\n## Data frames subset list-like with single index\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])\ndf[1:2]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> x y\n#> 1 1 3\n#> 2 2 2\n#> 3 3 1\n```\n\n\n:::\n\n```{.r .cell-code}\ndf[c(\"x\", \"z\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> x z\n#> 1 1 a\n#> 2 2 b\n#> 3 3 c\n```\n\n\n:::\n:::\n\n\n## Data frames subset matrix-like with multiple indices\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[1:2, c(\"x\", \"z\")] # rows, columns\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> x z\n#> 1 1 a\n#> 2 2 b\n```\n\n\n:::\n\n```{.r .cell-code}\ndf[df$x == 2, ] # matching rows, all columns\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> x y z\n#> 2 2 2 b\n```\n\n\n:::\n\n```{.r .cell-code}\ndf[, c(\"x\", \"z\")] # equivalent to no ,\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> x z\n#> 1 1 a\n#> 2 2 b\n#> 3 3 c\n```\n\n\n:::\n:::\n\n\n## Subsetting a tibble with `[` returns a tibble\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntbl <- tibble::as_tibble(df)\ndf[, 1]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1 2 3\n```\n\n\n:::\n\n```{.r .cell-code}\ndf[, 1, drop = FALSE] # Prevent errors\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> x\n#> 1 1\n#> 2 2\n#> 3 3\n```\n\n\n:::\n\n```{.r .cell-code}\ntbl[, 1]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> # A tibble: 3 × 1\n#> x\n#> <int>\n#> 1 1\n#> 2 2\n#> 3 3\n```\n\n\n:::\n:::\n\n\n# Selecting a single element\n\n## `[[` selects a single element\n\n:::: {.columns}\n\n::: {.column}\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(1:3, \"a\", 4:6)\nx[1]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [[1]]\n#> [1] 1 2 3\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(x[1])\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"list\"\n```\n\n\n:::\n\n```{.r .cell-code}\nx[[1]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1 2 3\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(x[[1]])\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"integer\"\n```\n\n\n:::\n\n```{.r .cell-code}\nx[[1]][[1]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1\n```\n\n\n:::\n:::\n\n:::\n\n::: {.column}\n\n![](images/subsetting/hadley-tweet.png)\n:::\n\n::::\n\n## `$` is shorthand for `[[..., exact = FALSE]]`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(abc = 1)\nx$abc\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1\n```\n\n\n:::\n\n```{.r .cell-code}\nx$a\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1\n```\n\n\n:::\n\n```{.r .cell-code}\nx[[\"a\"]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> NULL\n```\n\n\n:::\n\n```{.r .cell-code}\nx[[\"a\", exact = FALSE]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1\n```\n\n\n:::\n\n```{.r .cell-code}\noptions(warnPartialMatchDollar = TRUE)\nx$a\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\n#> Warning in x$a: partial match of 'a' to 'abc'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1\n```\n\n\n:::\n:::\n\n\n## Behavior for missing-ish indices is inconsistent\n\n\n::: {.cell}\n\n```{.r .cell-code}\na <- c(a = 1L, b = 2L)\nlst <- list(a = 1:2)\n\n# Errors:\n# a[[NULL]]\n# lst[[NULL]]\n# a[[5]]\n# lst[[5]]\n# a[[\"c\"]]\n# a[[NA]]\n\nlst[[\"c\"]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> NULL\n```\n\n\n:::\n\n```{.r .cell-code}\nlst[[NA]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> NULL\n```\n\n\n:::\n:::\n\n\n## `purrr::pluck()` and `purrr::chuck()` provide consistent wrappers\n\n- `purrr::pluck()` always returns `NULL` or `.default` for (non-`NULL`) missing\n- `purrr::chuck()` always throws error\n\n\n::: {.cell}\n\n```{.r .cell-code}\npurrr::pluck(a, 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> NULL\n```\n\n\n:::\n\n```{.r .cell-code}\npurrr::pluck(a, \"c\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> NULL\n```\n\n\n:::\n\n```{.r .cell-code}\npurrr::pluck(lst, 5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> NULL\n```\n\n\n:::\n\n```{.r .cell-code}\npurrr::pluck(lst, \"c\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> NULL\n```\n\n\n:::\n:::\n\n\n## S4 has two additional subsetting operators\n\n- `@` equivalent to `$` (but error if bad)\n- `slot()` equivalent to `[[`\n\nMore in Chapter 15\n\n# Subsetting and assignment\n\n## Can assign to position with `[`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- 1:5\nx[1:2] <- c(101, 102)\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 101 102 3 4 5\n```\n\n\n:::\n\n```{.r .cell-code}\nx[1:3] <- 1:2\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1 2 1 4 5\n```\n\n\n:::\n:::\n\n\n## Remove list component with `NULL`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = 1, b = 2)\nx[[\"b\"]] <- NULL\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> $a\n#> [1] 1\n```\n\n\n:::\n:::\n\n\n## Use `list(NULL)` to add `NULL`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- list(a = 1, b = 2)\nx[[\"b\"]] <- list(NULL)\nx\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> $a\n#> [1] 1\n#> \n#> $b\n#> $b[[1]]\n#> NULL\n```\n\n\n:::\n:::\n\n\n## Subset with nothing to retain shape\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- data.frame(a = 1:3, b = 1:3)\ndf[] <- \"a\"\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> a b\n#> 1 a a\n#> 2 a a\n#> 3 a a\n```\n\n\n:::\n\n```{.r .cell-code}\ndf <- \"a\"\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"a\"\n```\n\n\n:::\n:::\n\n\n# Applications\n\n## Use a lookup vector and recycling rules to translate values\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(\"b\", \"g\", \"x\", \"g\", \"g\", \"b\")\nlookup <- c(b = \"blue\", g = \"green\", x = NA)\nlookup[x]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> b g x g g b \n#> \"blue\" \"green\" NA \"green\" \"green\" \"blue\"\n```\n\n\n:::\n\n```{.r .cell-code}\nunname(lookup[x])\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"blue\" \"green\" NA \"green\" \"green\" \"blue\"\n```\n\n\n:::\n:::\n\n\n## Use a lookup table to generate rows of data\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninfo <- data.frame(\n code = c(\"b\", \"g\", \"x\"),\n color = c(\"blue\", \"green\", NA),\n other_thing = 3:1\n)\nmatch(x, info$code) # Indices of info$code in x\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1 2 3 2 2 1\n```\n\n\n:::\n\n```{.r .cell-code}\ninfo[match(x, info$code), ]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> code color other_thing\n#> 1 b blue 3\n#> 2 g green 2\n#> 3 x <NA> 1\n#> 2.1 g green 2\n#> 2.2 g green 2\n#> 1.1 b blue 3\n```\n\n\n:::\n:::\n\n\n## Sort with `order()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- c(\"b\", \"c\", \"a\")\norder(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 3 1 2\n```\n\n\n:::\n\n```{.r .cell-code}\nx[order(x)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] \"a\" \"b\" \"c\"\n```\n\n\n:::\n\n```{.r .cell-code}\ndf <- data.frame(b = 3:1, a = 1:3)\ndf[order(df$b), ]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> b a\n#> 3 1 3\n#> 2 2 2\n#> 1 3 1\n```\n\n\n:::\n\n```{.r .cell-code}\ndf[, order(names(df))]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> a b\n#> 1 1 3\n#> 2 2 2\n#> 3 3 1\n```\n\n\n:::\n:::\n\n\n## Expand counts\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- data.frame(x = c(2, 4, 1), y = c(9, 11, 6), n = c(3, 5, 1))\nrep(1:nrow(df), df$n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> [1] 1 1 1 2 2 2 2 2 3\n```\n\n\n:::\n\n```{.r .cell-code}\ndf[rep(1:nrow(df), df$n), ]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n#> x y n\n#> 1 2 9 3\n#> 1.1 2 9 3\n#> 1.2 2 9 3\n#> 2 4 11 5\n#> 2.1 4 11 5\n#> 2.2 4 11 5\n#> 2.3 4 11 5\n#> 2.4 4 11 5\n#> 3 1 6 1\n```\n\n\n:::\n:::\n\n\n## Ran out of time to make slides for\n\nIdeally a future cohort should expand these:\n\n- Remove df columns with `setdiff()`\n- Logically subset rows `df[df$col > 5, ]`\n- The next slide about `which()`\n\n## Boolean algebra versus sets (logical and integer)\n\n- `which()` gives the indices of a Boolean vector\n\n\n::: {.cell}\n\n```{.r .cell-code}\n(x1 <- 1:10 %% 2 == 0) # 1-10 divisible by 2\n# [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE\n(x2 <- which(x1))\n# [1] 2 4 6 8 10\n(y1 <- 1:10 %% 5 == 0) # 1-10 divisible by 5\n# [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE\n(y2 <- which(y1))\n# [1] 5 10\nx1 & y1\n# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE\n```\n:::\n\n", + "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" ], - "includes": {}, + "includes": { + "include-after-body": [ + "\n<script>\n // htmlwidgets need to know to resize themselves when slides are shown/hidden.\n // Fire the \"slideenter\" event (handled by htmlwidgets.js) when the current\n // slide changes (different for each slide format).\n (function () {\n // dispatch for htmlwidgets\n function fireSlideEnter() {\n const event = window.document.createEvent(\"Event\");\n event.initEvent(\"slideenter\", true, true);\n window.document.dispatchEvent(event);\n }\n\n function fireSlideChanged(previousSlide, currentSlide) {\n fireSlideEnter();\n\n // dispatch for shiny\n if (window.jQuery) {\n if (previousSlide) {\n window.jQuery(previousSlide).trigger(\"hidden\");\n }\n if (currentSlide) {\n window.jQuery(currentSlide).trigger(\"shown\");\n }\n }\n }\n\n // hookup for slidy\n if (window.w3c_slidy) {\n window.w3c_slidy.add_observer(function (slide_num) {\n // slide_num starts at position 1\n fireSlideChanged(null, w3c_slidy.slides[slide_num - 1]);\n });\n }\n\n })();\n</script>\n\n" + ] + }, "engineDependencies": {}, "preserve": {}, "postProcess": true diff --git a/slides/04.Rmd b/slides/04.Rmd @@ -1,523 +0,0 @@ ---- -engine: knitr -title: Subsetting ---- - -## Learning objectives: - -- Learn about the 6 ways to subset atomic vectors -- Learn about the 3 subsetting operators: `[[`, `[`, and `$` -- Learn how subsetting works with different vector types -- Learn how subsetting can be combined with assignment - -## Selecting multiple elements - -### Atomic Vectors - -- 6 ways to subset atomic vectors - -Let's take a look with an example vector. - -```{r atomic_vector} -x <- c(1.1, 2.2, 3.3, 4.4) -``` - -**Positive integer indices** - -```{r positive_int} -# return elements at specified positions which can be out of order -x[c(4, 1)] - -# duplicate indices return duplicate values -x[c(2, 2)] - -# real numbers truncate to integers -# so this behaves as if it is x[c(3, 3)] -x[c(3.2, 3.8)] -``` - -**Negative integer indices** - -```{r, error=TRUE} -### excludes elements at specified positions -x[-c(1, 3)] # same as x[c(-1, -3)] or x[c(2, 4)] - -### mixing positive and negative is a no-no -x[c(-1, 3)] -``` - -**Logical Vectors** - -```{r logical_vec} -x[c(TRUE, TRUE, FALSE, TRUE)] - -x[x < 3] - -cond <- x > 2.5 -x[cond] -``` - -- **Recyling rules** applies when the two vectors are of different lengths -- the shorter of the two is recycled to the length of the longer -- Easy to understand if x or y is 1, best to avoid other lengths - -```{r} -x[c(F, T)] # equivalent to: x[c(FALSE, TRUE, FALSE, TRUE)] -``` - -**Missing values (NA)** - -```{r missing} -# Missing values in index will also return NA in output -x[c(NA, TRUE)] -``` - -**Nothing** - -```{r nothing} -# returns the original vector -x[] -``` - -**Zero** - -```{r zero} -# returns a zero-length vector -x[0] -``` - -**Character vectors** - -```{r character} -# if name, you can use to return matched elements -(y <- setNames(x, letters[1:4])) - -y[c("d", "b", "a")] - -# Like integer indices, you can repeat indices -y[c("a", "a", "a")] - -# When subsetting with [, names are always matched exactly -z <- c(abc = 1, def = 2) -z -z[c("a", "d")] -``` - -### Lists - -- Subsetting works the same way -- `[` always returns a list -- `[[` and `$` let you pull elements out of a list - -```{r} -my_list <- list(a = c(T, F), b = letters[5:15], c = 100:108) -my_list -``` - -**Return a (named) list** - -```{r} -l1 <- my_list[2] -l1 -``` - -**Return a vector** - -```{r} -l2 <- my_list[[2]] -l2 -l2b <- my_list$b -l2b -``` - -**Return a specific element** - -```{r} -l3 <- my_list[[2]][3] -l3 -l4 <- my_list[['b']][3] -l4 -l4b <- my_list$b[3] -l4b -``` - -**Visual Representation** - -![](images/subsetting/hadley-tweet.png) - -See this stackoverflow article for more detailed information about the differences: https://stackoverflow.com/questions/1169456/the-difference-between-bracket-and-double-bracket-for-accessing-the-el - -### Matrices and arrays - -You can subset higher dimensional structures in three ways: - -- with multiple vectors -- with a single vector -- with a matrix - -```{r} -a <- matrix(1:12, nrow = 3) -colnames(a) <- c("A", "B", "C", "D") - -# single row -a[1, ] - -# single column -a[, 1] - -# single element -a[1, 1] - -# two rows from two columns -a[1:2, 3:4] - -a[c(TRUE, FALSE, TRUE), c("B", "A")] - -# zero index and negative index -a[0, -2] -``` - -**Subset a matrix with a matrix** - -```{r} -b <- matrix(1:4, nrow = 2) -b -a[b] -``` - -```{r} -vals <- outer(1:5, 1:5, FUN = "paste", sep = ",") -vals - -select <- matrix(ncol = 2, byrow = TRUE, - c(1, 1, - 3, 1, - 2, 4)) -select - -vals[select] -``` - -Matrices and arrays are just special vectors; can subset with a single vector -(arrays in R stored column wise) - -```{r} -vals[c(3, 15, 16, 17)] -``` - -### Data frames and tibbles - -Data frames act like both lists and matrices - -- When subsetting with a single index, they behave like lists and index the columns, so `df[1:2]` selects the first two columns. -- When subsetting with two indices, they behave like matrices, so `df[1:3, ]` selects the first three rows (and all the columns). - -```{r penguins, error=TRUE} -library(palmerpenguins) -penguins <- penguins - -# single index selects first two columns -two_cols <- penguins[2:3] # or penguins[c(2,3)] -head(two_cols) - -# equivalent to the above code -same_two_cols <- penguins[c("island", "bill_length_mm")] -head(same_two_cols) - -# two indices separated by comma (first two rows of 3rd and 4th columns) -penguins[1:2, 3:4] - -# Can't do this... -penguins[[3:4]][c(1:4)] -# ...but this works... -penguins[[3]][c(1:4)] -# ...or this equivalent... -penguins$bill_length_mm[1:4] -``` - -Subsetting a tibble with `[` always returns a tibble - -### Preserving dimensionality - -- Data frames and tibbles behave differently -- tibble will default to preserve dimensionality, data frames do not -- this can lead to unexpected behavior and code breaking in the future -- Use `drop = FALSE` to preserve dimensionality when subsetting a data frame or use tibbles - - -```{r} -tb <- tibble::tibble(a = 1:2, b = 1:2) - -# returns tibble -str(tb[, "a"]) -tb[, "a"] # equivalent to tb[, "a", drop = FALSE] - -# returns integer vector -# str(tb[, "a", drop = TRUE]) -tb[, "a", drop = TRUE] -``` - -```{r} -df <- data.frame(a = 1:2, b = 1:2) - -# returns integer vector -# str(df[, "a"]) -df[, "a"] - -# returns data frame with one column -# str(df[, "a", drop = FALSE]) -df[, "a", drop = FALSE] -``` -**Factors** - -Factor subsetting drop argument controls whether or not levels (rather than dimensions) are preserved. - -```{r} -z <- factor(c("a", "b", "c")) -z[1] -z[1, drop = TRUE] -``` - -## Selecting a single element - -`[[` and `$` are used to extract single elements (note: a vector can be a single element) - -### `[[]]` - -Because `[[]]` can return only a single item, you must use it with either a single positive integer or a single string. - -```{r train} -x <- list(1:3, "a", 4:6) -x[[1]] -``` - -Hadley Wickham recommends using `[[]]` with atomic vectors whenever you want to extract a single value to reinforce the expectation that you are getting and setting individual values. - -### `$` - -- `x$y` is equivalent to `x[["y"]]` - -the `$` operator doesn't work with stored vals - -```{r} -var <- "cyl" - -# Doesn't work - mtcars$var translated to mtcars[["var"]] -mtcars$var - -# Instead use [[ -mtcars[[var]] -``` - -`$` allows partial matching, `[[]]` does not - -```{r} -x <- list(abc = 1) -x$a - -x[["a"]] - -``` - -Hadley advises to change Global settings: - -```{r} -options(warnPartialMatchDollar = TRUE) -x$a -``` - -tibbles don't have this behavior - -```{r} -penguins$s -``` - -### missing and out of bound indices - -- Due to the inconsistency of how R handles such indices, `purrr::pluck()` and `purrr::chuck()` are recommended - -```{r, eval=FALSE} -x <- list( - a = list(1, 2, 3), - b = list(3, 4, 5) -) -purrr::pluck(x, "a", 1) -# [1] 1 -purrr::pluck(x, "c", 1) -# NULL -purrr::pluck(x, "c", 1, .default = NA) -# [1] NA -``` - -### `@` and `slot()` -- `@` is `$` for S4 objects (to be revisited in Chapter 15) - -- `slot()` is `[[ ]]` for S4 objects - -## Subsetting and Assignment - -- Subsetting can be combined with assignment to edit values - -```{r} -x <- c("Tigers", "Royals", "White Sox", "Twins", "Indians") - -x[5] <- "Guardians" - -x -``` - -- length of the subset and assignment vector should be the same to avoid recycling - -You can use NULL to remove a component - -```{r} -x <- list(a = 1, b = 2) -x[["b"]] <- NULL -str(x) -``` - -Subsetting with nothing can preserve structure of original object - -```{r, eval=FALSE} -# mtcars[] <- lapply(mtcars, as.integer) -# is.data.frame(mtcars) -# [1] TRUE -# mtcars <- lapply(mtcars, as.integer) -#> is.data.frame(mtcars) -# [1] FALSE -``` - -## Applications - -Applications copied from cohort 2 slide - -### Lookup tables (character subsetting) - -```{r, eval=FALSE} -x <- c("m", "f", "u", "f", "f", "m", "m") -lookup <- c(m = "Male", f = "Female", u = NA) -lookup[x] -# m f u f f m m -# "Male" "Female" NA "Female" "Female" "Male" "Male" -``` - -### Matching and merging by hand (integer subsetting) - -- The `match()` function allows merging a vector with a table - -```{r, eval=FALSE} -grades <- c("D", "A", "C", "B", "F") -info <- data.frame( - grade = c("A", "B", "C", "D", "F"), - desc = c("Excellent", "Very Good", "Average", "Fair", "Poor"), - fail = c(F, F, F, F, T) -) -id <- match(grades, info$grade) -id -# [1] 3 2 2 1 3 -info[id, ] -# grade desc fail -# 4 D Fair FALSE -# 1 A Excellent FALSE -# 3 C Average FALSE -# 2 B Very Good FALSE -# 5 F Poor TRUE -``` - -### Random samples and bootstrapping (integer subsetting) - -```{r, eval=FALSE} -# mtcars[sample(nrow(mtcars), 3), ] # use replace = TRUE to replace -# mpg cyl disp hp drat wt qsec vs am gear carb -# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 -# Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 -# Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 -``` - -### Ordering (integer subsetting) - -```{r, eval=FALSE} -# mtcars[order(mtcars$mpg), ] -# mpg cyl disp hp drat wt qsec vs am gear carb -# Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 -# Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 -# Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 -# Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 -# Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 -# Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 -# ... -``` - -### Expanding aggregated counts (integer subsetting) - -- We can expand a count column by using `rep()` - -```{r, eval=FALSE} -df <- tibble::tibble(x = c("Amy", "Julie", "Brian"), n = c(2, 1, 3)) -df[rep(1:nrow(df), df$n), ] -# A tibble: 6 x 2 -# x n -# <chr> <dbl> -# 1 Amy 2 -# 2 Amy 2 -# 3 Julie 1 -# 4 Brian 3 -# 5 Brian 3 -# 6 Brian 3 -``` - -### Removing columns from data frames (character) - -- We can remove a column by subsetting, which does not change the object - -```{r, eval=FALSE} -df[, 1] -# A tibble: 3 x 1 -# x -# <chr> -# 1 Amy -# 2 Julie -# 3 Brian -``` - -- We can also delete the column using `NULL` - -```{r, eval=FALSE} -df$n <- NULL -df -# A tibble: 3 x 1 -# x -# <chr> -# 1 Amy -# 2 Julie -# 3 Brian -``` - -### Selecting rows based on a condition (logical subsetting) - -```{r, eval=FALSE} -# mtcars[mtcars$gear == 5, ] -# mpg cyl disp hp drat wt qsec vs am gear carb -# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2 -# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 -# Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4 -# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6 -# Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8 -``` - -### Boolean algebra versus sets (logical and integer) - -- `which()` gives the indices of a Boolean vector - -```{r, eval=FALSE} -(x1 <- 1:10 %% 2 == 0) # 1-10 divisible by 2 -# [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE -(x2 <- which(x1)) -# [1] 2 4 6 8 10 -(y1 <- 1:10 %% 5 == 0) # 1-10 divisible by 5 -# [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE -(y2 <- which(y1)) -# [1] 5 10 -x1 & y1 -# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE -``` diff --git a/slides/04.qmd b/slides/04.qmd @@ -0,0 +1,376 @@ +--- +engine: knitr +title: Subsetting +--- + +## Learning objectives: + +- Select multiple elements from a vector with `[` +- Learn about the 3 subsetting operators: `[[`, `[`, and `$` +- Learn how subsetting works with different vector types +- Learn how subsetting can be combined with assignment + +# Selecting multiple elements + +## 1. Positive integers return elements at specified positions + +```{r} +#| label: positive_int +x <- c(1.1, 2.2, 3.3, 4.4) # decimal = original position +x +x[c(4, 1)] +x[c(1, 1, 1)] +x[c(1.9999)] +``` + +Reals *truncate* to integers. + +```{r} +#| label: positive_real +x[c(1.0001, 1.9999)] +``` + +## 2. Negative integers remove specified elements + +```{r} +#| label: negative_int +x[-c(1, 3)] # same as x[c(-1, -3)] or x[c(2, 4)] +``` + +## 2b. Mixing negative and positive integers throws an error + +```{r} +#| label: mixed_int +#| error: true +x[c(-1, 3)] +``` + +## 2c. Zeros ignored with other ints + +```{r} +#| label: negative_int_zero +x[c(-1, 0)] +x[c(-1, 0, 0, 0, 0, 0 ,0 ,0)] +x[c(1, 0, 2, 0, 3, 0)] +``` + + +## 3. Logical vectors select specified elements + +```{r} +#| label: logical_vec +x[c(TRUE, TRUE, FALSE, TRUE)] +x[x < 3] + +cond <- x > 2.5 +x[cond] +``` + +## 3b. Shorter element are recycled to higher length + +```{r} +#| label: recycling +x[FALSE] +x[TRUE] +x[c(FALSE, TRUE)] # equivalent to: x[c(FALSE, TRUE, FALSE, TRUE)] +``` + +- Easy to understand if x or y is 1, best to avoid other lengths + +## 3c. NA index returns NA + +```{r} +#| label: missing_index +x[c(NA, TRUE, NA, TRUE)] +``` +## 3d. Extra TRUE index returns NA + +```{r} +#| label: extra_index +x[c(FALSE, TRUE, TRUE, TRUE, TRUE, TRUE)] +x[1:5] +``` + +## 4. Indexing with nothing returns original vector + +```{r nothing} +x[] +``` + +## 5. Indexing with just 0 returns 0-length vector (with class) + +```{r zero} +x[0] +letters[0] +``` + +## 6. Indexing with character vector returns element of named vector + +```{r character} +(y <- setNames(x, letters[1:4])) +y[c("d", "b", "a")] +y[c("a", "a", "a")] +``` + +## 6b. Names must be exact for `[` + +```{r} +#| label: exact_names +z <- c(abc = 1, def = 2) +z +z[c("a", "d")] +``` + +## Subsetting a list with `[` returns a list + +```{r} +#| label: list_subset_basics +my_list <- list(a = c(T, F), b = letters[5:15], c = 100:108) +my_list +my_list[c("a", "b")] +``` + +## Lists use same rules for `[` + +```{r} +#| label: list_subset_multiple +my_list[2:3] +my_list[c(TRUE, FALSE, TRUE)] +``` + +## Matrices & arrays take multidimensional indices + +```{r} +#| label: array_subset +a <- matrix(1:9, nrow = 3) +a +a[1:2, 2:3] # rows, columns +``` + +## Matrices & arrays can accept character, logical, etc + +```{r} +#| label: array_named +colnames(a) <- c("A", "B", "C") +a[c(TRUE, TRUE, FALSE), c("B", "A")] # a[1:2, 2:1] +``` + +## Matrices & arrays are also vectors + +```{r} +#| label: array_vector +vals <- outer(1:5, 1:5, FUN = "paste", sep = ",") # All chr combos of 1:5 +vals +vals[c(4, 15)] +a[a > 5] +``` + +## Data frames subset list-like with single index + +```{r} +#| label: df_subset1 +df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3]) +df[1:2] +df[c("x", "z")] +``` + +## Data frames subset matrix-like with multiple indices + +```{r} +df[1:2, c("x", "z")] # rows, columns +df[df$x == 2, ] # matching rows, all columns +df[, c("x", "z")] # equivalent to no , +``` + +## Subsetting a tibble with `[` returns a tibble + +```{r} +tbl <- tibble::as_tibble(df) +df[, 1] +df[, 1, drop = FALSE] # Prevent errors +tbl[, 1] +``` + +# Selecting a single element + +## `[[` selects a single element + +:::: {.columns} + +::: {.column} +```{r} +x <- list(1:3, "a", 4:6) +x[1] +class(x[1]) +x[[1]] +class(x[[1]]) +x[[1]][[1]] +``` +::: + +::: {.column} + +![](images/subsetting/hadley-tweet.png) +::: + +:::: + +## `$` is shorthand for `[[..., exact = FALSE]]` + +```{r} +#| label: dollar_subset +#| warning: true +x <- list(abc = 1) +x$abc +x$a +x[["a"]] +x[["a", exact = FALSE]] + +options(warnPartialMatchDollar = TRUE) +x$a +``` + +## Behavior for missing-ish indices is inconsistent + +```{r} +#| label: missingish_indices +#| error: true +a <- c(a = 1L, b = 2L) +lst <- list(a = 1:2) + +# Errors: +# a[[NULL]] +# lst[[NULL]] +# a[[5]] +# lst[[5]] +# a[["c"]] +# a[[NA]] + +lst[["c"]] +lst[[NA]] +``` + +## `purrr::pluck()` and `purrr::chuck()` provide consistent wrappers + +- `purrr::pluck()` always returns `NULL` or `.default` for (non-`NULL`) missing +- `purrr::chuck()` always throws error + +```{r} +purrr::pluck(a, 5) +purrr::pluck(a, "c") +purrr::pluck(lst, 5) +purrr::pluck(lst, "c") +``` + +## S4 has two additional subsetting operators + +- `@` equivalent to `$` (but error if bad) +- `slot()` equivalent to `[[` + +More in Chapter 15 + +# Subsetting and assignment + +## Can assign to position with `[` + +```{r} +x <- 1:5 +x[1:2] <- c(101, 102) +x +x[1:3] <- 1:2 +x +``` + +## Remove list component with `NULL` + +```{r} +x <- list(a = 1, b = 2) +x[["b"]] <- NULL +x +``` + +## Use `list(NULL)` to add `NULL` + +```{r} +x <- list(a = 1, b = 2) +x[["b"]] <- list(NULL) +x +``` + +## Subset with nothing to retain shape + +```{r} +df <- data.frame(a = 1:3, b = 1:3) +df[] <- "a" +df +df <- "a" +df +``` + +# Applications + +## Use a lookup vector and recycling rules to translate values + +```{r} +x <- c("b", "g", "x", "g", "g", "b") +lookup <- c(b = "blue", g = "green", x = NA) +lookup[x] +unname(lookup[x]) +``` + +## Use a lookup table to generate rows of data + +```{r} +info <- data.frame( + code = c("b", "g", "x"), + color = c("blue", "green", NA), + other_thing = 3:1 +) +match(x, info$code) # Indices of info$code in x +info[match(x, info$code), ] +``` + +## Sort with `order()` + +```{r} +x <- c("b", "c", "a") +order(x) +x[order(x)] + +df <- data.frame(b = 3:1, a = 1:3) +df[order(df$b), ] +df[, order(names(df))] +``` + +## Expand counts + +```{r} +df <- data.frame(x = c(2, 4, 1), y = c(9, 11, 6), n = c(3, 5, 1)) +rep(1:nrow(df), df$n) +df[rep(1:nrow(df), df$n), ] +``` + +## Ran out of time to make slides for + +Ideally a future cohort should expand these: + +- Remove df columns with `setdiff()` +- Logically subset rows `df[df$col > 5, ]` +- The next slide about `which()` + +## Boolean algebra versus sets (logical and integer) + +- `which()` gives the indices of a Boolean vector + +```{r, eval=FALSE} +(x1 <- 1:10 %% 2 == 0) # 1-10 divisible by 2 +# [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE +(x2 <- which(x1)) +# [1] 2 4 6 8 10 +(y1 <- 1:10 %% 5 == 0) # 1-10 divisible by 5 +# [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE +(y2 <- which(y1)) +# [1] 5 10 +x1 & y1 +# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE +```