bookclub-advr

DSLC Advanced R Book Club
git clone https://git.eamoncaddigan.net/bookclub-advr.git
Log | Files | Refs | README | LICENSE

03.qmd (45023B)


      1 ---
      2 engine: knitr
      3 title: Vectors
      4 ---
      5 
      6 ## Learning objectives:
      7 
      8 -   Learn about different types of vectors and their attributes
      9 -   Navigate through vector types and their value types
     10 -   Venture into factors and date-time objects
     11 -   Discuss the differences between data frames and tibbles
     12 -   Do not get absorbed by the `NA` and `NULL` black hole
     13 
     14 
     15 ## Session Info
     16 
     17 ```{r ch9_setup, message = FALSE, warning = FALSE}
     18 library("dplyr")
     19 library("gt")
     20 library("palmerpenguins")
     21 ```
     22 
     23 
     24 <details>
     25 <summary>Session Info</summary>
     26 ```{r}
     27 utils::sessionInfo()
     28 ```
     29 </details>
     30 
     31 ## Aperitif
     32 
     33 ![Palmer Penguins](images/vectors/lter_penguins.png)
     34 
     35 ## Counting Penguins
     36 
     37 Consider this code to count the number of Gentoo penguins in the `penguins` data set. We see that there are 124 Gentoo penguins.
     38 
     39 ```{r, eval = FALSE}
     40 sum("Gentoo" == penguins$species)
     41 # output: 124
     42 ```
     43 
     44 ## In
     45 
     46 One subtle error can arise in trying out `%in%` here instead.
     47 
     48 ```{r, results = 'hide'}
     49 species_vector <- penguins |> select(species)
     50 print("Gentoo" %in% species_vector)
     51 # output: FALSE
     52 ```
     53 
     54 ![Where did the penguins go?](images/vectors/lter_penguins_no_gentoo.png)
     55 
     56 ## Fix: base R 
     57 
     58 ```{r, results = 'hide'}
     59 species_unlist <- penguins |> select(species) |> unlist()
     60 print("Gentoo" %in% species_unlist)
     61 # output: TRUE
     62 ```
     63 
     64 ## Fix: dplyr
     65 
     66 ```{r, results = 'hide'}
     67 species_pull <- penguins |> select(species) |> pull()
     68 print("Gentoo" %in% species_pull)
     69 # output: TRUE
     70 ```
     71 
     72 ## Motivation
     73 
     74 * What are the different types of vectors?
     75 * How does this affect accessing vectors?
     76 
     77 <details>
     78 <summary>Side Quest: Looking up the `%in%` operator</summary>
     79 If you want to look up the manual pages for the `%in%` operator with the `?`, use backticks:
     80 
     81 ```{r, eval = FALSE}
     82 ?`%in%`
     83 ```
     84 
     85 and we find that `%in%` is a wrapper for the `match()` function.
     86 
     87 </details>
     88 
     89 
     90 ## Types of Vectors
     91 
     92 ![Image Credit: Advanced R](images/vectors/summary-tree.png) 
     93 
     94 Two main types:
     95 
     96 -   **Atomic**: Elements all the same type.
     97 -   **List**: Elements are different Types.
     98 
     99 Closely related but not technically a vector:
    100 
    101 -   **NULL**: Null elements. Often length zero.
    102 
    103 
    104 ## Types of Atomic Vectors (1/2)
    105 
    106 ![Image Credit: Advanced R](images/vectors/summary-tree-atomic.png){width=50%} 
    107 
    108 ## Types of Atomic Vectors (2/2)
    109 
    110 -   **Logical**: True/False
    111 -   **Integer**: Numeric (discrete, no decimals)
    112 -   **Double**: Numeric (continuous, decimals)
    113 -   **Character**: String
    114 
    115 ## Vectors of Length One
    116 
    117 **Scalars** are vectors that consist of a single value.
    118 
    119 ## Logicals
    120 
    121 ```{r vec_lgl}
    122 lgl1 <- TRUE
    123 lgl2 <- T #abbreviation for TRUE
    124 lgl3 <- FALSE
    125 lgl4 <- F #abbreviation for FALSE
    126 ```
    127 
    128 ## Doubles
    129 
    130 ```{r vec_dbl}
    131 # integer, decimal, scientific, or hexidecimal format
    132 dbl1 <- 1
    133 dbl2 <- 1.234 # decimal
    134 dbl3 <- 1.234e0 # scientific format
    135 dbl4 <- 0xcafe # hexidecimal format
    136 ```
    137 
    138 ## Integers
    139 
    140 Integers must be followed by L and cannot have fractional values
    141 
    142 ```{r vec_int}
    143 int1 <- 1L
    144 int2 <- 1234L
    145 int3 <- 1234e0L
    146 int4 <- 0xcafeL
    147 ```
    148 
    149 <details>
    150 <summary>Pop Quiz: Why "L" for integers?</summary>
    151 Wickham notes that the use of `L` dates back to the **C** programming language and its "long int" type for memory allocation.
    152 </details>
    153 
    154 ## Strings
    155 
    156 Strings can use single or double quotes and special characters are escaped with \
    157 
    158 ```{r vec_str}
    159 str1 <- "hello" # double quotes
    160 str2 <- 'hello' # single quotes
    161 str3 <- "مرحبًا" # Unicode
    162 str4 <- "\U0001f605" # sweaty_smile 😅
    163 ```
    164 
    165 ## Longer 1/2
    166 
    167 There are several ways to make longer vectors:
    168 
    169 **1. With single values** inside c() for combine.
    170 
    171 ```{r long_single}
    172 lgl_var <- c(TRUE, FALSE)
    173 int_var <- c(1L, 6L, 10L)
    174 dbl_var <- c(1, 2.5, 4.5)
    175 chr_var <- c("these are", "some strings")
    176 ```
    177 
    178 ![Image Credit: Advanced R](images/vectors/atomic.png) 
    179 
    180 ## Longer 2/2
    181 
    182 **2. With other vectors**
    183 
    184 ```{r long_vec}
    185 c(c(1, 2), c(3, 4)) # output is not nested
    186 ```
    187 
    188 ## Type and Length
    189 
    190 We can determine the type of a vector with `typeof()` and its length with `length()`
    191 
    192 ```{r type_length, echo = FALSE}
    193 # typeof(lgl_var)
    194 # typeof(int_var)
    195 # typeof(dbl_var)
    196 # typeof(chr_var)
    197 # 
    198 # length(lgl_var)
    199 # length(int_var)
    200 # length(dbl_var)
    201 # length(chr_var)
    202 
    203 var_names <- c("lgl_var", "int_var", "dbl_var", "chr_var")
    204 var_values <- c("TRUE, FALSE", "1L, 6L, 10L", "1, 2.5, 4.5", "'these are', 'some strings'")
    205 var_type <- c("logical", "integer", "double", "character")
    206 var_length <- c(2, 3, 3, 2)
    207 
    208 type_length_df <- data.frame(var_names, var_values, var_type, var_length)
    209 
    210 # make gt table
    211 type_length_df |>
    212   gt() |>
    213   cols_align(align = "center") |>
    214   cols_label(
    215     var_names ~ "name",
    216     var_values ~ "value",
    217     var_type ~ "typeof()",
    218     var_length ~ "length()"
    219   ) |>
    220   tab_header(
    221     title = "Types of Atomic Vectors",
    222     subtitle = ""
    223   ) |>
    224   tab_footnote(
    225     footnote = "Source: https://adv-r.hadley.nz/index.html",
    226     locations = cells_title(groups = "title")
    227   ) |>
    228   tab_style(
    229     style = list(cell_fill(color = "#F9E3D6")),
    230     locations = cells_body(columns = var_type)
    231   ) |>
    232   tab_style(
    233     style = list(cell_fill(color = "lightcyan")),
    234     locations = cells_body(columns = var_length)
    235   )
    236 ```
    237 
    238 ## Side Quest: Penguins
    239 
    240 <details>
    241 ```{r}
    242 typeof(penguins$species)
    243 class(penguins$species)
    244 
    245 typeof(species_unlist)
    246 class(species_unlist)
    247 
    248 typeof(species_pull)
    249 class(species_pull)
    250 ```
    251 
    252 </details>
    253 
    254 ## Missing values: Contagion
    255 
    256 For most computations, an operation over values that includes a missing value yields a missing value (unless you're careful)
    257 
    258 ```{r na_contagion}
    259 # contagion
    260 5*NA
    261 sum(c(1, 2, NA, 3))
    262 ```
    263 
    264 ## Missing values: Contagion Exceptions
    265 
    266 ```{r na_exceptions, eval = FALSE}
    267 NA ^ 0
    268 #> [1] 1
    269 NA | TRUE
    270 #> [1] TRUE
    271 NA & FALSE
    272 #> [1] FALSE
    273 ```
    274 
    275 
    276 #### Innoculation
    277 
    278 ```{r na_innoculation, eval = FALSE}
    279 sum(c(1, 2, NA, 3), na.rm = TRUE)
    280 # output: 6
    281 ```
    282 
    283 To search for missing values use `is.na()`
    284 
    285 ```{r na_search, eval = FALSE}
    286 x <- c(NA, 5, NA, 10)
    287 x == NA
    288 # output: NA NA NA NA [BATMAN!]
    289 ```
    290 
    291 ```{r na_search_better, eval = FALSE}
    292 is.na(x)
    293 # output: TRUE FALSE TRUE FALSE
    294 ```
    295 
    296 ## Missing Values: NA Types 
    297 
    298 <details>
    299 Each type has its own NA type
    300 
    301 -   Logical: `NA`
    302 -   Integer: `NA_integer`
    303 -   Double: `NA_double`
    304 -   Character: `NA_character`
    305 
    306 This may not matter in many contexts.
    307 
    308 Can matter for operations where types matter like `dplyr::if_else()`.
    309 </details>
    310 
    311 
    312 ## Testing (1/2)
    313 
    314 **What type of vector `is.*`() it?**
    315 
    316 Test data type:
    317 
    318 -   Logical: `is.logical()`
    319 -   Integer: `is.integer()`
    320 -   Double: `is.double()`
    321 -   Character: `is.character()`
    322 
    323 
    324 ## Testing (2/2)
    325 
    326 **What type of object is it?**
    327 
    328 Don't test objects with these tools:
    329 
    330 -   `is.vector()`
    331 -   `is.atomic()`
    332 -   `is.numeric()` 
    333 
    334 They don’t test if you have a vector, atomic vector, or numeric vector; you’ll need to carefully read the documentation to figure out what they actually do (preview: *attributes*)
    335 
    336 ## Side Quest: rlang `is_*()`
    337 
    338 <details>
    339 <summary>Maybe use `{rlang}`?</summary>
    340 
    341 -   `rlang::is_vector`
    342 -   `rlang::is_atomic`
    343 
    344 ```{r test_rlang}
    345 # vector
    346 rlang::is_vector(c(1, 2))
    347 rlang::is_vector(list(1, 2))
    348 
    349 # atomic
    350 rlang::is_atomic(c(1, 2))
    351 rlang::is_atomic(list(1, "a"))
    352 
    353 ```
    354 
    355 See more [here](https://rlang.r-lib.org/reference/type-predicates.html)
    356 </details>
    357 
    358 
    359 ## Coercion
    360 
    361 * R follows rules for coercion: character → double → integer → logical
    362 
    363 * R can coerce either automatically or explicitly
    364 
    365 #### **Automatic**
    366 
    367 Two contexts for automatic coercion:
    368 
    369 1.  Combination
    370 2.  Mathematical
    371 
    372 
    373 
    374 ## Coercion by Combination:
    375 
    376 ```{r coerce_c}
    377 str(c(TRUE, "TRUE"))
    378 ```
    379 
    380 ## Coercion by Mathematical operations:
    381 
    382 ```{r coerce_math}
    383 # imagine a logical vector about whether an attribute is present
    384 has_attribute <- c(TRUE, FALSE, TRUE, TRUE)
    385 
    386 # number with attribute
    387 sum(has_attribute)
    388 ```
    389 
    390 ## **Explicit**
    391 
    392 <!--
    393 
    394 Use `as.*()`
    395 
    396 -   Logical: `as.logical()`
    397 -   Integer: `as.integer()`
    398 -   Double: `as.double()`
    399 -   Character: `as.character()`
    400 
    401 -->
    402 
    403 ```{r explicit_coercion, echo = FALSE}
    404 # dbl_var
    405 # as.integer(dbl_var)
    406 # lgl_var
    407 # as.character(lgl_var)
    408 
    409 var_names <- c("lgl_var", "int_var", "dbl_var", "chr_var")
    410 var_values <- c("TRUE, FALSE", "1L, 6L, 10L", "1, 2.5, 4.5", "'these are', 'some strings'")
    411 as_logical <- c("TRUE FALSE", "TRUE TRUE TRUE", "TRUE TRUE TRUE", "NA NA")
    412 as_integer <- c("1 0", "1 6 10", "1 2 4", 'NA_integer')
    413 as_double <- c("1 0", "1 6 10", "1.0 2.5 4.5", 'NA_double')
    414 as_character <- c("'TRUE' 'FALSE'", "'1' '6' '10'", "'1' '2.5' '4.5'", "'these are', 'some strings'")
    415 
    416 coercion_df <- data.frame(var_names, var_values, as_logical, as_integer, as_double, as_character)
    417 
    418 coercion_df |>
    419   gt() |>
    420   cols_align(align = "center") |>
    421   cols_label(
    422     var_names ~ "name",
    423     var_values ~ "value",
    424     as_logical ~ "as.logical()",
    425     as_integer ~ "as.integer()",
    426     as_double ~ "as.double()",
    427     as_character ~ "as.character()"
    428   ) |>
    429   tab_header(
    430     title = "Coercion of Atomic Vectors",
    431     subtitle = ""
    432   ) |>
    433   tab_footnote(
    434     footnote = "Source: https://adv-r.hadley.nz/index.html",
    435     locations = cells_title(groups = "title")
    436   ) |>
    437   tab_style(
    438     style = list(cell_fill(color = "#F9E3D6")),
    439     locations = cells_body(columns = c(as_logical, as_double))
    440   ) |>
    441   tab_style(
    442     style = list(cell_fill(color = "lightcyan")),
    443     locations = cells_body(columns = c(as_integer, as_character))
    444   )
    445 ```
    446 
    447 But note that coercion may fail in one of two ways, or both:
    448 
    449 -   With warning/error
    450 -   NAs
    451 
    452 ```{r coerce_error}
    453 as.integer(c(1, 2, "three"))
    454 ```
    455 
    456 ## Exercises 1/5
    457 
    458 1. How do you create raw and complex scalars?
    459 
    460 <details><summary>Answer(s)</summary>
    461 ```{r, eval = FALSE}
    462 as.raw(42)
    463 #> [1] 2a
    464 charToRaw("A")
    465 #> [1] 41
    466 complex(length.out = 1, real = 1, imaginary = 1)
    467 #> [1] 1+1i
    468 ```
    469 </details>
    470 
    471 ## Exercises 2/5
    472 
    473 2. Test your knowledge of the vector coercion rules by predicting the output of the following uses of c():
    474 
    475 ```{r, eval = FALSE}
    476 c(1, FALSE)
    477 c("a", 1)
    478 c(TRUE, 1L)
    479 ```
    480 
    481 <details><summary>Answer(s)</summary>
    482 ```{r, eval = FALSE}
    483 c(1, FALSE)      # will be coerced to double    -> 1 0
    484 c("a", 1)        # will be coerced to character -> "a" "1"
    485 c(TRUE, 1L)      # will be coerced to integer   -> 1 1
    486 ```
    487 </details>
    488 
    489 ## Exercises 3/5
    490 
    491 3. Why is `1 == "1"` true? Why is `-1 < FALSE` true? Why is `"one" < 2` false?
    492 
    493 <details><summary>Answer(s)</summary>
    494 These comparisons are carried out by operator-functions (==, <), which coerce their arguments to a common type. In the examples above, these types will be character, double and character: 1 will be coerced to "1", FALSE is represented as 0 and 2 turns into "2" (and numbers precede letters in lexicographic order (may depend on locale)).
    495 
    496 </details>
    497 
    498 ## Exercises 4/5
    499 
    500 4. Why is the default missing value, NA, a logical vector? What’s special about logical vectors?
    501 
    502 <details><summary>Answer(s)</summary>
    503 The presence of missing values shouldn’t affect the type of an object. Recall that there is a type-hierarchy for coercion from character → double → integer → logical. When combining `NA`s with other atomic types, the `NA`s will be coerced to integer (`NA_integer_`), double (`NA_real_`) or character (`NA_character_`) and not the other way round. If `NA` were a character and added to a set of other values all of these would be coerced to character as well.
    504 </details>
    505 
    506 ## Exercises 5/5
    507 
    508 5. Precisely what do `is.atomic()`, `is.numeric()`, and `is.vector()` test for?
    509 
    510 <details><summary>Answer(s)</summary>
    511 
    512 * `is.atomic()` tests if an object is an atomic vector or is `NULL` (!). Atomic vectors are objects of type logical, integer, double, complex, character or raw.
    513 * `is.numeric()` tests if an object has type integer or double and is not of class `factor`, `Date`, `POSIXt` or `difftime`.
    514 * `is.vector()` tests if an object is a vector or an expression and has no attributes, apart from names. Vectors are atomic vectors or lists.
    515  
    516 </details>
    517 
    518 
    519 ## Attributes
    520 
    521 Attributes are name-value pairs that attach metadata to an object (vector).
    522 
    523 * **Name-value pairs**: attributes have a name and a value
    524 * **Metadata**: not data itself, but data about the data
    525  
    526 ## Getting and Setting
    527 
    528 Three functions:
    529 
    530 1. retrieve and modify single attributes with `attr()`
    531 2. retrieve en masse with `attributes()`
    532 3. set en masse with `structure()`
    533 
    534 ## Single attribute
    535 
    536 Use `attr()`
    537 
    538 ```{r attr_single}
    539 # some object
    540 a <- c(1, 2, 3)
    541 
    542 # set attribute
    543 attr(x = a, which = "attribute_name") <- "some attribute"
    544 
    545 # get attribute
    546 attr(a, "attribute_name")
    547 ```
    548 
    549 ## Multiple attributes
    550 
    551 `structure()`: set multiple attributes, `attributes()`: get multiple attributes
    552 
    553 :::: columns
    554 ::: column
    555 ```{r attr_multiple}
    556 a <- 1:3
    557 attr(a, "x") <- "abcdef"
    558 attr(a, "x")
    559 
    560 attr(a, "y") <- 4:6
    561 str(attributes(a))
    562 
    563 b <- structure(
    564   1:3, 
    565   x = "abcdef",
    566   y = 4:6
    567 )
    568 identical(a, b)
    569 ```
    570 :::
    571 
    572 ::: column
    573 ![Image Credit: Advanced R](images/vectors/attr.png) 
    574 :::
    575 ::::
    576 
    577 
    578 ## Why
    579 
    580 Three particularly important attributes: 
    581 
    582 1. **names** - a character vector giving each element a name
    583 2. **dimension** - (or dim) turns vectors into matrices and arrays 
    584 3. **class** - powers the S3 object system (we'll learn more about this in chapter 13)
    585 
    586 Most attributes are lost by most operations.  Only two attributes are routinely preserved: names and dimension.
    587 
    588 ## Names
    589 
    590 ~~Three~~ Four ways to name:
    591 
    592 :::: columns
    593 
    594 ::: {.column width="50%"}
    595 ```{r names}
    596 # (1) On creation: 
    597 x <- c(A = 1, B = 2, C = 3)
    598 x
    599 
    600 # (2) Assign to names():
    601 y <- 1:3
    602 names(y) <- c("a", "b", "c")
    603 y
    604 
    605 # (3) Inline:
    606 z <- setNames(1:3, c("a", "b", "c"))
    607 z
    608 ```
    609 :::
    610 
    611 ::: {.column width="50%"}
    612 ![proper diagram](images/vectors/attr-names-1.png) 
    613 :::
    614 
    615 ::::
    616 
    617 ## rlang Names
    618 
    619 :::: columns
    620 
    621 ::: {.column width="50%"}
    622 
    623 ```{r names_via_rlang}
    624 # (4) Inline with {rlang}:
    625 a <- 1:3
    626 rlang::set_names(
    627   a,
    628   c("a", "b", "c")
    629 )
    630 ```
    631 
    632 :::
    633 
    634 ::: {.column width="50%"}
    635 ![simplified diagram](images/vectors/attr-names-2.png) 
    636 :::
    637 
    638 ::::
    639 
    640 
    641 ## Removing names
    642 
    643 * `x <- unname(x)` or `names(x) <- NULL`
    644 * Thematically but not directly related: labelled class vectors with `haven::labelled()`
    645 
    646 
    647 ## Dimensions: `matrix()` and `array()`
    648 
    649 ```{r dimensions}
    650 # Two scalar arguments specify row and column sizes
    651 x <- matrix(1:6, nrow = 2, ncol = 3)
    652 x
    653 # One vector argument to describe all dimensions
    654 y <- array(1:12, c(2, 3, 2)) # rows, columns, no of arrays
    655 y
    656 ```
    657 
    658 ## Dimensions: assign to `dim()`
    659 
    660 ```{r dimensions2}
    661 # You can also modify an object in place by setting dim()
    662 z <- 1:6
    663 dim(z) <- c(2, 3) # rows, columns
    664 z
    665 a <- 1:12
    666 dim(a) <- c(2, 3, 2) # rows, columns, no of arrays
    667 a
    668 ```
    669 
    670 
    671 ## Functions for working with vectors, matrices and arrays (1/2):
    672 
    673 Vector | Matrix	| Array
    674 :----- | :---------- | :-----
    675 `names()` | `rownames()`, `colnames()` | `dimnames()`
    676 `length()` | `nrow()`, `ncol()` | `dim()`
    677 `c()` | `rbind()`, `cbind()` | `abind::abind()`
    678 — | `t()` | `aperm()`
    679 `is.null(dim(x))` | `is.matrix()` | `is.array()`
    680 
    681 * **Caution**: Vector without `dim` set has `NULL` dimensions, not `1`.
    682 * One dimension?
    683 
    684 ## Functions for working with vectors, matrices and arrays (2/2):
    685 
    686 ```{r examples_of_1D, eval = FALSE}
    687 str(1:3)                   # 1d vector
    688 #>  int [1:3] 1 2 3
    689 str(matrix(1:3, ncol = 1)) # column vector
    690 #>  int [1:3, 1] 1 2 3
    691 str(matrix(1:3, nrow = 1)) # row vector
    692 #>  int [1, 1:3] 1 2 3
    693 str(array(1:3, 3))         # "array" vector
    694 #>  int [1:3(1d)] 1 2 3
    695 ```
    696 
    697 
    698 ## Exercises 1/4
    699 
    700 1. How is `setNames()` implemented? Read the source code.
    701 
    702 <details><summary>Answer(s)</summary>
    703 
    704 ```{r, eval = FALSE}
    705 setNames <- function(object = nm, nm) {
    706   names(object) <- nm
    707   object
    708 }
    709 ```
    710 
    711 - Data arg 1st = works well with pipe.
    712 - 1st arg is optional
    713 
    714 ```{r, eval = FALSE}
    715 setNames( , c("a", "b", "c"))
    716 #>   a   b   c 
    717 #> "a" "b" "c"
    718 ```
    719 </details>
    720 
    721 ## Exercises 1/4 (cont)
    722 
    723 1. How is `unname()` implemented? Read the source code.
    724 
    725 <details><summary>Answer(s)</summary>
    726 
    727 ```{r, eval = FALSE}
    728 unname <- function(obj, force = FALSE) {
    729   if (!is.null(names(obj))) 
    730     names(obj) <- NULL
    731   if (!is.null(dimnames(obj)) && (force || !is.data.frame(obj))) 
    732     dimnames(obj) <- NULL
    733   obj
    734 }
    735 ```
    736 `unname()` sets existing `names` or `dimnames` to `NULL`.
    737 </details>
    738 
    739 ## Exercises 2/4
    740 
    741 2. What does `dim()` return when applied to a 1-dimensional vector? When might you use `NROW()` or `NCOL()`?
    742 
    743 <details><summary>Answer(s)</summary>
    744 
    745 > `dim()` returns `NULL` when applied to a 1d vector.
    746 
    747 `NROW()` and `NCOL()` treats `NULL` and vectors like they have dimensions:
    748 
    749 ```{r, eval = FALSE}
    750 x <- 1:10
    751 nrow(x)
    752 #> NULL
    753 ncol(x)
    754 #> NULL
    755 NROW(x)
    756 #> [1] 10
    757 NCOL(x)
    758 #> [1] 1
    759 ```
    760 
    761 </details>
    762 
    763 ## Exercises 3/4
    764 
    765 3. How would you describe the following three objects? What makes them different from `1:5`?
    766 
    767 ```{r}
    768 x1 <- array(1:5, c(1, 1, 5))
    769 x2 <- array(1:5, c(1, 5, 1))
    770 x3 <- array(1:5, c(5, 1, 1))
    771 ```
    772 
    773 <details><summary>Answer(s)</summary>
    774 ```{r, eval = FALSE}
    775 x1 <- array(1:5, c(1, 1, 5))  # 1 row,  1 column,  5 in third dim.
    776 x2 <- array(1:5, c(1, 5, 1))  # 1 row,  5 columns, 1 in third dim.
    777 x3 <- array(1:5, c(5, 1, 1))  # 5 rows, 1 column,  1 in third dim.
    778 ```
    779 </details>
    780 
    781 ## Exercises 4/4
    782 
    783 4. An early draft used this code to illustrate `structure()`:
    784 
    785 ```{r, eval = FALSE}
    786 structure(1:5, comment = "my attribute")
    787 #> [1] 1 2 3 4 5
    788 ```
    789 
    790 Why don't you see the comment attribute on print? Is the attribute missing, or is there something else special about it?
    791 
    792 <details><summary>Answer(s)</summary>
    793 The documentation states (see `?comment`):
    794 
    795 > Contrary to other attributes, the comment is not printed (by print or print.default).
    796 
    797 ## Exercises 4/4 (cont)
    798 
    799 <details><summary>Answer(s)</summary>
    800 Also, from `?attributes:`
    801 
    802 > Note that some attributes (namely class, comment, dim, dimnames, names, row.names and tsp) are treated specially and have restrictions on the values which can be set.
    803 
    804 Retrieve comment attributes with `attr()`:
    805 
    806 ```{r, eval = FALSE}
    807 foo <- structure(1:5, comment = "my attribute")
    808 
    809 attributes(foo)
    810 #> $comment
    811 #> [1] "my attribute"
    812 attr(foo, which = "comment")
    813 #> [1] "my attribute"
    814 ```
    815 
    816 </details>
    817 
    818 
    819 
    820 ## **Class** - S3 atomic vectors
    821 
    822 ![](images/vectors/summary-tree-s3-1.png) 
    823 
    824 Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
    825 
    826 **Having a class attribute turns an object into an S3 object.**
    827 
    828 What makes S3 atomic vectors different?
    829 
    830 1. behave differently from a regular vector when passed to a generic function 
    831 2. often store additional information in other attributes
    832 
    833 
    834 ## Four important S3 vectors used in base R:
    835 
    836 1. **Factors** (categorical data)
    837 2. **Dates**
    838 3. **Date-times** (POSIXct)
    839 4. **Durations** (difftime)
    840 
    841 ## Factors
    842 
    843 A factor is a vector used to store categorical data that can contain only predefined values.
    844 
    845 Factors are integer vectors with:
    846 
    847 -   Class: "factor"
    848 -   Attributes: "levels", or the set of allowed values
    849 
    850 ## Factors examples
    851 
    852 ```{r factor}
    853 colors = c('red', 'blue', 'green','red','red', 'green')
    854 colors_factor <- factor(
    855   x = colors, levels = c('red', 'blue', 'green', 'yellow')
    856 )
    857 ```
    858 
    859 :::: columns
    860 
    861 ::: column
    862 
    863 ```{r factor_table}
    864 table(colors)
    865 table(colors_factor)
    866 ```
    867 :::
    868 
    869 ::: column
    870 ```{r factor_type}
    871 typeof(colors_factor)
    872 class(colors_factor)
    873 
    874 attributes(colors_factor)
    875 ```
    876 :::
    877 ::::
    878 
    879 ## Custom Order
    880 
    881 Factors can be ordered. This can be useful for models or visualizations where order matters.
    882 
    883 ```{r factor_ordered}
    884 
    885 values <- c('high', 'med', 'low', 'med', 'high', 'low', 'med', 'high')
    886 ordered_factor <- ordered(
    887   x = values,
    888   levels = c('low', 'med', 'high') # in order
    889 )
    890 ordered_factor
    891 
    892 table(values)
    893 table(ordered_factor)
    894 ```
    895 
    896 ## Dates
    897 
    898 Dates are:
    899 
    900 -   Double vectors
    901 -   With class "Date"
    902 -   No other attributes
    903 
    904 ```{r dates}
    905 notes_date <- Sys.Date()
    906 
    907 # type
    908 typeof(notes_date)
    909 
    910 # class
    911 attributes(notes_date)
    912 ```
    913 
    914 ## Dates Unix epoch
    915 
    916 The double component represents the number of days since since the [Unix epoch](https://en.wikipedia.org/wiki/Unix_time) `1970-01-01`
    917 
    918 ```{r days_since_1970}
    919 date <- as.Date("1970-02-01")
    920 unclass(date)
    921 ```
    922 
    923 ## Date-times
    924 
    925 There are 2 Date-time representations in base R:
    926 
    927 -   POSIXct, where "ct" denotes *calendar time*
    928 -   POSIXlt, where "lt" designates *local time*
    929 
    930 <!--
    931 
    932 Just for fun:
    933 "How to pronounce 'POSIXct'?"
    934 https://www.howtopronounce.com/posixct
    935 
    936 -->
    937 
    938 ## Dates-times: POSIXct
    939 
    940 We'll focus on POSIXct because:
    941 
    942 -   Simplest
    943 -   Built on an atomic (double) vector
    944 -   Most appropriate for use in a data frame
    945 
    946 Let's now build and deconstruct a Date-time
    947 
    948 ```{r date_time}
    949 # Build
    950 note_date_time <- as.POSIXct(
    951   x = Sys.time(), # time
    952   tz = "America/New_York" # time zone, used only for formatting
    953 )
    954 
    955 # Inspect
    956 note_date_time
    957 
    958 # - type
    959 typeof(note_date_time)
    960 
    961 # - attributes
    962 attributes(note_date_time)
    963 
    964 structure(note_date_time, tzone = "Europe/Paris")
    965 ```
    966 
    967 ```{r date_time_format}
    968 date_time <- as.POSIXct("2024-02-22 12:34:56", tz = "EST")
    969 unclass(date_time)
    970 ```
    971 
    972 
    973 ## Durations
    974 
    975 Durations represent the amount of time between pairs of dates or date-times.
    976 
    977 -   Double vectors
    978 -   Class: "difftime"
    979 -   Attributes: "units", or the unit of duration (e.g., weeks, hours, minutes, seconds, etc.)
    980 
    981 ```{r durations}
    982 # Construct
    983 one_minute <- as.difftime(1, units = "mins")
    984 # Inspect
    985 one_minute
    986 
    987 # Dissect
    988 # - type
    989 typeof(one_minute)
    990 # - attributes
    991 attributes(one_minute)
    992 ```
    993 
    994 ```{r durations_math}
    995 time_since_01_01_1970 <- notes_date - date
    996 time_since_01_01_1970
    997 ```
    998 
    999 
   1000 See also:
   1001 
   1002 -   [`lubridate::make_difftime()`](https://lubridate.tidyverse.org/reference/make_difftime.html)
   1003 -   [`clock::date_time_build()`](https://clock.r-lib.org/reference/date_time_build.html)
   1004 
   1005 
   1006 ## Exercises 1/3
   1007 
   1008 1. What sort of object does `table()` return? What is its type? What attributes does it have? How does the dimensionality change as you tabulate more variables?
   1009 
   1010 <details><summary>Answer(s)</summary>
   1011 
   1012 `table()` returns a contingency table of its input variables. It is implemented as an integer vector with class table and dimensions (which makes it act like an array). Its attributes are dim (dimensions) and dimnames (one name for each input column). The dimensions correspond to the number of unique values (factor levels) in each input variable.
   1013 
   1014 ```{r, eval = FALSE}
   1015 x <- table(mtcars[c("vs", "cyl", "am")])
   1016 
   1017 typeof(x)
   1018 #> [1] "integer"
   1019 attributes(x)
   1020 #> $dim
   1021 #> [1] 2 3 2
   1022 #> 
   1023 #> $dimnames
   1024 #> $dimnames$vs
   1025 #> [1] "0" "1"
   1026 #> 
   1027 #> $dimnames$cyl
   1028 #> [1] "4" "6" "8"
   1029 #> 
   1030 #> $dimnames$am
   1031 #> [1] "0" "1"
   1032 #> 
   1033 #> 
   1034 #> $class
   1035 #> [1] "table"
   1036 ```
   1037 </details>
   1038 
   1039 ## Exercises 2/3
   1040 
   1041 2. What happens to a factor when you modify its levels?
   1042 
   1043 ```{r, eval = FALSE}
   1044 f1 <- factor(letters)
   1045 levels(f1) <- rev(levels(f1))
   1046 ```
   1047 
   1048 <details><summary>Answer(s)</summary>
   1049 The underlying integer values stay the same, but the levels are changed, making it look like the data has changed.
   1050 
   1051 ```{r, eval = FALSE}
   1052 f1 <- factor(letters)
   1053 f1
   1054 #>  [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
   1055 #> Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
   1056 as.integer(f1)
   1057 #>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
   1058 #> [26] 26
   1059 
   1060 levels(f1) <- rev(levels(f1))
   1061 f1
   1062 #>  [1] z y x w v u t s r q p o n m l k j i h g f e d c b a
   1063 #> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a
   1064 as.integer(f1)
   1065 #>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
   1066 #> [26] 26
   1067 ```
   1068 </details>
   1069 
   1070 ## Exercises 3/3
   1071 
   1072 3. What does this code do? How do `f2` and `f3` differ from `f1`?
   1073 
   1074 ```{r, eval = FALSE}
   1075 f2 <- rev(factor(letters))
   1076 f3 <- factor(letters, levels = rev(letters))
   1077 ```
   1078 
   1079 <details><summary>Answer(s)</summary>
   1080 For `f2` and `f3` either the order of the factor elements or its levels are being reversed. For `f1` both transformations are occurring.
   1081 
   1082 ```{r, eval = FALSE}
   1083 # Reverse element order
   1084 (f2 <- rev(factor(letters)))
   1085 #>  [1] z y x w v u t s r q p o n m l k j i h g f e d c b a
   1086 #> Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
   1087 as.integer(f2)
   1088 #>  [1] 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2
   1089 #> [26]  1
   1090 
   1091 # Reverse factor levels (when creating factor)
   1092 (f3 <- factor(letters, levels = rev(letters)))
   1093 #>  [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
   1094 #> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a
   1095 as.integer(f3)
   1096 #>  [1] 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2
   1097 #> [26]  1
   1098 ```
   1099 </details>
   1100 
   1101 
   1102 ## Lists
   1103 
   1104 * sometimes called a generic vector or recursive vector
   1105 * Recall ([section 2.3.3](https://adv-r.hadley.nz/names-values.html#list-references)): each element is really a *reference* to another object
   1106 * an be composed of elements of different types (as opposed to atomic vectors which must be of only one type)
   1107 
   1108 ## Constructing
   1109 
   1110 Simple lists:
   1111 
   1112 ```{r list_simple}
   1113 # Construct
   1114 simple_list <- list(
   1115   c(TRUE, FALSE),   # logicals
   1116   1:20,             # integers
   1117   c(1.2, 2.3, 3.4), # doubles
   1118   c("primo", "secundo", "tercio") # characters
   1119 )
   1120 
   1121 simple_list
   1122 
   1123 # Inspect
   1124 # - type
   1125 typeof(simple_list)
   1126 # - structure
   1127 str(simple_list)
   1128 
   1129 # Accessing
   1130 simple_list[1]
   1131 simple_list[2]
   1132 simple_list[3]
   1133 simple_list[4]
   1134 
   1135 simple_list[[1]][2]
   1136 simple_list[[2]][8]
   1137 simple_list[[3]][2]
   1138 simple_list[[4]][3]
   1139 ```
   1140 
   1141 ## Even Simpler List
   1142 
   1143 ```{r list_simpler}
   1144 # Construct
   1145 simpler_list <- list(TRUE, FALSE, 
   1146                     1, 2, 3, 4, 5, 
   1147                     1.2, 2.3, 3.4, 
   1148                     "primo", "secundo", "tercio")
   1149 
   1150 # Accessing
   1151 simpler_list[1]
   1152 simpler_list[5]
   1153 simpler_list[9]
   1154 simpler_list[11]
   1155 ```
   1156 
   1157 ## Nested lists:
   1158 
   1159 ```{r list_nested}
   1160 nested_list <- list(
   1161   # first level
   1162   list(
   1163     # second level
   1164     list(
   1165       # third level
   1166       list(1)
   1167     )
   1168   )
   1169 )
   1170 
   1171 str(nested_list)
   1172 ```
   1173 
   1174 Like JSON.
   1175 
   1176 ## Combined lists
   1177 
   1178 ```{r list_combined}
   1179 list_comb1 <- list(list(1, 2), list(3, 4)) # with list()
   1180 list_comb2 <- c(list(1, 2), list(3, 4)) # with c()
   1181 
   1182 # compare structure
   1183 str(list_comb1)
   1184 str(list_comb2)
   1185 
   1186 # does this work if they are different data types?
   1187 list_comb3 <- c(list(1, 2), list(TRUE, FALSE))
   1188 str(list_comb3)
   1189 ```
   1190 
   1191 ## Testing
   1192 
   1193 Check that is a list:
   1194 
   1195 -   `is.list()`
   1196 -   \`rlang::is_list()\`\`
   1197 
   1198 The two do the same, except that the latter can check for the number of elements
   1199 
   1200 ```{r list_test}
   1201 # is list
   1202 base::is.list(list_comb2)
   1203 rlang::is_list(list_comb2)
   1204 
   1205 # is list of 4 elements
   1206 rlang::is_list(x = list_comb2, n = 4)
   1207 
   1208 # is a vector (of a special type)
   1209 # remember the family tree?
   1210 rlang::is_vector(list_comb2)
   1211 ```
   1212 
   1213 ## Coercion
   1214 
   1215 Use `as.list()`
   1216 
   1217 ```{r list_coercion}
   1218 list(1:3)
   1219 as.list(1:3)
   1220 ```
   1221 
   1222 ## Matrices and arrays
   1223 
   1224 Although not often used, the dimension attribute can be added to create **list-matrices** or **list-arrays**.
   1225 
   1226 ```{r list_matrices_arrays}
   1227 l <- list(1:3, "a", TRUE, 1.0)
   1228 dim(l) <- c(2, 2); l
   1229 
   1230 l[[1, 1]]
   1231 ```
   1232 
   1233 
   1234 ## Exercises 1/3
   1235 
   1236 1. List all the ways that a list differs from an atomic vector.
   1237 
   1238 <details><summary>Answer(s)</summary>
   1239 
   1240 * Atomic vectors are always homogeneous (all elements must be of the same type). Lists may be heterogeneous (the elements can be of different types) as described in the introduction of the vectors chapter.
   1241 * Atomic vectors point to one address in memory, while lists contain a separate reference for each element. (This was described in the list sections of the vectors and the names and values chapters.)
   1242 
   1243 ```{r, eval = FALSE}
   1244 lobstr::ref(1:2)
   1245 #> [1:0x7fcd936f6e80] <int>
   1246 lobstr::ref(list(1:2, 2))
   1247 #> █ [1:0x7fcd93d53048] <list> 
   1248 #> ├─[2:0x7fcd91377e40] <int> 
   1249 #> └─[3:0x7fcd93b41eb0] <dbl>
   1250 ```
   1251 
   1252 
   1253 * Subsetting with out-of-bounds and NA values leads to different output. For example, [ returns NA for atomics and NULL for lists. (This is described in more detail within the subsetting chapter.)
   1254 
   1255 ```{r, eval = FALSE}
   1256 # Subsetting atomic vectors
   1257 (1:2)[3]
   1258 #> [1] NA
   1259 (1:2)[NA]
   1260 #> [1] NA NA
   1261 
   1262 # Subsetting lists
   1263 as.list(1:2)[3]
   1264 #> [[1]]
   1265 #> NULL
   1266 as.list(1:2)[NA]
   1267 #> [[1]]
   1268 #> NULL
   1269 #> 
   1270 #> [[2]]
   1271 #> NULL
   1272 ```
   1273 
   1274 
   1275 </details>
   1276 
   1277 ## Exercises 2/3
   1278 
   1279 2. Why do you need to use `unlist()` to convert a list to an atomic vector? Why doesn’t `as.vector()` work?
   1280 
   1281 <details><summary>Answer(s)</summary>
   1282 A list is already a vector, though not an atomic one! Note that as.vector() and is.vector() use different definitions of “vector!”
   1283 
   1284 ```{r, eval = FALSE}
   1285 is.vector(as.vector(mtcars))
   1286 #> [1] FALSE
   1287 ```
   1288 
   1289 </details>
   1290 
   1291 ## Exercises 3/3
   1292 
   1293 3. Compare and contrast `c()` and `unlist()` when combining a date and date-time into a single vector.
   1294 
   1295 <details><summary>Answer(s)</summary>
   1296 Date and date-time objects are both built upon doubles. While dates store the number of days since the reference date 1970-01-01 (also known as “the Epoch”) in days, date-time-objects (POSIXct) store the time difference to this date in seconds.
   1297 
   1298 ```{r, eval = FALSE}
   1299 date    <- as.Date("1970-01-02")
   1300 dttm_ct <- as.POSIXct("1970-01-01 01:00", tz = "UTC")
   1301 
   1302 # Internal representations
   1303 unclass(date)
   1304 #> [1] 1
   1305 unclass(dttm_ct)
   1306 #> [1] 3600
   1307 #> attr(,"tzone")
   1308 #> [1] "UTC"
   1309 ```
   1310 
   1311 As the c() generic only dispatches on its first argument, combining date and date-time objects via c() could lead to surprising results in older R versions (pre R 4.0.0):
   1312 
   1313 ```{r, eval = FALSE}
   1314 # Output in R version 3.6.2
   1315 c(date, dttm_ct)  # equal to c.Date(date, dttm_ct) 
   1316 #> [1] "1970-01-02" "1979-11-10"
   1317 c(dttm_ct, date)  # equal to c.POSIXct(date, dttm_ct)
   1318 #> [1] "1970-01-01 02:00:00 CET" "1970-01-01 01:00:01 CET"
   1319 ```
   1320 
   1321 In the first statement above c.Date() is executed, which incorrectly treats the underlying double of dttm_ct (3600) as days instead of seconds. Conversely, when c.POSIXct() is called on a date, one day is counted as one second only.
   1322 
   1323 We can highlight these mechanics by the following code:
   1324 
   1325 ```{r, eval = FALSE}
   1326 # Output in R version 3.6.2
   1327 unclass(c(date, dttm_ct))  # internal representation
   1328 #> [1] 1 3600
   1329 date + 3599
   1330 #> "1979-11-10"
   1331 ```
   1332 
   1333 As of R 4.0.0 these issues have been resolved and both methods now convert their input first into POSIXct and Date, respectively.
   1334 
   1335 ```{r, eval = FALSE}
   1336 c(dttm_ct, date)
   1337 #> [1] "1970-01-01 01:00:00 UTC" "1970-01-02 00:00:00 UTC"
   1338 unclass(c(dttm_ct, date))
   1339 #> [1]  3600 86400
   1340 
   1341 c(date, dttm_ct)
   1342 #> [1] "1970-01-02" "1970-01-01"
   1343 unclass(c(date, dttm_ct))
   1344 #> [1] 1 0
   1345 ```
   1346 
   1347 However, as c() strips the time zone (and other attributes) of POSIXct objects, some caution is still recommended.
   1348 
   1349 ```{r, eval = FALSE}
   1350 (dttm_ct <- as.POSIXct("1970-01-01 01:00", tz = "HST"))
   1351 #> [1] "1970-01-01 01:00:00 HST"
   1352 attributes(c(dttm_ct))
   1353 #> $class
   1354 #> [1] "POSIXct" "POSIXt"
   1355 ```
   1356 
   1357 A package that deals with these kinds of problems in more depth and provides a structural solution for them is the {vctrs} package9 which is also used throughout the tidyverse.10
   1358 
   1359 Let’s look at unlist(), which operates on list input.
   1360 
   1361 ```{r, eval = FALSE}
   1362 # Attributes are stripped
   1363 unlist(list(date, dttm_ct))  
   1364 #> [1]     1 39600
   1365 ```
   1366 
   1367 We see again that dates and date-times are internally stored as doubles. Unfortunately, this is all we are left with, when unlist strips the attributes of the list.
   1368 
   1369 To summarise: c() coerces types and strips time zones. Errors may have occurred in older R versions because of inappropriate method dispatch/immature methods. unlist() strips attributes.
   1370 </details>
   1371 
   1372 
   1373 ## Data frames and tibbles
   1374 
   1375 ![](images/vectors/summary-tree-s3-2.png) 
   1376 
   1377 Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
   1378 
   1379 ## Data frame
   1380 
   1381 A data frame is a:
   1382 
   1383 -   Named list of vectors (i.e., column names)
   1384 -   Attributes:
   1385     -   (column) `names`
   1386     -   `row.names`
   1387     -   Class: "data frame"
   1388 
   1389 ## Data frame, examples 1/2:
   1390 
   1391 ```{r data_frame}
   1392 # Construct
   1393 df <- data.frame(
   1394   col1 = c(1, 2, 3),              # named atomic vector
   1395   col2 = c("un", "deux", "trois") # another named atomic vector
   1396   # ,stringsAsFactors = FALSE # default for versions after R 4.1
   1397 )
   1398 # Inspect
   1399 df
   1400 
   1401 # Deconstruct
   1402 # - type
   1403 typeof(df)
   1404 # - attributes
   1405 attributes(df)
   1406 ```
   1407 
   1408 
   1409 ## Data frame, examples 2/2:
   1410 
   1411 ```{r df_functions}
   1412 rownames(df)
   1413 colnames(df)
   1414 names(df) # Same as colnames(df)
   1415 
   1416 nrow(df) 
   1417 ncol(df)
   1418 length(df) # Same as ncol(df)
   1419 ```
   1420 
   1421 Unlike other lists, the length of each vector must be the same (i.e. as many vector elements as rows in the data frame).
   1422 
   1423 ## Tibble
   1424 
   1425 Created to relieve some of the frustrations and pain points created by data frames, tibbles are data frames that are:
   1426 
   1427 -   Lazy (do less)
   1428 -   Surly (complain more)
   1429 
   1430 ## Lazy
   1431 
   1432 Tibbles do not:
   1433 
   1434 -   Coerce strings
   1435 -   Transform non-syntactic names
   1436 -   Recycle vectors of length greater than 1
   1437 
   1438 ## ! Coerce strings
   1439 
   1440 ```{r tbl_no_coerce}
   1441 chr_col <- c("don't", "factor", "me", "bro")
   1442 
   1443 # data frame
   1444 df <- data.frame(
   1445   a = chr_col,
   1446   # in R 4.1 and earlier, this was the default
   1447   stringsAsFactors = TRUE
   1448 )
   1449 
   1450 # tibble
   1451 tbl <- tibble::tibble(
   1452   a = chr_col
   1453 )
   1454 
   1455 # contrast the structure
   1456 str(df$a)
   1457 str(tbl$a)
   1458 
   1459 ```
   1460 
   1461 ## ! Transform non-syntactic names
   1462 
   1463 ```{r tbl_col_name}
   1464 # data frame
   1465 df <- data.frame(
   1466   `1` = c(1, 2, 3)
   1467 )
   1468 
   1469 # tibble
   1470 tbl <- tibble::tibble(
   1471   `1` = c(1, 2, 3)
   1472 )
   1473 
   1474 # contrast the names
   1475 names(df)
   1476 names(tbl)
   1477 ```
   1478 
   1479 ## ! Recycle vectors of length greater than 1
   1480 
   1481 ```{r tbl_recycle, error=TRUE}
   1482 # data frame
   1483 df <- data.frame(
   1484   col1 = c(1, 2, 3, 4),
   1485   col2 = c(1, 2)
   1486 )
   1487 
   1488 # tibble
   1489 tbl <- tibble::tibble(
   1490   col1 = c(1, 2, 3, 4),
   1491   col2 = c(1, 2)
   1492 )
   1493 ```
   1494 
   1495 ## Surly
   1496 
   1497 Tibbles do only what they're asked and complain if what they're asked doesn't make sense:
   1498 
   1499 -   Subsetting always yields a tibble
   1500 -   Complains if cannot find column
   1501 
   1502 ## Subsetting always yields a tibble
   1503 
   1504 ```{r tbl_subset}
   1505 # data frame
   1506 df <- data.frame(
   1507   col1 = c(1, 2, 3, 4)
   1508 )
   1509 
   1510 # tibble
   1511 tbl <- tibble::tibble(
   1512   col1 = c(1, 2, 3, 4)
   1513 )
   1514 
   1515 # contrast
   1516 df_col <- df[, "col1"]
   1517 str(df_col)
   1518 tbl_col <- tbl[, "col1"]
   1519 str(tbl_col)
   1520 
   1521 # to select a vector, do one of these instead
   1522 tbl_col_1 <- tbl[["col1"]]
   1523 str(tbl_col_1)
   1524 tbl_col_2 <- dplyr::pull(tbl, col1)
   1525 str(tbl_col_2)
   1526 ```
   1527 
   1528 ## Complains if cannot find column
   1529 
   1530 ```{r tbl_col_match, warning=TRUE}
   1531 names(df)
   1532 df$col
   1533 
   1534 names(tbl)
   1535 tbl$col
   1536 ```
   1537 
   1538 ## One more difference
   1539 
   1540 **`tibble()` allows you to refer to variables created during construction**
   1541 
   1542 ```{r df_tibble_diff}
   1543 tibble::tibble(
   1544   x = 1:3,
   1545   y = x * 2 # x refers to the line above
   1546 )
   1547 ```
   1548 
   1549 <details>
   1550 <summary>Side Quest: Row Names</summary>
   1551 
   1552 - character vector containing only unique values
   1553 - get and set with `rownames()`
   1554 - can use them to subset rows
   1555 
   1556 ```{r row_names}
   1557 df3 <- data.frame(
   1558   age = c(35, 27, 18),
   1559   hair = c("blond", "brown", "black"),
   1560   row.names = c("Bob", "Susan", "Sam")
   1561 )
   1562 df3
   1563 
   1564 rownames(df3)
   1565 df3["Bob", ]
   1566 
   1567 rownames(df3) <- c("Susan", "Bob", "Sam")
   1568 rownames(df3)
   1569 df3["Bob", ]
   1570 ```
   1571 
   1572 There are three reasons why row names are undesirable:
   1573 
   1574 3. Metadata is data, so storing it in a different way to the rest of the data is fundamentally a bad idea. 
   1575 2. Row names are a poor abstraction for labelling rows because they only work when a row can be identified by a single string. This fails in many cases.
   1576 3. Row names must be unique, so any duplication of rows (e.g. from bootstrapping) will create new row names.
   1577 
   1578 </details>
   1579 
   1580 
   1581 ## Tibles: Printing
   1582 
   1583 Data frames and tibbles print differently
   1584 
   1585 ```{r df_tibble_print}
   1586 df3
   1587 tibble::as_tibble(df3)
   1588 ```
   1589 
   1590 
   1591 ## Tibles: Subsetting
   1592 
   1593 Two undesirable subsetting behaviours:
   1594 
   1595 1. When you subset columns with `df[, vars]`, you will get a vector if vars selects one variable, otherwise you’ll get a data frame, unless you always remember to use `df[, vars, drop = FALSE]`.
   1596 2. When you attempt to extract a single column with `df$x` and there is no column `x`, a data frame will instead select any variable that starts with `x`. If no variable starts with `x`, `df$x` will return NULL.
   1597 
   1598 Tibbles tweak these behaviours so that a [ always returns a tibble, and a $ doesn’t do partial matching and warns if it can’t find a variable (*this is what makes tibbles surly*).
   1599 
   1600 ## Tibles: Testing
   1601 
   1602 Whether data frame: `is.data.frame()`. Note: both data frame and tibble are data frames.
   1603 
   1604 Whether tibble: `tibble::is_tibble`. Note: only tibbles are tibbles. Vanilla data frames are not.
   1605 
   1606 ## Tibles: Coercion
   1607 
   1608 -   To data frame: `as.data.frame()`
   1609 -   To tibble: `tibble::as_tibble()`
   1610 
   1611 ## Tibles: List Columns
   1612 
   1613 List-columns are allowed in data frames but you have to do a little extra work by either adding the list-column after creation or wrapping the list in `I()`
   1614 
   1615 ```{r list_columns}
   1616 df4 <- data.frame(x = 1:3)
   1617 df4$y <- list(1:2, 1:3, 1:4)
   1618 df4
   1619 
   1620 df5 <- data.frame(
   1621   x = 1:3, 
   1622   y = I(list(1:2, 1:3, 1:4))
   1623 )
   1624 df5
   1625 ```
   1626 
   1627 ## Tibbles: Matrix and data frame columns
   1628 
   1629 - As long as the number of rows matches the data frame, it’s also possible to have a matrix or data frame as a column of a data frame.
   1630 - same as list-columns, must either addi the list-column after creation or wrapping the list in `I()`
   1631 
   1632 ```{r matrix_df_columns}
   1633 dfm <- data.frame(
   1634   x = 1:3 * 10,
   1635   y = I(matrix(1:9, nrow = 3))
   1636 )
   1637 
   1638 dfm$z <- data.frame(a = 3:1, b = letters[1:3], stringsAsFactors = FALSE)
   1639 
   1640 str(dfm)
   1641 dfm$y
   1642 dfm$z
   1643 ```
   1644 
   1645 
   1646 ## Exercises 1/4
   1647 
   1648 1. Can you have a data frame with zero rows? What about zero columns?
   1649 
   1650 <details><summary>Answer(s)</summary>
   1651 Yes, you can create these data frames easily; either during creation or via subsetting. Even both dimensions can be zero. Create a 0-row, 0-column, or an empty data frame directly:
   1652 
   1653 ```{r, eval = FALSE}
   1654 data.frame(a = integer(), b = logical())
   1655 #> [1] a b
   1656 #> <0 rows> (or 0-length row.names)
   1657 
   1658 data.frame(row.names = 1:3)  # or data.frame()[1:3, ]
   1659 #> data frame with 0 columns and 3 rows
   1660 
   1661 data.frame()
   1662 #> data frame with 0 columns and 0 rows
   1663 ```
   1664 
   1665 Create similar data frames via subsetting the respective dimension with either 0, `NULL`, `FALSE` or a valid 0-length atomic (`logical(0)`, `character(0)`, `integer(0)`, `double(0)`). Negative integer sequences would also work. The following example uses a zero:
   1666 
   1667 ```{r, eval = FALSE}
   1668 mtcars[0, ]
   1669 #>  [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
   1670 #> <0 rows> (or 0-length row.names)
   1671 
   1672 mtcars[ , 0]  # or mtcars[0]
   1673 #> data frame with 0 columns and 32 rows
   1674 
   1675 mtcars[0, 0]
   1676 #> data frame with 0 columns and 0 rows
   1677 ```
   1678 
   1679 
   1680 </details>
   1681 
   1682 ## Exercises 2/4
   1683 
   1684 2. What happens if you attempt to set rownames that are not unique?
   1685 
   1686 <details><summary>Answer(s)</summary>
   1687 Matrices can have duplicated row names, so this does not cause problems.
   1688 
   1689 Data frames, however, require unique rownames and you get different results depending on how you attempt to set them. If you set them directly or via `row.names()`, you get an error:
   1690 
   1691 ```{r, eval = FALSE}
   1692 data.frame(row.names = c("x", "y", "y"))
   1693 #> Error in data.frame(row.names = c("x", "y", "y")): duplicate row.names: y
   1694 
   1695 df <- data.frame(x = 1:3)
   1696 row.names(df) <- c("x", "y", "y")
   1697 #> Warning: non-unique value when setting 'row.names': 'y'
   1698 #> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
   1699 ```
   1700 
   1701 If you use subsetting, `[` automatically deduplicates:
   1702 
   1703 ```{r, eval = FALSE}
   1704 row.names(df) <- c("x", "y", "z")
   1705 df[c(1, 1, 1), , drop = FALSE]
   1706 #>     x
   1707 #> x   1
   1708 #> x.1 1
   1709 #> x.2 1
   1710 ```
   1711 
   1712 </details>
   1713 
   1714 ## Exercises 3/4
   1715 
   1716 3. If `df` is a data frame, what can you say about `t(df)`, and `t(t(df))`? Perform some experiments, making sure to try different column types.
   1717 
   1718 <details><summary>Answer(s)</summary>
   1719 Both of `t(df)` and `t(t(df))` will return matrices:
   1720 
   1721 ```{r, eval = FALSE}
   1722 df <- data.frame(x = 1:3, y = letters[1:3])
   1723 is.matrix(df)
   1724 #> [1] FALSE
   1725 is.matrix(t(df))
   1726 #> [1] TRUE
   1727 is.matrix(t(t(df)))
   1728 #> [1] TRUE
   1729 ```
   1730 
   1731 The dimensions will respect the typical transposition rules:
   1732 
   1733 ```{r, eval = FALSE}
   1734 dim(df)
   1735 #> [1] 3 2
   1736 dim(t(df))
   1737 #> [1] 2 3
   1738 dim(t(t(df)))
   1739 #> [1] 3 2
   1740 ```
   1741 
   1742 Because the output is a matrix, every column is coerced to the same type. (It is implemented within `t.data.frame()` via `as.matrix()` which is described below).
   1743 
   1744 ```{r, eval = FALSE}
   1745 df
   1746 #>   x y
   1747 #> 1 1 a
   1748 #> 2 2 b
   1749 #> 3 3 c
   1750 t(df)
   1751 #>   [,1] [,2] [,3]
   1752 #> x "1"  "2"  "3" 
   1753 #> y "a"  "b"  "c"
   1754 ```
   1755 
   1756 </details>
   1757 
   1758 ## Exercises 4/4
   1759 
   1760 4. What does `as.matrix()` do when applied to a data frame with columns of different types? How does it differ from `data.matrix()`?
   1761 
   1762 <details><summary>Answer(s)</summary>
   1763 The type of the result of as.matrix depends on the types of the input columns (see `?as.matrix`):
   1764 
   1765 > The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying as.vector to factors and format to other non-character columns. Otherwise the usual coercion hierarchy (logical < integer < double < complex) will be used, e.g. all-logical data frames will be coerced to a logical matrix, mixed logical-integer will give an integer matrix, etc.
   1766 
   1767 On the other hand, `data.matrix` will always return a numeric matrix (see `?data.matrix()`).
   1768 
   1769 > Return the matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Factors and ordered factors are replaced by their internal codes. […] Character columns are first converted to factors and then to integers.
   1770 
   1771 We can illustrate and compare the mechanics of these functions using a concrete example. `as.matrix()` makes it possible to retrieve most of the original information from the data frame but leaves us with characters. To retrieve all information from `data.matrix()`’s output, we would need a lookup table for each column.
   1772 
   1773 ```{r, eval = FALSE}
   1774 df_coltypes <- data.frame(
   1775   a = c("a", "b"),
   1776   b = c(TRUE, FALSE),
   1777   c = c(1L, 0L),
   1778   d = c(1.5, 2),
   1779   e = factor(c("f1", "f2"))
   1780 )
   1781 
   1782 as.matrix(df_coltypes)
   1783 #>      a   b       c   d     e   
   1784 #> [1,] "a" "TRUE"  "1" "1.5" "f1"
   1785 #> [2,] "b" "FALSE" "0" "2.0" "f2"
   1786 data.matrix(df_coltypes)
   1787 #>      a b c   d e
   1788 #> [1,] 1 1 1 1.5 1
   1789 #> [2,] 2 0 0 2.0 2
   1790 ```
   1791 
   1792 </details>
   1793 
   1794 
   1795 ## `NULL`
   1796 
   1797 Special type of object that:
   1798 
   1799 -   Length 0
   1800 -   Cannot have attributes
   1801 
   1802 ```{r null, results = 'hide'}
   1803 typeof(NULL)
   1804 #> [1] "NULL"
   1805 
   1806 length(NULL)
   1807 #> [1] 0
   1808 ```
   1809 
   1810 ```{r null_attr, error=TRUE}
   1811 x <- NULL
   1812 attr(x, "y") <- 1
   1813 ```
   1814 
   1815 ```{r null_check}
   1816 is.null(NULL)
   1817 ```
   1818 
   1819 
   1820 ## Digestif
   1821 
   1822 Let is use some of this chapter's skills on the `penguins` data.
   1823 
   1824 ## Attributes
   1825 
   1826 ```{r}
   1827 str(penguins_raw)
   1828 ```
   1829 
   1830 ```{r}
   1831 str(penguins_raw, give.attr = FALSE)
   1832 ```
   1833 
   1834 ## Data Frames vs Tibbles
   1835 
   1836 ```{r}
   1837 penguins_df <- data.frame(penguins)
   1838 penguins_tb <- penguins #i.e. penguins was already a tibble
   1839 ```
   1840 
   1841 ## Printing
   1842 
   1843 * Tip: print out these results in RStudio under different editor themes
   1844 
   1845 ```{r, eval = FALSE}
   1846 print(penguins_df) #don't run this
   1847 ```
   1848 
   1849 ```{r}
   1850 head(penguins_df)
   1851 ```
   1852 
   1853 ```{r}
   1854 penguins_tb
   1855 ```
   1856 
   1857 ## Atomic Vectors
   1858 
   1859 ```{r}
   1860 species_vector_df <- penguins_df |> select(species)
   1861 species_unlist_df <- penguins_df |> select(species) |> unlist()
   1862 species_pull_df   <- penguins_df |> select(species) |> pull()
   1863 
   1864 species_vector_tb <- penguins_tb |> select(species)
   1865 species_unlist_tb <- penguins_tb |> select(species) |> unlist()
   1866 species_pull_tb   <- penguins_tb |> select(species) |> pull()
   1867 ```
   1868 
   1869 <details>
   1870 <summary>`typeof()` and `class()`</summary>
   1871 ```{r}
   1872 typeof(species_vector_df)
   1873 class(species_vector_df)
   1874 
   1875 typeof(species_unlist_df)
   1876 class(species_unlist_df)
   1877 
   1878 typeof(species_pull_df)
   1879 class(species_pull_df)
   1880 
   1881 typeof(species_vector_tb)
   1882 class(species_vector_tb)
   1883 
   1884 typeof(species_unlist_tb)
   1885 class(species_unlist_tb)
   1886 
   1887 typeof(species_pull_tb)
   1888 class(species_pull_tb)
   1889 ```
   1890 
   1891 </details>
   1892 
   1893 ## Column Names
   1894 
   1895 ```{r}
   1896 colnames(penguins_tb)
   1897 ```
   1898 
   1899 ```{r}
   1900 names(penguins_tb) == colnames(penguins_tb)
   1901 ```
   1902 
   1903 ```{r}
   1904 names(penguins_df) == names(penguins_tb)
   1905 ```
   1906 
   1907 ## What if we only invoke a partial name of a column of a tibble?
   1908 
   1909 ```{r, error = TRUE}
   1910 penguins_tb$y 
   1911 ```
   1912 
   1913 ![tibbles are surly!](images/vectors/surly_tibbles.png)
   1914 
   1915 * What if we only invoke a partial name of a column of a data frame?
   1916 
   1917 ```{r}
   1918 head(penguins_df$y) #instead of `year`
   1919 ```
   1920 
   1921 * Is this evaluation in alphabetical order or column order?
   1922 
   1923 ```{r}
   1924 penguins_df_se_sp <- penguins_df |> select(sex, species)
   1925 penguins_df_sp_se <- penguins_df |> select(species, sex)
   1926 ```
   1927 
   1928 ```{r}
   1929 head(penguins_df_se_sp$s)
   1930 ```
   1931 
   1932 ```{r}
   1933 head(penguins_df_sp_se$s)
   1934 ```
   1935 
   1936 
   1937 ## Chapter Quiz 1/5
   1938 
   1939 1. What are the four common types of atomic vectors? What are the two rare types?
   1940 
   1941 <details><summary>Answer(s)</summary>
   1942 The four common types of atomic vector are logical, integer, double and character. The two rarer types are complex and raw.
   1943 </details>
   1944 
   1945 ## Chapter Quiz 2/5
   1946 
   1947 2. What are attributes? How do you get them and set them?
   1948 
   1949 <details><summary>Answer(s)</summary>
   1950 Attributes allow you to associate arbitrary additional metadata to any object. You can get and set individual attributes with `attr(x, "y")` and `attr(x, "y") <- value`; or you can get and set all attributes at once with `attributes()`.
   1951 </details>
   1952 
   1953 ## Chapter Quiz 3/5
   1954 
   1955 3. How is a list different from an atomic vector? How is a matrix different from a data frame?
   1956 
   1957 <details><summary>Answer(s)</summary>
   1958 The elements of a list can be any type (even a list); the elements of an atomic vector are all of the same type. Similarly, every element of a matrix must be the same type; in a data frame, different columns can have different types.
   1959 </details>
   1960 
   1961 ## Chapter Quiz 4/5
   1962 
   1963 4. Can you have a list that is a matrix? Can a data frame have a column that is a matrix?
   1964 
   1965 <details><summary>Answer(s)</summary>
   1966 You can make a list-array by assigning dimensions to a list. You can make a matrix a column of a data frame with `df$x <- matrix()`, or by using `I()` when creating a new data frame `data.frame(x = I(matrix()))`.
   1967 </details>
   1968 
   1969 ## Chapter Quiz 5/5
   1970 
   1971 5. How do tibbles behave differently from data frames?
   1972 
   1973 <details><summary>Answer(s)</summary>
   1974 Tibbles have an enhanced print method, never coerce strings to factors, and provide stricter subsetting methods.
   1975 </details>