03.qmd (45023B)
1 --- 2 engine: knitr 3 title: Vectors 4 --- 5 6 ## Learning objectives: 7 8 - Learn about different types of vectors and their attributes 9 - Navigate through vector types and their value types 10 - Venture into factors and date-time objects 11 - Discuss the differences between data frames and tibbles 12 - Do not get absorbed by the `NA` and `NULL` black hole 13 14 15 ## Session Info 16 17 ```{r ch9_setup, message = FALSE, warning = FALSE} 18 library("dplyr") 19 library("gt") 20 library("palmerpenguins") 21 ``` 22 23 24 <details> 25 <summary>Session Info</summary> 26 ```{r} 27 utils::sessionInfo() 28 ``` 29 </details> 30 31 ## Aperitif 32 33  34 35 ## Counting Penguins 36 37 Consider this code to count the number of Gentoo penguins in the `penguins` data set. We see that there are 124 Gentoo penguins. 38 39 ```{r, eval = FALSE} 40 sum("Gentoo" == penguins$species) 41 # output: 124 42 ``` 43 44 ## In 45 46 One subtle error can arise in trying out `%in%` here instead. 47 48 ```{r, results = 'hide'} 49 species_vector <- penguins |> select(species) 50 print("Gentoo" %in% species_vector) 51 # output: FALSE 52 ``` 53 54  55 56 ## Fix: base R 57 58 ```{r, results = 'hide'} 59 species_unlist <- penguins |> select(species) |> unlist() 60 print("Gentoo" %in% species_unlist) 61 # output: TRUE 62 ``` 63 64 ## Fix: dplyr 65 66 ```{r, results = 'hide'} 67 species_pull <- penguins |> select(species) |> pull() 68 print("Gentoo" %in% species_pull) 69 # output: TRUE 70 ``` 71 72 ## Motivation 73 74 * What are the different types of vectors? 75 * How does this affect accessing vectors? 76 77 <details> 78 <summary>Side Quest: Looking up the `%in%` operator</summary> 79 If you want to look up the manual pages for the `%in%` operator with the `?`, use backticks: 80 81 ```{r, eval = FALSE} 82 ?`%in%` 83 ``` 84 85 and we find that `%in%` is a wrapper for the `match()` function. 86 87 </details> 88 89 90 ## Types of Vectors 91 92  93 94 Two main types: 95 96 - **Atomic**: Elements all the same type. 97 - **List**: Elements are different Types. 98 99 Closely related but not technically a vector: 100 101 - **NULL**: Null elements. Often length zero. 102 103 104 ## Types of Atomic Vectors (1/2) 105 106 {width=50%} 107 108 ## Types of Atomic Vectors (2/2) 109 110 - **Logical**: True/False 111 - **Integer**: Numeric (discrete, no decimals) 112 - **Double**: Numeric (continuous, decimals) 113 - **Character**: String 114 115 ## Vectors of Length One 116 117 **Scalars** are vectors that consist of a single value. 118 119 ## Logicals 120 121 ```{r vec_lgl} 122 lgl1 <- TRUE 123 lgl2 <- T #abbreviation for TRUE 124 lgl3 <- FALSE 125 lgl4 <- F #abbreviation for FALSE 126 ``` 127 128 ## Doubles 129 130 ```{r vec_dbl} 131 # integer, decimal, scientific, or hexidecimal format 132 dbl1 <- 1 133 dbl2 <- 1.234 # decimal 134 dbl3 <- 1.234e0 # scientific format 135 dbl4 <- 0xcafe # hexidecimal format 136 ``` 137 138 ## Integers 139 140 Integers must be followed by L and cannot have fractional values 141 142 ```{r vec_int} 143 int1 <- 1L 144 int2 <- 1234L 145 int3 <- 1234e0L 146 int4 <- 0xcafeL 147 ``` 148 149 <details> 150 <summary>Pop Quiz: Why "L" for integers?</summary> 151 Wickham notes that the use of `L` dates back to the **C** programming language and its "long int" type for memory allocation. 152 </details> 153 154 ## Strings 155 156 Strings can use single or double quotes and special characters are escaped with \ 157 158 ```{r vec_str} 159 str1 <- "hello" # double quotes 160 str2 <- 'hello' # single quotes 161 str3 <- "مرحبًا" # Unicode 162 str4 <- "\U0001f605" # sweaty_smile 😅 163 ``` 164 165 ## Longer 1/2 166 167 There are several ways to make longer vectors: 168 169 **1. With single values** inside c() for combine. 170 171 ```{r long_single} 172 lgl_var <- c(TRUE, FALSE) 173 int_var <- c(1L, 6L, 10L) 174 dbl_var <- c(1, 2.5, 4.5) 175 chr_var <- c("these are", "some strings") 176 ``` 177 178  179 180 ## Longer 2/2 181 182 **2. With other vectors** 183 184 ```{r long_vec} 185 c(c(1, 2), c(3, 4)) # output is not nested 186 ``` 187 188 ## Type and Length 189 190 We can determine the type of a vector with `typeof()` and its length with `length()` 191 192 ```{r type_length, echo = FALSE} 193 # typeof(lgl_var) 194 # typeof(int_var) 195 # typeof(dbl_var) 196 # typeof(chr_var) 197 # 198 # length(lgl_var) 199 # length(int_var) 200 # length(dbl_var) 201 # length(chr_var) 202 203 var_names <- c("lgl_var", "int_var", "dbl_var", "chr_var") 204 var_values <- c("TRUE, FALSE", "1L, 6L, 10L", "1, 2.5, 4.5", "'these are', 'some strings'") 205 var_type <- c("logical", "integer", "double", "character") 206 var_length <- c(2, 3, 3, 2) 207 208 type_length_df <- data.frame(var_names, var_values, var_type, var_length) 209 210 # make gt table 211 type_length_df |> 212 gt() |> 213 cols_align(align = "center") |> 214 cols_label( 215 var_names ~ "name", 216 var_values ~ "value", 217 var_type ~ "typeof()", 218 var_length ~ "length()" 219 ) |> 220 tab_header( 221 title = "Types of Atomic Vectors", 222 subtitle = "" 223 ) |> 224 tab_footnote( 225 footnote = "Source: https://adv-r.hadley.nz/index.html", 226 locations = cells_title(groups = "title") 227 ) |> 228 tab_style( 229 style = list(cell_fill(color = "#F9E3D6")), 230 locations = cells_body(columns = var_type) 231 ) |> 232 tab_style( 233 style = list(cell_fill(color = "lightcyan")), 234 locations = cells_body(columns = var_length) 235 ) 236 ``` 237 238 ## Side Quest: Penguins 239 240 <details> 241 ```{r} 242 typeof(penguins$species) 243 class(penguins$species) 244 245 typeof(species_unlist) 246 class(species_unlist) 247 248 typeof(species_pull) 249 class(species_pull) 250 ``` 251 252 </details> 253 254 ## Missing values: Contagion 255 256 For most computations, an operation over values that includes a missing value yields a missing value (unless you're careful) 257 258 ```{r na_contagion} 259 # contagion 260 5*NA 261 sum(c(1, 2, NA, 3)) 262 ``` 263 264 ## Missing values: Contagion Exceptions 265 266 ```{r na_exceptions, eval = FALSE} 267 NA ^ 0 268 #> [1] 1 269 NA | TRUE 270 #> [1] TRUE 271 NA & FALSE 272 #> [1] FALSE 273 ``` 274 275 276 #### Innoculation 277 278 ```{r na_innoculation, eval = FALSE} 279 sum(c(1, 2, NA, 3), na.rm = TRUE) 280 # output: 6 281 ``` 282 283 To search for missing values use `is.na()` 284 285 ```{r na_search, eval = FALSE} 286 x <- c(NA, 5, NA, 10) 287 x == NA 288 # output: NA NA NA NA [BATMAN!] 289 ``` 290 291 ```{r na_search_better, eval = FALSE} 292 is.na(x) 293 # output: TRUE FALSE TRUE FALSE 294 ``` 295 296 ## Missing Values: NA Types 297 298 <details> 299 Each type has its own NA type 300 301 - Logical: `NA` 302 - Integer: `NA_integer` 303 - Double: `NA_double` 304 - Character: `NA_character` 305 306 This may not matter in many contexts. 307 308 Can matter for operations where types matter like `dplyr::if_else()`. 309 </details> 310 311 312 ## Testing (1/2) 313 314 **What type of vector `is.*`() it?** 315 316 Test data type: 317 318 - Logical: `is.logical()` 319 - Integer: `is.integer()` 320 - Double: `is.double()` 321 - Character: `is.character()` 322 323 324 ## Testing (2/2) 325 326 **What type of object is it?** 327 328 Don't test objects with these tools: 329 330 - `is.vector()` 331 - `is.atomic()` 332 - `is.numeric()` 333 334 They don’t test if you have a vector, atomic vector, or numeric vector; you’ll need to carefully read the documentation to figure out what they actually do (preview: *attributes*) 335 336 ## Side Quest: rlang `is_*()` 337 338 <details> 339 <summary>Maybe use `{rlang}`?</summary> 340 341 - `rlang::is_vector` 342 - `rlang::is_atomic` 343 344 ```{r test_rlang} 345 # vector 346 rlang::is_vector(c(1, 2)) 347 rlang::is_vector(list(1, 2)) 348 349 # atomic 350 rlang::is_atomic(c(1, 2)) 351 rlang::is_atomic(list(1, "a")) 352 353 ``` 354 355 See more [here](https://rlang.r-lib.org/reference/type-predicates.html) 356 </details> 357 358 359 ## Coercion 360 361 * R follows rules for coercion: character → double → integer → logical 362 363 * R can coerce either automatically or explicitly 364 365 #### **Automatic** 366 367 Two contexts for automatic coercion: 368 369 1. Combination 370 2. Mathematical 371 372 373 374 ## Coercion by Combination: 375 376 ```{r coerce_c} 377 str(c(TRUE, "TRUE")) 378 ``` 379 380 ## Coercion by Mathematical operations: 381 382 ```{r coerce_math} 383 # imagine a logical vector about whether an attribute is present 384 has_attribute <- c(TRUE, FALSE, TRUE, TRUE) 385 386 # number with attribute 387 sum(has_attribute) 388 ``` 389 390 ## **Explicit** 391 392 <!-- 393 394 Use `as.*()` 395 396 - Logical: `as.logical()` 397 - Integer: `as.integer()` 398 - Double: `as.double()` 399 - Character: `as.character()` 400 401 --> 402 403 ```{r explicit_coercion, echo = FALSE} 404 # dbl_var 405 # as.integer(dbl_var) 406 # lgl_var 407 # as.character(lgl_var) 408 409 var_names <- c("lgl_var", "int_var", "dbl_var", "chr_var") 410 var_values <- c("TRUE, FALSE", "1L, 6L, 10L", "1, 2.5, 4.5", "'these are', 'some strings'") 411 as_logical <- c("TRUE FALSE", "TRUE TRUE TRUE", "TRUE TRUE TRUE", "NA NA") 412 as_integer <- c("1 0", "1 6 10", "1 2 4", 'NA_integer') 413 as_double <- c("1 0", "1 6 10", "1.0 2.5 4.5", 'NA_double') 414 as_character <- c("'TRUE' 'FALSE'", "'1' '6' '10'", "'1' '2.5' '4.5'", "'these are', 'some strings'") 415 416 coercion_df <- data.frame(var_names, var_values, as_logical, as_integer, as_double, as_character) 417 418 coercion_df |> 419 gt() |> 420 cols_align(align = "center") |> 421 cols_label( 422 var_names ~ "name", 423 var_values ~ "value", 424 as_logical ~ "as.logical()", 425 as_integer ~ "as.integer()", 426 as_double ~ "as.double()", 427 as_character ~ "as.character()" 428 ) |> 429 tab_header( 430 title = "Coercion of Atomic Vectors", 431 subtitle = "" 432 ) |> 433 tab_footnote( 434 footnote = "Source: https://adv-r.hadley.nz/index.html", 435 locations = cells_title(groups = "title") 436 ) |> 437 tab_style( 438 style = list(cell_fill(color = "#F9E3D6")), 439 locations = cells_body(columns = c(as_logical, as_double)) 440 ) |> 441 tab_style( 442 style = list(cell_fill(color = "lightcyan")), 443 locations = cells_body(columns = c(as_integer, as_character)) 444 ) 445 ``` 446 447 But note that coercion may fail in one of two ways, or both: 448 449 - With warning/error 450 - NAs 451 452 ```{r coerce_error} 453 as.integer(c(1, 2, "three")) 454 ``` 455 456 ## Exercises 1/5 457 458 1. How do you create raw and complex scalars? 459 460 <details><summary>Answer(s)</summary> 461 ```{r, eval = FALSE} 462 as.raw(42) 463 #> [1] 2a 464 charToRaw("A") 465 #> [1] 41 466 complex(length.out = 1, real = 1, imaginary = 1) 467 #> [1] 1+1i 468 ``` 469 </details> 470 471 ## Exercises 2/5 472 473 2. Test your knowledge of the vector coercion rules by predicting the output of the following uses of c(): 474 475 ```{r, eval = FALSE} 476 c(1, FALSE) 477 c("a", 1) 478 c(TRUE, 1L) 479 ``` 480 481 <details><summary>Answer(s)</summary> 482 ```{r, eval = FALSE} 483 c(1, FALSE) # will be coerced to double -> 1 0 484 c("a", 1) # will be coerced to character -> "a" "1" 485 c(TRUE, 1L) # will be coerced to integer -> 1 1 486 ``` 487 </details> 488 489 ## Exercises 3/5 490 491 3. Why is `1 == "1"` true? Why is `-1 < FALSE` true? Why is `"one" < 2` false? 492 493 <details><summary>Answer(s)</summary> 494 These comparisons are carried out by operator-functions (==, <), which coerce their arguments to a common type. In the examples above, these types will be character, double and character: 1 will be coerced to "1", FALSE is represented as 0 and 2 turns into "2" (and numbers precede letters in lexicographic order (may depend on locale)). 495 496 </details> 497 498 ## Exercises 4/5 499 500 4. Why is the default missing value, NA, a logical vector? What’s special about logical vectors? 501 502 <details><summary>Answer(s)</summary> 503 The presence of missing values shouldn’t affect the type of an object. Recall that there is a type-hierarchy for coercion from character → double → integer → logical. When combining `NA`s with other atomic types, the `NA`s will be coerced to integer (`NA_integer_`), double (`NA_real_`) or character (`NA_character_`) and not the other way round. If `NA` were a character and added to a set of other values all of these would be coerced to character as well. 504 </details> 505 506 ## Exercises 5/5 507 508 5. Precisely what do `is.atomic()`, `is.numeric()`, and `is.vector()` test for? 509 510 <details><summary>Answer(s)</summary> 511 512 * `is.atomic()` tests if an object is an atomic vector or is `NULL` (!). Atomic vectors are objects of type logical, integer, double, complex, character or raw. 513 * `is.numeric()` tests if an object has type integer or double and is not of class `factor`, `Date`, `POSIXt` or `difftime`. 514 * `is.vector()` tests if an object is a vector or an expression and has no attributes, apart from names. Vectors are atomic vectors or lists. 515 516 </details> 517 518 519 ## Attributes 520 521 Attributes are name-value pairs that attach metadata to an object (vector). 522 523 * **Name-value pairs**: attributes have a name and a value 524 * **Metadata**: not data itself, but data about the data 525 526 ## Getting and Setting 527 528 Three functions: 529 530 1. retrieve and modify single attributes with `attr()` 531 2. retrieve en masse with `attributes()` 532 3. set en masse with `structure()` 533 534 ## Single attribute 535 536 Use `attr()` 537 538 ```{r attr_single} 539 # some object 540 a <- c(1, 2, 3) 541 542 # set attribute 543 attr(x = a, which = "attribute_name") <- "some attribute" 544 545 # get attribute 546 attr(a, "attribute_name") 547 ``` 548 549 ## Multiple attributes 550 551 `structure()`: set multiple attributes, `attributes()`: get multiple attributes 552 553 :::: columns 554 ::: column 555 ```{r attr_multiple} 556 a <- 1:3 557 attr(a, "x") <- "abcdef" 558 attr(a, "x") 559 560 attr(a, "y") <- 4:6 561 str(attributes(a)) 562 563 b <- structure( 564 1:3, 565 x = "abcdef", 566 y = 4:6 567 ) 568 identical(a, b) 569 ``` 570 ::: 571 572 ::: column 573  574 ::: 575 :::: 576 577 578 ## Why 579 580 Three particularly important attributes: 581 582 1. **names** - a character vector giving each element a name 583 2. **dimension** - (or dim) turns vectors into matrices and arrays 584 3. **class** - powers the S3 object system (we'll learn more about this in chapter 13) 585 586 Most attributes are lost by most operations. Only two attributes are routinely preserved: names and dimension. 587 588 ## Names 589 590 ~~Three~~ Four ways to name: 591 592 :::: columns 593 594 ::: {.column width="50%"} 595 ```{r names} 596 # (1) On creation: 597 x <- c(A = 1, B = 2, C = 3) 598 x 599 600 # (2) Assign to names(): 601 y <- 1:3 602 names(y) <- c("a", "b", "c") 603 y 604 605 # (3) Inline: 606 z <- setNames(1:3, c("a", "b", "c")) 607 z 608 ``` 609 ::: 610 611 ::: {.column width="50%"} 612  613 ::: 614 615 :::: 616 617 ## rlang Names 618 619 :::: columns 620 621 ::: {.column width="50%"} 622 623 ```{r names_via_rlang} 624 # (4) Inline with {rlang}: 625 a <- 1:3 626 rlang::set_names( 627 a, 628 c("a", "b", "c") 629 ) 630 ``` 631 632 ::: 633 634 ::: {.column width="50%"} 635  636 ::: 637 638 :::: 639 640 641 ## Removing names 642 643 * `x <- unname(x)` or `names(x) <- NULL` 644 * Thematically but not directly related: labelled class vectors with `haven::labelled()` 645 646 647 ## Dimensions: `matrix()` and `array()` 648 649 ```{r dimensions} 650 # Two scalar arguments specify row and column sizes 651 x <- matrix(1:6, nrow = 2, ncol = 3) 652 x 653 # One vector argument to describe all dimensions 654 y <- array(1:12, c(2, 3, 2)) # rows, columns, no of arrays 655 y 656 ``` 657 658 ## Dimensions: assign to `dim()` 659 660 ```{r dimensions2} 661 # You can also modify an object in place by setting dim() 662 z <- 1:6 663 dim(z) <- c(2, 3) # rows, columns 664 z 665 a <- 1:12 666 dim(a) <- c(2, 3, 2) # rows, columns, no of arrays 667 a 668 ``` 669 670 671 ## Functions for working with vectors, matrices and arrays (1/2): 672 673 Vector | Matrix | Array 674 :----- | :---------- | :----- 675 `names()` | `rownames()`, `colnames()` | `dimnames()` 676 `length()` | `nrow()`, `ncol()` | `dim()` 677 `c()` | `rbind()`, `cbind()` | `abind::abind()` 678 — | `t()` | `aperm()` 679 `is.null(dim(x))` | `is.matrix()` | `is.array()` 680 681 * **Caution**: Vector without `dim` set has `NULL` dimensions, not `1`. 682 * One dimension? 683 684 ## Functions for working with vectors, matrices and arrays (2/2): 685 686 ```{r examples_of_1D, eval = FALSE} 687 str(1:3) # 1d vector 688 #> int [1:3] 1 2 3 689 str(matrix(1:3, ncol = 1)) # column vector 690 #> int [1:3, 1] 1 2 3 691 str(matrix(1:3, nrow = 1)) # row vector 692 #> int [1, 1:3] 1 2 3 693 str(array(1:3, 3)) # "array" vector 694 #> int [1:3(1d)] 1 2 3 695 ``` 696 697 698 ## Exercises 1/4 699 700 1. How is `setNames()` implemented? Read the source code. 701 702 <details><summary>Answer(s)</summary> 703 704 ```{r, eval = FALSE} 705 setNames <- function(object = nm, nm) { 706 names(object) <- nm 707 object 708 } 709 ``` 710 711 - Data arg 1st = works well with pipe. 712 - 1st arg is optional 713 714 ```{r, eval = FALSE} 715 setNames( , c("a", "b", "c")) 716 #> a b c 717 #> "a" "b" "c" 718 ``` 719 </details> 720 721 ## Exercises 1/4 (cont) 722 723 1. How is `unname()` implemented? Read the source code. 724 725 <details><summary>Answer(s)</summary> 726 727 ```{r, eval = FALSE} 728 unname <- function(obj, force = FALSE) { 729 if (!is.null(names(obj))) 730 names(obj) <- NULL 731 if (!is.null(dimnames(obj)) && (force || !is.data.frame(obj))) 732 dimnames(obj) <- NULL 733 obj 734 } 735 ``` 736 `unname()` sets existing `names` or `dimnames` to `NULL`. 737 </details> 738 739 ## Exercises 2/4 740 741 2. What does `dim()` return when applied to a 1-dimensional vector? When might you use `NROW()` or `NCOL()`? 742 743 <details><summary>Answer(s)</summary> 744 745 > `dim()` returns `NULL` when applied to a 1d vector. 746 747 `NROW()` and `NCOL()` treats `NULL` and vectors like they have dimensions: 748 749 ```{r, eval = FALSE} 750 x <- 1:10 751 nrow(x) 752 #> NULL 753 ncol(x) 754 #> NULL 755 NROW(x) 756 #> [1] 10 757 NCOL(x) 758 #> [1] 1 759 ``` 760 761 </details> 762 763 ## Exercises 3/4 764 765 3. How would you describe the following three objects? What makes them different from `1:5`? 766 767 ```{r} 768 x1 <- array(1:5, c(1, 1, 5)) 769 x2 <- array(1:5, c(1, 5, 1)) 770 x3 <- array(1:5, c(5, 1, 1)) 771 ``` 772 773 <details><summary>Answer(s)</summary> 774 ```{r, eval = FALSE} 775 x1 <- array(1:5, c(1, 1, 5)) # 1 row, 1 column, 5 in third dim. 776 x2 <- array(1:5, c(1, 5, 1)) # 1 row, 5 columns, 1 in third dim. 777 x3 <- array(1:5, c(5, 1, 1)) # 5 rows, 1 column, 1 in third dim. 778 ``` 779 </details> 780 781 ## Exercises 4/4 782 783 4. An early draft used this code to illustrate `structure()`: 784 785 ```{r, eval = FALSE} 786 structure(1:5, comment = "my attribute") 787 #> [1] 1 2 3 4 5 788 ``` 789 790 Why don't you see the comment attribute on print? Is the attribute missing, or is there something else special about it? 791 792 <details><summary>Answer(s)</summary> 793 The documentation states (see `?comment`): 794 795 > Contrary to other attributes, the comment is not printed (by print or print.default). 796 797 ## Exercises 4/4 (cont) 798 799 <details><summary>Answer(s)</summary> 800 Also, from `?attributes:` 801 802 > Note that some attributes (namely class, comment, dim, dimnames, names, row.names and tsp) are treated specially and have restrictions on the values which can be set. 803 804 Retrieve comment attributes with `attr()`: 805 806 ```{r, eval = FALSE} 807 foo <- structure(1:5, comment = "my attribute") 808 809 attributes(foo) 810 #> $comment 811 #> [1] "my attribute" 812 attr(foo, which = "comment") 813 #> [1] "my attribute" 814 ``` 815 816 </details> 817 818 819 820 ## **Class** - S3 atomic vectors 821 822  823 824 Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham 825 826 **Having a class attribute turns an object into an S3 object.** 827 828 What makes S3 atomic vectors different? 829 830 1. behave differently from a regular vector when passed to a generic function 831 2. often store additional information in other attributes 832 833 834 ## Four important S3 vectors used in base R: 835 836 1. **Factors** (categorical data) 837 2. **Dates** 838 3. **Date-times** (POSIXct) 839 4. **Durations** (difftime) 840 841 ## Factors 842 843 A factor is a vector used to store categorical data that can contain only predefined values. 844 845 Factors are integer vectors with: 846 847 - Class: "factor" 848 - Attributes: "levels", or the set of allowed values 849 850 ## Factors examples 851 852 ```{r factor} 853 colors = c('red', 'blue', 'green','red','red', 'green') 854 colors_factor <- factor( 855 x = colors, levels = c('red', 'blue', 'green', 'yellow') 856 ) 857 ``` 858 859 :::: columns 860 861 ::: column 862 863 ```{r factor_table} 864 table(colors) 865 table(colors_factor) 866 ``` 867 ::: 868 869 ::: column 870 ```{r factor_type} 871 typeof(colors_factor) 872 class(colors_factor) 873 874 attributes(colors_factor) 875 ``` 876 ::: 877 :::: 878 879 ## Custom Order 880 881 Factors can be ordered. This can be useful for models or visualizations where order matters. 882 883 ```{r factor_ordered} 884 885 values <- c('high', 'med', 'low', 'med', 'high', 'low', 'med', 'high') 886 ordered_factor <- ordered( 887 x = values, 888 levels = c('low', 'med', 'high') # in order 889 ) 890 ordered_factor 891 892 table(values) 893 table(ordered_factor) 894 ``` 895 896 ## Dates 897 898 Dates are: 899 900 - Double vectors 901 - With class "Date" 902 - No other attributes 903 904 ```{r dates} 905 notes_date <- Sys.Date() 906 907 # type 908 typeof(notes_date) 909 910 # class 911 attributes(notes_date) 912 ``` 913 914 ## Dates Unix epoch 915 916 The double component represents the number of days since since the [Unix epoch](https://en.wikipedia.org/wiki/Unix_time) `1970-01-01` 917 918 ```{r days_since_1970} 919 date <- as.Date("1970-02-01") 920 unclass(date) 921 ``` 922 923 ## Date-times 924 925 There are 2 Date-time representations in base R: 926 927 - POSIXct, where "ct" denotes *calendar time* 928 - POSIXlt, where "lt" designates *local time* 929 930 <!-- 931 932 Just for fun: 933 "How to pronounce 'POSIXct'?" 934 https://www.howtopronounce.com/posixct 935 936 --> 937 938 ## Dates-times: POSIXct 939 940 We'll focus on POSIXct because: 941 942 - Simplest 943 - Built on an atomic (double) vector 944 - Most appropriate for use in a data frame 945 946 Let's now build and deconstruct a Date-time 947 948 ```{r date_time} 949 # Build 950 note_date_time <- as.POSIXct( 951 x = Sys.time(), # time 952 tz = "America/New_York" # time zone, used only for formatting 953 ) 954 955 # Inspect 956 note_date_time 957 958 # - type 959 typeof(note_date_time) 960 961 # - attributes 962 attributes(note_date_time) 963 964 structure(note_date_time, tzone = "Europe/Paris") 965 ``` 966 967 ```{r date_time_format} 968 date_time <- as.POSIXct("2024-02-22 12:34:56", tz = "EST") 969 unclass(date_time) 970 ``` 971 972 973 ## Durations 974 975 Durations represent the amount of time between pairs of dates or date-times. 976 977 - Double vectors 978 - Class: "difftime" 979 - Attributes: "units", or the unit of duration (e.g., weeks, hours, minutes, seconds, etc.) 980 981 ```{r durations} 982 # Construct 983 one_minute <- as.difftime(1, units = "mins") 984 # Inspect 985 one_minute 986 987 # Dissect 988 # - type 989 typeof(one_minute) 990 # - attributes 991 attributes(one_minute) 992 ``` 993 994 ```{r durations_math} 995 time_since_01_01_1970 <- notes_date - date 996 time_since_01_01_1970 997 ``` 998 999 1000 See also: 1001 1002 - [`lubridate::make_difftime()`](https://lubridate.tidyverse.org/reference/make_difftime.html) 1003 - [`clock::date_time_build()`](https://clock.r-lib.org/reference/date_time_build.html) 1004 1005 1006 ## Exercises 1/3 1007 1008 1. What sort of object does `table()` return? What is its type? What attributes does it have? How does the dimensionality change as you tabulate more variables? 1009 1010 <details><summary>Answer(s)</summary> 1011 1012 `table()` returns a contingency table of its input variables. It is implemented as an integer vector with class table and dimensions (which makes it act like an array). Its attributes are dim (dimensions) and dimnames (one name for each input column). The dimensions correspond to the number of unique values (factor levels) in each input variable. 1013 1014 ```{r, eval = FALSE} 1015 x <- table(mtcars[c("vs", "cyl", "am")]) 1016 1017 typeof(x) 1018 #> [1] "integer" 1019 attributes(x) 1020 #> $dim 1021 #> [1] 2 3 2 1022 #> 1023 #> $dimnames 1024 #> $dimnames$vs 1025 #> [1] "0" "1" 1026 #> 1027 #> $dimnames$cyl 1028 #> [1] "4" "6" "8" 1029 #> 1030 #> $dimnames$am 1031 #> [1] "0" "1" 1032 #> 1033 #> 1034 #> $class 1035 #> [1] "table" 1036 ``` 1037 </details> 1038 1039 ## Exercises 2/3 1040 1041 2. What happens to a factor when you modify its levels? 1042 1043 ```{r, eval = FALSE} 1044 f1 <- factor(letters) 1045 levels(f1) <- rev(levels(f1)) 1046 ``` 1047 1048 <details><summary>Answer(s)</summary> 1049 The underlying integer values stay the same, but the levels are changed, making it look like the data has changed. 1050 1051 ```{r, eval = FALSE} 1052 f1 <- factor(letters) 1053 f1 1054 #> [1] a b c d e f g h i j k l m n o p q r s t u v w x y z 1055 #> Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z 1056 as.integer(f1) 1057 #> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1058 #> [26] 26 1059 1060 levels(f1) <- rev(levels(f1)) 1061 f1 1062 #> [1] z y x w v u t s r q p o n m l k j i h g f e d c b a 1063 #> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a 1064 as.integer(f1) 1065 #> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1066 #> [26] 26 1067 ``` 1068 </details> 1069 1070 ## Exercises 3/3 1071 1072 3. What does this code do? How do `f2` and `f3` differ from `f1`? 1073 1074 ```{r, eval = FALSE} 1075 f2 <- rev(factor(letters)) 1076 f3 <- factor(letters, levels = rev(letters)) 1077 ``` 1078 1079 <details><summary>Answer(s)</summary> 1080 For `f2` and `f3` either the order of the factor elements or its levels are being reversed. For `f1` both transformations are occurring. 1081 1082 ```{r, eval = FALSE} 1083 # Reverse element order 1084 (f2 <- rev(factor(letters))) 1085 #> [1] z y x w v u t s r q p o n m l k j i h g f e d c b a 1086 #> Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z 1087 as.integer(f2) 1088 #> [1] 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1089 #> [26] 1 1090 1091 # Reverse factor levels (when creating factor) 1092 (f3 <- factor(letters, levels = rev(letters))) 1093 #> [1] a b c d e f g h i j k l m n o p q r s t u v w x y z 1094 #> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a 1095 as.integer(f3) 1096 #> [1] 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1097 #> [26] 1 1098 ``` 1099 </details> 1100 1101 1102 ## Lists 1103 1104 * sometimes called a generic vector or recursive vector 1105 * Recall ([section 2.3.3](https://adv-r.hadley.nz/names-values.html#list-references)): each element is really a *reference* to another object 1106 * an be composed of elements of different types (as opposed to atomic vectors which must be of only one type) 1107 1108 ## Constructing 1109 1110 Simple lists: 1111 1112 ```{r list_simple} 1113 # Construct 1114 simple_list <- list( 1115 c(TRUE, FALSE), # logicals 1116 1:20, # integers 1117 c(1.2, 2.3, 3.4), # doubles 1118 c("primo", "secundo", "tercio") # characters 1119 ) 1120 1121 simple_list 1122 1123 # Inspect 1124 # - type 1125 typeof(simple_list) 1126 # - structure 1127 str(simple_list) 1128 1129 # Accessing 1130 simple_list[1] 1131 simple_list[2] 1132 simple_list[3] 1133 simple_list[4] 1134 1135 simple_list[[1]][2] 1136 simple_list[[2]][8] 1137 simple_list[[3]][2] 1138 simple_list[[4]][3] 1139 ``` 1140 1141 ## Even Simpler List 1142 1143 ```{r list_simpler} 1144 # Construct 1145 simpler_list <- list(TRUE, FALSE, 1146 1, 2, 3, 4, 5, 1147 1.2, 2.3, 3.4, 1148 "primo", "secundo", "tercio") 1149 1150 # Accessing 1151 simpler_list[1] 1152 simpler_list[5] 1153 simpler_list[9] 1154 simpler_list[11] 1155 ``` 1156 1157 ## Nested lists: 1158 1159 ```{r list_nested} 1160 nested_list <- list( 1161 # first level 1162 list( 1163 # second level 1164 list( 1165 # third level 1166 list(1) 1167 ) 1168 ) 1169 ) 1170 1171 str(nested_list) 1172 ``` 1173 1174 Like JSON. 1175 1176 ## Combined lists 1177 1178 ```{r list_combined} 1179 list_comb1 <- list(list(1, 2), list(3, 4)) # with list() 1180 list_comb2 <- c(list(1, 2), list(3, 4)) # with c() 1181 1182 # compare structure 1183 str(list_comb1) 1184 str(list_comb2) 1185 1186 # does this work if they are different data types? 1187 list_comb3 <- c(list(1, 2), list(TRUE, FALSE)) 1188 str(list_comb3) 1189 ``` 1190 1191 ## Testing 1192 1193 Check that is a list: 1194 1195 - `is.list()` 1196 - \`rlang::is_list()\`\` 1197 1198 The two do the same, except that the latter can check for the number of elements 1199 1200 ```{r list_test} 1201 # is list 1202 base::is.list(list_comb2) 1203 rlang::is_list(list_comb2) 1204 1205 # is list of 4 elements 1206 rlang::is_list(x = list_comb2, n = 4) 1207 1208 # is a vector (of a special type) 1209 # remember the family tree? 1210 rlang::is_vector(list_comb2) 1211 ``` 1212 1213 ## Coercion 1214 1215 Use `as.list()` 1216 1217 ```{r list_coercion} 1218 list(1:3) 1219 as.list(1:3) 1220 ``` 1221 1222 ## Matrices and arrays 1223 1224 Although not often used, the dimension attribute can be added to create **list-matrices** or **list-arrays**. 1225 1226 ```{r list_matrices_arrays} 1227 l <- list(1:3, "a", TRUE, 1.0) 1228 dim(l) <- c(2, 2); l 1229 1230 l[[1, 1]] 1231 ``` 1232 1233 1234 ## Exercises 1/3 1235 1236 1. List all the ways that a list differs from an atomic vector. 1237 1238 <details><summary>Answer(s)</summary> 1239 1240 * Atomic vectors are always homogeneous (all elements must be of the same type). Lists may be heterogeneous (the elements can be of different types) as described in the introduction of the vectors chapter. 1241 * Atomic vectors point to one address in memory, while lists contain a separate reference for each element. (This was described in the list sections of the vectors and the names and values chapters.) 1242 1243 ```{r, eval = FALSE} 1244 lobstr::ref(1:2) 1245 #> [1:0x7fcd936f6e80] <int> 1246 lobstr::ref(list(1:2, 2)) 1247 #> █ [1:0x7fcd93d53048] <list> 1248 #> ├─[2:0x7fcd91377e40] <int> 1249 #> └─[3:0x7fcd93b41eb0] <dbl> 1250 ``` 1251 1252 1253 * Subsetting with out-of-bounds and NA values leads to different output. For example, [ returns NA for atomics and NULL for lists. (This is described in more detail within the subsetting chapter.) 1254 1255 ```{r, eval = FALSE} 1256 # Subsetting atomic vectors 1257 (1:2)[3] 1258 #> [1] NA 1259 (1:2)[NA] 1260 #> [1] NA NA 1261 1262 # Subsetting lists 1263 as.list(1:2)[3] 1264 #> [[1]] 1265 #> NULL 1266 as.list(1:2)[NA] 1267 #> [[1]] 1268 #> NULL 1269 #> 1270 #> [[2]] 1271 #> NULL 1272 ``` 1273 1274 1275 </details> 1276 1277 ## Exercises 2/3 1278 1279 2. Why do you need to use `unlist()` to convert a list to an atomic vector? Why doesn’t `as.vector()` work? 1280 1281 <details><summary>Answer(s)</summary> 1282 A list is already a vector, though not an atomic one! Note that as.vector() and is.vector() use different definitions of “vector!” 1283 1284 ```{r, eval = FALSE} 1285 is.vector(as.vector(mtcars)) 1286 #> [1] FALSE 1287 ``` 1288 1289 </details> 1290 1291 ## Exercises 3/3 1292 1293 3. Compare and contrast `c()` and `unlist()` when combining a date and date-time into a single vector. 1294 1295 <details><summary>Answer(s)</summary> 1296 Date and date-time objects are both built upon doubles. While dates store the number of days since the reference date 1970-01-01 (also known as “the Epoch”) in days, date-time-objects (POSIXct) store the time difference to this date in seconds. 1297 1298 ```{r, eval = FALSE} 1299 date <- as.Date("1970-01-02") 1300 dttm_ct <- as.POSIXct("1970-01-01 01:00", tz = "UTC") 1301 1302 # Internal representations 1303 unclass(date) 1304 #> [1] 1 1305 unclass(dttm_ct) 1306 #> [1] 3600 1307 #> attr(,"tzone") 1308 #> [1] "UTC" 1309 ``` 1310 1311 As the c() generic only dispatches on its first argument, combining date and date-time objects via c() could lead to surprising results in older R versions (pre R 4.0.0): 1312 1313 ```{r, eval = FALSE} 1314 # Output in R version 3.6.2 1315 c(date, dttm_ct) # equal to c.Date(date, dttm_ct) 1316 #> [1] "1970-01-02" "1979-11-10" 1317 c(dttm_ct, date) # equal to c.POSIXct(date, dttm_ct) 1318 #> [1] "1970-01-01 02:00:00 CET" "1970-01-01 01:00:01 CET" 1319 ``` 1320 1321 In the first statement above c.Date() is executed, which incorrectly treats the underlying double of dttm_ct (3600) as days instead of seconds. Conversely, when c.POSIXct() is called on a date, one day is counted as one second only. 1322 1323 We can highlight these mechanics by the following code: 1324 1325 ```{r, eval = FALSE} 1326 # Output in R version 3.6.2 1327 unclass(c(date, dttm_ct)) # internal representation 1328 #> [1] 1 3600 1329 date + 3599 1330 #> "1979-11-10" 1331 ``` 1332 1333 As of R 4.0.0 these issues have been resolved and both methods now convert their input first into POSIXct and Date, respectively. 1334 1335 ```{r, eval = FALSE} 1336 c(dttm_ct, date) 1337 #> [1] "1970-01-01 01:00:00 UTC" "1970-01-02 00:00:00 UTC" 1338 unclass(c(dttm_ct, date)) 1339 #> [1] 3600 86400 1340 1341 c(date, dttm_ct) 1342 #> [1] "1970-01-02" "1970-01-01" 1343 unclass(c(date, dttm_ct)) 1344 #> [1] 1 0 1345 ``` 1346 1347 However, as c() strips the time zone (and other attributes) of POSIXct objects, some caution is still recommended. 1348 1349 ```{r, eval = FALSE} 1350 (dttm_ct <- as.POSIXct("1970-01-01 01:00", tz = "HST")) 1351 #> [1] "1970-01-01 01:00:00 HST" 1352 attributes(c(dttm_ct)) 1353 #> $class 1354 #> [1] "POSIXct" "POSIXt" 1355 ``` 1356 1357 A package that deals with these kinds of problems in more depth and provides a structural solution for them is the {vctrs} package9 which is also used throughout the tidyverse.10 1358 1359 Let’s look at unlist(), which operates on list input. 1360 1361 ```{r, eval = FALSE} 1362 # Attributes are stripped 1363 unlist(list(date, dttm_ct)) 1364 #> [1] 1 39600 1365 ``` 1366 1367 We see again that dates and date-times are internally stored as doubles. Unfortunately, this is all we are left with, when unlist strips the attributes of the list. 1368 1369 To summarise: c() coerces types and strips time zones. Errors may have occurred in older R versions because of inappropriate method dispatch/immature methods. unlist() strips attributes. 1370 </details> 1371 1372 1373 ## Data frames and tibbles 1374 1375  1376 1377 Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham 1378 1379 ## Data frame 1380 1381 A data frame is a: 1382 1383 - Named list of vectors (i.e., column names) 1384 - Attributes: 1385 - (column) `names` 1386 - `row.names` 1387 - Class: "data frame" 1388 1389 ## Data frame, examples 1/2: 1390 1391 ```{r data_frame} 1392 # Construct 1393 df <- data.frame( 1394 col1 = c(1, 2, 3), # named atomic vector 1395 col2 = c("un", "deux", "trois") # another named atomic vector 1396 # ,stringsAsFactors = FALSE # default for versions after R 4.1 1397 ) 1398 # Inspect 1399 df 1400 1401 # Deconstruct 1402 # - type 1403 typeof(df) 1404 # - attributes 1405 attributes(df) 1406 ``` 1407 1408 1409 ## Data frame, examples 2/2: 1410 1411 ```{r df_functions} 1412 rownames(df) 1413 colnames(df) 1414 names(df) # Same as colnames(df) 1415 1416 nrow(df) 1417 ncol(df) 1418 length(df) # Same as ncol(df) 1419 ``` 1420 1421 Unlike other lists, the length of each vector must be the same (i.e. as many vector elements as rows in the data frame). 1422 1423 ## Tibble 1424 1425 Created to relieve some of the frustrations and pain points created by data frames, tibbles are data frames that are: 1426 1427 - Lazy (do less) 1428 - Surly (complain more) 1429 1430 ## Lazy 1431 1432 Tibbles do not: 1433 1434 - Coerce strings 1435 - Transform non-syntactic names 1436 - Recycle vectors of length greater than 1 1437 1438 ## ! Coerce strings 1439 1440 ```{r tbl_no_coerce} 1441 chr_col <- c("don't", "factor", "me", "bro") 1442 1443 # data frame 1444 df <- data.frame( 1445 a = chr_col, 1446 # in R 4.1 and earlier, this was the default 1447 stringsAsFactors = TRUE 1448 ) 1449 1450 # tibble 1451 tbl <- tibble::tibble( 1452 a = chr_col 1453 ) 1454 1455 # contrast the structure 1456 str(df$a) 1457 str(tbl$a) 1458 1459 ``` 1460 1461 ## ! Transform non-syntactic names 1462 1463 ```{r tbl_col_name} 1464 # data frame 1465 df <- data.frame( 1466 `1` = c(1, 2, 3) 1467 ) 1468 1469 # tibble 1470 tbl <- tibble::tibble( 1471 `1` = c(1, 2, 3) 1472 ) 1473 1474 # contrast the names 1475 names(df) 1476 names(tbl) 1477 ``` 1478 1479 ## ! Recycle vectors of length greater than 1 1480 1481 ```{r tbl_recycle, error=TRUE} 1482 # data frame 1483 df <- data.frame( 1484 col1 = c(1, 2, 3, 4), 1485 col2 = c(1, 2) 1486 ) 1487 1488 # tibble 1489 tbl <- tibble::tibble( 1490 col1 = c(1, 2, 3, 4), 1491 col2 = c(1, 2) 1492 ) 1493 ``` 1494 1495 ## Surly 1496 1497 Tibbles do only what they're asked and complain if what they're asked doesn't make sense: 1498 1499 - Subsetting always yields a tibble 1500 - Complains if cannot find column 1501 1502 ## Subsetting always yields a tibble 1503 1504 ```{r tbl_subset} 1505 # data frame 1506 df <- data.frame( 1507 col1 = c(1, 2, 3, 4) 1508 ) 1509 1510 # tibble 1511 tbl <- tibble::tibble( 1512 col1 = c(1, 2, 3, 4) 1513 ) 1514 1515 # contrast 1516 df_col <- df[, "col1"] 1517 str(df_col) 1518 tbl_col <- tbl[, "col1"] 1519 str(tbl_col) 1520 1521 # to select a vector, do one of these instead 1522 tbl_col_1 <- tbl[["col1"]] 1523 str(tbl_col_1) 1524 tbl_col_2 <- dplyr::pull(tbl, col1) 1525 str(tbl_col_2) 1526 ``` 1527 1528 ## Complains if cannot find column 1529 1530 ```{r tbl_col_match, warning=TRUE} 1531 names(df) 1532 df$col 1533 1534 names(tbl) 1535 tbl$col 1536 ``` 1537 1538 ## One more difference 1539 1540 **`tibble()` allows you to refer to variables created during construction** 1541 1542 ```{r df_tibble_diff} 1543 tibble::tibble( 1544 x = 1:3, 1545 y = x * 2 # x refers to the line above 1546 ) 1547 ``` 1548 1549 <details> 1550 <summary>Side Quest: Row Names</summary> 1551 1552 - character vector containing only unique values 1553 - get and set with `rownames()` 1554 - can use them to subset rows 1555 1556 ```{r row_names} 1557 df3 <- data.frame( 1558 age = c(35, 27, 18), 1559 hair = c("blond", "brown", "black"), 1560 row.names = c("Bob", "Susan", "Sam") 1561 ) 1562 df3 1563 1564 rownames(df3) 1565 df3["Bob", ] 1566 1567 rownames(df3) <- c("Susan", "Bob", "Sam") 1568 rownames(df3) 1569 df3["Bob", ] 1570 ``` 1571 1572 There are three reasons why row names are undesirable: 1573 1574 3. Metadata is data, so storing it in a different way to the rest of the data is fundamentally a bad idea. 1575 2. Row names are a poor abstraction for labelling rows because they only work when a row can be identified by a single string. This fails in many cases. 1576 3. Row names must be unique, so any duplication of rows (e.g. from bootstrapping) will create new row names. 1577 1578 </details> 1579 1580 1581 ## Tibles: Printing 1582 1583 Data frames and tibbles print differently 1584 1585 ```{r df_tibble_print} 1586 df3 1587 tibble::as_tibble(df3) 1588 ``` 1589 1590 1591 ## Tibles: Subsetting 1592 1593 Two undesirable subsetting behaviours: 1594 1595 1. When you subset columns with `df[, vars]`, you will get a vector if vars selects one variable, otherwise you’ll get a data frame, unless you always remember to use `df[, vars, drop = FALSE]`. 1596 2. When you attempt to extract a single column with `df$x` and there is no column `x`, a data frame will instead select any variable that starts with `x`. If no variable starts with `x`, `df$x` will return NULL. 1597 1598 Tibbles tweak these behaviours so that a [ always returns a tibble, and a $ doesn’t do partial matching and warns if it can’t find a variable (*this is what makes tibbles surly*). 1599 1600 ## Tibles: Testing 1601 1602 Whether data frame: `is.data.frame()`. Note: both data frame and tibble are data frames. 1603 1604 Whether tibble: `tibble::is_tibble`. Note: only tibbles are tibbles. Vanilla data frames are not. 1605 1606 ## Tibles: Coercion 1607 1608 - To data frame: `as.data.frame()` 1609 - To tibble: `tibble::as_tibble()` 1610 1611 ## Tibles: List Columns 1612 1613 List-columns are allowed in data frames but you have to do a little extra work by either adding the list-column after creation or wrapping the list in `I()` 1614 1615 ```{r list_columns} 1616 df4 <- data.frame(x = 1:3) 1617 df4$y <- list(1:2, 1:3, 1:4) 1618 df4 1619 1620 df5 <- data.frame( 1621 x = 1:3, 1622 y = I(list(1:2, 1:3, 1:4)) 1623 ) 1624 df5 1625 ``` 1626 1627 ## Tibbles: Matrix and data frame columns 1628 1629 - As long as the number of rows matches the data frame, it’s also possible to have a matrix or data frame as a column of a data frame. 1630 - same as list-columns, must either addi the list-column after creation or wrapping the list in `I()` 1631 1632 ```{r matrix_df_columns} 1633 dfm <- data.frame( 1634 x = 1:3 * 10, 1635 y = I(matrix(1:9, nrow = 3)) 1636 ) 1637 1638 dfm$z <- data.frame(a = 3:1, b = letters[1:3], stringsAsFactors = FALSE) 1639 1640 str(dfm) 1641 dfm$y 1642 dfm$z 1643 ``` 1644 1645 1646 ## Exercises 1/4 1647 1648 1. Can you have a data frame with zero rows? What about zero columns? 1649 1650 <details><summary>Answer(s)</summary> 1651 Yes, you can create these data frames easily; either during creation or via subsetting. Even both dimensions can be zero. Create a 0-row, 0-column, or an empty data frame directly: 1652 1653 ```{r, eval = FALSE} 1654 data.frame(a = integer(), b = logical()) 1655 #> [1] a b 1656 #> <0 rows> (or 0-length row.names) 1657 1658 data.frame(row.names = 1:3) # or data.frame()[1:3, ] 1659 #> data frame with 0 columns and 3 rows 1660 1661 data.frame() 1662 #> data frame with 0 columns and 0 rows 1663 ``` 1664 1665 Create similar data frames via subsetting the respective dimension with either 0, `NULL`, `FALSE` or a valid 0-length atomic (`logical(0)`, `character(0)`, `integer(0)`, `double(0)`). Negative integer sequences would also work. The following example uses a zero: 1666 1667 ```{r, eval = FALSE} 1668 mtcars[0, ] 1669 #> [1] mpg cyl disp hp drat wt qsec vs am gear carb 1670 #> <0 rows> (or 0-length row.names) 1671 1672 mtcars[ , 0] # or mtcars[0] 1673 #> data frame with 0 columns and 32 rows 1674 1675 mtcars[0, 0] 1676 #> data frame with 0 columns and 0 rows 1677 ``` 1678 1679 1680 </details> 1681 1682 ## Exercises 2/4 1683 1684 2. What happens if you attempt to set rownames that are not unique? 1685 1686 <details><summary>Answer(s)</summary> 1687 Matrices can have duplicated row names, so this does not cause problems. 1688 1689 Data frames, however, require unique rownames and you get different results depending on how you attempt to set them. If you set them directly or via `row.names()`, you get an error: 1690 1691 ```{r, eval = FALSE} 1692 data.frame(row.names = c("x", "y", "y")) 1693 #> Error in data.frame(row.names = c("x", "y", "y")): duplicate row.names: y 1694 1695 df <- data.frame(x = 1:3) 1696 row.names(df) <- c("x", "y", "y") 1697 #> Warning: non-unique value when setting 'row.names': 'y' 1698 #> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed 1699 ``` 1700 1701 If you use subsetting, `[` automatically deduplicates: 1702 1703 ```{r, eval = FALSE} 1704 row.names(df) <- c("x", "y", "z") 1705 df[c(1, 1, 1), , drop = FALSE] 1706 #> x 1707 #> x 1 1708 #> x.1 1 1709 #> x.2 1 1710 ``` 1711 1712 </details> 1713 1714 ## Exercises 3/4 1715 1716 3. If `df` is a data frame, what can you say about `t(df)`, and `t(t(df))`? Perform some experiments, making sure to try different column types. 1717 1718 <details><summary>Answer(s)</summary> 1719 Both of `t(df)` and `t(t(df))` will return matrices: 1720 1721 ```{r, eval = FALSE} 1722 df <- data.frame(x = 1:3, y = letters[1:3]) 1723 is.matrix(df) 1724 #> [1] FALSE 1725 is.matrix(t(df)) 1726 #> [1] TRUE 1727 is.matrix(t(t(df))) 1728 #> [1] TRUE 1729 ``` 1730 1731 The dimensions will respect the typical transposition rules: 1732 1733 ```{r, eval = FALSE} 1734 dim(df) 1735 #> [1] 3 2 1736 dim(t(df)) 1737 #> [1] 2 3 1738 dim(t(t(df))) 1739 #> [1] 3 2 1740 ``` 1741 1742 Because the output is a matrix, every column is coerced to the same type. (It is implemented within `t.data.frame()` via `as.matrix()` which is described below). 1743 1744 ```{r, eval = FALSE} 1745 df 1746 #> x y 1747 #> 1 1 a 1748 #> 2 2 b 1749 #> 3 3 c 1750 t(df) 1751 #> [,1] [,2] [,3] 1752 #> x "1" "2" "3" 1753 #> y "a" "b" "c" 1754 ``` 1755 1756 </details> 1757 1758 ## Exercises 4/4 1759 1760 4. What does `as.matrix()` do when applied to a data frame with columns of different types? How does it differ from `data.matrix()`? 1761 1762 <details><summary>Answer(s)</summary> 1763 The type of the result of as.matrix depends on the types of the input columns (see `?as.matrix`): 1764 1765 > The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying as.vector to factors and format to other non-character columns. Otherwise the usual coercion hierarchy (logical < integer < double < complex) will be used, e.g. all-logical data frames will be coerced to a logical matrix, mixed logical-integer will give an integer matrix, etc. 1766 1767 On the other hand, `data.matrix` will always return a numeric matrix (see `?data.matrix()`). 1768 1769 > Return the matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Factors and ordered factors are replaced by their internal codes. […] Character columns are first converted to factors and then to integers. 1770 1771 We can illustrate and compare the mechanics of these functions using a concrete example. `as.matrix()` makes it possible to retrieve most of the original information from the data frame but leaves us with characters. To retrieve all information from `data.matrix()`’s output, we would need a lookup table for each column. 1772 1773 ```{r, eval = FALSE} 1774 df_coltypes <- data.frame( 1775 a = c("a", "b"), 1776 b = c(TRUE, FALSE), 1777 c = c(1L, 0L), 1778 d = c(1.5, 2), 1779 e = factor(c("f1", "f2")) 1780 ) 1781 1782 as.matrix(df_coltypes) 1783 #> a b c d e 1784 #> [1,] "a" "TRUE" "1" "1.5" "f1" 1785 #> [2,] "b" "FALSE" "0" "2.0" "f2" 1786 data.matrix(df_coltypes) 1787 #> a b c d e 1788 #> [1,] 1 1 1 1.5 1 1789 #> [2,] 2 0 0 2.0 2 1790 ``` 1791 1792 </details> 1793 1794 1795 ## `NULL` 1796 1797 Special type of object that: 1798 1799 - Length 0 1800 - Cannot have attributes 1801 1802 ```{r null, results = 'hide'} 1803 typeof(NULL) 1804 #> [1] "NULL" 1805 1806 length(NULL) 1807 #> [1] 0 1808 ``` 1809 1810 ```{r null_attr, error=TRUE} 1811 x <- NULL 1812 attr(x, "y") <- 1 1813 ``` 1814 1815 ```{r null_check} 1816 is.null(NULL) 1817 ``` 1818 1819 1820 ## Digestif 1821 1822 Let is use some of this chapter's skills on the `penguins` data. 1823 1824 ## Attributes 1825 1826 ```{r} 1827 str(penguins_raw) 1828 ``` 1829 1830 ```{r} 1831 str(penguins_raw, give.attr = FALSE) 1832 ``` 1833 1834 ## Data Frames vs Tibbles 1835 1836 ```{r} 1837 penguins_df <- data.frame(penguins) 1838 penguins_tb <- penguins #i.e. penguins was already a tibble 1839 ``` 1840 1841 ## Printing 1842 1843 * Tip: print out these results in RStudio under different editor themes 1844 1845 ```{r, eval = FALSE} 1846 print(penguins_df) #don't run this 1847 ``` 1848 1849 ```{r} 1850 head(penguins_df) 1851 ``` 1852 1853 ```{r} 1854 penguins_tb 1855 ``` 1856 1857 ## Atomic Vectors 1858 1859 ```{r} 1860 species_vector_df <- penguins_df |> select(species) 1861 species_unlist_df <- penguins_df |> select(species) |> unlist() 1862 species_pull_df <- penguins_df |> select(species) |> pull() 1863 1864 species_vector_tb <- penguins_tb |> select(species) 1865 species_unlist_tb <- penguins_tb |> select(species) |> unlist() 1866 species_pull_tb <- penguins_tb |> select(species) |> pull() 1867 ``` 1868 1869 <details> 1870 <summary>`typeof()` and `class()`</summary> 1871 ```{r} 1872 typeof(species_vector_df) 1873 class(species_vector_df) 1874 1875 typeof(species_unlist_df) 1876 class(species_unlist_df) 1877 1878 typeof(species_pull_df) 1879 class(species_pull_df) 1880 1881 typeof(species_vector_tb) 1882 class(species_vector_tb) 1883 1884 typeof(species_unlist_tb) 1885 class(species_unlist_tb) 1886 1887 typeof(species_pull_tb) 1888 class(species_pull_tb) 1889 ``` 1890 1891 </details> 1892 1893 ## Column Names 1894 1895 ```{r} 1896 colnames(penguins_tb) 1897 ``` 1898 1899 ```{r} 1900 names(penguins_tb) == colnames(penguins_tb) 1901 ``` 1902 1903 ```{r} 1904 names(penguins_df) == names(penguins_tb) 1905 ``` 1906 1907 ## What if we only invoke a partial name of a column of a tibble? 1908 1909 ```{r, error = TRUE} 1910 penguins_tb$y 1911 ``` 1912 1913  1914 1915 * What if we only invoke a partial name of a column of a data frame? 1916 1917 ```{r} 1918 head(penguins_df$y) #instead of `year` 1919 ``` 1920 1921 * Is this evaluation in alphabetical order or column order? 1922 1923 ```{r} 1924 penguins_df_se_sp <- penguins_df |> select(sex, species) 1925 penguins_df_sp_se <- penguins_df |> select(species, sex) 1926 ``` 1927 1928 ```{r} 1929 head(penguins_df_se_sp$s) 1930 ``` 1931 1932 ```{r} 1933 head(penguins_df_sp_se$s) 1934 ``` 1935 1936 1937 ## Chapter Quiz 1/5 1938 1939 1. What are the four common types of atomic vectors? What are the two rare types? 1940 1941 <details><summary>Answer(s)</summary> 1942 The four common types of atomic vector are logical, integer, double and character. The two rarer types are complex and raw. 1943 </details> 1944 1945 ## Chapter Quiz 2/5 1946 1947 2. What are attributes? How do you get them and set them? 1948 1949 <details><summary>Answer(s)</summary> 1950 Attributes allow you to associate arbitrary additional metadata to any object. You can get and set individual attributes with `attr(x, "y")` and `attr(x, "y") <- value`; or you can get and set all attributes at once with `attributes()`. 1951 </details> 1952 1953 ## Chapter Quiz 3/5 1954 1955 3. How is a list different from an atomic vector? How is a matrix different from a data frame? 1956 1957 <details><summary>Answer(s)</summary> 1958 The elements of a list can be any type (even a list); the elements of an atomic vector are all of the same type. Similarly, every element of a matrix must be the same type; in a data frame, different columns can have different types. 1959 </details> 1960 1961 ## Chapter Quiz 4/5 1962 1963 4. Can you have a list that is a matrix? Can a data frame have a column that is a matrix? 1964 1965 <details><summary>Answer(s)</summary> 1966 You can make a list-array by assigning dimensions to a list. You can make a matrix a column of a data frame with `df$x <- matrix()`, or by using `I()` when creating a new data frame `data.frame(x = I(matrix()))`. 1967 </details> 1968 1969 ## Chapter Quiz 5/5 1970 1971 5. How do tibbles behave differently from data frames? 1972 1973 <details><summary>Answer(s)</summary> 1974 Tibbles have an enhanced print method, never coerce strings to factors, and provide stricter subsetting methods. 1975 </details>