commit 543d8954550390f0184229d51c4e6383dc2262bb
parent ec7c26d7f08256bcb7006b6189bd11cba6f5b5fc
Author: Betsy Rosalen <betsy@mylittleuniverse.com>
Date:   Thu, 22 Feb 2024 16:01:03 -0500
Cohort 8 chapter 3 (#57)
* Chapter 3 Vectors Notes Updated
* Add names to all code chunks in Chapter 3
* Remove Chapter 3 yaml header that got added by mistake
* Fixed a few minor formatting errors in Chapter 3
Diffstat:
| M | 03_Vectors.Rmd |  |  | 491 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------- | 
1 file changed, 369 insertions(+), 122 deletions(-)
diff --git a/03_Vectors.Rmd b/03_Vectors.Rmd
@@ -2,58 +2,73 @@
 
 **Learning objectives:**
 
--   Learn about different types of vectors
+-   Learn about different types of vectors and their attributes
 -   Learn how these types relate to one another
 
-## Types of vectors
+## Types of Vectors
 
-The family tree of vectors:
+ 
 
- Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
+Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
 
--   **Atomic.** Elements all the same type.
--   **List.** Elements are different Types.
--   **NULL** Null elements. Length zero.
+Two main types:
 
-## Atomic vectors
+-   **Atomic** Elements all the same type.
+-   **List** Elements are different Types.
 
-### Types
+Closely related but not technically a vector:
 
--   The vector family tree revisited.
--   Meet the children of atomic vectors
+-   **NULL** Null elements. Often length zero.
 
- Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
+## Atomic Vectors
 
-### Length one
+### Types of atomic vectors
 
-"Scalars" that consist of a single value.
+ 
+
+Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
+
+-   **Logical**: True/False
+-   **Integer**: Numeric (discrete, no decimals)
+-   **Double**: Numeric (continuous, decimals)
+-   **Character**: String
+
+### Vectors of Length One
+
+**Scalars** are vectors that consist of a single value.
+
+#### Logicals:
 
 ```{r vec_lgl}
-# Logicals
 lgl1 <- TRUE
 lgl2 <- T
+lgl3 <- FALSE
+lgl4 <- F
 ```
 
+#### Doubles:
+
 ```{r vec_dbl}
-# Doubles
 # integer, decimal, scientific, or hexidecimal format
 dbl1 <- 1
-dbl2 <- 1.234
-dbl3 <- 1.234e0
-dbl4 <- 0xcafe
+dbl2 <- 1.234 # decimal
+dbl3 <- 1.234e0 # scientific format
+dbl4 <- 0xcafe # hexidecimal format
 ```
 
+#### Integers: must be followed by L and cannot have fractional values
+
 ```{r vec_int}
-# Integers
 # Note: L denotes an integer
 int1 <- 1L
-int2 <- 1.234L
-int3 <- 1.234e0L
+int2 <- 1234L
+int3 <- 1234e0L
 int4 <- 0xcafeL
 ```
 
+#### Strings: can use single or double quotes and special characters are escaped with \
+
 ```{r vec_str}
-# Strings
 str1 <- "hello" # double quotes
 str2 <- 'hello' # single quotes
 str3 <- "مرحبًا" # Unicode
@@ -62,13 +77,15 @@ str4 <- "\U0001f605" # sweaty_smile
 
 ### Longer
 
-Several ways to make longer:
+Several ways to make longer vectors:
 
-**1. With single values**
+**1. With single values** inside c() for combine.
 
 ```{r long_single}
-lgl_vec <- c(TRUE, FALSE)
-
+lgl_var <- c(TRUE, FALSE)
+int_var <- c(1L, 6L, 10L)
+dbl_var <- c(1, 2.5, 4.5)
+chr_var <- c("these are", "some strings")
 ```
 
 **2. With other vectors**
@@ -97,6 +114,22 @@ They look to do both more and less than `c()`.
 
 Note: currently has `questioning` lifecycle badge, since these constructors may get moved to `vctrs`
 
+#### Determine Type and Length
+
+determine the type of a vector with `typeof()` and its length with `length()`
+
+```{r type_length}
+typeof(lgl_var)
+typeof(int_var)
+typeof(dbl_var)
+typeof(chr_var)
+
+length(lgl_var)
+length(int_var)
+length(dbl_var)
+length(chr_var)
+```
+
 ### Missing values
 
 **Contagion**
@@ -110,7 +143,14 @@ sum(c(1, 2, NA, 3))
 
 # innoculate
 sum(c(1, 2, NA, 3), na.rm = TRUE)
+```
+
+To search for missing values use `is.na()`
 
+```{r na_search, error=TRUE}
+x <- c(NA, 5, NA, 10)
+x == NA
+is.na(x)
 ```
 
 **Types**
@@ -143,7 +183,9 @@ Don't test objects with these tools:
 
 -   `is.vector()`
 -   `is.atomic()`
--   `is.numeric()`
+-   `is.numeric()` 
+
+They don’t test if you have a vector, atomic vector, or numeric vector; you’ll need to carefully read the documentation to figure out what they actually do.
 
 Instead, maybe, use `{rlang}`
 
@@ -169,20 +211,20 @@ R follows rules for coercion: character → double → integer → logical
 
 R can coerce either automatically or explicitly
 
-**Automatic**
+#### **Automatic**
 
 Two contexts for automatic coercion:
 
 1.  Combination
 2.  Mathematical
 
-Combination:
+##### Coercion by Combination:
 
 ```{r coerce_c}
 str(c(TRUE, "TRUE"))
 ```
 
-Mathematical operations
+##### Coercion by Mathematical operations:
 
 ```{r coerce_math}
 # imagine a logical vector about whether an attribute is present
@@ -192,7 +234,7 @@ has_attribute <- c(TRUE, FALSE, TRUE, TRUE)
 sum(has_attribute)
 ```
 
-**Explicit**
+#### **Explicit**
 
 Use `as.*()`
 
@@ -201,7 +243,14 @@ Use `as.*()`
 -   Double: `as.double()`
 -   Character: `as.character()`
 
-But note that coercions may fail in one of two ways, or both:
+```{r explicit_coercion}
+dbl_var
+as.integer(dbl_var)
+lgl_var
+as.character(lgl_var)
+```
+
+But note that coercion may fail in one of two ways, or both:
 
 -   With warning/error
 -   NAs
@@ -212,37 +261,23 @@ as.integer(c(1, 2, "three"))
 
 ## Attributes
 
--   What
--   How
--   Why
-
-### What
-
-Two perspectives:
+Attributes are name-value pairs that attach metadata to an object(vector).
 
--   Name-value pairs
--   Metadata
+### What?
 
-**Name-value pairs**
+**Name-value pairs** - attributes have a name and a value
 
-Formally, attributes have a name and a value.
+**Metadata** - Not data itself, but data about the data
 
-**Metadata**
+### How? 
 
--   Not data itself
--   But data about the data
+#### Getting and Setting
 
-### How
+Three functions:
 
-Two operations:
-
-1.  Get
-2.  Set
-
-Two cases:
-
-1.  Single attribute
-2.  Multiple attributes
+1. retrieve and modify single attributes with `attr()`
+2. retrieve en masse with `attributes()`
+3. set en masse with `structure()`
 
 **Single attribute**
 
@@ -253,10 +288,10 @@ Use `attr()`
 a <- c(1, 2, 3)
 
 # set attribute
-attr(x = a, which = "some_attribute_name") <- "some attribute"
+attr(x = a, which = "attribute_name") <- "some attribute"
 
 # get attribute
-attr(x = a, which = "some_attribute_name")
+attr(a, "attribute_name")
 ```
 
 **Multiple attributes**
@@ -269,114 +304,153 @@ b <- c(4, 5, 6)
 # set
 b <- structure(
   .Data = b,
-  attrib1 = "one",
-  attrib2 = "two"
+  attrib1_name = "first_attribute",
+  attrib2_name = "second_attribute"
 )
 
 # get
+attributes(b)
 str(attributes(b))
 ```
 
 ### Why
 
-Two common use cases:
+Three particularly important attributes: 
+
+1. **names** - a character vector giving each element a name
+2. **dimension** - (or dim) turns vectors into matrices and arrays 
+3. **class** - powers the S3 object system (we'll learn more about this in chapter 13)
 
--   Names
--   Dimensions
+Most attributes are lost by most operations.  Only two attributes are routinely preserved: names and dimension.
 
-**Names**
+#### **Names**
 
 ~~Three~~ Four ways to name:
 
-```{r}
-# 1. At creation
-one <- c(one = 1, two = 2, three = 3)
+```{r names}
+# When creating it: 
+x <- c(A = 1, B = 2, C = 3)
+x
 
-# 2. By assigning a character vector of names
-two <- c(1, 2, 3)
-names(two) <- c("one", "two", "three")
+# By assigning a character vector to names()
+y <- 1:3
+names(y) <- c("a", "b", "c")
+y
 
-# 3. By setting names--with base R
-three <- c(1, 2, 3)
-stats::setNames(
-  object = three, 
-  nm = c("One", "Two", "Three")
-)
+# Inline, with setNames():
+z <- setNames(1:3, c("one", "two", "three"))
+z
 
 # 4. By setting names--with {rlang}
+a <- 1:3
 rlang::set_names(
-  x = three,
+  x = a,
   nm = c("One", "Two", "Three")
 )
+
 ```
+You can remove names from a vector by using `x <- unname(x)` or `names(x) <- NULL`.
 
 Thematically but not directly related: labelled class vectors with `haven::labelled()`
 
-**Dimensions**
+#### **Dimensions**
 
-Important for arrays and matrices.
+Create matrices and arrays with `matrix()` and `array()`, or by using the assignment form of `dim()`:
 
-```{r}
-# length 6 vector spread across 2 rows of 3 columns
-matrix(1:6, nrow = 2, ncol = 3)
+```{r dimensions}
+# Two scalar arguments specify row and column sizes
+x <- matrix(1:6, nrow = 2, ncol = 3)
+x
+
+# One vector argument to describe all dimensions
+y <- array(1:12, c(2, 3, 2))
+y
+
+# You can also modify an object in place by setting dim()
+z <- 1:6
+dim(z) <- c(3, 2)
+z
 ```
 
-## S3 atomic vectors
+##### Functions for working with vectors, matrices and arrays:
+
+Vector | Matrix	| Array
+:----- | :---------- | :-----
+`names()` | `rownames()`, `colnames()` | `dimnames()`
+`length()` | `nrow()`, `ncol()` | `dim()`
+`c()` | `rbind()`, `cbind()` | `abind::abind()`
+— | `t()` | `aperm()`
+`is.null(dim(x))` | `is.matrix()` | `is.array()`
+
+## **Class** - S3 atomic vectors
 
--   The vector family tree revisited.
--   Meet the children of typed atomic vectors
+ 
 
- Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
+Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
 
-This list could (more easily) be expanded to new vector types with [`{vctrs}`](https://vctrs.r-lib.org/). See [rstudio::conf(2019) talk on the package around 18:27](https://www.rstudio.com/resources/rstudioconf-2019/vctrs-tools-for-making-size-and-type-consistent-functions/). See also [rstudio::conf(2020) talk on new vector types for dealing with non-decimal currencies](https://www.rstudio.com/resources/rstudioconf-2020/vctrs-creating-custom-vector-classes-with-the-vctrs-package/).
+**Having a class attribute turns an object into an S3 object.**
 
-What makes S3 atomic vectors different than their parents?
+What makes S3 atomic vectors different?
 
-Two things:
+1. behave differently from a regular vector when passed to a generic function 
+2. often store additional information in other attributes
 
-1.  Class
-2.  Attributes (typically)
+Four important S3 vectors used in base R:
+
+1. **Factors** (categorical data)
+2. **Dates**
+3. **Date-times** (POSIXct)
+4. **Durations** (difftime)
 
 ### Factors
 
+A factor is a vector used to store categorical data that can contain only predefined values.
+
 Factors are integer vectors with:
 
 -   Class: "factor"
 -   Attributes: "levels", or the set of allowed values
 
 ```{r factor}
+colors = c('red', 'blue', 'green','red','red', 'green')
 # Build a factor
 a_factor <- factor(
   # values
-  x = c(1, 2, 3),
+  x = colors,
   # exhaustive list of values
-  levels = c(1, 2, 3, 4)
+  levels = c('red', 'blue', 'green', 'yellow')
 )
 
-# Inspect
-a_factor
+# Useful when some possible values are not present in the data
+table(colors)
+table(a_factor)
 
-# Dissect
 # - type
 typeof(a_factor)
+class(a_factor)
 
 # - attributes
 attributes(a_factor)
 ```
 
-Factors can be ordered. This can be useful for models or visaulations where order matters.
+Factors can be ordered. This can be useful for models or visualizations where order matters.
 
 ```{r factor_ordered}
-# Build
+
+values <- c('high', 'med', 'low', 'med', 'high', 'low', 'med', 'high')
+
 ordered_factor <- ordered(
   # values
-  x = c(1, 2, 3),
+  x = values,
   # levels in ascending order
-  levels = c(4, 3, 2, 1)
+  levels = c('low', 'med', 'high')
 )
 
 # Inspect
 ordered_factor
+
+table(values)
+table(ordered_factor)
 ```
 
 ### Dates
@@ -385,8 +459,7 @@ Dates are:
 
 -   Double vectors
 -   With class "Date"
-
-The double component represents the number of days since since `1970-01-01`
+-   No other attributes
 
 ```{r dates}
 notes_date <- Sys.Date()
@@ -398,6 +471,13 @@ typeof(notes_date)
 attributes(notes_date)
 ```
 
+The double component represents the number of days since since `1970-01-01`
+
+```{r days_since_1970}
+date <- as.Date("1970-02-01")
+unclass(date)
+```
+
 ### Date-times
 
 There are 2 Date-time representations in base R:
@@ -405,11 +485,11 @@ There are 2 Date-time representations in base R:
 -   POSIXct, where "ct" denotes calendar time
 -   POSIXlt, where "lt" designates local time.
 
-Let's focus on POSIXct because:
+We'll focus on POSIXct because:
 
 -   Simplest
--   Built on an atomic vector
--   Most apt to be in a data frame
+-   Built on an atomic (double) vector
+-   Most appropriate for use in a data frame
 
 Let's now build and deconstruct a Date-time
 
@@ -419,22 +499,30 @@ note_date_time <- as.POSIXct(
   # time
   x = Sys.time(),
   # time zone, used only for formatting
-  tz = "EDT"
+  tz = "America/New_York"
 )
 
 # Inspect
 note_date_time
 
-# Dissect
 # - type
 typeof(note_date_time)
+
 # - attributes
 attributes(note_date_time)
+
+structure(note_date_time, tzone = "Europe/Paris")
 ```
 
+```{r date_time_format}
+date_time <- as.POSIXct("2024-02-22 12:34:56", tz = "EST")
+unclass(date_time)
+```
+
+
 ### Durations
 
-Durations are:
+Durations represent the amount of time between pairs of dates or date-times.
 
 -   Double vectors
 -   Class: "difftime"
@@ -443,7 +531,6 @@ Durations are:
 ```{r durations}
 # Construct
 one_minute <- as.difftime(1, units = "mins")
-
 # Inspect
 one_minute
 
@@ -454,6 +541,12 @@ typeof(one_minute)
 attributes(one_minute)
 ```
 
+```{r durations_math}
+time_since_01_01_1970 <- notes_date - date
+time_since_01_01_1970
+```
+
+
 See also:
 
 -   [`lubridate::make_difftime()`](https://lubridate.tidyverse.org/reference/make_difftime.html)
@@ -461,7 +554,8 @@ See also:
 
 ## Lists
 
-Sometimes called a generic vector, a list can be composed of elements of different types.
+- Sometimes called a generic vector or recursive vector
+- can be composed of elements of different types (as opposed to atomic vectors which must be of only one type)
 
 ### Constructing
 
@@ -480,12 +574,40 @@ simple_list <- list(
   c("primo", "secundo", "tercio")
 )
 
+simple_list
+
 # Inspect
 # - type
 typeof(simple_list)
 # - structure
 str(simple_list)
 
+# Accessing
+simple_list[1]
+simple_list[2]
+simple_list[3]
+simple_list[4]
+
+simple_list[[1]][2]
+simple_list[[2]][8]
+simple_list[[3]][2]
+simple_list[[4]][3]
+```
+
+Even Simpler List
+
+```{r list_simpler}
+# Construct
+simpler_list <- list(TRUE, FALSE, 
+                    1, 2, 3, 4, 5, 
+                    1.2, 2.3, 3.4, 
+                    "primo", "secundo", "tercio")
+
+# Accessing
+simpler_list[1]
+simpler_list[5]
+simpler_list[9]
+simpler_list[11]
 ```
 
 Nested lists:
@@ -518,6 +640,10 @@ list_comb2 <- c(list(1, 2), list(3, 4))
 # compare structure
 str(list_comb1)
 str(list_comb2)
+
+# does this work if they are different data types?
+list_comb3 <- c(list(1, 2), list(TRUE, FALSE))
+str(list_comb3)
 ```
 
 ### Testing
@@ -544,22 +670,40 @@ rlang::is_vector(list_comb2)
 
 ### Coercion
 
+Use `as.list()`
+
+```{r list_coercion}
+list(1:3)
+as.list(1:3)
+```
+
+## Matrices and arrays
+
+Although not often used, the dimension attribute can be added to create list-matrices or list-arrays.
+
+```{r list_matrices_arrays}
+l <- list(1:3, "a", TRUE, 1.0)
+dim(l) <- c(2, 2)
+l
+
+l[[1, 1]]
+```
+
 ## Data frames and tibbles
 
--   The vector family tree revisited.
--   Meet the children of lists
+ 
 
- Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
+Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
 
 ### Data frame
 
 A data frame is a:
 
 -   Named list of vectors (i.e., column names)
--   Class: "data frame"
 -   Attributes:
     -   (column) `names`
-    -   \`row.names\`\`
+    -   `row.names`
+    -   Class: "data frame"
 
 ```{r data_frame}
 # Construct
@@ -567,9 +711,8 @@ df <- data.frame(
   # named atomic vector
   col1 = c(1, 2, 3),
   # another named atomic vector
-  col2 = c("un", "deux", "trois"),
-  # not necessary after R 4.1 (?)
-  stringsAsFactors = FALSE
+  col2 = c("un", "deux", "trois")
+  # ,stringsAsFactors = FALSE # default for versions after R 4.1
 )
 
 # Inspect
@@ -582,14 +725,24 @@ typeof(df)
 attributes(df)
 ```
 
+```{r df_functions}
+rownames(df)
+colnames(df)
+names(df) # Same as colnames(df)
+
+nrow(df) 
+ncol(df)
+length(df) # Same as ncol(df)
+```
+
 Unlike other lists, the length of each vector must be the same (i.e. as many vector elements as rows in the data frame).
 
 ### Tibble
 
-As compared to data frames, tibbles are data frames that are:
+Created to relieve some of the frustrations and pain points created by data frames, tibbles are data frames that are:
 
--   Lazy
--   Surly
+-   Lazy (do less)
+-   Surly (complain more)
 
 #### Lazy
 
@@ -699,6 +852,64 @@ names(tbl)
 tbl$col
 ```
 
+#### One more difference
+
+**`tibble()` allows you to refer to variables created during construction**
+
+```{r df_tibble_diff}
+tibble::tibble(
+  x = 1:3,
+  y = x * 2 # x refers to the line above
+)
+```
+
+### Row Names
+
+- character vector containing only unique values
+- get and set with `rownames()`
+- can use them to subset rows
+
+```{r row_names}
+df3 <- data.frame(
+  age = c(35, 27, 18),
+  hair = c("blond", "brown", "black"),
+  row.names = c("Bob", "Susan", "Sam")
+)
+df3
+
+rownames(df3)
+df3["Bob", ]
+
+rownames(df3) <- c("Susan", "Bob", "Sam")
+rownames(df3)
+df3["Bob", ]
+```
+
+There are three reasons why row names are undesirable:
+
+3. Metadata is data, so storing it in a different way to the rest of the data is fundamentally a bad idea. 
+2. Row names are a poor abstraction for labelling rows because they only work when a row can be identified by a single string. This fails in many cases.
+3. Row names must be unique, so any duplication of rows (e.g. from bootstrapping) will create new row names.
+
+### Printing
+
+Data frames and tibbles print differently
+
+```{r df_tibble_print}
+df3
+tibble::as_tibble(df3)
+```
+
+
+### Subsetting
+
+Two undesirable subsetting behaviours:
+
+1. When you subset columns with `df[, vars]`, you will get a vector if vars selects one variable, otherwise you’ll get a data frame, unless you always remember to use `df[, vars, drop = FALSE]`.
+2. When you attempt to extract a single column with `df$x` and there is no column `x`, a data frame will instead select any variable that starts with `x`. If no variable starts with `x`, `df$x` will return NULL.
+
+Tibbles tweak these behaviours so that a [ always returns a tibble, and a $ doesn’t do partial matching and warns if it can’t find a variable (*this is what makes tibbles surly*).
+
 ### Testing
 
 Whether data frame: `is.data.frame()`. Note: both data frame and tibble are data frames.
@@ -710,6 +921,40 @@ Whether tibble: `tibble::is_tibble`. Note: only tibbles are tibbles. Vanilla dat
 -   To data frame: `as.data.frame()`
 -   To tibble: `tibble::as_tibble()`
 
+### List Columns
+
+List-columns are allowed in data frames but you have to do a little extra work by either adding the list-column after creation or wrapping the list in `I()`
+
+```{r list_columns}
+df4 <- data.frame(x = 1:3)
+df4$y <- list(1:2, 1:3, 1:4)
+df4
+
+df5 <- data.frame(
+  x = 1:3, 
+  y = I(list(1:2, 1:3, 1:4))
+)
+df5
+```
+
+### Matrix and data frame columns
+
+- As long as the number of rows matches the data frame, it’s also possible to have a matrix or data frame as a column of a data frame.
+- same as list-columns, must either addi the list-column after creation or wrapping the list in `I()`
+
+```{r matrix_df_columns}
+dfm <- data.frame(
+  x = 1:3 * 10,
+  y = I(matrix(1:9, nrow = 3))
+)
+
+dfm$z <- data.frame(a = 3:1, b = letters[1:3], stringsAsFactors = FALSE)
+
+str(dfm)
+dfm$y
+dfm$z
+```
+
 ## `NULL`
 
 Special type of object that:
@@ -726,6 +971,8 @@ length(NULL)
 
 x <- NULL
 attr(x, "y") <- 1
+
+is.null(NULL)
 ```
 
 ## Meeting Videos