commit 543d8954550390f0184229d51c4e6383dc2262bb
parent ec7c26d7f08256bcb7006b6189bd11cba6f5b5fc
Author: Betsy Rosalen <betsy@mylittleuniverse.com>
Date: Thu, 22 Feb 2024 16:01:03 -0500
Cohort 8 chapter 3 (#57)
* Chapter 3 Vectors Notes Updated
* Add names to all code chunks in Chapter 3
* Remove Chapter 3 yaml header that got added by mistake
* Fixed a few minor formatting errors in Chapter 3
Diffstat:
M | 03_Vectors.Rmd | | | 491 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------- |
1 file changed, 369 insertions(+), 122 deletions(-)
diff --git a/03_Vectors.Rmd b/03_Vectors.Rmd
@@ -2,58 +2,73 @@
**Learning objectives:**
-- Learn about different types of vectors
+- Learn about different types of vectors and their attributes
- Learn how these types relate to one another
-## Types of vectors
+## Types of Vectors
-The family tree of vectors:
+
- Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
+Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
-- **Atomic.** Elements all the same type.
-- **List.** Elements are different Types.
-- **NULL** Null elements. Length zero.
+Two main types:
-## Atomic vectors
+- **Atomic** Elements all the same type.
+- **List** Elements are different Types.
-### Types
+Closely related but not technically a vector:
-- The vector family tree revisited.
-- Meet the children of atomic vectors
+- **NULL** Null elements. Often length zero.
- Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
+## Atomic Vectors
-### Length one
+### Types of atomic vectors
-"Scalars" that consist of a single value.
+
+
+Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
+
+- **Logical**: True/False
+- **Integer**: Numeric (discrete, no decimals)
+- **Double**: Numeric (continuous, decimals)
+- **Character**: String
+
+### Vectors of Length One
+
+**Scalars** are vectors that consist of a single value.
+
+#### Logicals:
```{r vec_lgl}
-# Logicals
lgl1 <- TRUE
lgl2 <- T
+lgl3 <- FALSE
+lgl4 <- F
```
+#### Doubles:
+
```{r vec_dbl}
-# Doubles
# integer, decimal, scientific, or hexidecimal format
dbl1 <- 1
-dbl2 <- 1.234
-dbl3 <- 1.234e0
-dbl4 <- 0xcafe
+dbl2 <- 1.234 # decimal
+dbl3 <- 1.234e0 # scientific format
+dbl4 <- 0xcafe # hexidecimal format
```
+#### Integers: must be followed by L and cannot have fractional values
+
```{r vec_int}
-# Integers
# Note: L denotes an integer
int1 <- 1L
-int2 <- 1.234L
-int3 <- 1.234e0L
+int2 <- 1234L
+int3 <- 1234e0L
int4 <- 0xcafeL
```
+#### Strings: can use single or double quotes and special characters are escaped with \
+
```{r vec_str}
-# Strings
str1 <- "hello" # double quotes
str2 <- 'hello' # single quotes
str3 <- "مرحبًا" # Unicode
@@ -62,13 +77,15 @@ str4 <- "\U0001f605" # sweaty_smile
### Longer
-Several ways to make longer:
+Several ways to make longer vectors:
-**1. With single values**
+**1. With single values** inside c() for combine.
```{r long_single}
-lgl_vec <- c(TRUE, FALSE)
-
+lgl_var <- c(TRUE, FALSE)
+int_var <- c(1L, 6L, 10L)
+dbl_var <- c(1, 2.5, 4.5)
+chr_var <- c("these are", "some strings")
```
**2. With other vectors**
@@ -97,6 +114,22 @@ They look to do both more and less than `c()`.
Note: currently has `questioning` lifecycle badge, since these constructors may get moved to `vctrs`
+#### Determine Type and Length
+
+determine the type of a vector with `typeof()` and its length with `length()`
+
+```{r type_length}
+typeof(lgl_var)
+typeof(int_var)
+typeof(dbl_var)
+typeof(chr_var)
+
+length(lgl_var)
+length(int_var)
+length(dbl_var)
+length(chr_var)
+```
+
### Missing values
**Contagion**
@@ -110,7 +143,14 @@ sum(c(1, 2, NA, 3))
# innoculate
sum(c(1, 2, NA, 3), na.rm = TRUE)
+```
+
+To search for missing values use `is.na()`
+```{r na_search, error=TRUE}
+x <- c(NA, 5, NA, 10)
+x == NA
+is.na(x)
```
**Types**
@@ -143,7 +183,9 @@ Don't test objects with these tools:
- `is.vector()`
- `is.atomic()`
-- `is.numeric()`
+- `is.numeric()`
+
+They don’t test if you have a vector, atomic vector, or numeric vector; you’ll need to carefully read the documentation to figure out what they actually do.
Instead, maybe, use `{rlang}`
@@ -169,20 +211,20 @@ R follows rules for coercion: character → double → integer → logical
R can coerce either automatically or explicitly
-**Automatic**
+#### **Automatic**
Two contexts for automatic coercion:
1. Combination
2. Mathematical
-Combination:
+##### Coercion by Combination:
```{r coerce_c}
str(c(TRUE, "TRUE"))
```
-Mathematical operations
+##### Coercion by Mathematical operations:
```{r coerce_math}
# imagine a logical vector about whether an attribute is present
@@ -192,7 +234,7 @@ has_attribute <- c(TRUE, FALSE, TRUE, TRUE)
sum(has_attribute)
```
-**Explicit**
+#### **Explicit**
Use `as.*()`
@@ -201,7 +243,14 @@ Use `as.*()`
- Double: `as.double()`
- Character: `as.character()`
-But note that coercions may fail in one of two ways, or both:
+```{r explicit_coercion}
+dbl_var
+as.integer(dbl_var)
+lgl_var
+as.character(lgl_var)
+```
+
+But note that coercion may fail in one of two ways, or both:
- With warning/error
- NAs
@@ -212,37 +261,23 @@ as.integer(c(1, 2, "three"))
## Attributes
-- What
-- How
-- Why
-
-### What
-
-Two perspectives:
+Attributes are name-value pairs that attach metadata to an object(vector).
-- Name-value pairs
-- Metadata
+### What?
-**Name-value pairs**
+**Name-value pairs** - attributes have a name and a value
-Formally, attributes have a name and a value.
+**Metadata** - Not data itself, but data about the data
-**Metadata**
+### How?
-- Not data itself
-- But data about the data
+#### Getting and Setting
-### How
+Three functions:
-Two operations:
-
-1. Get
-2. Set
-
-Two cases:
-
-1. Single attribute
-2. Multiple attributes
+1. retrieve and modify single attributes with `attr()`
+2. retrieve en masse with `attributes()`
+3. set en masse with `structure()`
**Single attribute**
@@ -253,10 +288,10 @@ Use `attr()`
a <- c(1, 2, 3)
# set attribute
-attr(x = a, which = "some_attribute_name") <- "some attribute"
+attr(x = a, which = "attribute_name") <- "some attribute"
# get attribute
-attr(x = a, which = "some_attribute_name")
+attr(a, "attribute_name")
```
**Multiple attributes**
@@ -269,114 +304,153 @@ b <- c(4, 5, 6)
# set
b <- structure(
.Data = b,
- attrib1 = "one",
- attrib2 = "two"
+ attrib1_name = "first_attribute",
+ attrib2_name = "second_attribute"
)
# get
+attributes(b)
str(attributes(b))
```
### Why
-Two common use cases:
+Three particularly important attributes:
+
+1. **names** - a character vector giving each element a name
+2. **dimension** - (or dim) turns vectors into matrices and arrays
+3. **class** - powers the S3 object system (we'll learn more about this in chapter 13)
-- Names
-- Dimensions
+Most attributes are lost by most operations. Only two attributes are routinely preserved: names and dimension.
-**Names**
+#### **Names**
~~Three~~ Four ways to name:
-```{r}
-# 1. At creation
-one <- c(one = 1, two = 2, three = 3)
+```{r names}
+# When creating it:
+x <- c(A = 1, B = 2, C = 3)
+x
-# 2. By assigning a character vector of names
-two <- c(1, 2, 3)
-names(two) <- c("one", "two", "three")
+# By assigning a character vector to names()
+y <- 1:3
+names(y) <- c("a", "b", "c")
+y
-# 3. By setting names--with base R
-three <- c(1, 2, 3)
-stats::setNames(
- object = three,
- nm = c("One", "Two", "Three")
-)
+# Inline, with setNames():
+z <- setNames(1:3, c("one", "two", "three"))
+z
# 4. By setting names--with {rlang}
+a <- 1:3
rlang::set_names(
- x = three,
+ x = a,
nm = c("One", "Two", "Three")
)
+
```
+You can remove names from a vector by using `x <- unname(x)` or `names(x) <- NULL`.
Thematically but not directly related: labelled class vectors with `haven::labelled()`
-**Dimensions**
+#### **Dimensions**
-Important for arrays and matrices.
+Create matrices and arrays with `matrix()` and `array()`, or by using the assignment form of `dim()`:
-```{r}
-# length 6 vector spread across 2 rows of 3 columns
-matrix(1:6, nrow = 2, ncol = 3)
+```{r dimensions}
+# Two scalar arguments specify row and column sizes
+x <- matrix(1:6, nrow = 2, ncol = 3)
+x
+
+# One vector argument to describe all dimensions
+y <- array(1:12, c(2, 3, 2))
+y
+
+# You can also modify an object in place by setting dim()
+z <- 1:6
+dim(z) <- c(3, 2)
+z
```
-## S3 atomic vectors
+##### Functions for working with vectors, matrices and arrays:
+
+Vector | Matrix | Array
+:----- | :---------- | :-----
+`names()` | `rownames()`, `colnames()` | `dimnames()`
+`length()` | `nrow()`, `ncol()` | `dim()`
+`c()` | `rbind()`, `cbind()` | `abind::abind()`
+— | `t()` | `aperm()`
+`is.null(dim(x))` | `is.matrix()` | `is.array()`
+
+## **Class** - S3 atomic vectors
-- The vector family tree revisited.
-- Meet the children of typed atomic vectors
+
- Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
+Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
-This list could (more easily) be expanded to new vector types with [`{vctrs}`](https://vctrs.r-lib.org/). See [rstudio::conf(2019) talk on the package around 18:27](https://www.rstudio.com/resources/rstudioconf-2019/vctrs-tools-for-making-size-and-type-consistent-functions/). See also [rstudio::conf(2020) talk on new vector types for dealing with non-decimal currencies](https://www.rstudio.com/resources/rstudioconf-2020/vctrs-creating-custom-vector-classes-with-the-vctrs-package/).
+**Having a class attribute turns an object into an S3 object.**
-What makes S3 atomic vectors different than their parents?
+What makes S3 atomic vectors different?
-Two things:
+1. behave differently from a regular vector when passed to a generic function
+2. often store additional information in other attributes
-1. Class
-2. Attributes (typically)
+Four important S3 vectors used in base R:
+
+1. **Factors** (categorical data)
+2. **Dates**
+3. **Date-times** (POSIXct)
+4. **Durations** (difftime)
### Factors
+A factor is a vector used to store categorical data that can contain only predefined values.
+
Factors are integer vectors with:
- Class: "factor"
- Attributes: "levels", or the set of allowed values
```{r factor}
+colors = c('red', 'blue', 'green','red','red', 'green')
# Build a factor
a_factor <- factor(
# values
- x = c(1, 2, 3),
+ x = colors,
# exhaustive list of values
- levels = c(1, 2, 3, 4)
+ levels = c('red', 'blue', 'green', 'yellow')
)
-# Inspect
-a_factor
+# Useful when some possible values are not present in the data
+table(colors)
+table(a_factor)
-# Dissect
# - type
typeof(a_factor)
+class(a_factor)
# - attributes
attributes(a_factor)
```
-Factors can be ordered. This can be useful for models or visaulations where order matters.
+Factors can be ordered. This can be useful for models or visualizations where order matters.
```{r factor_ordered}
-# Build
+
+values <- c('high', 'med', 'low', 'med', 'high', 'low', 'med', 'high')
+
ordered_factor <- ordered(
# values
- x = c(1, 2, 3),
+ x = values,
# levels in ascending order
- levels = c(4, 3, 2, 1)
+ levels = c('low', 'med', 'high')
)
# Inspect
ordered_factor
+
+table(values)
+table(ordered_factor)
```
### Dates
@@ -385,8 +459,7 @@ Dates are:
- Double vectors
- With class "Date"
-
-The double component represents the number of days since since `1970-01-01`
+- No other attributes
```{r dates}
notes_date <- Sys.Date()
@@ -398,6 +471,13 @@ typeof(notes_date)
attributes(notes_date)
```
+The double component represents the number of days since since `1970-01-01`
+
+```{r days_since_1970}
+date <- as.Date("1970-02-01")
+unclass(date)
+```
+
### Date-times
There are 2 Date-time representations in base R:
@@ -405,11 +485,11 @@ There are 2 Date-time representations in base R:
- POSIXct, where "ct" denotes calendar time
- POSIXlt, where "lt" designates local time.
-Let's focus on POSIXct because:
+We'll focus on POSIXct because:
- Simplest
-- Built on an atomic vector
-- Most apt to be in a data frame
+- Built on an atomic (double) vector
+- Most appropriate for use in a data frame
Let's now build and deconstruct a Date-time
@@ -419,22 +499,30 @@ note_date_time <- as.POSIXct(
# time
x = Sys.time(),
# time zone, used only for formatting
- tz = "EDT"
+ tz = "America/New_York"
)
# Inspect
note_date_time
-# Dissect
# - type
typeof(note_date_time)
+
# - attributes
attributes(note_date_time)
+
+structure(note_date_time, tzone = "Europe/Paris")
```
+```{r date_time_format}
+date_time <- as.POSIXct("2024-02-22 12:34:56", tz = "EST")
+unclass(date_time)
+```
+
+
### Durations
-Durations are:
+Durations represent the amount of time between pairs of dates or date-times.
- Double vectors
- Class: "difftime"
@@ -443,7 +531,6 @@ Durations are:
```{r durations}
# Construct
one_minute <- as.difftime(1, units = "mins")
-
# Inspect
one_minute
@@ -454,6 +541,12 @@ typeof(one_minute)
attributes(one_minute)
```
+```{r durations_math}
+time_since_01_01_1970 <- notes_date - date
+time_since_01_01_1970
+```
+
+
See also:
- [`lubridate::make_difftime()`](https://lubridate.tidyverse.org/reference/make_difftime.html)
@@ -461,7 +554,8 @@ See also:
## Lists
-Sometimes called a generic vector, a list can be composed of elements of different types.
+- Sometimes called a generic vector or recursive vector
+- can be composed of elements of different types (as opposed to atomic vectors which must be of only one type)
### Constructing
@@ -480,12 +574,40 @@ simple_list <- list(
c("primo", "secundo", "tercio")
)
+simple_list
+
# Inspect
# - type
typeof(simple_list)
# - structure
str(simple_list)
+# Accessing
+simple_list[1]
+simple_list[2]
+simple_list[3]
+simple_list[4]
+
+simple_list[[1]][2]
+simple_list[[2]][8]
+simple_list[[3]][2]
+simple_list[[4]][3]
+```
+
+Even Simpler List
+
+```{r list_simpler}
+# Construct
+simpler_list <- list(TRUE, FALSE,
+ 1, 2, 3, 4, 5,
+ 1.2, 2.3, 3.4,
+ "primo", "secundo", "tercio")
+
+# Accessing
+simpler_list[1]
+simpler_list[5]
+simpler_list[9]
+simpler_list[11]
```
Nested lists:
@@ -518,6 +640,10 @@ list_comb2 <- c(list(1, 2), list(3, 4))
# compare structure
str(list_comb1)
str(list_comb2)
+
+# does this work if they are different data types?
+list_comb3 <- c(list(1, 2), list(TRUE, FALSE))
+str(list_comb3)
```
### Testing
@@ -544,22 +670,40 @@ rlang::is_vector(list_comb2)
### Coercion
+Use `as.list()`
+
+```{r list_coercion}
+list(1:3)
+as.list(1:3)
+```
+
+## Matrices and arrays
+
+Although not often used, the dimension attribute can be added to create list-matrices or list-arrays.
+
+```{r list_matrices_arrays}
+l <- list(1:3, "a", TRUE, 1.0)
+dim(l) <- c(2, 2)
+l
+
+l[[1, 1]]
+```
+
## Data frames and tibbles
-- The vector family tree revisited.
-- Meet the children of lists
+
- Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
+Credit: [Advanced R](https://adv-r.hadley.nz/index.html) by Hadley Wickham
### Data frame
A data frame is a:
- Named list of vectors (i.e., column names)
-- Class: "data frame"
- Attributes:
- (column) `names`
- - \`row.names\`\`
+ - `row.names`
+ - Class: "data frame"
```{r data_frame}
# Construct
@@ -567,9 +711,8 @@ df <- data.frame(
# named atomic vector
col1 = c(1, 2, 3),
# another named atomic vector
- col2 = c("un", "deux", "trois"),
- # not necessary after R 4.1 (?)
- stringsAsFactors = FALSE
+ col2 = c("un", "deux", "trois")
+ # ,stringsAsFactors = FALSE # default for versions after R 4.1
)
# Inspect
@@ -582,14 +725,24 @@ typeof(df)
attributes(df)
```
+```{r df_functions}
+rownames(df)
+colnames(df)
+names(df) # Same as colnames(df)
+
+nrow(df)
+ncol(df)
+length(df) # Same as ncol(df)
+```
+
Unlike other lists, the length of each vector must be the same (i.e. as many vector elements as rows in the data frame).
### Tibble
-As compared to data frames, tibbles are data frames that are:
+Created to relieve some of the frustrations and pain points created by data frames, tibbles are data frames that are:
-- Lazy
-- Surly
+- Lazy (do less)
+- Surly (complain more)
#### Lazy
@@ -699,6 +852,64 @@ names(tbl)
tbl$col
```
+#### One more difference
+
+**`tibble()` allows you to refer to variables created during construction**
+
+```{r df_tibble_diff}
+tibble::tibble(
+ x = 1:3,
+ y = x * 2 # x refers to the line above
+)
+```
+
+### Row Names
+
+- character vector containing only unique values
+- get and set with `rownames()`
+- can use them to subset rows
+
+```{r row_names}
+df3 <- data.frame(
+ age = c(35, 27, 18),
+ hair = c("blond", "brown", "black"),
+ row.names = c("Bob", "Susan", "Sam")
+)
+df3
+
+rownames(df3)
+df3["Bob", ]
+
+rownames(df3) <- c("Susan", "Bob", "Sam")
+rownames(df3)
+df3["Bob", ]
+```
+
+There are three reasons why row names are undesirable:
+
+3. Metadata is data, so storing it in a different way to the rest of the data is fundamentally a bad idea.
+2. Row names are a poor abstraction for labelling rows because they only work when a row can be identified by a single string. This fails in many cases.
+3. Row names must be unique, so any duplication of rows (e.g. from bootstrapping) will create new row names.
+
+### Printing
+
+Data frames and tibbles print differently
+
+```{r df_tibble_print}
+df3
+tibble::as_tibble(df3)
+```
+
+
+### Subsetting
+
+Two undesirable subsetting behaviours:
+
+1. When you subset columns with `df[, vars]`, you will get a vector if vars selects one variable, otherwise you’ll get a data frame, unless you always remember to use `df[, vars, drop = FALSE]`.
+2. When you attempt to extract a single column with `df$x` and there is no column `x`, a data frame will instead select any variable that starts with `x`. If no variable starts with `x`, `df$x` will return NULL.
+
+Tibbles tweak these behaviours so that a [ always returns a tibble, and a $ doesn’t do partial matching and warns if it can’t find a variable (*this is what makes tibbles surly*).
+
### Testing
Whether data frame: `is.data.frame()`. Note: both data frame and tibble are data frames.
@@ -710,6 +921,40 @@ Whether tibble: `tibble::is_tibble`. Note: only tibbles are tibbles. Vanilla dat
- To data frame: `as.data.frame()`
- To tibble: `tibble::as_tibble()`
+### List Columns
+
+List-columns are allowed in data frames but you have to do a little extra work by either adding the list-column after creation or wrapping the list in `I()`
+
+```{r list_columns}
+df4 <- data.frame(x = 1:3)
+df4$y <- list(1:2, 1:3, 1:4)
+df4
+
+df5 <- data.frame(
+ x = 1:3,
+ y = I(list(1:2, 1:3, 1:4))
+)
+df5
+```
+
+### Matrix and data frame columns
+
+- As long as the number of rows matches the data frame, it’s also possible to have a matrix or data frame as a column of a data frame.
+- same as list-columns, must either addi the list-column after creation or wrapping the list in `I()`
+
+```{r matrix_df_columns}
+dfm <- data.frame(
+ x = 1:3 * 10,
+ y = I(matrix(1:9, nrow = 3))
+)
+
+dfm$z <- data.frame(a = 3:1, b = letters[1:3], stringsAsFactors = FALSE)
+
+str(dfm)
+dfm$y
+dfm$z
+```
+
## `NULL`
Special type of object that:
@@ -726,6 +971,8 @@ length(NULL)
x <- NULL
attr(x, "y") <- 1
+
+is.null(NULL)
```
## Meeting Videos