bookclub-advr

DSLC Advanced R Book Club
git clone https://git.eamoncaddigan.net/bookclub-advr.git
Log | Files | Refs | README | LICENSE

09.Rmd (12426B)


      1 ---
      2 engine: knitr
      3 title: Functionals
      4 ---
      5 
      6 ## Learning objectives:
      7 
      8 - Define functionals.
      9 - Use the `purrr::map()` family of functionals.
     10 - Use the `purrr::walk()` family of functionals.
     11 - Use the `purrr::reduce()` and `purrr::accumulate()` family of functionals.
     12 - Use `purrr::safely()` and `purrr::possibly()` to deal with failure.
     13 
     14 9.1. **Introduction**
     15 
     16 9.2.  **map()**
     17 
     18 9.3. **purrr** style
     19 
     20 9.4. **map_** variants
     21 
     22 9.5. **reduce()** and **accumulate** family of functions
     23 
     24 - Some functions that weren't covered
     25 
     26 
     27 ## What are functionals {-}
     28 
     29 ## Introduction 
     30 
     31 __Functionals__ are functions that take function as input and return a vector as output. Functionals that you probably have used before are: `apply()`, `lapply()` or `tapply()`. 
     32 
     33 
     34 - alternatives to loops
     35 
     36 - a functional is better than a `for` loop is better than `while` is better than `repeat`
     37 
     38 
     39 ### Benefits {-}
     40 
     41 
     42 - encourages function logic to be separated from iteration logic
     43 
     44 - can collapse into vectors/data frames easily
     45 
     46 
     47 ## Map
     48 
     49 `map()` has two arguments, a vector and a function. It performs the function on each element of the vector and returns a list. We can also pass in some additional argument into the function.
     50 
     51 ```{r,echo=FALSE,warning=FALSE,message=FALSE}
     52 knitr::include_graphics(path = 'images/9_2_3_map-arg.png')
     53 ```
     54 
     55 ```{r}
     56 simple_map <- function(x, f, ...) {
     57 out <- vector("list", length(x))
     58 for (i in seq_along(x)) {
     59 out[[i]] <- f(x[[i]], ...)
     60 }
     61 out
     62 }
     63 ```
     64 
     65 ## Benefit of using the map function in purrr {-}
     66 
     67 - `purrr::map()` is equivalent to `lapply()`
     68 
     69 - returns a list and is the most general
     70 
     71 - the length of the input == the length of the output
     72 
     73 - `map()` is more flexible, with additional arguments allowed
     74 
     75 - `map()` has a host of extensions
     76 
     77 
     78 
     79 ```{r load,echo=FALSE,warning=FALSE,message=FALSE}
     80 library(tidyverse)
     81 ```
     82 
     83 ## Atomic vectors {-}
     84 
     85 
     86 - has 4 variants to return atomic vectors
     87     - `map_chr()`
     88     - `map_dbl()`
     89     - `map_int()`
     90     - `map_lgl()`
     91 
     92 ```{r}
     93 triple <- function(x) x * 3
     94 map(.x=1:3, .f=triple)
     95 
     96 map_dbl(.x=1:3, .f=triple)
     97 
     98 map_lgl(.x=c(1, NA, 3), .f=is.na)
     99 ```
    100 
    101 ## Anonymous functions and shortcuts  {-}
    102 
    103  **Anonymous functions** 
    104 ```{r}
    105 map_dbl(.x=mtcars, .f=function(x) mean(x, na.rm = TRUE)) |> 
    106   head()
    107 ```
    108 
    109 - the "twiddle" uses a twiddle `~` to set a formula
    110 - can use `.x` to reference the input `map(.x = ..., .f = )`
    111 ```{r, eval=FALSE}
    112 map_dbl(.x=mtcars,  .f=~mean(.x, na.rm = TRUE))
    113 ```
    114 
    115 - can be simplified further as
    116 ```{r}
    117 map_dbl(.x=mtcars, .f=mean, na.rm = TRUE)
    118 ```
    119 
    120 - what happens when we try a handful of variants of the task at hand?  (how many unique values are there for each variable?)
    121 
    122 Note that `.x` is the **name** of the first argument in `map()` (`.f` is the name of the second argument).
    123 
    124 ```{r}
    125 #| error: true
    126 # the task
    127 map_dbl(mtcars, function(x) length(unique(x)))
    128 map_dbl(mtcars, function(unicorn) length(unique(unicorn)))
    129 map_dbl(mtcars, ~length(unique(.x)))
    130 map_dbl(mtcars, ~length(unique(..1)))
    131 map_dbl(mtcars, ~length(unique(.)))
    132 
    133 # not the task
    134 map_dbl(mtcars, length)
    135 map_dbl(mtcars, length(unique))
    136 map_dbl(mtcars, 1)
    137 ```
    138 
    139 ```{r}
    140 #| echo: false
    141 #| message: false
    142 #| warning: false
    143 rm(x)
    144 ```
    145 
    146 ```{r}
    147 #| error: true
    148 #error
    149 map_dbl(mtcars, length(unique()))
    150 map_dbl(mtcars, ~length(unique(x)))
    151 ```
    152 
    153 
    154 ## Modify {-}
    155 
    156 Sometimes we might want the output to be the same as the input, then in that case we can use the modify function rather than map
    157 
    158 ```{r}
    159 df <- data.frame(x=1:3,y=6:4)
    160 
    161 map(df, .f=~.x*3)
    162 
    163 modify(.x=df,.f=~.x*3)
    164 ```
    165 
    166 Note that `modify()` always returns the same type of output (which is not necessarily true with `map()`).  Additionally, `modify()` does not actually change the value of `df`.
    167 
    168 ```{r}
    169 df
    170 ```
    171 
    172 
    173 ## `purrr` style
    174 
    175 ```{r}
    176 mtcars |> 
    177   map(head, 20) |> # pull first 20 of each column
    178   map_dbl(mean) |> # mean of each vector
    179   head()
    180 ```
    181 
    182 An example from `tidytuesday`
    183 ```{r, eval=FALSE}
    184 #| warning: false
    185 #| message: false
    186 
    187 tt <- tidytuesdayR::tt_load("2020-06-30")
    188 
    189 # filter data & exclude columns with lost of nulls
    190 list_df <- 
    191   map(
    192     .x = tt[1:3], 
    193     .f = 
    194       ~ .x |> 
    195       filter(issue <= 152 | issue > 200) |> 
    196       mutate(timeframe = ifelse(issue <= 152, "first 5 years", "last 5 years")) |> 
    197       select_if(~mean(is.na(.x)) < 0.2) 
    198   )
    199 
    200 
    201 # write to global environment
    202 iwalk(
    203   .x = list_df,
    204   .f = ~ assign(x = .y, value = .x, envir = globalenv())
    205 )
    206 ```
    207 
    208 ## `map_*()` variants 
    209 
    210 There are many variants
    211 
    212 ![](images/map_variants.png)
    213 
    214 
    215 ## `map2_*()` {-}
    216 
    217 - raise each value `.x` by 2
    218 
    219 ```{r}
    220 map_dbl(
    221   .x = 1:5, 
    222   .f = function(x) x ^ 2
    223 )
    224 ```
    225 
    226 - raise each value `.x` by another value `.y`
    227 
    228 ```{r}
    229 map2_dbl(
    230   .x = 1:5, 
    231   .y = 2:6, 
    232   .f = ~ (.x ^ .y)
    233 )
    234 ```
    235 
    236 
    237 ## The benefit of using the map over apply family of function {-}
    238 - It is written in C
    239 - It preserves names
    240 - We always know the return value type
    241 - We can apply the function for multiple input values
    242 - We can pass additional arguments into the function
    243 
    244 
    245 ## `walk()` {-}
    246 
    247 
    248 - We use `walk()` when we want to call a function for it side effect(s) rather than its return value, like generating plots, `write.csv()`, or `ggsave()`. If you don't want a return value, `map()` will print more info than you may want.
    249 
    250 
    251 ```{r}
    252 map(1:3, ~cat(.x, "\n"))
    253 ```
    254 
    255 - for these cases, use `walk()` instead
    256 ```{r}
    257 walk(1:3, ~cat(.x, "\n"))
    258 ```
    259 
    260 `cat()` does have a result, it's just usually returned invisibly.
    261 
    262 ```{r}
    263 cat("hello")
    264 
    265 (cat("hello"))
    266 ```
    267 
    268 
    269 We can use `pwalk()` to save a list of plot to disk.  Note that the "p" in `pwalk()` means that we have more than 1 (or 2) variables to pipe into the function.  Also note that the name of the first argument in all of the "p" functions is now `.l` (instead of `.x`).
    270 
    271 ```{r}
    272 plots <- mtcars |>  
    273   split(mtcars$cyl) |>  
    274   map(~ggplot(.x, aes(mpg,wt)) +
    275         geom_point())
    276 
    277 paths <- stringr::str_c(names(plots), '.png')
    278 
    279 pwalk(.l = list(paths,plots), .f = ggsave, path = tempdir())
    280 pmap(.l = list(paths,plots), .f = ggsave, path = tempdir())
    281   
    282 ```
    283 
    284 - walk, walk2 and pwalk all invisibly return .x the first argument. This makes them suitable for use in the middle of pipelines.
    285 
    286 - note: I don't think that it is "`.x`" (or "`.l`") that they are returning invisibly.  But I'm not sure what it is.  Hadley says:
    287 
    288 > purrr provides the walk family of functions that ignore the return values of the `.f` and instead return `.x` invisibly.
    289 
    290 But not in the first `cat()` example, it is the `NULL` values that get returned invisibly (those aren't the same as `.x`).
    291 
    292 ## `imap()` {-}
    293 
    294 - `imap()` is like `map2()`except that `.y` is derived from `names(.x)` if named or `seq_along(.x)` if not.
    295 
    296 - These two produce the same result
    297 
    298 ```{r}
    299 imap_chr(.x = mtcars, 
    300          .f = ~ paste(.y, "has a mean of", round(mean(.x), 1))) |> 
    301 head()
    302 
    303 map2_chr(.x = mtcars, 
    304          .y = names(mtcars),
    305          .f = ~ paste(.y, "has a mean of", round(mean(.x), 1))) |> 
    306 head()
    307 ```
    308 
    309 
    310 ## `pmap()` {-}
    311 
    312 - you can pass a named list or dataframe as arguments to a function
    313 
    314 - for example `runif()` has the parameters `n`, `min` and `max`
    315 
    316 ```{r}
    317 params <- tibble::tribble(
    318   ~ n, ~ min, ~ max,
    319    1L,     1,    10,
    320    2L,    10,   100,
    321    3L,   100,  1000
    322 )
    323 
    324 pmap(params, runif)
    325 ```
    326 
    327 - could also be
    328 
    329 ```{r}
    330 list(
    331   n = 1:3, 
    332   min = 10 ^ (0:2), 
    333   max = 10 ^ (1:3)
    334 ) |> 
    335 pmap(runif)
    336 ```
    337 
    338 - I like to use `expand_grid()` when I want all possible parameter combinations.
    339 
    340 ```{r}
    341 expand_grid(n = 1:3,
    342             min = 10 ^ (0:1),
    343             max = 10 ^ (1:2))
    344 
    345 expand_grid(n = 1:3,
    346             min = 10 ^ (0:1),
    347             max = 10 ^ (1:2)) |> 
    348 pmap(runif)
    349 ```
    350 
    351 
    352 
    353 ## `reduce()` family
    354 
    355 The `reduce()` function is a powerful functional that allows you to abstract away from a sequence of functions that are applied in a fixed direction.
    356 
    357 `reduce()` takes a vector as its first argument, a function as its second argument, and an optional `.init` argument last.  It will then apply the function repeatedly to the vector until there is only a single element left.
    358 
    359 (Hint: start at the top of the image and read down.)
    360 
    361 ```{r,echo=FALSE,warning=FALSE,message=FALSE}
    362 knitr::include_graphics(path = 'images/reduce-init.png')
    363 ```
    364 
    365 
    366 Let me really quickly demonstrate `reduce()` in action.
    367 
    368 Say you wanted to add up the numbers 1 through 5 using only the plus operator `+`. You could do something like:
    369 
    370 ```{r}
    371 1 + 2 + 3 + 4 + 5
    372 
    373 ```
    374 
    375 Which is the same as:
    376 
    377 ```{r}
    378 reduce(1:5, `+`)
    379 ```
    380 
    381 And if you want the start value to be something that is not the first argument of the vector, pass that value to the .init argument:
    382 
    383 ```{r}
    384 
    385 identical(
    386   0.5 + 1 + 2 + 3 + 4 + 5,
    387   reduce(1:5, `+`, .init = 0.5)
    388 )
    389 
    390 ```
    391 
    392 ## ggplot2 example with reduce {-}
    393 
    394 ```{r}
    395 ggplot(mtcars, aes(hp, mpg)) + 
    396   geom_point(size = 8, alpha = .5, color = "yellow") +
    397   geom_point(size = 4, alpha = .5, color = "red") +
    398   geom_point(size = 2, alpha = .5, color = "blue")
    399 
    400 ```
    401 
    402 Let us use the `reduce()` function.  Note that `reduce2()` takes two arguments, but the first value (`..1`) is given by the `.init` value.
    403 
    404 ```{r}
    405 reduce2(
    406   c(8, 4, 2),
    407   c("yellow", "red", "blue"),
    408   ~ ..1 + geom_point(size = ..2, alpha = .5, color = ..3),
    409   .init = ggplot(mtcars, aes(hp, mpg))
    410 )
    411 
    412 ```
    413 
    414 ```{r}
    415 df <- list(age=tibble(name='john',age=30),
    416     sex=tibble(name=c('john','mary'),sex=c('M','F'),
    417     trt=tibble(name='Mary',treatment='A')))
    418 
    419 df
    420 
    421 df |> reduce(.f = full_join)
    422 
    423 reduce(.x = df,.f = full_join)
    424 ```
    425 
    426 - to see all intermediate steps, use **accumulate()**
    427 
    428 ```{r}
    429 set.seed(1234)
    430 accumulate(1:5, `+`)
    431 ```
    432 
    433 ```{r}
    434 accumulate2(
    435   c(8, 4, 2),
    436   c("yellow", "red", "blue"),
    437   ~ ..1 + geom_point(size = ..2, alpha = .5, color = ..3),
    438   .init = ggplot(mtcars, aes(hp, mpg))
    439 )
    440 ```
    441 
    442 
    443 ## `map_df*()` variants {-}
    444 
    445 - `map_dfr()` = row bind the results
    446 
    447 - `map_dfc()` = column bind the results
    448 
    449 - Note that `map_dfr()` has been superseded by `map() |> list_rbind()` and `map_dfc()` has been superseded by `map() |> list_cbind()`
    450 
    451 ```{r}
    452 col_stats <- function(n) {
    453   head(mtcars, n) |> 
    454     summarise_all(mean) |> 
    455     mutate_all(floor) |> 
    456     mutate(n = paste("N =", n))
    457 }
    458 
    459 map((1:2) * 10, col_stats)
    460 
    461 map_dfr((1:2) * 10, col_stats)
    462 
    463 map((1:2) * 10, col_stats) |> list_rbind()
    464 ```
    465 
    466 ---
    467 
    468 ## `pluck()` {-}
    469 
    470 - `pluck()` will pull a single element from a list
    471 
    472 I like the example from the book because the starting object is not particularly easy to work with (as many JSON objects might not be).
    473 
    474 ```{r}
    475 my_list <- list(
    476   list(-1, x = 1, y = c(2), z = "a"),
    477   list(-2, x = 4, y = c(5, 6), z = "b"),
    478   list(-3, x = 8, y = c(9, 10, 11))
    479 )
    480 my_list
    481 ```
    482 
    483 Notice that the "first element" means something different in standard `pluck()` versus `map`ped `pluck()`.
    484 
    485 ```{r}
    486 pluck(my_list, 1)
    487 
    488 map(my_list, pluck, 1)
    489 
    490 map_dbl(my_list, pluck, 1)
    491 ```
    492 
    493 The `map()` functions also have shortcuts for extracting elements from vectors (powered by `purrr::pluck()`).  Note that `map(my_list, 3)` is a shortcut for `map(my_list, pluck, 3)`.
    494 
    495 ```{r}
    496 #| error: true
    497 
    498 # Select by name
    499 map_dbl(my_list, "x")
    500 
    501 # Or by position
    502 map_dbl(my_list, 1)
    503 
    504 # Or by both
    505 map_dbl(my_list, list("y", 1))
    506 
    507 # You'll get an error if you try to retrieve an inside item that doesn't have 
    508 # a consistent format and you want a numeric output
    509 map_dbl(my_list, list("y"))
    510 
    511 
    512 # You'll get an error if a component doesn't exist:
    513 map_chr(my_list, "z")
    514 #> Error: Result 3 must be a single string, not NULL of length 0
    515 
    516 # Unless you supply a .default value
    517 map_chr(my_list, "z", .default = NA)
    518 #> [1] "a" "b" NA
    519 ```
    520 
    521 
    522 ## Not covered: `flatten()` {-}
    523 
    524 - `flatten()` will turn a list of lists into a simpler vector.
    525 
    526 ```{r}
    527 my_list <-
    528   list(
    529     a = 1:3,
    530     b = list(1:3)
    531   )
    532 
    533 my_list
    534 
    535 map_if(my_list, is.list, pluck)
    536   
    537 map_if(my_list, is.list, flatten_int)
    538 
    539 map_if(my_list, is.list, flatten_int) |> 
    540   flatten_int()
    541 ```
    542 
    543 ## Dealing with Failures {-}
    544 
    545 ## Safely {-}
    546 
    547 `safely()` is an adverb.  It takes a function (a verb) and returns a modified version. In this case, the modified function will never throw an error. Instead it always returns a list with two elements.
    548 
    549 - `result` is the original result. If there is an error this will be NULL
    550 
    551 - `error` is an error object. If the operation was successful the "`error`" will be NULL.
    552 
    553 ```{r}
    554 A <- list(1, 10, "a")
    555 
    556 map(.x = A, .f = safely(log))
    557   
    558 ```
    559 
    560 ## Possibly {-}
    561 
    562 `possibly()` always succeeds. It is simpler than `safely()`, because you can give it a default value to return when there is an error.
    563 
    564 ```{r}
    565 A <- list(1,10,"a")
    566 
    567 map_dbl(.x = A, .f = possibly(log, otherwise = NA_real_) )
    568 
    569 ```