18.Rmd (19674B)
1 --- 2 engine: knitr 3 title: Expressions 4 --- 5 6 ## Learning objectives: 7 8 * Understand the idea of the abstract syntax tree (AST). 9 * Discuss the data structures that underlie the AST: 10 * Constants 11 * Symbols 12 * Calls 13 * Explore the idea behind parsing. 14 * Explore some details of R's grammar. 15 * Discuss the use or recursive functions to compute on the language. 16 * Work with three other more specialized data structures: 17 * Pairlists 18 * Missing arguments 19 * Expression vectors 20 21 ```{r, message = FALSE, warning = FALSE} 22 library(rlang) 23 library(lobstr) 24 ``` 25 26 ## Introduction 27 28 > To compute on the language, we first need to understand its structure. 29 30 * This requires a few things: 31 * New vocabulary. 32 * New tools to inspect and modify expressions. 33 * Approach the use of the language with new ways of thinking. 34 * One of the first new ways of thinking is the distinction between an operation and its result. 35 36 ```{r, error = TRUE} 37 y <- x * 10 38 ``` 39 40 * We can capture the intent of the code without executing it using the rlang package. 41 42 ```{r} 43 z <- rlang::expr(y <- x * 10) 44 45 z 46 ``` 47 48 * We can then evaluate the expression using the **base::eval** function. 49 50 ```{r} 51 x <- 4 52 53 base::eval(expr(y <- x * 10)) 54 55 y 56 ``` 57 58 ### Evaluating multiple expressions 59 60 * The function `expression()` allows for multiple expressions, and in some ways it acts similarly to the way files are `source()`d in. That is, we `eval()`uate all of the expressions at once. 61 62 * `expression()` returns a vector and can be passed to `eval()`. 63 64 ```{r} 65 z <- expression(x <- 4, x * 10) 66 67 eval(z) 68 is.atomic(z) 69 is.vector(z) 70 ``` 71 72 * `exprs()` does not evaluate everything at once. To evaluate each expression, the individual expressions must be evaluated in a loop. 73 74 ```{r} 75 for (i in exprs(x <- 4, x * 10)) { 76 print(i) 77 print(eval(i)) 78 } 79 ``` 80 81 ## Abstract Syntax Tree (AST) 82 83 * Expressions are objects that capture the structure of code without evaluating it. 84 * Expressions are also called abstract syntax trees (ASTs) because the structure of code is hierarchical and can be naturally represented as a tree. 85 * Understanding this tree structure is crucial for inspecting and modifying expressions. 86 * Branches = Calls 87 * Leaves = Symbols and constants 88 89 ```{r,eval=FALSE} 90 f(x, "y", 1) 91 ``` 92 93  94 95 ### With `lobstr::ast():` 96 97 ```{r} 98 lobstr::ast(f(x, "y", 1)) 99 ``` 100 101 * Some functions might also contain more calls like the example below: 102 103 ```{r,eval=FALSE} 104 f(g(1, 2), h(3, 4, i())): 105 ``` 106  107 108 ```{r} 109 lobstr::ast(f(g(1, 2), h(3, 4, i()))) 110 ``` 111 * Read the **hand-drawn diagrams** from left-to-right (ignoring vertical position) 112 * Read the **lobstr-drawn diagrams** from top-to-bottom (ignoring horizontal position). 113 * The depth within the tree is determined by the nesting of function calls. 114 * Depth also determines evaluation order, **as evaluation generally proceeds from deepest-to-shallowest, but this is not guaranteed because of lazy evaluation**. 115 116 ### Infix calls 117 118 > Every call in R can be written in tree form because any call can be written in prefix form. 119 120 An infix operator is a function where the function name is placed between its arguments. Prefix form is when then function name comes before the arguments, which are enclosed in parentheses. [Note that the name infix comes from the words prefix and suffix.] 121 122 ```{r} 123 y <- x * 10 124 `<-`(y, `*`(x, 10)) 125 ``` 126 127 * A characteristic of the language is that infix functions can always be written as prefix functions; therefore, all function calls can be represented using an AST. 128 129  130 131 ```{r} 132 lobstr::ast(y <- x * 10) 133 ``` 134 135 ```{r} 136 lobstr::ast(`<-`(y, `*`(x, 10))) 137 ``` 138 139 * There is no difference between the ASTs for the infix version vs the prefix version, and if you generate an expression with prefix calls, R will still print it in infix form: 140 141 ```{r} 142 rlang::expr(`<-`(y, `*`(x, 10))) 143 ``` 144 145 ## Expression 146 147 * Collectively, the data structures present in the AST are called expressions. 148 * These include: 149 1. Constants 150 2. Symbols 151 3. Calls 152 4. Pairlists 153 154 ### Constants 155 156 * Scalar constants are the simplest component of the AST. 157 * A constant is either **NULL** or a **length-1** atomic vector (or scalar) 158 * e.g., `TRUE`, `1L`, `2.5`, `"x"`, or `"hello"`. 159 * We can test for a constant with `rlang::is_syntactic_literal()`. 160 * Constants are self-quoting in the sense that the expression used to represent a constant is the same constant: 161 162 ```{r} 163 identical(expr(TRUE), TRUE) 164 identical(expr(1), 1) 165 identical(expr(2L), 2L) 166 identical(expr("x"), "x") 167 identical(expr("hello"), "hello") 168 ``` 169 170 ### Symbols 171 172 * A symbol represents the name of an object. 173 * `x` 174 * `mtcars` 175 * `mean` 176 * In base R, the terms symbol and name are used interchangeably (i.e., `is.name()` is identical to `is.symbol()`), but this book used symbol consistently because **"name"** has many other meanings. 177 * You can create a symbol in two ways: 178 1. by capturing code that references an object with `expr()`. 179 2. turning a string into a symbol with `rlang::sym()`. 180 181 ```{r} 182 expr(x) 183 ``` 184 185 ```{r} 186 sym("x") 187 ``` 188 189 * A symbol can be turned back into a string with `as.character()` or `rlang::as_string()`. 190 * `as_string()` has the advantage of clearly signalling that you’ll get a character vector of length 1. 191 192 ```{r} 193 as_string(expr(x)) 194 ``` 195 196 * We can recognize a symbol because it is printed without quotes 197 198 ```{r} 199 expr(x) 200 ``` 201 202 * `str()` tells you that it is a symbol, and `is.symbol()` is TRUE: 203 204 ```{r} 205 str(expr(x)) 206 ``` 207 208 ```{r} 209 is.symbol(expr(x)) 210 ``` 211 212 * The symbol type is not vectorised, i.e., a symbol is always length 1. 213 * If you want multiple symbols, you’ll need to put them in a list, using `rlang::syms()`. 214 215 Note that `as_string()` will not work on expressions which are not symbols. 216 217 ```{r} 218 #| error: true 219 as_string(expr(x+y)) 220 ``` 221 222 223 ### Calls 224 225 * A call object represents a captured function call. 226 * Call objects are a special type of list. 227 * The first component specifies the function to call (usually a symbol, i.e., the name fo the function). 228 * The remaining elements are the arguments for that call. 229 * Call objects create branches in the AST, because calls can be nested inside other calls. 230 * You can identify a call object when printed because it looks just like a function call. 231 * Confusingly `typeof()` and `str()` print language for call objects (where we might expect it to return that it is a "call" object), but `is.call()` returns TRUE: 232 233 ```{r} 234 lobstr::ast(read.table("important.csv", row.names = FALSE)) 235 ``` 236 237 ```{r} 238 x <- expr(read.table("important.csv", row.names = FALSE)) 239 ``` 240 241 ```{r} 242 typeof(x) 243 ``` 244 245 ```{r} 246 is.call(x) 247 ``` 248 249 ### Subsetting 250 251 * Calls generally behave like lists. 252 * Since they are list-like, you can use standard subsetting tools. 253 * The first element of the call object is the function to call, which is usually a symbol: 254 255 ```{r} 256 x[[1]] 257 ``` 258 259 ```{r} 260 is.symbol(x[[1]]) 261 ``` 262 * The remainder of the elements are the arguments: 263 264 ```{r} 265 is.symbol(x[-1]) 266 as.list(x[-1]) 267 ``` 268 * We can extract individual arguments with [[ or, if named, $: 269 270 ```{r} 271 x[[2]] 272 ``` 273 274 ```{r} 275 x$row.names 276 ``` 277 278 * We can determine the number of arguments in a call object by subtracting 1 from its length: 279 280 ```{r} 281 length(x) - 1 282 ``` 283 284 * Extracting specific arguments from calls is challenging because of R’s flexible rules for argument matching: 285 * It could potentially be in any location, with the full name, with an abbreviated name, or with no name. 286 287 * To work around this problem, you can use `rlang::call_standardise()` which standardizes all arguments to use the full name: 288 289 ```{r} 290 rlang::call_standardise(x) 291 ``` 292 293 * But If the function uses ... it’s not possible to standardise all arguments. 294 * Calls can be modified in the same way as lists: 295 296 ```{r} 297 x$header <- TRUE 298 x 299 ``` 300 301 ### Function position 302 303 * The first element of the call object is the function position. This contains the function that will be called when the object is evaluated, and is usually a symbol. 304 305 ```{r} 306 lobstr::ast(foo()) 307 ``` 308 309 * While R allows you to surround the name of the function with quotes, the parser converts it to a symbol: 310 311 ```{r} 312 lobstr::ast("foo"()) 313 ``` 314 315 * However, sometimes the function doesn’t exist in the current environment and you need to do some computation to retrieve it: 316 * For example, if the function is in another package, is a method of an R6 object, or is created by a function factory. In this case, the function position will be occupied by another call: 317 318 319 ```{r} 320 lobstr::ast(pkg::foo(1)) 321 ``` 322 323 ```{r} 324 lobstr::ast(obj$foo(1)) 325 ``` 326 327 ```{r} 328 lobstr::ast(foo(1)(2)) 329 ``` 330 331  332 333 ### Constructing 334 335 * You can construct a call object from its components using `rlang::call2()`. 336 * The first argument is the name of the function to call (either as a string, a symbol, or another call). 337 * The remaining arguments will be passed along to the call: 338 339 ```{r} 340 call2("mean", x = expr(x), na.rm = TRUE) 341 ``` 342 343 ```{r} 344 call2(expr(base::mean), x = expr(x), na.rm = TRUE) 345 ``` 346 347 * Infix calls created in this way still print as usual. 348 349 ```{r} 350 call2("<-", expr(x), 10) 351 ``` 352 353 ## Parsing and grammar 354 355 * **Parsing** - The process by which a computer language takes a string and constructs an expression. Parsing is governed by a set of rules known as a grammar. 356 * We are going to use `lobstr::ast()` to explore some of the details of R’s grammar, and then show how you can transform back and forth between expressions and strings. 357 * **Operator precedence** - Conventions used by the programming language to resolve ambiguity. 358 * Infix functions introduce two sources of ambiguity. 359 * The first source of ambiguity arises from infix functions: what does 1 + 2 * 3 yield? Do you get 9 (i.e., (1 + 2) * 3), or 7 (i.e., 1 + (2 * 3))? In other words, which of the two possible parse trees below does R use? 360 361  362 363 * Programming languages use conventions called operator precedence to resolve this ambiguity. We can use `ast()` to see what R does: 364 365 ```{r} 366 lobstr::ast(1 + 2 * 3) 367 ``` 368 369 * PEMDAS (or BEDMAS or BODMAS, depending on where in the world you grew up) is pretty clear on what to do. Other operator precedence isn't as clear. 370 * There’s one particularly surprising case in R: 371 * ! has a much lower precedence (i.e., it binds less tightly) than you might expect. 372 * This allows you to write useful operations like: 373 374 ```{r} 375 lobstr::ast(!x %in% y) 376 ``` 377 * **R has over 30 infix operators divided into 18 precedence** groups. 378 * While the details are described in `?Syntax`, very few people have memorized the complete ordering. 379 * If there’s any confusion, use parentheses! 380 381 ```{r} 382 # override PEMDAS 383 lobstr::ast((1 + 2) * 3) 384 ``` 385 386 ### Associativity 387 388 * The second source of ambiguity is introduced by repeated usage of the same infix function. 389 390 ```{r} 391 1 + 2 + 3 392 393 # What does R do first? 394 (1 + 2) + 3 395 396 # or 397 1 + (2 + 3) 398 ``` 399 400 * In this case it doesn't matter. Other places it might, like in `ggplot2`. 401 402 * In R, most operators are left-associative, i.e., the operations on the left are evaluated first: 403 404 ```{r} 405 lobstr::ast(1 + 2 + 3) 406 ``` 407 408 * There are two exceptions to the left-associative rule: 409 1. exponentiation 410 2. assignment 411 412 ```{r} 413 lobstr::ast(2 ^ 2 ^ 3) 414 ``` 415 416 ```{r} 417 lobstr::ast(x <- y <- z) 418 ``` 419 420 ### Parsing and deparsing 421 422 * Parsing - turning characters you've typed into an AST (i.e., from strings to expressions). 423 * R usually takes care of parsing code for us. 424 * But occasionally you have code stored as a string, and you want to parse it yourself. 425 * You can do so using `rlang::parse_expr()`: 426 427 ```{r} 428 x1 <- "y <- x + 10" 429 x1 430 is.call(x1) 431 ``` 432 433 ```{r} 434 x2 <- rlang::parse_expr(x1) 435 x2 436 is.call(x2) 437 ``` 438 439 * `parse_expr()` always returns a single expression. 440 * If you have multiple expression separated by `;` or `,`, you’ll need to use `rlang::parse_exprs()` which is the plural version of `rlang::parse_expr()`. It returns a list of expressions: 441 442 ```{r} 443 x3 <- "a <- 1; a + 1" 444 ``` 445 446 ```{r} 447 rlang::parse_exprs(x3) 448 ``` 449 450 * If you find yourself parsing strings into expressions often, **quasiquotation** may be a safer approach. 451 * More about quasiquaotation in Chapter 19. 452 * The inverse of parsing is deparsing. 453 * **Deparsing** - given an expression, you want the string that would generate it. 454 * Deparsing happens automatically when you print an expression. 455 * You can get the string with `rlang::expr_text()`: 456 * Parsing and deparsing are not symmetric. 457 * Parsing creates the AST which means that we lose backticks around ordinary names, comments, and whitespace. 458 459 ```{r} 460 cat(expr_text(expr({ 461 # This is a comment 462 x <- `x` + 1 463 }))) 464 ``` 465 466 ## Using the AST to solve more complicated problems 467 468 * Here we focus on what we learned to perform recursion on the AST. 469 * Two parts of a recursive function: 470 * Recursive case: handles the nodes in the tree. Typically, you’ll do something to each child of a node, usually calling the recursive function again, and then combine the results back together again. For expressions, you’ll need to handle calls and pairlists (function arguments). 471 * Base case: handles the leaves of the tree. The base cases ensure that the function eventually terminates, by solving the simplest cases directly. For expressions, you need to handle symbols and constants in the base case. 472 473 474 ### Two helper functions 475 476 * First, we need an `epxr_type()` function to return the type of expression element as a string. 477 478 ```{r} 479 expr_type <- function(x) { 480 if (rlang::is_syntactic_literal(x)) { 481 "constant" 482 } else if (is.symbol(x)) { 483 "symbol" 484 } else if (is.call(x)) { 485 "call" 486 } else if (is.pairlist(x)) { 487 "pairlist" 488 } else { 489 typeof(x) 490 } 491 } 492 ``` 493 494 ```{r} 495 expr_type(expr("a")) 496 expr_type(expr(x)) 497 expr_type(expr(f(1, 2))) 498 ``` 499 500 * Second, we need a wrapper function to handle exceptions. 501 502 ```{r} 503 switch_expr <- function(x, ...) { 504 switch(expr_type(x), 505 ..., 506 stop("Don't know how to handle type ", typeof(x), call. = FALSE) 507 ) 508 } 509 ``` 510 511 * Lastly, we can write a basic template that walks the AST using the `switch()` statement. 512 513 ```{r,} 514 recurse_call <- function(x) { 515 switch_expr(x, 516 # Base cases 517 symbol = , 518 constant = , 519 520 # Recursive cases 521 call = , 522 pairlist = 523 ) 524 } 525 ``` 526 527 ### Specific use cases for `recurse_call()` 528 529 ### Example 1: Finding F and T 530 531 * Using `F` and `T` in our code rather than `FALSE` and `TRUE` is bad practice. 532 * Say we want to walk the AST to find times when we use `F` and `T`. 533 * Start off by finding the type of `T` vs `TRUE`. 534 535 ```{r} 536 expr_type(expr(TRUE)) 537 538 expr_type(expr(T)) 539 ``` 540 541 * With this knowledge, we can now write the base cases of our recursive function. 542 * The logic is as follows: 543 * A constant is never a logical abbreviation and a symbol is an abbreviation if it is "F" or "T": 544 545 ```{r} 546 logical_abbr_rec <- function(x) { 547 switch_expr(x, 548 constant = FALSE, 549 symbol = as_string(x) %in% c("F", "T") 550 ) 551 } 552 ``` 553 554 ```{r} 555 logical_abbr_rec(expr(TRUE)) 556 logical_abbr_rec(expr(T)) 557 ``` 558 559 * It's best practice to write another wrapper, assuming every input you receive will be an expression. 560 561 ```{r} 562 logical_abbr <- function(x) { 563 logical_abbr_rec(enexpr(x)) 564 } 565 566 logical_abbr(T) 567 logical_abbr(FALSE) 568 ``` 569 570 #### Next step: code for the recursive cases 571 572 * Here we want to do the same thing for calls and for pairlists. 573 * Here's the logic: recursively apply the function to each subcomponent, and return `TRUE` if any subcomponent contains a logical abbreviation. 574 * This is simplified by using the `purrr::some()` function, which iterates over a list and returns `TRUE` if the predicate function is true for any element. 575 576 ```{r} 577 logical_abbr_rec <- function(x) { 578 switch_expr(x, 579 # Base cases 580 constant = FALSE, 581 symbol = as_string(x) %in% c("F", "T"), 582 # Recursive cases 583 call = , 584 # Are we sure this is the correct function to use? 585 # Why not logical_abbr_rec? 586 pairlist = purrr::some(x, logical_abbr_rec) 587 ) 588 } 589 590 logical_abbr(mean(x, na.rm = T)) 591 592 logical_abbr(function(x, na.rm = T) FALSE) 593 ``` 594 595 ### Example 2: Finding all variables created by assignment 596 597 * Listing all the variables is a little more complicated. 598 * Figure out what assignment looks like based on the AST. 599 600 ```{r} 601 ast(x <- 10) 602 ``` 603 604 * Now we need to decide what data structure we're going to use for the results. 605 * Easiest thing will be to return a character vector. 606 * We would need to use a list if we wanted to return symbols. 607 608 ### Dealing with the base cases 609 610 ```{r} 611 find_assign_rec <- function(x) { 612 switch_expr(x, 613 constant = , 614 symbol = character() 615 ) 616 } 617 find_assign <- function(x) find_assign_rec(enexpr(x)) 618 619 find_assign("x") 620 621 find_assign(x) 622 623 ``` 624 625 ### Dealing with the recursive cases 626 627 * Here is the function to flatten pairlists. 628 629 ```{r} 630 flat_map_chr <- function(.x, .f, ...) { 631 purrr::flatten_chr(purrr::map(.x, .f, ...)) 632 } 633 634 flat_map_chr(letters[1:3], ~ rep(., sample(3, 1))) 635 ``` 636 637 * Here is the code needed to identify calls. 638 639 ```{r} 640 find_assign_rec <- function(x) { 641 switch_expr(x, 642 # Base cases 643 constant = , 644 symbol = character(), 645 646 # Recursive cases 647 pairlist = flat_map_chr(as.list(x), find_assign_rec), 648 call = { 649 if (is_call(x, "<-")) { 650 as_string(x[[2]]) 651 } else { 652 flat_map_chr(as.list(x), find_assign_rec) 653 } 654 } 655 ) 656 } 657 658 find_assign(a <- 1) 659 660 find_assign({ 661 a <- 1 662 { 663 b <- 2 664 } 665 }) 666 667 ``` 668 669 ### Make the function more robust 670 671 * Throw cases at it that we think might break the function. 672 * Write a function to handle these cases. 673 674 ```{r} 675 find_assign_call <- function(x) { 676 if (is_call(x, "<-") && is_symbol(x[[2]])) { 677 lhs <- as_string(x[[2]]) 678 children <- as.list(x)[-1] 679 } else { 680 lhs <- character() 681 children <- as.list(x) 682 } 683 684 c(lhs, flat_map_chr(children, find_assign_rec)) 685 } 686 687 find_assign_rec <- function(x) { 688 switch_expr(x, 689 # Base cases 690 constant = , 691 symbol = character(), 692 693 # Recursive cases 694 pairlist = flat_map_chr(x, find_assign_rec), 695 call = find_assign_call(x) 696 ) 697 } 698 699 find_assign(a <- b <- c <- 1) 700 701 find_assign(system.time(x <- print(y <- 5))) 702 703 ``` 704 705 * This approach certainly is more complicated, but it's important to start simple and move up. 706 707 ## Specialised data structures 708 709 * Pairlists 710 * Missing arguments 711 * Expression vectors 712 713 ### Pairlists 714 715 * Pairlists are a remnant of R’s past and have been replaced by lists almost everywhere. 716 * The only place you are likely to see pairlists in R is when working with calls to the function, as the formal arguments to a function are stored in a pairlist: 717 718 ```{r} 719 f <- expr(function(x, y = 10) x + y) 720 ``` 721 722 ```{r} 723 args <- f[[2]] 724 args 725 ``` 726 727 ```{r} 728 typeof(args) 729 ``` 730 * Fortunately, whenever you encounter a pairlist, you can treat it just like a regular list: 731 732 ```{r} 733 pl <- pairlist(x = 1, y = 2) 734 ``` 735 736 ```{r} 737 length(pl) 738 ``` 739 740 ```{r} 741 pl$x 742 ``` 743 744 ### Missing arguments 745 746 * Empty symbols 747 * To create an empty symbol, you need to use `missing_arg()` or `expr()`. 748 749 ```{r} 750 missing_arg() 751 typeof(missing_arg()) 752 ``` 753 754 * Empty symbols don't print anything. 755 * To check, we need to use `rlang::is_missing()` 756 757 ```{r} 758 is_missing(missing_arg()) 759 ``` 760 761 * These are usually present in function formals: 762 763 ```{r} 764 f <- expr(function(x, y = 10) x + y) 765 766 args <- f[[2]] 767 768 769 is_missing(args[[1]]) 770 ``` 771 772 ### Expression vectors 773 774 * An expression vector is just a list of expressions. 775 * The only difference is that calling `eval()` on an expression evaluates each individual expression. 776 * Instead, it might be more advantageous to use a list of expressions. 777 778 * Expression vectors are only produced by two base functions: 779 `expression()` and `parse()`: 780 781 ```{r} 782 exp1 <- parse(text = c(" 783 x <- 4 784 x 785 ")) 786 exp1 787 ``` 788 789 ```{r} 790 exp2 <- expression(x <- 4, x) 791 exp2 792 ``` 793 794 ```{r} 795 typeof(exp1) 796 typeof(exp2) 797 ``` 798 799 800 - Like calls and pairlists, expression vectors behave like lists: 801 802 ```{r} 803 length(exp1) 804 exp1[[1]] 805 ```