www.eamoncaddigan.net

Content and configuration for https://www.eamoncaddigan.net
git clone https://git.eamoncaddigan.net/www.eamoncaddigan.net.git
Log | Files | Refs | Submodules | README

index.md (5166B)


      1 ---
      2 title: "Checking Equality in an R Pipeline"
      3 date: 2023-12-22T10:24:40-08:00
      4 draft: false
      5 categories:
      6 - Programming
      7 - Data Science
      8 tags:
      9 - R
     10 ---
     11 
     12 I was wondering how to incorporate a test for equality into a pipeline using
     13 R's new(_ish_) native forward pipe operator[^pipe], so I [asked the
     14 Fediverse](https://social.coop/@eamon/111608458848988985) and got great
     15 advice[^fedi].
     16 
     17 ## Why would I do this?
     18 
     19 I approach scripts differently than notebooks (whether using Jupyter or
     20 RMarkdown) since they're usually meant to be run non-interactively. I try to
     21 keep the list of packages (or modules) fairly minimal; basically, when
     22 deciding how to balance portability vs. code readability, I am biased a bit
     23 more toward portability than I am for an interactive data analysis.
     24 
     25 Nevertheless, a good script should include data quality checks and raise
     26 errors when they're found. An approach I'd previously taken with the pipe
     27 operator introduced by R's [magrittr
     28 package](https://cran.r-project.org/web/packages/magrittr/index.html)[^magrittr]
     29 might look like this:
     30 
     31 ```r
     32 my_data_frame %>%
     33   # ( a long sequence of data munging operations would go here )
     34   filter(some_condition_that_should_never_be_true) %>%
     35   nrow() %>%
     36   `==`(0) %>%
     37   stopifnot()
     38 ```
     39 
     40 This code would cause the R script to exit and alert the calling process
     41 that something went wrong, but it doesn't work as-written with the native
     42 pipe.
     43 
     44 ## Avoid using `==`
     45 
     46 I was convinced by a couple responses[^responses1] that the best approach is
     47 to forego the `==` operator entirely, and this advice is echoed in [R's own
     48 documentation](https://search.r-project.org/R/refmans/base/html/Comparison.html).
     49 
     50 * When you know that the value you're testing will be an integer or a
     51   non-numeric value, the `identical()` function is better. In the example
     52   above, I'd write `identical(0L)`, since this differentiates integer from
     53   floating-point values, and I know `nrow()` will return an integer.
     54 * For other numeric values, use `all.equal()` followed by `isTRUE()`; e.g.
     55   `sin(pi) |> all.equal(0) |> isTRUE()`
     56 
     57 The first solution is what the scripts I wrote this week now use.
     58 
     59 ## You _can_ still use `==`
     60 
     61 I often bristle when I see somebody ask "how do I do this?" on the internet
     62 only to be told "don't do that". It only "works" when the respondent
     63 _really_ knows what they're talking about, and internet commenters seem to
     64 overestimate their expertise[^dk]. Fortunately, that wasn't the case this
     65 time, and I think this says great things about the Fediverse's R community.
     66 
     67 That said, it is still possible to use the equality operator by
     68 naming the argument[^responses2]:
     69 
     70 ```r
     71 my_data_frame |>
     72   # ( a long sequence of data munging operations would go here )
     73   filter(some_condition_that_should_never_be_true) |>
     74   nrow() |>
     75   `==`(x = _, 0) |>
     76   stopifnot()
     77 ```
     78 
     79 I believe this approach requires R 4.2.0 or greater, but I haven't tested it
     80 in the 4.1.x family. It's nice to know about this, because it applies in
     81 other contexts (e.g., when using other operators as functions).
     82 
     83 It's possibly worth calling out the first approach I had planned to use,
     84 which calls an "anonymous function"[^anon]:
     85 
     86 ```r
     87 my_data_frame |>
     88   # ( a long sequence of data munging operations would go here )
     89   filter(some_condition_that_should_never_be_true) |>
     90   nrow() |>
     91   {\(x) x == 0}() |>
     92   stopifnot()
     93 ```
     94 
     95 ## Rambling about pipes
     96 
     97 Pipe operators can be found in a several other languages[^python]. The idea
     98 of "pipelining" operations probably "comes from" [concatenative
     99 languages]({{< ref "postscript-graph-paper" >}}); pipe _operators_ bring
    100 this ability to other paradigms, allowing one to chain a sequence of
    101 functions without storing intermediate values or nesting the calls. Fans of
    102 this style believe it can "decrease development time and improve readability
    103 and maintainability of code"[^benefits].
    104 
    105 In the examples here, I don't want to create a bunch of variables to store
    106 values that I won't use later. But also, I think I realize why I like
    107 pipelining: being able to fit the sequence into a single pipeline _feels_
    108 like writing a sentence describing the transformations I want to apply to
    109 the data.
    110 
    111 [^pipe]: `|>`, introduced in R 4.1.0 in May 2021.
    112 
    113 [^fedi]: Seriously, thanks to everyone who replied, most of whom probably
    114     only saw my post because of its `#RStats` hashtag.
    115 
    116 [^magrittr]: See [Differences between the base R and magrittr
    117     pipes](https://www.tidyverse.org/blog/2023/04/base-vs-magrittr-pipe/)
    118 
    119 [^responses1]: This advice comes from [Josep
    120     Pueyo-Ros](https://github.com/jospueyo) and [Elio
    121     Campitelli](https://eliocamp.github.io/).
    122 
    123 [^dk]: Dunning, D. (2011). The Dunning–Kruger effect: On being ignorant of
    124     one's own ignorance. In _Advances in experimental social psychology_
    125     (Vol. 44, pp. 247-296). Academic Press.
    126 
    127 [^responses2]: Thanks go to [Matt Dray](https://www.matt-dray.com/) for this
    128     one.
    129 
    130 [^anon]: This syntax was also introduced in R 4.1.0.
    131 
    132 [^python]: Python is a notable exception here, and it's my opinion that it
    133     would be improved with the addition of one.
    134 
    135 [^benefits]: <https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html>