index.md (5166B)
1 --- 2 title: "Checking Equality in an R Pipeline" 3 date: 2023-12-22T10:24:40-08:00 4 draft: false 5 categories: 6 - Programming 7 - Data Science 8 tags: 9 - R 10 --- 11 12 I was wondering how to incorporate a test for equality into a pipeline using 13 R's new(_ish_) native forward pipe operator[^pipe], so I [asked the 14 Fediverse](https://social.coop/@eamon/111608458848988985) and got great 15 advice[^fedi]. 16 17 ## Why would I do this? 18 19 I approach scripts differently than notebooks (whether using Jupyter or 20 RMarkdown) since they're usually meant to be run non-interactively. I try to 21 keep the list of packages (or modules) fairly minimal; basically, when 22 deciding how to balance portability vs. code readability, I am biased a bit 23 more toward portability than I am for an interactive data analysis. 24 25 Nevertheless, a good script should include data quality checks and raise 26 errors when they're found. An approach I'd previously taken with the pipe 27 operator introduced by R's [magrittr 28 package](https://cran.r-project.org/web/packages/magrittr/index.html)[^magrittr] 29 might look like this: 30 31 ```r 32 my_data_frame %>% 33 # ( a long sequence of data munging operations would go here ) 34 filter(some_condition_that_should_never_be_true) %>% 35 nrow() %>% 36 `==`(0) %>% 37 stopifnot() 38 ``` 39 40 This code would cause the R script to exit and alert the calling process 41 that something went wrong, but it doesn't work as-written with the native 42 pipe. 43 44 ## Avoid using `==` 45 46 I was convinced by a couple responses[^responses1] that the best approach is 47 to forego the `==` operator entirely, and this advice is echoed in [R's own 48 documentation](https://search.r-project.org/R/refmans/base/html/Comparison.html). 49 50 * When you know that the value you're testing will be an integer or a 51 non-numeric value, the `identical()` function is better. In the example 52 above, I'd write `identical(0L)`, since this differentiates integer from 53 floating-point values, and I know `nrow()` will return an integer. 54 * For other numeric values, use `all.equal()` followed by `isTRUE()`; e.g. 55 `sin(pi) |> all.equal(0) |> isTRUE()` 56 57 The first solution is what the scripts I wrote this week now use. 58 59 ## You _can_ still use `==` 60 61 I often bristle when I see somebody ask "how do I do this?" on the internet 62 only to be told "don't do that". It only "works" when the respondent 63 _really_ knows what they're talking about, and internet commenters seem to 64 overestimate their expertise[^dk]. Fortunately, that wasn't the case this 65 time, and I think this says great things about the Fediverse's R community. 66 67 That said, it is still possible to use the equality operator by 68 naming the argument[^responses2]: 69 70 ```r 71 my_data_frame |> 72 # ( a long sequence of data munging operations would go here ) 73 filter(some_condition_that_should_never_be_true) |> 74 nrow() |> 75 `==`(x = _, 0) |> 76 stopifnot() 77 ``` 78 79 I believe this approach requires R 4.2.0 or greater, but I haven't tested it 80 in the 4.1.x family. It's nice to know about this, because it applies in 81 other contexts (e.g., when using other operators as functions). 82 83 It's possibly worth calling out the first approach I had planned to use, 84 which calls an "anonymous function"[^anon]: 85 86 ```r 87 my_data_frame |> 88 # ( a long sequence of data munging operations would go here ) 89 filter(some_condition_that_should_never_be_true) |> 90 nrow() |> 91 {\(x) x == 0}() |> 92 stopifnot() 93 ``` 94 95 ## Rambling about pipes 96 97 Pipe operators can be found in a several other languages[^python]. The idea 98 of "pipelining" operations probably "comes from" [concatenative 99 languages]({{< ref "postscript-graph-paper" >}}); pipe _operators_ bring 100 this ability to other paradigms, allowing one to chain a sequence of 101 functions without storing intermediate values or nesting the calls. Fans of 102 this style believe it can "decrease development time and improve readability 103 and maintainability of code"[^benefits]. 104 105 In the examples here, I don't want to create a bunch of variables to store 106 values that I won't use later. But also, I think I realize why I like 107 pipelining: being able to fit the sequence into a single pipeline _feels_ 108 like writing a sentence describing the transformations I want to apply to 109 the data. 110 111 [^pipe]: `|>`, introduced in R 4.1.0 in May 2021. 112 113 [^fedi]: Seriously, thanks to everyone who replied, most of whom probably 114 only saw my post because of its `#RStats` hashtag. 115 116 [^magrittr]: See [Differences between the base R and magrittr 117 pipes](https://www.tidyverse.org/blog/2023/04/base-vs-magrittr-pipe/) 118 119 [^responses1]: This advice comes from [Josep 120 Pueyo-Ros](https://github.com/jospueyo) and [Elio 121 Campitelli](https://eliocamp.github.io/). 122 123 [^dk]: Dunning, D. (2011). The Dunning–Kruger effect: On being ignorant of 124 one's own ignorance. In _Advances in experimental social psychology_ 125 (Vol. 44, pp. 247-296). Academic Press. 126 127 [^responses2]: Thanks go to [Matt Dray](https://www.matt-dray.com/) for this 128 one. 129 130 [^anon]: This syntax was also introduced in R 4.1.0. 131 132 [^python]: Python is a notable exception here, and it's my opinion that it 133 would be improved with the addition of one. 134 135 [^benefits]: <https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html>