commit f167ba63f9681c14a50641b09d0c57779eb36476
parent 953f91399b4e966e9ea36f7dd895d24c5a629853
Author: Eamon Caddigan <eamon.caddigan@gmail.com>
Date: Fri, 22 Dec 2023 10:42:00 -0800
Post about R pipelines
Diffstat:
1 file changed, 135 insertions(+), 0 deletions(-)
diff --git a/content/posts/r-pipe-equality/index.md b/content/posts/r-pipe-equality/index.md
@@ -0,0 +1,135 @@
+---
+title: "Checking Equality in an R Pipeline"
+date: 2023-12-22T10:24:40-08:00
+draft: false
+categories:
+- Programming
+- Data Science
+tags:
+- R
+---
+
+I was wondering how to incorporate a test for equality into a pipeline using
+R's new(_ish_) native forward pipe operator[^pipe], so I [asked the
+Fediverse](https://social.coop/@eamon/111608458848988985) and got great
+advice[^fedi].
+
+## Why would I do this?
+
+I approach scripts differently than notebooks (whether using Jupyter or
+RMarkdown) since they're usually meant to be run non-interactively. I try to
+keep the list of packages (or modules) fairly minimal; basically, when
+deciding how to balance portability vs. code readability, I am biased a bit
+more toward portability than I am for an interactive data analysis.
+
+Nevertheless, a good script should include data quality checks and raise
+errors when they're found. An approach I'd previously taken with the pipe
+operator introduced by R's [magrittr
+package](https://cran.r-project.org/web/packages/magrittr/index.html)[^magrittr]
+might look like this:
+
+```r
+my_data_frame %>%
+ # ( a long sequence of data munging operations would go here )
+ filter(some_condition_that_should_never_be_true) %>%
+ nrow() %>%
+ `==`(0) %>%
+ stopifnot()
+```
+
+This code would cause the R script to exit and alert the calling process
+that something went wrong, but it doesn't work as-written with the native
+pipe.
+
+## Avoid using `==`
+
+I was convinced by a couple responses[^responses1] that the best approach is
+to forego the `==` operator entirely, and this advice is echoed in [R's own
+documentation](https://search.r-project.org/R/refmans/base/html/Comparison.html).
+
+* When you know that the value you're testing will be an integer or a
+ non-numeric value, the `identical()` function is better. In the example
+ above, I'd write `identical(0L)`, since this differentiates integer from
+ floating-point values, and I know `nrow()` will return an integer.
+* For other numeric values, use `all.equal()` followed by `isTRUE()`; e.g.
+ `sin(pi) |> all.equal(0) |> isTRUE()`
+
+The first solution is what the scripts I wrote this week now use.
+
+## You _can_ still use `==`
+
+I often bristle when I see somebody ask "how do I do this?" on the internet
+only to be told "don't do that". It only "works" when the respondent
+_really_ knows what they're talking about, and internet commenters seem to
+overestimate their expertise[^dk]. Fortunately, that wasn't the case this
+time, and I think this says great things about the Fediverse's R community.
+
+That said, it is still possible to use the equality operator by
+naming the argument[^responses2]:
+
+```r
+my_data_frame |>
+ # ( a long sequence of data munging operations would go here )
+ filter(some_condition_that_should_never_be_true) |>
+ nrow() |>
+ `==`(x = _, 0) |>
+ stopifnot()
+```
+
+I believe this approach requires R 4.2.0 or greater, but I haven't tested it
+in the 4.1.x family. It's nice to know about this, because it applies in
+other contexts (e.g., when using other operators as functions).
+
+It's possibly worth calling out the first approach I had planned to use,
+which calls an "anonymous function"[^anon]:
+
+```r
+my_data_frame |>
+ # ( a long sequence of data munging operations would go here )
+ filter(some_condition_that_should_never_be_true) |>
+ nrow() |>
+ {\(x) x == 0}() |>
+ stopifnot()
+```
+
+## Rambling about pipes
+
+Pipe operators can be found in a several other languages[^python]. The idea
+of "pipelining" operations probably "comes from" [concatenative
+languages]({{< ref "postscript-graph-paper" >}}); pipe _operators_ bring
+this ability to other paradigms, allowing one to chain a sequence of
+functions without storing intermediate values or nesting the calls. Fans of
+this style believe it can "decrease development time and improve readability
+and maintainability of code"[^benefits].
+
+In the examples here, I don't want to create a bunch of variables to store
+values that I won't use later. But also, I think I realize why I like
+pipelining: being able to fit the sequence into a single pipeline _feels_
+like writing a sentence describing the transformations I want to apply to
+the data.
+
+[^pipe]: `|>`, introduced in R 4.1.0 in May 2021.
+
+[^fedi]: Seriously, thanks to everyone who replied, most of whom probably
+ only saw my post because of its `#RStats` hashtag.
+
+[^magrittr]: See [Differences between the base R and magrittr
+ pipes](https://www.tidyverse.org/blog/2023/04/base-vs-magrittr-pipe/)
+
+[^responses1]: This advice comes from [Josep
+ Pueyo-Ros](https://github.com/jospueyo) and [Elio
+ Campitelli](https://eliocamp.github.io/).
+
+[^dk]: Dunning, D. (2011). The Dunning–Kruger effect: On being ignorant of
+ one's own ignorance. In _Advances in experimental social psychology_
+ (Vol. 44, pp. 247-296). Academic Press.
+
+[^responses2]: Thanks go to [Matt Dray](https://www.matt-dray.com/) for this
+ one.
+
+[^anon]: This syntax was also introduced in R 4.1.0.
+
+[^python]: Python is a notable exception here, and it's my opinion that it
+ would be improved with the addition of one.
+
+[^benefits]: <https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html>