www.eamoncaddigan.net

Content and configuration for https://www.eamoncaddigan.net
git clone https://git.eamoncaddigan.net/www.eamoncaddigan.net.git
Log | Files | Refs | Submodules | README

commit f167ba63f9681c14a50641b09d0c57779eb36476
parent 953f91399b4e966e9ea36f7dd895d24c5a629853
Author: Eamon Caddigan <eamon.caddigan@gmail.com>
Date:   Fri, 22 Dec 2023 10:42:00 -0800

Post about R pipelines

Diffstat:
Acontent/posts/r-pipe-equality/index.md | 135+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 135 insertions(+), 0 deletions(-)

diff --git a/content/posts/r-pipe-equality/index.md b/content/posts/r-pipe-equality/index.md @@ -0,0 +1,135 @@ +--- +title: "Checking Equality in an R Pipeline" +date: 2023-12-22T10:24:40-08:00 +draft: false +categories: +- Programming +- Data Science +tags: +- R +--- + +I was wondering how to incorporate a test for equality into a pipeline using +R's new(_ish_) native forward pipe operator[^pipe], so I [asked the +Fediverse](https://social.coop/@eamon/111608458848988985) and got great +advice[^fedi]. + +## Why would I do this? + +I approach scripts differently than notebooks (whether using Jupyter or +RMarkdown) since they're usually meant to be run non-interactively. I try to +keep the list of packages (or modules) fairly minimal; basically, when +deciding how to balance portability vs. code readability, I am biased a bit +more toward portability than I am for an interactive data analysis. + +Nevertheless, a good script should include data quality checks and raise +errors when they're found. An approach I'd previously taken with the pipe +operator introduced by R's [magrittr +package](https://cran.r-project.org/web/packages/magrittr/index.html)[^magrittr] +might look like this: + +```r +my_data_frame %>% + # ( a long sequence of data munging operations would go here ) + filter(some_condition_that_should_never_be_true) %>% + nrow() %>% + `==`(0) %>% + stopifnot() +``` + +This code would cause the R script to exit and alert the calling process +that something went wrong, but it doesn't work as-written with the native +pipe. + +## Avoid using `==` + +I was convinced by a couple responses[^responses1] that the best approach is +to forego the `==` operator entirely, and this advice is echoed in [R's own +documentation](https://search.r-project.org/R/refmans/base/html/Comparison.html). + +* When you know that the value you're testing will be an integer or a + non-numeric value, the `identical()` function is better. In the example + above, I'd write `identical(0L)`, since this differentiates integer from + floating-point values, and I know `nrow()` will return an integer. +* For other numeric values, use `all.equal()` followed by `isTRUE()`; e.g. + `sin(pi) |> all.equal(0) |> isTRUE()` + +The first solution is what the scripts I wrote this week now use. + +## You _can_ still use `==` + +I often bristle when I see somebody ask "how do I do this?" on the internet +only to be told "don't do that". It only "works" when the respondent +_really_ knows what they're talking about, and internet commenters seem to +overestimate their expertise[^dk]. Fortunately, that wasn't the case this +time, and I think this says great things about the Fediverse's R community. + +That said, it is still possible to use the equality operator by +naming the argument[^responses2]: + +```r +my_data_frame |> + # ( a long sequence of data munging operations would go here ) + filter(some_condition_that_should_never_be_true) |> + nrow() |> + `==`(x = _, 0) |> + stopifnot() +``` + +I believe this approach requires R 4.2.0 or greater, but I haven't tested it +in the 4.1.x family. It's nice to know about this, because it applies in +other contexts (e.g., when using other operators as functions). + +It's possibly worth calling out the first approach I had planned to use, +which calls an "anonymous function"[^anon]: + +```r +my_data_frame |> + # ( a long sequence of data munging operations would go here ) + filter(some_condition_that_should_never_be_true) |> + nrow() |> + {\(x) x == 0}() |> + stopifnot() +``` + +## Rambling about pipes + +Pipe operators can be found in a several other languages[^python]. The idea +of "pipelining" operations probably "comes from" [concatenative +languages]({{< ref "postscript-graph-paper" >}}); pipe _operators_ bring +this ability to other paradigms, allowing one to chain a sequence of +functions without storing intermediate values or nesting the calls. Fans of +this style believe it can "decrease development time and improve readability +and maintainability of code"[^benefits]. + +In the examples here, I don't want to create a bunch of variables to store +values that I won't use later. But also, I think I realize why I like +pipelining: being able to fit the sequence into a single pipeline _feels_ +like writing a sentence describing the transformations I want to apply to +the data. + +[^pipe]: `|>`, introduced in R 4.1.0 in May 2021. + +[^fedi]: Seriously, thanks to everyone who replied, most of whom probably + only saw my post because of its `#RStats` hashtag. + +[^magrittr]: See [Differences between the base R and magrittr + pipes](https://www.tidyverse.org/blog/2023/04/base-vs-magrittr-pipe/) + +[^responses1]: This advice comes from [Josep + Pueyo-Ros](https://github.com/jospueyo) and [Elio + Campitelli](https://eliocamp.github.io/). + +[^dk]: Dunning, D. (2011). The Dunning–Kruger effect: On being ignorant of + one's own ignorance. In _Advances in experimental social psychology_ + (Vol. 44, pp. 247-296). Academic Press. + +[^responses2]: Thanks go to [Matt Dray](https://www.matt-dray.com/) for this + one. + +[^anon]: This syntax was also introduced in R 4.1.0. + +[^python]: Python is a notable exception here, and it's my opinion that it + would be improved with the addition of one. + +[^benefits]: <https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html>