commit 7428f60c4855d55a79819bb7e74a2f932931a804
Author: Eamon Caddigan <eamon.caddigan@gmail.com>
Date: Wed, 13 Jul 2022 15:50:16 -0400
Initial commit of all the R-based blog posts (that I can find)
Diffstat:
6 files changed, 1024 insertions(+), 0 deletions(-)
diff --git a/2015-09-26-violin-plots.Rmd b/2015-09-26-violin-plots.Rmd
@@ -0,0 +1,124 @@
+---
+title: "Violin plots are great"
+author: "Eamon Caddigan"
+date: '2015-09-26'
+output: html_document
+layout: post
+summary: Fiddling with a visualization technique that's not common in psychology (yet).
+categories: dataviz R psych
+---
+
+```{r global_options, include=FALSE}
+knitr::opts_chunk$set(cache=TRUE, echo=TRUE, warning=FALSE, message=FALSE,
+ fig.width=8, fig.height=5, fig.align="center")
+
+source("geom_flat_violin.R")
+```
+
+Anybody who's used the ggplot2 package in R is probably familiar with `geom_violin`, which creates [violin plots](https://en.wikipedia.org/wiki/Violin_plot). I'm not going to reproduce the Wikipedia article here; just think of violin plots as sideways density plots (which themselves are basically smooth histograms). They look like this:
+
+```{r}
+library(ggplot2)
+ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
+ geom_violin()
+```
+
+I wasn't familiar with this type of plot until I started using ggplot2. Judging by a recent conversation with friends (because my friends and I are nerds), that's not uncommon among people coming from psychology (but check out Figure 1 in the [recent report on reproducability in psychology](https://osf.io/ezcuj/wiki/home/) by the Open Science Collaboration).
+
+That's a shame, because this is a useful technique; a quick glance shows viewers the shape of the distribution and lets them easily estimate its mode and range. If you also need the median and interquartile range, it's simple to display them by overlaying a box plot (violin plots are usually made this way).
+
+```{r}
+ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
+ geom_violin(aes(fill = factor(cyl))) +
+ geom_boxplot(width = 0.2)
+```
+
+### Violin plots vs. bar graphs
+
+Violin plots aren't popular in the psychology literature--at least among vision/cognition researchers. Instead, it's more common to see bar graphs, which throw away all of the information present in a violin plot.
+
+```{r}
+library(dplyr)
+mtcarsSummary <- mtcars %>%
+ group_by(cyl) %>%
+ summarize(mpg_mean = mean(mpg),
+ mpg_se = sqrt(var(mpg)/length(mpg)))
+
+ggplot(mtcarsSummary, aes(x = factor(cyl), y = mpg_mean, fill = factor(cyl))) +
+ geom_bar(stat = "identity")
+```
+
+Bar graphs highlight a single statistic of the analyst's choosing. In psychology (and many other fields), researchers use bar graphs to visualize the mean of the data, and usually include error bars to show the standard error (or another confidence interval).
+
+However, when audiences see bar graphs of means, they erroneously judge values that fall *inside* a bar (i.e., below the mean) as more probable than values equidistant from the mean but outside a bar ([Newman & Scholl, 2012](http://perception.research.yale.edu/papers/12-Newman-Scholl-PBR.pdf)). This bias doesn't affect violin plots because the region inside the violin contains *all* of the observed data. I'd guess that observers will correctly intuit that values in the wider parts of the violin are more probable than those in narrower parts.
+
+The mean and standard error are only useful statistics when you assume that your data are normally distributed; bar graphs don't help you check that assumption. For small sample size studies, it's more effective to just plot every single observation ([Weissgerber et al., 2015](http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002128)). Till Bergmann [explored this approach](http://tillbergmann.com/blog/articles/barplots-are-pies.html) in a post that includes R code.
+
+If it's important to display the mean and standard error, these values can be overlaid on any visualization.
+
+```{r, fig.width=11, fig.height=5}
+library(gridExtra)
+
+plotBars <- ggplot(mtcarsSummary, aes(x = factor(cyl), y = mpg_mean, fill = factor(cyl))) +
+ geom_bar(aes(fill = factor(cyl)), stat = "identity") +
+ geom_errorbar(aes(y = mpg_mean, ymin = mpg_mean-mpg_se, ymax = mpg_mean+mpg_se),
+ color = "black", width = 0.4) +
+ ylim(0, 35) +
+ theme(legend.position = "none") +
+ ggtitle("Bar graph")
+
+plotPoints <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, color = factor(cyl))) +
+ geom_point(aes(y = mpg, color = factor(cyl)),
+ position = position_jitter(width = 0.25, height = 0.0),
+ alpha = 0.6) +
+ geom_point(aes(y = mpg_mean), color = "black", size = 2, data = mtcarsSummary) +
+ geom_errorbar(aes(y = mpg_mean, ymin = mpg_mean-mpg_se, ymax = mpg_mean+mpg_se),
+ color = "black", width = 0.2, data = mtcarsSummary) +
+ ylim(0, 35) +
+ theme(legend.position = "none") +
+ ggtitle("Every observation")
+
+plotViolins <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
+ geom_violin(aes(y = mpg, fill = factor(cyl))) +
+ geom_point(aes(y = mpg_mean), color = "black", size = 2, data = mtcarsSummary) +
+ geom_errorbar(aes(y = mpg_mean, ymin = mpg_mean-mpg_se, ymax = mpg_mean+mpg_se),
+ color = "black", width = 0.2, data = mtcarsSummary) +
+ ylim(0, 35) +
+ theme(legend.position = "none") +
+ ggtitle("Violin plot")
+
+grid.arrange(plotBars, plotPoints, plotViolins, ncol = 3)
+```
+
+### Violin plots vs. density plots
+
+A violin plot shows the distribution's density using the width of the plot, which is symmetric about its axis, while traditional density plots use height from a common baseline. It may be easier to estimate relative differences in density plots, though I don't know of any research on the topic. More importantly, density plots (or histograms) readily display the density estimates, whereas violin plots typically don't present these.
+
+```{r}
+ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) +
+ geom_density(alpha = 0.6)
+```
+
+Figures like the one above quickly become crowded as the number of factors increases. It's easy to flip the coordinates and use faceting to construct figures similar to violin plots.
+
+```{r}
+# trim = TRUE trims the tails to the range of the data,
+# which is the default for geom_violin
+ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) +
+ geom_density(trim = TRUE) +
+ coord_flip() +
+ facet_grid(. ~ cyl)
+```
+
+I asked twitter about making plots like this without faceting. [David Robinson](http://varianceexplained.org/) came through with a [new geom that does it](https://gist.github.com/dgrtwo/eb7750e74997891d7c20). Like traditional violin plots, these toss out the density estimates--and currently only work with the development version of ggplot2--but they do the trick.
+
+```{r}
+ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +
+ geom_flat_violin()
+```
+
+### Use violin plots
+
+If you're into R's base graphics (why?), it looks like the [vioplot package](http://www.inside-r.org/packages/cran/vioplot/docs/vioplot) can make violin plots without using ggplot2. Seaborn appears to bring [very powerful violin plots](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.violinplot.html) to python, but I haven't had much opportunity to explore the awesome pandas world that's emerged since I last used python for most of my analyses.
+
+Psychologists should use violin plots more often. They're ideal for displaying non-normal data, which we frequently encounter when looking at a single participant's performance (e.g., response times). More importantly, previous research--*in psychology*--has shown that viewers have difficulty interpreting bar graphs, and violin plots present a viable alternative.
diff --git a/2015-10-22-indexing-matrices.Rmd b/2015-10-22-indexing-matrices.Rmd
@@ -0,0 +1,98 @@
+---
+title: "Indexing matrix elements in R"
+author: "Eamon Caddigan"
+date: "2015-10-22"
+output: html_document
+layout: post
+summary: Why I prefer zero-based numbering.
+categories: R programming
+---
+
+```{r global_options, include=FALSE}
+knitr::opts_chunk$set(cache=TRUE)
+```
+
+I came to science with a background in engineering, but most of my scientist friends didn’t. I often have a hard time articulating why I’m so annoyed by one-based indexing--which R and MATLAB use, but most other programming languages don’t. Here’s a recent example that might help.
+
+For a simulation I’m running, I use the values in several of the columns of a data frame as indexes into separate vectors. Here’s an example of using indexes in one vector (instead of a column from a `data.frame`) to access the elements in another:
+
+```{r}
+valueVector <- c(2, 5, 8, 3, 0)
+indexVector <- c(2, 3, 3, 1, 4, 5, 5)
+valueVector[indexVector]
+```
+
+R veterans will point out that you can use factors for this, and that's definitely true here. However, when the values in the small vector are changing often but the indices are relatively stable, I prefer this approach. Either strategy works.
+
+Unfortunately, things aren't so easy when the data is in a matrix (a 2D vector) and you want to access its elements using two index vectors (i.e., one indexing the matrix’s rows, and the second indexing its columns). R’s default behavior might not be what you expect:
+
+```{r}
+valueMatrix <- matrix(LETTERS[1:15], ncol = 3)
+valueMatrix
+
+rowIndices <- c(4, 2, 2, 4, 4, 4)
+colIndices <- c(2, 3, 3, 2, 3, 2)
+valueMatrix[rowIndices, colIndices]
+```
+
+Instead of returning the six values associated with the six row/column pairs, it returns a 6 × 6 matrix with the elements of interest along the diagonal. When the number of elements is this small, it’s easy to wrap this in a call to `diag()`; that approach isn’t appropriate for long index vectors, since memory requirements are squared.
+
+Readers who’ve done data manipulation in C are probably familiar with pointer arithmetic. Zero-based numbering makes it easy to combine row and column indexes and make a one-dimensional array behave like a matrix:
+
+```
+/* When the code has to deal with matrices of varying size, you can’t allocate
+ an array right away. */
+double *valueMatrix;
+
+/* Stuff happened, and now you know how big your matrix will be (nrow x ncol),
+ so you allocate memory on the “heap” to store the data. */
+valueMatrix = (double *)malloc(sizeof(double) * nrow * ncol);
+
+/* This is how you’d access specific elements. */
+oldValueAtRowCol = valueMatrix[row + nrow*col];
+valueMatrix[row + nrow*col] = newValueAtRowCol;
+
+/* Can’t forget this when you’re done; I don’t miss C. */
+free(valueMatrix);
+valueMatrix = 0;
+
+/* NB: All programs work this way, but most scripting languages take care of
+ this stuff for you. Progress! */
+```
+
+To its credit, R makes this easy too; everything is stored as a one-dimensional vector behind the scenes, and can be accessed as such. R’s use of one-based indexing just makes it look a bit more awkward:
+
+```{r}
+valueMatrix[rowIndices + nrow(valueMatrix) * (colIndices - 1)]
+```
+
+I hope this is useful for anybody who wants to access matrix elements using index vectors. More importantly, maybe some scientists-turned-programmers can gain a bit of insight into why zero-based numbering make sense to those of us who cut our teeth on C (or languages like it).
+
+***
+
+### Update
+
+After I posted this, [Andrie de Vries](http://rfordummies.com/) sent [a tweet](https://twitter.com/RevoAndrie/status/657187336833388545) sharing an alternate syntax that's more R-like:
+
+```{r}
+valueMatrix[cbind(rowIndices, colIndices)]
+```
+
+I definitely agree that it looks better, but there is a slight performance hit due to the call to `cbind()`. If you're repeatedly accessing a matrix with the same pair of indices, it might be worth it store the bound pair as a variable and reuse that. Here's the benchmark performance of each of the approaches I've discussed:
+
+```{r}
+library("microbenchmark")
+
+rowIndices <- sample(1:5, 1e4, replace=TRUE)
+colIndices <- sample(1:3, 1e4, replace=TRUE)
+boundIndices <- cbind(rowIndices, colIndices)
+
+op <- microbenchmark(diag_matrix = diag(valueMatrix[rowIndices, colIndices]),
+ pointer_math = valueMatrix[rowIndices + nrow(valueMatrix) * (colIndices - 1)],
+ array_indexing = valueMatrix[cbind(rowIndices, colIndices)],
+ array_indexing_prebound = valueMatrix[boundIndices],
+ times = 100)
+print(op)
+```
+
+Indexing with the pre-bound pair is fastest, using arithmetic on the indexes is a close second, and calling `cbind()` inside the brackets is in third place. Creating the n × n matrix and extracting its diagonal is excessively slow (and uses up a lot of RAM), so don't ever do that.
diff --git a/2015-12-19-riddler-2.Rmd b/2015-12-19-riddler-2.Rmd
@@ -0,0 +1,101 @@
+---
+title: "Riddler #2"
+author: "Eamon Caddigan"
+date: "2015-12-19"
+output: html_document
+layout: post
+summary: Solving and simulating a probability puzzle.
+categories: R stats
+---
+
+FiveThirtyEight has a new puzzle feature called The Riddler. This week they posted [their second puzzle](http://fivethirtyeight.com/features/which-geyser-gushes-first/), which involves probability:
+
+> You arrive at the beautiful Three Geysers National Park. You read a placard
+> explaining that the three eponymous geysers — creatively named A, B and C —
+> erupt at intervals of precisely two hours, four hours and six hours,
+> respectively. However, you just got there, so you have no idea how the three
+> eruptions are staggered. Assuming they each started erupting at some
+> independently random point in history, what are the probabilities that A, B
+> and C, respectively, will be the first to erupt after your arrival?
+
+### Analytic solution
+
+Responses were due last night at midnight, so I hope I'm not spoiling anything by sharing mine.
+
+When you arrive at the park, the first eruption of geyser A is likely to occur at any time within the next two hours with a uniform probability. Denoting the number of hours until the first eruption of the geyser as “A”, A ~ Uniform(0, 2). Similarly, B ~ Uniform(0, 4) and C ~ Uniform(0, 6).
+
+To compute the probability of geyser A erupting first, separately consider the cases where geysers B and C do and do not erupt within the first two hours of waiting.
+
+Using Bayes Theorem:
+
+$$Pr(A first \cap B<2, C<2) = Pr(A first | B < 2, C < 2) Pr(B < 2, C < 2)$$
+
+Assuming independence of the geysers:
+
+$$= Pr(A first | B < 2, C < 2) Pr(B < 2) Pr(C < 2)$$
+
+Considering the case here, in which all three geysers erupt in the first two hours, the probability of any geyser erupting first is 1/3. The probability of geysers B and C erupting in the first two hours is easy to calculate.
+
+$$= (1/3) \times (1/2) \times (1/3) = 1/18$$
+
+Using similar logic:
+
+$$Pr(A first \cap B > 2, C < 2) = (1/2) \times (1/2) \times (1/3) = 1/12\\
+Pr(A first \cap B < 2, C > 2) = (1/2) \times (1/2) \times (2/3) = 1/6\\
+Pr(A first \cap B > 2, C > 2) = 1 \times (1/2) \times (2/3) = 1/3$$
+
+These disjoint events partition the sample space, so the law of total probability dictates:
+
+$$Pr(A first) = 1/18 + 1/12 + 1/6 + 1/3 = 23/36 \approx 0.6389$$
+
+Do the same thing to calculate the probability of geyser B erupting before the others:
+
+$$Pr(B first \cap B < 2, C < 2) = (1/3) \times (1/2) \times (1/3) = 1/18\\
+Pr(B first \cap B < 2, C > 2) = (1/2) \times (1/2) \times (2/3) = 1/6$$
+
+Note that when geyser B does not erupt within the first two hours, another geyser is guaranteed to erupt before it:
+
+$$Pr(B first \cap B > 2, C < 2) = 0\\
+Pr(B first \cap B > 2, C > 2) = 0$$
+
+Therefore:
+
+$$Pr(B first) = 1/18 + 1/6 = 4/18 \approx 0.2222$$
+
+And geyser C is similar:
+
+$$Pr(C first \cap B < 2, C < 2) = (1/3) \times (1/2) \times (1/3) = 1/18\\
+Pr(C first \cap B > 2, C < 2) = (1/2) \times (1/2) \times (1/3) = 1/12\\
+Pr(C first) = 1/18 + 1/12 = 5/36 \approx 0.1389$$
+
+### Quick simulation
+
+It never hurts to check results with a simulation. After all, math is hard and programming is easy.
+
+```{r, warning=FALSE, message=FALSE}
+library(dplyr)
+library(ggplot2)
+
+numSamples <- 1e6
+
+# Generate random wait times for each geyser's next eruption
+geysers <- data_frame(A = runif(numSamples, 0, 2),
+ B = runif(numSamples, 0, 4),
+ C = runif(numSamples, 0, 6))
+geyserNames <- colnames(geysers)
+
+# Identify the geyser with the smallest time-to eruption
+geysers <- geysers %>%
+ mutate(first_geyser = geyserNames[max.col(-geysers[, geyserNames])])
+
+# Plot the results
+geysers %>%
+ ggplot(aes(x = first_geyser)) + geom_bar()
+
+# Count observations and estimate probabilities
+geysers %>%
+ count(first_geyser) %>%
+ mutate(prob_first = n / numSamples)
+```
+
+Close enough!
diff --git a/2018-07-22-tufte_plot.Rmd b/2018-07-22-tufte_plot.Rmd
@@ -0,0 +1,185 @@
+---
+title: "Recreating a Tufte Slopegraph"
+author: "Eamon Caddigan"
+date: "2018-07-05"
+output: md_document
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+```
+```{r load_libraries, include=FALSE, warning=FALSE, message=FALSE}
+#extrafont::loadfonts(device="win")
+suppressPackageStartupMessages(library(tibble))
+suppressPackageStartupMessages(library(dplyr))
+suppressPackageStartupMessages(library(ggplot2))
+suppressPackageStartupMessages(library(ggrepel))
+```
+
+Recently on Twitter, data visualization guru Edward R. Tufte wrote that graphics produced by R are not “publication ready”. His proposed workflow is to use statistical software to create an initial version of a plot, and then make final improvements in Adobe Illustrator.
+
+I disagree with this advice. First I’ll show the steps a data analyst might take to create a high-quality graphic entirely in R. Then, I’ll explain why I think this is a better approach.
+
+## Publication quality graphics in R
+
+Page 158 of Tufte’s classic book, [The Visual Display of Quantitative Information (2nd ed.)](https://www.edwardtufte.com/tufte/books_vdqi), features a “slope graph” that shows the change in government receipts for several countries between 1970 and 1979. Below are the first few rows of these data in a [tidy data frame](https://www.jstatsoft.org/article/view/v059i10), `receiptData`, along with a quick and dirty slopegraph.
+
+```{r receipt_data, echo=FALSE}
+receiptData <- tribble(
+ ~country, ~year, ~receipts,
+ "Belgium", 1970L, 35.2,
+ "Belgium", 1979L, 43.2,
+ "Britain", 1970L, 40.7,
+ "Britain", 1979L, 39.0,
+ "Canada", 1970L, 35.2,
+ "Canada", 1979L, 35.8,
+ "Finland", 1970L, 34.9,
+ "Finland", 1979L, 38.2,
+ "France", 1970L, 39.0,
+ "France", 1979L, 43.4,
+ "Germany", 1970L, 37.5,
+ "Germany", 1979L, 42.9,
+ "Greece", 1970L, 26.8,
+ "Greece", 1979L, 30.6,
+ "Italy", 1970L, 30.4,
+ "Italy", 1979L, 35.7,
+ "Japan", 1970L, 20.7,
+ "Japan", 1979L, 26.6,
+ "Netherlands", 1970L, 44.0,
+ "Netherlands", 1979L, 55.8,
+ "Norway", 1970L, 43.5,
+ "Norway", 1979L, 52.2,
+ "Spain", 1970L, 22.5,
+ "Spain", 1979L, 27.1,
+ "Sweden", 1970L, 46.9,
+ "Sweden", 1979L, 57.4,
+ "Switzerland", 1970L, 26.5,
+ "Switzerland", 1979L, 33.2,
+ "United States", 1970L, 30.3,
+ "United States", 1979L, 32.5
+ )
+
+receiptData %>%
+ head(6) %>%
+ knitr::kable()
+```
+```{r graph_1}
+ggplot(receiptData, aes(year, receipts, group = country)) +
+ geom_line() +
+ geom_text_repel(aes(label = country)) +
+ labs(x = "Year", y = "Government receipts as percentage of GDP")
+```
+
+This plot is not attractive, but it is useful for getting a handle on the data. Whether [iterating through an exploratory data analysis]({{< ref "/posts/2015-09-09-data-science.md" >}}) or preparing a graphic for publication, analysts will create many ugly graphics on the path to settling on a design and refining it.
+
+For our first round of improvements, we can change the aspect ratio of the graphic and arrange the country labels so they don’t overlap with the data. We should also remove the “chart junk” in the background, such as the background grid, and label only the years of interest on the x-axis.
+
+```{r graph_2, fig.height=8, fig.width=5}
+ggplot(receiptData, aes(year, receipts, group = country)) +
+ geom_line() +
+ geom_text_repel(aes(label = country),
+ data = filter(receiptData, year == 1970),
+ nudge_x = -0.5, hjust = 1,
+ direction = "y", size = 5) +
+ geom_text_repel(aes(label = country),
+ data = filter(receiptData, year == 1979),
+ nudge_x = 0.5, hjust = 0,
+ direction = "y", size = 5) +
+ scale_x_continuous(breaks = c(1970, 1979), limits = c(1966, 1983)) +
+ theme(panel.background = element_blank(),
+ axis.title = element_text(size = 16),
+ axis.text = element_text(size = 12)) +
+ labs(x = "Year", y = "Government receipts as percentage of GDP")
+```
+
+The [ggrepel](https://github.com/slowkow/ggrepel) package has done a great job preventing the labels from overlapping; I usually use `geom_text_repel()` instead of `geom_text()` during exploratory data analysis. Here, however, the segments connecting the labels to data points create confusing clutter. While these can be removed within the function, our final figure will be easier to understand if we nudge labels manually so that they’re as close to their data as possible.
+
+We can also drop the axes; the year labels will go right above the data, and we’ll show each country’s value next to its label. Since we’re losing the axis titles, we’ll also embed a caption in this version of the graphic.
+
+Finally, we’ll change the typeface. Since we’re trying to make something that would please Tufte, we’ll use a serif font. If you haven’t done so before, you may need to run `loadfonts()` from the [extrafont package](https://github.com/wch/extrafont) to tell R how to find system fonts. We’ll also further boost our “data-ink ratio” by making the lines thinner.
+
+```{r graph_3, fig.height=8, fig.width=8}
+labelAdjustments <- tribble(
+ ~country, ~year, ~nudge_y,
+ "Netherlands", 1970L, 0.3,
+ "Norway", 1970L, -0.2,
+ "Belgium", 1970L, 0.9,
+ "Canada", 1970L, -0.1,
+ "Finland", 1970L, -0.8,
+ "Italy", 1970L, 0.4,
+ "United States", 1970L, -0.5,
+ "Greece", 1970L, 0.4,
+ "Switzerland", 1970L, -0.3,
+ "France", 1979L, 0.8,
+ "Germany", 1979L, -0.7,
+ "Britain", 1979L, 0.1,
+ "Finland", 1979L, -0.1,
+ "Canada", 1979L, 0.4,
+ "Italy", 1979L, -0.5,
+ "Switzerland", 1979L, 0.2,
+ "United States", 1979L, -0.1,
+ "Spain", 1979L, 0.3,
+ "Japan", 1979L, -0.2
+ )
+
+receiptData <- left_join(receiptData, labelAdjustments,
+ by = c("country", "year")) %>%
+ mutate(receipts_nudged = ifelse(is.na(nudge_y), 0, nudge_y) + receipts)
+
+update_geom_defaults("text", list(family = "Georgia", size = 4.0))
+ggplot(receiptData, aes(year, receipts, group = country)) +
+ # Slope lines
+ geom_line(size = 0.1) +
+ # Country names (manually nudged)
+ geom_text(aes(year, receipts_nudged, label = country),
+ data = filter(receiptData, year == 1970),
+ hjust = 1, nudge_x = -3) +
+ geom_text(aes(year, receipts_nudged, label = country),
+ data = filter(receiptData, year == 1979),
+ hjust = 0, nudge_x = 3) +
+ # Data values
+ geom_text(aes(year, receipts_nudged, label = sprintf("%0.1f", receipts)),
+ data = filter(receiptData, year == 1970),
+ hjust = 1, nudge_x = -0.5) +
+ geom_text(aes(year, receipts_nudged, label = sprintf("%0.1f", receipts)),
+ data = filter(receiptData, year == 1979),
+ hjust = 0, nudge_x = 0.5) +
+ # Column labels
+ annotate("text", x = c(1970, 1979) + c(-0.5, 0.5),
+ y = max(receiptData$receipts) + 3,
+ label = c("1970", "1979"),
+ hjust = c(1, 0)) +
+ # Plot annotation
+ annotate("text", x = 1945, y = 58, hjust = 0, vjust = 1,
+ label = paste("Current Receipts of Government as a",
+ "Percentage of Gross Domestic\nProduct, 1970 and 1979",
+ sep = "\n"),
+ family = "Georgia", size = 4.0) +
+ coord_cartesian(xlim = c(1945, 1990)) +
+ theme(panel.background = element_blank(),
+ axis.title = element_blank(),
+ axis.text = element_blank(),
+ axis.ticks = element_blank())
+```
+
+## Why it’s worth it
+
+It took several iterations to get the aesthetics of this plot just right, while an adept user of a graphics editor would be able to recreate it in minutes. However, this initial time savings obscures several advantages to completing this process in code as opposed to switching to a tool like Illustrator.
+
+### R is free software
+
+All of the programs I used to create this graphic (and share it with you) are [free software](https://www.fsf.org/about/what-is-free-software), unlike Adobe’s terrific but expensive Illustrator. [Inkscape](https://inkscape.org/en/) is an excellent open source alternative, but it is not as powerful as its commercial competitor. On the other hand, [R](https://www.r-project.org/) is arguably the most advanced and well-supported statistical analysis tool. Even if the politics of the open source movement aren’t important, free is a better price-point.
+
+### Code can be reproduced
+
+Anybody who wants to recreate the final figure using a different set of countries, a new pair of years, or a favored economic indicator can run the code above on their own data. With minimal tweaking, they will have a graphic with aesthetics identical to those above. Reproducing this graphic using a multi-tool workflow is more complicated; a novice would first need to learn how to create a “rough draft” of it in software, and then learn how to use a graphics editor to improve its appearance.
+
+### Scripted graphics are easier to update
+
+Imagine that an analyst discovered a mistake in their analysis that needed to be corrected. Or imagine the less stressful situation where a colleague suggested replacing some of the countries in the figure. In such cases, the work of creating the graphic will need to be redone: the analyst will need to add data or correct errors, generate a new plot in software, and re-edit the plot in their graphics editor. Using a scripted, code-based workflow, the analyst would only need to do the first step, and possibly update some manual tweaks. The initial time savings afforded by a graphics editor disappears after the first time this happens.
+
+### Automation prevents errors
+
+Not only does scripted workflow make it easier to deal with errors, it guards against them. When analysts adjust graphic elements in a GUI, there’s a risk of mistakenly moving, deleting, or mislabeling data points. Such mistakes are difficult to detect, but code-based workflows avoid these risks; if the data source and analysis are error-free, then the graphic will also be.
+
+These considerations are among the reasons why scripted analyses are [considered best practice](https://peerj.com/preprints/3210/) in statistical analysis. Most of Tufte’s advice on creating graphics is excellent, but I recommend ignoring his suggestion to make the final edits in a program like Illustrator. Learning how to make polished graphics in software will ultimately save analysts time and help them avoid mistakes.
+\ No newline at end of file
diff --git a/2022-07-08-labeling-plots.Rmd b/2022-07-08-labeling-plots.Rmd
@@ -0,0 +1,92 @@
+---
+title: "Labeling bar charts (and other graphics) in ggplot2"
+description: "How to combine stats and geoms to simplify complex plots."
+date: 2022-07-08T19:32:57-04:00
+draft: False
+knit: (function(input, ...) {
+ rmarkdown::render(
+ input,
+ output_dir = file.path(Sys.getenv("HUGO_ROOT"), "content/posts")
+ )
+ })
+output:
+ md_document:
+ variant: markdown
+ preserve_yaml: true
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(
+ echo = TRUE,
+ fig.path = file.path("figs",
+ sub("\\.Rmd$", "",
+ basename(rstudioapi::getActiveDocumentContext()$path)),
+ "")
+)
+knitr::opts_knit$set(
+ base.dir = file.path(Sys.getenv("HUGO_ROOT"), "static"),
+ base.url = "/"
+)
+```
+
+Bar charts are [rightfully criticized for being potentially misleading](https://www.kickstarter.com/projects/1474588473/barbarplots), but they're still useful for some data graphics. They're a great way to display counts, for example, and non-technical audiences are comfortable interpreting them.
+
+One way that a typical bar plot can be improved is by removing unnecessary axis labels and directly labeling the bars themselves. For example, if we wanted to see how many counties are in some of the states comprising the Midwestern US, we could use the `midwest` data set that's packaged with ggplot2 and draw a simple bar plot.
+
+```{r simple_bar}
+library(ggplot2)
+
+ggplot(midwest, aes(x = state)) +
+ geom_bar() +
+ labs(x = "", y = "Number of counties") +
+ theme_minimal() +
+ theme(axis.text.x = element_text(size = 16),
+ axis.title.y = element_text(size = 14))
+```
+
+This is fine (provided that you can accept that Ohio is part of the Midwest and Iowa isn't). However, we can easily strip away some of the "chart junk" by removing unnecessary theme elements and labeling the bars using `geom_text`.
+
+```{r better_plot}
+ggplot(midwest, aes(state)) +
+ geom_bar() +
+ geom_text(aes(y = ..count.., label = ..count..), stat = "count",
+ vjust = 0, nudge_y = 2, size = 8) +
+ expand_limits(y = 110) +
+ theme(panel.background = element_blank(),
+ panel.grid = element_blank(),
+ axis.line = element_blank(),
+ axis.ticks = element_blank(),
+ axis.title = element_blank(),
+ axis.text.y = element_blank(),
+ axis.text.x = element_text(size = 16)) +
+ labs(title = "Number of counties in Midwestern states")
+```
+
+There's still room for improvement—I'd consider rearranging the bars based on something more meaningful than alphabetic order, and possibly coloring them—but this simultaneously conveys more information than the previous plot (objectively) and is less "cluttered" (subjectively).
+
+## How this works
+
+I'm surprised I'm still writing blog posts about ggplot2 after seven years, but I continue to learn new things about it even after regular use. Recently, I decided that it bothered me that there were "stats" (like `stat_ecdf`) and "geoms" (like `geom_bar`), and I really didn't understand the difference between them—they seemed interchange but not identical. I finally looked into it and came across [this response on StackOverflow](https://stackoverflow.com/a/44226841), which contained the following quote from the [ggplot2 book](https://ggplot2-book.org/):
+
+> You only need to set one of stat and geom: every geom has a default stat, and every stat has a default geom.
+
+It turns out that "stats" and "geoms" _are_ largely interchangeable, because they're both wrappers around `layer()`, with different ways of handling defaults. The code above works because I changed the `stat` parameter of `geom_text` (which, according to [the documentation](https://ggplot2.tidyverse.org/reference/geom_text.html), defaults to `"identity"`) to `"count"`, which is the default `stat` of `geom_bar`. Looking at [the documentation for `stat_count`](https://ggplot2.tidyverse.org/reference/geom_bar.html), we see that it provides two _computed variables_, and we use use the computed `count` variable in our aesthetics for `geom_text` to provide both the vertical position and label.
+
+Getting my head around this has helped me iterate through analyses more quickly. For instance, I frequently use `stat_ecdf` to get a handle on distributions, and now I can label these plots easily. Sticking with data about the Midwest, here's the ECDF of population density for Midwestern counties, with a few values highlighted.
+
+```{r ecdf}
+ggplot(midwest, aes(popdensity)) +
+ stat_ecdf() + # This could also be `geom_step(stat = "ecdf")`
+ geom_text(aes(label = sprintf("(%.0f: %.0f%%)", ..x.., floor(100 * ..y..))),
+ stat = "ecdf", hjust = 1, vjust = 0, check_overlap = TRUE, size = 5) +
+ expand_limits(x = -15000) +
+ scale_y_continuous(labels = scales::label_percent()) +
+ theme_minimal() +
+ labs(x = "Population density", y = "CDF")
+```
+
+I wouldn't call this a _beautiful_ graphic, but visualizations like this are often _useful_, especially when getting a handle on a new data set. Here we see that over a third of Midwestern counties have population densities fully two orders of magnitude smaller than the most densely populated county, and we've only had to draw a single plot.
+
+It's hard to justify putting much effort into producing plots that aren't meant to be shared, and a typical data analysis will involve creating dozens of plots that are only useful for the person analyzing the data. Without understanding stats and geoms, I would never have bothered labeling an ECDF plot; "quick and dirty" needs to be quick after all. On the other hand, labeling the values in a bar chart is something that one should do when preparing publication-quality graphics. ggplot2 is useful in both phases of a project.
+
+In the words of Alan Kay, ["simple things should be simple, complex things should be possible."](https://www.quora.com/What-is-the-story-behind-Alan-Kay-s-adage-Simple-things-should-be-simple-complex-things-should-be-possible) ggplot2 does both of these consistently. Occasionally, it even makes complex things simple.
+\ No newline at end of file
diff --git a/2022-07-13-instagram.Rmd b/2022-07-13-instagram.Rmd
@@ -0,0 +1,421 @@
+---
+title: "Post types and sentiment from my Instagram feed"
+description: "Does Instagram spark joy? A look at what I'm seeing and how I feel while using this platform."
+date: 2022-07-13T12:28:22-04:00
+draft: False
+knit: (function(input, ...) {
+ rmarkdown::render(
+ input,
+ output_dir = file.path(Sys.getenv("HUGO_ROOT"), "content/posts")
+ )
+ })
+output:
+ md_document:
+ variant: markdown
+ preserve_yaml: true
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(
+ echo = FALSE,
+ fig.path = file.path("figs",
+ sub("\\.Rmd$", "",
+ basename(rstudioapi::getActiveDocumentContext()$path)),
+ "")
+)
+knitr::opts_knit$set(
+ base.dir = file.path(Sys.getenv("HUGO_ROOT"), "static"),
+ base.url = "/"
+)
+```
+```{r packages, include=FALSE}
+library(dplyr)
+library(ggplot2)
+library(forcats)
+```
+```{r data}
+ig_dat <- tibble::tribble(
+ ~date, ~medium, ~sentiment, ~poster_category, ~poster_followed,
+ "2022-07-09", "Video", 1L, "Business (Friend's)", TRUE,
+ "2022-07-09", "Photo", 2L, "Friend", TRUE,
+ "2022-07-09", "Video", -2L, "Ad", FALSE,
+ "2022-07-09", "Image", 1L, "Meme account", TRUE,
+ "2022-07-09", "Photo", 2L, "Friend", TRUE,
+ "2022-07-09", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-09", "Video", -2L, "Ad", FALSE,
+ "2022-07-09", "Photo", 1L, "Friend", TRUE,
+ "2022-07-09", "Video", 0L, "Suggested post", FALSE,
+ "2022-07-09", "Video", 0L, "Meme account", TRUE,
+ "2022-07-09", "Video", -2L, "Ad", FALSE,
+ "2022-07-09", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-09", "Photo", 1L, "Meme account", TRUE,
+ "2022-07-09", "Video", 1L, "Suggested post", FALSE,
+ "2022-07-09", "Video", 0L, "Ad", FALSE,
+ "2022-07-09", "Video", 1L, "Business (Friend's)", TRUE,
+ "2022-07-09", "Video", 1L, "Suggested post", FALSE,
+ "2022-07-09", "Image", 1L, "Meme account", TRUE,
+ "2022-07-09", "Video", -1L, "Ad", FALSE,
+ "2022-07-09", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-09", "Image", 0L, "Meme account", TRUE,
+ "2022-07-09", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-09", "Photo", -2L, "Ad", FALSE,
+ "2022-07-09", "Photo", 2L, "Organization", TRUE,
+ "2022-07-09", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-09", "Image", 1L, "Business (Friend's)", TRUE,
+ "2022-07-09", "Video", -2L, "Ad", FALSE,
+ "2022-07-09", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-09", "Photo", 0L, "Artist", TRUE,
+ "2022-07-09", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-09", "Video", -2L, "Ad", FALSE,
+ "2022-07-09", "Photo", 2L, "Friend", TRUE,
+ "2022-07-09", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-09", "Image", 2L, "Artist", TRUE,
+ "2022-07-09", "Video", -2L, "Ad", FALSE,
+ "2022-07-09", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-09", "Photo", 1L, "Organization", TRUE,
+ "2022-07-09", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-09", "Video", -1L, "Ad", FALSE,
+ "2022-07-09", "Photo", 1L, "Organization", TRUE,
+ "2022-07-09", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-09", "Image", 2L, "Artist", TRUE,
+ "2022-07-09", "Video", -2L, "Ad", FALSE,
+ "2022-07-09", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-09", "Photo", 0L, "Organization", TRUE,
+ "2022-07-09", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-09", "Video", -2L, "Ad", FALSE,
+ "2022-07-09", "Photo", 2L, "Friend", TRUE,
+ "2022-07-09", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-09", "Photo", 1L, "Meme account", TRUE,
+ "2022-07-10", "Image", 2L, "Artist", TRUE,
+ "2022-07-10", "Image", 0L, "Meme account", TRUE,
+ "2022-07-10", "Image", 1L, "Meme account", TRUE,
+ "2022-07-10", "Video", -2L, "Ad", FALSE,
+ "2022-07-10", "Photo", 1L, "Friend", TRUE,
+ "2022-07-10", "Video", 1L, "Friend", TRUE,
+ "2022-07-10", "Video", 0L, "Suggested post", FALSE,
+ "2022-07-10", "Image", 0L, "Meme account", TRUE,
+ "2022-07-10", "Video", 1L, "Suggested post", FALSE,
+ "2022-07-10", "Image", 2L, "Friend", TRUE,
+ "2022-07-10", "Video", 0L, "Suggested post", FALSE,
+ "2022-07-10", "Video", -2L, "Ad", FALSE,
+ "2022-07-10", "Photo", 1L, "Friend", TRUE,
+ "2022-07-10", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-10", "Image", 1L, "Meme account", TRUE,
+ "2022-07-10", "Video", -1L, "Ad", FALSE,
+ "2022-07-10", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-10", "Image", 1L, "Friend", TRUE,
+ "2022-07-10", "Video", 1L, "Suggested post", FALSE,
+ "2022-07-10", "Video", -2L, "Ad", FALSE,
+ "2022-07-10", "Photo", 1L, "Organization", TRUE,
+ "2022-07-10", "Video", 0L, "Suggested post", FALSE,
+ "2022-07-10", "Photo", 0L, "Business (Friend's)", TRUE,
+ "2022-07-10", "Video", -2L, "Ad", FALSE,
+ "2022-07-10", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-10", "Photo", 1L, "Meme account", TRUE,
+ "2022-07-10", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-10", "Image", -1L, "Ad", FALSE,
+ "2022-07-10", "Image", 1L, "Artist", TRUE,
+ "2022-07-10", "Video", 0L, "Suggested post", FALSE,
+ "2022-07-10", "Photo", 2L, "Friend", TRUE,
+ "2022-07-10", "Image", -1L, "Ad", FALSE,
+ "2022-07-10", "Video", 1L, "Suggested post", FALSE,
+ "2022-07-10", "Photo", 1L, "Organization", TRUE,
+ "2022-07-10", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-10", "Video", -2L, "Ad", FALSE,
+ "2022-07-10", "Video", -1L, "Organization", TRUE,
+ "2022-07-10", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-10", "Photo", 2L, "Friend", TRUE,
+ "2022-07-10", "Video", -1L, "Ad", FALSE,
+ "2022-07-10", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-10", "Image", 1L, "Artist", TRUE,
+ "2022-07-10", "Video", 1L, "Suggested post", FALSE,
+ "2022-07-10", "Video", -2L, "Ad", FALSE,
+ "2022-07-10", "Video", 2L, "Friend", TRUE,
+ "2022-07-10", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-10", "Video", 1L, "Friend", TRUE,
+ "2022-07-10", "Video", -2L, "Ad", FALSE,
+ "2022-07-10", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-10", "Photo", 2L, "Friend", TRUE,
+ "2022-07-11", "Photo", 0L, "Friend", TRUE,
+ "2022-07-11", "Video", -1L, "Ad", FALSE,
+ "2022-07-11", "Video", 0L, "Business (Friend's)", TRUE,
+ "2022-07-11", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-11", "Video", 2L, "Artist", TRUE,
+ "2022-07-11", "Video", -1L, "Ad", FALSE,
+ "2022-07-11", "Photo", 0L, "Organization", TRUE,
+ "2022-07-11", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-11", "Photo", 1L, "Organization", TRUE,
+ "2022-07-11", "Photo", 2L, "Friend", TRUE,
+ "2022-07-11", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-11", "Photo", 1L, "Friend", TRUE,
+ "2022-07-11", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-11", "Photo", 2L, "Organization", TRUE,
+ "2022-07-11", "Video", -2L, "Ad", FALSE,
+ "2022-07-11", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-11", "Image", 2L, "Friend", TRUE,
+ "2022-07-11", "Image", 2L, "Artist", TRUE,
+ "2022-07-11", "Video", -2L, "Ad", FALSE,
+ "2022-07-11", "Video", 0L, "Suggested post", FALSE,
+ "2022-07-11", "Photo", 2L, "Friend", TRUE,
+ "2022-07-11", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-11", "Video", -2L, "Ad", FALSE,
+ "2022-07-11", "Photo", 0L, "Organization", TRUE,
+ "2022-07-11", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-11", "Image", 0L, "Artist", TRUE,
+ "2022-07-11", "Video", -1L, "Ad", FALSE,
+ "2022-07-11", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-11", "Video", -1L, "Friend", TRUE,
+ "2022-07-11", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-11", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-11", "Photo", 1L, "Meme account", TRUE,
+ "2022-07-11", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-11", "Photo", 1L, "Friend", TRUE,
+ "2022-07-11", "Video", -1L, "Ad", FALSE,
+ "2022-07-11", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-11", "Photo", 0L, "Organization", TRUE,
+ "2022-07-11", "Video", 0L, "Suggested post", FALSE,
+ "2022-07-11", "Video", 0L, "Ad", FALSE,
+ "2022-07-11", "Photo", 2L, "Friend", TRUE,
+ "2022-07-11", "Video", 0L, "Suggested post", FALSE,
+ "2022-07-11", "Photo", 0L, "Meme account", TRUE,
+ "2022-07-11", "Video", -1L, "Ad", FALSE,
+ "2022-07-11", "Video", 1L, "Suggested post", FALSE,
+ "2022-07-11", "Video", 0L, "Suggested post", FALSE,
+ "2022-07-11", "Image", 0L, "Friend", TRUE,
+ "2022-07-11", "Video", -1L, "Ad", FALSE,
+ "2022-07-11", "Video", 0L, "Suggested post", FALSE,
+ "2022-07-11", "Photo", 2L, "Friend", TRUE,
+ "2022-07-11", "Video", 0L, "Suggested post", FALSE,
+ "2022-07-12", "Photo", 2L, "Friend", TRUE,
+ "2022-07-12", "Photo", 1L, "Meme account", TRUE,
+ "2022-07-12", "Video", -1L, "Ad", FALSE,
+ "2022-07-12", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-12", "Photo", 0L, "Friend", TRUE,
+ "2022-07-12", "Photo", 1L, "Friend", TRUE,
+ "2022-07-12", "Image", 0L, "Meme account", TRUE,
+ "2022-07-12", "Photo", 0L, "Ad", FALSE,
+ "2022-07-12", "Image", 0L, "Suggested post", FALSE,
+ "2022-07-12", "Photo", 2L, "Friend", TRUE,
+ "2022-07-12", "Video", 1L, "Suggested post", FALSE,
+ "2022-07-12", "Video", -2L, "Ad", FALSE,
+ "2022-07-12", "Photo", 2L, "Friend", TRUE,
+ "2022-07-12", "Video", 0L, "Suggested post", FALSE,
+ "2022-07-12", "Image", 1L, "Organization", TRUE,
+ "2022-07-12", "Video", -1L, "Ad", FALSE,
+ "2022-07-12", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-12", "Photo", 1L, "Friend", TRUE,
+ "2022-07-12", "Video", 0L, "Suggested post", FALSE,
+ "2022-07-12", "Image", -1L, "Ad", FALSE,
+ "2022-07-12", "Image", 0L, "Business (Friend's)", TRUE,
+ "2022-07-12", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-12", "Photo", 1L, "Meme account", TRUE,
+ "2022-07-12", "Video", 0L, "Ad", FALSE,
+ "2022-07-12", "Image", 0L, "Suggested post", FALSE,
+ "2022-07-12", "Photo", 1L, "Meme account", TRUE,
+ "2022-07-12", "Video", 0L, "Suggested post", FALSE,
+ "2022-07-12", "Video", -2L, "Ad", FALSE,
+ "2022-07-12", "Image", 0L, "Organization", TRUE,
+ "2022-07-12", "Video", 0L, "Suggested post", FALSE,
+ "2022-07-12", "Photo", 1L, "Friend", TRUE,
+ "2022-07-12", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-12", "Video", -2L, "Ad", FALSE,
+ "2022-07-12", "Photo", 1L, "Meme account", TRUE,
+ "2022-07-12", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-12", "Photo", 1L, "Organization", TRUE,
+ "2022-07-12", "Image", -2L, "Ad", FALSE,
+ "2022-07-12", "Video", -1L, "Suggested post", FALSE,
+ "2022-07-12", "Photo", 0L, "Artist", TRUE,
+ "2022-07-12", "Video", 0L, "Suggested post", FALSE,
+ "2022-07-12", "Video", 1L, "Ad", FALSE,
+ "2022-07-12", "Image", 2L, "Artist", TRUE,
+ "2022-07-12", "Video", -2L, "Suggested post", FALSE,
+ "2022-07-12", "Photo", 1L, "Friend", TRUE,
+ "2022-07-12", "Video", -2L, "Ad", FALSE,
+ "2022-07-12", "Image", 0L, "Suggested post", FALSE,
+ "2022-07-12", "Photo", 1L, "Friend", TRUE,
+ "2022-07-12", "Video", 1L, "Suggested post", FALSE,
+ "2022-07-12", "Photo", 0L, "Business (Friend's)", TRUE,
+ "2022-07-12", "Video", -2L, "Ad", FALSE
+ ) %>%
+ mutate(poster_category = sub(" \\(Friend's\\)", "", poster_category))
+```
+
+I've been critical of social media platforms (in spite of my continued use of them), but I've traditionally defended Instagram. I knew that it had major problems (most galling, [internal research showed that use of the platform was hurting teenage girls' body image, and Meta suppressed the findings](https://www.wsj.com/articles/facebook-knows-instagram-is-toxic-for-teen-girls-company-documents-show-11631620739?mod=article_inline)), but browsing IG has nevertheless been a pleasant experience for me. Photos of my friends' pets, children, and vacations are _nice things_, and being exposed to them _made me happy_.
+
+Recently, Instagram instituted changes to the app's main feed, which were [announed by Adam Mosseri, the head of Instagram, on Twitter](https://twitter.com/mosseri/status/1521589403671355392). I feel like Instagram stopped being as fun for me shortly after these changes were rolled out, so I collected a bit of data.
+
+## Method
+
+Over the course of four days (July 9–12, 2022), I only opened the app once per day, on my phone. I looked at the first 50 posts on my feed, and logged a couple features, described below, about each one. I also gave each post a subjective "sentiment score" using a five-point scale from -2 (for posts I actively disliked) to 2 (for posts I quite liked), with a score of 0 being neutral.
+
+## Results
+
+What does Instagram show me? Let's first look at "poster category" counts, which shows who the posts come from.
+
+```{r poster_category, fig.height=4, fig.width=7}
+caption_text <- "Data from 200 posts viewed over 4 days (50 posts/day)"
+
+ig_dat %>%
+ mutate(poster_category = fct_rev(fct_infreq(poster_category))) %>%
+ ggplot(aes(x = poster_category)) +
+ geom_bar(aes(fill = !poster_followed)) +
+ geom_text(aes(label = ..count..), stat = "count",
+ hjust = 0, nudge_y = 1, size = 6) +
+ scale_fill_brewer(palette = "Dark2") +
+ expand_limits(y = 75) +
+ coord_flip() +
+ theme(panel.background = element_blank(),
+ panel.grid = element_blank(),
+ plot.title.position = "plot",
+ plot.caption.position = "plot",
+ legend.position = "none",
+ axis.line = element_blank(),
+ axis.ticks = element_blank(),
+ axis.title = element_blank(),
+ axis.text.x = element_blank(),
+ axis.text.y = element_text(size = 16,
+ margin = margin(r = -16))) +
+ labs(title = "Poster categories in my Instagram feed",
+ caption = caption_text)
+```
+
+Most of the 410 accounts I follow are my friends'. I also follow a few _artists_ (a combination of musicians, visual artists, and authors), _organizations_ (primarily political and community orgs), _businesses_ (all of which are run by friends—a couple bakers, a couple shop owners), and a small handful of _meme accounts_ (e.g., "dnddads"). Subjectively, meme accounts are over-represented on my feed; I follow hundreds of actual people and only around five meme accounts, but the latter account for over half as many posts as the former.
+
+The biggest issue here is that the majority of posts come from _accounts I don't follow_. I understand that Instagram is an ad-supported business, but I see more ads than posts from friends. Worse still, I see almost twice as many "suggested posts" as I do posts from friends.
+
+What types of content are represented by these posts?
+
+```{r medium, fig.height=3, fig.width=7}
+ig_dat %>%
+ mutate(medium = fct_rev(fct_infreq(medium))) %>%
+ ggplot(aes(x = medium)) +
+ geom_bar(aes(fill = medium)) +
+ geom_text(aes(label = ..count..), stat = "count",
+ hjust = 0, nudge_y = 1, size = 6) +
+ scale_fill_brewer(palette = "Paired") +
+ expand_limits(y = 120) +
+ coord_flip() +
+ theme(panel.background = element_blank(),
+ panel.grid = element_blank(),
+ plot.title.position = "plot",
+ plot.caption.position = "plot",
+ legend.position = "none",
+ axis.line = element_blank(),
+ axis.ticks = element_blank(),
+ axis.title = element_blank(),
+ axis.text.x = element_blank(),
+ axis.text.y = element_text(size = 20,
+ margin = margin(r = -20))) +
+ labs(title = "Media present in my Instagram feed",
+ caption = caption_text)
+```
+
+Mosseri announced that Meta wanted to make video a bigger part of the Instagram experience, and they seem to have delivered. Of the 200 posts I logged, the majority were video. Photographs, which the app was originally designed for, account for a little over a quarter of the posts, with other types of image (drawings, illustrations, etc.) accounting for the rest.
+
+What's the relationship between _poster category_ and _post medium_?
+
+```{r poster_media, fig.height=4}
+ig_dat %>%
+ mutate(poster_category = fct_rev(fct_infreq(poster_category))) %>%
+ ggplot(aes(x = poster_category)) +
+ geom_bar(aes(fill = medium)) +
+ geom_text(aes(label = ..count..), stat = "count",
+ hjust = 0, nudge_y = 1, size = 6) +
+ scale_fill_brewer(palette = "Paired") +
+ expand_limits(y = 75) +
+ coord_flip() +
+ theme(panel.background = element_blank(),
+ panel.grid = element_blank(),
+ plot.title.position = "plot",
+ plot.caption.position = "plot",
+ legend.title = element_text(size = 12),
+ legend.text = element_text(size = 12),
+ legend.position = c(0.8, 0.2),
+ axis.line = element_blank(),
+ axis.ticks = element_blank(),
+ axis.title = element_blank(),
+ axis.text.x = element_blank(),
+ axis.text.y = element_text(size = 16,
+ margin = margin(r = -16))) +
+ labs(title = "Post medium and poster category in my Instagram feed",
+ caption = caption_text,
+ fill = "Post medium")
+```
+
+There's a tight correspondence between what medium is employed for a post and who's posting it. Videos are being pushed by accounts I don't follow, while my friends continue to prioritize photographs.
+
+Since post medium and poster category are so tightly linked, I'll focus on looking at the sentiment by poster category—it would be difficult to tease apart how I feel about photos vs. videos and how I feel about posts from accounts I do vs. do not follow.
+
+```{r category_sentiment}
+ig_dat %>%
+ mutate(poster_category = fct_rev(fct_infreq(poster_category))) %>%
+ ggplot(aes(poster_category, sentiment)) +
+ geom_boxplot(aes(fill = !poster_followed)) +
+ scale_fill_brewer(palette = "Dark2") +
+ coord_flip() +
+ theme(panel.background = element_blank(),
+ panel.grid = element_blank(),
+ plot.title.position = "plot",
+ plot.caption.position = "plot",
+ legend.position = "none",
+ axis.line = element_blank(),
+ axis.ticks = element_blank(),
+ axis.title.x = element_text(size = 16),
+ axis.title.y = element_blank(),
+ axis.text.x = element_text(size = 12),
+ axis.text.y = element_text(size = 16,
+ margin = margin(r = -16))) +
+ labs(title = "Instagram sentiment by poster category",
+ caption = caption_text,
+ y = "Subjective sentiment")
+```
+
+These box-and-whisker plots show the distribution of sentiment scores for each post category. The heavy vertical line indicates the median score, and the boxes extend from the first through third quartiles (the "interquartile range", or IQR). The "whiskers" cover the full range of scores, unless any observation is identified as a potential "outlier". Outliers are defined as observations with values more than 150% of the value of the IQR above or below the third or first quartile, respectively. For example, I was bothered by a photo of a dead bird which was posted by a friend, and gave it a score of -1. The first and third quartiles of sentiment scores for posts from friends are 1 and 2, providing an IQR of 1. Since -1 is less than 1 (the first quartile) - 1.5 (150% of 1, the IQR), that score is an outlier.
+
+I'm not surprised that I like posts from the accounts I follow more than those from accounts I don't. It's disappointing that suggested posts aren't much nicer to see than ads; these are supposed to make the platorm more "immersive" and "engaging" and fail to achieve that.
+
+In an attempt to quantify the "experience" of browsing Instagram, I looked at the cumulative sentiment scores across each of the four days. There's no doubt that the actual dynamics of my affect, to the extent that it can even be quantified, is more complicated than +2 + -2 = 0. However, this is a good first approximation of what it feels like to scroll through a feed.
+
+```{r cumulative_sentiment}
+ig_sentiment <- ig_dat %>%
+ group_by(date) %>%
+ mutate(seq = row_number(date),
+ cum_sent = cumsum(sentiment))
+
+ig_sentiment %>%
+ ggplot(aes(seq, cum_sent)) +
+ geom_hline(yintercept = 0, linetype = "dashed") +
+ geom_step(aes(color = date), size = 1) +
+ geom_text(aes(label = cum_sent), data = filter(ig_sentiment, seq == 50),
+ hjust = 0, nudge_x = 0.5, size = 5) +
+ scale_color_brewer(palette = "Set2") +
+ theme(panel.background = element_blank(),
+ panel.grid = element_blank(),
+ plot.title.position = "plot",
+ plot.caption.position = "plot",
+ legend.title = element_text(size = 12),
+ legend.text = element_text(size = 12),
+ legend.position = "none",
+ axis.line.y = element_blank(),
+ axis.ticks.y = element_blank(),
+ axis.title = element_blank(),
+ #axis.text.x = element_blank(),
+ axis.text.y = element_blank()) +
+ labs(title = "Cumulative sentiment while browsing Instagram",
+ caption = caption_text)
+```
+
+Each of the four days has a negative cumulative sentiment score, which indicates that I see more posts that I dislike than posts that I like. There's also a suggestion of a pattern, in which my sentiment stays relatively flat and positive over the first 20–30 posts, and then drops.
+
+I wonder if this is the result of an intentional design decision by Instagram. If they have an internal model of what posts I might like, they could certainly time them in such a way to encourage longer stretches of engagement, and they have the data to figure out what the optimal timing would look like. On the other hand, this is a purely subjective score that I didn't attempt to carefully calibrate, so it's worth considering that I might just get tired of my feed after a couple dozen posts.
+
+## Conclusions
+
+I don't know if it's worth an attempt to use my experience to formulate suggestions to Instagram (or their would-be competitors). I'm approaching middle age, and I don't spend much money on the internet; Instagram can get stronger engagement and much better ad conversion rates from people who aren't like me. Perhaps I'm an unprofitable user and this is their attempt to nudge me off the platform?
+
+However, this has been a useful exercise for me, because it suggests a few things I could change right now to improve my experience:
+
+* Stop following meme accounts and organizations (these posts provide a neutral-to-good experience, but they potentially crowd out posts from friends and artists that I would enjoy more)
+* Stop scrolling my feed after the first 20 or so posts (cumulative sentiment appears to begin dropping after this point)
+
+Most importantly, it calls into question whether I should even continue using the platform at all. The maximum cumulative sentiment that I experienced was only six (which corresponds to seeing just three things that I really like), and that was only achieved on two of the four days that I measured this. As much as I enjoy seeing photos from my friends (and the data show that I really do enjoy this), it's hard to conclude that the experience of using Instagram is joyful.
+\ No newline at end of file