commit f5a65b0d7b81897286e25fa962809aec20408356
parent f9f3b5d81b534535306775177cb89c8044e1f891
Author: eamoncaddigan <eamon.caddigan@gmail.com>
Date: Thu, 17 Sep 2015 15:01:11 -0400
Text and code cleanup.
Diffstat:
1 file changed, 26 insertions(+), 23 deletions(-)
diff --git a/antivax-bootstrap.Rmd b/antivax-bootstrap.Rmd
@@ -79,11 +79,11 @@ Some of my friends offered insightful comments, and one [pointed out](https://tw
![Posteror of final score differences](https://pbs.twimg.com/media/COE2e8bUkAEeu8b.png:large)
-# Bootstrapping
+### Bootstrapping
-Interpreting differences in parameter values isn't always straightforward, so I thought it'd be worthwhile to try a different approach. Instead fitting a generative model to the sampled data, we can use bootstrapping on the sample to estimate population parameters. [Bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) is conceptually simple; I feel it would have much wider adoption today had computers been around in the early days of statistics.
+Interpreting differences in parameter values isn't always straightforward, so I thought it'd be worthwhile to try a different approach. Instead fitting a generative model to the sample, we can use bootstrapping to estimate the unobserved population parameters. [Bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) is conceptually simple; I feel it would have much wider adoption today had computers been around in the early days of statistics.
-Here is code that uses bootstrapping to estimate the probability of each response on the pre-intervention survey (irrespective of question on intervention group assignment). The sample mean is already an unbiased estimator of the population mean, so bootstrapping isn't necessary in this first example. However, this provides a simple illustration of how the technique works: sample observations *with replacement* from the observed data, calculate a statistic on this new data, and repeat.
+Here is code that uses bootstrapping to estimate the probability of each response on the pre-intervention survey (irrespective of survey question or intervention group assignment). The sample mean is already an unbiased estimator of the population mean, so bootstrapping isn't necessary in this first example. However, this provides a simple illustration of how the technique works: sample observations *with replacement* from the data, calculate a statistic on this new data, and repeat.
```{r setup_bootstrap, dependson="setup_data", echo=TRUE}
numBootstraps <- 1e3 # Should be a big number
@@ -92,18 +92,18 @@ uniqueResponses <- sort(unique(questionnaireData$pretest_response))
interventionLevels <- levels(questionnaireData$intervention)
# The observed proportion of responses at each level
-obsPretestResponseProbabilities <- as.numeric(table(questionnaireData$pretest_response)) /
+pretestResponseProbs <- as.numeric(table(questionnaireData$pretest_response)) /
numObservations
```
```{r pretest_bootstrap, dependson="setup_bootstrap", echo=TRUE}
# Bootstrap to find the probability that each response will be given to pre-test
# questions.
-
-pretestData <- matrix(data = 0,
- nrow = numBootstraps,
- ncol = length(uniqueResponses))
-colnames(pretestData) <- paste(uniqueResponses)
+pretestData <- array(data = 0,
+ dim = c(numBootstraps,
+ length(uniqueResponses)),
+ dimnames = list(NULL,
+ paste(uniqueResponses)))
# Run the bootstrap
for (ii in seq_len(numBootstraps)) {
@@ -120,9 +120,10 @@ pretestData <- pretestData / numObservations
```{r pretest_plot, dependson="pretest_bootstrap", fig.width=4, fig.height=4}
pretestDF <- data_frame(response = uniqueResponses,
- bootstrap_prob = apply(pretestData, 2, mean),
- bootstrap_sd = apply(pretestData, 2, sd),
- observed_prob = obsPretestResponseProbabilities)
+ bootstrap_prob = apply(pretestData, 2, mean),
+ bootstrap_sd = apply(pretestData, 2, sd),
+ observed_prob = pretestResponseProbs)
+
ggplot(pretestDF, aes(x = response)) +
geom_bar(aes(y = observed_prob), stat = "identity",
color="white", fill="skyblue") +
@@ -141,10 +142,11 @@ As expected, the bootstrap estimates for the proportion of responses at each lev
## Changes in vaccination attitudes
-Due to chance alone, the three groups of participants (the control group, the "autism correction" group, and the "disease risk" group) had different patterns of responses to the pre-intervention survey. To mitigate this, the code below estimates the transition probabilities from each response on the pre-intervention survey to each response on the post-intervention survey, and does so separately for each group. These are conditional probabilities, e.g., P(post-intervention rating = 4 | pre-intervention rating = 3). These conditional probabilities are then combined with the observed pre-intervention response probabilities to calculate the joint probability of each response transition (e.g., P(post-intervention rating = 4 AND pre-intervention rating = 3)). Since the prior is agnostic to subjects' group assignment, these joint probability estimates are free from any biases due to a failure of random assignment.
+Due to chance alone, the three groups of participants (the control group, the "autism correction" group, and the "disease risk" group) had different patterns of responses to the pre-intervention survey. To mitigate this, the code below estimates the transition probabilities from each response on the pre-intervention survey to each response on the post-intervention survey, and does so separately for each group. These are conditional probabilities, e.g., P(post-intervention rating = 4 | pre-intervention rating = 3).
-```{r posttest_bootstrap, dependson="setup_bootstrap", echo=TRUE}
+The conditional probabilities are then combined with the observed pre-intervention response probabilities to calculate the joint probability of each response transition (e.g., P(post-intervention rating = 4 AND pre-intervention rating = 3)). Importantly, since the prior is agnostic to subjects' group assignment, these joint probability estimates are free from biases that would follow from a failure of random assignment.
+```{r posttest_bootstrap, dependson="setup_bootstrap", echo=TRUE}
# preintervention responses x intervention groups x bootstraps x postintervention responses
posttestData <- array(data = 0,
dim = c(length(uniqueResponses),
@@ -185,17 +187,18 @@ for (pretestResponse in seq_along(uniqueResponses)) {
# Convert the conditional probabilities to joint probabilities using the
# observed priors on each pretest response.
posttestData[pretestResponse, , , ] <- posttestData[pretestResponse, , , ] *
- obsPretestResponseProbabilities[pretestResponse]
+ pretestResponseProbs[pretestResponse]
}
```
-With the transition probabilities appropriately modeled, it's possible to test the hypothesis: **"participants in the 'disease risk' group are more likely to respond with a more-pro-vaccine attitude after the intervention than participants in the other groups."** We'll use the previously-run bootstraps to compute the probability of increased scores separately for each group.
+With the transition probabilities estimated, it's possible to test the hypothesis: **"participants in the 'disease risk' group are more likely to have a more pro-vaccine attitude after the intervention than participants in the other groups."** We'll use the previously-run bootstraps to compute the probability of increased scores separately for each group.
```{r posttest_shifts, dependson="posttest_bootstrap", echo=TRUE}
-posttestIncrease <- matrix(data = 0,
- nrow = numBootstraps,
- ncol = length(interventionLevels))
-colnames(posttestIncrease) <- interventionLevels
+posttestIncrease <- array(data = 0,
+ dim = c(numBootstraps,
+ length(interventionLevels)),
+ dimnames = list(NULL,
+ interventionLevels))
for (interventionLevel in seq_along(interventionLevels)) {
for (pretestResponse in seq_along(uniqueResponses)) {
@@ -209,7 +212,7 @@ for (interventionLevel in seq_along(interventionLevels)) {
}
```
-This estimates the probability that "disease risk" samples have a greater probability of making higher post-intervention responses than "autism correction" sample
+This estimates the probability that post-intervention responses from the "disease risk" group have a greater probability of being higher than their pre-intervention counterparts than responses from the "autism correction" group.
```{r posttest_stat, dependson="posttest_shifts", echo=TRUE}
sum(posttestIncrease[, which(interventionLevels == "Disease Risk")] >
@@ -223,10 +226,11 @@ Below is a visualization of the bootstrapped distributions of the probabilities
posttestDF <- gather(as.data.frame(posttestIncrease), "intervention", "prob_increase")
ggplot(posttestDF, aes(x = prob_increase, fill = intervention)) +
geom_density(alpha = 0.6) +
+ scale_fill_brewer(type = "qual", palette = "Dark2") +
ylab("Samples") + xlab("Probability of rating increase") +
theme_minimal()
```
## Conclusion
-Bootstrapping shows that the "disease risk" group is more likely than other groups to have an increase in pro-vaccination attitudes following the intervention. This analysis collapses across the five survey questions used by Horne and colleagues, although it would be straightforward to extend this approach to estimate attitude changes separately for each question.
-\ No newline at end of file
+Bootstrapping shows that the "disease risk" group is more likely than other groups to have an increase in pro-vaccination attitudes following the intervention. This analysis collapses across the five survey questions used by Horne and colleagues, although it would be straightforward to extend this approach to estimate attitude change probabilities separately for each question.