antivax-bootstrap.Rmd (13659B)
1 --- 2 layout: post 3 title: "Bootstrap analysis of anti-vaccination belief changes" 4 summary: Another way of looking at the antivaccination data of Horne, et al. 5 author: "Eamon Caddigan" 6 date: 2015-09-15 7 categories: psych R 8 output: html_document 9 --- 10 11 ```{r global_options, include=FALSE} 12 knitr::opts_chunk$set(cache=TRUE, echo=FALSE, warning=FALSE, message=FALSE, 13 fig.width=9, fig.align="center") 14 ``` 15 16 ```{r setup_data, results="hide"} 17 # Required librarys and external files ---------------------------------------- 18 19 library(readxl) 20 library(tidyr) 21 library(dplyr) 22 library(ggplot2) 23 library(gridExtra) 24 25 # Clean and process the data -------------------------------------------------- 26 27 # Generates warnings for the Ps who didn't do day 2 28 suppressWarnings(expData <- read_excel("Vacc_HPHH_publicDataset.xlsx", sheet = 2)) 29 30 # Exclude Ps who didn't do day 2 and failed the attention checks 31 expData.clean <- expData %>% 32 # It's good to add a subject number so we can go back to original data 33 mutate(subject_number = 1:nrow(.)) %>% 34 filter(Returned == 1, 35 `AttentionCheck_PostTest (if = 4 then include)` == 4, 36 `AttentionChecks_Sum(include if = 4)` == 4, 37 Paid_Attention == 1) 38 39 # Get all the dependent measures into a DF 40 questionnaireData <- expData.clean %>% 41 # Pull out the columns and use consistent names 42 select(subject_number, 43 intervention = Condition, 44 pretest.healthy = Healthy_VaxscalePretest, 45 posttest.healthy = Healthy_VaxscalePosttest, 46 pretest.diseases = Diseases_VaxScalePretest, 47 posttest.diseases = Diseases_VaxScalePosttest, 48 pretest.doctors = Doctors_VaxScalePreTest, 49 posttest.doctors = Doctors_VaxScalePostTest, 50 pretest.side_effects = Sideeffects_VaxScalePreTest, 51 posttest.side_effects = Sideeffects_VaxScalePostTest, 52 pretest.plan_to = Planto_VaxScalePreTest, 53 posttest.plan_to = Planto_VaxScalePostTest) %>% 54 # Reverse-code the approrpiate columns 55 mutate(pretest.diseases = 7 - pretest.diseases, 56 posttest.diseases = 7 - posttest.diseases, 57 pretest.side_effects = 7 - pretest.side_effects, 58 posttest.side_effects = 7 - posttest.side_effects) %>% 59 # Tidy the data 60 gather("question", "response", -subject_number, -intervention) %>% 61 separate(question, c("interval", "question"), sep = "\\.") %>% 62 mutate(intervention = factor(intervention, 63 c("Control", "Autism Correction", "Disease Risk")), 64 interval = factor(interval, 65 c("pretest", "posttest"), ordered = TRUE), 66 question = factor(question, 67 c("healthy", "diseases", "doctors", "side_effects", "plan_to"))) %>% 68 # Pre- and post-test get their own columns in these analyses 69 mutate(interval = paste0(interval, "_response")) %>% 70 spread(interval, response) 71 # ----------------------------------------------------------------------------- 72 ``` 73 74 ## Introduction 75 76 In a [previous post]({{ site.url }}/psych/bayes/2015/09/03/antivax-attitudes/) (I don't know why I'm linking it -- there are only two), I presented an analysis of data by [Horne, Powell, Hummel & Holyoak, (2015)](http://www.pnas.org/content/112/33/10321.abstract) that investigated changes in attitudes toward childhood vaccinations. The previous analysis used Bayesian estimation to show a credible increase in pro-vaccination attitudes following a "disease risk" intervention, but not an "autism correction" intervention. 77 78 Some of my friends offered insightful comments, and one [pointed out](https://twitter.com/johnclevenger/status/639795727439429632) what appeared to be a failure of random assignment. Participants in the "disease risk" group happened to have lower scores on the pre-intervention survey and therefore had more room for improvement. This is a fair criticism, but a subsequent analysis showed that post-intervention scores were also higher for the "disease risk" group, which addresses this issue. 79 80 ![Posterior of final score differences](bayesian_ending_scores.png) 81 82 ### Bootstrapping 83 84 Interpreting differences in parameter values isn't always straightforward, so I thought it'd be worthwhile to try a different approach. Instead of fitting a generative model to the sample, we can use bootstrapping to estimate the unobserved population parameters. [Bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) is conceptually simple; I feel it would have much wider adoption today had computers been around in the early days of statistics. 85 86 Here is code that uses bootstrapping to estimate the probability of each response on the pre-intervention survey (irrespective of survey question or intervention group assignment). The sample mean is already an unbiased estimator of the population mean, so bootstrapping isn't necessary in this first example. However, this provides a simple illustration of how the technique works: sample observations *with replacement* from the data, calculate a statistic on this new data, and repeat. The mean of the observed statistic values provides an estimate of the population statistic, and the distribution of statistic values provides a measure of certainty. 87 88 ```{r setup_bootstrap, dependson="setup_data", echo=TRUE} 89 numBootstraps <- 1e5 # Should be a big number 90 numObservations <- nrow(questionnaireData) 91 uniqueResponses <- sort(unique(questionnaireData$pretest_response)) 92 interventionLevels <- levels(questionnaireData$intervention) 93 94 # The observed proportion of responses at each level 95 pretestResponseProbs <- as.numeric(table(questionnaireData$pretest_response)) / 96 numObservations 97 ``` 98 99 ```{r pretest_bootstrap, dependson="setup_bootstrap", echo=TRUE} 100 # Bootstrap to find the probability that each response will be given to pre-test 101 # questions. 102 pretestData <- array(data = 0, 103 dim = c(numBootstraps, 104 length(uniqueResponses)), 105 dimnames = list(NULL, 106 paste(uniqueResponses))) 107 108 # Run the bootstrap 109 for (ii in seq_len(numBootstraps)) { 110 bootSamples <- sample(questionnaireData$pretest_response, 111 numObservations, 112 replace = TRUE) 113 bootSamplesTabulated <- table(bootSamples) 114 pretestData[ii, names(bootSamplesTabulated)] <- bootSamplesTabulated 115 } 116 117 # Convert the counts to probabilities 118 pretestData <- pretestData / numObservations 119 ``` 120 121 ```{r pretest_plot, dependson="pretest_bootstrap", fig.width=4, fig.height=4} 122 pretestDF <- data_frame(response = uniqueResponses, 123 bootstrap_prob = apply(pretestData, 2, mean), 124 bootstrap_sd = apply(pretestData, 2, sd), 125 observed_prob = pretestResponseProbs) 126 127 ggplot(pretestDF, aes(x = response)) + 128 geom_bar(aes(y = observed_prob), stat = "identity", 129 color="white", fill="skyblue") + 130 geom_point(aes(y = bootstrap_prob), size = 3, color = "red") + 131 geom_errorbar(aes(ymin = bootstrap_prob - bootstrap_sd/2, 132 ymax = bootstrap_prob + bootstrap_sd/2), 133 size = 2, color = "red", width = 0) + 134 scale_x_continuous(breaks = 1:length(uniqueResponses)) + 135 scale_y_continuous(limits = c(0, 1)) + 136 xlab("Response Level") + 137 ylab("Proportion") + 138 theme_classic() 139 ``` 140 141 As expected, the bootstrap estimates for the proportion of responses at each level almost exactly match the observed data. There are supposed to be error-bars around the points, which show the bootstrap estimates, but they're obscured by the points themselves. 142 143 ## Changes in vaccination attitudes 144 145 Due to chance alone, the three groups of participants (the control group, the "autism correction" group, and the "disease risk" group) showed different patterns of responses to the pre-intervention survey. To mitigate this issue, the code below estimates the transition probabilities from each response on the pre-intervention survey to each response on the post-intervention survey, and does so separately for the groups. These are conditional probabilities, e.g., P(post-intervention rating = 4 | pre-intervention rating = 3). 146 147 The conditional probabilities are then combined with the observed pre-intervention response probabilities to calculate the joint probability of each response transition (e.g., P(post-intervention rating = 4 AND pre-intervention rating = 3)). Importantly, since the prior is agnostic to subjects' group assignment, these joint probability estimates are less-affected by biases that would follow from a failure of random assignment. 148 149 ```{r posttest_bootstrap, dependson="setup_bootstrap", echo=TRUE} 150 # preintervention responses x intervention groups x bootstraps x postintervention responses 151 posttestData <- array(data = 0, 152 dim = c(length(uniqueResponses), 153 length(interventionLevels), 154 numBootstraps, 155 length(uniqueResponses)), 156 dimnames = list(paste(uniqueResponses), 157 interventionLevels, 158 NULL, 159 paste(uniqueResponses))) 160 161 for (pretestResponse in seq_along(uniqueResponses)) { 162 for (interventionLevel in seq_along(interventionLevels)) { 163 # Get the subset of data for each combination of intervention and 164 # pre-intervention response level. 165 questionnaireDataSubset <- filter(questionnaireData, 166 intervention == interventionLevels[interventionLevel], 167 pretest_response == pretestResponse) 168 numObservationsSubset <- nrow(questionnaireDataSubset) 169 170 # Run the bootstrap 171 for (ii in seq_len(numBootstraps)) { 172 bootSamples <- sample(questionnaireDataSubset$posttest_response, 173 numObservationsSubset, 174 replace = TRUE) 175 bootSamplesTabulated <- table(bootSamples) 176 posttestData[pretestResponse, 177 interventionLevel, 178 ii, 179 names(bootSamplesTabulated)] <- bootSamplesTabulated 180 } 181 182 # Convert the counts to probabilities 183 posttestData[pretestResponse, interventionLevel, , ] <- 184 posttestData[pretestResponse, interventionLevel, , ] / numObservationsSubset 185 } 186 187 # Convert the conditional probabilities to joint probabilities using the 188 # observed priors on each pretest response. 189 posttestData[pretestResponse, , , ] <- posttestData[pretestResponse, , , ] * 190 pretestResponseProbs[pretestResponse] 191 } 192 ``` 193 194 With the transition probabilities sampled, it's possible to test the hypothesis: **"participants are more likely to shift toward a more pro-vaccine attitude following a 'disease risk' intervention than participants in control and 'autism correction' groups."** We'll use the previously-run bootstrap samples to compute the each group's probability of increasing scores. 195 196 ```{r posttest_shifts, dependson="posttest_bootstrap", echo=TRUE} 197 posttestIncrease <- array(data = 0, 198 dim = c(numBootstraps, 199 length(interventionLevels)), 200 dimnames = list(NULL, 201 interventionLevels)) 202 203 for (interventionLevel in seq_along(interventionLevels)) { 204 for (pretestResponse in seq_along(uniqueResponses)) { 205 for (posttestResponse in seq_along(uniqueResponses)) { 206 if (posttestResponse > pretestResponse) { 207 posttestIncrease[, interventionLevel] <- posttestIncrease[, interventionLevel] + 208 posttestData[pretestResponse, interventionLevel, , posttestResponse] 209 } 210 } 211 } 212 } 213 ``` 214 215 This estimates the probability that post-intervention responses from the "disease risk" group have a greater probability of being higher than their pre-intervention counterparts than responses from the "autism correction" group. 216 217 ```{r posttest_stat, dependson="posttest_shifts", echo=TRUE} 218 sum(posttestIncrease[, which(interventionLevels == "Disease Risk")] > 219 posttestIncrease[, which(interventionLevels == "Autism Correction")]) / 220 nrow(posttestIncrease) 221 ``` 222 223 Below is a visualization of the bootstrap distributions. This illustrates the certainty of the estimates of the probability that participants would express stronger pro-vaccination attitudes after the interventions. 224 225 ```{r posttest_plot, dependson="posttest_shifts"} 226 posttestDF <- gather(as.data.frame(posttestIncrease), "intervention", "prob_increase") 227 ggplot(posttestDF, aes(x = prob_increase, fill = intervention)) + 228 geom_density(alpha = 0.6) + 229 scale_fill_brewer(type = "qual", palette = "Dark2") + 230 ylab("Samples") + xlab("Probability of rating increase") + 231 theme_minimal() 232 ``` 233 234 ## Conclusion 235 236 Bootstrapping shows that a "disease risk" intervention has a stronger effect than others in shifting participants' pro-vaccination attitudes. This analysis collapses across the five survey questions used by Horne and colleagues, but it would be straightforward to extend this code to estimate attitude change probabilities separately for each question. 237 238 Although there are benefits to analyzing data with nonparametric methods, the biggest shortcoming of the approach I've used here is that it can not estimate the size of the attitude changes. Instead, it estimates that probability of pro-vaccination attitude changes occurring, and the difference in these probabilities between the groups. This is a great example of why it's important to keep your question in mind while analyzing data.