www.eamoncaddigan.net

Content and configuration for https://www.eamoncaddigan.net
git clone https://git.eamoncaddigan.net/www.eamoncaddigan.net.git
Log | Files | Refs | Submodules | README

index.Rmd (13897B)


      1 ---
      2 layout: post
      3 title: "Bootstrap analysis of anti-vaccination belief changes"
      4 summary: Another way of looking at the antivaccination data of Horne, et al.
      5 author: "Eamon Caddigan"
      6 date: 2015-09-15
      7 categories: psych R
      8 output: html_document
      9 ---
     10 
     11 ```{r global_options, include=FALSE}
     12 knitr::opts_chunk$set(cache=TRUE, echo=FALSE, warning=FALSE, message=FALSE,
     13                       fig.width=9, fig.align="center")
     14 ```
     15 
     16 ```{r setup_data, results="hide"}
     17 # Required librarys and external files ----------------------------------------
     18 
     19 library(readxl)
     20 library(tidyr)
     21 library(dplyr)
     22 library(ggplot2)
     23 library(gridExtra)
     24 
     25 # Clean and process the data --------------------------------------------------
     26 
     27 # Generates warnings for the Ps who didn't do day 2
     28 suppressWarnings(expData <- read_excel("Vacc_HPHH_publicDataset.xlsx", sheet = 2))
     29 
     30 # Exclude Ps who didn't do day 2 and failed the attention checks
     31 expData.clean <- expData %>%
     32   # It's good to add a subject number so we can go back to original data
     33   mutate(subject_number = 1:nrow(.)) %>%
     34   filter(Returned == 1,
     35          `AttentionCheck_PostTest (if = 4 then include)` == 4,
     36          `AttentionChecks_Sum(include if = 4)` == 4,
     37          Paid_Attention == 1)
     38 
     39 # Get all the dependent measures into a DF
     40 questionnaireData <- expData.clean %>%
     41   # Pull out the columns and use consistent names
     42   select(subject_number,
     43          intervention = Condition,
     44          pretest.healthy = Healthy_VaxscalePretest,
     45          posttest.healthy = Healthy_VaxscalePosttest,
     46          pretest.diseases = Diseases_VaxScalePretest,
     47          posttest.diseases = Diseases_VaxScalePosttest,
     48          pretest.doctors = Doctors_VaxScalePreTest,
     49          posttest.doctors = Doctors_VaxScalePostTest,
     50          pretest.side_effects = Sideeffects_VaxScalePreTest,
     51          posttest.side_effects = Sideeffects_VaxScalePostTest,
     52          pretest.plan_to = Planto_VaxScalePreTest,
     53          posttest.plan_to = Planto_VaxScalePostTest) %>%
     54   # Reverse-code the approrpiate columns
     55   mutate(pretest.diseases = 7 - pretest.diseases,
     56          posttest.diseases = 7 - posttest.diseases,
     57          pretest.side_effects = 7 - pretest.side_effects,
     58          posttest.side_effects = 7 - posttest.side_effects) %>%
     59   # Tidy the data
     60   gather("question", "response", -subject_number, -intervention) %>%
     61   separate(question, c("interval", "question"), sep = "\\.") %>% 
     62   mutate(intervention = factor(intervention, 
     63                                c("Control", "Autism Correction", "Disease Risk")),
     64          interval = factor(interval, 
     65                            c("pretest", "posttest"), ordered = TRUE),
     66          question = factor(question, 
     67                            c("healthy", "diseases", "doctors", "side_effects", "plan_to"))) %>%
     68   # Pre- and post-test get their own columns in these analyses
     69   mutate(interval = paste0(interval, "_response")) %>%
     70   spread(interval, response)
     71 # -----------------------------------------------------------------------------
     72 ```
     73 
     74 ## Introduction
     75 
     76 In a [previous post]({{ site.url }}/psych/bayes/2015/09/03/antivax-attitudes/) (I don't know why I'm linking it -- there are only two), I presented an analysis of data by [Horne, Powell, Hummel & Holyoak, (2015)](http://www.pnas.org/content/112/33/10321.abstract) that investigated changes in attitudes toward childhood vaccinations. The previous analysis used Bayesian estimation to show a credible increase in pro-vaccination attitudes following a "disease risk" intervention, but not an "autism correction" intervention.
     77 
     78 Some of my friends offered insightful comments, and one [pointed out](https://twitter.com/johnclevenger/status/639795727439429632) what appeared to be a failure of random assignment. Participants in the "disease risk" group happened to have lower scores on the pre-intervention survey and therefore had more room for improvement. This is a fair criticism, but a subsequent analysis showed that post-intervention scores were also higher for the "disease risk" group, which addresses this issue.
     79 
     80 ![Posterior of final score differences](bayesian_ending_scores.png)
     81 
     82 ### Bootstrapping 
     83 
     84 Interpreting differences in parameter values isn't always straightforward, so I thought it'd be worthwhile to try a different approach. Instead of fitting a generative model to the sample, we can use bootstrapping to estimate the unobserved population parameters. [Bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) is conceptually simple; I feel it would have much wider adoption today had computers been around in the early days of statistics.
     85 
     86 Here is code that uses bootstrapping to estimate the probability of each response on the pre-intervention survey (irrespective of survey question or intervention group assignment). The sample mean is already an unbiased estimator of the population mean, so bootstrapping isn't necessary in this first example. However, this provides a simple illustration of how the technique works: sample observations *with replacement* from the data, calculate a statistic on this new data, and repeat. The mean of the observed statistic values provides an estimate of the population statistic, and the distribution of statistic values provides a measure of certainty. 
     87 
     88 ```{r setup_bootstrap, dependson="setup_data", echo=TRUE}
     89 numBootstraps <- 1e5  # Should be a big number
     90 numObservations <- nrow(questionnaireData)
     91 uniqueResponses <- sort(unique(questionnaireData$pretest_response))
     92 interventionLevels <- levels(questionnaireData$intervention)
     93 
     94 # The observed proportion of responses at each level
     95 pretestResponseProbs <- as.numeric(table(questionnaireData$pretest_response)) / 
     96   numObservations
     97 ```
     98 
     99 ```{r pretest_bootstrap, dependson="setup_bootstrap", echo=TRUE}
    100 # Bootstrap to find the probability that each response will be given to pre-test
    101 # questions.
    102 pretestData <- array(data = 0,
    103                      dim = c(numBootstraps,
    104                              length(uniqueResponses)),
    105                      dimnames = list(NULL,
    106                                      paste(uniqueResponses)))
    107 
    108 # Run the bootstrap
    109 for (ii in seq_len(numBootstraps)) {
    110   bootSamples <- sample(questionnaireData$pretest_response, 
    111                         numObservations, 
    112                         replace = TRUE)
    113   bootSamplesTabulated <- table(bootSamples)
    114   pretestData[ii, names(bootSamplesTabulated)] <- bootSamplesTabulated
    115 }
    116 
    117 # Convert the counts to probabilities
    118 pretestData <- pretestData / numObservations
    119 ```
    120 
    121 ```{r pretest_plot, dependson="pretest_bootstrap", fig.width=4, fig.height=4}
    122 pretestDF <- data_frame(response = uniqueResponses, 
    123                         bootstrap_prob = apply(pretestData, 2, mean),
    124                         bootstrap_sd = apply(pretestData, 2, sd),
    125                         observed_prob = pretestResponseProbs)
    126 
    127 ggplot(pretestDF, aes(x = response)) + 
    128   geom_bar(aes(y = observed_prob), stat = "identity", 
    129            color="white", fill="skyblue") + 
    130   geom_point(aes(y = bootstrap_prob), size = 3, color = "red") + 
    131   geom_errorbar(aes(ymin = bootstrap_prob - bootstrap_sd/2, 
    132                     ymax = bootstrap_prob + bootstrap_sd/2),
    133                 size = 2, color = "red", width = 0) +
    134   scale_x_continuous(breaks = 1:length(uniqueResponses)) +
    135   scale_y_continuous(limits = c(0, 1)) +
    136   xlab("Response Level") + 
    137   ylab("Proportion") + 
    138   theme_classic()
    139 ```
    140 
    141 As expected, the bootstrap estimates for the proportion of responses at each level almost exactly match the observed data. There are supposed to be error-bars around the points, which show the bootstrap estimates, but they're obscured by the points themselves.
    142 
    143 ## Changes in vaccination attitudes
    144 
    145 Due to chance alone, the three groups of participants (the control group, the "autism correction" group, and the "disease risk" group) showed different patterns of responses to the pre-intervention survey. To mitigate this issue, the code below estimates the transition probabilities from each response on the pre-intervention survey to each response on the post-intervention survey, and does so separately for the groups. These are conditional probabilities, e.g., P(post-intervention rating = 4 | pre-intervention rating = 3). 
    146 
    147 The conditional probabilities are then combined with the observed pre-intervention response probabilities to calculate the joint probability of each response transition (e.g., P(post-intervention rating = 4 AND pre-intervention rating = 3)). Importantly, since the prior is agnostic to subjects' group assignment, these joint probability estimates are less-affected by biases that would follow from a failure of random assignment.
    148 
    149 ```{r posttest_bootstrap, dependson="setup_bootstrap", echo=TRUE}
    150 # preintervention responses x intervention groups x bootstraps x postintervention responses
    151 posttestData <- array(data = 0,
    152                       dim = c(length(uniqueResponses),
    153                               length(interventionLevels),
    154                               numBootstraps, 
    155                               length(uniqueResponses)),
    156                       dimnames = list(paste(uniqueResponses),
    157                                       interventionLevels,
    158                                       NULL,
    159                                       paste(uniqueResponses)))
    160 
    161 for (pretestResponse in seq_along(uniqueResponses)) {
    162   for (interventionLevel in seq_along(interventionLevels)) {
    163     # Get the subset of data for each combination of intervention and
    164     # pre-intervention response level.
    165     questionnaireDataSubset <- filter(questionnaireData,
    166                                       intervention == interventionLevels[interventionLevel],
    167                                       pretest_response == pretestResponse)
    168     numObservationsSubset <- nrow(questionnaireDataSubset)
    169     
    170     # Run the bootstrap
    171     for (ii in seq_len(numBootstraps)) {
    172       bootSamples <- sample(questionnaireDataSubset$posttest_response, 
    173                             numObservationsSubset, 
    174                             replace = TRUE)
    175       bootSamplesTabulated <- table(bootSamples)
    176       posttestData[pretestResponse, 
    177                    interventionLevel, 
    178                    ii, 
    179                    names(bootSamplesTabulated)] <- bootSamplesTabulated
    180     }
    181     
    182     # Convert the counts to probabilities
    183     posttestData[pretestResponse, interventionLevel, , ] <- 
    184       posttestData[pretestResponse, interventionLevel, , ] / numObservationsSubset
    185   }
    186   
    187   # Convert the conditional probabilities to joint probabilities using the 
    188   # observed priors on each pretest response.
    189   posttestData[pretestResponse, , , ] <- posttestData[pretestResponse, , , ] *
    190     pretestResponseProbs[pretestResponse]
    191 }
    192 ```
    193 
    194 With the transition probabilities sampled, it's possible to test the hypothesis: **"participants are more likely to shift toward a more pro-vaccine attitude following a 'disease risk' intervention than participants in control and 'autism correction' groups."** We'll use the previously-run bootstrap samples to compute the each group's probability of increasing scores.
    195 
    196 ```{r posttest_shifts, dependson="posttest_bootstrap", echo=TRUE}
    197 posttestIncrease <- array(data = 0, 
    198                           dim = c(numBootstraps,
    199                                   length(interventionLevels)),
    200                           dimnames = list(NULL,
    201                                           interventionLevels))
    202 
    203 for (interventionLevel in seq_along(interventionLevels)) {
    204   for (pretestResponse in seq_along(uniqueResponses)) {
    205     for (posttestResponse in seq_along(uniqueResponses)) {
    206       if (posttestResponse > pretestResponse) {
    207         posttestIncrease[, interventionLevel] <- posttestIncrease[, interventionLevel] +
    208           posttestData[pretestResponse, interventionLevel, , posttestResponse]
    209       }
    210     }
    211   }
    212 }
    213 ```
    214 
    215 This estimates the probability that post-intervention responses from the "disease risk" group have a greater probability of being higher than their pre-intervention counterparts than responses from the "autism correction" group.
    216 
    217 ```{r posttest_stat, dependson="posttest_shifts", echo=TRUE}
    218 sum(posttestIncrease[, which(interventionLevels == "Disease Risk")] > 
    219       posttestIncrease[, which(interventionLevels == "Autism Correction")]) / 
    220   nrow(posttestIncrease)
    221 ```
    222 
    223 Below is a visualization of the bootstrap distributions. This illustrates the certainty of the estimates of the probability  that participants would express stronger pro-vaccination attitudes after the interventions. 
    224 
    225 ```{r posttest_plot, dependson="posttest_shifts"}
    226 posttestDF <- gather(as.data.frame(posttestIncrease), "intervention", "prob_increase")
    227 ggplot(posttestDF, aes(x = prob_increase, fill = intervention)) + 
    228   geom_density(alpha = 0.6) + 
    229   scale_fill_brewer(type = "qual", palette = "Dark2") +
    230   ylab("Samples") + xlab("Probability of rating increase") + 
    231   theme_minimal()
    232 ```
    233 
    234 ## Conclusion
    235 
    236 Bootstrapping shows that a "disease risk" intervention has a stronger effect than others in shifting participants' pro-vaccination attitudes. This analysis collapses across the five survey questions used by Horne and colleagues, but it would be straightforward to extend this code to estimate attitude change probabilities separately for each question. 
    237 
    238 Although there are benefits to analyzing data with nonparametric methods, the biggest shortcoming of the approach I've used here is that it can not estimate the size of the attitude changes. Instead, it estimates that probability of pro-vaccination attitude changes occurring, and the difference in these probabilities between the groups. This is a great example of why it's important to keep your question in mind while analyzing data.