index.md (6960B)
1 --- 2 title: "Sister probability puzzle" 3 description: "My solution to a question about the probability of having a sister." 4 date: 2023-01-18T21:47:38-08:00 5 draft: False 6 katex: True 7 knit: (function(input, ...) { 8 rmarkdown::render( 9 input, 10 output_dir = file.path(Sys.getenv("HUGO_ROOT"), "content/posts") 11 ) 12 }) 13 output: 14 md_document: 15 variant: markdown 16 preserve_yaml: true 17 categories: 18 - Data Science 19 tags: 20 - R 21 - Statistics 22 --- 23 24 A smart and talented friend, who's a fellow data scientist by 25 profession, came across a probability puzzle that's supposedly given in 26 some job interviews for data science roles. The problem asks the 27 following: 28 29 > Assume the distribution of children per family is such that the 30 > probability of having 0 children is 0.30, 1 child is 0.25, 2 children 31 > is 0.20, 3 children is 0.15, 4 children is 0.10, and 5 or more 32 > children is 0. Consider a random girl in the population of children. 33 > What is the probability that she has a sister? 34 35 Checking her work, my friend disagreed with the answers she found 36 online, and wanted to know how I'd solve the problem. Since my solution 37 seems to be different than others I've found so far, I thought I'd share 38 it with the whole Internet, which will certainly tell me whether I'm 39 wrong. 40 41 ## Simulating the problem 42 43 I'll start with a simulation---I'm a bit more confident in my 44 programming than my probability theory (I've been doing it longer), and 45 it makes some of the tricky parts of the problem explicit. 46 47 First, we'll simulate a random sample of *families*, since the problem 48 gives us probabilities in terms of *children per family*. After 49 generating a pool of random family sizes, we can use the binomial 50 distribution to generate a random number of girls for each family 51 (naively assuming that the probability of a child being a girl is 0.5). 52 53 ``` r 54 set.seed(232636039) 55 num_families <- 100000 56 57 family_size <- seq(0, 5) 58 p_family_size <- c(0.3, 0.25, 0.20, 0.15, 0.1, 0.0) 59 60 girls_per_family <- sample(family_size, num_families, replace = TRUE, 61 prob = p_family_size) |> 62 (\(n) rbinom(num_families, size = n, prob = 0.5))() 63 ``` 64 65 If a family has more than two girls, these girls have sisters. The 66 probability that a randomly selected girl has a sister is approximated 67 by the number of girls with sisters divided by the total number of girls 68 in our simulation. 69 70 ``` r 71 # If there is more than one girl in a family, these girls have sisters 72 girls_with_sisters <- sum(girls_per_family[girls_per_family > 1]) 73 girls_total <- sum(girls_per_family) 74 75 (p_est <- girls_with_sisters / girls_total) 76 ``` 77 78 ## [1] 0.5878486 79 80 The code above estimates that the probability that a girl (sampled at 81 random from the *population of children*) has a sister is approximately 82 0.59. 83 84 ## Plotting the simulation 85 86 Just for fun, let's plot the estimate as it develops over the number of 87 simulated families. This also helps us get a sense of how trustworthy 88 the estimate is---has it converged to a stable value, or is it still 89 swinging around? For this, I'll finally use parts of the 90 [Tidyverse](https://www.tidyverse.org/). 91 92 ``` r 93 suppressPackageStartupMessages(library(dplyr)) 94 suppressPackageStartupMessages(library(ggplot2)) 95 ``` 96 97 ``` r 98 sister_dat <- tibble(girls_per_family) |> 99 mutate(families = seq(n()), 100 has_sister = girls_per_family > 1, 101 total_girls = cumsum(girls_per_family), 102 girls_with_sisters = cumsum(girls_per_family * has_sister), 103 p_est = girls_with_sisters / total_girls) 104 ``` 105 106 ``` r 107 sister_dat |> 108 # We'll skip any beginning rows that were generated before we sampled girls, 109 # since 0/0 is undefined. 110 filter(total_girls > 0) |> 111 ggplot(aes(families, p_est)) + 112 geom_step() + 113 geom_hline(yintercept = 71/120, linetype = "dashed") + 114 scale_x_log10() + 115 theme_minimal() + 116 labs(x = "Number of random families", y = "Observed proportion of girls with sisters") 117 ``` 118 119 ![A step graph shows the observed proportion of girls with sisters 120 converging to ~0.59 as the number of simulated families increases](plot_estimate-1.png) 121 122 ## Analytic solution 123 124 If you're wondering how I chose a value for the dashed line, here's an 125 analytic solution to the puzzle. 126 127 We'll let *S* represent the event that a girl has a sister, *Fₙ* 128 represent the event that a family has *n* children, and *Cₙ* represent 129 the event that a girl comes from a family of *n* children (i.e., the 130 girl has *n - 1* siblings). By the law of total probability: 131 132 $$ \mathbb{P}(S) = \sum_{n} \mathbb{P}(S|C_n) \mathbb{P}(C_n) $$ 133 134 The conditional probability that a girl from a family with *n* children 135 has a sister is the compliment of the probability that all of her 136 siblings are brothers. Therefore: 137 138 $$ \mathbb{P}(S|C_n) = 1 - \frac{1}{2^{n-1}} $$ 139 140 For the next component, it's important to call out the fact that the 141 problem text gives us P(*Fₙ*) (the probability that a family has *n* 142 children), which is **not the same** as P(*Cₙ*) (the probability that a 143 child comes from a family with *n* children). I suspect that this trips 144 some people up, and explains some of the answers that are different than 145 mine. 146 147 To illustrate the difference, let's imagine that the probabilities were 148 constructed from a population of 100 families. Twenty five of these 149 families have one child, while only ten have four children. However, 150 this arrangement results in 25 children with no siblings, while 40 151 children would have 3 each. In other words, a randomly selected *family* 152 is more likely to have one child than four children, but a randomly 153 selected *child* is more likely to come from a family with four children 154 than a family with one child. Formalizing this reasoning a bit, we find: 155 156 $$ \mathbb{P}(C_n) = \frac{n \mathbb{P}(F_n)}{\sum_{m} m \mathbb{P}(F_m)} $$ 157 158 With our terms defined, we'll use R once more to calculate *P(S)*: 159 160 ``` r 161 n <- family_size 162 fn <- p_family_size 163 s_given_cn <- 1 - 1/(2^(n-1)) 164 cn <- n * fn / sum(n * fn) 165 (s <- sum(s_given_cn * cn)) 166 ``` 167 168 ## [1] 0.5916667 169 170 This value is equal to `71/120`, and is quite close to what we found 171 through simulation---but only reliably so after generating around 10,000 172 families. 173 174 ## Concluding note 175 176 I'm ambivalent about the practice of asking job applicants to solve 177 probability puzzles during interviews, but it's true that sometimes our 178 intuitions can lead us astray (e.g, confusing P(*Fₙ*) and P(*Cₙ*)), and this 179 does come up in real-life data analyses. When I find myself dealing with 180 such situations, I'm flexible in the approach I take: [sometimes I try to 181 find an exact solution using probability theory first]({{< ref 182 "posts/riddler-2/index.md" >}}), but often I program a quick simulation 183 instead. Most of the time, I wind up doing both, because these approaches 184 complement one another: simulations are great because they force you to be 185 explicit about the assumptions that you're making about the problem, while 186 analytic solutions give exact answers and are insensitive to randomness.