index.Rmd (7208B)
1 --- 2 title: "Sister probability puzzle" 3 description: "My solution to a question about the probability of having a sister." 4 date: 2023-01-18T21:47:38-08:00 5 draft: False 6 katex: True 7 knit: (function(input, ...) { 8 rmarkdown::render( 9 input, 10 output_dir = file.path(Sys.getenv("HUGO_ROOT"), "content/posts") 11 ) 12 }) 13 output: 14 md_document: 15 variant: markdown 16 preserve_yaml: true 17 --- 18 19 ```{r setup, include=FALSE} 20 knitr::opts_chunk$set( 21 echo = TRUE, 22 fig.path = file.path("figs", 23 sub("\\.Rmd$", "", 24 basename(rstudioapi::getActiveDocumentContext()$path)), 25 "") 26 ) 27 knitr::opts_knit$set( 28 base.dir = file.path(Sys.getenv("HUGO_ROOT"), "static"), 29 base.url = "/" 30 ) 31 ``` 32 33 A smart and talented friend, who's a fellow data scientist by profession, came 34 across a probability puzzle that's supposedly given in some job interviews for 35 data science roles. The problem asks the following: 36 37 > Assume the distribution of children per family is such that the probability of 38 > having 0 children is 0.30, 1 child is 0.25, 2 children is 0.20, 3 children is 39 > 0.15, 4 children is 0.10, and 5 or more children is 0. Consider a random girl 40 > in the population of children. What is the probability that she has a sister? 41 42 Checking her work, my friend disagreed with the answers she found online, and 43 wanted to know how I'd solve the problem. Since my solution seems to be 44 different than others I've found so far, I thought I'd share it with the whole 45 Internet, which will certainly tell me whether I'm wrong. 46 47 ## Simulating the problem 48 49 I'll start with a simulation—I'm a bit more confident in my programming than my 50 probability theory (I've been doing it longer), and it makes some of the tricky 51 parts of the problem explicit. 52 53 First, we'll simulate a random sample of _families_, since the problem gives us 54 probabilities in terms of _children per family_. After generating a pool of 55 random family sizes, we can use the binomial distribution to generate a random 56 number of girls for each family (naively assuming that the probability of a 57 child being a girl is 0.5). 58 59 ```{r simulate_families} 60 set.seed(232636039) 61 num_families <- 100000 62 63 family_size <- seq(0, 5) 64 p_family_size <- c(0.3, 0.25, 0.20, 0.15, 0.1, 0.0) 65 66 girls_per_family <- sample(family_size, num_families, replace = TRUE, 67 prob = p_family_size) |> 68 (\(n) rbinom(num_families, size = n, prob = 0.5))() 69 ``` 70 71 If a family has more than two girls, these girls have sisters. The probability 72 that a randomly selected girl has a sister is approximated by the number of 73 girls with sisters divided by the total number of girls in our simulation. 74 75 ```{r estimate_p} 76 # If there is more than one girl in a family, these girls have sisters 77 girls_with_sisters <- sum(girls_per_family[girls_per_family > 1]) 78 girls_total <- sum(girls_per_family) 79 80 (p_est <- girls_with_sisters / girls_total) 81 ``` 82 83 The code above estimates that the probability that a girl (sampled at random 84 from the _population of children_) has a sister is approximately `r round(p_est, 85 2)`. 86 87 ## Plotting the simulation 88 89 Just for fun, let's plot the estimate as it develops over the number of 90 simulated families. This also helps us get a sense of how trustworthy the 91 estimate is—has it converged to a stable value, or is it still swinging around? 92 For this, I'll finally use parts of the [Tidyverse](https://www.tidyverse.org/). 93 94 ```{r load_tidyverse} 95 suppressPackageStartupMessages(library(dplyr)) 96 suppressPackageStartupMessages(library(ggplot2)) 97 ``` 98 99 ```{r construct_df, cache=TRUE, dependson="simulate_families"} 100 sister_dat <- tibble(girls_per_family) |> 101 mutate(families = seq(n()), 102 has_sister = girls_per_family > 1, 103 total_girls = cumsum(girls_per_family), 104 girls_with_sisters = cumsum(girls_per_family * has_sister), 105 p_est = girls_with_sisters / total_girls) 106 ``` 107 108 ```{r plot_estimate, cache=TRUE, dependson="construc_df"} 109 sister_dat |> 110 # We'll skip any beginning rows that were generated before we sampled girls, 111 # since 0/0 is undefined. 112 filter(total_girls > 0) |> 113 ggplot(aes(families, p_est)) + 114 geom_step() + 115 geom_hline(yintercept = 71/120, linetype = "dashed") + 116 scale_x_log10() + 117 theme_minimal() + 118 labs(x = "Number of random families", y = "Observed proportion of girls with sisters") 119 ``` 120 121 ## Analytic solution 122 123 If you're wondering how I chose a value for the dashed line, here's an analytic 124 solution to the puzzle. 125 126 We'll let _S_ represent the event that a girl has a sister, _Fₙ_ represent the 127 event that a family has _n_ children, and _Cₙ_ represent the event that a girl 128 comes from a family of _n_ children (i.e., the girl has _n - 1_ siblings). By 129 the law of total probability: 130 131 $$ \mathbb{P}(S) = \sum_{n} \mathbb{P}(S|C_n) \mathbb{P}(C_n) $$ 132 133 The conditional probability that a girl from a family with _n_ children has a 134 sister is the compliment of the probability that all of her siblings are 135 brothers. Therefore: 136 137 $$ \mathbb{P}(S|C_n) = 1 - \frac{1}{2^{n-1}} $$ 138 139 For the next component, it's important to call out the fact that the problem 140 text gives us P(_Fₙ_) (the probability that a family has _n_ children), which is 141 **not the same** as P(_Cₙ_) (the probability that a child comes from a family 142 with _n_ children). I suspect that this trips some people up, and explains some 143 of the answers that are different than mine. 144 145 To illustrate the difference, let's imagine that the probabilities were 146 constructed from a population of 100 families. Twenty five of these families 147 have one child, while only ten have four children. However, this arrangement 148 results in 25 children with no siblings, while 40 children would have 3 each. In 149 other words, a randomly selected _family_ is more likely to have one child than 150 four children, but a randomly selected _child_ is more likely to come from a 151 family with four children than a family with one child. Formalizing this 152 reasoning a bit, we find: 153 154 $$ \mathbb{P}(C_n) = \frac{n \mathbb{P}(F_n)}{\sum_{m} m \mathbb{P}(F_m)} $$ 155 156 With our terms defined, we'll use R once more to calculate _P(S)_: 157 158 ```{r calculate_p} 159 n <- family_size 160 fn <- p_family_size 161 s_given_cn <- 1 - 1/(2^(n-1)) 162 cn <- n * fn / sum(n * fn) 163 (s <- sum(s_given_cn * cn)) 164 ``` 165 166 This value is equal to `71/120`, and is quite close to what we found through 167 simulation—but only reliably so after generating around 10,000 families. 168 169 ## Concluding note 170 171 I'm ambivalent about the practice of asking job applicants to solve probability 172 puzzles during interviews, but it's true that sometimes our intuitions can lead 173 us astray (e.g, confusing P(_Fₙ_) and P(_Cₙ_)), and this does come up in 174 real-life data analyses. When I find myself dealing with such situations, I'm 175 flexible in the approach I take: sometimes I try to find an exact solution using 176 probability theory first, but often I program a quick simulation instead. Most 177 of the time, I wind up doing both, because these approaches complement one 178 another: simulations are great because they force you to be explicit about the 179 assumptions that you're making about the problem, while analytic solutions give 180 exact answers and are insensitive to randomness.