index.md - www.eamoncaddigan.net - Content and configuration for https://www.eamoncaddigan.net

index.md (6960B)
      1 ---
      2 title: "Sister probability puzzle"
      3 description: "My solution to a question about the probability of having a sister."
      4 date: 2023-01-18T21:47:38-08:00
      5 draft: False
      6 katex: True
      7 knit: (function(input, ...) {
      8     rmarkdown::render(
      9       input,
     10       output_dir = file.path(Sys.getenv("HUGO_ROOT"), "content/posts")
     11     )
     12   })
     13 output: 
     14   md_document:
     15     variant: markdown
     16     preserve_yaml: true
     17 categories:
     18 - Data Science
     19 tags:
     20 - R
     21 - Statistics
     22 ---
     23 
     24 A smart and talented friend, who's a fellow data scientist by
     25 profession, came across a probability puzzle that's supposedly given in
     26 some job interviews for data science roles. The problem asks the
     27 following:
     28 
     29 > Assume the distribution of children per family is such that the
     30 > probability of having 0 children is 0.30, 1 child is 0.25, 2 children
     31 > is 0.20, 3 children is 0.15, 4 children is 0.10, and 5 or more
     32 > children is 0. Consider a random girl in the population of children.
     33 > What is the probability that she has a sister?
     34 
     35 Checking her work, my friend disagreed with the answers she found
     36 online, and wanted to know how I'd solve the problem. Since my solution
     37 seems to be different than others I've found so far, I thought I'd share
     38 it with the whole Internet, which will certainly tell me whether I'm
     39 wrong.
     40 
     41 ## Simulating the problem
     42 
     43 I'll start with a simulation---I'm a bit more confident in my
     44 programming than my probability theory (I've been doing it longer), and
     45 it makes some of the tricky parts of the problem explicit.
     46 
     47 First, we'll simulate a random sample of *families*, since the problem
     48 gives us probabilities in terms of *children per family*. After
     49 generating a pool of random family sizes, we can use the binomial
     50 distribution to generate a random number of girls for each family
     51 (naively assuming that the probability of a child being a girl is 0.5).
     52 
     53 ``` r
     54 set.seed(232636039)
     55 num_families <- 100000
     56 
     57 family_size <- seq(0, 5)
     58 p_family_size <- c(0.3, 0.25, 0.20, 0.15, 0.1, 0.0)
     59 
     60 girls_per_family <- sample(family_size, num_families, replace = TRUE,
     61                            prob = p_family_size) |>
     62   (\(n) rbinom(num_families, size = n, prob = 0.5))()
     63 ```
     64 
     65 If a family has more than two girls, these girls have sisters. The
     66 probability that a randomly selected girl has a sister is approximated
     67 by the number of girls with sisters divided by the total number of girls
     68 in our simulation.
     69 
     70 ``` r
     71 # If there is more than one girl in a family, these girls have sisters
     72 girls_with_sisters <- sum(girls_per_family[girls_per_family > 1])
     73 girls_total <- sum(girls_per_family)
     74 
     75 (p_est <- girls_with_sisters / girls_total)
     76 ```
     77 
     78     ## [1] 0.5878486
     79 
     80 The code above estimates that the probability that a girl (sampled at
     81 random from the *population of children*) has a sister is approximately
     82 0.59.
     83 
     84 ## Plotting the simulation
     85 
     86 Just for fun, let's plot the estimate as it develops over the number of
     87 simulated families. This also helps us get a sense of how trustworthy
     88 the estimate is---has it converged to a stable value, or is it still
     89 swinging around? For this, I'll finally use parts of the
     90 [Tidyverse](https://www.tidyverse.org/).
     91 
     92 ``` r
     93 suppressPackageStartupMessages(library(dplyr))
     94 suppressPackageStartupMessages(library(ggplot2))
     95 ```
     96 
     97 ``` r
     98 sister_dat <- tibble(girls_per_family) |>
     99   mutate(families = seq(n()),
    100          has_sister = girls_per_family > 1,
    101          total_girls = cumsum(girls_per_family),
    102          girls_with_sisters = cumsum(girls_per_family * has_sister),
    103          p_est = girls_with_sisters / total_girls)
    104 ```
    105 
    106 ``` r
    107 sister_dat |>
    108   # We'll skip any beginning rows that were generated before we sampled girls,
    109   # since 0/0 is undefined.
    110   filter(total_girls > 0) |>
    111   ggplot(aes(families, p_est)) +
    112   geom_step() +
    113   geom_hline(yintercept = 71/120, linetype = "dashed") +
    114   scale_x_log10() +
    115   theme_minimal() +
    116   labs(x = "Number of random families", y = "Observed proportion of girls with sisters")
    117 ```
    118 
    119 ![A step graph shows the observed proportion of girls with sisters
    120 converging to ~0.59 as the number of simulated families increases](plot_estimate-1.png)
    121 
    122 ## Analytic solution
    123 
    124 If you're wondering how I chose a value for the dashed line, here's an
    125 analytic solution to the puzzle.
    126 
    127 We'll let *S* represent the event that a girl has a sister, *Fₙ*
    128 represent the event that a family has *n* children, and *Cₙ* represent
    129 the event that a girl comes from a family of *n* children (i.e., the
    130 girl has *n - 1* siblings). By the law of total probability:
    131 
    132 $$ \mathbb{P}(S) = \sum_{n} \mathbb{P}(S|C_n) \mathbb{P}(C_n) $$
    133 
    134 The conditional probability that a girl from a family with *n* children
    135 has a sister is the compliment of the probability that all of her
    136 siblings are brothers. Therefore:
    137 
    138 $$ \mathbb{P}(S|C_n) = 1 - \frac{1}{2^{n-1}} $$
    139 
    140 For the next component, it's important to call out the fact that the
    141 problem text gives us P(*Fₙ*) (the probability that a family has *n*
    142 children), which is **not the same** as P(*Cₙ*) (the probability that a
    143 child comes from a family with *n* children). I suspect that this trips
    144 some people up, and explains some of the answers that are different than
    145 mine.
    146 
    147 To illustrate the difference, let's imagine that the probabilities were
    148 constructed from a population of 100 families. Twenty five of these
    149 families have one child, while only ten have four children. However,
    150 this arrangement results in 25 children with no siblings, while 40
    151 children would have 3 each. In other words, a randomly selected *family*
    152 is more likely to have one child than four children, but a randomly
    153 selected *child* is more likely to come from a family with four children
    154 than a family with one child. Formalizing this reasoning a bit, we find:
    155 
    156 $$ \mathbb{P}(C_n) = \frac{n \mathbb{P}(F_n)}{\sum_{m} m \mathbb{P}(F_m)} $$
    157 
    158 With our terms defined, we'll use R once more to calculate *P(S)*:
    159 
    160 ``` r
    161 n <- family_size
    162 fn <- p_family_size
    163 s_given_cn <- 1 - 1/(2^(n-1))
    164 cn <- n * fn / sum(n * fn)
    165 (s <- sum(s_given_cn * cn))
    166 ```
    167 
    168     ## [1] 0.5916667
    169 
    170 This value is equal to `71/120`, and is quite close to what we found
    171 through simulation---but only reliably so after generating around 10,000
    172 families.
    173 
    174 ## Concluding note
    175 
    176 I'm ambivalent about the practice of asking job applicants to solve
    177 probability puzzles during interviews, but it's true that sometimes our
    178 intuitions can lead us astray (e.g, confusing P(*Fₙ*) and P(*Cₙ*)), and this
    179 does come up in real-life data analyses. When I find myself dealing with
    180 such situations, I'm flexible in the approach I take: [sometimes I try to
    181 find an exact solution using probability theory first]({{< ref
    182 "posts/riddler-2/index.md" >}}), but often I program a quick simulation
    183 instead. Most of the time, I wind up doing both, because these approaches
    184 complement one another: simulations are great because they force you to be
    185 explicit about the assumptions that you're making about the problem, while
    186 analytic solutions give exact answers and are insensitive to randomness.
	www.eamoncaddigan.net Content and configuration for https://www.eamoncaddigan.net
	git clone https://git.eamoncaddigan.net/www.eamoncaddigan.net.git
	Log \| Files \| Refs \| Submodules \| README