www.eamoncaddigan.net

Content and configuration for https://www.eamoncaddigan.net
git clone https://git.eamoncaddigan.net/www.eamoncaddigan.net.git
Log | Files | Refs | Submodules | README

index.Rmd (7208B)


      1 ---
      2 title: "Sister probability puzzle"
      3 description: "My solution to a question about the probability of having a sister."
      4 date: 2023-01-18T21:47:38-08:00
      5 draft: False
      6 katex: True
      7 knit: (function(input, ...) {
      8     rmarkdown::render(
      9       input,
     10       output_dir = file.path(Sys.getenv("HUGO_ROOT"), "content/posts")
     11     )
     12   })
     13 output: 
     14   md_document:
     15     variant: markdown
     16     preserve_yaml: true
     17 ---
     18 
     19 ```{r setup, include=FALSE}
     20 knitr::opts_chunk$set(
     21   echo = TRUE,
     22   fig.path = file.path("figs", 
     23                        sub("\\.Rmd$", "",
     24                            basename(rstudioapi::getActiveDocumentContext()$path)),
     25                        "")
     26 )
     27 knitr::opts_knit$set(
     28   base.dir = file.path(Sys.getenv("HUGO_ROOT"), "static"),
     29   base.url = "/"
     30 )
     31 ```
     32 
     33 A smart and talented friend, who's a fellow data scientist by profession, came
     34 across a probability puzzle that's supposedly given in some job interviews for
     35 data science roles. The problem asks the following:
     36 
     37 > Assume the distribution of children per family is such that the probability of
     38 > having 0 children is 0.30, 1 child is 0.25, 2 children is 0.20, 3 children is
     39 > 0.15, 4 children is 0.10, and 5 or more children is 0. Consider a random girl 
     40 > in the population of children. What is the probability that she has a sister?
     41 
     42 Checking her work, my friend disagreed with the answers she found online, and
     43 wanted to know how I'd solve the problem. Since my solution seems to be
     44 different than others I've found so far, I thought I'd share it with the whole
     45 Internet, which will certainly tell me whether I'm wrong.
     46 
     47 ## Simulating the problem
     48 
     49 I'll start with a simulation—I'm a bit more confident in my programming than my
     50 probability theory (I've been doing it longer), and it makes some of the tricky
     51 parts of the problem explicit.
     52 
     53 First, we'll simulate a random sample of _families_, since the problem gives us
     54 probabilities in terms of _children per family_. After generating a pool of
     55 random family sizes, we can use the binomial distribution to generate a random
     56 number of girls for each family (naively assuming that the probability of a
     57 child being a girl is 0.5).
     58 
     59 ```{r simulate_families}
     60 set.seed(232636039)
     61 num_families <- 100000
     62 
     63 family_size <- seq(0, 5)
     64 p_family_size <- c(0.3, 0.25, 0.20, 0.15, 0.1, 0.0)
     65 
     66 girls_per_family <- sample(family_size, num_families, replace = TRUE,
     67                            prob = p_family_size) |>
     68   (\(n) rbinom(num_families, size = n, prob = 0.5))()
     69 ```
     70 
     71 If a family has more than two girls, these girls have sisters. The probability
     72 that a randomly selected girl has a sister is approximated by the number of
     73 girls with sisters divided by the total number of girls in our simulation.
     74 
     75 ```{r estimate_p}
     76 # If there is more than one girl in a family, these girls have sisters
     77 girls_with_sisters <- sum(girls_per_family[girls_per_family > 1])
     78 girls_total <- sum(girls_per_family)
     79 
     80 (p_est <- girls_with_sisters / girls_total)
     81 ```
     82 
     83 The code above estimates that the probability that a girl (sampled at random
     84 from the _population of children_) has a sister is approximately `r round(p_est,
     85 2)`.
     86 
     87 ## Plotting the simulation
     88 
     89 Just for fun, let's plot the estimate as it develops over the number of
     90 simulated families. This also helps us get a sense of how trustworthy the
     91 estimate is—has it converged to a stable value, or is it still swinging around?
     92 For this, I'll finally use parts of the [Tidyverse](https://www.tidyverse.org/).
     93 
     94 ```{r load_tidyverse}
     95 suppressPackageStartupMessages(library(dplyr))
     96 suppressPackageStartupMessages(library(ggplot2))
     97 ```
     98 
     99 ```{r construct_df, cache=TRUE, dependson="simulate_families"}
    100 sister_dat <- tibble(girls_per_family) |>
    101   mutate(families = seq(n()),
    102          has_sister = girls_per_family > 1,
    103          total_girls = cumsum(girls_per_family),
    104          girls_with_sisters = cumsum(girls_per_family * has_sister),
    105          p_est = girls_with_sisters / total_girls)
    106 ```
    107 
    108 ```{r plot_estimate, cache=TRUE, dependson="construc_df"}
    109 sister_dat |>
    110   # We'll skip any beginning rows that were generated before we sampled girls,
    111   # since 0/0 is undefined.
    112   filter(total_girls > 0) |>
    113   ggplot(aes(families, p_est)) +
    114   geom_step() +
    115   geom_hline(yintercept = 71/120, linetype = "dashed") +
    116   scale_x_log10() +
    117   theme_minimal() +
    118   labs(x = "Number of random families", y = "Observed proportion of girls with sisters")
    119 ```
    120 
    121 ## Analytic solution
    122 
    123 If you're wondering how I chose a value for the dashed line, here's an analytic
    124 solution to the puzzle.
    125 
    126 We'll let _S_ represent the event that a girl has a sister, _Fₙ_ represent the
    127 event that a family has _n_ children, and _Cₙ_ represent the event that a girl
    128 comes from a family of _n_ children (i.e., the girl has _n - 1_ siblings). By
    129 the law of total probability:
    130 
    131 $$ \mathbb{P}(S) = \sum_{n} \mathbb{P}(S|C_n) \mathbb{P}(C_n) $$
    132 
    133 The conditional probability that a girl from a family with _n_ children has a
    134 sister is the compliment of the probability that all of her siblings are
    135 brothers. Therefore:
    136 
    137 $$ \mathbb{P}(S|C_n) = 1 - \frac{1}{2^{n-1}} $$
    138 
    139 For the next component, it's important to call out the fact that the problem
    140 text gives us P(_Fₙ_) (the probability that a family has _n_ children), which is
    141 **not the same** as P(_Cₙ_) (the probability that a child comes from a family
    142 with _n_ children). I suspect that this trips some people up, and explains some
    143 of the answers that are different than mine.
    144 
    145 To illustrate the difference, let's imagine that the probabilities were
    146 constructed from a population of 100 families. Twenty five of these families
    147 have one child, while only ten have four children. However, this arrangement
    148 results in 25 children with no siblings, while 40 children would have 3 each. In
    149 other words, a randomly selected _family_ is more likely to have one child than
    150 four children, but a randomly selected _child_ is more likely to come from a
    151 family with four children than a family with one child. Formalizing this
    152 reasoning a bit, we find:
    153 
    154 $$ \mathbb{P}(C_n) = \frac{n \mathbb{P}(F_n)}{\sum_{m} m \mathbb{P}(F_m)} $$
    155 
    156 With our terms defined, we'll use R once more to calculate _P(S)_:
    157 
    158 ```{r calculate_p}
    159 n <- family_size
    160 fn <- p_family_size
    161 s_given_cn <- 1 - 1/(2^(n-1))
    162 cn <- n * fn / sum(n * fn)
    163 (s <- sum(s_given_cn * cn))
    164 ```
    165 
    166 This value is equal to `71/120`, and is quite close to what we found through
    167 simulation—but only reliably so after generating around 10,000 families.
    168 
    169 ## Concluding note
    170 
    171 I'm ambivalent about the practice of asking job applicants to solve probability
    172 puzzles during interviews, but it's true that sometimes our intuitions can lead
    173 us astray (e.g, confusing P(_Fₙ_) and P(_Cₙ_)), and this does come up in
    174 real-life data analyses. When I find myself dealing with such situations, I'm
    175 flexible in the approach I take: sometimes I try to find an exact solution using
    176 probability theory first, but often I program a quick simulation instead. Most
    177 of the time, I wind up doing both, because these approaches complement one
    178 another: simulations are great because they force you to be explicit about the
    179 assumptions that you're making about the problem, while analytic solutions give
    180 exact answers and are insensitive to randomness.