Using simulation as a substitute for analytical solutions

Suppose we want to know the probability of some event. There are two ways we can find that: use algebra, calculus, probability theory, etc. to find the answer mathematically, * or* we can write a program to find the answer through simulation.

For example, in the famous birthday problem (taught in all intro to probability theory courses) we are interested in the probability that, in a set of \(n\) randomly chosen people, some pair of them will have the same birthday (which we refer to as event \(A\)).

Using Kolmogorov axioms and conditional probability we can derive an analytical solution for \(P(A)\):

\[P(A) = 1 - \frac{n! \cdot {\binom {365}{n}}}{365^{n}}\]

Deriving an exact solution analytically is an interesting challenge * but we have computers now* and can find an approximate answer to this scenario and more. Using random number generation we can simulate an outcome and then count how many times a particular outcome occurred out of all the simulations.

For evaluating the accuracy of our simulation-based results, we will be using the following R(2019) code for calculating \(P(A)\):

```
pa <- function(n) {
1 - (factorial(n) * choose(365, n)) / (365^n)
}
```

For example, out of 23 people with uniformly distributed birthdays, the probability that two people will share a birthday is 50.7%.

To facilitate some of the work, I rely on the {purrr}(2019) R package, so let’s attach that:

```
library(purrr)
```

We start with a basic function that randomly generates \(n\) birthdays, assuming a year with 365 days and that each day is equally likely:

```
birthdays <- function(n) {
sample(1:365, n, replace = TRUE)
}
```

Then we have a function which simulates a single outcome: either 2 or more people share a birthday (there are duplicate values) or they were all born on different days:

```
simulate_a <- function(n) {
bdays <- birthdays(n)
any(duplicated(bdays))
}
```

Approximating \(P(A)\) involves simulating a bunch of outcomes and calculating the proportion of those outcomes where the event \(A\) was observed:

```
pa2 <- function(n, simulations) {
# output a TRUE/FALSE in each simulation:
simulation_outcomes <- map_lgl(1:simulations, ~ simulate_a(n))
# proportion of simulations that are TRUE:
sum(simulation_outcomes) / simulations
}
```

Lastly, we should do a few repetitions to get more approximations because we shouldn’t trust just one.

```
repeat_pa2 <- function(n, simulations, repeats) {
map_dbl(1:repeats, ~ pa2(n, simulations))
}
```

Now we can compare `pa(n)`

and `pa2(n)`

at different values of \(n\):

```
set.seed(23)
approx_pa <- map_dfr(
c(5, 10, 20, 23, 30, 40, 50),
function(n) {
pas <- repeat_pa2(n, simulations = 1e5, repeats = 10)
data.frame(
n = n,
analytical = pa(n),
simulated = sprintf("%.3f (%.3f-%.3f)", mean(pas), min(pas), max(pas)))
}
)
```

n | analytical | simulated |
---|---|---|

5 | 0.027 | 0.027 (0.026-0.028) |

10 | 0.117 | 0.117 (0.115-0.118) |

20 | 0.411 | 0.412 (0.409-0.415) |

23 | 0.507 | 0.507 (0.504-0.509) |

30 | 0.706 | 0.707 (0.705-0.709) |

40 | 0.891 | 0.892 (0.890-0.893) |

50 | 0.970 | 0.970 (0.970-0.971) |

Simulation gives us probabilities pretty close to the ones we get using the analytically-derived formula!

Suppose we were instead interested in a different question:

What is the probability of event \(B\), that in a group of \(n\) people, there will be a sequence of \(m\) consecutive days in which there is at least one birthday each day?

Good luck trying to derive a generalized solution analytically! Luckily, * we have computers*! Let’s start by defining a function that simulates a single outcome:

```
simulate_b <- function(n, m) {
bdays <- birthdays(n) # re-using from before
# two birthdays on same day doesn't matter anymore:
bdays <- unique(bdays)
# calculate how many days between each birthday:
diffs <- c(0, diff(sort(bdays))) # 33, 35, 36 => 0, 2, 1
for (i in m:length(diffs)) {
# event B is m differences of 1 days between birthdays:
if (sum(diffs[(i - m + 1):i] == 1) == m) {
return(TRUE) # stop checking at 1st observance of B
}
}
return(FALSE) # event B was not observed
}
```

We proceed as before: we run `simulate_b`

a bunch of times and calculate the proportion of those times where \(B\) occurred:

```
pb <- function(n, m, simulations) {
simulation_outcomes <- map_lgl(1:simulations, ~ simulate_b(n, m))
mean(simulation_outcomes)
}
```

And repeat it a few times to capture the uncertainty:

```
repeat_pa2 <- function(n, m, simulations, repeats) {
map_dbl(1:repeats, ~ pb(n, m, simulations))
}
```

So, what is the probability that in a group of, say, 77 people (with uniformly random birthdays across 365 days) we will find 6 consecutive days in which there is at least one birthday each day?

```
set.seed(42)
summary(approx_pb <- repeat_pa2(77, 6, 1e5, 10))
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.002000 0.002093 0.002210 0.002197 0.002267 0.002460
```

For that particular scenario and event, the probability is approximately 0.220%. The way we’ve coded it means we can quickly approximate the probability for any number of people \(n\) and any number of days \(m\).

If you’re interested in practicing this skill, I encourage you to practice with the following:

- Verify to yourself, through simulation, that in the Monty Hall problem, contestants who switch doors have a \(\frac{2}{3}\) chance of winning but contestants who stick to their initial choice have a \(\frac{1}{3}\) chance of winning.
- Suppose two people are playing non-stop rock–paper–scissors (also called roshambo). Player 1 (
**P1**) always chooses one of the three shapes at random, but Player 2 (**P2**) always chooses between the two shapes they didn’t pick the previous round. Is one tactic better than the other – that is, which one has a higher probability of winning than the other one? (Or are they both about the same?) - Suppose
**P1**adjusts their strategy and always picks the shape that**P2**picked the previous round, while**P2**sticks to their original strategy. So, for example, if**P2**chose “paper” (a flat hand) in round 1, then**P1**will definitely pick “paper” in round 2, while**P2**will randomly pick between “scissors” and “rock”. Is this a better or worse strategy for**P1**than always randomly picking between the three shapes? How does the win probability of**P1**’s new strategy compare to the win probablity of**P2**’s strategy?

- Try different numbers of simulations (e.g. 1K, 10K, 100K, 1M, 10M, 100M) with the first example. How does the number of simulations affect the variability of approximations you get – that is, how wide the range is? How many simulations do you need to consistently see the first three significant digits of the approximated results be the same as the first three significant digits of the result from the closed-form solution?
- At a meeting of a dozen (12) sets of twins (assuming each pair was born on the same day), what is the probability that two sets of twins (4 people total) share a birthday?
- Two baseball teams (each composed of 9 players) arrive at a field to play each other. What is probability that nobody on either team shares a birthday with one of their teammates and that exactly one player from team 1 shares a birthday with exactly one player from team 2?

This should go without saying but feel free to use whatever language you’re most comfortable with (Python, Julia, etc.). Also maybe find a friend to try these exercises with you and then compare answers and approaches. (Perhaps you’ll learn something new from each other!)

If you enjoyed this tutorial, you may be interested in my other tutorials:

- Bayesian optimization in R: step-by-step demonstration of BayesOpt for derivative-free minimization of a noiseless, black-box function
- Solving a logic grid puzzle with integer programming: modeling a logic puzzle as a constraint satisfaction problem with {OMPR}

You might have noticed a few blue links with “W”s on this page. Those are links to the Wikipedia articles on those topics and if you hover over them, you will see a preview of the article. This is possible with the ContextCards library developed by my coworker Joaquin over at Wikimedia, based on the Popups extension for MediaWiki.

Henry, Lionel, and Hadley Wickham. 2019. *Purrr: Functional Programming Tools*. https://CRAN.R-project.org/package=purrr.

R Core Team. 2019. *R: A Language and Environment for Statistical Computing*. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com/bearloga/approximating-probability, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".