Probability Foundations: From Counting to Continuous Distributions

Today

Probability Foundations: From Counting to Continuous Distributions


Distribution Reference

Discrete

NameRangeP(X=k)P(X = k)MeanVariance
Bernoulli(p)(p){0,1}\{0, 1\}P(1)=p;  P(0)=1pP(1) = p;\; P(0) = 1-pppp(1p)p(1-p)
Binomial(n,p)(n,p){0,1,,n}\{0,1,\ldots,n\}(nk)pk(1p)nk\binom{n}{k}p^k(1-p)^{n-k}npnpnp(1p)np(1-p)
Hypergeometric(n,N,G)(n,N,G){0,,n}\{0,\ldots,n\}(Gk)(NGnk)/(Nn)\binom{G}{k}\binom{N-G}{n-k}/\binom{N}{n}nG/NnG/Nn(G/N)(1G/N)(Nn)/(N1)n(G/N)(1-G/N)(N-n)/(N-1)
Geometric(p)(p){1,2,3,}\{1,2,3,\ldots\}(1p)k1p(1-p)^{k-1}p1/p1/p(1p)/p2(1-p)/p^2
Geometric(p)(p){0,1,2,}\{0,1,2,\ldots\}(1p)kp(1-p)^k p(1p)/p(1-p)/p(1p)/p2(1-p)/p^2
Neg. Binomial(r,p)(r,p){0,1,2,}\{0,1,2,\ldots\}(k+r1r1)pr(1p)k\binom{k+r-1}{r-1}p^r(1-p)^kr(1p)/pr(1-p)/pr(1p)/p2r(1-p)/p^2
Poisson(μ)(\mu){0,1,2,}\{0,1,2,\ldots\}eμμk/k!e^{-\mu}\mu^k/k!μ\muμ\mu

Continuous

NameRangeDensity f(x)f(x)CDF F(x)F(x)MeanVariance
Uniform(a,b)(a,b)(a,b)(a,b)1/(ba)1/(b-a)(xa)/(ba)(x-a)/(b-a)(a+b)/2(a+b)/2(ba)2/12(b-a)^2/12
Normal(μ,σ2)(\mu,\sigma^2)(,)(-\infty,\infty)(2πσ2)1/2e(xμ)2/2σ2(2\pi\sigma^2)^{-1/2}e^{-(x-\mu)^2/2\sigma^2}Φ ⁣(xμσ)\Phi\!\left(\frac{x-\mu}{\sigma}\right)μ\muσ2\sigma^2
Exponential(λ)(\lambda)(0,)(0,\infty)λeλx\lambda e^{-\lambda x}1eλx1 - e^{-\lambda x}1/λ1/\lambda1/λ21/\lambda^2
Gamma(r,λ)(r,\lambda)(0,)(0,\infty)λrΓ(r)xr1eλx\frac{\lambda^r}{\Gamma(r)}x^{r-1}e^{-\lambda x}1eλxk=0r1(λx)kk!1 - e^{-\lambda x}\sum_{k=0}^{r-1}\frac{(\lambda x)^k}{k!}r/λr/\lambdar/λ2r/\lambda^2

1. Foundations of Probability

1.1 Core Rules

Complement rule. P(Ac)=1P(A)P(A^c) = 1 - P(A). Any “at least one” problem is best attacked via complement.

Inclusion-exclusion. For two events:

P(AB)=P(A)+P(B)P(AB).P(A \cup B) = P(A) + P(B) - P(A \cap B).

For three events:

P(ABC)=iP(Ai)i<jP(AiAj)+P(ABC).P(A \cup B \cup C) = \sum_i P(A_i) - \sum_{i<j} P(A_i \cap A_j) + P(A \cap B \cap C).

Independence. Events A,BA, B are independent iff P(AB)=P(A)P(B)P(A \cap B) = P(A)P(B). Independence and disjointness are entirely different concepts: disjoint events with positive probability are never independent.

1.2 Conditional Probability and Bayes’ Rule

P(AB)=P(AB)P(B),P(AB)=P(BA)P(A)P(B).P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \qquad P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B)}.

Law of total probability. If {Ak}\{A_k\} partitions the sample space:

P(B)=kP(BAk)P(Ak).P(B) = \sum_k P(B \mid A_k)\,P(A_k).

A useful distinction from the review lectures: a forward conditional (cause given, find effect) needs no Bayes’ rule, while a backward conditional (effect given, find cause) does.

1.3 Counting Techniques

Multinomial coefficient. The number of ways to partition nn objects into groups of sizes n1,,nmn_1, \ldots, n_m:

(nn1,n2,,nm)=n!n1!n2!nm!,ni=n.\binom{n}{n_1, n_2, \ldots, n_m} = \frac{n!}{n_1!\, n_2! \cdots n_m!}, \qquad \sum n_i = n.

Slot method. To count arrangements with adjacency constraints: fix one type first, then place the other into the resulting gaps. Example: 8 coin tosses with 3 heads given. Place 5 tails to create 6 slots, then choose 3 for heads. Probability of non-consecutive heads:

(63)(83).\frac{\binom{6}{3}}{\binom{8}{3}}.

Conditional sample space. When conditioning restricts the outcome space, recompute as if the restricted space is the full space. Example: 20-sided die given no face exceeds 12 becomes a uniform 12-sided die.

1.4 Conditional Distributions

Poisson conditional on sum. If X1Pois(μ1)X_1 \sim \text{Pois}(\mu_1), X2Pois(μ2)X_2 \sim \text{Pois}(\mu_2) are independent, then:

X1X1+X2=m    Bin ⁣(m,  μ1μ1+μ2).X_1 \mid X_1 + X_2 = m \;\sim\; \text{Bin}\!\left(m,\;\frac{\mu_1}{\mu_1 + \mu_2}\right).

Interpretation: given mm total arrivals in a Poisson process with two types, the count of type 1 is Binomial with success probability equal to the rate proportion.


2. Discrete Distribution Patterns

2.1 Distribution Selection

ScenarioDistribution
Fixed nn independent trials, count successesBinomial(n,p)(n,p)
Without replacement from finite population, count successesHypergeometric(n,N,G)(n,N,G)
Independent trials until first successGeometric(p)(p)
Independent trials until rrth success (count failures)Neg. Binomial(r,p)(r,p)
Count of events in a Poisson process intervalPoisson(λt)(\lambda t)
nn large, pp small, npnp moderatePoisson(μ=np)(\mu = np) approx to Binomial
NnN \gg nBinomial(n,G/N)(n, G/N) approx to Hypergeometric

2.2 Approximation Methods

Normal approximation to Binomial. When npnp and n(1p)n(1-p) are both large:

XBin(n,p)    N(np,  np(1p)).X \sim \text{Bin}(n,p) \;\approx\; N(np,\; np(1-p)).

Continuity correction for integer-valued XX:

P(X=k)Φ ⁣(k+0.5μσ)Φ ⁣(k0.5μσ).P(X = k) \approx \Phi\!\left(\frac{k + 0.5 - \mu}{\sigma}\right) - \Phi\!\left(\frac{k - 0.5 - \mu}{\sigma}\right).

Poisson approximation to Binomial. When nn is large, pp is small, and μ=np\mu = np is moderate:

Bin(n,p)Pois(μ).\text{Bin}(n,p) \approx \text{Pois}(\mu).

2.3 Coupon Collector Pattern

Collecting nn distinct types drawn uniformly at random. After ii types collected, the wait for type i+1i+1 is:

WiGeom ⁣(nin),i=0,1,,n1.W_i \sim \text{Geom}\!\left(\frac{n-i}{n}\right), \qquad i = 0, 1, \ldots, n-1.

The waiting times W0,W1,,Wn1W_0, W_1, \ldots, W_{n-1} are mutually independent, so:

Var ⁣(Wi)=Var(Wi).\text{Var}\!\left(\sum W_i\right) = \sum \text{Var}(W_i).


3. Expectation and Variance

3.1 Expectation

E[X]=xxP(X=x).E[X] = \sum_x x\,P(X = x).

LOTUS (Law of the Unconscious Statistician):

E[g(X)]=xg(x)P(X=x).E[g(X)] = \sum_x g(x)\,P(X = x).

Note that E[g(X)]g(E[X])E[g(X)] \neq g(E[X]) in general.

Linearity. For any random variables (no independence required):

E ⁣[i=1naiXi+b]=i=1naiE[Xi]+b.E\!\left[\sum_{i=1}^n a_i X_i + b\right] = \sum_{i=1}^n a_i E[X_i] + b.

3.2 LOTUS with Poisson Series Manipulation

For XPois(μ)X \sim \text{Pois}(\mu) and g(x)=1(x+1)(x+2)(x+n)g(x) = \frac{1}{(x+1)(x+2)\cdots(x+n)}, find E[g(X)]E[g(X)].

Step 1. Apply LOTUS:

E[g(X)]=k=01(k+1)(k+2)(k+n)eμμkk!.E[g(X)] = \sum_{k=0}^{\infty} \frac{1}{(k+1)(k+2)\cdots(k+n)} \cdot \frac{e^{-\mu}\mu^k}{k!}.

Step 2. Combine denominators: (k+1)(k+2)(k+n)k!=(k+n)!(k+1)(k+2)\cdots(k+n) \cdot k! = (k+n)!, so:

=k=0eμμk(k+n)!.= \sum_{k=0}^{\infty} \frac{e^{-\mu}\mu^k}{(k+n)!}.

Step 3. Multiply and divide by μn\mu^n, substitute j=k+nj = k + n:

=1μnj=neμμjj!=1μnP(Xn).= \frac{1}{\mu^n}\sum_{j=n}^{\infty}\frac{e^{-\mu}\mu^j}{j!} = \frac{1}{\mu^n}\,P(X \geq n).

Result:

E[g(X)]=1μn[1j=0n1eμμjj!]\boxed{E[g(X)] = \frac{1}{\mu^n}\left[1 - \sum_{j=0}^{n-1}\frac{e^{-\mu}\mu^j}{j!}\right]}

The key recognition: j=neμμj/j!\sum_{j=n}^{\infty} e^{-\mu}\mu^j/j! is the upper tail of the Poisson CDF.

3.3 Indicator Method

If XX counts how many of the events A1,,AnA_1, \ldots, A_n occur, decompose X=I1++InX = I_1 + \cdots + I_n with Ij=1(Aj)I_j = \mathbf{1}(A_j). Then:

E[X]=j=1nP(Aj).E[X] = \sum_{j=1}^n P(A_j).

No independence is required. This is the single most common technique on exams.

3.4 Variance

Var(X)=E[X2](E[X])2,Var(aX+b)=a2Var(X).\text{Var}(X) = E[X^2] - (E[X])^2, \qquad \text{Var}(aX + b) = a^2\,\text{Var}(X).

For independent X,YX, Y: Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y).

For general (possibly dependent) X,YX, Y: Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\,\text{Cov}(X,Y).

3.5 Variance via Indicators

For X=IiX = \sum I_i:

E[X2]=iE[Ii2]+ijE[IiIj]=iP(Ai)+ijP(AiAj),E[X^2] = \sum_i E[I_i^2] + \sum_{i \neq j} E[I_i I_j] = \sum_i P(A_i) + \sum_{i \neq j} P(A_i \cap A_j),

since Ii2=IiI_i^2 = I_i and IiIj=IAiAjI_i I_j = I_{A_i \cap A_j}. The cross terms P(AiAj)P(A_i \cap A_j) may differ depending on the structural relationship between indicators (e.g., adjacent vs. non-adjacent pairs).


4. Probability Inequalities

4.1 Markov’s Inequality

For X0X \geq 0:

P(Xa)E[X]a.P(X \geq a) \leq \frac{E[X]}{a}.

When XX takes values in [c,d][c, d], apply Markov to Xc0X - c \geq 0:

P(Xa)=P(Xcac)E[X]cac.P(X \geq a) = P(X - c \geq a - c) \leq \frac{E[X] - c}{a - c}.

4.2 Chebyshev’s Inequality

For any random variable XX with finite variance:

P(Xμkσ)1k2.P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}.

Equivalently: P(Xμa)Var(X)/a2P(|X - \mu| \geq a) \leq \text{Var}(X)/a^2.

One-sided bound. {Xa}{Xμaμ}\{X \geq a\} \subseteq \{|X - \mu| \geq a - \mu\} for a>μa > \mu, so:

P(Xa)Var(X)(aμ)2.P(X \geq a) \leq \frac{\text{Var}(X)}{(a-\mu)^2}.

Tighter bound with side information. Chebyshev bounds both tails simultaneously:

P(Xa)+P(Xb)1k2,P(X \geq a) + P(X \leq b) \leq \frac{1}{k^2},

so if a lower bound on P(Xb)P(X \leq b) is known, it can be subtracted.

Symmetry refinement. If XX is symmetric about μ\mu, then P(Xμ+a)=P(Xμa)P(X \geq \mu + a) = P(X \leq \mu - a), and the Chebyshev bound can be halved:

P(Xμ+a)12Var(X)a2.P(X \geq \mu + a) \leq \frac{1}{2} \cdot \frac{\text{Var}(X)}{a^2}.


5. Continuous Distributions

5.1 PDF and CDF

For a continuous random variable XX with density ff and CDF FF:

P(aXb)=abf(x)dx=F(b)F(a),f(x)=F(x).P(a \leq X \leq b) = \int_a^b f(x)\,dx = F(b) - F(a), \qquad f(x) = F'(x).

Note that f(x)f(x) is a density, not a probability. Its units are 1/(units of X)1/(\text{units of } X). Also, P(X=a)=0P(X = a) = 0 for any specific value aa.

5.2 Exponential Distribution

The exponential distribution Exp(λ)\text{Exp}(\lambda) has the survival function:

P(X>t)=eλt,t>0.P(X > t) = e^{-\lambda t}, \qquad t > 0.

Memoryless property:

P(X>s+tX>s)=P(X>t).P(X > s + t \mid X > s) = P(X > t).

This is the defining characteristic: the conditional distribution of remaining lifetime, given survival to time ss, is the same as the original distribution.

5.3 Minimum and Maximum of Independent Variables

Minimum. M=min(X1,,Xn)M = \min(X_1, \ldots, X_n), all independent:

P(M>t)=i=1nP(Xi>t).P(M > t) = \prod_{i=1}^n P(X_i > t).

Competing exponentials. If XiExp(λi)X_i \sim \text{Exp}(\lambda_i) independently:

min(X1,,Xn)Exp(λ1++λn).\min(X_1, \ldots, X_n) \sim \text{Exp}(\lambda_1 + \cdots + \lambda_n).

Maximum. Y=max(X1,,Xn)Y = \max(X_1, \ldots, X_n), all independent:

P(Yt)=i=1nP(Xit).P(Y \leq t) = \prod_{i=1}^n P(X_i \leq t).


6. Poisson Process and Gamma Distribution

6.1 Poisson Process

For a Poisson process with rate λ\lambda:

N(t)Pois(λt),TrGamma(r,λ),N(t) \sim \text{Pois}(\lambda t), \qquad T_r \sim \text{Gamma}(r, \lambda),

where N(t)N(t) is the count in [0,t][0, t] and TrT_r is the rrth arrival time.

Month indexing convention. January =[0,1]= [0, 1], February =[1,2]= [1, 2], …, so month kk occupies [k1,k][k-1, k].

6.2 Gamma-Poisson Duality

The Gamma CDF equals the Poisson upper tail:

P(Trt)=P(N(t)r)=1k=0r1eλt(λt)kk!.P(T_r \leq t) = P(N(t) \geq r) = 1 - \sum_{k=0}^{r-1}\frac{e^{-\lambda t}(\lambda t)^k}{k!}.

This identity converts between arrival time probabilities and count probabilities.

6.3 Memoryless Property in Poisson Processes

After observing the process up to time ss, the remaining time until the next arrival is Exp(λ)\text{Exp}(\lambda), independent of the history. This follows from the memoryless property of the exponential and the independent increments of the Poisson process.

6.4 Poisson Superposition and Thinning

If X1Pois(μ1)X_1 \sim \text{Pois}(\mu_1) and X2Pois(μ2)X_2 \sim \text{Pois}(\mu_2) are independent:

X1+X2Pois(μ1+μ2).X_1 + X_2 \sim \text{Pois}(\mu_1 + \mu_2).

Conversely, in a Poisson process with rate λ\lambda where each arrival is independently classified as type 1 with probability pp, the type 1 arrivals form a Poisson process with rate λp\lambda p.


Pattern Recognition Table

SignalTechnique
”Upper/lower bound”, “at least/most” with limited infoMarkov or Chebyshev
”E[W]” counting pairs, matches, or occurrencesIndicator decomposition + linearity
”Var(X)” with indicator structureE[X2]E[X^2] via E[Ii2]+ijE[IiIj]\sum E[I_i^2] + \sum_{i\neq j} E[I_iI_j]
“With replacement” + countBinomial
”Without replacement” + countHypergeometric
Sequential random experiment (urn \to draw)Law of total probability
”First to achieve” alternating gameGeometric series
Poisson process, arrival time of rrth eventGamma-Poisson duality
Min of independent lifetimesMultiply survival functions
g(x)g(x) with factorial-like denominator, XPoisX \sim \text{Pois}LOTUS \to combine denominators \to Poisson tail
nn large, pp moderateNormal approximation + continuity correction
nn large, pp small, npnp moderatePoisson approximation

References

[1] J. Pitman, Probability, corrected 6th printing. New York, NY: Springer, 1997.

[2] A. Lucas, STAT 134 Spring 2026 Lecture Notes 1—22, UC Berkeley.