Random Variables, Expectation, Variance, Normal Approximation, Discrete Distributions, Poisson, Symmetry

Today

Random Variables, Expectation, Variance, Normal Approximation, Discrete Distributions, Poisson, Symmetry


Distribution Reference (Chapter 3 Scope)

NameRangeP(X=k)P(X = k)MeanVariance
Bernoulli(p)(p){0,1}\{0, 1\}P(1)=p;  P(0)=1pP(1) = p;\; P(0) = 1 - pppp(1p)p(1 - p)
Binomial(n,p)(n, p){0,1,,n}\{0, 1, \ldots, n\}(nk)pk(1p)nk\binom{n}{k} p^k (1-p)^{n-k}npnpnp(1p)np(1-p)
Hypergeometric(n,N,G)(n, N, G){0,,n}\{0, \ldots, n\}(Gk)(NGnk)/(Nn)\binom{G}{k}\binom{N-G}{n-k} / \binom{N}{n}nG/NnG/Nn(G/N)(1G/N)(Nn)/(N1)n(G/N)(1 - G/N)(N-n)/(N-1)
Geometric(p)(p){1,2,3,}\{1, 2, 3, \ldots\}(1p)k1p(1-p)^{k-1}p1/p1/p(1p)/p2(1-p)/p^2
Geometric(p)(p){0,1,2,}\{0, 1, 2, \ldots\}(1p)kp(1-p)^{k}p(1p)/p(1-p)/p(1p)/p2(1-p)/p^2
Neg. Binomial(r,p)(r, p) (failures FrF_r){0,1,2,}\{0, 1, 2, \ldots\}(k+r1r1)pr(1p)k\binom{k+r-1}{r-1}p^r(1-p)^kr(1p)/pr(1-p)/pr(1p)/p2r(1-p)/p^2
Poisson(μ)(\mu){0,1,2,}\{0, 1, 2, \ldots\}eμμk/k!e^{-\mu}\mu^k / k!μ\muμ\mu
Uniform on {a,,b}\{a, \ldots, b\}{a,a+1,,b}\{a, a+1, \ldots, b\}1/(ba+1)1/(b - a + 1)(a+b)/2(a+b)/2[(ba+1)21]/12[(b-a+1)^2 - 1]/12

The Geometric distribution has two conventions. Pitman primarily uses {1,2,}\{1, 2, \ldots\} (waiting time to first success). The {0,1,2,}\{0, 1, 2, \ldots\} version counts failures before first success. Always check the problem statement.

The Negative Binomial(r,p)(r, p) table above counts failures FrF_r before the rrth success. The waiting time Tr=Fr+rT_r = F_r + r lives on {r,r+1,}\{r, r+1, \ldots\} with E(Tr)=r/pE(T_r) = r/p and SD(Tr)=r(1p)/p\text{SD}(T_r) = \sqrt{r(1-p)}/p.

Convention used throughout these notes: q=1pq = 1 - p wherever pp is a success probability.


Section 3.1: Introduction to Random Variables

Core idea: a random variable is a numerical summary of a random outcome, and its distribution tells you the probability of each possible value.

1. Random Variable and Its Distribution

A random variable XX is denoted by a capital letter. A lowercase letter xx is a dummy variable representing a specific possible value.

Range: the set of all possible values of XX. Example: the number of heads in 4 coin tosses has range {0,1,2,3,4}\{0, 1, 2, 3, 4\}.

Distribution of XX is the collection of probabilities

P(X=x),xrange of X,P(X = x), \quad x \in \text{range of } X,

satisfying two conditions:

P(X=x)0(nonnegativity),xP(X=x)=1(sums to 1).P(X = x) \geq 0 \quad \text{(nonnegativity)}, \qquad \sum_{x} P(X = x) = 1 \quad \text{(sums to 1)}.

For any subset BB of the range:

P(XB)=xBP(X=x).P(X \in B) = \sum_{x \in B} P(X = x).

This follows because the events (X=x)(X = x) for distinct xx are mutually exclusive and exhaustive.

2. Functions of a Random Variable

If X=g(W)X = g(W), then the distribution of XX is derived from WW:

P(X=x)=P(g(W)=x)=w:g(w)=xP(W=w).P(X = x) = P(g(W) = x) = \sum_{w:\, g(w) = x} P(W = w).

The function gg may send multiple values of ww to the same xx, so all such probabilities must be summed.

Transforming events. Convert the event about g(X)g(X) back to a condition on XX:

P(2X5)=P(X5/2),P(X25)=P(5X5).P(2X \leq 5) = P(X \leq 5/2), \qquad P(X^2 \leq 5) = P(-\sqrt{5} \leq X \leq \sqrt{5}).

Watch for: sign flips with negative multipliers, splitting into two branches when removing squares or absolute values.

3. Joint Distribution

For two random variables X,YX, Y defined on the same setting:

P(x,y)=P(X=x,Y=y),P(x,y)0,all (x,y)P(x,y)=1.P(x, y) = P(X = x,\, Y = y), \qquad P(x, y) \geq 0, \qquad \sum_{\text{all } (x,y)} P(x,y) = 1.

Marginal distributions are obtained by summing out the other variable:

P(X=x)=yP(x,y),P(Y=y)=xP(x,y).P(X = x) = \sum_{y} P(x,y), \qquad P(Y = y) = \sum_{x} P(x,y).

Event probabilities from joint:

P(X<Y)=(x,y):x<yP(x,y),P(X+Y=z)=xP(x,zx).P(X < Y) = \sum_{(x,y):\, x < y} P(x,y), \qquad P(X + Y = z) = \sum_{x} P(x,\, z - x).

You cannot determine the distribution of X+YX + Y from the marginal distributions alone. The joint distribution is required. (Pitman Example 3 shows that the same marginals can produce different distributions of X+YX + Y under different dependence structures.)

4. Same Distribution vs. Equality

Same distribution: P(X=v)=P(Y=v)P(X = v) = P(Y = v) for all vv in the common range.

Equality: P(X=Y)=1P(X = Y) = 1.

Equality implies same distribution, but the converse is false. Classic example: sampling {1,2,3}\{1, 2, 3\} without replacement, the first draw XX and second draw YY have the same uniform distribution, but P(X=Y)=0P(X = Y) = 0.

Change of variable principle: if XX and YY have the same distribution, then g(X)g(X) and g(Y)g(Y) have the same distribution for any function gg.

5. Conditional Distribution

Given event AA with P(A)>0P(A) > 0:

P(Y=yA)=P(Y=y and A)P(A).P(Y = y \mid A) = \frac{P(Y = y \text{ and } A)}{P(A)}.

Given X=xX = x:

P(Y=yX=x)=P(X=x,Y=y)P(X=x).P(Y = y \mid X = x) = \frac{P(X = x,\, Y = y)}{P(X = x)}.

This is the column of the joint table at X=xX = x, renormalized by the column sum.

Multiplication rule:

P(X=x,Y=y)=P(X=x)P(Y=yX=x).P(X = x,\, Y = y) = P(X = x) \cdot P(Y = y \mid X = x).

6. Independence

XX and YY are independent iff

P(X=x,Y=y)=P(X=x)P(Y=y)for all x,y.P(X = x,\, Y = y) = P(X = x) \cdot P(Y = y) \quad \text{for all } x, y.

Equivalent: the conditional distribution of YY given X=xX = x does not depend on xx (and vice versa).

To check from a joint table: verify the product rule for every cell. One failure means dependent.

nn variables: X1,,XnX_1, \ldots, X_n are independent iff

P(X1=x1,,Xn=xn)=i=1nP(Xi=xi)for all (x1,,xn).P(X_1 = x_1, \ldots, X_n = x_n) = \prod_{i=1}^{n} P(X_i = x_i) \quad \text{for all } (x_1, \ldots, x_n).

Three consequences of independence:

  1. Functions of independent variables are independent: if XjX_j are independent, so are fj(Xj)f_j(X_j).
  2. Disjoint blocks are independent: if X1,,X6X_1, \ldots, X_6 are independent, then (X1,X2)(X_1, X_2) and (X3,X4,X5)(X_3, X_4, X_5) are independent.
  3. Combine 1 and 2: functions of disjoint blocks are independent.

7. Multinomial Distribution

Generalization of the binomial. With mm categories, probability pip_i for category ii (pi=1\sum p_i = 1), and nn independent trials:

P(N1=n1,,Nm=nm)=n!n1!nm!p1n1pmnm,ni=n.P(N_1 = n_1, \ldots, N_m = n_m) = \frac{n!}{n_1! \cdots n_m!}\, p_1^{n_1} \cdots p_m^{n_m}, \quad \sum n_i = n.

The factor p1n1pmnmp_1^{n_1} \cdots p_m^{n_m} is the probability of one specific ordering; n!/(n1!nm!)n!/(n_1! \cdots n_m!) counts the number of such orderings.

Since N1++Nm=nN_1 + \cdots + N_m = n is a constraint, the NiN_i are not independent.

8. Symmetry of Distributions

XX is symmetric about 0 iff P(X=x)=P(X=x)P(X = -x) = P(X = x) for all xx, equivalently X-X has the same distribution as XX. Consequence: P(Xa)=P(Xa)P(X \geq a) = P(X \leq -a).

XX is symmetric about bb iff XbX - b is symmetric about 0.

Sums: if X1,,XnX_1, \ldots, X_n are independent and each XiX_i is symmetric about bib_i, then Sn=XiS_n = \sum X_i is symmetric about bi\sum b_i.

Parity trick: if SnS_n takes integer values and is symmetric about a half integer (e.g., 454.5454.5), then P(Sn454)=P(Sn455)=1/2P(S_n \leq 454) = P(S_n \geq 455) = 1/2. This works because no probability mass sits on the center.


Section 3.2: Expectation

Core idea: the expected value is a probability weighted average, and the linearity of expectation is the single most powerful computational tool in this course.

1. Definition

E(X)=xxP(X=x).E(X) = \sum_{x} x \cdot P(X = x).

Interpretation: the balance point (center of gravity) of the distribution histogram.

Key special cases:

E(IA)=P(A)(indicator of event A),E(c)=c(constant).E(I_A) = P(A) \quad \text{(indicator of event } A\text{)}, \qquad E(c) = c \quad \text{(constant)}.

2. Long Run Average

If XX represents a repeated measurement, then E(X)E(X) approximates the average over many repetitions. (Made rigorous by the Law of Large Numbers in Section 3.3.)

3. Addition Rule (Linearity)

For any random variables X,YX, Y defined on the same setting:

E(X+Y)=E(X)+E(Y).E(X + Y) = E(X) + E(Y).

This holds regardless of dependence. More generally:

E(a1X1++anXn+b)=a1E(X1)++anE(Xn)+b.E(a_1 X_1 + \cdots + a_n X_n + b) = a_1 E(X_1) + \cdots + a_n E(X_n) + b.

4. Method of Indicators

If XX counts how many of the events A1,,AnA_1, \ldots, A_n occur, write X=I1++InX = I_1 + \cdots + I_n where Ij=1(Aj)I_j = \mathbf{1}(A_j). Then:

E(X)=P(A1)+P(A2)++P(An).E(X) = P(A_1) + P(A_2) + \cdots + P(A_n).

No information about dependence among the AjA_j is needed.

Binomial mean: SnBinomial(n,p)S_n \sim \text{Binomial}(n, p), write Sn=I1++InS_n = I_1 + \cdots + I_n, each E(Ij)=pE(I_j) = p:

E(Sn)=np.E(S_n) = np.

Complement indicator pattern. Sometimes Ij=1I_j = 1 means “at least one item from group jj is selected.” It is easier to compute P(Ij=1)=1P(none from group j)P(I_j = 1) = 1 - P(\text{none from group } j). Then:

E(X)=j=1n[1P(none from group j selected)].E(X) = \sum_{j=1}^{n} \bigl[1 - P(\text{none from group } j \text{ selected})\bigr].

This pattern appeared on Practice Quiz 2 Q3: nn families with 3 kids, kk kids selected from 3n3n, XX = number of families with at least one kid selected. Here P(none from family j)=(3n3k)/(3nk)P(\text{none from family } j) = \binom{3n - 3}{k} / \binom{3n}{k}, so E(X)=n[1(3n3k)/(3nk)]E(X) = n\bigl[1 - \binom{3n-3}{k}/\binom{3n}{k}\bigr].

5. Tail Sum Formula

If XX takes values in {0,1,,n}\{0, 1, \ldots, n\}:

E(X)=j=1nP(Xj).E(X) = \sum_{j=1}^{n} P(X \geq j).

This is a special case of the method of indicators with Aj=(Xj)A_j = (X \geq j).

Useful when P(Xj)P(X \geq j) is simpler than P(X=k)P(X = k). Example: E(min of 4 dice)=j=16[(7j)/6]4E(\min \text{ of 4 dice}) = \sum_{j=1}^{6} [(7-j)/6]^4.

6. Expectation of a Function

E[g(X)]=xg(x)P(X=x).E[g(X)] = \sum_{x} g(x) \cdot P(X = x).

Trap: In general, E[g(X)]g(E(X))E[g(X)] \neq g(E(X)). This equality holds only when gg is linear (affine). The most common mistake: E(X2)[E(X)]2E(X^2) \neq [E(X)]^2.

kkth moment:

E(Xk)=xxkP(X=x).E(X^k) = \sum_{x} x^k \, P(X = x).

Two variable version:

E[g(X,Y)]=(x,y)g(x,y)P(X=x,Y=y).E[g(X, Y)] = \sum_{(x,y)} g(x,y) \, P(X = x, Y = y).

7. Multiplication Rule for Expectation

If XX and YY are independent:

E(XY)=E(X)E(Y).E(XY) = E(X) \cdot E(Y).

This requires independence. Contrast with the addition rule which requires nothing.

Counterexample: if X=YX = Y, then E(XY)=E(X2)[E(X)]2=E(X)E(Y)E(XY) = E(X^2) \neq [E(X)]^2 = E(X)E(Y) (unless XX is constant).

8. Variance of a Product of Independent Variables

If XX and YY are independent:

Var(XY)=Var(X)Var(Y)+Var(X)[E(Y)]2+[E(X)]2Var(Y).\text{Var}(XY) = \text{Var}(X)\,\text{Var}(Y) + \text{Var}(X)\,[E(Y)]^2 + [E(X)]^2\,\text{Var}(Y).

Equivalently:

Var(XY)=E(X2)E(Y2)[E(X)]2[E(Y)]2.\text{Var}(XY) = E(X^2)\,E(Y^2) - [E(X)]^2\,[E(Y)]^2.

Derivation. By independence, E[(XY)2]=E(X2)E(Y2)E[(XY)^2] = E(X^2)\,E(Y^2) and [E(XY)]2=[E(X)]2[E(Y)]2[E(XY)]^2 = [E(X)]^2[E(Y)]^2. Then Var(XY)=E(X2)E(Y2)[E(X)]2[E(Y)]2\text{Var}(XY) = E(X^2)E(Y^2) - [E(X)]^2[E(Y)]^2. Expanding E(X2)=Var(X)+[E(X)]2E(X^2) = \text{Var}(X) + [E(X)]^2 and similarly for YY gives the first form.

This formula appeared on Practice Quiz 2 Q2. With Var(X)=3\text{Var}(X) = 3, E(X)=1E(X) = 1, Var(Y)=2\text{Var}(Y) = 2, E(Y)=0E(Y) = 0: Var(XY)=(3)(2)+(3)(0)+(1)(2)=8\text{Var}(XY) = (3)(2) + (3)(0) + (1)(2) = 8.

9. Boole’s and Markov’s Inequalities

Boole’s inequality: if XX counts how many of A1,,AnA_1, \ldots, A_n occur:

P(X1)=P ⁣(jAj)jP(Aj)=E(X).P(X \geq 1) = P\!\left(\bigcup_j A_j\right) \leq \sum_j P(A_j) = E(X).

Markov’s inequality: for X0X \geq 0 and a>0a > 0:

P(Xa)E(X)a.P(X \geq a) \leq \frac{E(X)}{a}.

The condition X0X \geq 0 is essential.

10. Properties of Expectation (Summary Box, p.181)

PropertyFormulaCondition
DefinitionE(X)=xxP(X=x)E(X) = \sum_x x\,P(X=x)
ConstantsE(c)=cE(c) = c
IndicatorsE(IA)=P(A)E(I_A) = P(A)
FunctionsE[g(X)]=xg(x)P(X=x)E[g(X)] = \sum_x g(x)P(X=x)g(E(X))\neq g(E(X)) in general
Constant factorsE(cX)=cE(X)E(cX) = cE(X)
AdditionE(X+Y)=E(X)+E(Y)E(X+Y) = E(X) + E(Y)Always (dependent or not)
MultiplicationE(XY)=E(X)E(Y)E(XY) = E(X)E(Y)Independent only

Section 3.3: Standard Deviation and Normal Approximation

Core idea: variance measures spread around the mean, and the square root law governs how sums and averages of independent variables behave.

1. Variance and Standard Deviation

Var(X)=E[(Xμ)2],SD(X)=Var(X),μ=E(X).\text{Var}(X) = E[(X - \mu)^2], \qquad \text{SD}(X) = \sqrt{\text{Var}(X)}, \qquad \mu = E(X).

SD(X)\text{SD}(X) measures the typical size of the deviation Xμ|X - \mu|.

Var(X)0\text{Var}(X) \geq 0 always, with equality iff XX is constant (i.e., P(X=μ)=1P(X = \mu) = 1).

2. Computational Formula

Var(X)=E(X2)[E(X)]2.\text{Var}(X) = E(X^2) - [E(X)]^2.

“Mean of the square minus square of the mean.” This always gives:

E(X2)[E(X)]2,E(X^2) \geq [E(X)]^2,

with equality iff XX is constant.

Indicator: IA2=IAI_A^2 = I_A, so E(IA2)=pE(I_A^2) = p, thus Var(IA)=pp2=p(1p)\text{Var}(I_A) = p - p^2 = p(1-p), SD(IA)=p(1p)\text{SD}(I_A) = \sqrt{p(1-p)}.

Fair die: E(X)=3.5E(X) = 3.5, E(X2)=91/6E(X^2) = 91/6, Var(X)=35/12\text{Var}(X) = 35/12, SD(X)1.71\text{SD}(X) \approx 1.71.

3. Scaling and Shifting

E(aX+b)=aE(X)+b,SD(aX+b)=aSD(X).E(aX + b) = aE(X) + b, \qquad \text{SD}(aX + b) = |a| \cdot \text{SD}(X).

Adding a constant shifts the mean but does not change the SD. Multiplying by aa scales the SD by a|a|.

4. Standardization

For μ=E(X)\mu = E(X) and σ=SD(X)>0\sigma = \text{SD}(X) > 0:

X=Xμσ,E(X)=0,SD(X)=1.X^* = \frac{X - \mu}{\sigma}, \qquad E(X^*) = 0, \qquad \text{SD}(X^*) = 1.

XX^* measures how many standard deviations XX is from its mean.

5. Chebyshev’s Inequality

For any random variable XX with E(X)=μE(X) = \mu and SD(X)=σ\text{SD}(X) = \sigma:

P ⁣(Xμkσ)1k2.P\!\left(|X - \mu| \geq k\sigma\right) \leq \frac{1}{k^2}.

Proof: apply Markov’s inequality to (Xμ)2(X - \mu)^2 with threshold k2σ2k^2\sigma^2.

This bound is crude but universal (works for any distribution).

One sided application template. To bound P(Xa)P(X \geq a) where a>μa > \mu:

P(Xa)P(Xμaμ)σ2(aμ)2.P(X \geq a) \leq P(|X - \mu| \geq a - \mu) \leq \frac{\sigma^2}{(a - \mu)^2}.

Set k=(aμ)/σk = (a - \mu)/\sigma and apply 1/k21/k^2.

Inverse problem template (Practice Quiz 2 Q1b). Given target bound 1/k2=c1/k^2 = c, solve k=1/ck = 1/\sqrt{c}. Then aμ=kσa - \mu = k\sigma gives μ=akσ\mu = a - k\sigma.

DeviationChebyshev boundNormal actual
1σ\geq 1\sigma1\leq 1 (trivial)0.3173
2σ\geq 2\sigma1/4\leq 1/40.0455
3σ\geq 3\sigma1/9\leq 1/90.0027
4σ\geq 4\sigma1/16\leq 1/160.00006

6. Variance Addition Rule

If XX and YY are independent:

Var(X+Y)=Var(X)+Var(Y).\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y).

For nn independent variables:

Var(X1++Xn)=Var(X1)++Var(Xn).\text{Var}(X_1 + \cdots + X_n) = \text{Var}(X_1) + \cdots + \text{Var}(X_n).

This requires independence. The cross term 2E[(XE(X))(YE(Y))]2E[(X - E(X))(Y - E(Y))] vanishes under independence but not in general. If X=YX = Y: Var(2X)=4Var(X)2Var(X)\text{Var}(2X) = 4\text{Var}(X) \neq 2\text{Var}(X).

7. Square Root Law

For i.i.d. X1,,XnX_1, \ldots, X_n with E(Xi)=μE(X_i) = \mu and SD(Xi)=σ\text{SD}(X_i) = \sigma, let Sn=XiS_n = \sum X_i and Xˉn=Sn/n\bar{X}_n = S_n / n:

E(Sn)=nμ,SD(Sn)=σn,E(S_n) = n\mu, \qquad \text{SD}(S_n) = \sigma\sqrt{n}, E(Xˉn)=μ,SD(Xˉn)=σn.E(\bar{X}_n) = \mu, \qquad \text{SD}(\bar{X}_n) = \frac{\sigma}{\sqrt{n}}.

The sum’s SD grows as n\sqrt{n} (not nn) because positive and negative deviations partially cancel. The average’s SD shrinks as 1/n1/\sqrt{n}.

Binomial SD: SnBinomial(n,p)S_n \sim \text{Binomial}(n, p), each indicator has SD=pq\text{SD} = \sqrt{pq}, so SD(Sn)=npq\text{SD}(S_n) = \sqrt{npq}.

8. Law of Averages (Weak Law of Large Numbers)

For i.i.d. X1,X2,X_1, X_2, \ldots with E(Xi)=μE(X_i) = \mu and finite variance:

P(Xˉnμ<ϵ)1as n,for every ϵ>0.P(|\bar{X}_n - \mu| < \epsilon) \to 1 \quad \text{as } n \to \infty, \quad \text{for every } \epsilon > 0.

Proof via Chebyshev: P(Xˉnμϵ)σ2/(nϵ2)0P(|\bar{X}_n - \mu| \geq \epsilon) \leq \sigma^2 / (n\epsilon^2) \to 0.

9. Central Limit Theorem (Normal Approximation)

For i.i.d. XiX_i with E(Xi)=μE(X_i) = \mu, SD(Xi)=σ\text{SD}(X_i) = \sigma, and nn large:

P ⁣(aSnnμσnb)Φ(b)Φ(a).P\!\left(a \leq \frac{S_n - n\mu}{\sigma\sqrt{n}} \leq b\right) \approx \Phi(b) - \Phi(a).

Equivalently, SnS_n is approximately Normal(nμ,  nσ2)\text{Normal}(n\mu,\; n\sigma^2).

Procedure: (1) compute μ\mu and σ\sigma of one XiX_i, (2) get E(Sn)=nμE(S_n) = n\mu and SD(Sn)=σn\text{SD}(S_n) = \sigma\sqrt{n}, (3) standardize, (4) use Φ\Phi table.

Continuity correction: if XiX_i takes consecutive integer values, replace bb with b+1/2b + 1/2 and aa with a1/2a - 1/2 for better accuracy.


Section 3.4: Discrete Distributions

Core idea: everything from Sections 3.1 through 3.3 extends to infinite ranges by replacing finite sums with convergent series. The geometric distribution is the central new object.

1. Discrete Distribution on Infinite Sets

A distribution on {0,1,2,}\{0, 1, 2, \ldots\} is a sequence p0,p1,p2,p_0, p_1, p_2, \ldots with

pi0for all i,i=0pi=1.p_i \geq 0 \quad \text{for all } i, \qquad \sum_{i=0}^{\infty} p_i = 1.

2. Infinite Sum Rule

If A1,A2,A_1, A_2, \ldots are mutually exclusive with A=AiA = \bigcup A_i:

P(A)=i=1P(Ai).P(A) = \sum_{i=1}^{\infty} P(A_i).

All finite rules (conditional probability, independence, Bayes, expectation, variance) extend to infinite ranges by replacing finite sums with convergent series.

3. Geometric Distribution

Definition. Waiting time TT to first success in Bernoulli(p)(p) trials:

P(T=k)=(1p)k1p,k=1,2,3,P(T = k) = (1-p)^{k-1}\,p, \qquad k = 1, 2, 3, \ldots

Sum to 1: k=1qk1p=p/(1q)=1\sum_{k=1}^{\infty} q^{k-1}p = p/(1-q) = 1.

Tail probability: P(T>n)=(1p)nP(T > n) = (1-p)^n.

Moments:

E(T)=1p,Var(T)=1pp2,SD(T)=1pp.E(T) = \frac{1}{p}, \qquad \text{Var}(T) = \frac{1-p}{p^2}, \qquad \text{SD}(T) = \frac{\sqrt{1-p}}{p}.

Derivation of E(T)E(T) via shift and subtract. Let Σ1=n=1nqn1=1+2q+3q2+\Sigma_1 = \sum_{n=1}^{\infty} n\,q^{n-1} = 1 + 2q + 3q^2 + \cdots. Then:

qΣ1=q+2q2+3q3+q\,\Sigma_1 = q + 2q^2 + 3q^3 + \cdots (1q)Σ1=1+q+q2+=11q=1p(1-q)\,\Sigma_1 = 1 + q + q^2 + \cdots = \frac{1}{1-q} = \frac{1}{p} Σ1=1p2,E(T)=pΣ1=1p.\Sigma_1 = \frac{1}{p^2}, \qquad E(T) = p \cdot \Sigma_1 = \frac{1}{p}.

4. The Craps Principle

Condition: each round is independent with the same probabilities a,b,da, b, d throughout.

Repeated game with outcomes: A wins (prob aa), B wins (prob bb), draw (prob d=1abd = 1 - a - b). The overall winner is determined by the first decisive game:

P(A wins overall)=aa+b.P(\text{A wins overall}) = \frac{a}{a + b}.

The total number of games GGeometric(1d)G \sim \text{Geometric}(1 - d), and GG is independent of who wins.

5. Negative Binomial Distribution

Waiting time to the rrth success: TrT_r on {r,r+1,}\{r, r+1, \ldots\}:

P(Tr=t)=(t1r1)pr(1p)tr.P(T_r = t) = \binom{t-1}{r-1}\,p^r\,(1-p)^{t-r}.

Key decomposition: Tr=W1+W2++WrT_r = W_1 + W_2 + \cdots + W_r where WiW_i are independent Geometric(p)(p) on {1,2,}\{1, 2, \ldots\}:

E(Tr)=rp,SD(Tr)=r(1p)p.E(T_r) = \frac{r}{p}, \qquad \text{SD}(T_r) = \frac{\sqrt{r(1-p)}}{p}.

6. The Collector’s Problem

Collecting all nn types, where each draw is uniform:

E(total draws)=n ⁣(1+12+13++1n)=nHn.E(\text{total draws}) = n\!\left(1 + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{n}\right) = n\,H_n.

Decompose into geometric waiting times: after collecting kk types, the next new type requires Geometric((nk)/n)((n-k)/n) draws with mean n/(nk)n/(n-k).


Section 3.5: The Poisson Distribution

Core idea: Poisson arises as the limit of Binomial when nn is large and pp is small with np=μnp = \mu fixed. It also arises exactly from random scatter models.

1. Definition

P(N=k)=eμμkk!,k=0,1,2,P(N = k) = e^{-\mu}\,\frac{\mu^k}{k!}, \qquad k = 0, 1, 2, \ldots

Parameter μ>0\mu > 0.

2. Mean and Variance

E(N)=μ,Var(N)=μ,SD(N)=μ.E(N) = \mu, \qquad \text{Var}(N) = \mu, \qquad \text{SD}(N) = \sqrt{\mu}.

Mean derivation:

E(N)=k=1keμμkk!=eμμk=1μk1(k1)!=eμμeμ=μ.E(N) = \sum_{k=1}^{\infty} k \cdot e^{-\mu}\frac{\mu^k}{k!} = e^{-\mu}\,\mu \sum_{k=1}^{\infty} \frac{\mu^{k-1}}{(k-1)!} = e^{-\mu}\,\mu\,e^{\mu} = \mu.

Variance derivation via factorial moment:

E[N(N1)]=k=2k(k1)eμμkk!=eμμ2k=2μk2(k2)!=μ2.E[N(N-1)] = \sum_{k=2}^{\infty} k(k-1)\,e^{-\mu}\frac{\mu^k}{k!} = e^{-\mu}\,\mu^2 \sum_{k=2}^{\infty} \frac{\mu^{k-2}}{(k-2)!} = \mu^2. E(N2)=E[N(N1)]+E(N)=μ2+μ.E(N^2) = E[N(N-1)] + E(N) = \mu^2 + \mu. Var(N)=μ2+μμ2=μ.\text{Var}(N) = \mu^2 + \mu - \mu^2 = \mu.

Trap: E[N(N1)]=μ2E[N(N-1)] = \mu^2, but E(N2)=μ2+μE(N^2) = \mu^2 + \mu. Do not confuse these.

3. Sum of Independent Poissons

If N1,,NjN_1, \ldots, N_j are independent with NiPoisson(μi)N_i \sim \text{Poisson}(\mu_i):

N1++NjPoisson(μ1++μj).N_1 + \cdots + N_j \sim \text{Poisson}(\mu_1 + \cdots + \mu_j).

Independent Poisson variables sum to Poisson; the parameters add.

Independence is required. Counts on overlapping regions of a Poisson scatter are not independent.

4. Poisson Scaling Template

When a Poisson process has rate λ\lambda per unit of time (or space), the count over duration tt (or area AA) is:

NPoisson(λt)orNPoisson(λA).N \sim \text{Poisson}(\lambda \cdot t) \quad \text{or} \quad N \sim \text{Poisson}(\lambda \cdot A).

Practice Quiz 2 Q5: 2 moths per 12 hour night. Rate =2= 2 per night. Over 3 nights: μ=2×3=6\mu = 2 \times 3 = 6. Then P(N>7)=1k=07e66k/k!P(N > 7) = 1 - \sum_{k=0}^{7} e^{-6}\,6^k/k!.

5. “Approximately” vs. “Exactly” Poisson

Binomial approximation: write “approximately Poisson(μ)(\mu).” Poisson scatter/process model: write “has Poisson(μ)(\mu) distribution.” This wording distinction affects grading.

6. Poisson Random Scatter

Two assumptions: (1) No multiple hits at the same point. (2) For each partition of the unit square into nn equal subsquares, each subsquare is hit independently with the same probability.

Theorem. Under these assumptions, there exists an intensity λ>0\lambda > 0 such that:

(i) For any region BB: N(B)Poisson(λarea(B))N(B) \sim \text{Poisson}(\lambda \cdot \text{area}(B)).

(ii) For disjoint regions B1,,BjB_1, \ldots, B_j: N(B1),,N(Bj)N(B_1), \ldots, N(B_j) are mutually independent.

7. Thinning

Start with Poisson scatter of intensity λ\lambda. Keep each point independently with probability pp.

Kept points: Poisson scatter, intensity λp.\text{Kept points: Poisson scatter, intensity } \lambda p. Removed points: Poisson scatter, intensity λ(1p).\text{Removed points: Poisson scatter, intensity } \lambda(1 - p).

The kept and removed scatters are independent.

Superposition (reverse of thinning): two independent Poisson scatters with intensities α,β\alpha, \beta combine to give Poisson scatter with intensity α+β\alpha + \beta.

8. Small μ\mu Approximations

When μ\mu is very small:

P(N=0)1μ,P(N=1)μ,P(N2)0.P(N = 0) \approx 1 - \mu, \qquad P(N = 1) \approx \mu, \qquad P(N \geq 2) \approx 0.

9. Normal Approximation for Poisson

For large μ\mu:

Nμμ  d  Standard Normal.\frac{N - \mu}{\sqrt{\mu}} \;\xrightarrow{d}\; \text{Standard Normal}.

Convergence is slower than for the binomial because Poisson skewness =1/μ= 1/\sqrt{\mu} decays slowly.


Section 3.6: Symmetry (Exchangeability)

Core idea: exchangeability lets you replace “the kkth draw” with “the first draw” in any calculation, massively simplifying problems involving sampling without replacement.

1. Exchangeability

(X,Y)(X, Y) has a symmetric joint distribution iff P(x,y)=P(y,x)P(x, y) = P(y, x) for all (x,y)(x, y). Equivalently, (X,Y)(X, Y) and (Y,X)(Y, X) have the same joint distribution. We say XX and YY are exchangeable.

For nn variables: X1,,XnX_1, \ldots, X_n are exchangeable iff

P(x1,,xn)=P(xπ(1),,xπ(n))P(x_1, \ldots, x_n) = P(x_{\pi(1)}, \ldots, x_{\pi(n)})

for every permutation π\pi.

Consequences: all XiX_i share the same marginal distribution, and every subset of size mm has the same joint distribution.

Exchangeable does NOT imply independent. Sampling without replacement is exchangeable but not independent.

2. Independent Trials Are Exchangeable

If X1,,XnX_1, \ldots, X_n are i.i.d., then P(x1,,xn)=p(xi)P(x_1, \ldots, x_n) = \prod p(x_i), which is symmetric because multiplication is commutative.

3. Sampling Without Replacement Is Exchangeable

Theorem. Drawing nn items without replacement from {b1,,bN}\{b_1, \ldots, b_N\} produces an exchangeable sequence X1,,XnX_1, \ldots, X_n.

In particular, any subset Xi1,,XimX_{i_1}, \ldots, X_{i_m} has the same joint distribution as the first mm draws X1,,XmX_1, \ldots, X_m.

Application pattern: to compute P(event about the kth draw)P(\text{event about the } k\text{th draw}), replace it with the equivalent event about the 1st draw.

Example (Pitman 3.6.1). 52 cards, deal 5.

P(5th card is King)=P(1st card is King)=4/52P(\text{5th card is King}) = P(\text{1st card is King}) = 4/52.

P(3rd and 5th are black)=P(1st and 2nd are black)=(26/52)(25/51)P(\text{3rd and 5th are black}) = P(\text{1st and 2nd are black}) = (26/52)(25/51).

Example (Pitman 3.6.2). 100 balls (50 red, 50 black), draw 20.

P(X10=rX18=r,X19=r)=P(X3=rX1=r,X2=r)=4898.P(X_{10} = r \mid X_{18} = r,\, X_{19} = r) = P(X_3 = r \mid X_1 = r,\, X_2 = r) = \frac{48}{98}.

On exams, write “by symmetry of sampling without replacement” as a one line justification, then proceed with the simpler first/second draw calculation.

4. Hypergeometric Mean and Variance

Population NN with GG good, B=NGB = N - G bad. Sample nn without replacement. SnS_n = number of good. Let p=G/Np = G/N, q=1pq = 1 - p.

Mean via indicators + symmetry:

Sn=I1++In,Ij=1(draw j is good).S_n = I_1 + \cdots + I_n, \qquad I_j = \mathbf{1}(\text{draw } j \text{ is good}).

By exchangeability, each IjI_j has Bernoulli(G/N)(G/N) distribution:

E(Sn)=nGN=np.E(S_n) = n \cdot \frac{G}{N} = np.

This is the same as the binomial mean.

Variance derivation. Expand E(Sn2)=E[(Ij)2]E(S_n^2) = E[(\sum I_j)^2]:

E(Sn2)=jE(Ij2)+2j<kE(IjIk).E(S_n^2) = \sum_j E(I_j^2) + 2\sum_{j < k} E(I_j I_k).

Since Ij2=IjI_j^2 = I_j: jE(Ij2)=nG/N\sum_j E(I_j^2) = n \cdot G/N.

By exchangeability, every pair (Ij,Ik)(I_j, I_k) has the same joint distribution as (I1,I2)(I_1, I_2):

E(IjIk)=P(1st good)P(2nd good1st good)=GNG1N1.E(I_j I_k) = P(\text{1st good}) \cdot P(\text{2nd good} \mid \text{1st good}) = \frac{G}{N} \cdot \frac{G-1}{N-1}.

There are (n2)\binom{n}{2} such pairs. So:

E(Sn2)=nGN+n(n1)GNG1N1.E(S_n^2) = n\frac{G}{N} + n(n-1)\frac{G}{N}\cdot\frac{G-1}{N-1}.

After algebra: Var(Sn)=E(Sn2)(np)2\text{Var}(S_n) = E(S_n^2) - (np)^2 simplifies to

Var(Sn)=npqNnN1.\text{Var}(S_n) = npq \cdot \frac{N - n}{N - 1}. SD(Sn)=npqNnN1.\text{SD}(S_n) = \sqrt{npq} \cdot \sqrt{\frac{N-n}{N-1}}.

5. Finite Population Correction Factor

NnN1.\sqrt{\frac{N - n}{N - 1}}.

This factor multiplies the binomial SD npq\sqrt{npq} to give the hypergeometric SD.

Key properties:

ConditionFactor valueMeaning
NnN \gg n1\approx 1Nearly identical to binomial
n=Nn = N=0= 0No randomness (entire population sampled)
Always1\leq 1Without replacement has less variability

Common mistake: the denominator is N1N - 1, not NN.

6. Normal Approximation for Hypergeometric

SnNormal ⁣(np,  npqNnN1)S_n \approx \text{Normal}\!\left(np,\; npq \cdot \frac{N-n}{N-1}\right)

when SD(Sn)\text{SD}(S_n) is sufficiently large. Same procedure as binomial normal approximation, but use the corrected SD.


Appendix 2: Series Identities

All geometric series must be simplified to closed form. Leaving unsimplified series loses points.

Geometric series (R<1|R| < 1):

k=0Rk=11R,k=0nRk=1Rn+11R.\sum_{k=0}^{\infty} R^k = \frac{1}{1 - R}, \qquad \sum_{k=0}^{n} R^k = \frac{1 - R^{n+1}}{1 - R}.

First derivative (shift and subtract):

k=1kRk1=1(1R)2.\sum_{k=1}^{\infty} k\,R^{k-1} = \frac{1}{(1-R)^2}.

Second derivative:

k=2k(k1)Rk2=2(1R)3.\sum_{k=2}^{\infty} k(k-1)\,R^{k-2} = \frac{2}{(1-R)^3}.

Exponential series:

k=0μkk!=eμ.\sum_{k=0}^{\infty} \frac{\mu^k}{k!} = e^{\mu}.

Arithmetic series:

1+2++n=n(n+1)2.1 + 2 + \cdots + n = \frac{n(n+1)}{2}.

General properties of sums (from Pitman Appendix 2):

icxi=cixi,i(xi+yi)=ixi+iyi.\sum_i c\,x_i = c \sum_i x_i, \qquad \sum_i (x_i + y_i) = \sum_i x_i + \sum_i y_i.

If xiyix_i \leq y_i for every ii, then ixiiyi\sum_i x_i \leq \sum_i y_i.


Exam Templates

Template 1: Chebyshev One Sided Bound

Given: E(X)=μE(X) = \mu, SD(X)=σ\text{SD}(X) = \sigma. Find: upper bound on P(Xa)P(X \geq a) where a>μa > \mu.

P(Xa)P(Xμaμ)σ2(aμ)2.P(X \geq a) \leq P(|X - \mu| \geq a - \mu) \leq \frac{\sigma^2}{(a - \mu)^2}.

Inverse: given target bound cc, find μ\mu such that bound equals cc:

σ2(aμ)2=c    aμ=σc    μ=aσc.\frac{\sigma^2}{(a - \mu)^2} = c \implies a - \mu = \frac{\sigma}{\sqrt{c}} \implies \mu = a - \frac{\sigma}{\sqrt{c}}.

Template 2: Var(XY) for Independent Variables

Var(XY)=Var(X)Var(Y)+Var(X)[E(Y)]2+[E(X)]2Var(Y).\text{Var}(XY) = \text{Var}(X)\text{Var}(Y) + \text{Var}(X)[E(Y)]^2 + [E(X)]^2\text{Var}(Y).

If E(Y)=0E(Y) = 0, this simplifies to Var(XY)=Var(X)Var(Y)+[E(X)]2Var(Y)\text{Var}(XY) = \text{Var}(X)\text{Var}(Y) + [E(X)]^2\text{Var}(Y).

Template 3: Indicator Complement Method

To find E(number of groups with at least one member selected)E(\text{number of groups with at least one member selected}):

E(X)=n[1P(no member from a single group is selected)].E(X) = n \cdot \bigl[1 - P(\text{no member from a single group is selected})\bigr].

Use hypergeometric or combinatorial counting for the “none selected” probability.

Template 4: Geometric Waiting Time

Condition: trials must be independent with the same success probability pp on each trial.

Repeated independent trials until a desired outcome with probability pp:

E(number of trials)=1p.E(\text{number of trials}) = \frac{1}{p}.

First find pp (often using counting), then apply.

Template 5: Poisson Scaling

Condition: arrivals must be independent of one another and conditions must be uniform across the time/space interval (Poisson process assumption).

Rate λ\lambda per unit ×\times duration/area tt == parameter μ=λt\mu = \lambda t.

P(N>c)=1k=0ceμμkk!.P(N > c) = 1 - \sum_{k=0}^{c} e^{-\mu}\frac{\mu^k}{k!}.

Exam Trap Checklist

Expectation traps:

  1. E[g(X)]g(E(X))E[g(X)] \neq g(E(X)) unless gg is linear. Most common: E(X2)[E(X)]2E(X^2) \neq [E(X)]^2.
  2. Addition rule E(X+Y)=E(X)+E(Y)E(X + Y) = E(X) + E(Y) requires NO independence.
  3. Multiplication rule E(XY)=E(X)E(Y)E(XY) = E(X)E(Y) requires independence.
  4. Markov’s inequality requires X0X \geq 0.

Variance traps:

  1. Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) requires independence. Do not use without checking.
  2. SD(aX+b)=aSD(X)\text{SD}(aX + b) = |a|\,\text{SD}(X): the absolute value on aa matters; shifts do not affect SD.
  3. Var\text{Var} has squared units; SD\text{SD} has original units.
  4. Sum SD grows as n\sqrt{n}; average SD shrinks as 1/n1/\sqrt{n}. Directions are opposite.

Distribution traps:

  1. Geometric: check whether support starts at 1 (waiting time) or 0 (failure count). Mean is 1/p1/p vs. (1p)/p(1-p)/p.
  2. Negative binomial: TrrT_r \geq r. If t<rt < r, the probability is 0.
  3. Poisson: “approximately Poisson” (binomial approximation) vs. “has Poisson distribution” (scatter model).
  4. Poisson: E[N(N1)]=μ2E[N(N-1)] = \mu^2 but E(N2)=μ2+μE(N^2) = \mu^2 + \mu.
  5. Independent Poisson sum is Poisson (parameters add). Independence is required.

Hypergeometric traps:

  1. Finite population correction: denominator is N1N - 1, not NN.
  2. Mean is the same as binomial (npnp); only the SD differs.
  3. “By symmetry” is a valid one line justification for replacing draw kk with draw 1.

Series traps:

  1. Always simplify geometric series to closed form. Leaving qk\sum q^k unsimplified loses points.
  2. Same distribution \neq equality. Joint distribution is needed for P(X+Y=z)P(X + Y = z); marginals alone are insufficient.

References

[1] J. Pitman, “Ch. 3: Random Variables,” in Probability, corrected 6th printing. New York, NY: Springer, 1997, pp. 139—257.

[2] J. Pitman, “Distribution Summaries,” in Probability, corrected 6th printing. New York, NY: Springer, 1997, pp. 475—487.

[3] J. Pitman, “Appendix 2: Sums,” in Probability, corrected 6th printing. New York, NY: Springer, 1997, pp. 515—516.