Random Variables, Expectation, Variance, Normal Approximation, Discrete Distributions, Poisson, Symmetry
Today
Random Variables, Expectation, Variance, Normal Approximation, Discrete Distributions, Poisson, Symmetry
Distribution Reference (Chapter 3 Scope)
Name
Range
P(X=k)
Mean
Variance
Bernoulli(p)
{0,1}
P(1)=p;P(0)=1−p
p
p(1−p)
Binomial(n,p)
{0,1,…,n}
(kn)pk(1−p)n−k
np
np(1−p)
Hypergeometric(n,N,G)
{0,…,n}
(kG)(n−kN−G)/(nN)
nG/N
n(G/N)(1−G/N)(N−n)/(N−1)
Geometric(p)
{1,2,3,…}
(1−p)k−1p
1/p
(1−p)/p2
Geometric(p)
{0,1,2,…}
(1−p)kp
(1−p)/p
(1−p)/p2
Neg. Binomial(r,p) (failures Fr)
{0,1,2,…}
(r−1k+r−1)pr(1−p)k
r(1−p)/p
r(1−p)/p2
Poisson(μ)
{0,1,2,…}
e−μμk/k!
μ
μ
Uniform on {a,…,b}
{a,a+1,…,b}
1/(b−a+1)
(a+b)/2
[(b−a+1)2−1]/12
The Geometric distribution has two conventions. Pitman primarily uses {1,2,…} (waiting time to first success). The {0,1,2,…} version counts failures before first success. Always check the problem statement.
The Negative Binomial(r,p) table above counts failures Fr before the rth success. The waiting time Tr=Fr+r lives on {r,r+1,…} with E(Tr)=r/p and SD(Tr)=r(1−p)/p.
Convention used throughout these notes:q=1−p wherever p is a success probability.
Section 3.1: Introduction to Random Variables
Core idea: a random variable is a numerical summary of a random outcome, and its distribution tells you the probability of each possible value.
1. Random Variable and Its Distribution
A random variable X is denoted by a capital letter. A lowercase letter x is a dummy variable representing a specific possible value.
Range: the set of all possible values of X. Example: the number of heads in 4 coin tosses has range {0,1,2,3,4}.
Distribution of X is the collection of probabilities
P(X=x),x∈range of X,
satisfying two conditions:
P(X=x)≥0(nonnegativity),x∑P(X=x)=1(sums to 1).
For any subset B of the range:
P(X∈B)=x∈B∑P(X=x).
This follows because the events (X=x) for distinct x are mutually exclusive and exhaustive.
2. Functions of a Random Variable
If X=g(W), then the distribution of X is derived from W:
P(X=x)=P(g(W)=x)=w:g(w)=x∑P(W=w).
The function g may send multiple values of w to the same x, so all such probabilities must be summed.
Transforming events. Convert the event about g(X) back to a condition on X:
P(2X≤5)=P(X≤5/2),P(X2≤5)=P(−5≤X≤5).
Watch for: sign flips with negative multipliers, splitting into two branches when removing squares or absolute values.
3. Joint Distribution
For two random variables X,Y defined on the same setting:
P(x,y)=P(X=x,Y=y),P(x,y)≥0,all (x,y)∑P(x,y)=1.
Marginal distributions are obtained by summing out the other variable:
P(X=x)=y∑P(x,y),P(Y=y)=x∑P(x,y).
Event probabilities from joint:
P(X<Y)=(x,y):x<y∑P(x,y),P(X+Y=z)=x∑P(x,z−x).
You cannot determine the distribution of X+Y from the marginal distributions alone. The joint distribution is required. (Pitman Example 3 shows that the same marginals can produce different distributions of X+Y under different dependence structures.)
4. Same Distribution vs. Equality
Same distribution:P(X=v)=P(Y=v) for all v in the common range.
Equality:P(X=Y)=1.
Equality implies same distribution, but the converse is false. Classic example: sampling {1,2,3} without replacement, the first draw X and second draw Y have the same uniform distribution, but P(X=Y)=0.
Change of variable principle: if X and Y have the same distribution, then g(X) and g(Y) have the same distribution for any function g.
5. Conditional Distribution
Given event A with P(A)>0:
P(Y=y∣A)=P(A)P(Y=y and A).
Given X=x:
P(Y=y∣X=x)=P(X=x)P(X=x,Y=y).
This is the column of the joint table at X=x, renormalized by the column sum.
Multiplication rule:
P(X=x,Y=y)=P(X=x)⋅P(Y=y∣X=x).
6. Independence
X and Y are independent iff
P(X=x,Y=y)=P(X=x)⋅P(Y=y)for all x,y.
Equivalent: the conditional distribution of Y given X=x does not depend on x (and vice versa).
To check from a joint table: verify the product rule for every cell. One failure means dependent.
n variables:X1,…,Xn are independent iff
P(X1=x1,…,Xn=xn)=i=1∏nP(Xi=xi)for all (x1,…,xn).
Three consequences of independence:
Functions of independent variables are independent: if Xj are independent, so are fj(Xj).
Disjoint blocks are independent: if X1,…,X6 are independent, then (X1,X2) and (X3,X4,X5) are independent.
Combine 1 and 2: functions of disjoint blocks are independent.
7. Multinomial Distribution
Generalization of the binomial. With m categories, probability pi for category i (∑pi=1), and n independent trials:
The factor p1n1⋯pmnm is the probability of one specific ordering; n!/(n1!⋯nm!) counts the number of such orderings.
Since N1+⋯+Nm=n is a constraint, the Ni are not independent.
8. Symmetry of Distributions
X is symmetric about 0 iff P(X=−x)=P(X=x) for all x, equivalently −X has the same distribution as X. Consequence: P(X≥a)=P(X≤−a).
X is symmetric about b iff X−b is symmetric about 0.
Sums: if X1,…,Xn are independent and each Xi is symmetric about bi, then Sn=∑Xi is symmetric about ∑bi.
Parity trick: if Sn takes integer values and is symmetric about a half integer (e.g., 454.5), then P(Sn≤454)=P(Sn≥455)=1/2. This works because no probability mass sits on the center.
Section 3.2: Expectation
Core idea: the expected value is a probability weighted average, and the linearity of expectation is the single most powerful computational tool in this course.
1. Definition
E(X)=x∑x⋅P(X=x).
Interpretation: the balance point (center of gravity) of the distribution histogram.
Key special cases:
E(IA)=P(A)(indicator of event A),E(c)=c(constant).
2. Long Run Average
If X represents a repeated measurement, then E(X) approximates the average over many repetitions. (Made rigorous by the Law of Large Numbers in Section 3.3.)
3. Addition Rule (Linearity)
For any random variables X,Y defined on the same setting:
E(X+Y)=E(X)+E(Y).
This holds regardless of dependence. More generally:
E(a1X1+⋯+anXn+b)=a1E(X1)+⋯+anE(Xn)+b.
4. Method of Indicators
If X counts how many of the events A1,…,An occur, write X=I1+⋯+In where Ij=1(Aj). Then:
E(X)=P(A1)+P(A2)+⋯+P(An).
No information about dependence among the Aj is needed.
Binomial mean:Sn∼Binomial(n,p), write Sn=I1+⋯+In, each E(Ij)=p:
E(Sn)=np.
Complement indicator pattern. Sometimes Ij=1 means “at least one item from group j is selected.” It is easier to compute P(Ij=1)=1−P(none from group j). Then:
E(X)=j=1∑n[1−P(none from group j selected)].
This pattern appeared on Practice Quiz 2 Q3: n families with 3 kids, k kids selected from 3n, X = number of families with at least one kid selected. Here P(none from family j)=(k3n−3)/(k3n), so E(X)=n[1−(k3n−3)/(k3n)].
5. Tail Sum Formula
If X takes values in {0,1,…,n}:
E(X)=j=1∑nP(X≥j).
This is a special case of the method of indicators with Aj=(X≥j).
Useful when P(X≥j) is simpler than P(X=k). Example: E(min of 4 dice)=∑j=16[(7−j)/6]4.
6. Expectation of a Function
E[g(X)]=x∑g(x)⋅P(X=x).
Trap: In general, E[g(X)]=g(E(X)). This equality holds only when g is linear (affine). The most common mistake: E(X2)=[E(X)]2.
kth moment:
E(Xk)=x∑xkP(X=x).
Two variable version:
E[g(X,Y)]=(x,y)∑g(x,y)P(X=x,Y=y).
7. Multiplication Rule for Expectation
If X and Y are independent:
E(XY)=E(X)⋅E(Y).
This requires independence. Contrast with the addition rule which requires nothing.
Counterexample: if X=Y, then E(XY)=E(X2)=[E(X)]2=E(X)E(Y) (unless X is constant).
8. Variance of a Product of Independent Variables
If X and Y are independent:
Var(XY)=Var(X)Var(Y)+Var(X)[E(Y)]2+[E(X)]2Var(Y).
Equivalently:
Var(XY)=E(X2)E(Y2)−[E(X)]2[E(Y)]2.
Derivation. By independence, E[(XY)2]=E(X2)E(Y2) and [E(XY)]2=[E(X)]2[E(Y)]2. Then Var(XY)=E(X2)E(Y2)−[E(X)]2[E(Y)]2. Expanding E(X2)=Var(X)+[E(X)]2 and similarly for Y gives the first form.
This formula appeared on Practice Quiz 2 Q2. With Var(X)=3, E(X)=1, Var(Y)=2, E(Y)=0: Var(XY)=(3)(2)+(3)(0)+(1)(2)=8.
9. Boole’s and Markov’s Inequalities
Boole’s inequality: if X counts how many of A1,…,An occur:
P(X≥1)=P(j⋃Aj)≤j∑P(Aj)=E(X).
Markov’s inequality: for X≥0 and a>0:
P(X≥a)≤aE(X).
The condition X≥0 is essential.
10. Properties of Expectation (Summary Box, p.181)
Property
Formula
Condition
Definition
E(X)=∑xxP(X=x)
Constants
E(c)=c
Indicators
E(IA)=P(A)
Functions
E[g(X)]=∑xg(x)P(X=x)
=g(E(X)) in general
Constant factors
E(cX)=cE(X)
Addition
E(X+Y)=E(X)+E(Y)
Always (dependent or not)
Multiplication
E(XY)=E(X)E(Y)
Independent only
Section 3.3: Standard Deviation and Normal Approximation
Core idea: variance measures spread around the mean, and the square root law governs how sums and averages of independent variables behave.
1. Variance and Standard Deviation
Var(X)=E[(X−μ)2],SD(X)=Var(X),μ=E(X).
SD(X) measures the typical size of the deviation ∣X−μ∣.
Var(X)≥0 always, with equality iff X is constant (i.e., P(X=μ)=1).
2. Computational Formula
Var(X)=E(X2)−[E(X)]2.
“Mean of the square minus square of the mean.” This always gives:
E(X2)≥[E(X)]2,
with equality iff X is constant.
Indicator:IA2=IA, so E(IA2)=p, thus Var(IA)=p−p2=p(1−p), SD(IA)=p(1−p).
Adding a constant shifts the mean but does not change the SD. Multiplying by a scales the SD by ∣a∣.
4. Standardization
For μ=E(X) and σ=SD(X)>0:
X∗=σX−μ,E(X∗)=0,SD(X∗)=1.
X∗ measures how many standard deviations X is from its mean.
5. Chebyshev’s Inequality
For any random variable X with E(X)=μ and SD(X)=σ:
P(∣X−μ∣≥kσ)≤k21.
Proof: apply Markov’s inequality to (X−μ)2 with threshold k2σ2.
This bound is crude but universal (works for any distribution).
One sided application template. To bound P(X≥a) where a>μ:
P(X≥a)≤P(∣X−μ∣≥a−μ)≤(a−μ)2σ2.
Set k=(a−μ)/σ and apply 1/k2.
Inverse problem template (Practice Quiz 2 Q1b). Given target bound 1/k2=c, solve k=1/c. Then a−μ=kσ gives μ=a−kσ.
Deviation
Chebyshev bound
Normal actual
≥1σ
≤1 (trivial)
0.3173
≥2σ
≤1/4
0.0455
≥3σ
≤1/9
0.0027
≥4σ
≤1/16
0.00006
6. Variance Addition Rule
If X and Y are independent:
Var(X+Y)=Var(X)+Var(Y).
For n independent variables:
Var(X1+⋯+Xn)=Var(X1)+⋯+Var(Xn).
This requires independence. The cross term 2E[(X−E(X))(Y−E(Y))] vanishes under independence but not in general. If X=Y: Var(2X)=4Var(X)=2Var(X).
7. Square Root Law
For i.i.d. X1,…,Xn with E(Xi)=μ and SD(Xi)=σ, let Sn=∑Xi and Xˉn=Sn/n:
E(Sn)=nμ,SD(Sn)=σn,E(Xˉn)=μ,SD(Xˉn)=nσ.
The sum’s SD grows as n (not n) because positive and negative deviations partially cancel. The average’s SD shrinks as 1/n.
Binomial SD:Sn∼Binomial(n,p), each indicator has SD=pq, so SD(Sn)=npq.
8. Law of Averages (Weak Law of Large Numbers)
For i.i.d. X1,X2,… with E(Xi)=μ and finite variance:
P(∣Xˉn−μ∣<ϵ)→1as n→∞,for every ϵ>0.
Proof via Chebyshev: P(∣Xˉn−μ∣≥ϵ)≤σ2/(nϵ2)→0.
9. Central Limit Theorem (Normal Approximation)
For i.i.d. Xi with E(Xi)=μ, SD(Xi)=σ, and n large:
P(a≤σnSn−nμ≤b)≈Φ(b)−Φ(a).
Equivalently, Sn is approximately Normal(nμ,nσ2).
Procedure: (1) compute μ and σ of one Xi, (2) get E(Sn)=nμ and SD(Sn)=σn, (3) standardize, (4) use Φ table.
Continuity correction: if Xi takes consecutive integer values, replace b with b+1/2 and a with a−1/2 for better accuracy.
Section 3.4: Discrete Distributions
Core idea: everything from Sections 3.1 through 3.3 extends to infinite ranges by replacing finite sums with convergent series. The geometric distribution is the central new object.
1. Discrete Distribution on Infinite Sets
A distribution on {0,1,2,…} is a sequence p0,p1,p2,… with
pi≥0for all i,i=0∑∞pi=1.
2. Infinite Sum Rule
If A1,A2,… are mutually exclusive with A=⋃Ai:
P(A)=i=1∑∞P(Ai).
All finite rules (conditional probability, independence, Bayes, expectation, variance) extend to infinite ranges by replacing finite sums with convergent series.
3. Geometric Distribution
Definition. Waiting time T to first success in Bernoulli(p) trials:
P(T=k)=(1−p)k−1p,k=1,2,3,…
Sum to 1: ∑k=1∞qk−1p=p/(1−q)=1.
Tail probability:P(T>n)=(1−p)n.
Moments:
E(T)=p1,Var(T)=p21−p,SD(T)=p1−p.
Derivation of E(T) via shift and subtract. Let Σ1=∑n=1∞nqn−1=1+2q+3q2+⋯. Then:
Two assumptions: (1) No multiple hits at the same point. (2) For each partition of the unit square into n equal subsquares, each subsquare is hit independently with the same probability.
Theorem. Under these assumptions, there exists an intensity λ>0 such that:
(i) For any region B: N(B)∼Poisson(λ⋅area(B)).
(ii) For disjoint regions B1,…,Bj: N(B1),…,N(Bj) are mutually independent.
7. Thinning
Start with Poisson scatter of intensity λ. Keep each point independently with probability p.
Superposition (reverse of thinning): two independent Poisson scatters with intensities α,β combine to give Poisson scatter with intensity α+β.
8. Small μ Approximations
When μ is very small:
P(N=0)≈1−μ,P(N=1)≈μ,P(N≥2)≈0.
9. Normal Approximation for Poisson
For large μ:
μN−μdStandard Normal.
Convergence is slower than for the binomial because Poisson skewness =1/μ decays slowly.
Section 3.6: Symmetry (Exchangeability)
Core idea: exchangeability lets you replace “the kth draw” with “the first draw” in any calculation, massively simplifying problems involving sampling without replacement.
1. Exchangeability
(X,Y) has a symmetric joint distribution iff P(x,y)=P(y,x) for all (x,y). Equivalently, (X,Y) and (Y,X) have the same joint distribution. We say X and Y are exchangeable.
For n variables: X1,…,Xn are exchangeable iff
P(x1,…,xn)=P(xπ(1),…,xπ(n))
for every permutation π.
Consequences: all Xi share the same marginal distribution, and every subset of size m has the same joint distribution.
Exchangeable does NOT imply independent. Sampling without replacement is exchangeable but not independent.
2. Independent Trials Are Exchangeable
If X1,…,Xn are i.i.d., then P(x1,…,xn)=∏p(xi), which is symmetric because multiplication is commutative.
3. Sampling Without Replacement Is Exchangeable
Theorem. Drawing n items without replacement from {b1,…,bN} produces an exchangeable sequence X1,…,Xn.
In particular, any subset Xi1,…,Xim has the same joint distribution as the first m draws X1,…,Xm.
Application pattern: to compute P(event about the kth draw), replace it with the equivalent event about the 1st draw.
Example (Pitman 3.6.1). 52 cards, deal 5.
P(5th card is King)=P(1st card is King)=4/52.
P(3rd and 5th are black)=P(1st and 2nd are black)=(26/52)(25/51).
On exams, write “by symmetry of sampling without replacement” as a one line justification, then proceed with the simpler first/second draw calculation.
4. Hypergeometric Mean and Variance
Population N with G good, B=N−G bad. Sample n without replacement. Sn = number of good. Let p=G/N, q=1−p.
Mean via indicators + symmetry:
Sn=I1+⋯+In,Ij=1(draw j is good).
By exchangeability, each Ij has Bernoulli(G/N) distribution:
E(Sn)=n⋅NG=np.
This is the same as the binomial mean.
Variance derivation. Expand E(Sn2)=E[(∑Ij)2]:
E(Sn2)=j∑E(Ij2)+2j<k∑E(IjIk).
Since Ij2=Ij: ∑jE(Ij2)=n⋅G/N.
By exchangeability, every pair (Ij,Ik) has the same joint distribution as (I1,I2):