Notes: Statistics Refresher for Machine Learning

Feb 13, 2026

Basic

nPr=n!(nr)!nCr=n!r!(nr)!\begin{align*} ^nP_{r} = \frac{n!}{(n-r)!} \\ ^nC_{r} = \frac{n!}{r!(n-r)!} \end{align*}

LOOK we divide by n!n! because among themselves how they sit around does NOT matter! See this visualiser

P(event A, GIVEN, event B)=P(A and B)P(B)    P(AB)=P(AB)P(B)\begin{align*} P(\text{event A, GIVEN, event B}) &= \frac{P(\text{A and B})}{P(B)} \\ \implies P(A|B) & = \frac{P(A \cap B)}{P(B)} \end{align*}

Random Variables & Distribution

E[X]=xXxP(x)E[X] = \sum_{x \in X}xP(x) Var(X)=E[(XE[X])2]or, Var(X)=E[X2](E[X])2\begin{align} &\text{Var}(X) = E[(X-E[X])^2] \\ \text{or, } &\text{Var}(X) = E[X^2] - (E[X])^2 \end{align} Cov(X,Y)=E[(XE[X])(YE[Y])]\text{Cov}(X,Y)=E[(X-E[X])(Y-E[Y])]

Discrete Distributions

  1. Bernoulli
Ber(p)f(x)={pif x=11pif x=0Ber(p) \sim f(x) = \begin{cases} p & \text{if } x = 1 \\ 1-p & \text{if } x = 0 \end{cases}

Mean is pp, Variance is p(1p)p(1-p).
2. Binomial
Sum of n independent Bernoulli random var.

Bin(n,p)f(x)=(nx)pr(1p)nrBin(n, p) \sim f(x) = \binom{n}{x}p^r(1-p)^{n-r}

Mean is npnp, Variance is np(1p)np(1-p).
3. Geometric
It counts the number of trails that is required to observe single success. f(x)=(1p)x1pf(x) = (1-p)^{x-1}p Where Mean is 1/p1/p, Variance is 1pp2\frac{1-p}{p^2}.
4. Poisson
It counts the number of evens occurring in a fixed interval of time or space with a given average rate λ\lambda

P(λ)f(x)=λxeλx!P(\lambda) \sim f(x) = \frac{\lambda^xe^{-\lambda}}{x!}

Mean and Variance both is is λ\lambda.

Continuous Distributions

  1. Uniform
    When all the possible intervals have equal possibility of happening.
U(a,b)f(x)={1bafor x[a,b]0otherwiseU(a,b) \sim f(x) = \begin{cases} \frac{1}{b-a} & \text{for } x \in [a,b] \\ 0 & \text{otherwise} \end{cases}

Mean is a+b2\frac{a+b}{2}, Variance is (ba)212\frac{(b-a)^2}{12}
2. Normal
The normal (or Gaussian) is a distribution which is assumed to be additively produced by many many small effects

N(μ,σ2)f(x)=12πσ2e(xμ)22σ2N(\mu,\sigma^2) \sim f(x) = \frac{1}{\sqrt{ 2\pi \sigma^2 }}e^{ - \frac{(x-\mu)^2}{2\sigma^2} }

Mean is μ\mu, Variance is σ2\sigma^2.
3. Exponential
It is the continuous analogue of the geometric distribution.

Exp(λ)f(x)={λeλxif x00otherwise\text{Exp}(\lambda) \sim f(x) = \begin{cases} \lambda e^{-\lambda x} & \text{if } x \geq 0 \\ 0 & \text{otherwise} \end{cases}

Mean is 1/λ1/\lambda, Variance is 1/λ21/\lambda^2

Estimation

Bayesian

P(AB)=P(BA)P(A)P(B)    P(AB)×P(B)=P(BA)×P(A)\begin{align} P(A|B) &= \frac{P(B|A)P(A)}{P(B)} \\ \implies P(A|B)\times P(B) &= P(B|A) \times P(A) \end{align}

Intuitively what proportion when B happens then A happens as well is equal to what proportion of A happens when B happens

Regression

yi=mxi+nb,xiyi=mxi2+bxi\begin{align} &\sum y_{i} = m\sum x_{i} + nb, \\ &\sum x_{i}y_{i} = m\sum x_{i}^2+b\sum x_{i} \end{align}

Where nn is sample size, xix_{i} & yiy_{i} are the value of the two measurements. mm is slope and bb is intercept.

r=sxysxxsyyr=\frac{s_{xy}}{\sqrt{ s_{xx} }\sqrt{ s_{yy} }}

and sxy,sxx,syys_{xy}, s_{xx}, s_{yy} are defined as:

sxy=(xixˉ)(yiyˉ)sxx=(xixˉ)2syy=(yiyˉ)2\begin{align} s_{xy} &= \sum(x_{i}-\bar{x})(y_{i}-\bar{y}) \\ s_{xx} &= \sum(x_{i}-\bar{x})^2 \\ s_{yy} &= \sum(y_{i}-\bar{y})^2 \end{align}

note, r lies between -1 and 1, and 0 means no correlation. It is actually intuitively, cosine of the angle between when we take the regression lines made by least square YxyY_{xy} i.e the line of xx wrt to yy, and YyxY_{yx} i.e the line of yy wrt to xx.

diag from seeing theory

Comparison

d=μ1μ2pooled SDpooled SD=(n11)×SD12+(n21)×SD22n1+n22\begin{align} d &= \frac{\mu_{1}-\mu_{2}}{\text{pooled }SD} \\ \text{pooled SD} &= \sqrt{ \frac{(n_{1}-1)\times SD_{1}^2 + (n_{2}-1)\times SD_{2}^2}{n_{1}+n_{2}-2} } \end{align}

Hypothesis Testing

θ^N(0,p(1p)(1n1+1n0))\hat{\theta} \sim N\left( 0,p(1-p)\left( \frac{1}{n_{1}}+\frac{1}{n_{0}} \right) \right) T=θ^SE(θ^)=θ^p(1p)(1n1+1n0)T = \frac{\hat{\theta}}{SE(\hat{\theta})}=\frac{\hat{\theta}}{\sqrt{ p(1-p)\left( \frac{1}{n_{1}}+\frac{1}{n_{0}} \right) }}

then we see, E(T)=0,Var(T)=1E(T) = 0, \text{Var}(T)=1, i.e TN(0,1)T\sim N(0,1) The critical value i.e. the value at the edge of the level of significance is mostly given and with our sample we only calculate tt to test if the H0H_{0} can be rejected or not.

p=P(TtobsH0)or, p=2P(Ztobs)\begin{align} p &= P(|T|\geq |t_{\text{obs}}||H_{0}) \\ \text{or, } p&= 2P(Z\geq|t_{\text{obs}}|) \end{align} P(1.96<Z<1.96)=0.95or, P(1.96<θ^θSE<1.96)=0.95or, P(θ^1.96SE<θ<θ^+1.96SE)=0.95\begin{align} P(-1.96 \lt Z \lt 1.96) = 0.95 \\ \text{or, } P(-1.96 \lt \frac{\hat{\theta}-\theta}{SE} \lt 1.96) = 0.95 \\ \text{or, } P(\hat{\theta}-1.96SE \lt \theta \lt \hat{\theta}+1.96SE) = 0.95 \\ \\ \end{align}

e.i. the confidence interval of 95% is θ^±1.96SE\hat{\theta}\pm1.96SE.

Accept H0H_{0}Reject H0H_{0}
H0H_{0} is trueTrue -ve
1-α\alpha
False +ve
α\alpha
Type I error
H0H_{0} is falseFalse -ve
β\beta
Type II error
True +ve
1β1-\beta
Power

zed stat