Mathematics — Lessesity

Calculus

The mathematics of continuous change — how quantities vary, accumulate, and relate to one another through limits, derivatives, and integrals.

Limits & Derivatives

The foundation of calculus — measuring instantaneous rates of change via the limiting process.

The Limit

A limit describes the value a function approaches as its input approaches a given point — even if the function is not defined at that point.

$$\lim_{x \to a} f(x) = L$$

In Plain Terms

Imagine walking toward a wall but never quite touching it — you can get arbitrarily close. A limit asks: what value is the function heading toward as the input closes in on a target, even if it never arrives?

Example: $\lim_{x \to 2} (x^2) = 4$ — as $x$ approaches 2, $x^2$ approaches 4.

The Derivative

The derivative of a function at a point measures the instantaneous rate of change — defined as the limit of the difference quotient as the interval shrinks to zero.

$$f'(x) = \lim_{\Delta x \to 0}\frac{f(x+\Delta x)-f(x)}{\Delta x}$$

In Plain Terms

The derivative is the slope of a curve at a single point — the "steepness" of the function right there. If $f(x)$ is your position over time, $f'(x)$ is your speed at each instant.

Example: If $f(x) = x^2$, then $f'(x) = 2x$. At $x = 3$, the slope is $6$.

Power Rule

The power rule gives a quick formula for differentiating any power of $x$ without computing the full limit definition each time.

$$\frac{d}{dx} x^n = n x^{n-1}$$

In Plain Terms

Bring the exponent down as a coefficient and reduce the power by one. It's the most-used shortcut in calculus.

Example: $\frac{d}{dx} x^5 = 5x^4$.

Product Rule

When differentiating the product of two functions, the derivative is not simply the product of their derivatives — the product rule gives the correct formula.

$$(uv)' = u'v + uv'$$

In Plain Terms

Differentiate the first function times the second, plus the first times the derivative of the second. Often remembered as "first times derivative of second, plus second times derivative of first."

Example: $\frac{d}{dx}[x^2 \sin x] = 2x \sin x + x^2 \cos x$.

Common Derivatives

A reference table of frequently encountered derivatives.

Function	Derivative
$\sin x$	$\cos x$
$\cos x$	$-\sin x$
$e^x$	$e^x$
$\ln x$	$\dfrac{1}{x}$
$\tan x$	$\sec^2 x$

In Plain Terms

These are the atomic building blocks — memorise them and you can differentiate almost any function using composition rules.

Integrals

The counterpart to differentiation — computing accumulated quantities such as areas, totals, and averages.

The Definite Integral

The definite integral of $f$ from $a$ to $b$ computes the net signed area between the curve and the $x$-axis over that interval.

$$\int_a^b f(x)\, dx = \lim_{\|\Delta\| \to 0} \sum_{i} f(c_i)\,\Delta x_i$$

In Plain Terms

Slice the area under a curve into infinitely many infinitesimally thin rectangles and add them all up. The result is the total accumulated quantity — area, distance, charge, or whatever $f(x)$ represents.

Example: $\int_0^1 x^2\, dx = \frac{1}{3}$ — the area under $y = x^2$ from $0$ to $1$.

The Indefinite Integral (Antiderivative)

The indefinite integral finds a function whose derivative equals $f(x)$ — the reverse operation of differentiation, expressed up to an arbitrary constant $C$.

$$\int f(x)\, dx = F(x) + C \quad \text{where } F'(x) = f(x)$$

In Plain Terms

If differentiation is "finding speed from position," antidifferentiation is "finding position from speed." The constant $C$ reflects that many different position functions have the same speed.

Example: $\int x^2\, dx = \frac{x^3}{3} + C$.

The Chain Rule

How to differentiate a composition of two functions.

Chain Rule

When one function is nested inside another, the chain rule states that the derivative of the composite is the derivative of the outer function (evaluated at the inner) multiplied by the derivative of the inner.

$$\frac{d}{dx}\, g(f(x)) = g'(f(x))\cdot f'(x)$$

In Plain Terms

Think of gears: the chain rule multiplies the rate of the outer gear by the rate of the inner gear. "Outside derivative times inside derivative."

Example: $\frac{d}{dx}\sin(x^2) = \cos(x^2) \cdot 2x$.

Fundamental Theorem of Calculus

The deep result unifying differentiation and integration as inverse operations.

FTC Part I

If $F(x)$ is defined as the integral of $f$ from a fixed point up to $x$, then $F$ is differentiable and its derivative is $f(x)$ itself.

$$\frac{d}{dx}\int_0^x f(t)\, dt = f(x)$$

In Plain Terms

Integration and differentiation undo each other. If you accumulate area under a curve and then ask "how fast is that area growing right now?", the answer is just the height of the curve at that point.

FTC Part II

To evaluate a definite integral, find any antiderivative $F$ and compute $F(b) - F(a)$. This replaces infinite sums with simple algebra.

$$\int_a^b f(x)\, dx = F(b) - F(a) \quad \text{where } F'(x) = f(x)$$

In Plain Terms

You don't need to add up infinite rectangles by hand. Find the antiderivative, plug in the endpoints, subtract. The FTC is why calculus is practically useful.

Example: $\int_1^3 2x\, dx = [x^2]_1^3 = 9 - 1 = 8$.

Linear Algebra

The algebra of vectors, matrices, and linear transformations — the language of data, geometry, and machine learning.

Vectors & Spaces

The building blocks: what vectors are, how to combine them, and what spaces they form.

Vectors & Vector Spaces

A vector space $V$ is a set of vectors that can be added together and multiplied by scalars, satisfying a set of axioms. Elements like arrows, polynomials, and functions all qualify.

$$\mathbf{u}, \mathbf{v} \in V \implies \mathbf{u} + \mathbf{v} \in V, \quad c\mathbf{u} \in V$$

In Plain Terms

A vector space is any collection of objects that you can add and scale — and the results stay in the collection. Arrows in 3D space are the classic example, but the concept is far more general.

Example: $\mathbb{R}^n$ — tuples of $n$ real numbers — is the most common vector space.

Dot Product

The dot product measures how much two vectors point in the same direction, returning a scalar. It connects algebraic computation to geometric angle.

$$\mathbf{u} \cdot \mathbf{v} = \sum_{i} u_i v_i = \|\mathbf{u}\|\|\mathbf{v}\|\cos\theta$$

In Plain Terms

Multiply corresponding components and sum. If the result is zero, the vectors are perpendicular. If large and positive, they point in the same direction.

Example: $[1,2] \cdot [3,4] = 1\cdot3 + 2\cdot4 = 11$.

Linear Maps & Transformations

Functions between vector spaces that preserve addition and scalar multiplication.

Linear Map

A linear map $T: V \to W$ satisfies additivity and homogeneity — it maps the zero vector to zero and preserves the structure of the space.

$$T \in \mathcal{L}(V, W) \implies T(\mathbf{u}+\mathbf{v}) = T\mathbf{u} + T\mathbf{v}, \quad T(c\mathbf{v}) = c\,T\mathbf{v}$$

In Plain Terms

Linear maps are the "straight" transformations — rotations, reflections, and scalings of space. They cannot curve, bend, or translate. Every linear map can be represented as matrix multiplication.

Rank–Nullity Theorem

For a linear map $T: V \to W$, the dimension of the domain equals the sum of the dimensions of the null space (kernel) and the range (image).

$$\dim(V) = \dim(\operatorname{nul}(T)) + \dim(\operatorname{ran}(T))$$

In Plain Terms

Every dimension of the input space is either "crushed to zero" (null space) or "mapped somewhere useful" (range). The theorem says those two parts must add up to the whole input dimension — nothing is unaccounted for.

Example: A $3 \times 3$ matrix with rank 2 has a null space of dimension 1.

Eigenvalues & Eigenvectors

Special vectors that a transformation merely stretches or flips — the "axes" of a linear map.

Eigenvalue Equation

An eigenvector $\mathbf{v}_i$ of a linear operator $T$ is a non-zero vector that is only scaled (not rotated) by $T$. The scaling factor is the corresponding eigenvalue $\lambda_i$.

$$T\mathbf{v}_i = \lambda_i \mathbf{v}_i$$

In Plain Terms

Most vectors get rotated when a matrix acts on them. Eigenvectors are special: they point in the same direction before and after — they just get longer or shorter. They reveal the natural "axes" of a transformation.

Example: For a stretch-in-x matrix, the $x$-axis vector is an eigenvector. Eigenvalues are found by solving $\det(T - \lambda I) = 0$.

Determinant as Product of Eigenvalues

The determinant of a linear operator equals the product of all its eigenvalues — connecting the volume-scaling factor of a transformation to its spectral structure.

$$\det(T) = \prod_{i} \lambda_i$$

In Plain Terms

The determinant measures how much a transformation stretches or squishes volumes. If any eigenvalue is zero, the determinant is zero — meaning the transformation collapses space and is not invertible.

Matrix Operations

How matrices are multiplied, inverted, and composed to represent sequences of linear maps.

Matrix Multiplication

The product of two matrices $A$ ($m \times n$) and $B$ ($n \times p$) is a new matrix $C$ ($m \times p$) where each entry is the dot product of a row of $A$ with a column of $B$.

$$(AB)_{ij} = \sum_{k=1}^{n} A_{ik}\, B_{kj}$$

In Plain Terms

Matrix multiplication represents composing two transformations: first apply $B$, then $A$. Note that $AB \neq BA$ in general — order matters.

Invertibility

A square matrix $A$ is invertible (non-singular) if there exists a matrix $A^{-1}$ such that $AA^{-1} = A^{-1}A = I$. This requires $\det(A) \neq 0$.

$$A\,A^{-1} = A^{-1}A = I \iff \det(A) \neq 0$$

In Plain Terms

An invertible matrix is a transformation you can undo. If $\det(A) = 0$, the transformation squashes space into a lower dimension — there is no way back, and the matrix is singular.

Probability Theory

The mathematics of uncertainty — rigorously quantifying chance, belief, and the behaviour of random variables.

Foundations

The axioms and basic rules that all probability theory rests on.

Kolmogorov Axioms

Probability is a function $P$ from events to real numbers satisfying: non-negativity, the total probability of the sample space is 1, and countable additivity for disjoint events.

$$P(A) \geq 0, \quad P(\Omega) = 1, \quad P(A \cup B) = P(A) + P(B) \text{ if } A \cap B = \emptyset$$

In Plain Terms

Probabilities are never negative, everything adds up to 1, and if two outcomes can't happen at the same time their individual chances just add. Three simple rules — all of probability theory follows from them.

Independence

Two events $A$ and $B$ are independent if knowing that one occurred gives no information about the other — their joint probability factors into the product of their marginals.

$$A \perp B \iff P(A,B) = P(A)\,P(B)$$

In Plain Terms

Flipping a fair coin and rolling a die are independent — the coin result tells you nothing about the die. If events are not independent, they are correlated in some way.

Law of Total Probability

If $\{B_i\}$ is a partition of the sample space, the probability of any event $A$ can be computed by summing its conditional probabilities over all cases.

$$P(A) = \sum_{i} P(A \mid B_i)\, P(B_i)$$

In Plain Terms

Break the problem into exhaustive, mutually exclusive scenarios, compute the probability of $A$ within each scenario, then average them weighted by how likely each scenario is.

Conditional Probability & Bayes' Theorem

How probabilities update in light of new information.

Conditional Probability

The probability of event $A$ given that event $B$ has occurred is the joint probability of both divided by the probability of the conditioning event.

$$P(A \mid B) = \frac{P(A, B)}{P(B)}$$

In Plain Terms

You've learned that $B$ happened — now you're restricted to the world where $B$ is true. Conditional probability asks: given that restricted world, what fraction of it contains $A$?

Example: The probability of drawing a king given the card is a face card: $P(\text{king} \mid \text{face}) = \frac{4/52}{12/52} = \frac{1}{3}$.

Bayes' Theorem

Bayes' theorem inverts conditional probability — it lets you update a prior belief $P(A)$ in light of observed evidence $B$, yielding the posterior $P(A \mid B)$.

$$P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B)}$$

In Plain Terms

You start with a belief (prior). You observe evidence. Bayes' theorem tells you exactly how to revise your belief. It is the mathematical engine of rational learning and forms the backbone of Bayesian statistics and machine learning.

Example: A medical test is 99% accurate. But if the disease is rare (1 in 1000), a positive result is still more likely a false positive than a true positive — Bayes' theorem quantifies this exactly.

Expectation & Variance

Summarising the centre and spread of random variables.

Expected Value

The expected value (mean) of a random variable $X$ is the probability-weighted average of all possible outcomes — the long-run average over infinitely many trials.

$$\mathbb{E}[X] = \int_{-\infty}^{\infty} x\, p(x)\, dx$$

In Plain Terms

Roll a die many times and average the results — you approach 3.5. The expected value is that long-run average. It's the "centre of gravity" of the probability distribution.

Linearity: $\mathbb{E}[aX + b] = a\,\mathbb{E}[X] + b$ — a hugely useful property.

Variance

Variance measures how spread out a distribution is — the expected squared deviation from the mean. The square root is the standard deviation.

$$\operatorname{Var}[X] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$

In Plain Terms

Variance answers: "how unpredictable is this?" A die roll has more variance than a coin flip mapped to 3 or 4. High variance means outcomes are widely scattered around the mean.

Covariance & Correlation

Covariance measures how two random variables move together. Correlation normalises this to the range $[-1, 1]$ for easier interpretation.

$$\operatorname{Cov}[X,Y] = \mathbb{E}[XY] - \mathbb{E}[X]\,\mathbb{E}[Y]$$

$$\operatorname{Cor}(X,Y) = \frac{\operatorname{Cov}(X,Y)}{\sqrt{\operatorname{Var}(X)\,\operatorname{Var}(Y)}}$$

In Plain Terms

Positive covariance means when $X$ is high, $Y$ tends to be high. Correlation of $+1$ means perfect linear agreement; $-1$ means perfect inverse relationship; $0$ means no linear relationship.

Common Distributions

Named probability distributions that arise throughout science and statistics.

Bernoulli Distribution

The simplest distribution: a single trial with two outcomes — success (probability $p$) or failure (probability $1-p$).

$$X \sim \text{Bernoulli}(p), \quad \mathbb{E}[X] = p, \quad \operatorname{Var}[X] = p(1-p)$$

In Plain Terms

A biased coin flip. Heads with probability $p$, tails with probability $1-p$. Every more complex model in statistics eventually reduces to combinations of these.

Normal (Gaussian) Distribution

The bell curve — the most important distribution in statistics, arising naturally from sums of many independent random variables (Central Limit Theorem).

$$X \sim \mathcal{N}(\mu, \sigma^2), \quad p(x) = \frac{1}{\sigma\sqrt{2\pi}}\,e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

In Plain Terms

Heights, measurement errors, and test scores all follow the bell curve. The mean $\mu$ sets the centre; $\sigma$ (standard deviation) controls the width. About 68% of values fall within one standard deviation of the mean.

Statistics

The science of collecting, analysing, and drawing conclusions from data — applying probability to the real world of samples and experiments.

Mean, Variance & Standard Deviation

Summarising a dataset with a handful of numbers.

Sample Mean & Variance

The sample mean $\bar{x}$ is the arithmetic average of observations. The sample variance $s^2$ measures their spread, using $n-1$ (Bessel's correction) for an unbiased estimate.

$$\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i, \qquad s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2$$

In Plain Terms

Add up all values and divide by how many there are to get the mean. Variance tells you how spread out those values are. The standard deviation $s = \sqrt{s^2}$ is in the same units as the data — more intuitive than variance.

Hypothesis Testing

A formal framework for deciding whether data provides evidence against a default assumption.

Null & Alternative Hypothesis

The null hypothesis $H_0$ represents the default or "no effect" claim. The alternative $H_1$ is what we seek evidence for. We test whether data is surprising enough under $H_0$ to reject it.

$$H_0: \mu = \mu_0 \quad \text{vs.} \quad H_1: \mu \neq \mu_0$$

In Plain Terms

A court starts by assuming innocence (null hypothesis). Evidence either stays insufficient to convict or is strong enough to "reject" innocence. Statistics does the same: presume no effect, then see if the data says otherwise.

The p-value

The p-value is the probability of observing data at least as extreme as what was seen, assuming the null hypothesis is true. A small p-value is evidence against $H_0$.

$$p = P(\text{data as extreme or more} \mid H_0 \text{ true})$$

Reject $H_0$ if $p < \alpha$ (significance level, typically $0.05$)

In Plain Terms

If $p = 0.03$, then data this extreme would occur only 3% of the time if the null were true. That's unlikely enough to raise suspicion. The p-value is not the probability the null is true — a common misconception.

Confidence Intervals

Expressing estimation uncertainty as a range rather than a single number.

95% Confidence Interval

A 95% confidence interval is constructed so that 95% of intervals built this way (across repeated experiments) contain the true parameter. For large samples with known $\sigma$:

$$\bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}$$

For 95%: $z_{0.025} \approx 1.96$

In Plain Terms

A 95% CI does not mean "there's a 95% chance the true value is in this interval." The true value is fixed — it is either in this interval or it isn't. The 95% describes the procedure: repeat the experiment many times, and 95% of those intervals will contain the truth.

Regression

Modelling the relationship between variables — predicting one from another.

Simple Linear Regression

Models the relationship between a response $y$ and a predictor $x$ as a straight line, estimated by minimising the sum of squared residuals (ordinary least squares).

$$y = \beta_0 + \beta_1 x + \varepsilon, \qquad \hat{\beta}_1 = \frac{\operatorname{Cov}(x,y)}{\operatorname{Var}(x)}$$

In Plain Terms

Draw the "best fit" line through a scatter plot — the one that minimises the total vertical distance between the points and the line. $\beta_0$ is where the line crosses the $y$-axis; $\beta_1$ is how steep it is.

Set Theory & Logic

The formal foundations of mathematics — sets define collections, and logic provides the rules of valid inference.

Sets

Collections of objects — the atoms of modern mathematics.

Set Operations

The fundamental operations on sets — union, intersection, and complement — combine collections in well-defined ways.

$$A \cup B = \{x : x \in A \text{ or } x \in B\}$$

$$A \cap B = \{x : x \in A \text{ and } x \in B\}$$

$$A^c = \{x : x \notin A\}$$

In Plain Terms

Union is "or" — everything in either set. Intersection is "and" — only what's in both. Complement is "not" — everything outside the set. These mirror the logical connectives $\vee$, $\wedge$, $\neg$.

De Morgan's Laws: $(A \cup B)^c = A^c \cap B^c$ and $(A \cap B)^c = A^c \cup B^c$.

Subsets & Power Sets

$A \subseteq B$ means every element of $A$ is also in $B$. The power set $\mathcal{P}(A)$ is the set of all subsets of $A$ — if $|A| = n$, then $|\mathcal{P}(A)| = 2^n$.

$$A \subseteq B \iff \forall x:\, x \in A \Rightarrow x \in B$$

$$|\mathcal{P}(A)| = 2^{|A|}$$

In Plain Terms

The power set of $\{1, 2\}$ contains four subsets: $\emptyset$, $\{1\}$, $\{2\}$, $\{1,2\}$. Each element is either included or not — giving $2^n$ possibilities for $n$ elements.

Logical Connectives

The building blocks of formal propositions — how to combine statements with precise meaning.

Propositional Connectives

Logical connectives combine propositions into compound statements. The five essential connectives are: negation, conjunction, disjunction, implication, and biconditional.

Symbol	Name	Meaning
$\neg P$	Negation	not $P$
$P \wedge Q$	Conjunction	$P$ and $Q$
$P \vee Q$	Disjunction	$P$ or $Q$
$P \Rightarrow Q$	Implication	if $P$ then $Q$
$P \iff Q$	Biconditional	$P$ if and only if $Q$

In Plain Terms

These are the grammar of mathematics. Every theorem is ultimately a statement built from these connectives, and every proof checks that the connectives hold between the premises and conclusion.

Key tautology: $P \Rightarrow Q \equiv \neg P \vee Q$ — implication is equivalent to "not P, or Q."

Quantifiers

Expressing statements about all elements of a set, or the existence of at least one.

Universal & Existential Quantifiers

The universal quantifier $\forall$ ("for all") and the existential quantifier $\exists$ ("there exists") are the two quantifiers of first-order logic.

$$\forall x \in S:\, P(x) \quad \text{(}P\text{ holds for every element of }S\text{)}$$

$$\exists x \in S:\, P(x) \quad \text{(at least one element of }S\text{ satisfies }P\text{)}$$

In Plain Terms

"All swans are white" is a universal claim — one black swan disproves it. "There exists a prime larger than 1000" is existential — one example proves it. Understanding the difference is fundamental to reading and writing proofs.

Negation: $\neg(\forall x\, P(x)) \equiv \exists x\, \neg P(x)$ and $\neg(\exists x\, P(x)) \equiv \forall x\, \neg P(x)$.

Direct Proof & Proof by Contradiction

Two fundamental proof strategies: direct proof chains implications from hypotheses to conclusion; proof by contradiction assumes the negation of the conclusion and derives a contradiction.

Direct: $P \Rightarrow Q_1 \Rightarrow Q_2 \Rightarrow \cdots \Rightarrow Q$

Contradiction: assume $\neg Q$, derive $\bot$ (falsehood), conclude $Q$

In Plain Terms

Euclid proved infinitely many primes exist by contradiction: assume finitely many, multiply them all together plus one — the result can't be divisible by any of them, so it must be prime or have a new prime factor. Contradiction. Therefore infinitely many primes exist.