Lagrangians for the Amnesiac

For a while I’ve been meaning to do some more advanced posts on optimization problems of all flavors. One technique that comes up over and over again is Lagrange multipliers, so this post is going to be a leisurely reminder of that technique. I often forget how to do these basic calculus-type things, so it’s good practice.

We will assume something about the reader’s knowledge, but it’s a short list: know how to operate with vectors and the dot product, know how to take a partial derivative, and know that in single-variable calculus the local maxima and minima of a differentiable function $ f(x)$ occur when the derivative $ f'(x)$ vanishes. All of the functions we’ll work with in this post will have infinitely many derivatives (i.e. smooth). So things will be nice.

The Gradient

The gradient of a multivariable function is the natural extension of the derivative of a single-variable function. If $ f(x_1, \dots, x_n)$ is a differentiable function, the data of the gradient of $ f$ consists of all of the partial derivatives $ \partial f / \partial x_i$. It’s usually written as a vector

$ \displaystyle \nabla f = \left ( \frac{\partial f}{\partial x_1}, \dots, \frac{\partial f}{\partial x_n} \right )$

To make things easier for ourselves, we’ll just call $ f$ a function $ f(x)$ and understand $ x$ to be a vector in $ \mathbb{R}^n$.

We can also think of $ \nabla f$ as a function which takes in vectors and spits out vectors, by plugging in the input vector into each $ \partial f / \partial x_i$. And the reason we do this is because it lets us describe the derivative of $ f$ at a point as a linear map based on the gradient. That is, if we want to know how fast $ f$ is growing along a particular vector $ v$ and at a particular point $ (x, f(x))$, we can just take a dot product of $ v$ with $ \nabla f(x)$. I like to call dot products inner products, and use the notation $ \left \langle \nabla f(x), v \right \rangle$. Here $ v$ is a vector in $ \mathbb{R}^n$ which we think of as “tangent vectors” to the surface defined by $ f$. And if we scale $ v$ bigger or smaller, the value of the derivative scales with it (of course, because the derivative is a linear map!). Usually we use unit vectors to represent directions, but there’s no reason we have to. Calculus textbooks often require this to define a “directional derivative,” but perhaps it is better to understand the linear algebra over memorizing these arbitrary choices.

For example, let $ f(x,y,z) = xyz$. Then $ \nabla f = (yz, xz, xy)$, and $ \nabla f(1,2,1) = (2, 1, 2)$. Now if we pick a vector to go along, say, $ v = (0,-1,1)$, we get the derivative of $ f$ along $ v$ is $ \left \langle (2,1,2), (0,-1,1) \right \rangle = 1$.

As importantly as computing derivatives is finding where the derivative is zero, and the geometry of the gradient can help us here. Specifically, if we think of our function $ f$ as a surface sitting in $ \mathbb{R}^{n+1}$ (as in the picture below), it’s not hard to see that the gradient vector points in the direction of steepest ascent of $ f$. How do we know this? Well if you fix a point $ (x_1, \dots, x_n)$ and you’re forced to use a vector $ v$ of the same magnitude as $ \nabla f(x)$, how can you maximize the inner product $ \left \langle \nabla f(x), v \right \rangle$? Well, you just pick $ v$ to be equal to $ \nabla f(x)$, of course! This will turn the dot product into the square norm of $ \nabla f(x)$.

The gradient points in the direction of steepest ascent

The gradient points in the direction of steepest ascent. (image source)

More generally, the operation of an inner product $ \left \langle -, v \right \rangle$ is geometrically the size of the projection of the argument onto $ v$ (scaled by the size of $ v$), and projections of a vector $ w$ onto different directions than $ w$ can only be smaller in magnitude than $ w$. Another way to see this is to know the “alternative” formula for the dot product

$ \displaystyle \left \langle v,w \right \rangle = \left \| v \right \| \left \| w \right \| \cos(\theta)$

where $ \theta$ is the angle between the vectors (in $ \mathbb{R}^n$). We might not know how to get that angle, and in this post we don’t care, but we do know that $ \cos(\theta)$ is between -1 and 1. And so if $ v$ is fixed and we can’t change the norm of $ w$ but only its direction, we will maximize the dot product when the two vectors point in the same direction, when $ \theta$ is zero.

All of this is just to say that the gradient at a point can be interpreted as having a specific direction. It’s the direction of steepest ascent of the surface $ f$, and it’s size tells you how steep $ f$ is at that point. The opposite direction is the direction of steepest descent, and the orthogonal directions (when $ \theta = \pi /2$) have derivative zero.

Now what happens if we’re at a local minimum or maximum? Well it’s necessary that $ f$ is flat, and so by our discussion above the derivatives in all directions must be zero. It’s a basic linear algebra proof to show that this means the gradient is the zero vector. You can prove this by asking what sorts of vectors $ w$ have a dot product of zero with all other vectors $ v$?

Now once we have a local max or a local min, how do we tell which? The answer is actually a bit complicated, and it requires you to inspect the eigenvalues of the Hessian of $ f$. We won’t dally on eigenvalues except to explain the idea in brief: for an $ n$ variable function $ f$ the Hessian of $ f$ at $ x$ is an $ n$-by-$ n$ matrix where the $ i,j$ entry is the value of $ (\partial f / \partial x_i \partial x_j )(x)$. It just so turns out that if this matrix has only positive eigenvalues, then $ x$ is a local minimum. If the eigenvalues are all negative, it’s a local max. If some are negative and some are positive, then it’s a saddle point. And if zero is an eigenvalue then we’re screwed and can’t conclude anything without more work.

But all of this Hessian business isn’t particularly important for us, because most of our applications of the Lagrangian will work with functions where we already know that there is a unique global maximum or minimum. Finding where the gradient is zero is enough. As much as this author stresses the importance of linear algebra, we simply won’t need to compute any eigenvalues for this one.

What we will need to do is look at optimizing functions which are constrained by some equality conditions. This is where Lagrangians come into play.

Constrained by Equality

Often times we will want to find a minimum or maximum of a function $ f(x)$, but we will have additional constraints. The simplest kind is an equality constraint.

For example, we might want to find the maximum of the function $ f(x, y, z) = xyz$ requiring that the point $ (x,y,z)$ lies on the unit circle. One could write this in a “canonical form”

maximize $ xyz$
subject to $ x^2 + y^2 + z^2 = 1$

Way back in the scientific revolution, Fermat discovered a technique to solve such problems that was later generalized by Lagrange. The idea is to combine these constraints into one function whose gradient provides enough information to find a maximum. Clearly such information needs to include two things: that the gradient of $ xyz$ is zero, and that the constraint is satisfied.

First we rewrite the constraints as $ g(x,y,z) = x^2 + y^2 + x^2 – 1 = 0$, because when we’re dealing with gradients we want things to be zero. Then we form the Lagrangian of the problem. We’ll give a precise definition in a minute, but it looks like this:

$ L(x,y,z,\lambda) = xyz + \lambda(x^2 + y^2 + z^2 – 1)$

That is, we’ve added a new variable $ \lambda$ and added the two functions together. Let’s see what happens when we take a gradient:

$ \displaystyle \frac{\partial L}{\partial x} = yz + \lambda 2x$

$ \displaystyle \frac{\partial L}{\partial y} = xz + \lambda 2y$

$ \displaystyle \frac{\partial L}{\partial z} = xy + \lambda 2z$

$ \displaystyle \frac{\partial L}{\partial \lambda} = x^2 + y^2 + z^2 – 1$

Now if we require the gradient to be zero, the last equation is simply the original constraint, and the first three equations say that $ \nabla f (x,y,z) = \lambda \nabla g (x,y,z)$. In other words, we’re saying that the two gradients must point in the same direction for the function to provide a maximum. Solving for where these equations vanish gives some trivial solutions (one variable is $ \pm 1$ and the rest zero, and $ \lambda = 0$), and a solution defined by $ x^2 = y^2 = z^2 = 1/3$ which is clearly the maximal of the choices.

Indeed, this will work in general, and you can see a geometric and analytic proof in these notes.

Specifically, if we have an optimization problem defined by an objective function $ f(x)$ to optimize, and a set of $ k$ equality constraints $ g_i(x) = 0$, then we can form the Lagrangian

$ \displaystyle L(x, \lambda_1, \dots, \lambda_k) = f(x) + \sum_{i=1}^k \lambda_i g_i(x)$

And then a theorem of Lagrange is that all optimal solutions $ x^*$ to the problem satisfy $ \nabla L(x^*, \lambda_1, \dots, \lambda_k) = 0$ for some choice of $ \lambda _i$. But then you have to go solve the system and figure out which of the solutions gives you your optimum.


As it turns out, there are some additional constraints you can add to your problem to guarantee your system has a solution. One nice condition is that $ f(x)$ is convexA function is convex if any point on a line segment between two points $ (x,f(x))$ and $ (y,f(y))$ has a value greater than $ f$. In other words, for all $ 0 \leq t \leq 1$:

$ \displaystyle f(tx + (1-t)y) \leq tf(x) + (1-t)f(y)$

Some important examples of convex functions: exponentials, quadratics whose leading coefficient is positive, square norms of a vector variable, and linear functions.

Convex functions have this nice property that they have a unique local minimum value, and hence it must also be the global minimum. Why is this? Well if you have a local minimum $ x$, and any other point $ y$, then by virtue of being a local minimum there is some $ t$ sufficiently close to 1 so that:

$ \displaystyle f(x) \leq f(tx + (1-t)y) \leq tf(x) + (1-t)f(y)$

And rearranging we get

$ \displaystyle (1-t)f(x) \leq (1-t)f(y)$

So $ f(x) \leq f(y)$, and since $ y$ was arbitrary then $ x$ is the global minimum.

This alleviates our problem of having to sort through multiple solutions, and in particular it helps us to write programs to solve optimization problems: we know that techniques like gradient descent will never converge to a false local minimum.

That’s all for now! The next question we might shadowily ask: what happens if we add inequality constraints?

Rings — A Primer

Previously on this blog, we’ve covered two major kinds of algebraic objects: the vector space and the group. There are at least two more fundamental algebraic objects every mathematician should something know about. The first, and the focus of this primer, is the ring. The second, which we’ve mentioned briefly in passing on this blog, is the field. There are a few others important to the pure mathematician, such as the $ R$-module (here $ R$ is a ring). These do have some nice computational properties, but in order to even begin to talk about them we need to know about rings.

A Very Special Kind of Group

Recall that an abelian group $ (G, +)$ is a set $ G$ paired with a commutative binary operation $ +$, where $ G$ has a special identity element called 0 which acts as an identity for $ +$. The archetypal example of an abelian group is, of course, the integers $ \mathbb{Z}$ under addition, with zero playing the role of the identity element.

The easiest way to think of a ring is as an abelian group with more structure. This structure comes in the form of a multiplication operation which is “compatible” with the addition coming from the group structure.

Definition:ring $ (R, +, \cdot)$ is a set $ R$ which forms an abelian group under $ +$ (with additive identity 0), and has an additional associative binary operation $ \cdot$ with an element 1 serving as a (two-sided) multiplicative identity. Furthermore, $ \cdot$ distributes over $ +$ in the sense that for all $ x,y,z \in R$

$ x(y+z) = xy + xz$ and $ (y+z)x = yx + zx$

The most important thing to note is that multiplication is not commutative both in general rings and for most rings in practice. If multiplication is commutative, then the ring is called commutative. Some easy examples of commutative rings include rings of numbers like $ \mathbb{Z}, \mathbb{Z}/n\mathbb{Z}, \mathbb{Q}, \mathbb{R}$, which are just the abelian groups we know and love with multiplication added on.

If the reader takes anything away from this post, it should be the following:

Rings generalize arithmetic with integers.

Of course, this would imply that all rings are commutative, but this is not the case. More meaty and tempestuous examples of rings are very visibly noncommutative. One of the most important examples are rings of matrices. In particular, denote by $ M_n(\mathbb{R})$ the set of all $ n \times n$ matrices with real valued entries. This forms a ring under addition and multiplication of matrices, and has as a multiplicative identity the $ n \times n$ identity matrix $ I_n$.

Commutative rings are much more well-understood than noncommutative rings, and the study of the former is called commutative algebra. This is the main prerequisite for fields like algebraic geometry, which (in the simplest examples) associate commutative rings to geometric objects.

For us, all rings will have an identity, but many ring theorists will point out that one can just as easily define a ring to not have a multiplicative identity. We will call these non-unital rings, and will rarely, if ever, see them on this blog.

Another very important example of a concrete ring is the polynomial ring in $ n$ variables with coefficients in $ \mathbb{Q}$ or $ \mathbb{R}$. This ring is denoted with square brackets denoting the variables, e.g. $ \mathbb{R}[x_1, x_2, \dots , x_n]$. We rest assured that the reader is familiar with addition and multiplication of polynomials, and that this indeed forms a ring.

Kindergarten Math

Let’s start with some easy properties of rings. We will denote our generic ring by $ R$.

First, the multiplicative identity of a ring is unique. The proof is exactly the same as it was for groups, but note that identities must be two-sided for this to work. If $ 1, 1’$ are two identities, then $ 1 = 1 \cdot 1′ = 1’$.

Next, we prove that $ 0a = a0 = 0$ for all $ a \in R$. Indeed, by the fact that multiplication distributes across addition, $ 0a = (0 + 0)a = 0a + 0a$, and additively canceling $ 0a$ from both sides gives $ 0a = 0$. An identical proof works for $ a0$.

In fact, pretty much any “obvious property” from elementary arithmetic is satisfied for rings. For instance, $ -(-a) = a$ and $ (-a)b = a(-b) = -(ab)$ and $ (-1)^2 = 1$ are all trivial to prove. Here is a list of these and more properties which we invite the reader to prove.

Zero Divisors, Integral Domains, and Units

One thing that is very much not automatically given in the general theory of rings is multiplicative cancellation. That is, if I have $ ac = bc$ then it is not guaranteed to be the case that $ a = b$. It is quite easy to come up with examples in modular arithmetic on integers; if $ R = \mathbb{Z}/8\mathbb{Z}$ then $ 2*6 = 6*6 = 4 \mod 8$, but $ 2 \neq 6$.

The reason for this phenomenon is that many rings have elements that lack multiplicative inverses. In $ \mathbb{Z}/8\mathbb{Z}$, for instance, $ 2$ has no multiplicative inverse (and neither does 6). Indeed, one is often interested in determining which elements are invertible in a ring and which elements are not. In a seemingly unrelated issue, one is interested in determining whether one can multiply any given element $ x \in R$ by some $ y$ to get zero. It turns out that these two conditions are disjoint, and closely related to our further inspection of special classes of rings.

Definition: An element $ x$ of a ring $ R$ is said to be a left zero-divisor if there is some $ y \neq 0$ such that $ xy = 0$. Similarly, $ x$ is a right zero-divisor if there is a $ z$ for which $ zx = 0$. If $ x$ is a left and right zero-divisor (e.g. if $ R$ is commutative), it is just called a zero-divisor.

Definition: Let $ x,y \in R$. The element $ y$ is said to be a left inverse to $ x$ if $ yx = 1$, and a right inverse if $ xy = 1$. If there is some $ z \neq 0$ for which $ xz = zx = 1$, then $ x$ is said to be a two-sided inverse and $ z$ is called the inverse of $ x$, and $ x$ is called a unit.

As a quick warmup, we prove that if $ x$ has a left  and a right inverse then it has a two-sided inverse. Indeed, if $ zx = 1 = xy$, then $ z = z(xy) = (zx)y = y$, so in fact the left and right inverses are the same.

The salient fact here is that having a (left- or right-) inverse allows one to do (left- or right-) cancellation, since obviously when $ ac = bc$ and $ c$ has a right inverse, we can multiply $ acc^{-1} = bcc^{-1}$ to get $ a=b$. We will usually work with two-sided inverses and zero-divisors (since we will usually work in a commutative ring). But in non-commutative rings, like rings of matrices, one-sided phenomena do run rampant, and one must distinguish between them.

The right way to relate these two concepts is as follows. If $ c$ has a right inverse, then define the right-multiplication function $ (- \cdot c) : R \to R$ which takes $ x$ and spits out $ xc$. In fact, this function is an injection. Indeed, we already proved that (because $ c$ has a right inverse) if $ xc = yc$ then $ x = y$. In particular, there is a unique preimage of $ 0$ under this map. Since $ 0c = 0$ is always true, then it must be the case that the only way to left-multiply $ c$ times something to get zero is $ 0c$. That is, $ c$ is not a right zero-divisor if right-multiplication by $ c$ is injective. On the other hand, if the map is not injective, then there are some $ x \neq y$ such that $ xc = yc$, implying $ (x-y)c = 0$, and this proves that $ c$ is a right zero-divisor. We can do exactly the same argument with left-multiplication.

But there is one minor complication: what if right-multiplication is injective, but $ c$ has no inverses? It’s not hard to come up with an example: 2 as an element of the ring of integers $ \mathbb{Z}$ is a perfectly good one. It’s neither a zero-divisor nor a unit.

This basic study of zero-divisors gives us some natural definitions:

Definition: A division ring is a ring in which every element has a two-sided inverse.

If we allow that $ R$ is commutative, we get something even better (more familiar: $ \mathbb{Q}, \mathbb{R}, \mathbb{C}$ are the standard examples of fields).

Definition: field is a nonzero commutative division ring.

The “nonzero” part here is just to avoid the case when the ring is the trivial ring (sometimes called the zero ring) with one element. i.e., the set $ \left \{ 0 \right \}$ is a ring in which zero satisfies both the additive and multiplicative identities. The zero ring is excluded from being a field for silly reasons: elegant theorems will hold for all fields except the zero ring, and it would be messy to require every theorem to add the condition that the field in question is nonzero.

We will have much more to say about fields later on this blog, but for now let’s just state one very non-obvious and interesting result in non-commutative algebra, known as Wedderburn’s Little Theorem.

Theorem: Every finite divison ring is a field.

That is, simply having finitely many elements in a division ring is enough to prove that multiplication is commutative. Pretty neat stuff. We will actually see a simpler version of this theorem in a moment.

Now as we saw units and zero-divisors are disjoint, but not quite opposites of each other. Since we have defined a division ring as a ring where all (non-zero) elements are units, it is natural to define a ring in which the only zero-divisor is zero. This is considered a natural generalization of our favorite ring $ \mathbb{Z}$, hence the name “integral.”

Definition: An integral domain is a commutative ring in which zero is the only zero-divisor.

Note the requirement that the ring is commutative. Often we will simply call it a domain, although most authors allow domains to be noncommutative.

Already we can prove a very nice theorem:

Theorem: Every finite integral domain is a field.

Proof. Integral domains are commutative by definition, and so it suffices to show that every non-zero element has an inverse. Let $ R$ be our integral domain in question, and $ x \in R$ the element whose inverse we seek. By our discussion of above, right multiplication by $ x$ is an injective map $ R \to R$, and since $ R$ is finite this map must be a bijection. Hence $ x$ must have some $ y \neq 0$ so that $ yx = 1$. And so $ y$ is the inverse of $ x$.

$ \square$

We could continue traveling down this road of studying special kinds of rings and their related properties, but we won’t often use these ideas on this blog. We do think the reader should be familiar with the names of these special classes of rings, and we will state the main theorems relating them.

Definition: A nonzero element $ p \in R$ is called prime if whenever $ p$ divides a product $ ab$ it either divides $ a$ or $ b$ (or both). A unique factorization domain (abbreviated UFD) is an integral domain in which every element can be written uniquely as a product of primes.

Definition:Euclidean domain is a ring in which the division algorithm can be performed. That is, there is a norm function $ | \cdot | : R \ \left \{ 0 \right \} \to \mathbb{N}$, for which every pair $ a,b \neq 0$ can be written as $ a = bq + r$ with r satisfying $ |r| < |b|$.

Paolo Aluffi has a wonderful diagram showing the relations among the various special classes of integral domains. This image comes from his book, Algebra: Chapter 0, which is a must-have for the enterprising mathematics student interested in algebra.


In terms of what we have already seen, this diagram says that every field is a Euclidean domain, and in turn every Euclidean domain is a unique factorization domain. These are standard, but non-trivial theorems. We will not prove them here.

The two big areas in this diagram we haven’t yet mentioned on this blog are PIDs and Noetherian domains. The reason for that is because they both require a theory of ideals in rings (perhaps most briefly described as a generalization of the even numbers). We will begin next time with a discussion of ideals, and their important properties in studying rings, but before we finish we want to focus on one main example that will show up later on this blog.

Polynomial Rings

Let us formally define the polynomial ring.

Definition: Let $ R$ be a commutative ring. Define the ring $ R[x]$, to be the set of all polynomials in $ x$ with coefficients in $ R$, where addition and multiplication are the usual addition and multiplication of polynomials. We will often call $ R[x]$ the polynomial ring in one variable over $ R$.

We will often replace $ x$ by some other letter representing an “indeterminate” variable, such as $ t$, or $ y$, or multiple indexed variables as in the following definition.

Definition: Let $ R$ be a commutative ring. The ring $ R[x_1, x_2, \dots, x_n]$ is the set of all polynomials in the $ n$ variables with the usual addition and multiplication of polynomials.

What can we say about the polynomial ring in one variable $ R[x]$? It’s additive and multiplicative identities are clear: the constant 0 and 1 polynomials, respectively. Other than that, we can’t quite get much more. There are some very bizarre features of polynomial rings with bizarre coefficient rings, such as multiplication decreasing degree.

However, when we impose additional conditions on $ R$, the situation becomes much nicer.

Theorem: If $ R$ is a unique factorization domain, then so is $ R[x]$.

Proof. As we have yet to discuss ideals, we refer the reader to this proof, and recommend the reader return to it after our next primer.

$ \square$

On the other hand, we will most often be working with polynomial rings over a field. And here the situation is even better:

Theorem: If $ k$ is a field, then $ k[x]$ is a Euclidean domain.

Proof. The norm function here is precisely the degree of the polynomial (the highest power of a monomial in the polynomial). Then given $ f,g$, the usual algorithm for polynomial division gives a quotient and a remainder $ q, r$ so that $ f = qg + r$. In following the steps of the algorithm, one will note that all multiplication and division operations are performed in the field $ k$, and the remainder always has a smaller degree than the quotient. Indeed, one can explicitly describe the algorithm and prove its correctness, and we will do so in full generality in the future of this blog when we discuss computational algebraic geometry.

$ \square$

For multiple variables, things are a bit murkier. For instance, it is not even the case that $ k[x,y]$ is a euclidean domain. One of the strongest things we can say originates from this simple observation:

Lemma: $ R[x,y]$ is isomorphic to $ R[x][y]$.

We haven’t quite yet talked about isomorphisms of rings (we will next time), but the idea is clear: every polynomial in two variables $ x,y$ can be thought of as a polynomial in $ y$ where the coefficients are polynomials in $ x$ (gathered together by factoring out common factors of $ y^k$). Similarly, $ R[x_1, \dots, x_n]$ is the same thing as $ R[x_1, \dots, x_{n-1}][x_n]$ by induction. This allows us to prove that any polynomial ring is a unique factorization domain:

Theorem: If $ R$ is a UFD, so is $ R[x_1, \dots, x_n]$.

Proof. $ R[x]$ is a UFD as described above. By the lemma, $ R[x_1, \dots, x_n] = R[x_1, \dots, x_{n-1}][x_n]$ so by induction $ R[x_1, \dots, x_{n-1}]$ is a UFD implies $ R[x_1, \dots, x_n]$ is as well.

$ \square$

We’ll be very interested in exactly how to compute useful factorizations of polynomials into primes when we start our series on computational algebraic geometry. Some of the applications include robot motion planning, and automated theorem proving.

Next time we’ll visit the concept of an ideal, see quotient rings, and work toward proving Hilbert’s Nullstellensatz, a fundamental result in algebraic geometry.

Until then!

Introducing Categories

For a list of all the posts on Category Theory, see the Main Content page.

It is time for us to formally define what a category is, to see a wealth of examples. In our next post we’ll see how the definitions laid out here translate to programming constructs. As we’ve said in our soft motivational post on categories, the point of category theory is to organize mathematical structures across various disciplines into a unified language. As such, most of this post will be devote to laying down the definition of a category and the associated notation. We will be as clear as possible to avoid a notational barrier for newcomers, so if anything is unclear we will clarify it in the comments.

Definition of a Category

Let’s recall some examples of categories we’ve seen on this blog that serve to motivate the abstract definition of a category. We expect the reader to be comfortable with sets, and to absorb or glaze over the other examples as comfort dictates. The reader who is uncomfortable with sets and functions on sets should stop here. Instead, visit our primers on proof techniques, which doubles as a primer on set theory (or our terser primer on set theory from a two years ago).

The go-to example of a category is that of sets: sets together with functions between sets form a category. We will state exactly what this means momentarily, but first some examples of categories of “sets with structure” and “structure-preserving maps.”

Groups together with group homomorphisms form a category, as do rings and fields with their respective kinds of homomorphisms. Topological spaces together with continuous functions form a category, and metric spaces with distance-nonincreasing maps (“short” functions) form a sub-category. Vector spaces and linear maps, smooth manifolds and smooth maps, and algebraic varieties with rational maps all form categories. We could continue but the essential idea is clear: a category is some way to specify a collection of objects and “structure-preserving” mappings between those objects. There are three main features common to all of these examples:

  1. Composition of structure-preserving maps produces structure-preserving maps.
  2. Composition is associative.
  3. There is an identity map for each object.

The main abstraction is that forgetting what the objects and mappings are and only considering how they behave allows one to deviate from the examples above in useful ways. For instance, once we see the formal definition below, it will become clear that mathematical (say, first-order logical) statements, together with proofs of implication, form a category. Even though a “proof” isn’t strictly a structure-preserving map, it still fits with the roughly stated axioms above. One can compose proofs by laying the implications out one after another, this composition is trivially associative, and there is an identity proof. Thus, proofs provide a way to “transform” true statements into true statements, preserving the structure of boolean-valued truth.

Another example is the category of ML types and computable functions in ML. Computable functions can be quite wild in their behavior, but they are nevertheless composable, associative, and equipped with an identity.

And so the definition of a category seems to come as a natural consequence to think of all of these examples as special cases of one concept.

Before we state the definition we should note that, for abstruse technical reasons, we cannot phrase the definition of a category as a “set” of objects and mappings between objects. This is already impossible for the category of sets, because there is no “set of all sets.” Somehow (as illustrated by Russell’s paradox) there are “too many” sets to do this. Likewise, there is no “set” of all groups, topological spaces, vector spaces, etc.

This apparent difficulty requires some sidestepping. One possibility is to define a universe of non-paradoxical sets, and define categories by way of a universe. Another is to define a class, which bypasses set theory in another way. We won’t deliberate on the differences between these methods of avoiding Russell’s paradox. The reader need only know that it can be done. For our official definition, we will use the terminology of classes.

Definition: A category $ \textbf{C}$ consists of the following data:

  • A class of objects, denoted $ \textup{Obj}(\mathbf{C})$.
  • For each pair of objects $ A,B$, a set $ \textup{Hom}(A,B)$ of morphisms. Sometimes these are called hom-sets.

The morphisms satisfy the following conditions:

  • For all objects $ A,B,C$ and morphisms $ f \in \textup{Hom}(A,B), g \in \textup{Hom}(B,C)$ there is a composition operation $ \circ$ and $ g \circ f \in \textup{Hom}(A,C)$ is a morphism. We will henceforth drop the $ \circ$ and write $ gf$.
  • Composition is associative.
  • For all objects $ A$, there is an identity morphism $ 1_A \in \textup{Hom}(A,A)$. For all $ A,B$, and $ f \in \textup{Hom}(A,B)$, we have $ f 1_A = f$ and $ 1_B f = f$.

Some additional notation and terminology: we denote a morphism $ f \in \textup{Hom}(A,B)$ in three ways. Aside from “as an element of a set,” the most general way is as a diagram,

$ \displaystyle A \xrightarrow{\hspace{.5cm} f \hspace{.5cm}} B$

Although we will often shorten a single morphism to the standard function notation $ f: A \to B$. Given a morphism $ f: A \to B$, we call $ A$ the source of $ f$, and $ B$ the target of $ f$.

We will often name our categories, and we will do so in bold. So the category of sets with set-functions is denoted $ \mathbf{Set}$. When working with multiple categories, we will give subscripts on the hom-sets to avoid ambiguity, as in $ \textup{Hom}_{\mathbf{Set}}(A,B)$. A morphism from an object to itself is called an endomorphism, and the set of all endomorphisms of an object $ A$ is denoted $ \textup{End}_{\mathbf{C}}(A)$.

Note that in the definition above we require that morphisms form a set. This is important because hom-sets will have additional algebraic structure (e.g., for certain categories $ \textup{End}_{\mathbf{C}}(A)$ will form a ring).

Examples of Categories

Lets now formally define some of our simple examples of categories. Defining a category amounts to specifying the objects and morphisms, and verifying the conditions in the definition hold.


Define the category $ \mathbf{Set}$ whose objects are all sets, and whose morphisms are functions on sets. By now it should hopefully be clear that sets form a category, but let us go through the motions explicitly. Every set $ A$ has an identity function $ 1_A(x) = x$, and as we already know, functions on sets compose associatively. To verify this in complete detail would amount to writing down a general function as a relation, and using the definitions from elementary set theory. We have more pressing matters, but a reader who has not seen this before should consult our set theory primer. One can also define the category of finite sets $ \mathbf{FinSet}$ whose objects are finite sets and whose morphisms are again set-functions. As every object and morphism of $ \mathbf{FinSet}$ is one of $ \mathbf{Set}$, we call the former a subcategory of the latter.

Finite categories

The most trivial possible categories are those with only finitely many objects. For instance, define the category $ \mathbf{1}$ to have a single object $ \ast$ and a single morphism $ 1_{\ast}$, which must of course be the identity. The composition $ 1_{\ast} 1_{\ast}$ is forced to be $ 1_{\ast}$, and so this is a (rather useless) category. One can also imagine a category $ \mathbf{2}$ which has one non-identity morphism $ A \to B$ as well as examples of categories with any number of finite objects. Nothing interesting is going on here; we just completely specify the structure of the category object by object and morphism by morphism. Sometimes they come in handy, though.

Poset Categories

Here is an elementary example of a category that is nonetheless fundamental to modern discussions in topology and algebraic geometry. It will show up again in our work with persistent homology. Let $ X$ be any set, and define the category $ \mathbf{X}_\subset$ as follows. The objects of $ \mathbf{X}_\subset$ are all the subsets of $ X$. If $ A \subset B$ are subsets of $ X$, then the set $ \textup{Hom}(A,B)$ is defined to be the (unique, unnamed) singleton $ A \to B$. Otherwise, $ \textup{Hom}(A,B)$ is the empty set. Identities exist since every set is a subset of itself. The property of being a subset is transitive, so composition of morphisms makes sense and associativity is trivial since there is at most one morphism between any two objects. If the reader doesn’t believe this, we state what composition is rigorously: define each morphism as a pair $ (A,B)$ and define the composition operation as a set-function $ \textup{Hom}(A,B) \times \textup{Hom}(B,C) \to \textup{Hom}(A,C)$, which sends $ ((A,B), (B,C)) \mapsto (A,C)$. Because this operation is so trivial, it seems more appropriate to state it with a diagram:

first-diagramWe say this diagram commutes if all ways to compose morphisms (travel along arrows) have equal results. That is, in the diagram above, we assert that the morphism $ A \to B$ composed with the morphism $ B \to C$ is exactly the one morphism $ A \to C$. Usually one must prove that a diagram commutes, but in this case we are defining the composition operation so that commutativity holds. The reader can now directly verify that composition is associative:

$ (a,b)((b,c)(c,d)) = (a,b)(b,d) = (a,d) = (a,c)(c,d) = ((a,b)(b,c))(c,d)$

More generally, it is not hard to see how any transitive reflexive relation on a set (including partial orders) can be used to form a category: objects are elements of the set, and morphisms are unique arrows which exist when the objects are (asymmetrically) related. The subset category above is a special case where the set in question is the power set of $ X$, and the relation is $ \subset$. The familiar reader should note that the most prominent example of this in higher mathematics is to have $ X$ be the topology of a topological space (the set of open subsets).


Next, define the category $ \mathbf{Grp}$ whose objects are groups and whose morphisms are group homomorphisms. Recall briefly that a group is a set $ G$ endowed with a sensible (associative) binary operation denoted by multiplication, which has an identity and with respect to which every element has an inverse. For the uninitiated reader, just replace any abstract “group” by the set of all nonzero rational numbers with usual multiplication. It will suffice for this example.

A group homomorphism is a set-function $ f: A \to B$ which satisfies $ f(xy) = f(x)f(y)$ (here the binary operations on the left and right side of the equal sign are the operations in the two respective groups). Being set-functions, group homomorphisms are composable as functions and associatively so. Given any homomorphism $ g: B \to C$, we need to show that the composite is again a homomorphism:

$ \displaystyle gf(xy) = g(f(xy)) = g(f(x)f(y)) = g(f(x))g(f(y)) = gf(x) gf(y)$

Note that there are three multiplication operations floating around in this equation. Groups (as all sets) have identity maps, and the identity map is a perfectly good group homomorphism. This verifies that $ \mathbf{Grp}$ is indeed a category. While we could have stated all of these equalities via commutative diagrams, the pictures are quite large and messy so we will avoid them until later.

A similar derivation will prove that rings form a category, as do vector spaces, topological spaces, and fields. We are unlikely to use these categories in great detail in this series, so we refrain from giving them names for now.

One special example of a category of groups is the category $ \mathbf{Ab}$ of abelian groups (for which the multiplication operation is commutative). This category shows up as the prototypical example of a so-called “abelian category,” which means it has enough structure to do homology.


In a more discrete domain, define by $ \mathbf{Graph}$ as follows. The objects in this category are triples $ (V,E, \varphi)$ where $ \varphi: E \to V \times V$ represents edge adjacency. We will usually supress $ \varphi$ by saying vertices $ v,w$ are adjacent instead of $ \varphi(e) = (v,w)$. The morphisms in this category are graph homomorphisms. That is, if $ G = (V,E,\varphi)$ and $ G’ = (V’, E’,\varphi’)$, then $ f \in \textup{Hom}_{\mathbf{Graph}}(G,G’)$ is a pair of set functions on $ f_V: V \to V’, f_E: E \to E’$, satisfying the following commutative diagram. Here we denote by $ (f,f)$ the map which sends $ (u,v) \to (f(u), f(v))$.


This diagram is quite a mouthful, but in words it requires that whenever $ v,w$ are adjacent in $ G$, then $ f(v), f(w)$ are adjacent in $ G’$. Rewriting the diagram as an equation more explicitly, we are saying that if $ e \in E$ is an edge with $ \varphi(e) = (v,w)$, then it must be the case that $ (f_V(v), f_V(w)) = \varphi'(f_E(e))$. This is how one “preserves” the structure of a graph.

To prove this is a category, we can observe that composition makes sense: given two pairs

$ \displaystyle f_V:V \to V’$
$ \displaystyle f_E: E \to E’$


$ \displaystyle g_{V’}: V’ \to V”$
$ \displaystyle g_{E’}: E’ \to E”$

we can compose each morphism individually, getting $ gf_V: V \to V”$ and $ gf_E: E \to E”$, and the following hefty-looking commutative diagram.


Let’s verify commutativity together: we already know the two squares on the left and right commute (by hypothesis, they are morphisms in this category). So we have two things left to check, that the three ways to get from $ E$ to $ V” \times V”$ are all the same. This amounts to verifying the two equalities

$ \displaystyle (g_{V’}, g_{V’}) (f_V, f_V) \varphi = (g_{V’}, g_{V’}) \varphi’ f_E = \varphi” g_{E’} f_E$

But indeed, going left to right in the above equation, each transition from one expression to another only swaps two morphisms within one of the commutative squares. In other words, the first equality is already enforced by the commutativity of the left-hand square, and the second by the right. We are literally only substituting what we already know to be equal!

If it feels like we didn’t actually do any work there (aside from unravelling exactly what the diagrams mean), then you’re already starting to see the benefits of category theory. It can often feel like a cycle: commutative diagrams make it easy to argue about other commutative diagrams, and one can easily get lost in the wilderness of arrows. But more often than not, devoting a relatively small amount of time up front to show one diagram commutes will make a wealth of facts and theorems follow with no extra work. This is part of the reason category theory is affectionately referred to as “abstract nonsense.”

Diagram Categories

Speaking of abstract nonsense: next up is a very abstract example, but thinking about it will reinforce the ideas we’ve put forth in this post while giving a sneak peek to our treatment of universal properties. Fix an object $ A$ in an arbitrary category $ \mathbf{C}$. Define the category $ \mathbf{C}_A$ whose objects are morphisms with source $ A$. That is, an object of $ \mathbf{C}_A$ looks like

vert-arrowWhere $ B$ ranges over all objects of $ \mathbf{C}$. The morphisms of $ \mathbf{C}_A$ are commutative diagrams of the form


where as we said earlier the stipulation asserted by the diagram is that $ f’ = gf$. Let’s verify that the axioms for a category hold. Suppose we have two such commutative diagrams with matching target and source, say,


Note that the arrows $ g: A \to C$ must match up in both diagrams, or else composition does not make sense! Then we can combine them into a single commutative diagram:


If it is not obvious that this diagram commutes, we need only write down the argument explicitly: by the first diagram $ \beta f = g$ and $ \gamma g = h$, and piecing these together we have $ \gamma \beta f = \gamma g = h$. Associativity of this “piecing together” follows from the associativity of the morphisms in $ \mathbf{C}$. The identity morphism is a diagram whose two “legs” are both $ f: A \to B$ and whose connecting morphism is just $ 1_B$.

This kind of “diagram” category is immensely important, and we will revisit it and many variants of it in the future. A quick sneak peek: this category is closely related to the universal properties of polynomial rings, free objects, and limits.

As a slight generalization, you can define a category whose objects consist of pairs of morphisms with a shared source (or target), i.e.,


We challenge the reader to write down what a morphism of this category would look like, and prove the axioms for a category hold (it’s not hard, but a bit messy). We will revisit categories like this one in our post on universal properties; this particular one is intimately related to products.

Next Time

Next time we’ll jump straight into some code, and realize the definition of a category as a type in ML. We’ll see how some of our examples of categories here can be implemented using the code, and inspect the pro’s and con’s of the computational version of our definition.

Probabilistic Bounds — A Primer

Probabilistic arguments are a key tool for the analysis of algorithms in machine learning theory and probability theory. They also assume a prominent role in the analysis of randomized and streaming algorithms, where one imposes a restriction on the amount of storage space an algorithm is allowed to use for its computations (usually sublinear in the size of the input).

While a whole host of probabilistic arguments are used, one theorem in particular (or family of theorems) is ubiquitous: the Chernoff bound. In its simplest form, the Chernoff bound gives an exponential bound on the deviation of sums of random variables from their expected value.

This is perhaps most important to algorithm analysis in the following mindset. Say we have a program whose output is a random variable $ X$. Moreover suppose that the expected value of $ X$ is the correct output of the algorithm. Then we can run the algorithm multiple times and take a median (or some sort of average) across all runs. The probability that the algorithm gives a wildly incorrect answer is the probability that more than half of the runs give values which are wildly far from their expected value. Chernoff’s bound ensures this will happen with small probability.

So this post is dedicated to presenting the main versions of the Chernoff bound that are used in learning theory and randomized algorithms. Unfortunately the proof of the Chernoff bound in its full glory is beyond the scope of this blog. However, we will give short proofs of weaker, simpler bounds as a straightforward application of this blog’s previous work laying down the theory.

If the reader has not yet intuited it, this post will rely heavily on the mathematical formalisms of probability theory. We will assume our reader is familiar with the material from our first probability theory primer, and it certainly wouldn’t hurt to have read our conditional probability theory primer, though we won’t use conditional probability directly. We will refrain from using measure-theoretic probability theory entirely (some day my colleagues in analysis will like me, but not today).

Two Easy Bounds of Markov and Chebyshev

The first bound we’ll investigate is almost trivial in nature, but comes in handy. Suppose we have a random variable $ X$ which is non-negative (as a function). Markov’s inequality is the statement that, for any constant $ a > 0$,

$ \displaystyle \textup{P}(X \geq a) \leq \frac{\textup{E}(X)}{a}$

In words, the probability that $ X$ grows larger than some fixed constant is bounded by a quantity that is inversely proportional to the constant.

The proof is quite simple. Let $ \chi_a$ be the indicator random variable for the event that $ X \geq a$ ($ \chi_a = 1$ when $ X \geq a$ and zero otherwise). As with all indicator random variables, the expected value of $ \chi_a$ is the probability that the event happens (if this is mysterious, use the definition of expected value). So $ \textup{E}(\chi_a) = \textup{P}(X \geq a)$, and linearity of expectation allows us to include a factor of $ a$:

$ \textup{E}(a \chi_a) = a \textup{P}(X \geq a)$

The rest of the proof is simply the observation that $ \textup{E}(a \chi_a) \leq \textup{E}(X)$. Indeed, as random variables we have the inequality $ a \chi_a \leq X$. Whenever $ X < a$, the value of $ a \chi_a = 0$ while $ X$ is nonnegative by definition. And whenever $ a \leq X$,the value of $ a \chi_a = a$ while $ X$ is by assumption at least $ a$. It follows that $ \textup{E}(a \chi_a) \leq \textup{E}(X)$.

This last point is a simple property of expectation we omitted from our first primer. It usually goes by monotonicity of expectation, and we prove it here. First, if $ X \geq 0$ then $ \textup{E}(X) \geq 0$ (this is trivial). Second, if $ 0 \leq X \leq Y$, then define a new random variable $ Z = Y-X$. Since $ Z \geq 0$ and using linearity of expectation, it must be that $ \textup{E}(Z) = \textup{E}(Y) – \textup{E}(X) \geq 0$. Hence $ \textup{E}(X) \leq \textup{E}(Y)$. Note that we do require that $ X$ has a finite expected value for this argument to work, but if this is not the case then Markov’s inequality is nonsensical anyway.

Markov’s inequality by itself is not particularly impressive or useful. For example, if $ X$ is the number of heads in a hundred coin flips, Markov’s inequality ensures us that the probability of getting at least 99 heads is at most 50/99, which is about 1/2. Shocking. We know that the true probability is much closer to $ 2^{-100}$, so Markov’s inequality is a bust.

However, it does give us a more useful bound as a corollary. This bound is known as Chebyshev’s inequality, and its use is sometimes referred to as the second moment method because it gives a bound based on the variance of a random variable (instead of the expected value, the “first moment”).

The statement is as follows.

Chebyshev’s Inequality: Let $ X$ be a random variable with finite expected value and positive variance. Then we can bound the probability that $ X$ deviates from its expected value by a quantity that is proportional to the variance of $ X$. In particular, for any $ \lambda > 0$,

$ \displaystyle \textup{P}(|X – \textup{E}(X)| \geq \lambda) \leq \frac{\textup{Var}(X)}{\lambda^2}$

And without any additional assumptions on $ X$, this bound is sharp.

Proof. The proof is a simple application of Markov’s inequality. Let $ Y = (X – \textup{E}(X))^2$, so that $ \textup{E}(Y) = \textup{Var}(X)$. Then by Markov’s inequality

$ \textup{P}(Y \geq \lambda^2) \leq \frac{\textup{E}(Y)}{\lambda^2}$

Since $ Y$ is nonnegative $ |X – \textup{E}(X)| = \sqrt(Y)$, and $ \textup{P}(Y \geq \lambda^2) = \textup{P}(|X – \textup{E}(X)| \geq \lambda)$. The theorem is proved. $ \square$

Chebyshev’s inequality shows up in so many different places (and usually in rather dry, technical bits), that it’s difficult to give a good example application.  Here is one that shows up somewhat often.

Say $ X$ is a nonnegative integer-valued random variable, and we want to argue about when $ X = 0$ versus when $ X > 0$, given that we know $ \textup{E}(X)$. No matter how large $ \textup{E}(X)$ is, it can still be possible that $ \textup{P}(X = 0)$ is arbitrarily close to 1. As a colorful example, let $ X$ is the number of alien lifeforms discovered in the next ten years. We might debate that $ \textup{E}(X)$ can arbitrarily large: if some unexpected scientific and technological breakthroughs occur tomorrow, we could discover an unbounded number of lifeforms. On the other hand, we are very likely not to discover any, and probability theory allows for such a random variable to exist.

If we know everything about $ \textup{Var}(X)$, however, we can get more informed bounds.

Theorem: If $ \textup{E}(X) \neq 0$, then $ \displaystyle \textup{P}(X = 0) \leq \frac{\textup{Var}(X)}{\textup{E}(X)^2}$.

Proof. Simply choose $ \lambda = \textup{E}(X)$ and apply Chebyshev’s inequality.

$ \displaystyle \textup{P}(X = 0) \leq \textup{P}(|X – \textup{E}(X)| \geq \textup{E}(X)) \leq \frac{\textup{Var}(X)}{\textup{E}(X)^2}$

The first inequality follows from the fact that the only time $ X$ can ever be zero is when $ |X – \textup{E}(X)| = \textup{E}(X)$, and $ X=0$ only accounts for one such possibility. $ \square$

This theorem says more. If we know that $ \textup{Var}(X)$ is significantly smaller than $ \textup{E}(X)^2$, then $ X > 0$ is more certain to occur. More precisely, and more computationally minded, suppose we have a sequence of random variables $ X_n$ so that $ \textup{E}(X_n) \to \infty$ as $ n \to \infty$. Then the theorem says that if $ \textup{Var}(X_n) = o(\textup{E}(X_n)^2)$, then $ \textup{P}(X_n > 0) \to 1$. Remembering one of our very early primers on asymptotic notation, $ f = o(g)$ means that $ f$ grows asymptotically slower than $ g$, and in terms of this fraction $ \textup{Var}(X) / \textup{E}(X)^2$, this means that the denominator dominates the fraction so that the whole thing tends to zero.

The Chernoff Bound

The Chernoff bound takes advantage of an additional hypothesis: our random variable is a sum of independent coin flips. We can use this to get exponential bounds on the deviation of the sum. More rigorously,

Theorem: Let $ X_1 , \dots, X_n$ be independent random $ \left \{ 0,1 \right \}$-valued variables, and let $ X = \sum X_i$. Suppose that $ \mu = \textup{E}(X)$. Then the probability that $ X$ deviates from $ \mu$ by more than a factor of $ \lambda > 0$ is bounded from above:

$ \displaystyle \textup{P}(X > (1+\lambda)\mu) \leq \frac{e^{\lambda \mu}}{(1+\lambda)^{(1+\lambda)\mu}}$

The proof is beyond the scope of this post, but we point the interested reader to these lecture notes.

We can apply the Chernoff bound in an easy example. Say all $ X_i$ are fair coin flips, and we’re interested in the probability of getting more than 3/4 of the coins heads. Here $ \mu = n/2$ and $ \lambda = 1/2$, so the probability is bounded from above by

$ \displaystyle \left ( \frac{e}{(3/2)^3} \right )^{n/4} \approx \frac{1}{5^n}$

So as the number of coin flips grows, the probability of seeing such an occurrence diminishes extremely quickly to zero. This is important because if we want to test to see if, say, the coins are biased toward flipping heads, we can simply run an experiment with $ n$ sufficiently large. If we observe that more than 3/4 of the flips give heads, then we proclaim the coins are biased and we can be assured we are correct with high probability. Of course, after seeing 3/4 of more heads we’d be really confident that the coin is biased. A more realistic approach is to define some $ \varepsilon$ that is small enough so as to say, “if some event occurs whose probability is smaller than $ \varepsilon$, then I call shenanigans.” Then decide how many coins and what bound one would need to make the bad event have probability approximately $ \varepsilon$. Finding this balance is one of the more difficult aspects of probabilistic algorithms, and as we’ll see later all of these quantities are left as variables and the correct values are discovered in the course of the proof.

Chernoff-Hoeffding Inequality

The Hoeffding inequality (named after the Finnish statistician, Wassily Høffding) is a variant of the Chernoff bound, but often the bounds are collectively known as Chernoff-Hoeffding inequalities. The form that Hoeffding is known for can be thought of as a simplification and a slight generalization of Chernoff’s bound above.

Theorem: Let $ X_1, \dots, X_n$ be independent random variables whose values are within some range $ [a,b]$. Call $ \mu_i = \textup{E}(X_i)$, $ X = \sum_i X_i$, and $ \mu = \textup{E}(X) = \sum_i \mu_i$. Then for all $ t > 0$,

$ \displaystyle \textup{P}(|X – \mu| > t) \leq 2e^{-2t^2 / n(b-a)^2}$

For example, if we are interested in the sum of $ n$ rolls of a fair six-sided die, then the probability that we deviate from $ (7/2)n$ by more than $ 5 \sqrt{n \log n}$ is bounded by $ 2e^{(-2 \log n)} = 2/n^2$. Supposing we want to know how many rolls we need to guarantee with probability 0.01 that we don’t deviate too much, we just do the algebra:

$ 2n^{-2} < 0.01$
$ n^2 > 200$
$ n > \sqrt{200} \approx 14$

So with 15 rolls we can be confident that the sum of the rolls will lie between 20 and 85. It’s not the best possible bound we could come up with, because we’re completely ignoring the known structure on dice rolls (that they follow a uniform distribution!). The benefit is that it’s a quick and easy bound that works for any kind of random variable with that expected value.

Another version of this theorem concerns the average of the $ X_i$, and is only a minor modification of the above.

Theorem: If $ X_1, \dots, X_n$ are as above, and $ X = \frac{1}{n} \sum_i X_i$, with $ \mu = \frac{1}{n}(\sum_i \mu_i)$, then for all $ t > 0$, we get the following bound

$ \displaystyle \textup{P}(|X – \mu| > t) \leq 2e^{-2nt^2/(b-a)^2}$

The only difference here is the extra factor of $ n$ in the exponent. So the deviation is exponential both in the amount of deviation ($ t^2$), and in the number of trials.

This theorem comes up very often in learning theory, in particular to prove Boosting works. Mathematicians will joke about how all theorems in learning theory are just applications of Chernoff-Hoeffding-type bounds. We’ll of course be seeing it again as we investigate boosting and the PAC-learning model in future posts, so we’ll see the theorems applied to their fullest extent then.

Until next time!