For a while I’ve been meaning to do some more advanced posts on optimization problems of all flavors. One technique that comes up over and over again is Lagrange multipliers, so this post is going to be a leisurely reminder of that technique. I often forget how to do these basic calculus-type things, so it’s good practice.
We will assume something about the reader’s knowledge, but it’s a short list: know how to operate with vectors and the dot product, know how to take a partial derivative, and know that in single-variable calculus the local maxima and minima of a differentiable function occur when the derivative
vanishes. All of the functions we’ll work with in this post will have infinitely many derivatives (i.e. smooth). So things will be nice.
The Gradient
The gradient of a multivariable function is the natural extension of the derivative of a single-variable function. If is a differentiable function, the data of the gradient of
consists of all of the partial derivatives
. It’s usually written as a vector
To make things easier for ourselves, we’ll just call a function
and understand
to be a vector in
.
We can also think of as a function which takes in vectors and spits out vectors, by plugging in the input vector into each
. And the reason we do this is because it lets us describe the derivative of
at a point as a linear map based on the gradient. That is, if we want to know how fast
is growing along a particular vector
and at a particular point
, we can just take a dot product of
with
. I like to call dot products inner products, and use the notation
. Here
is a vector in
which we think of as “tangent vectors” to the surface defined by
. And if we scale
bigger or smaller, the value of the derivative scales with it (of course, because the derivative is a linear map!). Usually we use unit vectors to represent directions, but there’s no reason we have to. Calculus textbooks often require this to define a “directional derivative,” but perhaps it is better to understand the linear algebra over memorizing these arbitrary choices.
For example, let . Then
, and
. Now if we pick a vector to go along, say,
, we get the derivative of
along
is
.
As importantly as computing derivatives is finding where the derivative is zero, and the geometry of the gradient can help us here. Specifically, if we think of our function as a surface sitting in
(as in the picture below), it’s not hard to see that the gradient vector points in the direction of steepest ascent of
. How do we know this? Well if you fix a point
and you’re forced to use a vector
of the same magnitude as
, how can you maximize the inner product
? Well, you just pick
to be equal to
, of course! This will turn the dot product into the square norm of
.

The gradient points in the direction of steepest ascent. (image source)
More generally, the operation of an inner product is geometrically the size of the projection of the argument onto
(scaled by the size of
), and projections of a vector
onto different directions than
can only be smaller in magnitude than
. Another way to see this is to know the “alternative” formula for the dot product
where is the angle between the vectors (in
). We might not know how to get that angle, and in this post we don’t care, but we do know that
is between -1 and 1. And so if
is fixed and we can’t change the norm of
but only its direction, we will maximize the dot product when the two vectors point in the same direction, when
is zero.
All of this is just to say that the gradient at a point can be interpreted as having a specific direction. It’s the direction of steepest ascent of the surface , and it’s size tells you how steep
is at that point. The opposite direction is the direction of steepest descent, and the orthogonal directions (when
) have derivative zero.
Now what happens if we’re at a local minimum or maximum? Well it’s necessary that is flat, and so by our discussion above the derivatives in all directions must be zero. It’s a basic linear algebra proof to show that this means the gradient is the zero vector. You can prove this by asking what sorts of vectors
have a dot product of zero with all other vectors
?
Now once we have a local max or a local min, how do we tell which? The answer is actually a bit complicated, and it requires you to inspect the eigenvalues of the Hessian of . We won’t dally on eigenvalues except to explain the idea in brief: for an
variable function
the Hessian of
at
is an
-by-
matrix where the
entry is the value of
. It just so turns out that if this matrix has only positive eigenvalues, then
is a local minimum. If the eigenvalues are all negative, it’s a local max. If some are negative and some are positive, then it’s a saddle point. And if zero is an eigenvalue then we’re screwed and can’t conclude anything without more work.
But all of this Hessian business isn’t particularly important for us, because most of our applications of the Lagrangian will work with functions where we already know that there is a unique global maximum or minimum. Finding where the gradient is zero is enough. As much as this author stresses the importance of linear algebra, we simply won’t need to compute any eigenvalues for this one.
What we will need to do is look at optimizing functions which are constrained by some equality conditions. This is where Lagrangians come into play.
Constrained by Equality
Often times we will want to find a minimum or maximum of a function , but we will have additional constraints. The simplest kind is an equality constraint.
For example, we might want to find the maximum of the function requiring that the point
lies on the unit circle. One could write this in a “canonical form”
maximize
subject to
Way back in the scientific revolution, Fermat discovered a technique to solve such problems that was later generalized by Lagrange. The idea is to combine these constraints into one function whose gradient provides enough information to find a maximum. Clearly such information needs to include two things: that the gradient of is zero, and that the constraint is satisfied.
First we rewrite the constraints as , because when we’re dealing with gradients we want things to be zero. Then we form the Lagrangian of the problem. We’ll give a precise definition in a minute, but it looks like this:
That is, we’ve added a new variable and added the two functions together. Let’s see what happens when we take a gradient:
Now if we require the gradient to be zero, the last equation is simply the original constraint, and the first three equations say that . In other words, we’re saying that the two gradients must point in the same direction for the function to provide a maximum. Solving for where these equations vanish gives some trivial solutions (one variable is
and the rest zero, and
), and a solution defined by
which is clearly the maximal of the choices.
Indeed, this will work in general, and you can see a geometric and analytic proof in these notes.
Specifically, if we have an optimization problem defined by an objective function to optimize, and a set of
equality constraints
, then we can form the Lagrangian
And then a theorem of Lagrange is that all optimal solutions to the problem satisfy
for some choice of
. But then you have to go solve the system and figure out which of the solutions gives you your optimum.
Convexity
As it turns out, there are some additional constraints you can add to your problem to guarantee your system has a solution. One nice condition is that is convex. A function is convex if any point on a line segment between two points
and
has a value greater than
. In other words, for all
:
Some important examples of convex functions: exponentials, quadratics whose leading coefficient is positive, square norms of a vector variable, and linear functions.
Convex functions have this nice property that they have a unique local minimum value, and hence it must also be the global minimum. Why is this? Well if you have a local minimum , and any other point
, then by virtue of being a local minimum there is some
sufficiently close to 1 so that:
And rearranging we get
So , and since
was arbitrary then
is the global minimum.
This alleviates our problem of having to sort through multiple solutions, and in particular it helps us to write programs to solve optimization problems: we know that techniques like gradient descent will never converge to a false local minimum.
That’s all for now! The next question we might shadowily ask: what happens if we add inequality constraints?