Regression and Linear Combinations

Recently I’ve been helping out with a linear algebra course organized by Tai-Danae Bradley and Jack Hidary, and one of the questions that came up a few times was, “why should programmers care about the concept of a linear combination?”

For those who don’t know, given vectors v_1, \dots, v_n, a linear combination of the vectors is a choice of some coefficients a_i with which to weight the vectors in a sum v = \sum_{i=1}^n a_i v_i.

I must admit, math books do a poor job of painting the concept as more than theoretical—perhaps linear combinations are only needed for proofs, while the real meat is in matrix multiplication and cross products. But no, linear combinations truly lie at the heart of many practical applications.

In some cases, the entire goal of an algorithm is to find a “useful” linear combination of a set of vectors. The vectors are the building blocks (often a vector space or subspace basis), and the set of linear combinations are the legal ways to combine the blocks. Simpler blocks admit easier and more efficient algorithms, but their linear combinations are less expressive. Hence, a tradeoff.

A concrete example is regression. Most people think of regression in terms of linear regression. You’re looking for a linear function like y = mx+b that approximates some data well. For multiple variables, you have, e.g., \mathbf{x} = (x_1, x_2, x_3) as a vector of input variables, and \mathbf{w} = (w_1, w_2, w_3) as a vector of weights, and the function is y = \mathbf{w}^T \mathbf{x} + b.

To avoid the shift by b (which makes the function affine instead of purely linear; formulas of purely linear functions are easier to work with because the shift is like a pesky special case you have to constantly account for), authors often add a fake input variable x_0 which is always fixed to 1, and relabel b as w_0 to get y = \mathbf{w}^T \mathbf{x} = \sum_i w_i x_i as the final form. The optimization problem to solve becomes the following, where your data set to approximate is \{ \mathbf{x}_1, \dots, \mathbf{x}_k \}.

\displaystyle \min_w \sum_{i=1}^k (y_i - \mathbf{w}^T \mathbf{x}_i)^2

In this case, the function being learned—the output of the regression—doesn’t look like a linear combination. Technically it is, just not in an interesting way.

It becomes more obviously related to linear combinations when you try to model non-linearity. The idea is to define a class of functions called basis functions B = \{ f_1, \dots, f_m \mid f_i: \mathbb{R}^n \to \mathbb{R} \}, and allow your approximation to be any linear combination of functions in B, i.e., any function in the span of B.

\displaystyle \hat{f}(\mathbf{x}) = \sum_{i=1}^m w_i f_i(\mathbf{x})

Again, instead of weighting each coordinate of the input vector with a w_i, we’re weighting each basis function’s contribution (when given the whole input vector) to the output. If the basis functions were to output a single coordinate (f_i(\mathbf{x}) = x_i), we would be back to linear regression.

Then the optimization problem is to choose the weights to minimize the error of the approximation.

\displaystyle \min_w \sum_{j=1}^k (y_j - \hat{f}(\mathbf{x}_j))^2

As an example, let’s say that we wanted to do regression with a basis of quadratic polynomials. Our basis for three input variables might look like

\displaystyle \{ 1, x_1, x_2, x_3, x_1x_2, x_1x_3, x_2x_3, x_1^2, x_2^2, x_3^2 \}

Any quadratic polynomial in three variables can be written as a linear combination of these basis functions. Also note that if we treat this as the basis of a vector space, then a vector is a tuple of 10 numbers—the ten coefficients in the polynomial. It’s the same as \mathbb{R}^{10}, just with a different interpretation of what the vector’s entries mean. With that, we can see how we would compute dot products, projections, and other nice things, though they may not have quite the same geometric sensibility.

These are not the usual basis functions used for polynomial regression in practice (see the note at the end of this article), but we can already do some damage in writing regression algorithms.

A simple stochastic gradient descent

Although there is a closed form solution to many regression problems (including the quadratic regression problem, though with a slight twist), gradient descent is a simple enough solution to showcase how an optimization solver can find a useful linear combination. This code will be written in Python 3.9. It’s on Github.

First we start with some helpful type aliases

from typing import Callable, Tuple, List

Input = Tuple[float, float, float]
Coefficients = List[float]
Gradient = List[float]
Hypothesis = Callable[[Input], float]
Dataset = List[Tuple[Input, float]]

Then define a simple wrapper class for our basis functions

class QuadraticBasisPolynomials:
    def __init__(self):
        self.basis_functions = [
            lambda x: 1,
            lambda x: x[0],
            lambda x: x[1],
            lambda x: x[2],
            lambda x: x[0] * x[1],
            lambda x: x[0] * x[2],
            lambda x: x[1] * x[2],
            lambda x: x[0] * x[0],
            lambda x: x[1] * x[1],
            lambda x: x[2] * x[2],

    def __getitem__(self, index):
        return self.basis_functions[index]

    def __len__(self):
        return len(self.basis_functions)

    def linear_combination(self, weights: Coefficients) -> Hypothesis:
        def combined_function(x: Input) -> float:
            return sum(
                w * f(x)
                for (w, f) in zip(weights, self.basis_functions)

        return combined_function

basis = QuadraticBasisPolynomials()

The linear_combination function returns a function that computes the weighted sum of the basis functions. Now we can define the error on a dataset, as well as for a single point

def total_error(weights: Coefficients, data: Dataset) -> float:
    hypothesis = basis.linear_combination(weights)
    return sum(
        (actual_output - hypothesis(example)) ** 2
        for (example, actual_output) in data

def single_point_error(
        weights: Coefficients, point: Tuple[Input, float]) -> float:
    return point[1] - basis.linear_combination(weights)(point[0])

We can then define the gradient of the error function with respect to the weights and a single data point. Recall, the error function is defined as

\displaystyle E(\mathbf{w}) = \sum_{j=1}^k (y_j - \hat{f}(\mathbf{x}_j))^2

where \hat{f} is a linear combination of basis functions

\hat{f}(\mathbf{x}_j) = \sum_{s=1}^n w_s f_s(\mathbf{x}_j)

Since we’ll do stochastic gradient descent, the error formula is a bit simpler. We compute it not for the whole data set but only a single random point at a time. So the error is

\displaystyle E(\mathbf{w}) = (y_j - \hat{f}(\mathbf{x}_j))^2

Then we compute the gradient with respect to the individual entries of \mathbf{w}, using the chain rule and noting that the only term of the linear combination that has a nonzero contribution to the gradient for \frac{\partial E}{\partial w_i} is the term containing w_i. This is one of the major benefits of using linear combinations: the gradient computation is easy.

\displaystyle \frac{\partial E}{\partial w_i} = -2 (y_j - \hat{f}(\mathbf{x}_j)) \frac{\partial \hat{f}}{\partial w_i}(\mathbf{x}_j) = -2 (y_j - \hat{f}(\mathbf{x}_j)) f_i(\mathbf{x}_j)

Another advantage to being linear is that this formula is agnostic to the content of the underlying basis functions. This will hold so long as the weights don’t show up in the formula for the basis functions. As an exercise: try changing the implementation to use radial basis functions around each data point. (see the note at the end for why this would be problematic in real life)

def gradient(weights: Coefficients, data_point: Tuple[Input, float]) -> Gradient:
    error = single_point_error(weights, data_point)
    dE_dw = [0] * len(weights)

    for i, w in enumerate(weights):
        dE_dw[i] = -2 * error * basis[i](data_point[0])

    return dE_dw

Finally, the gradient descent core with a debugging helper.

import random

def print_debug_info(step, grad_norm, error, progress):
    print(f"{step}, {progress:.4f}, {error:.4f}, {grad_norm:.4f}")

def gradient_descent(
        data: Dataset,
        learning_rate: float,
        tolerance: float,
        training_callback = None,
) -> Hypothesis:
    weights = [random.random() * 2 - 1 for i in range(len(basis))]
    last_error = total_error(weights, data)
    step = 0
    progress = tolerance * 2
    grad_norm = 1.0

    if training_callback:
        training_callback(step, 0.0, last_error, 0.0)

    while abs(progress) > tolerance or grad_norm > tolerance:
        grad = gradient(weights, random.choice(data))
        grad_norm = sum(x**2 for x in grad)
        for i in range(len(weights)):
            weights[i] -= learning_rate * grad[i]

        error = total_error(weights, data)
        progress = error - last_error
        last_error = error
        step += 1

        if training_callback:
            training_callback(step, grad_norm, error, progress)

    return basis.linear_combination(weights)

Next create some sample data and run the optimization

def example_quadratic_data(num_points: int):
    def fn(x, y, z):
        return 2 - 4*x*y + z + z**2

    data = []
    for i in range(num_points):
        x, y, z = random.random(), random.random(), random.random()
        data.append(((x, y, z), fn(x, y, z)))

    return data

if __name__ == "__main__":
    data = example_quadratic_data(30)

Depending on the randomness, it may take a few thousand steps, but it typically converges to an error of < 1. Here’s the plot of error against gradient descent steps.

Gradient descent showing log(total error) vs number of steps, for the quadratic regression problem.

Kernels and Regularization

I’ll finish with explanations of the parentheticals above.

The real polynomial kernel. We chose a simple set of polynomial functions. This is closely related to the concept of a “kernel”, but the “real” polynomial kernel uses slightly different basis functions. It scales some of the basis functions by \sqrt{2}. This is OK because a linear combination can compensate by using coefficients that are appropriately divided by \sqrt{2}. But why would one want to do this? The answer boils down to a computational efficiency technique called the “Kernel trick.” In short, it allows you to compute the dot product between two linear combinations of vectors in this vector space without explicitly representing the vectors in the space to begin with. If your regression algorithm uses only dot products in its code (as is true of the closed form solution for regression), you get the benefits of nonlinear feature modeling without the cost of computing the features directly. There’s a lot more mathematical theory to discuss here (cf. Reproducing Kernel Hilbert Space) but I’ll have to leave it there for now.

What’s wrong with the radial basis function exercise? This exercise asked you to create a family of basis functions, one for each data point. The problem here is that having so many basis functions makes the linear combination space too expressive. The optimization will overfit the data. It’s like a lookup table: there’s one entry dedicated to each data point. New data points not in the training would be rarely handled well, since they aren’t in the “lookup table” the optimization algorithm found. To get around this, in practice one would add an extra term to the error corresponding to the L1 or L2 norm of the weight vector. This allows one to ensure that the total size of the weights is small, and in the L1 case that usually corresponds to most weights being zero, and only a few weights (the most important) being nonzero. The process of penalizing the “magnitude” of the linear combination is called regularization.

A parlor trick for SET

Tai-Danae Bradley is one of the hosts of PBS Infinite Series, a delightful series of vignettes into fun parts of math. The video below is about the same of SET, a favorite among mathematicians. Specifically, Tai-Danae explains how SET cards lie in (using more technical jargon) a vector space over a finite field, and that valid sets correspond to lines. If you don’t immediately know how this would work, watch the video.

In this post I want to share a parlor trick for SET that I originally heard from Charlotte Chan. It uses the same ideas from the video above, which I’ll only review briefly.

In the game of SET you see a board of cards like the following, and players look for sets.


Image source:

A valid set is a triple of cards where, feature by feature, the characteristics on the cards are either all the same or all different. A valid set above is {one empty blue oval, two solid blue ovals, three shaded blue ovals}. The feature of “fill” is different on all the cards, but the feature of “color” is the same, etc.

In a game of SET, the cards are dealt in order from a shuffled deck, players race to claim sets, removing the set if it’s valid, and three cards are dealt to replace the removed set. Eventually the deck is exhausted and the game is over, and the winner is the player who collected the most sets.

There are a handful of mathematical tricks you can use to help you search for sets faster, but the parlor trick in this post adds a fun variant to the end of the game.

Play the game of SET normally, but when you get down to the last card in the deck, don’t reveal it. Keep searching for sets until everyone agrees no visible sets are left. Then you start the variant: the first player to guess the last un-dealt card in the deck gets a bonus set.

The math comes in when you discover that you don’t need to guess, or remember anything about the game that was just played! A clever stranger could walk into the room at the end of the game and win the bonus point.

Theorem: As long as every player claimed a valid set throughout the game, the information on the remaining board uniquely determines the last (un-dealt) card.

Before we get to the proof, some reminders. Recall that there are four features on a SET card, each of which has three options. Enumerate the options for each feature (e.g., {Squiggle, Oval, Diamond} = {0, 1, 2}).

While we will not need the geometry induced by this, this implies each card is a vector in the vector space \mathbb{F}_3^4, where \mathbb{F}_3 = \mathbb{Z}/3\mathbb{Z} is the finite field of three elements, and the exponent means “dimension 4.” As Tai-Danae points out in the video, each SET is an affine line in this vector space. For example, if this is the enumeration:


Source: “The Joy of Set

Then using the enumeration, a set might be given by

\displaystyle \{ (1, 1, 1, 1), (1, 2, 0, 1), (1, 0, 2, 1) \}

The crucial feature for us is that the vector-sum (using the modular field arithmetic on each entry) of the cards in a valid set is the zero vector (0, 0, 0, 0). This is because 1+1+1 = 0, 2+2+2 = 0, and 1+2+3=0 are all true mod 3.

Proof of Theorem. Consider the vector-valued invariant S_t equal to the sum of the remaining cards after t sets have been taken. At the beginning of the game the deck has 81 cards that can be partitioned into valid sets. Because each valid set sums to the zero vector, S_0 = (0, 0, 0, 0). Removing a valid set via normal play does not affect the invariant, because you’re subtracting a set of vectors whose sum is zero. So S_t = 0 for all t.

At the end of the game, the invariant still holds even if there are no valid sets left to claim. Let x be the vector corresponding to the last un-dealt card, and c_1, \dots, c_n be the remaining visible cards. Then x + \sum_{i=1}^n c_i = (0,0,0,0), meaning x = -\sum_{i=1}^n c_i.


I would provide an example, but I want to encourage everyone to play a game of SET and try it out live!

Charlotte, who originally showed me this trick, was quick enough to compute this sum in her head. So were the other math students we played SET with. It’s a bit easier than it seems since you can do the sum feature by feature. Even though I’ve known about this trick for years, I still require a piece of paper and a few minutes.

Because this is Math Intersect Programming, the reader is encouraged to implement this scheme as an exercise, and simulate a game of SET by removing randomly chosen valid sets to verify experimentally that this scheme works.

Until next time!

The Inner Product as a Decision Rule

The standard inner product of two vectors has some nice geometric properties. Given two vectors x, y \in \mathbb{R}^n, where by x_i I mean the i-th coordinate of x, the standard inner product (which I will interchangeably call the dot product) is defined by the formula

\displaystyle \langle x, y \rangle = x_1 y_1 + \dots + x_n y_n

This formula, simple as it is, produces a lot of interesting geometry. An important such property, one which is discussed in machine learning circles more than pure math, is that it is a very convenient decision rule.

In particular, say we’re in the Euclidean plane, and we have a line L passing through the origin, with w being a unit vector perpendicular to L (“the normal” to the line).


If you take any vector x, then the dot product \langle x, w \rangle is positive if x is on the same side of L as w, and negative otherwise. The dot product is zero if and only if x is exactly on the line L, including when x is the zero vector.


Left: the dot product of w and x is positive, meaning they are on the same side of w. Right: The dot product is negative, and they are on opposite sides.

Here is an interactive demonstration of this property. Click the image below to go to the demo, and you can drag the vector arrowheads and see the decision rule change.


Click above to go to the demo

The code for this demo is available in a github repository.

It’s always curious, at first, that multiplying and summing produces such geometry. Why should this seemingly trivial arithmetic do anything useful at all?

The core fact that makes it work, however, is that the dot product tells you how one vector projects onto another. When I say “projecting” a vector x onto another vector w, I mean you take only the components of x that point in the direction of w. The demo shows what the result looks like using the red (or green) vector.

In two dimensions this is easy to see, as you can draw the triangle which has x as the hypotenuse, with w spanning one of the two legs of the triangle as follows:


If we call a the (vector) leg of the triangle parallel to w, while b is the dotted line (as a vector, parallel to L), then as vectors x = a + b. The projection of x onto w is just a.

Another way to think of this is that the projection is x, modified by removing any part of x that is perpendicular to w. Using some colorful language: you put your hands on either side of x and w, and then you squish x onto w along the line perpendicular to w (i.e., along b).

And if w is a unit vector, then the length of a—that is, the length of the projection of x onto w—is exactly the inner product product \langle x, w \rangle.

Moreover, if the angle between x and w is larger than 90 degrees, the projected vector will point in the opposite direction of w, so it’s really a “signed” length.


Left: the projection points in the same direction as w. Right: the projection points in the opposite direction.

And this is precisely why the decision rule works. This 90-degree boundary is the line perpendicular to w.

More technically said: Let x, y \in \mathbb{R}^n be two vectors, and \langle x,y \rangle their dot product. Define by \| y \| the length of y, specifically \sqrt{\langle y, y \rangle}. Define by \text{proj}_{y}(x) by first letting y' = \frac{y}{\| y \|}, and then let \text{proj}_{y}(x) = \langle x,y' \rangle y'. In words, you scale y to a unit vector y', use the result to compute the inner product, and then scale y so that it’s length is \langle x, y' \rangle. Then

Theorem: Geometrically, \text{proj}_y(x) is the projection of x onto the line spanned by y.

This theorem is true for any n-dimensional vector space, since if you have two vectors you can simply apply the reasoning for 2-dimensions to the 2-dimensional plane containing x and y. In that case, the decision boundary for a positive/negative output is the entire n-1 dimensional hyperplane perpendicular to y (the projected vector).

In fact, the usual formula for the angle between two vectors, i.e. the formula \langle x, y \rangle = \|x \| \cdot \| y \| \cos \theta, is a restatement of the projection theorem in terms of trigonometry. The \langle x, y' \rangle part of the projection formula (how much you scale the output) is equal to \| x \| \cos \theta. At the end of this post we have a proof of the cosine-angle formula above.

Part of why this decision rule property is so important is that this is a linear function, and linear functions can be optimized relatively easily. When I say that, I specifically mean that there are many known algorithms for optimizing linear functions, which don’t have obscene runtime or space requirements. This is a big reason why mathematicians and statisticians start the mathematical modeling process with linear functions. They’re inherently simpler.

In fact, there are many techniques in machine learning—a prominent one is the so-called Kernel Trick—that exist solely to take data that is not inherently linear in nature (cannot be fruitfully analyzed by linear methods) and transform it into a dataset that is. Using the Kernel Trick as an example to foreshadow some future posts on Support Vector Machines, the idea is to take data which cannot be separated by a line, and transform it (usually by adding new coordinates) so that it can. Then the decision rule, computed in the larger space, is just a dot product. Irene Papakonstantinou neatly demonstrates this with paper folding and scissors. The tradeoff is that the size of the ambient space increases, and it might increase so much that it makes computation intractable. Luckily, the Kernel Trick avoids this by remembering where the data came from, so that one can take advantage of the smaller space to compute what would be the inner product in the larger space.

Next time we’ll see how this decision rule shows up in an optimization problem: finding the “best” hyperplane that separates an input set of red and blue points into monochromatic regions (provided that is possible). Finding this separator is core subroutine of the Support Vector Machine technique, and therein lie interesting algorithms. After we see the core SVM algorithm, we’ll see how the Kernel Trick fits into the method to allow nonlinear decision boundaries.

Proof of the cosine angle formula

Theorem: The inner product \langle v, w \rangle is equal to \| v \| \| w \| \cos(\theta), where \theta is the angle between the two vectors.

Note that this angle is computed in the 2-dimensional subspace spanned by v, w, viewed as a typical flat plane, and this is a 2-dimensional plane regardless of the dimension of v, w.

Proof. If either v or w is zero, then both sides of the equation are zero and the theorem is trivial, so we may assume both are nonzero. Label a triangle with sides v,w and the third side v-w. Now the length of each side is \| v \|, \| w\|, and \| v-w \|, respectively. Assume for the moment that \theta is not 0 or 180 degrees, so that this triangle is not degenerate.

The law of cosines allows us to write

\displaystyle \| v - w \|^2 = \| v \|^2 + \| w \|^2 - 2 \| v \| \| w \| \cos(\theta)

Moreover, The left hand side is the inner product of v-w with itself, i.e. \| v - w \|^2 = \langle v-w , v-w \rangle. We’ll expand \langle v-w, v-w \rangle using two facts. The first is trivial from the formula, that inner product is symmetric: \langle v,w \rangle = \langle w, v \rangle. Second is that the inner product is linear in each input. In particular for the first input: \langle x + y, z \rangle = \langle x, z \rangle + \langle y, z \rangle and \langle cx, z \rangle = c \langle x, z \rangle. The same holds for the second input by symmetry of the two inputs. Hence we can split up \langle v-w, v-w \rangle as follows.

\displaystyle \begin{aligned} \langle v-w, v-w \rangle &= \langle v, v-w \rangle - \langle w, v-w \rangle \\ &= \langle v, v \rangle - \langle v, w \rangle - \langle w, v \rangle +  \langle w, w \rangle \\ &= \| v \|^2 - 2 \langle v, w \rangle + \| w \|^2 \\ \end{aligned}

Combining our two offset equations, we can subtract \| v \|^2 + \| w \|^2 from each side and get

\displaystyle -2 \|v \| \|w \| \cos(\theta) = -2 \langle v, w \rangle,

Which, after dividing by -2, proves the theorem if \theta \not \in \{0, 180 \}.

Now if \theta = 0 or 180 degrees, the vectors are parallel, so we can write one as a scalar multiple of the other. Say w = cv for c \in \mathbb{R}. In that case, \langle v, cv \rangle = c \| v \| \| v \|. Now \| w \| = | c | \| v \|, since a norm is a length and is hence non-negative (but c can be negative). Indeed, if v, w are parallel but pointing in opposite directions, then c < 0, so \cos(\theta) = -1, and c \| v \| = - \| w \|. Otherwise c > 0 and \cos(\theta) = 1. This allows us to write c \| v \| \| v \| = \| w \| \| v \| \cos(\theta), and this completes the final case of the theorem.


Singular Value Decomposition Part 1: Perspectives on Linear Algebra

The singular value decomposition (SVD) of a matrix is a fundamental tool in computer science, data analysis, and statistics. It’s used for all kinds of applications from regression to prediction, to finding approximate solutions to optimization problems. In this series of two posts we’ll motivate, define, compute, and use the singular value decomposition to analyze some data. (Jump to the second post)

I want to spend the first post entirely on motivation and background. As part of this, I think we need a little reminder about how linear algebra equivocates linear subspaces and matrices. I say “I think” because what I’m going to say seems rarely spelled out in detail. Indeed, I was confused myself when I first started to read about linear algebra applied to algorithms, machine learning, and data science, despite having a solid understanding of linear algebra from a mathematical perspective. The concern is the connection between matrices as transformations and matrices as a “convenient” way to organize data.

Data vs. maps

Linear algebra aficionados like to express deep facts via statements about matrix factorization. That is, they’ll say something opaque like (and this is the complete statement for SVD we’ll get to in the post):

The SVD of an m \times n matrix A with real values is a factorization of A as U\Sigma V^T, where U is an m \times m orthogonal matrix, V is an n \times n orthogonal matrix, and \Sigma is a diagonal matrix with nonnegative real entries on the diagonal.

Okay, I can understand the words individually, but what does it mean in terms of the big picture? There are two seemingly conflicting interpretations of matrices that muddle our vision.

The first is that A is a linear map from some n-dimensional vector space to an m-dimensional one. Let’s work with real numbers and call the domain vector space \mathbb{R}^n and the codomain \mathbb{R}^m. In this interpretation the factorization expresses a change of basis in the domain and codomain. Specifically, V expresses a change of basis from the usual basis of \mathbb{R}^n to some other basis, and U does the same for the co-domain \mathbb{R}^m.

That’s fine and dandy, so long as the data that makes up A is the description of some linear map we’d like to learn more about. Years ago on this blog we did exactly this analysis of a linear map modeling a random walk through the internet, and we ended up with Google’s PageRank algorithm. However, in most linear algebra applications A actually contains data in its rows or columns. That’s the second interpretation. That is, each row of A is a data point in \mathbb{R}^n, and there are m total data points, and they represent observations of some process happening in the world. The data points are like people rating movies or weather sensors measuring wind and temperature. If you’re not indoctrinated with the linear algebra world view, you might not naturally think of this as a mapping of vectors from one vector space to another.

Most of the time when people talk about linear algebra (even mathematicians), they’ll stick entirely to the linear map perspective or the data perspective, which is kind of frustrating when you’re learning it for the first time. It seems like the data perspective is just a tidy convenience, that it just “makes sense” to put some data in a table. In my experience the singular value decomposition is the first time that the two perspectives collide, and (at least in my case) it comes with cognitive dissonance.

The way these two ideas combine is that the data is thought of as the image of the basis vectors of \mathbb{R}^n under the linear map specified by A. Here is an example to make this concrete. Let’s say I want to express people rating movies. Each row will correspond to the ratings of a movie, and each column will correspond to a person, and the i,j entry of the matrix A is the rating person j gives to movie i.


In reality they’re rated on a scale from 1 to 5 stars, but to keep things simple we’ll just say that the ratings can be any real numbers (they just happened to pick integers). So this matrix represents a linear map. The domain is \mathbb{R}^3, and the basis vectors are called people, and the codomain is \mathbb{R}^8, whose basis vectors are movies.


Now the data set is represented by A(\vec e_{\textup{Aisha}}),A(\vec e_{\textup{Bob}}),A(\vec e_{\textup{Chandrika}}), and by the definition of how a matrix represents a linear map, the entires of these vectors are exactly the columns of A. If the codomain is really big, then the image of A is a small-dimensional linear subspace of the codomain. This is an important step, that we’ve increased our view from just the individual data points to all of their linear combinations as a subspace.

Why is this helpful at all? This is where we start to see the modeling assumptions of linear algebra show through. If we’re trying to use this matrix to say something about how people rate movies (maybe we want to predict how a new person will rate these movies), we would need to be able to represent that person as a linear combination of Aisha, Bob, and Chandrika. Likewise, if we had a new movie and we wanted to use this matrix to say anything about it, we’d have to represent the movie as a linear combination of the existing movies.

Of course, I don’t literally mean that a movie (as in, the bits comprising a file containing a movie) can be represented as a linear combination of other movies. I mean that we can represent a movie formally as a linear combination in some abstract vector space for the task at hand. In other words, we’re representing those features of the movie that influence its rating abstractly as a vector. We don’t have a legitimate mathematical way to understand that, so the vector is a proxy.

It’s totally unclear what this means in terms of real life, except that you can hope (or hypothesize, or verify), that if the rating process of movies is “linear” in nature then this formal representation will accurately reflect the real world. It’s like how physicists all secretly know that mathematics doesn’t literally dictate the laws of nature, because humans made up math in their heads and if you poke nature too hard the math breaks down, but it’s so damn convenient to describe hypotheses (and so damn accurate), that we can’t avoid using it to design airplanes. And we haven’t found anything better than math for this purpose.

Likewise, movie ratings aren’t literally a linear map, but if we pretend they are we can make algorithms that predict how people rate movies with pretty good accuracy. So if you know that Skyfall gets ratings 1,2, and 1 from Aisha, Bob, and Chandrika, respectively, then a new person would rate Skyfall based on a linear combination of how well they align with these three people. In other words, up to a linear combination, in this example Aisha, Bob, and Chandrika epitomize the process of rating movies.

And now we get to the key: factoring the matrix via SVD provides an alternative and more useful way to represent the process of people rating movies. By changing the basis of one or both vector spaces involved, we isolate the different (orthogonal) characteristics of the process. In the context of our movie example, “factorization” means the following:

  1. Come up with a special list of vectors v_1, v_2, \dots, v_8 so that every movie can be written as a linear combination of the v_i.
  2. Do the analogous thing for people to get p_1, p_2, p_3.
  3. Do (1) and (2) in such a way that the map A is diagonal with respect to both new bases simultaneously.

One might think of the v_i as “idealized movies” and the p_j as “idealized critics.” If you want to use this data to say things about the world, you’d be making the assumption that any person can be written as a linear combination of the p_j and any movie can be written as a linear combination of the v_i. These are the rows/columns of U, V from the factorization. To reiterate, these linear combinations are only with respect to the task of rating movies. And they’re “special” because they make the matrix diagonal.

If the world was logical (and I’m not saying it is) then maybe v_1 would correspond to some idealized notion of “action movie,” and p_1 would correspond to some idealized notion of “action movie lover.” Then it makes sense why the mapping would be diagonal in this basis: an action movie lover only loves action movies, so p_1 gives a rating of zero to everything except v_1. A movie is represented by how it decomposes (linearly) into “idealized” movies. To make up some arbitrary numbers, maybe Skyfall is 2/3 action movie, 1/5 dystopian sci-fi, and -6/7 comedic romance. Likewise a person would be represented by how they decompose (via linear combination) into a action movie lover, rom-com lover, etc.

To be completely clear, the singular value decomposition does not find the ideal sci-fi movie. The “ideal”ness of the singular value decomposition is with respect to the inherent geometric structure of the data coupled with the assumptions of linearity. Whether this has anything at all to do with how humans classify movies is a separate question, and the answer is almost certainly no.

With this perspective we’re almost ready to talk about the singular value decomposition. I just want to take a moment to write down a list of the assumptions that we’d need if we want to ensure that, given a data set of movie ratings, we can use linear algebra to make exact claims about world.

  1. All people rate movies via the same linear map.
  2. Every person can be expressed (for the sole purpose of movie ratings) as linear combinations of “ideal” people. Likewise for movies.
  3. The “idealized” movies and people can be expressed as linear combinations of the movies/people in our particular data set.
  4. There are no errors in the ratings.

One could have a deep and interesting discussion about the philosophical (or ethical, or cultural) aspects of these assumptions. But since the internet prefers to watch respectful discourse burn, we’ll turn to algorithms instead.

Approximating subspaces

In our present context, the singular value decomposition (SVD) isn’t meant to be a complete description of a mapping in a new basis, as we said above. Rather, we want to use it to approximate the mapping A by low-dimensional linear things. When I say “low-dimensional linear things” I mean that given A, we’d want to find another matrix B which is measurably similar to A in some way, and has low rank compared to A.

How do we know that A isn’t already low rank? The reasons is that data with even the tiniest bit of noise is full rank with overwhelming probability. A concrete way to say this is that the space of low-rank matrices has small dimension (in the sense of a manifold) inside the space of all matrices. So perturbing even a single entry by an infinitesimally small amount would increase the rank.

We don’t need to understand manifolds to understand the SVD, though. For our example of people rating movies the full-rank property should be obvious. The noise and randomness and arbitrariness in human preferences certainly destroys any “perfect” linear structure we could hope to find, and in particular that means the data set itself, i.e. the image of A, is a large-dimensional subspace of the codomain.

Finding a low-rank approximation can be thought of as “smoothing” the noise out of the data. And this works particularly well when the underlying process is close to a linear map. That is, when the data is close to being contained entirely in a single subspace of relatively low-dimension. One way to think of why this might be the case is that if the process you’re observing is truly linear, but the data you get is corrupted by small amounts of noise. Then A will be close to low rank in a measurable sense (to be defined mathematically in the sequel post) and the low-rank approximation B will be a more efficient, accurate, and generalizable surrogate for A.

In terms of our earlier list of assumptions about when you can linear algebra to solve problems, for the SVD we can add “approximately” to the first three assumptions, and “not too many errors” to the fourth. If those assumptions hold, SVD will give us a matrix B which accurately represents the process being measured. Conversely, if SVD does well, then you have some evidence that the process is linear-esque.

To be more specific with notation, if A is a matrix representing some dataset via the image of A and you provide a small integer k, then the singular value decomposition computes the rank k matrix B_k which best approximates A. And since now we’re comfortable identifying a data matrix with the subspace defined by its image, this is the same thing as finding the k-dimensional subspace of the image of A which is the best approximation of the data (i.e., the image of B_k). We’ll quantify what we mean by “best approximation” in the next post.

That’s it, as far as intuitively understanding what the SVD is. I should add that the SVD doesn’t only allow one to compute a rank k approximation, it actually allows you to set k=n and get an exact representation of A. We just won’t use it for that purpose in this series.

The second bit of intuition is the following. It’s only slightly closer to rigor, but somehow this little insight really made SVD click for me personally:

The SVD is what you get when you iteratively solve the greedy optimization problem of fitting data to a line.

By that I mean, you can compute the SVD by doing the following:

  1. What’s the best line fitting my data?
  2. Okay, ignoring that first line, what’s the next best line?
  3. Okay, ignoring all the lines in the span of those first two lines, what’s the next best line?
  4. Ignoring all the lines in the span of the first three lines, what’s the next best line?
  5. (repeat)

It should be shocking that this works. For most problems, in math and in life, the greedy algorithm is far from optimal. When it happens, once every blue moon, that the greedy algorithm is the best solution to a natural problem (and not obviously so, or just approximately so), it’s our intellectual duty to stop what we’re doing, sit up straight, and really understand and appreciate it. These wonders transcend political squabbles and sports scores. And we’ll start the next post immediately by diving into this greedy optimization problem.

The geometric perspective

There are two other perspectives I want to discuss here, though it may be more appropriate for a reader who is not familiar with the SVD to wait to read this after the sequel to this post. I’m just going to relate my understanding (in terms of the greedy algorithm and data approximations) to the geometric and statistical perspectives on the SVD.

Michael Nielsen wrote a long and detailed article presenting some ideas about a “new medium” in which to think about mathematics. He demonstrates is framework by looking at the singular value decomposition for 2×2 matrices. His explanation for the intuition behind the SVD is that you can take any matrix (linear map) and break it up into three pieces: a rotation about the origin, a rescaling of each coordinate, followed by another rotation about the origin. While I have the utmost respect for Nielsen (his book on quantum mechanics is the best text in the field), this explanation never quite made SVD click for me personally. It seems like a restatement of the opaque SVD definition (as a matrix factorization) into geometric terms. Indeed, an orthogonal matrix is a rotation, and a diagonal matrix is a rescaling of each coordinate.

To me, the key that’s missing from this explanation is the emphasis on the approximation. What makes the SVD so magical isn’t that the factorization exists in the first place, but rather that the SVD has these layers of increasingly good approximation. Though the terminology will come in the next post, these layers are the (ordered) singular vectors and singular values. And moreover, that the algorithmic process of constructing these layers necessarily goes in order from strongest approximation to weakest.

Another geometric perspective that highlights this is that the rank-k approximation provided by the SVD is a geometric projection of a matrix onto the space of rank at-most-k matrices with respect to the “spectral norm” on matrices (the spectral norm of A is the largest eigenvalue of A^TA). The change of basis described above makes this projection very easy: given a singular value decomposition you just take the top k singular vectors. Indeed, the Eckart-Young theorem formalizes this with the statement that the rank-k SVD B_k minimizes the distance (w.r.t. the spectral norm) between the original matrix A and any rank k matrix. So you can prove that SVD gives you the best rank k approximation of A by some reasonable measure.

Next time: algorithms

Next time we’ll connect all this to the formal definitions and rigor. We’ll study the greedy algorithm approach, and then we’ll implement the SVD and test it on some data.

Until then!