Regression and Linear Combinations

Recently I’ve been helping out with a linear algebra course organized by Tai-Danae Bradley and Jack Hidary, and one of the questions that came up a few times was, “why should programmers care about the concept of a linear combination?”

For those who don’t know, given vectors v_1, \dots, v_n, a linear combination of the vectors is a choice of some coefficients a_i with which to weight the vectors in a sum v = \sum_{i=1}^n a_i v_i.

I must admit, math books do a poor job of painting the concept as more than theoretical—perhaps linear combinations are only needed for proofs, while the real meat is in matrix multiplication and cross products. But no, linear combinations truly lie at the heart of many practical applications.

In some cases, the entire goal of an algorithm is to find a “useful” linear combination of a set of vectors. The vectors are the building blocks (often a vector space or subspace basis), and the set of linear combinations are the legal ways to combine the blocks. Simpler blocks admit easier and more efficient algorithms, but their linear combinations are less expressive. Hence, a tradeoff.

A concrete example is regression. Most people think of regression in terms of linear regression. You’re looking for a linear function like y = mx+b that approximates some data well. For multiple variables, you have, e.g., \mathbf{x} = (x_1, x_2, x_3) as a vector of input variables, and \mathbf{w} = (w_1, w_2, w_3) as a vector of weights, and the function is y = \mathbf{w}^T \mathbf{x} + b.

To avoid the shift by b (which makes the function affine instead of purely linear; formulas of purely linear functions are easier to work with because the shift is like a pesky special case you have to constantly account for), authors often add a fake input variable x_0 which is always fixed to 1, and relabel b as w_0 to get y = \mathbf{w}^T \mathbf{x} = \sum_i w_i x_i as the final form. The optimization problem to solve becomes the following, where your data set to approximate is \{ \mathbf{x}_1, \dots, \mathbf{x}_k \}.

\displaystyle \min_w \sum_{i=1}^k (y_i - \mathbf{w}^T \mathbf{x}_i)^2

In this case, the function being learned—the output of the regression—doesn’t look like a linear combination. Technically it is, just not in an interesting way.

It becomes more obviously related to linear combinations when you try to model non-linearity. The idea is to define a class of functions called basis functions B = \{ f_1, \dots, f_m \mid f_i: \mathbb{R}^n \to \mathbb{R} \}, and allow your approximation to be any linear combination of functions in B, i.e., any function in the span of B.

\displaystyle \hat{f}(\mathbf{x}) = \sum_{i=1}^m w_i f_i(\mathbf{x})

Again, instead of weighting each coordinate of the input vector with a w_i, we’re weighting each basis function’s contribution (when given the whole input vector) to the output. If the basis functions were to output a single coordinate (f_i(\mathbf{x}) = x_i), we would be back to linear regression.

Then the optimization problem is to choose the weights to minimize the error of the approximation.

\displaystyle \min_w \sum_{j=1}^k (y_j - \hat{f}(\mathbf{x}_j))^2

As an example, let’s say that we wanted to do regression with a basis of quadratic polynomials. Our basis for three input variables might look like

\displaystyle \{ 1, x_1, x_2, x_3, x_1x_2, x_1x_3, x_2x_3, x_1^2, x_2^2, x_3^2 \}

Any quadratic polynomial in three variables can be written as a linear combination of these basis functions. Also note that if we treat this as the basis of a vector space, then a vector is a tuple of 10 numbers—the ten coefficients in the polynomial. It’s the same as \mathbb{R}^{10}, just with a different interpretation of what the vector’s entries mean. With that, we can see how we would compute dot products, projections, and other nice things, though they may not have quite the same geometric sensibility.

These are not the usual basis functions used for polynomial regression in practice (see the note at the end of this article), but we can already do some damage in writing regression algorithms.

A simple stochastic gradient descent

Although there is a closed form solution to many regression problems (including the quadratic regression problem, though with a slight twist), gradient descent is a simple enough solution to showcase how an optimization solver can find a useful linear combination. This code will be written in Python 3.9. It’s on Github.

First we start with some helpful type aliases

from typing import Callable, Tuple, List

Input = Tuple[float, float, float]
Coefficients = List[float]
Gradient = List[float]
Hypothesis = Callable[[Input], float]
Dataset = List[Tuple[Input, float]]

Then define a simple wrapper class for our basis functions

class QuadraticBasisPolynomials:
    def __init__(self):
        self.basis_functions = [
            lambda x: 1,
            lambda x: x[0],
            lambda x: x[1],
            lambda x: x[2],
            lambda x: x[0] * x[1],
            lambda x: x[0] * x[2],
            lambda x: x[1] * x[2],
            lambda x: x[0] * x[0],
            lambda x: x[1] * x[1],
            lambda x: x[2] * x[2],

    def __getitem__(self, index):
        return self.basis_functions[index]

    def __len__(self):
        return len(self.basis_functions)

    def linear_combination(self, weights: Coefficients) -> Hypothesis:
        def combined_function(x: Input) -> float:
            return sum(
                w * f(x)
                for (w, f) in zip(weights, self.basis_functions)

        return combined_function

basis = QuadraticBasisPolynomials()

The linear_combination function returns a function that computes the weighted sum of the basis functions. Now we can define the error on a dataset, as well as for a single point

def total_error(weights: Coefficients, data: Dataset) -> float:
    hypothesis = basis.linear_combination(weights)
    return sum(
        (actual_output - hypothesis(example)) ** 2
        for (example, actual_output) in data

def single_point_error(
        weights: Coefficients, point: Tuple[Input, float]) -> float:
    return point[1] - basis.linear_combination(weights)(point[0])

We can then define the gradient of the error function with respect to the weights and a single data point. Recall, the error function is defined as

\displaystyle E(\mathbf{w}) = \sum_{j=1}^k (y_j - \hat{f}(\mathbf{x}_j))^2

where \hat{f} is a linear combination of basis functions

\hat{f}(\mathbf{x}_j) = \sum_{s=1}^n w_s f_s(\mathbf{x}_j)

Since we’ll do stochastic gradient descent, the error formula is a bit simpler. We compute it not for the whole data set but only a single random point at a time. So the error is

\displaystyle E(\mathbf{w}) = (y_j - \hat{f}(\mathbf{x}_j))^2

Then we compute the gradient with respect to the individual entries of \mathbf{w}, using the chain rule and noting that the only term of the linear combination that has a nonzero contribution to the gradient for \frac{\partial E}{\partial w_i} is the term containing w_i. This is one of the major benefits of using linear combinations: the gradient computation is easy.

\displaystyle \frac{\partial E}{\partial w_i} = -2 (y_j - \hat{f}(\mathbf{x}_j)) \frac{\partial \hat{f}}{\partial w_i}(\mathbf{x}_j) = -2 (y_j - \hat{f}(\mathbf{x}_j)) f_i(\mathbf{x}_j)

Another advantage to being linear is that this formula is agnostic to the content of the underlying basis functions. This will hold so long as the weights don’t show up in the formula for the basis functions. As an exercise: try changing the implementation to use radial basis functions around each data point. (see the note at the end for why this would be problematic in real life)

def gradient(weights: Coefficients, data_point: Tuple[Input, float]) -> Gradient:
    error = single_point_error(weights, data_point)
    dE_dw = [0] * len(weights)

    for i, w in enumerate(weights):
        dE_dw[i] = -2 * error * basis[i](data_point[0])

    return dE_dw

Finally, the gradient descent core with a debugging helper.

import random

def print_debug_info(step, grad_norm, error, progress):
    print(f"{step}, {progress:.4f}, {error:.4f}, {grad_norm:.4f}")

def gradient_descent(
        data: Dataset,
        learning_rate: float,
        tolerance: float,
        training_callback = None,
) -> Hypothesis:
    weights = [random.random() * 2 - 1 for i in range(len(basis))]
    last_error = total_error(weights, data)
    step = 0
    progress = tolerance * 2
    grad_norm = 1.0

    if training_callback:
        training_callback(step, 0.0, last_error, 0.0)

    while abs(progress) > tolerance or grad_norm > tolerance:
        grad = gradient(weights, random.choice(data))
        grad_norm = sum(x**2 for x in grad)
        for i in range(len(weights)):
            weights[i] -= learning_rate * grad[i]

        error = total_error(weights, data)
        progress = error - last_error
        last_error = error
        step += 1

        if training_callback:
            training_callback(step, grad_norm, error, progress)

    return basis.linear_combination(weights)

Next create some sample data and run the optimization

def example_quadratic_data(num_points: int):
    def fn(x, y, z):
        return 2 - 4*x*y + z + z**2

    data = []
    for i in range(num_points):
        x, y, z = random.random(), random.random(), random.random()
        data.append(((x, y, z), fn(x, y, z)))

    return data

if __name__ == "__main__":
    data = example_quadratic_data(30)

Depending on the randomness, it may take a few thousand steps, but it typically converges to an error of < 1. Here’s the plot of error against gradient descent steps.

Gradient descent showing log(total error) vs number of steps, for the quadratic regression problem.

Kernels and Regularization

I’ll finish with explanations of the parentheticals above.

The real polynomial kernel. We chose a simple set of polynomial functions. This is closely related to the concept of a “kernel”, but the “real” polynomial kernel uses slightly different basis functions. It scales some of the basis functions by \sqrt{2}. This is OK because a linear combination can compensate by using coefficients that are appropriately divided by \sqrt{2}. But why would one want to do this? The answer boils down to a computational efficiency technique called the “Kernel trick.” In short, it allows you to compute the dot product between two linear combinations of vectors in this vector space without explicitly representing the vectors in the space to begin with. If your regression algorithm uses only dot products in its code (as is true of the closed form solution for regression), you get the benefits of nonlinear feature modeling without the cost of computing the features directly. There’s a lot more mathematical theory to discuss here (cf. Reproducing Kernel Hilbert Space) but I’ll have to leave it there for now.

What’s wrong with the radial basis function exercise? This exercise asked you to create a family of basis functions, one for each data point. The problem here is that having so many basis functions makes the linear combination space too expressive. The optimization will overfit the data. It’s like a lookup table: there’s one entry dedicated to each data point. New data points not in the training would be rarely handled well, since they aren’t in the “lookup table” the optimization algorithm found. To get around this, in practice one would add an extra term to the error corresponding to the L1 or L2 norm of the weight vector. This allows one to ensure that the total size of the weights is small, and in the L1 case that usually corresponds to most weights being zero, and only a few weights (the most important) being nonzero. The process of penalizing the “magnitude” of the linear combination is called regularization.

Parameterizing the Vertex Cover Problem

I’m presenting a paper later this week at the Matheamtical Foundations of Computer Science 2014 in Budapest, Hungary. This conference is an interesting mix of logic and algorithms that aims to bring together researchers from these areas to discuss their work. And right away the first session on the first day focused on an area I know is important but have little experience with: fixed parameter complexity. From what I understand it’s not that popular of a topic at major theory conferences in the US (there appears to be only one paper on it at this year’s FOCS conference), but the basic ideas are worth knowing.

The basic idea is pretty simple: some hard computational problems become easier (read, polynomial-time solvable) if you fix some parameters involved to constants. Preferably small constants. For example, finding cliques of size k in a graph is NP-hard if k is a parameter, but if you fix k to a constant then you can check all possible subsets of size k in O(n^k) time. This is kind of a silly example because there are much faster ways to find triangles than checking all O(n^3) subsets of vertices, but part of the point of fixed-parameter complexity is to find the fastest algorithms in these fixed-parameter settings. Since in practice parameters are often small [citation needed], this analysis can provide useful practical algorithmic alternatives to heuristics or approximate solutions.

One important tool in the theory of fixed-parameter tractability is the idea of a kernel. I think it’s an unfortunate term because it’s massively overloaded in mathematics, but the idea is to take a problem instance with the parameter k, and carve out “easy” regions of the instance (often reducing k as you go) until the runtime of the trivial brute force algorithm only depends on k and not on the size of the input. The point is that the solution you get on this “carved out” instance is either the same as the original, or can be extended back to the original with little extra work. There is a more formal definition we’ll state, but there is a canonical example that gives a great illustration.

Consider the vertex cover problem. That is, you give me a graph G = (V,E) and a number k and I have to determine if there is a subset of \leq k vertices of G that touch all of the edges in E. This problem is fixed-parameter tractable because, as with k-clique one can just check all subsets of size k. The kernel approach we’ll show now is much smarter.

What you do is the following. As long as your graph has a vertex of degree > k, you remove it and reduce k by 1. This is because a vertex of degree > k will always be chosen for a vertex cover. If it’s not, then you need to include all of its neighbors to cover its edges, but there are > k neighbors and your vertex cover is constrained by size k. And so you can automatically put this high-degree vertex in your cover, and use induction on the smaller graph.

Once you can’t remove any more vertices there are two cases. In the case that there are more than k^2 edges, you output that there is no vertex cover. Indeed, if you only get k vertices in your cover and you removed all vertices of degree > k, then each can cover at most k edges, giving a total of at most k^2. Otherwise, if there are at most k^2 edges, then you can remove all the isolated vertices and show that there are only \leq 2k^2 vertices left. This is because each edge touches only two vertices, so in the worst case they’re all distinct. This smaller subgraph is called a kernel of the vertex cover, and the fact that its size depends only on k is the key. So you can look at all 2^{2k^2} = O(1) subsets to determine if there’s a cover of the size you want. If you find a cover of the kernel, you add back in all the large-degree vertices you deleted and you’re done.

Now, even for small k this is a pretty bad algorithm (k=5 gives 2^{50} subsets to inspect), but with more detailed analysis you can do significantly better. In particular, the best known bound reduces vertex cover to a kernel of size 2k - c \log(k) vertices for any constant c you specify. Getting \log(k) vertices is known to imply P = NP, and with more detailed complexity assumptions it’s even hard to get a graph with fewer than O(k^{2-\varepsilon}) edges for any \varepsilon > 0. These are all relatively recent results whose associated papers I have not read.

Even with these hardness results, there are two reasons why this kind of analysis is useful. The first is that it gives us a clearer picture of the complexity of these problems. In particular, the reduction we showed for vertex cover gives a time O(2^{2k^2} + n + m)-time algorithm, which you can then compare directly to the trivial O(n^k) time brute force algorithm and measure the difference. Indeed, if k = o(\sqrt{(k/2) log(n)}) then the kernelized approach is faster.

The second reason is that the kernel approach usually results in simple and quick checks for negative answers to a problem. In particular, if you want to check for k-sized set covers in a graph in the real world, this analysis shows that the first thing you should do is check if the kernel has size > k^2. If so, you can immediately give a “no” answer. So useful kernels can provide insight into the structure of a problem that can be turned into heuristic tools even when it doesn’t help you solve the problem exactly.

So now let’s just see the prevailing definition of a “kernelization” of a problem. This comes from the text of Downey and Fellows.

Definition: kernelization of a parameterized problem L (formally, a language where each string x is paired with a positive integer k) is a \textup{poly}(|x|, k)-time algorithm that converts instances (x,k) into instances (x', k') with the following three properties.

  • (x,k) is a yes instance of L if and only if (x', k') is.
  • |x'| \leq f(k) for some computable function f: \mathbb{N} \to \mathbb{N}.
  • k' \leq g(k) for some computable function g: \mathbb{N} \to \mathbb{N}.

The output (x', k') is called a kernel, and the problem is said to admit a polynomial kernel if f(k) = O(k^c) for some constant c.

So we showed that vertex cover admits a polynomial kernel (in fact, a quadratic one).

Now the nice theorem is that a problem is fixed-parameter tractable if and only if it admits a polynomial kernel. Finding a kernel is conceptually easier because, like in vertex cover, it allows you to introduce additional assumptions on the structure of the instances you’re working with. But more importantly from a theoretical standpoint, measuring the size and complexity of kernels for NP-hard problems gives us a way to discriminate among problems within NP. That and the chance to get some more practical tools for NP-hard problems makes parameterized complexity more interesting than it sounds at first.

Until next time!