Regression and Linear Combinations

Recently I’ve been helping out with a linear algebra course organized by Tai-Danae Bradley and Jack Hidary, and one of the questions that came up a few times was, “why should programmers care about the concept of a linear combination?”

For those who don’t know, given vectors $ v_1, \dots, v_n$, a linear combination of the vectors is a choice of some coefficients $ a_i$ with which to weight the vectors in a sum $ v = \sum_{i=1}^n a_i v_i$.

I must admit, math books do a poor job of painting the concept as more than theoretical—perhaps linear combinations are only needed for proofs, while the real meat is in matrix multiplication and cross products. But no, linear combinations truly lie at the heart of many practical applications.

In some cases, the entire goal of an algorithm is to find a “useful” linear combination of a set of vectors. The vectors are the building blocks (often a vector space or subspace basis), and the set of linear combinations are the legal ways to combine the blocks. Simpler blocks admit easier and more efficient algorithms, but their linear combinations are less expressive. Hence, a tradeoff.

A concrete example is regression. Most people think of regression in terms of linear regression. You’re looking for a linear function like $ y = mx+b$ that approximates some data well. For multiple variables, you have, e.g., $ \mathbf{x} = (x_1, x_2, x_3)$ as a vector of input variables, and $ \mathbf{w} = (w_1, w_2, w_3)$ as a vector of weights, and the function is $ y = \mathbf{w}^T \mathbf{x} + b$.

To avoid the shift by $ b$ (which makes the function affine instead of purely linear; formulas of purely linear functions are easier to work with because the shift is like a pesky special case you have to constantly account for), authors often add a fake input variable $ x_0$ which is always fixed to 1, and relabel $ b$ as $ w_0$ to get $ y = \mathbf{w}^T \mathbf{x} = \sum_i w_i x_i$ as the final form. The optimization problem to solve becomes the following, where your data set to approximate is $ \{ \mathbf{x}_1, \dots, \mathbf{x}_k \}$.

$ \displaystyle \min_w \sum_{i=1}^k (y_i – \mathbf{w}^T \mathbf{x}_i)^2$

In this case, the function being learned—the output of the regression—doesn’t look like a linear combination. Technically it is, just not in an interesting way.

It becomes more obviously related to linear combinations when you try to model non-linearity. The idea is to define a class of functions called basis functions $ B = \{ f_1, \dots, f_m \mid f_i: \mathbb{R}^n \to \mathbb{R} \}$, and allow your approximation to be any linear combination of functions in $ B$, i.e., any function in the span of B.

$ \displaystyle \hat{f}(\mathbf{x}) = \sum_{i=1}^m w_i f_i(\mathbf{x})$

Again, instead of weighting each coordinate of the input vector with a $ w_i$, we’re weighting each basis function’s contribution (when given the whole input vector) to the output. If the basis functions were to output a single coordinate ($ f_i(\mathbf{x}) = x_i$), we would be back to linear regression.

Then the optimization problem is to choose the weights to minimize the error of the approximation.

$ \displaystyle \min_w \sum_{j=1}^k (y_j – \hat{f}(\mathbf{x}_j))^2$

As an example, let’s say that we wanted to do regression with a basis of quadratic polynomials. Our basis for three input variables might look like

$ \displaystyle \{ 1, x_1, x_2, x_3, x_1x_2, x_1x_3, x_2x_3, x_1^2, x_2^2, x_3^2 \}$

Any quadratic polynomial in three variables can be written as a linear combination of these basis functions. Also note that if we treat this as the basis of a vector space, then a vector is a tuple of 10 numbers—the ten coefficients in the polynomial. It’s the same as $ \mathbb{R}^{10}$, just with a different interpretation of what the vector’s entries mean. With that, we can see how we would compute dot products, projections, and other nice things, though they may not have quite the same geometric sensibility.

These are not the usual basis functions used for polynomial regression in practice (see the note at the end of this article), but we can already do some damage in writing regression algorithms.

A simple stochastic gradient descent

Although there is a closed form solution to many regression problems (including the quadratic regression problem, though with a slight twist), gradient descent is a simple enough solution to showcase how an optimization solver can find a useful linear combination. This code will be written in Python 3.9. It’s on Github.

First we start with some helpful type aliases

from typing import Callable, Tuple, List

Input = Tuple[float, float, float]
Coefficients = List[float]
Gradient = List[float]
Hypothesis = Callable[[Input], float]
Dataset = List[Tuple[Input, float]]

Then define a simple wrapper class for our basis functions

class QuadraticBasisPolynomials:
    def __init__(self):
        self.basis_functions = [
            lambda x: 1,
            lambda x: x[0],
            lambda x: x[1],
            lambda x: x[2],
            lambda x: x[0] * x[1],
            lambda x: x[0] * x[2],
            lambda x: x[1] * x[2],
            lambda x: x[0] * x[0],
            lambda x: x[1] * x[1],
            lambda x: x[2] * x[2],
        ]

    def __getitem__(self, index):
        return self.basis_functions[index]

    def __len__(self):
        return len(self.basis_functions)

    def linear_combination(self, weights: Coefficients) -> Hypothesis:
        def combined_function(x: Input) -> float:
            return sum(
                w * f(x)
                for (w, f) in zip(weights, self.basis_functions)
            )

        return combined_function

basis = QuadraticBasisPolynomials()

The linear_combination function returns a function that computes the weighted sum of the basis functions. Now we can define the error on a dataset, as well as for a single point

def total_error(weights: Coefficients, data: Dataset) -> float:
    hypothesis = basis.linear_combination(weights)
    return sum(
        (actual_output - hypothesis(example)) ** 2
        for (example, actual_output) in data
    )


def single_point_error(
        weights: Coefficients, point: Tuple[Input, float]) -> float:
    return point[1] - basis.linear_combination(weights)(point[0])

We can then define the gradient of the error function with respect to the weights and a single data point. Recall, the error function is defined as

$ \displaystyle E(\mathbf{w}) = \sum_{j=1}^k (y_j – \hat{f}(\mathbf{x}_j))^2$

where $ \hat{f}$ is a linear combination of basis functions

$ \hat{f}(\mathbf{x}_j) = \sum_{s=1}^n w_s f_s(\mathbf{x}_j)$

Since we’ll do stochastic gradient descent, the error formula is a bit simpler. We compute it not for the whole data set but only a single random point at a time. So the error is

$ \displaystyle E(\mathbf{w}) = (y_j – \hat{f}(\mathbf{x}_j))^2$

Then we compute the gradient with respect to the individual entries of $ \mathbf{w}$, using the chain rule and noting that the only term of the linear combination that has a nonzero contribution to the gradient for $ \frac{\partial E}{\partial w_i}$ is the term containing $ w_i$. This is one of the major benefits of using linear combinations: the gradient computation is easy.

$ \displaystyle \frac{\partial E}{\partial w_i} = -2 (y_j – \hat{f}(\mathbf{x}_j)) \frac{\partial \hat{f}}{\partial w_i}(\mathbf{x}_j) = -2 (y_j – \hat{f}(\mathbf{x}_j)) f_i(\mathbf{x}_j)$

Another advantage to being linear is that this formula is agnostic to the content of the underlying basis functions. This will hold so long as the weights don’t show up in the formula for the basis functions. As an exercise: try changing the implementation to use radial basis functions around each data point. (see the note at the end for why this would be problematic in real life)

def gradient(weights: Coefficients, data_point: Tuple[Input, float]) -> Gradient:
    error = single_point_error(weights, data_point)
    dE_dw = [0] * len(weights)

    for i, w in enumerate(weights):
        dE_dw[i] = -2 * error * basis[i](data_point[0])

    return dE_dw

Finally, the gradient descent core with a debugging helper.

import random

def print_debug_info(step, grad_norm, error, progress):
    print(f"{step}, {progress:.4f}, {error:.4f}, {grad_norm:.4f}")


def gradient_descent(
        data: Dataset,
        learning_rate: float,
        tolerance: float,
        training_callback = None,
) -> Hypothesis:
    weights = [random.random() * 2 - 1 for i in range(len(basis))]
    last_error = total_error(weights, data)
    step = 0
    progress = tolerance * 2
    grad_norm = 1.0

    if training_callback:
        training_callback(step, 0.0, last_error, 0.0)

    while abs(progress) > tolerance or grad_norm > tolerance:
        grad = gradient(weights, random.choice(data))
        grad_norm = sum(x**2 for x in grad)
        
        for i in range(len(weights)):
            weights[i] -= learning_rate * grad[i]

        error = total_error(weights, data)
        progress = error - last_error
        last_error = error
        step += 1

        if training_callback:
            training_callback(step, grad_norm, error, progress)

    return basis.linear_combination(weights)

Next create some sample data and run the optimization

def example_quadratic_data(num_points: int):
    def fn(x, y, z):
        return 2 - 4*x*y + z + z**2

    data = []
    for i in range(num_points):
        x, y, z = random.random(), random.random(), random.random()
        data.append(((x, y, z), fn(x, y, z)))

    return data


if __name__ == "__main__":
    data = example_quadratic_data(30)
    gradient_descent(
        data, 
        learning_rate=0.01, 
        tolerance=1e-06, 
        training_callback=print_debug_info
    )

Depending on the randomness, it may take a few thousand steps, but it typically converges to an error of < 1. Here’s the plot of error against gradient descent steps.

Gradient descent showing log(total error) vs number of steps, for the quadratic regression problem.

Kernels and Regularization

I’ll finish with explanations of the parentheticals above.

The real polynomial kernel. We chose a simple set of polynomial functions. This is closely related to the concept of a “kernel”, but the “real” polynomial kernel uses slightly different basis functions. It scales some of the basis functions by $ \sqrt{2}$. This is OK because a linear combination can compensate by using coefficients that are appropriately divided by $ \sqrt{2}$. But why would one want to do this? The answer boils down to a computational efficiency technique called the “Kernel trick.” In short, it allows you to compute the dot product between two linear combinations of vectors in this vector space without explicitly representing the vectors in the space to begin with. If your regression algorithm uses only dot products in its code (as is true of the closed form solution for regression), you get the benefits of nonlinear feature modeling without the cost of computing the features directly. There’s a lot more mathematical theory to discuss here (cf. Reproducing Kernel Hilbert Space) but I’ll have to leave it there for now.

What’s wrong with the radial basis function exercise? This exercise asked you to create a family of basis functions, one for each data point. The problem here is that having so many basis functions makes the linear combination space too expressive. The optimization will overfit the data. It’s like a lookup table: there’s one entry dedicated to each data point. New data points not in the training would be rarely handled well, since they aren’t in the “lookup table” the optimization algorithm found. To get around this, in practice one would add an extra term to the error corresponding to the L1 or L2 norm of the weight vector. This allows one to ensure that the total size of the weights is small, and in the L1 case that usually corresponds to most weights being zero, and only a few weights (the most important) being nonzero. The process of penalizing the “magnitude” of the linear combination is called regularization.

Linear Regression

Machine learning is broadly split into two camps, statistical learning and non-statistical learning. The latter we’ve started to get a good picture of on this blog; we approached Perceptrons, decision trees, and neural networks from a non-statistical perspective. And generally “statistical” learning is just that, a perspective. Data is phrased in terms of independent and dependent variables, and statistical techniques are leveraged against the data. In this post we’ll focus on the simplest example of this, linear regression, and in the sequel see it applied to various learning problems.

As usual, all of the code presented in this post is available on this blog’s Github page.

The Linear Model, in Two Variables

And so given a data set we start by splitting it into independent variables and dependent variables. For this section, we’ll focus on the case of two variables, $ X, Y$. Here, if we want to be completely formal, $ X,Y$ are real-valued random variables on the same probability space (see our primer on probability theory to keep up with this sort of terminology, but we won’t rely on it heavily in this post), and we choose one of them, say $ X$, to be the independent variable and the other, say $ Y$, to be the dependent variable. All that means in is that we are assuming there is a relationship between $ X$ and $ Y$, and that we intend to use the value of $ X$ to predict the value of $ Y$. Perhaps a more computer-sciencey terminology would be to call the variables features and have input features and output features, but we will try to stick to the statistical terminology.

As a quick example, our sample space might be the set of all people, $ X$ could be age, and $ Y$ could be height. Then by calling age “independent,” we’re asserting that we’re trying to use age to predict height.

One of the strongest mainstays of statistics is the linear model. That is, when there aren’t any known relationships among the observed data, the simplest possible relationship one could discover is a linear one. A change in $ X$ corresponds to a proportional change in $ Y$, and so one could hope there exist constants $ a,b$ so that (as random variables) $ Y = aX + b$.  If this were the case then we could just draw many pairs of sample values of $ X$ and $ Y$, and try to estimate the value of $ a$ and $ b$.

If the data actually lies on a line, then two sample points will be enough to get a perfect prediction. Of course, nothing is exact outside of mathematics. And so if we were to use data coming from the real world, and even if we were to somehow produce some constants $ a, b$, our “predictor” would almost always be off by a bit. In the diagram below, where it’s clear that the relationship between the variables is linear, only a small fraction of the data points appear to lie on the line itself.

An example of a linear model for a set of points (credit Wikipedia).

In such scenarios it would be hopelessly foolish to wish for a perfect predictor, and so instead we wish to summarize the trends in the data using a simple description mechanism. In this case, that mechanism is a line. Now the computation required to find the “best” coefficients of the line is quite straightforward once we pick a suitable notion of what “best” means.

Now suppose that we call our (presently unknown) prediction function $ \hat{f}$. We often call the function we’re producing as a result of our learning algorithm the hypothesis, but in this case we’ll stick to calling it a prediction function. If we’re given a data point $ (x,y)$ where $ x$ is a value of $ X$ and $ y$ of $ Y$, then the error of our predictor on this example is $ |y – \hat{f}(x)|$. Geometrically this is the vertical distance from the actual $ y$ value to our prediction for the same $ x$, and so we’d like to minimize this error. Indeed, we’d like to minimize the sum of all the errors of our linear predictor over all data points we see. We’ll describe this in more detail momentarily.

The word “minimize” might evoke long suppressed memories of torturous Calculus lessons, and indeed we will use elementary Calculus to find the optimal linear predictor. But one small catch is that our error function, being an absolute value, is not differentiable! To mend this we observe that minimizing the absolute value of a number is the same as minimizing the square of a number. In fact, $ |x| = \sqrt(x^2)$, and the square root function and its inverse are both increasing functions; they preserve minima of sets of nonnegative numbers.  So we can describe our error as $ (y – \hat{f}(x))^2$, and use calculus to our heart’s content.

To explicitly formalize the problem, given a set of data points $ (x_i, y_i)_{i=1}^n$ and a potential prediction line $ \hat{f}(x) = ax + b$, we define the error of $ \hat{f}$ on the examples to be

$ \displaystyle S(a,b) = \sum_{i=1}^n (y_i – \hat{f}(x_i))^2$

Which can also be written as

$ \displaystyle S(a,b) = \sum_{i=1}^n (y_i – ax_i – b)^2$

Note that since we’re fixing our data sample, the function $ S$ is purely a function of the variables $ a,b$. Now we want to minimize this quantity with respect to $ a,b$, so we can take a gradient,

$ \displaystyle \frac{\partial S}{\partial a} = -2 \sum_{i=1}^n (y_i – ax_i – b) x_i$

$ \displaystyle \frac{\partial S}{\partial b} = -2 \sum_{i=1}^n (y_i -ax_i – b)$

and set them simultaneously equal to zero. In the first we solve for $ b$:

$ \displaystyle 0 = -2 \sum_{i=1}^n y_i – ax_i – b = -2 \left (-nb + \sum_{i=1}^n y_i – ax_i \right )$

$ \displaystyle b = \frac{1}{n} \sum_{i=1}^n y_i – ax_i$

If we denote by $ x_{\textup{avg}} = \frac{1}{n} \sum_i x_i$ this is just $ b = y_{\textup{avg}} – ax_{\textup{avg}}$. Substituting $ b$ into the other equation we get

$ \displaystyle -2 \sum_{i=1}^n (y_ix_i – ax_i^2 – y_{\textup{avg}}x_i – ax_{\textup{avg}}x_i ) = 0$

Which, by factoring out $ a$, further simplifies to

$ \displaystyle 0 = \sum_{i=1}^n y_ix_i – y_{\textup{avg}}x_i – a \sum_{i=1}^n (x_i^2 – x_{\textup{avg}}x_i)$

And so

$ \displaystyle a = \frac{\sum_{i=1}^n (y_i – y_{\textup{avg}})x_i }{\sum_{i=1}^n(x_i – x_{\textup{avg}})x_i}$

And it’s not hard to see (by taking second partials, if you wish) that this corresponds to a minimum of the error function. This closed form gives us an immediate algorithm to compute the optimal linear estimator. In Python,

avg = lambda L: 1.0* sum(L)/len(L)

def bestLinearEstimator(points):
   xAvg, yAvg = map(avg, zip(*points))

   aNum = 0
   aDenom = 0
   for (x,y) in points:
      aNum += (y - yAvg) * x
      aDenom += (x - xAvg) * x

   a = float(aNum) / aDenom
   b = yAvg - a * xAvg
   return (a, b), lambda x: a*x + b

and a quick example of its use on synthetic data points:

>>> import random
>>> a = 0.5
>>> b = 7.0
>>> points = [(x, a*x + b + (random.random() * 0.4 - 0.2)) for x in [random.random() * 10 for _ in range(100)]]
>>> bestLinearEstimator(points)[0]
(0.49649543577814137, 6.993035962110321)

Many Variables and Matrix Form

If we take those two variables $ x,y$ and tinker with them a bit, we can represent the solution to our regression problem in a different (a priori strange) way in terms of matrix multiplication.

First, we’ll transform the prediction function into matrixy style. We add in an extra variable $ x_0$ which we force to be 1, and then we can write our prediction line in a vector form as $ \mathbf{w} = (a,b)$. What is the benefit of such an awkward maneuver? It allows us to write the evaluation of our prediction function as a dot product

$ \displaystyle \hat{f}(x_0, x) = \left \langle (x_0, x), (b, a) \right \rangle = x_0b + ax = ax + b$

Now the notation is starting to get quite ugly, so let’s rename the coefficients of our line $ \mathbf{w} = (w_0, w_1)$, and the coefficients of the input data $ \mathbf{x} = (x_0, x_1)$. The output is still $ y$. Here we understand implicitly that the indices line up: if $ w_0$ is the constant term, then that makes $ x_0 = 1$ our extra variable (often called a bias variable by statistically minded folks), and $ x_1$ is the linear term with coefficient $ w_1$. Now we can just write the prediction function as

$ \hat{f}(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle$

We still haven’t really seen the benefit of this vector notation (and we won’t see it’s true power until we extend this to kernel ridge regression in the next post), but we do have at least one additional notational convenience: we can add arbitrarily many input variables without changing our notation.

If we expand our horizons to think of the random variable $ Y$ depending on the $ n$ random variables $ X_1, \dots, X_n$, then our data will come in tuples of the form $ (\mathbf{x}, y) = ((x_0, x_1, \dots, x_n), y)$, where again the $ x_0$ is fixed to 1. Then expanding our line $ \mathbf{w} = (w_0 , \dots, w_n)$, our evaluation function is still $ \hat{f}(\mathbf{x}) = \left \langle \mathbf{w}, \mathbf{x} \right \rangle$. Excellent.

Now we can write our error function using the same style of compact notation. In this case, we will store all of our input data points $ \mathbf{x}_j$ as rows of a matrix $ X$ and the output values $ y_j$ as entries of a vector $ \mathbf{y}$. Forgetting the boldface notation and just understanding everything as a vector or matrix, we can write the deviation of the predictor (on all the data points) from the true values as

$ y – Xw$

Indeed, each entry of the vector $ Xw$ is a dot product of a row of $ X$ (an input data point) with the coefficients of the line $ w$. It’s just $ \hat{f}$ applied to all the input data and stored as the entries of a vector. We still have the sign issue we did before, and so we can just take the square norm of the result and get the same effect as before:

$ \displaystyle S(w) = \| y – Xw \|^2$

This is just taking a dot product of $ y-Xw$ with itself. This form is awkward to differentiate because the variable $ w$ is nested in the norm. Luckily, we can get the same result by viewing $ y – Xw$ as a 1-by-$ n$ matrix, transposing it, and multiplying by $ y-Xw$.

$ \displaystyle S(w) = (y – Xw)^{\textup{T}}(y-Xw) = y^{\textup{T}}y -2w^{\textup{T}}X^{\textup{T}}y + w^{\textup{T}}X^{\textup{T}}Xw$

This notation is widely used, in particular because we have nice formulas for calculus on such forms. And so we can compute a gradient of $ S$ with respect to each of the $ w_i$ variables in $ w$ at the same time, and express the result as a vector. This is what taking a “partial derivative” with respect to a vector means: we just represent the system of partial derivates with respect to each entry as a vector. In this case, and using formula 61 from page 9 and formula 120 on page 13 of The Matrix Cookbook, we get

$ \displaystyle \frac{\partial S}{\partial w} = -2X^{\textup{T}}y + 2X^{\textup{T}}Xw$

Indeed, it’s quite trivial to prove the latter formula, that for any vector $ x$, the partial $ \frac{\partial x^{\textup{T}}x}{\partial x} = 2x$. If the reader feels uncomfortable with this, we suggest taking the time to unpack the notation (which we admittedly just spent so long packing) and take a classical derivative entry-by-entry.

Solving the above quantity for $ w$ gives $ w = (X^{\textup{T}}X)^{-1}X^{\textup{T}}y$, assuming the inverse of $ X^{\textup{T}}X$ exists. Again, we’ll spare the details proving that this is a minimum of the error function, but inspecting second derivatives provides this.

Now we can have a slightly more complicated program to compute the linear estimator for one input variable and many output variables. It’s “more complicated” in that much more mathematics is happening behind the code, but just admire the brevity!

from numpy import array, dot, transpose
from numpy.linalg import inv

def bestLinearEstimatorMV(points):
   # input points are n+1 tuples of n inputs and 1 output
   X = array([[1] + list(p[:-1]) for p in points]) # add bias as x_0
   y = array([p[-1] for p in points])

   Xt = transpose(X)
   theInverse = inv(dot(Xt, X))
   w = dot(dot(theInverse, Xt), y)
   return w, lambda x: dot(w, x)

Here are some examples of its use. First we check consistency by verifying that it agrees with the test used in the two-variable case (note the reordering of the variables):

>>> print(bestLinearEstimatorMV(points)[0])
[ 6.97687136  0.50284939]

And a more complicated example:

>>> trueW = array([-3,1,2,3,4,5])
>>> bias, linearTerms = trueW[0], trueW[1:]
>>> points = [tuple(v) + (dot(linearTerms, v) + bias + noise(),) for v in [numpy.random.random(5) for _ in range(100)]]
>>> print(bestLinearEstimatorMV(points)[0])
[-3.02698484  1.03984389  2.01999929  3.0046756   4.01240348  4.99515123]

As a quick reminder, all of the code used in this post is available on this blog’s Github page.

Bias and Variance

There is a deeper explanation of the linear model we’ve been studying. In particular, there is a general technique in statistics called maximum likelihood estimation. And, to be as concise as possible, the linear regression formulas we’ve derived above provide the maximum likelihood estimator for a line with symmetric “Gaussian noise.” Rather than go into maximum likelihood estimation in general, we’ll just describe what it means to be a “line with Gaussian noise,” and measure the linear model’s bias and variance with respect to such a model. We saw this very briefly in the test cases for the code in the past two sections. Just a quick warning: the proofs we present in this section will use the notation and propositions of basic probability theory we’ve discussed on this blog before.

So what we’ve done so far in this post is describe a computational process that accepts as input some points and produces as output a line. We have said nothing about the quality of the line, and indeed we cannot say anything about its quality without some assumptions on how the data was generated.  In usual statistical fashion, we will assume that the true data is being generated by an actual line, but with some added noise.

Specifically, let’s return to the case of two random variables $ X,Y$. If we assume that $ Y$ is perfectly determined by $ X$ via some linear equation $ Y = aX + b$, then as we already mentioned we can produce a perfect estimator using a mere two examples. On the other hand, what if every time we take a new $ x$ example, its corresponding $ y$ value is perturbed by some random coin flip (flipped at the time the example is produced)? Then the value of $ y_i$ would be $ y_i = ax_i + b + \eta_i$, and we say all the $ \eta_i$ are drawn independently and uniformly at random from the set $ \left \{ -1,1 \right \}$. In other words, with probability 1/2 we get -1, and otherwise 1, and none of the $ \eta_i$ depend on each other. In fact, we just want to make the blanket assumption that the noise doesn’t depend on anything (not the data drawn, the method we’re using to solve the problem, what our favorite color is…). In the notation of random variables, we’d call $ H$ the random variable producing the noise (in Greek $ H$ is the capital letter for $ \eta$), and write $ Y = aX + b + H$.

More realistically, the noise isn’t chosen uniformly from $ \pm 1$, but is rather chosen to be Gaussian with mean $ 0$ and some variance $ \sigma$. We’d denote this by $ \eta_i \sim N(\mu, \sigma)$, and say the $ \eta_i$ are drawn independently from this normal distribution. If the reader is uncomfortable with Gaussian noise (it’s certainly a nontrivial problem to generate it computationally), just stick to the noise we defined in the previous paragraph. For the purpose of this post, any symmetric noise will result in the same analysis (and the code samples above use uniform noise over an interval anyway).

Moving back to the case of many variables, we assume our data points $ y$ are given by $ y = Xw + H$ where $ X$ is the observed data and $ H$ is Gaussian noise with mean zero and some (unknown) standard deviation $ \sigma$. Then if we call $ \hat{w}$ our predicted linear coefficients (randomly depending on which samples are drawn), then its expected value conditioned on the data is

$ \displaystyle \textup{E}(\hat{w} | X) = \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}y | X)$

Replacing $ y$ by $ Xw + H$,

$ \displaystyle \begin{array} {lcl} \textup{E}(\hat{w} | X) & = & \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}(Xw + H) | X) \\ & = & \textup{E}((X^{\textup{T}}X)^{-1}X^{\textup{T}}Xw + (X^{\textup{T}}X)^{-1}X^{\textup{T}}H | X) \end{array}$

Notice that the first term is a fat matrix ($ X^{\textup{T}}X$) multiplied by its own inverse, so that cancels to 1. By linearity of expectation, we can split the resulting expression up as

$ \textup{E}(w | X) + (X^{\textup{T}}X)^{-1}X^{\textup{T}}\textup{E}(H | X)$

but $ w$ is constant (so its expected value is just itself) and $ \textup{E}(H | X) = 0$ by assumption that the noise is symmetric. So then the expected value of $ \hat{w}$ is just $ w$. Because this is true for all choices of data $ X$, the bias of our estimator is zero.

The question of variance is a bit trickier, because the variance of the entries of $ \hat{w}$ actually do depend on which samples are drawn. Briefly, to compute the covariance matrix of the $ w_i$ as variables depending on $ X$, we apply the definition:

$ \textup{Var}(\hat{w} | X) = \textup{E}(\| w – \textup{E}(w) \|^2 | X)$

And after some tedious expanding and rewriting and recalling that the covariance matrix of $ H$ is just the diagonal matrix $ \sigma^2 I_n$, we get that

$ \textup{Var}(\hat{w} | X) = \sigma^2 (X^{\textup{T}}X)^{-1}$

This means that if we get unlucky and draw some sample which makes some entries of $ (X^{\textup{T}}X)^{-1}$ big, then our estimator will vary a lot from the truth. This can happen for a variety of reasons, one of which is including irrelevant input variables in the computation. Unfortunately a deeper discussion of the statistical issues that arise when trying to make hypotheses in such situations. However, the concept of a bias-variance tradeoff is quite relevant. As we’ll see next time, a technique called ridge-regression sacrifices some bias in this standard linear regression model in order to dampen the variance. Moreover, a “kernel trick” allows us to make non-linear predictions, turning this simple model for linear estimation into a very powerful learning tool.

Until then!