The standard inner product of two vectors has some nice geometric properties. Given two vectors , where by I mean the -th coordinate of , the standard inner product (which I will interchangeably call the dot product) is defined by the formula

This formula, simple as it is, produces a lot of interesting geometry. An important such property, one which is discussed in machine learning circles more than pure math, is that it is a very convenient decision rule.

In particular, say we’re in the Euclidean plane, and we have a line passing through the origin, with being a unit vector perpendicular to (“the normal” to the line).

If you take any vector , then the dot product is positive if is on the same side of as , and negative otherwise. The dot product is zero if and only if is exactly on the line , including when is the zero vector.

Here is an interactive demonstration of this property. Click the image below to go to the demo, and you can drag the vector arrowheads and see the decision rule change.

The code for this demo is available in a github repository.

It’s always curious, at first, that multiplying and summing produces such geometry. Why should this seemingly trivial arithmetic do anything useful at all?

The core fact that makes it work, however, is that the dot product tells you how one vector *projects* onto another. When I say “projecting” a vector onto another vector , I mean you take only the components of that point in the direction of . The demo shows what the result looks like using the red (or green) vector.

In two dimensions this is easy to see, as you can draw the triangle which has as the hypotenuse, with spanning one of the two legs of the triangle as follows:

If we call the (vector) leg of the triangle parallel to , while is the dotted line (as a vector, parallel to ), then as vectors . The projection of onto is just .

Another way to think of this is that the projection is , modified by removing any part of that is perpendicular to . Using some colorful language: you put your hands on either side of and , and then you squish onto along the line perpendicular to (i.e., along ).

And if is a unit vector, then the length of —that is, the length of the projection of onto —is exactly the inner product product .

Moreover, if the angle between and is larger than 90 degrees, the projected vector will point in the opposite direction of , so it’s really a “signed” length.

And this is precisely why the decision rule works. This 90-degree boundary is the line perpendicular to .

More technically said: Let be two vectors, and their dot product. Define by the length of , specifically . Define by by first letting , and then let . In words, you scale to a unit vector , use the result to compute the inner product, and then scale so that it’s length is . Then

**Theorem: **Geometrically, is the projection of onto the line spanned by .

This theorem is true for any -dimensional vector space, since if you have two vectors you can simply apply the reasoning for 2-dimensions to the 2-dimensional plane containing and . In that case, the decision boundary for a positive/negative output is the entire dimensional *hyperplane* perpendicular to (the projected vector).

In fact, the usual formula for the angle between two vectors, i.e. the formula , is a restatement of the projection theorem in terms of trigonometry. The part of the projection formula (how much you scale the output) is equal to . At the end of this post we have a proof of the cosine-angle formula above.

Part of why this decision rule property is so important is that this is a *linear* function, and linear functions can be optimized relatively easily. When I say that, I specifically mean that there are many known algorithms for optimizing linear functions, which don’t have obscene runtime or space requirements. This is a big reason why mathematicians and statisticians start the mathematical modeling process with linear functions. They’re inherently simpler.

In fact, there are many techniques in machine learning—a prominent one is the so-called Kernel Trick—that exist solely to take data that is not inherently linear in nature (cannot be fruitfully analyzed by linear methods) and transform it into a dataset that is. Using the Kernel Trick as an example to foreshadow some future posts on Support Vector Machines, the idea is to take data which cannot be separated by a line, and transform it (usually by adding new coordinates) so that it can. Then the decision rule, computed in the larger space, is just a dot product. Irene Papakonstantinou neatly demonstrates this with paper folding and scissors. The tradeoff is that the size of the ambient space increases, and it might increase so much that it makes computation intractable. Luckily, the Kernel Trick avoids this by remembering where the data came from, so that one can take advantage of the smaller space to compute what would be the inner product in the larger space.

Next time we’ll see how this decision rule shows up in an optimization problem: finding the “best” hyperplane that separates an input set of red and blue points into monochromatic regions (provided that is possible). Finding this separator is core subroutine of the Support Vector Machine technique, and therein lie interesting algorithms. After we see the core SVM algorithm, we’ll see how the Kernel Trick fits into the method to allow nonlinear decision boundaries.

Proof of the cosine angle formula

**Theorem: **The inner product is equal to , where is the angle between the two vectors.

Note that this angle is computed in the 2-dimensional subspace spanned by , viewed as a typical flat plane, and this is a 2-dimensional plane regardless of the dimension of .

*Proof. *If either or is zero, then both sides of the equation are zero and the theorem is trivial, so we may assume both are nonzero. Label a triangle with sides and the third side . Now the length of each side is and , respectively. Assume for the moment that is not 0 or 180 degrees, so that this triangle is not degenerate.

The law of cosines allows us to write

Moreover, The left hand side is the inner product of with itself, i.e. . We’ll expand using two facts. The first is trivial from the formula, that inner product is symmetric: . Second is that the inner product is linear in each input. In particular for the first input: and . The same holds for the second input by symmetry of the two inputs. Hence we can split up as follows.

Combining our two offset equations, we can subtract from each side and get

Which, after dividing by , proves the theorem if .

Now if or 180 degrees, the vectors are parallel, so we can write one as a scalar multiple of the other. Say for . In that case, . Now , since a norm is a length and is hence non-negative (but can be negative). Indeed, if are parallel but pointing in opposite directions, then , so , and . Otherwise and . This allows us to write , and this completes the final case of the theorem.

This is nice ,

LikeLiked by 1 person

That’s a skillful answer to a diclfiuft question

LikeLike

If you project a vector (data vector) over (1,1,1,1,1….,1) then the proyection is the mean of the data, this illustrates that projecting is giving you some feature of you data that can be constructed from linear algorithms.

LikeLike

Good post, but I hate the title. It should be something like “Computing a Decision Statistic from the Inner Product”. After you take the inner product, your decision rule is whether the decision statistic is positive or negative.

LikeLike

Sorry to disappoint, but I don’t feel nearly as strongly as you do about this 🙂 Maybe if “decision statistic” was a common term… but either way the point of the post is to highlight a particular role. To suggest this is inherently about statistics would be misleading and detract from the focus, which is geometry.

LikeLike