P vs. NP, A Primer (And a Proof Written in Racket)

Decidability Versus Efficiency

In the early days of computing theory, the important questions were primarily about decidability. What sorts of problems are beyond the power of a Turing machine to solve? As we saw in our last primer on Turing machines, the halting problem is such an example: it can never be solved a finite amount of time by a Turing machine. However, more recently (in the past half-century) the focus of computing theory has shifted away from possibility in favor of determining feasibility. In particular, we want to know how fast we can solve a particular problem on a Turing machine. The goal is to design efficient algorithms for important real-world problems, and know when it is impossible to design a more efficient algorithm than what we already have. From a mathematical perspective, that often means we would like to find the boundary problems: those which are “just barely” too hard to admit an efficient algorithm. Of course, this requires us to first define what it means for an algorithm to be “efficient.”

These questions are intimately intertwined with big-O analysis, which we presented in our very first primer on this blog, and the definitions we investigate in the sequel require fluency in such notation. That being said, we try to give colloquial analogues to the most important ideas.

The Class P

The general goal here is to describe classes of problems that roughly correspond to how efficiently one can solve them. The first such class (and by all rights, the smallest class we will consider) is called P, which stands for “polynomial.”

Definition: P is the set of languages which can be decided by a Turing machine in O(n^k) for some k, where n is the input size of the problem. In the common notation:

\displaystyle \textup{P} = \bigcup_{k \in \mathbb{N}} \textup{TIME}(n^k)

Where \textup{TIME}(n^k) is the set of problems decidable in O(n^k) time. Recall that a language is any set of strings (any subset of \Sigma^*, as we have been investigating in our past computing theory primers and our post about metrics on words).

We don’t have too much to say about P itself, except that many common problems are in P. The real difficulty is in proving that some problem does not lie in P.

Here is an example this author has used to explain to elementary school children what’s it’s like to not be in P. Consider a deck of 52 playing cards, and your goal is to sort the cards by suit and then by rank (as it is when you open a fresh pack of cards). There is one obvious way to do this that children teach themselves quite readily: just look for the Ace of spades, put it on the bottom of a new stack. Then find the two of spades, and continue in this manner until the deck is sorted. This algorithm takes O(n^2), since in the very worst case the deck is sorted in reverse order, and if you’re too dim to realize it, you look through the entire deck at every step.

Of course, there are much faster ways to do it than the approach we detailed above, and it is a well-known theorem that sorting is \Theta(n \log(n)). That is, there is both a lower bound and an upper bound of n \log(n) for the general sorting problem.

Now let’s investigate a ridiculous algorithm for sorting cards. What if at each step, we shuffled the deck, and then checked to see if it magically became sorted due to our shuffling. Forgetting for a moment that randomness comes into play, we would find the correct sorting once in 52! trials, and it would take 52 steps to check if it was sorted. For a deck of n cards, this would hence take time O(n! \cdot n), which would take so long even for 52 cards, that the sun would go out before you could finish. But what if we didn’t know a better way?

For an example of a problem where we don’t know a better way, consider the problem of graph coloring. Is there a Turing machine which, when given as input a graph G and a number k, can determine in polynomial time whether G admits a k-coloring? There is an obvious algorithm to decide the problem: there are only finitely many choices of assignments of vertices to colors, and we can simply check them all. In fact, there are k^n of them, where n is the number of vertices of G.

Unfortunately, this algorithm is not polynomial in runtime: we would have to enumerate all of the different colorings, and check whether each is a valid coloring; this process is easily seen to be o(nk^n), which is far from polynomial for arbitrary k > 2.

But the true challenge is this: how do we know there is no “faster” algorithm? Of all the crazy wild ideas one could have to solve a problem, how do we know none can be so clever that they reduce the running time to be a polynomial in n?

In fact, we don’t know for sure that there isn’t! This is the heart of the open problem which is succinctly called “P vs. NP”.

The Class NP

While P is a class of “easy” problems from one perspective (problems that can be solved quickly, even in the worst case), being a member of NP is another measure of “easiness,” but from a different perspective.

Definition: NP is the class of problems which have a polynomial-time verifier. That is, given an input w to a problem A \in \textup{NP}, there is a string c called a certificate and a Turing machine M for which M verifies that c proves w \in A and runs in polynomial time.

This definition is a bit hard to swallow, but examples clarify the situation greatly. For the problem of graph coloring, we note that a certificate would simply be a list of pairs (v_i, n_i) which give a coloring of the graph G. It is quite trivial to define a polynomial-time Turing machine that ensures the coloring of G is valid. Hence, graph coloring is in NP. This is the case with most problems in NP: a proof that w \in A is hard to find, but easy to verify once you have it.

There is another definition of NP which is often useful, and it gives a reason for prefixing the “polynomial” part of P with an N.

Definition: NP is the set of problems which are solvable by a nondeterministic Turing machine in polynomial time.

For the motivating picture behind “nondeterministic Turing machines,” we turn to an analogy. Imagine you have an infinite number of computers running in parallel, and they can communicate instantaneously. What sorts of problems could we solve in polynomial time with such a machine? We could certainly solve graph coloring: simply have each machine try one of the k^n different colorings, and have the entire machine halt when one coloring is found (or when they all finish, we can safely reject that the graph is k-colorable).

So we can reformulate the definition in set notation as:

\displaystyle \textup{NP} = \bigcup_{k \in \mathbb{N}} \textup{NTIME}(n^k)

Here \textup{NTIME}(f(n)) is the set of all languages which can be solved in time f(n) on a nondeterministic Turing machine.

In other words, “NP” stands for “nondeterministic polynomial-time” problems. And in fact this definition is equivalent to existence of a polynomial-time verifier, as in our first definition. To see this, note that we can construct a nondeterministic machine that enumerates all possible certificates (lending to our analogy, one on each of the infinitely numerous parallel computers), and then tests them using the polynomial-time verifier. Since each branch uses a polynomial-time algorithm, the whole Turing machine runs in (nondeterministic) polynomial time. On the other hand, if some branch of computation halts in deterministic time, then the sequence of configurations of the tape for that branch has polynomial length, and so a Turing machine can simply run that computation to ensure it follows the rules of Turing machines and ends in an accepting state. This clearly represents a certificate.

One might ask why we care about infinitely parallel computers. In reality, we can only have finitely many computations going at once, so why bother with this silly mathematical uselessness? As it often turns out in mathematics, it is useful to think about such structures simply because they capture the essence of what we wish to study in a concise and formal manner. For complexity theory, nondeterministic Turing machines capture the level of complexity we wish to understand: in a sense, it’s the “next level up” from polynomial-time decidable problems.

K-Clique and 3-Sat

We have two more examples of problems in NP which will be important later in this post: the problems of 3-Satisfiability and finding a k-Clique.

Definition: Let \varphi(x_1, \dots, x_n) be a propositional formula in n boolean variables. \varphi is satisfiable if there exists an assignment of the variables x_1, \dots, x_n \to \left \{ \textup{T}, \textup{F} \right \} that makes \varphi true.

For example, we could have the formula

\displaystyle (x \vee y \vee \overline{z}) \wedge (\overline{x} \vee \overline{y} \vee z)

And this is satisfiable by the assignment of x = \textup{T}, y = \textup{F}, z = \textup{T}.

It should be obvious to the reader now that determining whether a formula is satisfiable is in NP: a certificate is just a list of variables mapping to true or false, and checking that the formula is satisfied by a given assignment can be done in polynomial time.

In 3-Satisfiability (usually shortened to 3-Sat), we simply restrict the form of \varphi and ask the same question. The form is called conjunctive normal form, and colloquially it is a bunch of clauses joined with “and,” where each clause is a bunch of variables (and their negations) connected by “or”. Moreover, the “3” in 3-Sat requires that each clause contain exactly three literals. For example, the equation given above would be a valid input to 3-Sat.

k-Clique, on the other hand, is a question about graphs. Given a graph G and a positive integer k, determine whether G contains a complete subgraph of k vertices. (In a complete graph, there is an edge connecting each pair of vertices; the name “clique” is motivated by the party problem, in the sense that there is a “clique” of friends at the party who are all mutually friends with each other.)

As expected, a certificate that a graph has a k-Clique is just a list of the vertices in the clique, and checking that all pairs of vertices listed have a connecting edge is easily seen to take O(|G|k^2) = O(n^3), which is polynomial in the size of the input.


As it turns out, the problems of 3-Satisfiability and k-Clique are quite special (as is graph coloring). They belong to a special sub-class of NP called NP-complete. Before we can define what NP-complete means, we have to be able to compare problems in NP.

Definition: Given two languages A, B, we say A \leq_p B, or A is polynomial-time reducible to B if there exists a computable function f: \Sigma^* \to \Sigma^* such that w \in A if and only if f(w) \in B, and f can be computed in polynomial time.

We have seen this same sort of idea with mapping reducibility in our last primer on Turing machines. Given a language B that we wanted to show as undecidable, we could show that if we had a Turing machine which decided B, we could solve the halting problem. This is precisely the same idea: given a solution for B and an input for A, we can construct in polynomial time an input for B, use a decider for B to solve it, and then output accordingly. The only new thing is that the conversion of the input must happen in polynomial time.

Of course, this discussion was the proof of a clarifying lemma:

Lemma: If B \in \textup{P} and A \leq_p B, then A \in \textup{P}.

The proof is immediate, as B can be solved in polynomial time, and the conversion function runs in polynomial time. We leave the construction of an explicit Turing machine to decide A as an exercise to the reader.

To phrase things more colloquially, A \leq_p B is true if A is an easier” problem than B, hence justifying the less-than notation.

And now for the amazing part: there are problems in NP which are harder than all other problems in NP.

Definition: A language A \in \textup{NP} is called NP-complete if for all problems B \in \textup{NP}, B \leq_p A.

In other words, all problems in NP reduce to an NP-complete problem in polynomial time. In fact, we get another nice fact about NP-completeness that mirrors our observation about P above:

Lemma: If A is NP-complete and B \in \textup{NP} with A \leq_p B, then B is NP-complete.

Obviously the composition of two polynomial-time reductions is a polynomial-time reduction, so we can conclude that all problems in NP which reduce to A also reduce to B.

The cautious reader should be rather hesitant to believe that NP-complete problems should even exist. There is no reason we can’t come up with harder and harder problems, so why should there be a point after which we can’t quickly verify a solution?

Well, Stephen Cook proved in 1971 that there is an NP-complete problem, and shortly thereafter many more were found. Today, there are thousands of known NP-complete problems.

Perhaps unsurprisingly, Cook’s original NP-complete problem was N-Satisfiability (i.e., 3-Satisfiability without a constraint on the number of clauses or the form). Unfortunately the proof is quite difficult. We point the reader to the relevant Wikipedia page, and briefly mention the outline of a proof.

Given a nondeterministic polynomial-time Turing machine, we can bound the number of parallel computations and the length of each computation by n^k for some fixed k. Then we create a n^k by n^k table of the configurations of the Turing machine (the i,j cell for the i-th branch of computation and the j-th step). From this, we can construct a monstrously enormous (yet polynomial in size) formula which has a satisfying assignment if and only if the Turing machine halts on some branch of computation in an accept state. Here is a table of the formulas needed to do this. In short, the formula traces the computation of the machine at each step, and ensures the transition function is honestly followed, the tape is reliably updated, and the head of each tape moves in the correct direction.

The reason we say it’s unsurprising that Satisfiability is NP-complete is because it’s commonly believed that every aspect of mathematics boils down to pure logic, although the process of doing so is entrenched in gory detail every step of the way. So it’s understandable that all problems in NP reduce to a problem about logic which is also in NP. We stipulate that other complexity classes likely have “complete” problems that are essentially questions about logical formulas.

A New Way to Find NP-Complete Problems

Now that we have established the NP-completeness of Satisfiability, we can do the same for other problems by reduction from a known NP-complete problem. First, we claim that 3-Satisfiability is NP-complete, and we leave the proof as an exercise to the reader (hint: reduce from regular Satisfiability by putting the formula into the right form).

Now given that 3-Sat is NP-complete, we will prove that k-Clique is NP-complete, by reduction from 3-Sat (in fact our conversion function will work for any formulas in conjunctive normal form, but 3 is enough).

Theorem: k-Clique is NP-complete.

Proof. Given a formula \varphi in conjunctive normal form, we construct an instance of k-Clique as follows. First, let k be the number of clauses in \varphi. Construct a graph G_{\varphi} by creating a vertex for each literal term in \varphi, and (to help visualization) organize them into columns by their originating clause, and label the vertex with its corresponding literal. Introduce an edge connecting two terms a, b in different columns when the formula a \wedge b is not a contradiction. In other words, it cannot be of the form x \wedge \overline{x} for some variable x.

As a concrete example, the formula

\displaystyle (x \vee \overline{y} \vee z) \wedge (\overline{x} \vee \overline{z} \vee w)

converts to the graph

We claim that \varphi has a satisfying assignment of variables if and only if G_{\varphi} has a k-clique. Supposing there is a valid assignment of variables in \varphi, then there must be one variable in each clause which is true (and hence k variables). This translates to k vertices in G_{\varphi}, one vertex in each column which is true, and none of these vertices conflict with each other, so G_{\varphi} has an edge connecting each pair of the k vertices. Conversely, suppose G_{\varphi} has a k-clique. By our construction, two edges in the same column cannot be connected by and edge, and hence this k-clique must have one vertex in every column. If the vertex is labeled with a negation, assign it to the value F (so that the literal evaluates to true), and otherwise assign it the value T. This gives a satisfying assignment of the variables of \varphi, since each clause will evaluate to true under this assignment.

The final part of the proof is that the conversion function runs in polynomial time, and we claim this is obvious from the construction: if \varphi has n literals, then we create O(n) vertices and O(n^2) edges. The creation of each vertex and edge can be done in constant time, as can the verification that two literals do not conflict. \square

Of course, this is a question about the possibilities of computers. Instead of giving a theoretical proof, why not just write a program to compute the conversion? Well we did just that, and the main body of the code turned out to be quite tidy:

;; n-sat-to-clique: formula -> (listof edge)
;; transform an input to n-sat to an input for clique
;; assume the input expression is in CNF, and that
(define (n-sat-to-clique expr)
  (let* ([conjuncts (∧-conjuncts expr)]
         [columns (map (λ (x) (∨-disjuncts x)) conjuncts)]
         [labeled-columns (label-graph columns 1)]
         [possible-edges (all-pairs-in-distinct-lists labeled-columns)])
     (list->set (filter no-conflict? possible-edges))))

We have a data structure for a general formula (and provide a function to compute the conjunctive normal form of any expression), and a data structure for a graph (essentially, a list of pairs of labelled vertices), and so the process of checking all possible edges, and filtering out those which have no conflict, clearly takes O(n^2) time.

The rest of the code required to run the function above is available on this blog’s Github page.

Other NP-complete Problems, and P =? NP

In the real world NP-complete problems show up everywhere. Application domains include cryptography, financial portfolio and investment management, scheduling and operation dynamics, packing problems, chem- and bioinformatics, guarding art galleries, circuit design, compiler design, and even modelling social networks.

There are even many questions one can ask about games that turn out to be NP-complete in complexity. For instance, many questions about the classic game of Tetris are NP-complete, along with Minesweeper, FreeCell, Lemmings, Battleship, and Mastermind.

Now the big unsolved question is does P = NP? In other words, can any of these seemingly hard problems be solved quickly? The simple fact is, if the answer is yes to one such problem, then the answer is yes not only to all NP-complete problems, but to all problems in NP (as we saw in our lemma earlier). This is the heart of the million-dollar question that has been crowned the most difficult open problem in computer science to date. Almost everyone agrees that P and NP should not be equal, but nobody can prove it.

Of course, common people love to talk about P and NP because of all of the wild things that would happen if we suddenly discovered that P = NP. All widely used security systems would fail, internet transactions would no longer be safe, efficiency in industry would increase by orders of magnitude, we’d unlock the secrets of the human genome, we’d quickly solve open mathematical problems and find staggeringly ginormicon primes (ginormicon = gigantic + enormous + decepticon), governments will topple, rioting in the streets, the artificial intelligence singularity will occur, etc., etc.

But all of this is just media hype. The likely reality is that some problems are simply too hard to be solved in polynomial time in general, just as there are probably problems which have no polynomial-time verifiers (i.e., problems outside of NP), excluding the trivial problems which are undecidable. In the end, it’s just a matter of time until mathematics sorts everything out, or proves that it’s impossible to do so. Indeed, just two years ago this author remembers waking up to the news that there was a 100 page proof that P is not equal to NP, and moreover that Stephen Cook himself considered it a serious attempt. Unfortunately it turned out to have irreparable flaws, but failure made it no less exciting: this is how great mathematics is made.

On the other hand, there are cranks out there who have, for the last few years, been convinced that they have proved P = NP, but are ignorant of their own flaws and the rest of the world’s criticism. May Gauss have mercy on their mathematical souls.

Until next time!

Inner Product Spaces – A Primer

This primer is a precursor to a post on eigenfaces, a technique for facial recognition. For a more basic introduction to linear algebra, see our first primer on the subject.


Vector spaces alone are not enough to do a lot of the interesting things we’d like them to do. Since a vector space is a generalization of Euclidean space, it is natural for us to investigate more specific types of vector spaces which are more akin to Euclidean space. In particular, we want to include the notion of a dot product. By admitting additional structure to a vector space, we may perform more computations, and hopefully get more interesting results.

Again, since we are developing mathematics for use in computing, all vector spaces in this post are finite-dimensional, and the field under question is always \mathbb{R} or \mathbb{C}, but usually the former. It suffices to recognize the vast amount of work and results in infinite-dimensional spaces, but we will not need it here.

Inner Product Spaces

The standard dot product operation in Euclidean space is defined as

\left \langle (x_1, \dots , x_n), (y_1, \dots, y_n) \right \rangle = \sum \limits_{i = 1}^n x_iy_i

So we take the dot product and generalize it to an operation on an arbitrary vector space.

Definition: An inner product on a vector space V over a field F is a function \left \langle \cdot, \cdot \right \rangle : V \times V \to F which satisfies the following properties for all x,y,z \in V, c \in F:

  • \left \langle x,y \right \rangle = \overline{ \left \langle y, x\right \rangle }, where the bar denotes complex conjugate. For real fields, this is just symmetry in the arguments.
  • \left \langle cx, y \right \rangle = c \left \langle x,y \right \rangle
  • \left \langle x+y, z \right \rangle = \left \langle x,z \right \rangle + \left \langle y,z \right \rangle
  • \left \langle x,x \right \rangle \geq 0 and \left \langle x,x \right \rangle = 0 if and only if x=0.

We recommend the novice reader invent some simple and wild operations on two vectors, and confirm or deny that they are inner products.

Notice that the second and third conditions imply that if the second argument of an inner product is fixed, then the resulting map V \to F is a linear map (since it maps vectors to the underlying field, it has a special name: a linear functional). We leave it as an exercise to the reader to investigate linearity facts about the map resulting from fixing the first argument (hint: things get conjugated).

We call a vector space with an associated inner product and inner product space. Of course, the most natural example of an inner product space is any Euclidean space \mathbb{R}^n with the dot product. However, there are many other interesting inner products, including ones which involve matrix multiplication, integration, and random variables. As interesting as they may be (and though the results we develop here hold for them), we have no need for them at this point in time. We will stick entirely to inner product spaces embedded in \mathbb{R}^n, and the standard dot product will suffice.

Now from any inner product we may induce a norm on V, which is colloquially the “distance” of a vector from the origin.

Definition: A norm on V is a function \left \| \cdot \right \| : V \to R which satisfies the following for all x,y \in V, c \in F:

  • \left \| x \right \| \geq 0, with equality if and only if x=0
  • \left \| cx \right \| = |c| \left \| x \right \|
  • \left \| x+y \right \| \leq \left \| x \right \| + \left \| y \right \| (the infamous triangle inequality)

If we recall the standard Euclidean norm, we see that it is just \left \| x \right \| = \sqrt{\left \langle x,x \right \rangle}. Indeed, for any inner product this definition satisfies the axioms of a norm, and so it is a natural generalization of “distance” between points in any inner product space.

In particular, those readers familiar with topology (or at least metric space analysis), will immediately recognize that a norm induces a metric on V, defined by d(x,y)= \left \| x-y \right \| , where of course x-y is the vector between x and y. Hence, every inner product space has a very rigid (metrized) topology and a fruitful geometry.

Additionally, any vector of norm 1 is called a unit vector, or a normal vector.

Now we look at vectors which have interesting relationships under inner products: specifically, when \left \langle x,y \right \rangle = 0.

Orthogonality and Orthonormal Bases

Definition: Two vectors x,y \in V are orthogonal if \left \langle x,y \right \rangle = 0. A set S of vectors is called orthogonal if the vectors in S are pairwise orthogonal.

Orthogonal vectors naturally generalize perpendicularity in Euclidean space. In particular, two vectors in \mathbb{R}^n are orthogonal if and only if the subspaces (lines) spanned by them are perpendicular.

The simplest examples of orthogonal vectors are the standard basis vectors (1,0, \dots, 0) \dots (0, \dots ,1), with respect to the usual dot product. Although, any scalar multiple of a basis vector \lambda e_i may replace e_i while still preserving the orthogonality of the set.

Orthogonality gives a new insight into the nature of inner products. Specifically, \left \langle x,y \right \rangle gives (almost) the length of the projection of x onto y. In other words, \left \langle x,y \right \rangle is the component of x that points in the direction of y (scaled by the length of y).

Now we may define projection maps to get “projection onto y” more faithfully than a plain inner product:

Definition: The projection of x onto y, denoted p_y(x), is the map V \to V defined by p_y(x) = \left \langle x, y \right \rangle \frac{y}{\left \| y \right \|^2}.

The reader may easily verify that the vectors p_y(x) and x-p_y(x) are indeed orthogonal, though the computation is awkward. This will come in handy later when we want to build orthogonal sets of vectors.

In addition, we may obviously decompose any vector v into its two orthogonal components with respect to another vector w via this projection: v = (v-p_w(v)) + p_w(v).

In addition to orthogonality, the standard basis vectors have norm 1. We call such a basis for an inner product space an orthonormal basis (a portmanteau of orthogonal and normal). We commonly denote an orthonormal set of vectors (e_1, \dots, e_n), perhaps a ritualistic attempt to summon the power of the standard basis vectors.

Given an orthonormal basis for an inner product space V (e_1, \dots, e_n), we may decompose any vector into its basis representation rather easily:

\displaystyle v = p_{e_1}(v) + \dots + p_{e_n}(v) = \left \langle v,e_1 \right \rangle e_1 + \dots + \left \langle v,e_n \right \rangle e_n,

Since the norm of each e_i is 1, we may skip the division by the square norm of e_i in the projection maps. Now, recalling that every vector can be written uniquely as a linear combination of the basis vectors, we see that the inner products \left \langle v, e_i \right \rangle are precisely those coefficients. The recognition of this fact leads us to an important theorem, with a necessary preliminary definition:

Definition: Two inner product spaces V, W are isometric if there exists a linear isomorphism f:V \to W which preserves the inner products in the respective spaces. i.e., \left \langle x, y \right \rangle_V = \left \langle f(x), f(y) \right \rangle_W for all x,y \in V.

Whereas, linear isomorphisms between vector spaces are the mathematically rigorous way of saying the two vector spaces are identical, isometric vector spaces give the “sameness” of the inner products. Hence, isometric vector spaces have equivalent metrics, topologies, and geometries, with merely a different choice of basis and representation of the vectors. Now, as we had that every finite-dimensional vector space was isomorphic to \mathbb{R}^n, we will soon see that every finite-dimensional inner product space is isometric to \mathbb{R}^n with the usual dot product.

Theorem: Any finite dimensional real inner product space V which has an orthonormal basis is isometric to \mathbb{R}^n with the usual dot product.

Proof: Define f: \mathbb{R}^n \to V by f((x_1, \dots, x_n)) = x_1e_1 + \dots + x_ne_n. Now, by computation, we see that

\left \langle f((a_1, \dots , a_n)), f((b_1, \dots , b_n)) \right \rangle
= \left \langle a_1e_1 + \dots + a_ne_n, b_1e_1 + \dots + b_ne_n \right \rangle
= \sum \limits_{i=1}^n a_i \left \langle e_i, b_1e_1 + \dots + b_ne_n \right \rangle
= \sum \limits_{i=1}^n \sum \limits_{j=1}^n a_i b_j \left \langle e_i, e_j \right \rangle
= a_1b_1 + \dots a_nb_n = \left \langle (a_1, \dots a_n), (b_1, \dots b_n) \right \rangle \square

Now, all we must prove is that every finite-dimensional inner product space has an orthonormal basis.

Theorem (Gram-Schmidt): Every basis of a finite-dimensional inner product space may be transformed into an orthonormal basis.

Proof: We do so by construction, using the previously introduced projection maps p_y(x). Given a basis (x_1, \dots , x_n), we compute the following algorithm:

  1. \displaystyle e_1 = \frac{x_1}{\left \| x_1 \right \|}
  2. For each i = 2, \dots , n:

    1. Let z_i = \sum \limits_{j=1}^{i-1} p_{e_j}(x_i)
    2. \displaystyle e_i = \frac{x_i - z_i}{\left \| x_i - z_i \right \|}

The reader may verify computationally that this process produces orthonormal vectors, but the argument is better phrased geometrically: each z_i is the projection of some new vector onto the subspace generated by all previously computed orthonormal vectors. By subtracting x_i - z_i, we take the part of x_i that is orthogonal to all vectors in that subspace. This is guaranteed to be nonzero because of the linear independence of our original list. And so, e_i is orthogonal to every vector preceding it in the algorithm (with indices j < i). Finally, we normalize each e_i to make them unit vectors, and we are done. \square

We note that this algorithm takes O(n^3), since we may compute the O(n^2) needed inner products ahead of time, and then there remains O(n) steps to compute each of the e_i.

This result now proves that every real finite-dimensional inner product space is isometric to \mathbb{R}^n. With this new insight, we may effectively do all our work in \mathbb{R}^n with the usual dot product, realizing that the results there hold for all relevant inner product spaces. In other words, our “simplification” at the beginning of this post, restricting our work to \mathbb{R}^n, was not at all a simplification. Proving statements about \mathbb{R}^n gives us equivalent statements about all real finite-dimensional inner product spaces. Wonderful!

Bases of Eigenvectors

There is one more important topic we wish to discuss here: the importance of bases of eigenvectors. In particular, given a linear operator A: V \to V, if one has a basis of eigenvectors for A, then A has a diagonal representation.

In particular, if A has a basis of eigenvectors (v_1, \dots , v_n), then the expansion of Av_i in terms of the basis vectors is just \lambda_i v_i, where \lambda_i is the corresponding eigenvalue. Thus, the matrix corresponding to A looks like:

\begin{pmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_n  \end{pmatrix}

Here we count multiplicity of eigenvalues.

The existence of a diagonal representation for a matrix has interesting computational implications. For instance, we often wish to take high powers of a matrix, such as in counting paths in a graph, or working with graphical computations. Unfortunately, each successive matrix multiplication takes O(n^2) computations. If we wish to compute A^m, this takes O(mn^2) time. However, if the matrix has a diagonal representation, we may spend the O(n^2) it takes to convert the matrix to its diagonal form, take powers of the diagonal matrix by simply taking the powers of the diagonal entries, and then convert it back. Indeed, multiplying two diagonal matrices together is just as easy as multiplying the diagonal entries together, as the reader may verify. This optimization reduces computation to O(n^2 + mn) = O(nm), since we are assuming m is very large.

Of course, then we are left with the problem of quickly computing eigenvectors. What worries us even more is that we might not have a basis of eigenvectors (some matrices don’t have any!). We instead take a slightly different route, which serves our purposes better. Specifically, we will be using this information to compute eigenvectors of symmetric matrices (A = A^{\textup{T}}). For this, we refer to a grand theorem:

The Spectral Theorem: Every real symmetric matrix has an orthonormal basis consisting of eigenvectors.

The proof goes beyond the scope of this post (see: characteristic polynomials and self-adjoint operators), but it is very useful for us. In particular, by finding these eigenvectors we may both have a diagonal representation for our matrix, and also compute projections in a jiffy! We will see the astounding applications of this quite soon.

So Even with two primers on linear algebra, we have still only scratched the surface of this wonderful subject. In the future we may continue this series of primers by discussing the linear algebra inherent in many optimization problems. Be sure to look out for it.

Until next time!

Big-O Notation – A Primer

The Quest to Capture Speed

Companies and researchers spend hundreds of millions of dollars for the fruits of their algorithms. Whether one is indexing websites on the internet for search, folding proteins, or figuring out which warehouse is the most cost-effective to ship a product from, improvements in algorithm speed save immense amounts of money.

It’s no surprise then, that a similarly immense amount of money has gone into the mathematics behind algorithm analysis. Arguably the roughest and most succinct measurement of how fast an algorithm runs is called order of complexity of an algorithm. There are two commonly used measures of order of complexity, namely Big-O notation and the more nuanced Big-Theta notation. We begin with the former.

Because an algorithm runs in a discrete number of steps, we call f(n) the number of steps it takes an algorithm to complete for any input of size n, and then analyze it for real input x \in \mathbb{R}. In doing so, we may utilize the vast wealth of knowledge in real analysis, and describe runtime in terms of elementary continuous functions and their limits.

Definition: We say a real-valued function f(x) is O(g(x)) if for some (potentially large) input value x_0, there exists a positive number M with |f(x)| \leq M |g(x)| for all x > x_0. Equivalently, \lim \limits_{x \rightarrow \infty} \dfrac{|f(x)|}{|g(x)|} < \infty.

Here is a simple, pedantic example:

Consider an algorithm which accepts as its input a list of numbers and adds together all of the numbers in that list. Given a list of size n, we might count up the number of steps taken by our algorithm to arrive at f(n) = 5(n-1), because there are n-1 additions being performed, and for each add there is an instruction fetch, a pointer reference to follow, a value in memory to load, the addition to perform, and the resulting value to store in a register, each for simplicity taking one step. We characterize this algorithm as O(n), because for any n > 1, we have M = 5 giving |f(n)| = |5n-5| < 5|n| = M|g(n)|.

Of course, such a detailed analysis is rarely performed. Generally, we’d say there is an addition to be performed for each element in the list, giving n additions and an order of complexity of O(n). Typically, the goal is to have the g function be as simple as possible, giving a small number of commonly used orders which facilitate over-the-table discussions about algorithms:

  • O(1): the constant time algorithm which always takes a fixed number of steps for any input. For example, looking up a key in a hash table.
  • O(\log n): logarithmic time, where at each step of the algorithm the problem size is reduced by a fixed factor. This is common to “divide and conquer” algorithms.
  • O(n^c) for c > 0: polynomial time. If f(n) is not a monomial, we conventionally drop all terms but the one of highest degree, because under the limit the highest degree term dominates the runtime.
  • O(a^n) for a > 1: exponential time, where with each increase of the input size, the runtime increases by a factor of a. This is extremely slow for any reasonable input size, even when a is very close to 1. Many brute-force algorithms (attempting every possibility) result in exponential runtime.
  • O(n!): factorial time, which is often clumped with exponential time, is actually strictly slower than any exponential time algorithm. Considering s_n = a^n / n!, as n surpasses a, we have each s_n = s_{n-1} (\frac{a}{n}), and since a < n, we decrease s_n at each step, and the limit goes to 0 (i.e., factorials grow much faster). Travelling Salesman is a famous problem whose brute-force solution is factorial, but even with cutting-edge methods for optimization, the worst case runtime is still exponential.

And there are additional classes that fall in between these, such as O(n \log n) and O(\log \log n). And it is easy to prove that constants, and additions of smaller terms are irrelevant under the O. Specifically, O(kg(n)) = O(g(n)) for any constant k, and similarly if f(n) is O(g(n)), then O(f(n) + g(n)) = O(g(n)).

In this way we encapsulate the notion of worst-case runtime. If f(n) is O(g(n)), we say the algorithm which has exact runtime f(n) runs no worse than g(n). Unfortunately, these bounds do not need to be tight. We could say, for instance, that an algorithm runs in O(n^{100}) as well as O(n^2), and be right on both counts. That is why we need Big-Theta notation, to give tighter bounds on runtime:

Definition: We say f(x) is \Theta(g(x)) if there exist x_0, M, N such that whenever x > x_0, we have N |g(x)| < |f(x)| < M |g(x)|. In other words, for an appropriate choice of constants, g bounds f both from above and below for all sufficiently large inputs.

Now we may only use \Theta notation when the bound is reasonable (not too high, not too low). We gain more information by knowing a function is \Theta(g(n)) than knowing it is O(g(n)), because we know it cannot grow slower than g.

Of course, for more complex algorithms with multiple input variables, the asymptotic runtimes are necessarily phrased in terms of continuous functions of multiple variables. While it’s not much harder to formally develop these characterizations, it’s beyond the scope of this post.

There is a wealth of knowledge out there on the orders of complexity of various algorithms; one simply cannot write a new algorithm without asking how fast it is. For the interested reader, I highly recommend the mathematically inclined book Introduction to Algorithms, by Cormen, et. al. It’s been the standard text on the subject for many years, and it is a wealth of knowledge and a valuable reference on any shelf.