Number Theory – A Primer

This primer exists for the background necessary to read our post on RSA encryption, but it also serves as a general primer to number theory.

Oh, Numbers, Numbers, Numbers

We start with some easy definitions.

Definition: The set of integers, denoted \mathbb{Z}, is the set \left \{ \dots -2, -1, 0, 1, 2, \dots \right \}.

Definition: Let a,b be integers, then a divides b, denoted a \mid b, if there exists an integer n such that na = b. We also less commonly say b is divisible by a. A composite number has a divisor greater than 1, and hence two strictly smaller divisors.

This definition of “dividing” allows us to bypass the more complicated world of fractions and rational numbers, and keep most of what we do in the integers. The novice reader is encouraged to prove the following propositions about divisibility:

  • If a \mid b then a \mid kb for any k \in \mathbb{N}.
  • If a \mid b and a \mid c then a \mid b + c, a \mid bc.
  • If a \mid b and b \mid a then a = \pm b.
  • Find a counterexample where a \mid bc but neither a \mid b nor a \mid c holds.

Definition: A positive integer p is prime if it has exactly two distinct divisors: 1 and p. For the remainder of this post, we will always use p,q to denote primes.

It follows immediately from the definition that if p,n are positive integers and p is prime, then n \mid p if and only if n = 1 or n=p.

We could work toward the fundamental theorem of arithmetic, that every positive integer factors uniquely as a product of primes, but this would lead us down the road of proving Bezout’s theorem, and we want to work quickly toward other things. Instead, we simply prove the existence of such a factoring, and take its uniqueness for granted:

Theorem: Every positive integer factors as a product of primes.

Proof. Let S be the set of positive integers which do not have a factoring as a product of primes. In particular, 1 \notin S since 1 is the product of zero primes, and no prime is in S. So S is entirely composed of composite numbers. Take the smallest element x \in S, and since it is composite it may be written as x = ab, where a,b are both strictly smaller than x. But since each of a,b are not in S, they factor into products of primes. By combining these factorings, we achieve a factoring of x, a contradiction. \square

Definition: The greatest common divisor of two numbers a, b, abbreviated gcd and denoted (a,b) or less commonly \gcd(a,b), is the largest divisor of both a and b. We say two numbers are relatively prime if (a,b) = 1.

Note that two relatively prime numbers might not be prime. In fact, (8,9) = 1, but neither 8 nor 9 are prime. We also sometimes say (though it is probably grammatically incorrect) a is relatively prime to b.

We may prove some interesting facts about greatest common divisors, which we leave as exercises to the ambitious reader:

  • (a + cb, b) = (a,b) for any c \in \mathbb{Z}.
  • (a,b) is the smallest positive linear combination of a,b with integer coefficients.
  • (Bezout’s theorem, a corollary to the above two statements) There is a linear combination of a,b with integer coefficients that equals (a,b).

The last theorem shows up in group theory as a statement about generators of additive integer groups n\mathbb{Z}.

Congruences and Modular Arithmetic

Definition: a is congruent to b modulo n, denoted a \equiv b \mod n, if n \mid b-a.

This definition has a more familiar form for computer scientists, namely the line of code:

a % n == b % n

In plain language, two numbers are congruent modulo n if they have the same remainder when divided by n. Usually, however, we make 0 \leq b < n, so that b is the remainder of a when divided by n.

It is a cheap fact that the relation \cdot \equiv \cdot \mod n is an equivalence relation on integers for any fixed n. In other words:

  • a \equiv a \mod n
  • If a \equiv b \mod n then b \equiv a \mod n
  • If a \equiv b \mod n and b \equiv c \mod n then a \equiv c \mod n.

This allows us to partition the integers into their congruence classes. In other words, the fact that this is an equivalence relation allows us to identify a number with its remainder modulo n. For the sake of consistency, when stating the set of all congruence classes mod n, we stick to the classes represented by the positive integers 0, \dots, n-1.

There is a vast amount of work on solving systems of congruences. The interested reader should investigate the Chinese Remainder Theorem and Hensel’s Lemma.

Definition: A number b is an inverse to a number a modulo n if ab \equiv 1 \mod n.

Inverses help to reduce computations by pairing a number with its inverse modulo n. Therefore, it is interesting to know when numbers have inverses, and whether one is unique up to congruence.

Proposition: An inverse for a \mod n exists if and only if (a,n) = 1 (a and n are relatively prime).

Proof. Suppose such an inverse b exists. Then by the definition of congruence, n \mid ab - 1, and hence cn = ab-1, so ab - cn = 1. In particular, 1 is a linear combination of a and n, and hence it must be the smallest positive linear combination. This is equivalent to (a,n), which is proved by the exercise above.

On the other hand, suppose (a,n)=1. We may reverse the computation above to find b. Specifically, b is the coefficient of a in the linear combination of a,n that makes 1. This proves the theorem \square.

It would be great if we could determine how many numbers have inverses mod n. In fact, we will, but first we’d like to investigate a few interesting theorems.

Theorem: (Wilson) (p-1)! \equiv -1 \mod p for any prime p.

Proof. We immediately see that two of these factors are easy: 1 \equiv 1 \mod p and p-1 \equiv -1 \mod p. We claim that the product of the remaining numbers is always 1.

Since each number 1 < n < p-1 is relatively prime to p (indeed, all numbers are relatively prime to a prime), each number n has an inverse modulo p. As long as this inverse is not n itself, we may pair off each number with its inverse, and see that the entire product is just 1.

To prove that n^{-1} \neq n, suppose n^2 \equiv 1 \mod p. Then (n+1)(n-1) \equiv 0 \mod p. But since 1 < n < p-1, we must have two numbers n+1, n-1 whose product is divisible by p. But neither of n+1, n-1 has p as a factor. So we conclude that either n+1 = 0 or n-1 = 0, giving n \equiv \pm 1 \mod p, a contradiction.

So we may indeed pair the numbers off as we described above, and see that (p-1)! \equiv -1 \mod p. \square

Here is another fundamental result that uses Wilson’s theorem as a stepping stone:

Theorem: (Fermat’s Little Theorem) If p is prime and 0 < a < p, then a^{p-1} \equiv 1 \mod p.

Proof. The set \left \{ 1, 2, \dots, p-1 \right \} are the nonzero remainders mod p. If we multiply each by a, we get another complete set of nonzero residues mod p, namely \left \{ a, 2a, \dots, (p-1)a \right \}.

Since both of these sets are all the nonzero residues mod p, their products are congruent. In particular,

\displaystyle \prod \limits_{k=1}^{p-1} k \equiv \prod \limits_{k=1}^{p-1}ak \mod n

Since we may factor out the a from each term, and there are p-1 terms, we arrive at (p-1)! \equiv a^{p-1}(p-1)! \mod p. But Wilson’s theorem proved that (p-1)! is nonzero mod p, and hence has a multiplicative inverse. Multiplying both sides of the equation by its inverse (which is obviously -1), we get a^{p-1} \equiv 1 \mod p, as desired. \square

Fermat’s Little Theorem allows us to quickly compute large powers of integers modulo any prime. Specifically, we may break 3^{143} \mod 11 into 3^{10*14 + 3} = (3^{10})^{14} \cdot 3^3 By Fermat’s Little Theorem, this is just 1^{14} \cdot 27 \mod 11, and from here we may compute 3^{143} \equiv 5 \mod 11. Certainly this is much faster than incrementally multiplying 3 to itself 143 times and every so often dividing by 11. We subtly allude to the usefulness of number theory for computing large exponents, which is an important theme in our post on RSA encryption.

Euler’s Phi Function

Our final topic for this primer is Euler’s phi function (also called the totient function), which counts the number of positive integers which are relatively prime to n.

Definition: \varphi(n) is the number of positive integers between 1 and n which are relatively prime to n.

Here we have another, more general version of Fermat’s Little Theorem, which uses an almost identical proof:

Theorem: For any positive integer n, if (a,n) = 1 then a^{\varphi(n)} \equiv 1 \mod n.

Proof. Again we use the argument from Fermat’s Little Theorem, except here the sets are not all integers from 1 to n, but only those which are relatively prime. Then we notice that if a, b are relatively prime to n, then so is their product.

Using the product trick once more, we label the relatively prime integers r_i, and see that

\displaystyle \prod \limits_{k=1}^{\varphi(n)} r_k \equiv a^{\varphi(n)} \prod \limits_{k=1}^{\varphi(n)} r_k \mod n

Since the product of all relatively prime integers is again relatively prime, it has a multiplicative inverse mod n. While we might not know what it is, it certainly exists, so we may cancel both sides of the congruence to get a^{\varphi(n)} \equiv 1 \mod n. Therefore, we win. \square

Now we would like to have a good way to compute \varphi(n) for a general n. First, we see that for powers of primes this is easy:

Proposition: \varphi(p^k) = p^{k-1}(p-1) = p^k - p^{k-1} for any prime p and positive integer k. In particular, \varphi(p) = p-1.

Proof. The only numbers which are not relatively prime to p^k are all the multiples of p. Since every pth number is a multiple of p, there are hence p^k / p numbers which are not relatively prime to p. Subtracting this from p^k gives our desired formula. \square

Finally, we have a theorem that lets us compute \varphi(n) for an arbitrary n, from \varphi of its prime power factors.

Proposition: \varphi(nm) = \varphi(n)\varphi(m) if n and m are relatively prime.

Proof. To each number a relatively prime to n, and each number b relatively prime to m, we see that am+bn is relatively prime to mn. Supposing to the contrary that some prime p divides (an+bm, mn), we see that it must divide one of m, n but not both, since (m,n) = 1. Suppose without loss of generality that p \mid n. Then since p \mid an+bm, we may see that p \mid m, a contradiction.

In other words, we have shown that a,b \mapsto am+bn is a function from the set of pairs of relatively prime numbers of n, m to the set of relatively prime numbers to nm. It suffices to show this map is injective and surjective.

For injectivity, we require that no two distinct am+bn are congruent. Supposing we have two distinct a,a' and two b, b', with am+bn \equiv a'm+b'n \mod mn. Then rearranging terms we get m(a-a') + n(b-b') \equiv 0 \mod mn. Since latex m$ divides both m(a-a') and 0 under the modulus, we see that m divides n(b-b'). But since (m,n) = 1, we get that m \mid b-b', so by definition b \equiv b' \mod m, contradicting our assumption that the b‘s were distinct. We get a similar result for the a‘s, and this proves injectivity.

For surjectivity, let k be a positive integer relatively prime to nm. Since (m,n) = 1, we may write am+bn = k for some a,b \in \mathbb{Z}. This is achieved by finding a linear combination of m,n equal to 1 (their gcd), and then multiplying through by k. We claim that (a,n) = 1. This is true since if some prime divided both of a,n, then it would also divide am+bn = k, and also nm. But k was assumed to be relatively prime to nm, so this cannot happen. Hence, a is relatively prime to m. An identical argument gives that b is relatively prime to n, so the pair (a,b) \mapsto am+bn, proving surjectivity. \square

This, we may compute \varphi(n) multiplicatively, by first finding its prime factorization, and then computing \varphi(p^k) for each prime factor, which is easy.

Unfortunately, finding prime factorizations quickly is very hard. Unless we know the factorization of a large n ahead of time (large as in hundreds-of-digits long), computing \varphi(n) is effectively impossible. We cover the implications of this in more detail in our post on RSA encryption.

Until next time!

Encryption & RSA

This post assumes working knowledge of elementary number theory. Luckily for the non-mathematicians, we cover all required knowledge and notation in our number theory primer.

So Three Thousand Years of Number Theory Wasn’t Pointless

It’s often tough to come up with concrete applications of pure mathematics. In fact, before computers came along mathematics was used mostly for navigation, astronomy, and war. In the real world it almost always coincided with the physical sciences. Certainly the esoteric field of number theory didn’t help to track planets or guide ships. It was just for the amusement and artistic expression of mathematicians.

Despite number theory’s apparent uselessness, mathematicians invested a huge amount of work in it, searching for distributions of primes and inventing ring theory in the pursuit of algebraic identities. Indeed some of the greatest open problems in mathematics today are still number theoretical: the infamous Goldbach Conjecture, the Twin Prime Conjecture, and the Collatz Conjecture all have simple statements, but their proofs or counterexamples have eluded mathematicians for hundreds of years. Solutions to these problems, which are generally deemed beyond the grasp of an average mathematician, would certainly bring with them large prizes and international fame.

Putting aside its inherent beauty, until recently there was no use for number theory at all. But nowadays we have complex computer simulated models, statistical analysis, graphics, computing theory, signal processing, and optimization problems. So even very complex mathematics finds its way into most of what we do on a daily basis.

And, of course, number theory also has its place: in cryptography.

The history of cryptography is long and fascinating. The interested reader will find a wealth of information through the article and subsequent links on Wikipedia. We focus on one current method whose security is mathematically sound.

The Advent of Public Keys

Until 1976 (two years before the RSA method was born), all encryption methods followed the same pattern:

  1. At an earlier date, the sender and recipient agree on a secret parameter called a key, which is used both to encrypt and decrypt the message.
  2. The message is encrypted by the sender, sent to the recipient, and then decrypted in privacy.

This way, any interceptor could not read the message without knowing the key and the encryption method. Of course, there were various methods of attacking the ciphers, but for the most part this was a safe method.

The problem is protecting the key. Since the two communicating parties had to agree on a key that nobody else could know, they either had to meet in person or trust an aide to communicate the key separately. Risky business for leaders of distant allied nations.

Then, in 1976, two researchers announced a breakthrough: the sender and recipient need not share the same key! Instead, everybody who wanted private communication has two keys: one private, and one public. The public key is published in a directory, while the private key is kept secret, so that only the recipient need know it.

Anyone wishing to send a secure message would then encrypt the message with the recipient’s public key. The message could only be decrypted with the recipient’s private key. Even the sender couldn’t decrypt his own message!

The astute reader might question whether such an encryption method is possible: certainly every deterministic computation is reversible. Indeed, in theory it is possible to reverse the encryption method. However, as we will see it is computationally unfeasible. With the method we are about to investigate (disregarding any future mathematical or quantum breakthroughs), it would take a mind-bogglingly long time to do so. And, of course, the method works through the magic of number theory.

RSA

Rivest, Shamir, and Adleman. They look like pretty nice guys.

RSA, an acronym which stands for the algorithm’s inventors, Rivest, Shamir, and Adleman, is such a public-key encryption system. It is one of the most widely-used ciphers, and it depends heavily on the computational intractability of two problems in number theory: namely factoring integers and taking modular roots.

But before we get there, let us develop the method. Recall Euler’s totient function, \varphi(n).

Definition: Let n be a positive integer. \varphi(n) is the number of integers between 1 and n relatively prime to n.

There is a famous theorem due to Euler that states if a, n are relatively prime integers, then

\displaystyle a^{\varphi(n)} \equiv 1 \mod{n}

In other words, if we raise a to that power, its remainder after dividing by n is 1. Group theorists will recognize this immediately from Lagrange’s Theorem. While it is possible to prove it with elementary tools, we will not do so here. We cover the full proof of this theorem in our number theory primer.

In particular, we notice the natural next result that a^{k \varphi(n) + 1} \equiv a \mod{n} for any k, since this is just

\displaystyle (a^{\varphi(n)})^k \cdot a \equiv 1^ka \equiv a \mod{n}.

If we could break up k \varphi(n) + 1 into two smaller numbers, say e,d, then we could use exponentiation as our encryption and decryption method. While that is the entire idea of RSA in short, it requires a bit more detail:

Let M be our message, encoded as a single number less than n. We call n the modulus, and for the sake of argument let us say M and n are relatively prime. Then by Euler’s theorem, M^{\varphi(n)} \equiv 1 \mod{n}. In particular, let us choose a public key e (for encryption), and raise M^e \mod{n}. This is the encrypted message. Note that both e and n are known to the encryptor, and hence the general public. Upon receiving such a message, the recipient may use his private key d = e^{-1} \mod{\varphi(n)} to decrypt the message. We may pick e to be relatively prime to \varphi(n), to ensure that such a d exists. Then ed \equiv 1 \mod{\varphi(n)}, and so by Euler’s theorem

\displaystyle (M^e)^d = M^{ed} = M^{k \varphi(n) + 1} \equiv M \mod{n}

By exponentiating the encrypted text with the right private key, we recover the original message, and our secrets are safe from prying eyes.

Now for the messy detail: Where did n come from? And how we can actually compute all this junk?

First, in order to ensure M < n for a reasonably encoded message M, we require that n is large. Furthermore, since we make both n and e public, we have to ensure that \varphi(n) is hard to compute, for if an attacker could determine \varphi(n) from n, then e^{-1} \mod{\varphi(n)} would be trivial to compute. In addition, one could theoretically compute all the eth roots of M^e modulo n.

We solve these problems by exploiting their computational intractability. We find two enormous primes p,q, and set n = pq. First, recall that the best known way to compute \varphi(n) is by the following theorem:

Theorem: For p,q primes, \varphi(p^k) = p^k - p^{k-1}, and \varphi(p^j q^k) = \varphi(p^j)\varphi(q^k).

In this way, we can compute \varphi(n) easily if we know it’s prime factorization. Therein lies the problem and the solution: factorizing large numbers is hard. Indeed, it is an unsolved problem in computer science as to whether integers can be factored by a polynomial-time algorithm. Quickly finding arbitrary roots mod n is a similarly hard problem.

To impress the difficulty of integer factorization, we visit its world record. In 2009, a team of researchers successfully factored a 678-bit (232-digit) integer, and it required a network of hundreds of computers and two years to do. The algorithms were quite sophisticated and at some times fickle, failing when one node in the network went down. On the other hand, our p,q will each be 2048-bit numbers, and so their product is astronomical in comparison. In fact, even 1024-bit numbers are thousands of times harder to factor than 678-bit numbers, meaning that with the hugest of networks, it would take far longer than our lifespans just to factor a “weak” RSA modulus with the methods known today. In this respect, and for the foreseeable future, RSA is watertight.

Since we constructed n as the product of two primes, we know

\varphi(n) = \varphi(p)\varphi(q) = (p-1)(q-1),

so we can compute \varphi(n) trivially. Then, if we pick any e < \varphi(n) which is relatively prime to \varphi(n) (for instance, e itself could be prime), then we may compute the public key d via the extended Euclidean algorithm.

For a clean-cut worked example of RSA key generation and encryption, see the subsection on Wikipedia. We admit that an example couldn’t be done much better than theirs, and we use the same notation here as the writers do there.

Big Random Primes

There is one remaining problem that requires our attention if we wish to implement an RSA encryption scheme. We have to generate huge primes.

To do so, we note that we don’t actually care what the primes are, only how big they are. Generating large random odd numbers is easy: we can simply randomly generate each of its 2,048 bits, ensuring the smallest bit is a 1. Since we recall that primes are distributed roughly according to x / \log(x), we see that the chance of getting a prime at random is roughly 2 / \log(2^{2048}), which is about 1 / 710. Thus, on average we can expect to generate 710 random numbers before we get a prime.

Now that we know we’ll probably find a prime number fast, we just have to determine which is prime. There is essentially only one sure-fire primality test: the Sieve of Eratosthenes, in which we simply test all the primes from 2 to the square root of n. If none divide n, then n is prime.

Unfortunately, this is far too slow, and would require us to generate a list of primes that is unreasonably large (indeed, if we already had that list of primes we wouldn’t need to generate any more!). So we turn to probabilistic tests. In other words, there are many algorithms which determine the likelihood of a candidate being composite (not prime), and then repeat the test until that likelihood is sufficiently close to 0, and hence a certainty. Generally this bound is 2^{-100}, and the existing algorithms achieve it in polynomial time.

Unfortunately an in-depth treatment of one such primality test is beyond the scope of this post. In addition, most contemporary programming languages come equipped with one such primality test, so we put their implementations aside for a later date. To read more about probabilistic primality tests, see the list of them on Wikipedia. They are all based on special cases of Euler’s theorem, and the distribution of multiplicative inverses modulo n.

Implementation

In a wild and unprecedented turn of events, we did not use Mathematica to implement RSA! The reason for this is so that anyone (especially the author’s paranoid father) can run it. So we implemented it in Java. As always, the entire source code (and this time, an executable jar file) is available on this blog’s Github page.

Despite its relative verbosity, Java has a few advantages. The first of these is the author’s familiarity with its GUI (graphical user interface) libraries. The second benefit is that all of the important functions we need are part of the BigInteger class. BigInteger is a built-in Java class that allows us to work with numbers of unbounded size. Recall that in Mathematica unbounded arithmetic is built into the language, but older and more general-purpose languages like Java and C adhere to fixed-length arithmetic. Disregarding the debates over which is better, we notice that BigInteger has the functions:

static BigInteger probablePrime(int bitLength, Random rnd)
BigInteger modPow(BigInteger exponent, BigInteger m)
BigInteger modInverse(BigInteger m)

For clarity, the first function generates numbers which are not prime with probability at most 2^{-100}, the second computes exponents modulo “m”, and the third computes the multiplicative inverse modulo “m”. The “modPow” and “modInverse” functions operate on the context object, or the implicit “this” argument (recall Java is object-oriented [see upcoming primer on object-oriented programming]).

Indeed, this is all we need to write our program! But there are a few more specifics:

First, we need a good random number generator to feed BigInteger’s “probablePrime” function. It turns out that Java’s built-in random number generator is just not secure enough. To remedy this, we could use the “java.security.secureRandom” class, part of Java’s cryptography package; but for the sake of brevity, we instead import an implementation of the Mersenne Twister, a fast prime number generator which is not secure.

Second, there are known factoring methods for n=pq if p \pm 1 or q \pm 1 has only small prime factors. These are due to Pollard and Williams. So we include a special method called “isDivisibleByLargePrime”, and screen our candidate prime numbers against its negation. The largest prime we test for is 65537, the 6543rd prime. The details of this function are in the source code, which again is available on this blog’s Github page. It is not very interesting.

Third, we notice that the choice of public key is arbitrary. Since everyone is allowed to know it, it shouldn’t matter what we pick. Of course, this is a bit too naive, and it has been proven that if the public key e is small (say, 3), then the RSA encryption is less secure. After twenty years of attacks trying to break RSA, it has been generally accepted that public keys with moderate bit-length and small Hamming weight (few 1’s in their binary expansion) are secure. The most commonly used public key is 65537, which is the prime 2^{16} +1 = \textup{0x10001}. So in our implementation, we fix the public key at 65537.

Finally, in order to make our String representations of moduli, public keys, and private keys slightly shorter, we use alphadecimal notation (base 36) for inputting and outputting our numbers. This has the advantage that it uses all numerals and characters, thus maximizing expression without getting funky symbols involved.

Results

Here is a snapshot of the resulting Java application:


As you can see, the encrypted messages are quite long, but very copy-pasteable. Also, generating the keys can take up to a minute, so have patience when pressing the “Generate Keys” button under the tab of the same name.

There you have it! Enjoy the applet; it’s for everyone to use, but despite all my due diligence in writing the software, I wouldn’t recommend anyone to rely on it for national security.

Feel free to leave me a comment with a super-secret RSA-encoded message! Here is my encryption modulus and my public key:

public key: 1ekh

encryption modulus:
5a1msbciqctepss5dfpp76rdb5selhybzhzerklbomyvwlohasuy81r9rbd3scvujpqn9e7m5hp52qv4fli9f4bupg23060wmaf94zq94s4j22hwyi764kk0k6w8is05tyg7029ub6ux4vb4bq9xxzw9nkfs0pfteq7wnm6hya7b2l2i1w8jh25913qye67gz8z4xrep1dcj8qpyxyi56ltukqyv366hei4nme6h9x00z16cbf8g76me1keccicsgd268u3iocvb6c5lw00j234270608f24qelu8xfjcddc7n9u0w2tf46bl1yzsjb8kb5ys9gh51l0ryetge1lwh1ontenraq35wv5f4ea57zae1ojcsxp3cxpx84mbg0duoln2vny7uixl3ti4n2flvfats4wz0h1c34cgxdyixb7ylci6t4dk8raqcbdi3yxclktvb7yv1sb61nxk1kylfp0h7wqtrogf8c039mc6bqe8b7eixb72pfz4sajw1rf170ck2vysy1z6bgyngrhyn8kpepd0btcyuyj0scdshlexlg4uolls8p6frxj8dt4ykcnddeuvf7dfz1qyqpjk7ljwr1hdgaecyqa6gsxryodrup1qpwgieot6v8c5bbizxm45qj4nez5cpe9q12m0pyeszic5dtb1yc0rm7lzzddg254uf390rk4irecfxibpv2lnk9zper3kg729w32n7u0qn8mamtmh80fo6hd0n5a50d9kzft1g3bky2t2d2f1a574ukigx1dfgqrtehnmx5nd

Until next time!

False Proof – All Numbers are Describable in at Most Twenty Words

Problem: Show that every natural number can be unambiguously described in fewer than twenty words.

“Solution”: Suppose to the contrary that not every natural number can be so described. Let S be the set of all natural numbers which are describable in fewer than twenty words. Consider R = \mathbb{N}-S, the set of all words which cannot be described in fewer than twenty words.

Since R is a subset of the natural numbers, which is well-ordered, it has a unique smallest element which we call r. Now, we may describe r unambiguously with the sentence: “the smallest natural number which cannot be unambiguously described in fewer than twenty words.” Since this description uses only fourteen words, we see that r \in S, a contradiction. Hence, R must be the empty set. \square

Explanation: Let’s analyze this a bit further. Suppose that every number were indeed unambiguously describable in fewer than twenty words. It should be obvious that this admits only finitely many descriptions of infinitely many numbers!

In particular, let there be n words in the English language, which we for the sake of argument say is the number of words in the Oxford English Dictionary. Then there are \displaystyle \sum \limits_{k=1}^{20} n^k different phrases. This is a large number, but it is indeed finite. Even if every phrase describes a natural number, there’s no way that we could get them all! This proof is clearly nonsense.

This apparent paradox is in the same vein as Russell’s paradox, which we cover at the end of our set theory primer. Indeed, in its paradox form, this problem is sometimes called the Richard-Berry paradox. The problem is with what kinds of sets we construct. While with Russell’s paradox, the problem was with elements of our set, here it is with the propositions used to determine membership. This proof gives evidence that the English language (and any human language, really), is not rigorous enough for the purposes of mathematics. It also convinces us that naive set theory is problematic.

We find the meat of the problem when we finally get down to the definition of a description. In short (and I don’t want to spoil upcoming content on this blog), a description of a number is a program which computes the number (on some fixed universal Turing machine U). Note that this assumes that whatever a human can compute a computer can too, which is a controversial statement better known as the Church-Turing Thesis. If we further define a number’s “definition” as the shortest program which computes it, then this statement transforms into “x is the smallest integer whose definition has fewer than 100 characters.” As it turns out, when this English description is appropriately reformulated on a Turing machine, it is not an effective description: the statement is undeciable. In fact, once we get into the theory of Kolmogorov complexity, we will find that Berry’s paradox can be used to prove Gödel’s incompleteness theorem!

The problem of description is a very deep and historically debated one. The various approaches and sub-fields are encapsulated in the study of information theory. One such framework for descriptions is provided by Kolmogorov complexity, which we are diligently working toward understanding. Keep an eye out for our future primer on the subject, which we will write as extra background for the already-written post on low-complexity art and unknown future topics.

Eigenfaces, for Facial Recognition

This post assumes familiarity with the terminology and notation of linear algebra, particularly inner product spaces. Fortunately, we have both a beginner’s primer on linear algebra and a follow-up primer on inner products.

The Quest

We are on a quest to write a program which recognizes images of faces. The general algorithm should be as follows.

  1. Get a bunch of sample images of people we want to recognize.
  2. Train our recognition algorithm on those samples.
  3. Classify new images of people from the sample images.

We will eventually end up with a mathematical object called an eigenface. In short, an eigenface measures variability within a set of images, and we will use them to classify new faces in terms of the ones we’ve already seen. But before we get there, let us investigate the different mathematical representations of images.

God has given you one face, and you make yourself a vector

Most naturally, we think of an image as a matrix of pixel values. For simplicity, we restrict our attention to grayscale images. Recall that a pixel value in the standard grayscale model is simply an unsigned byte representing pixel intensity. In other words, each pixel is an integer ranging from 0 to 255, where 0 is black and 255 is white. So every n \times m image corresponds bijectively to a matrix with integer entries between 0 and 255.

Representing an image as a matrix reminds us of the ubiquitous applicability of linear algebra. Indeed, we may learn a great deal about our image by representing it with different bases or querying pixel neighbors. We can find frequencies, detect edges, and do a whole host of other fascinating things. However, we aren’t only concerned with the properties of one picture. In fact, individual pictures are useless to us! We only care about the relationships between a set of pictures, and we want to be able to compute “how similar” two faces are to each other.

In other words, we want a face space.

So instead of representing our images as a matrix, let’s represent them as points. Given an n \times m matrix A with entries a_{i,j}, we simply “collapse” the rows of a matrix into a single row like so:

v = (a_{1,1}, \dots , a_{1,m}, a_{2,1}, \dots , a_{2,m}, \dots , a_{n,1}, \dots , a_{n,m})

Now we have our face space: \mathbb{R}^{n m}. Note that even for small images, this space is huge. For example, with images of size 100 \times 100, this space has ten-thousand dimensions! Certainly, reasoning about such objects would be impossible without a computer, but even then, we will run into trouble.

In order to translate back and forth between the standard matrix representation and the vector representation, we present the following Mathematica code. As always, the full notebook including all the code we provide here is available on this blog’s Github page.

imageToVector[img_] := Flatten[ImageData[img]];
vectorToImage[vec_, {n_, m_}] := Image[Partition[vec, m]];

Now all we need is a set of example faces. We found such an example from a facial recognition research group, and we posted the images on our Google code page. It contains a large number of images of size 200 \times 180 pixels. Here are a few samples:

This bounty of data is excellent. Now we may begin to play.

Mean Faces

In order to classify faces we would like to investigate the distribution of faces in face space. We start by computing the “mean face,” which represents the center of gravity of our sample of faces. We can do this simply by averaging the values of each pixel across all our face images.

We start by selecting a sample of male faces, at most one for each person in our database, and transforming them to grayscale:

(* This directory is particular to my file system. Change it
appropriately. *)

files = Import["~/downloads/faces94/male"][[1 ;; -1 ;; 30]];
faces = Map[Import["~/downloads/faces94/male/" <> #] &, files];

grayFaces = Map[ColorConvert[#, "Grayscale"] &, faces];

Then, we construct a vector with the average pixel intensity values for each pixel:

meanImage = Image[
  Apply[Plus, Map[ImageData, grayFaces]] / Length[grayFaces]
]

ImageData accepts an image object (as represented in Mathematica) and returns the matrix of pixel values. Note that for this particular operation, which is just adding and dividing, it is unnecessary to translate faces from matrices to vectors.

The result is the following image:

Honestly, for a mean face I expected something more sinister. Just for kicks, let’s watch the averaging process incrementally:

This is a nice (seemingly fast) convergence. We casually wonder how much this particular image is subject to change with data coming from different cultures, as most of our data are twenty- or thirty-something white males.

Now that we have a mean, we may calculate the deviations of each image from our mean. Again, this is a simple subtraction of pixel values, which does not require special representation.

differenceFaces = Map[ImageSubtract[#, meanImage]&, grayFaces];

And get some pictures that look like these:

Don’t they look much nicer now that we’ve subtracted off the mean face? In all seriousness, we call these difference faces.

As one can see, some of these difference faces are darker than others. For whatever reason (perhaps some faces are in slightly more agreeable positions), the lighter faces are more “unique” in this sample of faces. We will see this notion come up again when we compute the actual eigenfaces: some will resemble the more variable faces.

While this is fine and dandy, we don’t yet have a way to recognize anybody. For this, we turn to statistics.

Covariance

We may interpret our face vectors in face space as a distribution of n random variables which are precisely the people in the training sample. Specifically, when we try to classify a new face, we want to calculate the distance of that face from each of our training faces, and infer how likely it is to be a specific person.

In reality, however, since our faces are points in a space of 180 \times 200 = 36,000 dimensions, we have 36,000 random variables: one for each coordinate axis. While we recognize that this is far too many variables for us to do any computations with, we play along for now.

Taking our cue from statistics, we want to investigate the variability of the set of face vectors. To do this we use a special matrix called the covariance matrix of our distribution.

Definition: The covariance of two variables, denoted \textup{Cov}(x,y), is the expected value of the product of their deviations from their means: \textup{E}[(x-\textup{E}[x])(y-\textup{E}[y])].

Colloquially, the covariance measures the relationship between how two variables change. A high positive covariance means that a positive change in one variable correlates to a positive change in the other. A highly negative covariance implies a positive change in one variable correlates to a negative change in the other. If the two variables are independent, then their covariance is zero, though the reverse implication is not true in general.

As an example, consider the following distribution of points in the plane:

This distribution clearly has a nontrivial covariance in the x and y variables. It looks like an ellipse on a tilted axis signified by the two black arrows. In fact, we will soon be able to calculate those arrows; they have very special significance for us.

So just finding the variance of x and y (as in the coordinate axes) is not enough to fully describe this distribution. We need three numbers: the variance of x, the variance of y, and the covariance of x,y. As a side note, the variance of a variable is equivalently the covariance of that variable with itself.

We represent these three pieces of data in a covariance matrix, which for the two variables above has the form:

\begin{pmatrix} \textup{Cov}(x,x) & \textup{Cov}(x,y) \\ \textup{Cov}(y,x) & \textup{Cov}(y,y) \end{pmatrix}

Since the “Cov” as a function is symmetric in its arguments, we see that every covariance matrix is symmetric. This will have important implications for us, as symmetric matrices have a rich theory for us to exploit.

Once we compute the covariance matrix of our two variables, we want to describe the variance in terms of the “axes” as above. In other words, if we could just tilt, stretch, and move the data the right way, we could put it back on top of the normal unit coordinate axes (1,0) and (0,1). The process we just described is precisely a linear map that performs a change of basis. In particular, those two black arrows constitute the best basis to describe the data. So if we could just compute the right basis, we’d know everything about our distribution!

Now it’s no coincidence that the particular basis above is so nice. Specifically, if we consider our random variables as unit measurements along those axes (remember, one axis is stretched, so a unit measurement is longer there), then the two variables have zero covariance. In particular, the covariance matrix of the random variables in that basis is diagonal! Recalling that if a matrix has a diagonal representation, then the basis for that representation is a basis of eigenvectors. Hence, we see that computing any basis of eigenvectors will give us our optimal basis, should one exist.

And now, the coup de grâce, we recall that every symmetric real-valued matrix, and hence every covariance matrix, has an orthonormal basis of eigenvectors! This is a special case of the spectral theorem, which we discuss in our primer on inner product spaces.

So now we have proven that for any distribution of data in n random variables, we may describe them with a basis of eigenvectors, such that the variables are pairwise uncorrelated. Let’s apply this to our face space.

Methinks no eigenface so gracious is as mine

In one mindset, we may compute the covariance matrix of our difference faces quite easily:

differenceVectors = Map[imageToVector, differenceFaces];
covarianceMatrix = Transpose[differenceVectors].differenceVectors;

For the Java coders, note that the dot notation is for vector products; Mathematica is not object oriented, but rather it is functional with built in hash maps for mutation. Furthermore, in it’s raw form the “differenceVectors” variable is a 76-element list of face vectors.

These dot products, however, are precisely the computations of covariance, since each dot product computed is between two difference face vectors, each component of which is one of our random variables. Perform a few of the computations by hand (on a smaller matrix, obviously!) to convince yourself that it is so. The i,j entry of our resulting covariance matrix is just \textup{Cov}(x_i,x_j).

Here we are multiplying a 36,000 \times 76 matrix (76 face vectors of length 36,000) by a 76 \times 36,000 matrix, resulting an a 36,000 \times 36,000 matrix. While we could theoretically compute the eigenvalues and eigenvectors of this matrix directly, in reality even the matrix multiplication to construct the matrix uses up all available memory and crashes the Mathematica kernel (how wimpy my netbook is that it can’t handle a billion-entry matrix!). In any case, we need another way to get our eigenvectors.

Again going back to linear algebra, we have a few useful propositions:

Proposition: If A is a real matrix, then AA^{\textup{T}} and A^{\textup{T}}A are symmetric square matrices.

Proof. The squareness of these products is trivial. We prove symmetry of the first, and note the argument is identical for the second.

Here symmetry is equivalent to A^{\textup{T}}A = (A^{\textup{T}}A)^{\textup{T}}. We can see this immediately by using properties of transposition. Alternatively, we can note that the i,j entry of the left side is the dot product of the ith row of A^{\textup{T}} with the jth row of A. On the other hand, the j,i entry is the dot product of the jth column of A^{\textup{T}} and the ith column of A. Since the two factors are transposes of each other, the ith row of A^{\textup{T}} is equal to the ith column of A, and similarly for the js. In other words, the dot products described for the i,j and j,i entries are dot products of the same two vectors, and are hence equal. \square

In particular, we will use the symmetry about the smaller AA^{\textup{T}} to get information about the eigenvectors of the larger A^{\textup{T}}A. We just need the following proposition:

Proposition: Let a real matrix A have dimension n \times m, where n \leq m (here a transposition of A maintains generality). Then the number of eigenvectors with nonzero eigenvalue of A^{\textup{T}}A \ (m \times m) is no greater than the number of eigenvectors with nonzero eigenvalue of AA^{\textup{T}} \ (n \times n)

In particular, the eigenvectors with zero eigenvalues are removable; they correspond to zero variability of a face in that particular direction. So for our purposes, it suffices to find the eigenvalues with nonzero eigenvalue. As we are about to see, this is generally much smaller than the total number of eigenvectors. In fact, the result above follows from a stronger statement:

Proposition: If v is an eigenvector with nonzero eigenvalue of A^{\textup{T}}A, then Av is an eigenvector with the same eigenvalue of AA^{\textup{T}}.

Proof. Letting v be such an eigenvector, with corresponding eigenvalue \lambda \neq 0, we have A^{\textup{T}}Av = \lambda v, and hence

(AA^{\textup{T}})Av = A(A^{\textup{T}}A)v = A \lambda v = \lambda Av

This completes the proof. \square

By the same reasoning, if v is an eigenvector with nonzero eigenvalue of AA^{\textup{T}}, then A^{\textup{T}}v is an eigenvector of A^{\textup{T}}A.

Now we have just given a bijection between the eigenvectors with nonzero eigenvalue of AA^{\textup{T}} and the much smaller A^{\textup{T}}A! This is great, because if we compute in our face problem, the former is a mere 76 \times 76 matrix! Computing the eigenvectors is then a one-line affair, utilizing Mathematica’s fast library for it:

eigenfaceSystem = Eigensystem[covarianceMatrix];

This returns the eigenvalues in decreasing order, with their corresponding eigenvectors attached. Note that we still need to multiply them by the transpose of our “differenceVector” matrix.

The astute reader will notice that we named these eigenvectors eigenfaces. Though they are indeed just plain old eigenvectors, they have a special interpretation as images. Reforming them as 200 \times 180 grayscale pictures, and scaling the values to [0,255], we reveal truly haunting faces:

As creepy as they look, one must recall their astonishing interpretation: each ghostly face represents a random variable and the largest change of that random variable along its axis. Specifically, by finding these eigenfaces, we translated our notion of dimension from having one for each pixel to having one for each person in our training set, and these eigenfaces represent shared variability among the faces of those people. In other words, these faces represent the largest similarities between some faces, and the most drastic differences between others. This is one of the most stunning visualizations of dimension this mathematician has ever seen.

This Face is Your Face, This Face is My Face

Though we only display a few above, we computed a basis of 76 different eigenfaces. Call the subspace spanned by these basis vectors (which is certainly a small subspace of \mathbb{R}^{36,000}) the eigenface subspace. For the purpose of learning new faces, we may reduce face space to the eigenface subspace, and hence represent any face as a linear combination of the eigenfaces.

To do this, we recall once again that the spectral theorem not only provided us a basis of eigenvectors, but it guaranteed that this basis is orthonormal. Recalling our primer, the projections of a vector onto elements of an orthonormal basis give us the needed coefficients for that vector’s expansion. If we label our eigenfaces f_i, then any face c can be decomposed as

c = \left \langle c, f_1 \right \rangle f_1 + \dots + \left \langle c,f_{76} \right \rangle f_{76}

Here since we are working within \mathbb{R}^{36,000}, we may simply use the standard dot product as our inner product. Hence, in Mathematica the coefficients are easy to compute:

projectImageToFaceSpace[image_, meanImage_, eigenfaces_]:=
 Module[{imageVec, diffVec, meanVec},
  imageVec = imageToVector[image];
  meanVec = imageToVector[meanImage];
  diffVec = imageVec - meanVec;
  Map[(diffVec.#)&, eigenfaces]
 ];

For the following face, we present its coefficients when projected onto each of the 76 eigenfaces, in order by decreasing eigenvalues.

coefficients =
 {-6.85693, 23.7498, -11.4515, -3.43352, 5.24749, -7.1615,
  8.09015, -9.7205, -0.660834, -2.4148, -10.3942, 3.33424,
  2.94988, -2.75981, 3.02687, -2.4499, -2.09885, -5.98832,
  -4.22564, -0.65014, 2.20144, -5.43782, -9.61821, -3.25227,
  7.49413, -0.145002, 7.61483, -0.696994, -3.7731, 3.23569,
  -1.78853, 0.0400116, -3.86804, -2.02456, 2.20949, -1.86902,
  1.23445, 0.140996, 0.698304, -0.420466, 2.30691, 3.70434,
  1.02417, 0.382809, 0.413049, -0.994902, 0.754145, 0.363418,
  -0.383865, 1.46379, 1.96381, -2.90388, -2.33381, -0.438939,
  -0.30523, -0.105925, 0.665962, -0.729409, -1.28977, 0.150497,
  0.645343, 0.30724, -1.04942, 1.0462, -0.60808, 0.333288,
  1.09659, -1.38876, 0.33875, 0.278604, 1.0632, -0.0446148,
  0.24526, -0.283482, -0.236843, 0.312122};

And with the following piece of code, we can reconstruct the image exactly from the eigenfaces:

vectorToImage[
  imageToVector[meanImage] + Apply[Plus, coefficients*eigenfaces],
{200, 180}]

While this is fine and dandy, we can do better computationally. As the magnitude of an eigenface’s corresponding eigenvalue decreases, we see, both visually and mathematically, that the eigenface contributes less to the overall variability of the distribution of faces. In other words, we can shed some of the shorter coordinate axes and still retain most of our description of face space!

Here is an animation of the face reconstruction where we choose the first k eigenfaces (i.e., with the largest k corresponding eigenvalues), and increase k:

We notice some “pauses” in the animation, which correspond to either very small coefficients or eigenfaces which only have small regions of variability.

Unfortunately, and this is in the author’s opinion an inherent wishy-washiness of applied mathematics, how many eigenfaces to use is an empirically determined variable. Exercising our best judgement, we find that the image reconstruction is easily recognizable after 30 eigenfaces are used (about seven seconds into the animation above), so we use those for our reduced eigenface subspace.

Recognition Procedure

Now, to polish off our discussion, we have the quite simple recognition procedure. Once we have projected every sample image into the eigenface subspace, we have a bunch of points in \mathbb{R}^{76}.

When given a new image, all we need do is project it into face space, and find the smallest distance between the new point and each of the projections of the sample images. We do so in the Mathematica notebook (available from this blog’s Github page), and also run the recognition procedure on some pictures of the training subjects not used in training.

The reader is invited to look at the results, and see what sorts of changes in appearance mess up recognition the most. As a sneak preview, it seems that head-tilting, eye-closing, and smiling account for a lot of variability. If the reader is a fugitive on the run (and somehow find’s time to read this blog; I’m so honored!), and wishes to have his photograph not recognized, he should smile, tilt his head, and blink vigorously while the photograph is being taken. Additionally, to avoid detection, he should split some of his heist money with the author.

We notice one additional problem: in our haphazard way of selecting individuals for training and evaluation, we mistakenly attempted to classify some individuals that were never in a training photo. What’s worse, is that some of these men look strikingly similar to different people who were used in the training photos, with obvious differences like ears which stick out or thicker noses.

To handle subjects who were not in training, we need a category for those individuals who could not be classified. This is straightforward enough; we just have a minimum threshold. If an image is not within, say 25 units of any of our training faces, then it cannot be classified. This threshold can only be chosen empirically, which is again an unfortunate consequence of moving from pure mathematics to the real world.

With this modification, we see that our final evaluation sample of thirty faces, three of which were not in the original training set, results in only two false positive matches, and one false negative. This method appear to be relatively robust, considering its simplicity.

For more precise results, the reader is encouraged to play with the data and run larger tests. The faces used are organized by gender and name in a downloadable zip file on this blog’s Github page. The archive contains about twenty photos for each individual, and about eighty individuals. Have a blast!

An Alternative Metric

While Euclidean distance is fine and dandy, there are a whole host of other methods for classifying points in Euclidean space. One could potentially train a neural network to perform classification, or dig deeper down the linear algebra rabbit hole and run the points through a support vector machine. Of course, with the rich ocean of literature on classification problems, there are likely many other applicable methods that the reader could explore. [Note: we intend to cover both neural networks and support vector machines on this blog in due time.]

However, there is one less drastic change we can make in our recognition procedure: ditch Euclidean distance. We may want to do so because our axes (the eigenfaces, which recall are eigenvectors of random variables), are at different scales. Euclidean distance hence treats one unit along a short axis as importantly as it treats a fraction of a longer axis. But the distances along shorter axes are more variable, and hence a small change there should mean more than it does elsewhere. This simply cannot do.

Our remedy is the Mahalanobis metric. It is specifically designed to utilize the covariance matrix to measure weighted distances.

Definition: Given a covariance matrix S, the Mahalanobis metric m is defined as

\displaystyle m(x,y) = \sqrt{(x-y)^{\textup{T}} S^{-1} (x-y) }

However, since our covariance matrix is diagonal (using our basis of eigenvectors, the entries are just the corresponding eigenvalues), S^{-1} is easy to compute: we just invert each diagonal entry. We encourage the reader to compare the two distance metrics’ performance in eigenface recognition. Augmenting our Mathematica notebook to do so should not require more than a few lines of code.

Finally, a few extensions and modifications of the eigenface method have popped up since their conception in the late 1980’s. In particular, they seek to compensate for the eigenface approach’s weakness to changes in lighting, orientation, and scaling. One such method is called “fisherfaces,” another “eigenfeatures,” and yet another the “active appearance model.” The latter two appear to map out “landmarks” on the image being processed, combining traditional facial metrics or anomalies with the eigenface decomposition technique. This author may return to an investigation of other facial recognition systems in the future, but for now we have too many other ideas.

Additional Uses

The eigenface method for facial recognition hints at a far more general technique in mathematics: breaking an object down into components which yield information about the whole. This is everywhere in mathematics: group theory, Fourier analysis, and of course linear algebra, to name a few. For linear algebra, the method is called principal component analysis, and this method has wide application in both statistics and machine learning methods in general (which are nowadays largely statistical procedures anyway).

Aside from using eigenfaces to classify faces or other objects, they could be used simply for facial detection. The projection of a facial image into face space, whether the image is used for training or not, will almost always be relatively close to some training image. While the distance might not be small enough to determine a person’s identity, it is usually close enough to determine whether the image is of a face. We encourage the reader to use our code to write a function for facial detection. We anticipate it require less than twenty lines of code, and it should be very similar to our function for classification.

Another simplified application is to classify gender. We casually wonder if it would be more robust, and point the reader to this paper on it. The researcher here has a shockingly high-quality data set, and he also tries his hand on some general, low-quality data sets. It’s a good, quick read for one who has absorbed the content here.

Finally, eigenfaces can also be used as a method for image compression. Since we reduce an image to the 76 coefficients used to rebuild it from the eigenfaces, we see that an image is compressed from being 36,000 bytes to 76 floats. With a large collection of thousands of images and an agreeable tolerance for image approximation, eigenfaces could save quite a lot of space.

So we see that even though our original task was to classify faces, we have stumbled upon a whole host of other solutions to problems in computer science. Indeed, a method for image compression is quite far from the pure mathematics of orthonormal eigenvector bases. We attribute the connections to the beauty of mathematics and computer science, and look forward to the next time we may witness such a connection.

Until then!