Let me explain. The vertical axis represents the error of a hypothesis. The horizontal axis represents the complexity of the hypothesis. The blue curve represents the error of a machine learning algorithm’s output on its training data, and the red curve represents the *generalization* of that hypothesis to the real world. The overfitting phenomenon is marker in the middle of the graph, before which the training error and generalization error both go down, but after which the training error continues to fall while the generalization error rises.

The explanation is a sort of numerical version of Occam’s Razor that says more complex hypotheses can model a *fixed* data set better and better, but at some point a simpler hypothesis better models the underlying phenomenon that generates the data. To optimize a particular learning algorithm, one wants to set parameters of their model to hit the minimum of the red curve.

This is where things get juicy. Boosting, which we covered in gruesome detail previously, has a natural measure of complexity represented by the number of rounds you run the algorithm for. Each round adds one additional “weak learner” weighted vote. So running for a thousand rounds gives a vote of a thousand weak learners. Despite this, boosting **doesn’t overfit** on many datasets. In fact, and this is a shocking fact, researchers observed that Boosting would hit **zero training error, **they kept running it for more rounds, and the generalization error kept going down! It seemed like the complexity could grow arbitrarily without penalty.

Schapire, Freund, Bartlett, and Lee proposed a theoretical explanation for this based on the notion of a margin, and the goal of this post is to go through the details of their theorem and proof. Remember that the standard AdaBoost algorithm produces a set of *weak hypotheses* and a corresponding weight for each round . The classifier at the end is a weighted majority vote of all the weak learners (roughly: weak learners with high error on “hard” data points get less weight).

**Definition:** The *signed* *confidence *of a labeled example is the weighted sum:

The *margin* of is the quantity . The notation implicitly depends on the outputs of the AdaBoost algorithm via “conf.”

We use the product of the label and the confidence for the observation that if and only if the classifier is incorrect. The theorem we’ll prove in this post is

**Theorem: **With high probability over a random choice of training data, for any generalization error of boosting is bounded from above by

In words, the generalization error of the boosting hypothesis is bounded by the distribution of margins observed *on the training data*. To state and prove the theorem more generally we have to return to the details of PAC-learning. Here and in the rest of this post, denotes , the probability over a random example drawn from the distribution , and denotes the probability over a random (training) set of examples drawn from .

**Theorem: **Let be a set of random examples chosen from the distribution generating the data. Assume the weak learner corresponds to a finite hypothesis space of size , and let . Then with probability at least (over the choice of ), **every** weighted-majority vote function satisfies the following generalization bound for every .

In other words, this phenomenon is a fact about voting schemes, not boosting in particular. From now on, a “majority vote” function will mean to take the sign of a sum of the form , where and . This is the “convex hull” of the set of weak learners . If is infinite (in our proof it will be finite, but we’ll state a generalization afterward), then only finitely many of the in the sum may be nonzero.

To prove the theorem, we’ll start by defining a class of functions corresponding to “unweighted majority votes with duplicates:”

**Definition: **Let be the set of functions of the form where and the may contain duplicates (some of the may be equal to some other of the ).

Now every majority vote function can be written as a weighted sum of with weights (I’m using instead of to distinguish arbitrary weights from those weights arising from Boosting). So any such defines a natural distribution over where you draw function with probability . I’ll call this distribution . If we draw from this distribution times and take an unweighted sum, we’ll get a function . Call the random process (distribution) generating functions in this way . In diagram form, the logic goes

weights distribution over function in by drawing times according to .

The main fact about the relationship between and is that each is completely determined by the other. Obviously is determined by because we defined it that way, but is also completely determined by as follows:

Proving the equality is an exercise for the reader.

*Proof of Theorem. *First we’ll split the probability into two pieces, and then bound each piece.

First a probability reminder. If we have two events and (in what’s below, this will be and , we can split up into (where is the opposite of ). This is called the law of total probability. Moreover, because and because these quantities are all at most 1, it’s true that (the conditional probability) and that .

Back to the proof. Notice that for any and any , we can write as a sum:

Now I’ll loosen the first term by removing the second event (that only makes the whole probability bigger) and loosen the second term by relaxing it to a conditional:

Now because the inequality is true for every , it’s also true if we take an expectation of the RHS over any distribution we choose. We’ll choose the distribution to get

And (term 1) is

And is

We can rewrite the probabilities using expectations because (1) the variables being drawn in the distributions are independent, and (2) the probability of an event is the expectation of the indicator function of the event.

Now we’ll bound the terms separately. We’ll start with .

Fix and look at the quantity inside the expectation of .

This should intuitively be very small for the following reason. We’re sampling according to a distribution whose expectation is , and we *know* that . Of course is unlikely to be large.

Mathematically we can prove this by transforming the thing inside the probability to a form suitable for the Chernoff bound. Saying is the same as saying , i.e. that some random variable which is a sum of independent random variables (the ) deviates from its expectation by at least . Since the ‘s are all and constant inside the expectation, they can be removed from the absolute value to get

The Chernoff bound allows us to bound this by an exponential in the number of random variables in the sum, i.e. . It turns out the bound is .

Now recall

For , we don’t want to bound it absolutely like we did for , because there is nothing stopping the classifier from being a bad classifier and having lots of error. Rather, we want to bound it in terms of the probability that . We’ll do this in two steps. In step 1, we’ll go from of the ‘s to of the ‘s.

**Step 1: **For any fixed , if we take a sample of size , then consider the event in which the sample probability deviates from the true distribution by some value , i.e. the event

The claim is this happens with probability at most . This is again the Chernoff bound in disguise, because the expected value of is , and the probability over is an average of random variables (it’s a slightly different form of the Chernoff bound; see this post for more). From now on we’ll drop the when writing .

The bound above holds true for any fixed , but we want a bound over all and . To do that we use the union bound. Note that there are only possible choices for a nonnegative because is a sum of values each of which is either . And there are only possibilities for . So the union bound says the above event will occur with probability at most .

If we want the event to occur with probability at most , we can judiciously pick

And since the bound holds in general, we can take expectation with respect to and nothing changes. This means that for any , our chosen ensures that the following is true with probability at least :

Now for step 2, we bound the probability that on a sample to the probability that on a sample.

**Step 2:** The first claim is that

What we did was break up the LHS into two “and”s, when and (this was still an equality). Then we loosened the first term to since that is only more likely than both and . Then we loosened the second term again using the fact that a probability of an “and” is bounded by the conditional probability.

Now we have the probability of bounded by the probability that plus some stuff. We just need to bound the “plus some stuff” absolutely and then we’ll be done. The argument is the same as our previous use of the Chernoff bound: we assume , and yet . So the deviation of from its expectation is large, and the probability that happens is exponentially small in the amount of deviation. The bound you get is

And again we use the union bound to ensure the failure of this bound for any will be very small. Specifically, if we want the total failure probability to be at most , then we need to pick some ‘s so that . Choosing works.

Putting everything together, we get that with probability at least for every and every , this bound on the failure probability of :

This claim is true for *every* , so we can pick that minimizes it. Doing a little bit of behind-the-scenes calculus that is left as an exercise to the reader, a tight choice of is . And this gives the statement of the theorem.

We proved this for finite hypothesis classes, and if you know what VC-dimension is, you’ll know that it’s a central tool for reasoning about the complexity of infinite hypothesis classes. An analogous theorem can be proved in terms of the VC dimension. In that case, calling the VC-dimension of the weak learner’s output hypothesis class, the bound is

How can we interpret these bounds with so many parameters floating around? That’s where asymptotic notation comes in handy. If we fix and , then the big-O part of the theorem simplifies to , which is easier to think about since goes to zero very fast.

Now the theorem we just proved was about any weighted majority function. The question still remains: why is AdaBoost good? That follows from another theorem, which we’ll state and leave as an exercise (it essentially follows by unwrapping the definition of the AdaBoost algorithm from last time).

**Theorem: **Suppose that during AdaBoost the weak learners produce hypotheses with training errors . Then for any ,

Let’s interpret this for some concrete numbers. Say that and is any fixed value less than . In this case the term inside product becomes and the whole bound tends exponentially quickly to zero in the number of rounds . On the other hand, if we raise to about 1/3, then in order to maintain the LHS tending to zero we would need which is about 20% error.

If you’re interested in learning more about Boosting, there is an excellent book by Freund and Schapire (the inventors of boosting) called Boosting: Foundations and Algorithms. There they include a tighter analysis based on the idea of Rademacher complexity. The bound I presented in this post is nice because the proof doesn’t require any machinery past basic probability, but if you want to reach the cutting edge of knowledge about boosting you need to invest in the technical stuff.

Until next time!

]]>

The main intuition behind Reed-Solomon codes (and basically all the historically major codes) is

Error correction is about adding redundancy, and polynomials are a really efficient way to do that.

Here’s an example of what we’ll do in the post. Say you have a space probe flying past Mars taking photographs like this one

Unfortunately you know that if you send the images back to Earth via radio waves, the signal will get corrupted by cosmic something-or-other and you’ll end up with an image like this.

How can you recover from errors like this? You could do something like repeat each pixel twice in the message so that if one is corrupted the other will get through. But still, every now and then both pixels in a row will be corrupted and it’s twice as inefficient.

The idea of error-correcting codes is to find a way to encode a message so that it adds a lot of redundancy without adding too much extra information to the message. The name of the game is to optimize the tradeoff between how much redundancy you get and how much longer the message needs to be, while still being able to efficiently decode the encoded message.

A solid technique turns out to be: use polynomials. Even though you’d think polynomials are too simple (we teach them starting in the 7th grade these days!) they turn out to have remarkable properties. The most important of which is:

if you give me a bunch of points in the plane with different coordinates, they *uniquely* define a polynomial of a certain degree.

This fact is called *polynomial interpolation. *We used it in a previous post to share secrets, if you’re interested.

What makes polynomials great for error correction is that you can take a fixed polynomial (think, the message) and “encode” it as a list of points on that polynomial. If you include enough, then you can get back the original polynomial from the points alone. And the best part, for each two additional points you include above the minimum, you get resilience to one additional error *no matter where it happens in the message. *Another way to say this is, even if some of the points in your encoded message are wrong (the numbers are modified by an adversary or random noise), as long as there aren’t too many errors there is an algorithm that can recover the errors.

That’s what makes polynomials so much better than the naive idea of repeating every pixel twice: once you allow for three errors you run the risk of losing a pixel, but you had to double your communication costs. With a polynomial-based approach you’d only need to store around six extra pixels worth of data to get resilience to three errors that can happen anywhere. What a bargain!

Here’s the official theorem about Reed-Solomon codes:

**Theorem:** There is an efficient algorithm which, when given points with distinct has the following property. If there is a polynomial of degree that passes through at least of the given points, then the algorithm will output the polynomial.

So let’s implement the encoder, decoder, and turn the theorem into code!

The way you write a message of length as a polynomial is easy. Pick a large prime integer and from now on we’ll do all our arithmetic modulo . Then encode each character in the message as an integer between 0 and (this is why needs to be large enough), and the polynomial representing the message is

If the message has length then the polynomial will have degree .

Now to encode the message we just pick a bunch of values and plug them into the polynomial, and record the (input, output) pairs as the encoded message. If we want to make things simple we can just require that you always pick the values for some choice of .

A quick skippable side-note: we need to be prime so that our arithmetic happens in a field. Otherwise, we won’t necessarily get unique decoded messages.

Back when we discussed elliptic curve cryptography (ironically sharing an acronym with error correcting codes), we actually wrote a little library that lets us seamlessly represent polynomials with “modular arithmetic coefficients” in Python, which in math jargon is a “finite field.” Rather than reinvent the wheel we’ll just use that code as a black box (full source in the Github repo). Here are some examples of using it.

>>> from finitefield.finitefield import FiniteField >>> F13 = FiniteField(p=13) >>> a = F13(7) >>> a+9 3 (mod 13) >>> a*a 10 (mod 13) >>> 1/a 2 (mod 13)

A programming aside: once you construct an instance of your finite field, all arithmetic operations involving instances of that type will automatically lift integers to the appropriate type. Now to make some polynomials:

>>> from finitefield.polynomial import polynomialsOver >>> F = FiniteField(p=13) >>> P = polynomialsOver(F) >>> g = P([1,3,5]) >>> g 1 + 3 t^1 + 5 t^2 >>> g*g 1 + 6 t^1 + 6 t^2 + 4 t^3 + 12 t^4 >>> g(100) 4 (mod 13)

Now to fix an encoding/decoding scheme we’ll call the size of the unencoded message, the size of the encoded message, and the modulus, and we’ll fix these programmatically when the encoder and decoder are defined so we don’t have to keep carrying these data around.

def makeEncoderDecoder(n, k, p): Fp = FiniteField(p) Poly = polynomialsOver(Fp) def encode(message): ... def decode(encodedMessage): ... return encode, decode

Encode is the easier of the two.

def encode(message): thePoly = Poly(message) return [(Fp(i), thePoly(Fp(i))) for i in range(n)]

Technically we could remove the leading `Fp(i)`

from each tuple, since the decoder algorithm can assume we’re using the first integers in order. But we’ll leave it in and define the decode function more generically.

After we define how the decoder should work in theory we’ll run through a simple example step by step. Now on to the decoder.

There are a lot of different decoding algorithms for various error correcting codes. The one we’ll implement is called the Berlekamp-Welch algorithm, but before we get to it we should mention a much simpler algorithm that will work when there are only a few errors.

To remind us of notation, call the length of the message, so that is the degree of the polynomial we used to encode it. And is the number of points we used in the encoding. Call the encoded message as it’s received (as a list of points, possibly with errors).

In the simple method what you do is just randomly pick points from , do polynomial interpolation on the chosen points to get some polynomial , and see if agrees with most of the points in . If there really are few errors, then there’s a good chance the randomly chosen points won’t have any errors in them and you’ll win. If you get unlucky and pick some points with errors, then the you get won’t agree with most of and you can throw it out and try again. If you get *really* unlucky and a bad does agree with most of , then you just run this procedure a few hundred times and take the you get most often. But again, this only works with a small number of errors and while it could be good enough for many applications, don’t bet your first-born child’s life on it working. Or even your favorite pencil, for that matter. We’re going to implement Berlekamp-Welch so you can win someone else’s* *favorite pencil. You’re welcome.

**Exercise: **Implement the simple decoding algorithm and test it on some data.

Suppose we are guaranteed that there are exactly errors in our received message . Call the polynomial that represents the original message . In other words, we have that for all but of the points in .

There are two key ingredients in the algorithm. The first is called the *error locator polynomial. *We’ll call this polynomial , and it’s just defined by being zero wherever the errors occurred. In symbols, whenever . If we knew where the errors occurred, we could write out explicitly as a product of terms like . And if we knew we’d also be done, because it would tell us where the errors were and we could do interpolation on all the non-error points in .

So we’re going to have to study indirectly and use it to get . One nice property of is the following

which is true for *every* pair . Indeed, by definition when then so both sides are zero. Now we can use a technique called *linearization*. It goes like this. The product , i.e. the right-hand-side of the above equation, is a polynomial, say , of larger degree (). We get the equation for all :

Now , , and are all unknown, but it turns out that we can actually find and efficiently. Or rather, we can’t guarantee we’ll find and *exactly, *instead we’ll find two polynomials that have the same quotient as . Here’s how that works.

Say we wrote out as a generic polynomial of degree and as a generic polynomial of degree . So their coefficients are unspecified variables. Now we can plug in all the points to the equations , and this will form a *linear* system of unknowns ( unknowns come from and come from ).

Now we know that this system has *a *good solution, because if we take the true error locator polynomial and with the true we win. The worry is that we’ll solve this system and get two different polynomials whose quotient will be something crazy and unrelated to . But as it turns out this will never happen, and any solution will give the quotient . Here’s a proof you can skip if you hate proofs.

*Proof.* Say you have two pairs of solutions to the system, and , and you want to show that . Well, they might not be divisible, but we can multiply the previous equation through to get . Now we show two polynomials are equal in the same way as always: subtract and show there are too many roots. Define . The claim is that has roots, one for every point . Indeed,

But the degree of is which is less than by the assumption that . So has too many roots and must be the zero polynomial, and the two quotients are equal.

So the core python routine is just two steps: solve the linear equation, and then divide two polynomials. However, it turns out that no python module has any decent support for solving linear systems of equations over finite fields. Luckily, I wrote a linear solver way back when and so we’ll adapt it to our purposes. I’ll leave out the gory details of the solver itself, but you can see them in the source for this post. Here is the code that sets up the system

def solveSystem(encodedMessage): for e in range(maxE, 0, -1): ENumVars = e+1 QNumVars = e+k def row(i, a, b): return ([b * a**j for j in range(ENumVars)] + [-1 * a**j for j in range(QNumVars)] + [0]) # the "extended" part of the linear system system = ([row(i, a, b) for (i, (a,b)) in enumerate(encodedMessage)] + [[0] * (ENumVars-1) + [1] + [0] * (QNumVars) + [1]]) # ensure coefficient of x^e in E(x) is 1 solution = someSolution(system, freeVariableValue=1) E = Poly([solution[j] for j in range(e + 1)]) Q = Poly([solution[j] for j in range(e + 1, len(solution))]) P, remainder = Q.__divmod__(E) if remainder == 0: return Q, E raise Exception("found no divisors!") def decode(encodedMessage): Q,E = solveSystem(encodedMessage) P, remainder = Q.__divmod__(E) if remainder != 0: raise Exception("Q is not divisibly by E!") return P.coefficients

Now let’s go through an extended example with small numbers. Let’s work modulo 7 and say that our message is

2, 3, 2 (mod 7)

In particular, is the length of the message. We’ll encode it as a polynomial in the way we described:

If we pick , then we will encode the message as a sequence of five points on , namely through .

[[0, 2], [1, 0], [2, 2], [3, 1], [4, 4]] (mod 7)

Now let’s add a single error. First remember that our theoretical guarantee says that we can correct any number of errors up to , which in this case is , so we can definitely correct one error. We’ll add 1 to the third point, giving the received corrupted message as

[[0, 2], [1, 0], [2, 3], [3, 1], [4, 4]] (mod 7)

Now we set up the system of equations for all above. Rewriting the equations as , and adding as the last equation the constraint that . The columns represent the variables, with the last column being the right-hand-side of the equality as is the standard for Gaussian elimination.

# e0 e1 q0 q1 q2 q3 [ [2, 0, 6, 0, 0, 0, 0], [0, 0, 6, 6, 6, 6, 0], [3, 6, 6, 5, 3, 6, 0], [1, 3, 6, 4, 5, 1, 0], [4, 2, 6, 3, 5, 6, 0], [0, 1, 0, 0, 0, 0, 1], ]

Then we do row-reduction to get

[ [1, 0, 0, 0, 0, 0, 5], [0, 1, 0, 0, 0, 0, 1], [0, 0, 1, 0, 0, 0, 3], [0, 0, 0, 1, 0, 0, 3], [0, 0, 0, 0, 1, 0, 6], [0, 0, 0, 0, 0, 1, 2] ]

And reading off the solution gives and . Note in particular that the given in this solution is not the error locator polynomial! Nevertheless, the quotient of the two polynomials is exactly which gives back the original message.

There is one catch here: how does one determine the value of to use in setting up the system of linear equations? It turns out that an upper bound on will work just fine, so long as the upper bound you use agrees with the theoretical maximum number of errors allowed (see the Singleton bound from last time). The effect of doing this is that the linear system ends up with some number of free variables that you can set to arbitrary values, and these will correspond to additional shared roots of and that cancel out upon dividing.

Now it’s time for a sad fact. I tried running Welch-Berlekamp on an encoded version of the following tiny image:

And it didn’t finish after running all night.

Berlekamp-Welch is a slow algorithm for decoding Reed-Solomon codes because it requires one to solve a large system of equations. There’s at least one equation for each pixel in a black and white image! To get around this one typically encodes blocks of pixels together into one message character (since is larger than there is lots of space), and apparently one can balance it to minimize the number of equations. And finally, a nontrivial inefficiency comes from our implementation of everything in Python without optimizations. If we rewrote everything in C++ or Go and fixed the prime modulus, we would likely see reasonable running times. There are also asymptotically *much* faster methods based on the fast Fourier transform, and in the future we’ll try implementing some of these. For the dedicated reader, these are all good follow-up projects.

For now we’ll just demonstrate that it works by running it on a larger sample of text, the introductory paragraphs of To Kill a Mockingbird:

def tkamTest(): message = '''When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow. When it healed, and Jem's fears of never being able to play football were assuaged, he was seldom self-conscious about his injury. His left arm was somewhat shorter than his right; when he stood or walked, the back of his hand was at right angles to his body, his thumb parallel to his thigh. He couldn't have cared less, so long as he could pass and punt.''' k = len(message) n = len(message) * 2 p = 2087 integerMessage = [ord(x) for x in message] enc, dec, solveSystem = makeEncoderDecoder(n, k, p) print("encoding...") encoded = enc(integerMessage) e = int(k/2) print("corrupting...") corrupted = corrupt(encoded[:], e, 0, p) print("decoding...") Q,E = solveSystem(corrupted) P, remainder = (Q.__divmod__(E)) recovered = ''.join([chr(x) for x in P.coefficients]) print(recovered)

Running this with unix `time`

produces the following:

encoding... corrupting... decoding... When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow. When it healed, and Jem's fears of never being able to play football were assuaged, he was seldom self-conscious about his injury. His left arm was somewhat shorter than his right; when he stood or walked, the back of his hand was at right angles to his body, his thumb parallel to his thigh. He couldn't have cared less, so long as he could pass and punt. real 82m9.813s user 81m18.891s sys 0m27.404s

So it finishes in “only” an hour or so.

In any case, the decoding algorithm is an interesting one. In future posts we’ll explore more efficient algorithms and faster implementations.

Until then!

]]>

The way Patreon works is that you, dear reader, sign up for a monthly donation of any amount you please (as little as $1/month). There are at least three benefits for you doing this:

- You show your support of mathematics and programming. You provide documented evidence that you’re a good person. You help to ensure that Math ∩ Programming stays a high quality resource for everyone. You feel good.
- There are aggregate milestone goals that make Math ∩ Programming a better place. The first, for example, is that when Math ∩ Programming reaches $200/month I will
**permanently remove ads.**See below for a detailed description of how ads currently support the blog (spoiler: it’s not much).

- There are individual benefits if you decide to pledge $5 or more per month. This includes a monthly Google hangout I’ll host, and a
**sneak peek for every new post**and private discussion with me. I’m still thinking about how exactly I’ll implement the member preview, but my current plan is to have a private subreddit. But even if you don’t use reddit there’ll be another way to get access. The highest tier of rewards involves physical merchandise.

I’m excited about Patreon because it seems like an excellent platform. For example, the popular Numberphile YouTube channel makes almost $3k USD per month from patrons. To put that into perspective it’s $36k per year, and my graduate student stipend is only about $17k per year. Assuming Math ∩ Programming could get even *half* the success of numberphile, I could have potentially funded my entire graduate work just from blogging!

And channels like Numberphile are purely for entertainment’s sake. Math ∩ Programming has the additional benefit of providing working code for algorithms that are directly applicable to business. So if you have ever used the code at Math ∩ Programming as the start of a project or feature, or even if you just have fun reading about math and seeing cool applications, consider becoming a patron to say thank you!

Here are some other minor funding changes.

- One-time donations are now preferred through Square, due to the lower (1.9%) transaction fee. Square requires a debit card, so if you don’t have one or don’t want to use one you can still use PayPal to donate.
- People rarely buy merchandise. I made a total of $147 on merchandise since 2013. So I’m going to stop doing that for now. Maybe I just have to come up with better merchandise (comments are welcome).
- When I link to textbooks on Amazon I’m going to use Amazon Affiliate. Amazon pays me a little bit of money if you use the link and then end up buying something.

As of August 1, 2015 I have made a total of exactly $2,847.55 USD from ads and donations. About $320 of that is from 2015. It works out to about $70 per month since I first asked for money in 2013 and set up ads. It’s a nice little chunk of change, but nothing to get too excited about. Here is a chart of my ad income:

Donations have provided the rest of the funding, but donations appear to follow a Poisson distribution and the median monthly revenue is zero. By far most of the donations were in the few months after I first asked for donations.

I started my blog with pretty low expectations: I learned a lot of cool things and I wanted to share them, while understanding them better by writing code and filling in proof details. That’s still the core dream, and it will always be the core of Math ∩ Programming. So while it’s pretty cool that I can make any money at all from my blog, and I’m interested to see if I can grow it into a viable side business, you can rest assured that Math ∩ Programming will stay true to its core.

]]>

What we really want to do is talk about the *inherent shape of data. *Homology allows us to compute some qualitative features of a given shape, i.e., find and count the number of connected components or a given shape, or the number of “2-dimensional holes” it has. This is great, but data doesn’t come in a form suitable for computing homology. Though they may have *originated* from some underlying process that follows nice rules, data points are just floating around in space with no obvious connection between them.

Here is a cool example of Thom Yorke, the lead singer of the band Radiohead, whose face was scanned with a laser scanner for their music video “House of Cards.”

Given a point cloud such as the one above, our long term goal (we’re just getting started in this post) is to algorithmically discover what the characteristic topological features are in the data. Since homology is pretty coarse, we might detect the fact that the point cloud above looks like a hollow sphere with some holes in it corresponding to nostrils, ears, and the like. The hope is that if the data set isn’t too corrupted by noise, then it’s a good approximation to the underlying space it is sampled from. By computing the topological features of a point cloud we can understand the process that generated it, and Science can proceed.

But it’s not always as simple as Thom Yorke’s face. It turns out the producers of the music video had to actually *degrade* the data to get what you see above, because their lasers were too precise and didn’t look artistic enough! But you can imagine that if your laser is mounted on a car on a bumpy road, or tracking some object in the sky, or your data comes from acoustic waves traveling through earth, you’re bound to get noise. Or more realistically, if your data comes from thousands of stock market prices then the process *generating* the data is super mysterious. It changes over time, it may not follow any discernible pattern (though speculators may hope it does), and you can’t hope to visualize the entire dataset in any useful way.

But with persistent homology, so the claim goes, you’d get a good qualitative understanding of the dataset. Your results would be resistant to noise inherent in the data. It also wouldn’t be sensitive to the details of your data cleaning process. And with a dash of ingenuity, you can come up with a reasonable mathematical model of the underlying generative process. You could use that model to design algorithms, make big bucks, discover new drugs, recognize pictures of cats, or whatever tickles your fancy.

But our first problem is to resolve the input data type error. We want to use homology to describe data, but our data is a point cloud and homology operates on simplicial complexes. In this post we’ll see two ways one can do this, and see how they’re related.

Let’s start with the Čech complex. Given a point set in some metric space and a number , the *Čech complex * is the simplicial complex whose simplices are formed as follows. For each subset of points, form a -ball around each point in , and include as a simplex (of dimension ) if there is a common point contained in all of the balls in . This structure obviously satisfies the definition of a simplicial complex: any sub-subset of a simplex will be also be a simplex. Here is an example of the epsilon balls.

A topologist will have a minor protest here: the simplicial complex is supposed to resemble the structure inherent in the underlying points, but how do we know that this abstract simplicial complex (which is really hard to visualize!) resembles the topological space we used to make it? That is, was sitting in some metric space, and the union of these epsilon-balls forms some topological space that is close in structure to . But is the Čech complex close to ? Do they have the same topological structure? It’s not a trivial theorem to prove, but it turns out to be true.

**The Nerve Theorem: **The homotopy types of and are the same.

We won’t remind the readers about homotopy theory, but suffice it to say that when two topological spaces have the same homotopy type, then homology can’t distinguish them. In other words, if homotopy type is too coarse for a discriminator for our dataset, then persistent homology will fail us for sure.

So this theorem is a good sanity check. If we want to learn about our point cloud, we can pick a and study the topology of the corresponding Čech complex . The reason this is called the “Nerve Theorem” is because one can generalize it to an arbitrary family of convex sets. Given some family of convex sets, the *nerve *is the complex obtained by adding simplices for mutually overlapping subfamilies in the same way. The nerve theorem is actually more general, it says that with sufficient conditions on the family being “nice,” the resulting Čech complex has the same topological structure as .

The problem is that Čech complexes are tough to compute. To tell whether there are any 10-simplices (without additional knowledge) you have to inspect all subsets of size 10. In general computing the entire complex requires exponential time in the size of , which is extremely inefficient. So we need a different kind of complex, or at least a different representation to compensate.

The *Vietoris-Rips complex *is essentially the same as the Čech complex, except instead of adding a -simplex when there is a common point of intersection of *all* the -balls, we just do so when all the balls have *pairwise* intersections. We’ll denote the Vietoris-Rips complex with parameter as .

Here is an example to illustrate: if you give me three points that are the vertices of an equilateral triangle of side length 1, and I draw -balls around each point, then they will have all three pairwise intersections but no common point of intersection.

So in this example the Vietoris-Rips complex is a graph with a 2-simplex, while the Čech complex is just a graph.

One obvious question is: do we still get the benefits of the nerve theorem with Vietoris-Rips complexes? The answer is no, obviously, because the Vietoris-Rips complex and Čech complex in this triangle example have totally different topology! But everything’s not lost. What we can do instead is compare Vietoris-Rips and Čech complexes with related parameters.

**Theorem: **For all , the following inclusions hold

So if the Čech complexes for both and are good approximations of the underlying data, then so is the Vietoris-Rips complex. In fact, you can make this chain of inclusions slightly tighter, and if you’re interested you can see Theorem 2.5 in this recent paper of de Silva and Ghrist.

Now your first objection should be that computing a Vietoris-Rips complex *still* requires exponential time, because you have to scan all subsets for the possibility that they form a simplex. It’s true, but one nice thing about the Vietoris-Rips complex is that it can be represented implicitly as a graph. You just include an edge between two points if their corresponding balls overlap. Once we want to compute the actual simplices in the complex we have to scan for cliques in the graph, so that sucks. But it turns out that computing the graph is the first step in other more efficient methods for computing (or approximating) the VR complex.

Let’s go ahead and write a (trivial) program that computes the graph representation of the Vietoris-Rips complex of a given data set.

import numpy def naiveVR(points, epsilon): points = [numpy.array(x) for x in points] vrComplex = [(x,y) for (x,y) in combinations(points, 2) if norm(x - y) < 2*epsilon] return numpy.array(vrComplex)

Let’s try running it on a modestly large example: the first frame of the Radiohead music video. It’s got about 12,000 points in (x,y,z,intensity), and sadly it takes about twenty minutes. There are a couple of ways to make it more efficient. One is to use specially-crafted data structures for computing threshold queries (i.e., find all points within of this point). But those are only useful for small thresholds, and we’re interested in sweeping over a range of thresholds. Another is to invoke approximations of the data structure which give rise to “approximate” Vietoris-Rips complexes.

In a future post we’ll implement a method for speeding up the computation of the Vietoris-Rips complex, since this is the primary bottleneck for topological data analysis. But for now the conceptual idea of how Čech complexes and Vietoris-Rips complexes can be used to turn point clouds into simplicial complexes in reasonable ways.

Before we close we should mention that there are other ways to do this. I’ve chosen the algebraic flavor of topological data analysis due to my familiarity with algebra and the work based on this approach. The other approaches have a more geometric flavor, and are based on the Delaunay triangulation, a hallmark of computational geometry algorithms. The two approaches I’ve heard of are called the alpha complex and the flow complex. The downside of these approaches is that, because they are based on the Delaunay triangulation, they have poor scaling in the dimension of the data. Because high dimensional data is crucial, many researchers have been spending their time figuring out how to speed up approximations of the V-R complex. See these slides of Afra Zomorodian for an example.

Until next time!

]]>

**Warning: algorithms can facilitate illegal discrimination!**

Here’s a not-so-imaginary example of the problem. A bank wants people to take loans with high interest rates, and it also serves ads for these loans. A modern idea is to use an algorithm to decide, based on the sliver of known information about a user visiting a website, which advertisement to present that gives the largest chance of the user clicking on it. There’s one problem: these algorithms are trained on historical data, and poor uneducated people (often racial minorities) have a historical trend of being more likely to succumb to predatory loan advertisements than the general population. So an algorithm that is “just” trying to maximize clickthrough may also be targeting black people, de facto denying them opportunities for fair loans. Such behavior is illegal.

On the other hand, even if algorithms are not making illegal decisions, by training algorithms on data produced by humans, we naturally reinforce prejudices of the majority. This can have negative effects, like Google’s autocomplete finishing “Are transgenders” with “going to hell?” Even if this is the most common question being asked on Google, and *even* if the majority think it’s morally acceptable to display this to users, this shows that algorithms do in fact encode our prejudices. People are slowly coming to realize this, to the point where it was recently covered in the *New York Times*.

There are many facets to the algorithm fairness problem one that has not even been widely acknowledged as a problem, despite the *Times* article. The message has been echoed by machine learning researchers but mostly ignored by practitioners. In particular, “experts” continually make ignorant claims such as, “equations can’t be racist,” and the following quote from the above linked article about how the Chicago Police Department has been using algorithms to do predictive policing.

Wernick denies that [the predictive policing] algorithm uses “any racial, neighborhood, or other such information” to assist in compiling the heat list [of potential repeat offenders].

Why is this ignorant? Because of the well-known fact that removing explicit racial features from data does not eliminate an algorithm’s ability to learn race. If racial features disproportionately correlate with crime (as they do in the US), then an algorithm which learns race is actually doing exactly what it is designed to do! One needs to be very thorough to say that an algorithm does not “use race” in its computations. Algorithms are not designed in a vacuum, but rather in conjunction with the designer’s analysis of their data. There are two points of failure here: the designer can unwittingly encode biases into the algorithm based on a biased exploration of the data, and the data itself can encode biases due to human decisions made to create it. Because of this, the burden of proof is (or should be!) on the practitioner to guarantee they are not violating discrimination law. Wernick should instead prove mathematically that the policing algorithm does not discriminate.

While that viewpoint is idealistic, it’s a bit naive because there is no accepted definition of what it *means* for an algorithm to be fair. In fact, from a precise mathematical standpoint, there isn’t even a precise *legal *definition of what it means for any practice to be fair. In the US the existing legal theory is called disparate impact, which states that a practice can be considered illegal discrimination if it has a “disproportionately adverse” effect on members of a protected group. Here “disproportionate” is precisely defined by the 80% rule, but this is somehow not enforced as stated. As with many legal issues, laws are broad assertions that are challenged on a case-by-case basis. In the case of fairness, the legal decision usually hinges on whether an *individual* was treated unfairly, because the individual is the one who files the lawsuit. Our understanding of the law is cobbled together, essentially through anecdotes slanted by political agendas. A mathematician can’t make progress with that. We want the mathematical essence of fairness, not something that can be interpreted depending on the court majority.

The problem is exacerbated for data mining because the practitioners often demonstrate a poor understanding of statistics, the management doesn’t understand algorithms, and almost everyone is lulled into a false sense of security via abstraction (remember, “equations can’t be racist”). Experts in discrimination law aren’t trained to audit algorithms, and engineers aren’t trained in social science or law. The speed with which research becomes practice far outpaces the speed at which anyone can keep up. This is especially true at places like Google and Facebook, where teams of in-house mathematicians and algorithm designers bypass the delay between academia and industry.

And perhaps the worst part is that even the world’s best mathematicians and computer scientists don’t know how to interpret the output of many popular learning algorithms. This isn’t just a problem that stupid people aren’t listening to smart people, it’s that everyone is “stupid.” A more politically correct way to say it: transparency in machine learning is a wide open problem. Take, for example, deep learning. A far-removed adaptation of neuroscience to data mining, deep learning has become the flagship technique spearheading modern advances in image tagging, speech recognition, and other classification problems.

The picture above shows how low level “features” (which essentially boil down to simple numerical combinations of pixel values) are combined in a “neural network” to more complicated image-like structures. The claim that these features represent natural concepts like “cat” and “horse” have fueled the public attention on deep learning for years. But looking at the above, is there any reasonable way to say whether these are encoding “discriminatory information”? Not only is this an open question, but we don’t even know *what kinds of problems* deep learning can solve! How can we understand to what extent neural networks can encode discrimination if we don’t have a deep understanding of why a neural network is good at what it does?

What makes this worse is that there are only about ten people in the world who understand the practical aspects of deep learning well enough to achieve record results for deep learning. This means they spent a ton of time tinkering the model to make it domain-specific, and nobody really knows whether the subtle differences between the top models correspond to genuine advances or slight overfitting or luck. Who is to say whether the fiasco with Google tagging images of black people as apes was caused by the data or the deep learning algorithm or by some obscure tweak made by the designer? I doubt even the designer could tell you with any certainty.

Opacity and a lack of interpretability is the rule more than the exception in machine learning. Celebrated techniques like Support Vector Machines, Boosting, and recent popular “tensor methods” are all highly opaque. This means that even if we knew what fairness meant, it is still a challenge (though one we’d be suited for) to modify existing algorithms to become fair. But with recent success stories in theoretical computer science connecting security, trust, and privacy, computer scientists have started to take up the call of nailing down what fairness means, and how to measure and enforce fairness in algorithms. There is now a yearly workshop called Fairness, Accountability, and Transparency in Machine Learning (FAT-ML, an awesome acronym), and some famous theory researchers are starting to get involved, as are social scientists and legal experts. Full disclosure, two days ago I gave a talk as part of this workshop on modifications to AdaBoost that seem to make it more fair. More on that in a future post.

From our perspective, we the computer scientists and mathematicians, the central obstacle is still that we don’t have a good definition of fairness.

In the next post I want to get a bit more technical. I’ll describe the parts of the fairness literature I like (which will be biased), I’ll hypothesize about the tension between statistical fairness and individual fairness, and I’ll entertain ideas on how someone designing a controversial algorithm (such as a predictive policing algorithm) could maintain transparency and accountability over its discriminatory impact. In subsequent posts I want to explain in more detail why it seems so difficult to come up with a useful definition of fairness, and to describe some of the ideas I and my coauthors have worked on.

Until then!

]]>