# Searching for RH Counterexamples — Exploring Data

We’re ironically searching for counterexamples to the Riemann Hypothesis.

In the last article we added a menagerie of “production readiness” features like continuous integration tooling (automating test running and static analysis), alerting, and a simple deployment automation. Then I let it loose on AWS, got extremely busy with buying a house, forgot about this program for a few weeks (no alerts means it worked flawlessly!), and then saw my AWS bill.

So I copied the database off AWS using pg_dump (piped to gzip), terminated the instances, and inspected the results. A copy of the database is here. You may need git-lfs to clone it. If I wanted to start it back up again, I could spin them back up, and use gunzip | psql to restore the database, and it would start back up from where it left off. A nice benefit of all the software engineering work done thus far.

This article will summarize some of the data, show plots, and try out some exploratory data analysis techniques.

## Summary

We stopped the search mid-way through the set of numbers with 136 prime divisors.

The largest number processed was

1255923956750926940807079376257388805204
00410625719434151527143279285143764977392
49474111379646103102793414829651500824447
17178682617437033476033026987651806835743
3694669721205424205654368862231754214894
07691711699791787732382878164959602478352
11435434547040000

Which in factored form is the product of these terms

  2^8   3^7   5^4   7^4  11^3  13^3  17^2  19^2  23^2  29^2
31^2  37^2  41^2  43^1  47^1  53^1  59^1  61^1  67^1  71^1
73^1  79^1  83^1  89^1  97^1 101^1 103^1 107^1 109^1 113^1
127^1 131^1 137^1 139^1 149^1 151^1 157^1 163^1 167^1 173^1
179^1 181^1 191^1 193^1 197^1 199^1 211^1 223^1 227^1 229^1
233^1 239^1 241^1 251^1 257^1 263^1 269^1 271^1 277^1 281^1
283^1 293^1 307^1 311^1 313^1 317^1 331^1 337^1 347^1 349^1
353^1 359^1 367^1 373^1 379^1 383^1 389^1 397^1 401^1 409^1
419^1 421^1 431^1 433^1 439^1 443^1 449^1 457^1 461^1 463^1
467^1 479^1 487^1 491^1 499^1 503^1 509^1 521^1 523^1 541^1
547^1 557^1 563^1 569^1 571^1 577^1


The best witness—the number with the largest witness value—was

38824169178385494306912668979787078930475
9208283469279319659854547822438432284497
11994812030251439907246255647505123032869
03750131483244222351596015366602420554736
87070007801035106854341150889235475446938
52188272225341139870856016797627204990720000

which has witness value 1.7707954880001586, which is still significantly smaller than the needed 1.782 to disprove RH.

The factored form of the best witness is

 2^11   3^7   5^4   7^3  11^3  13^2  17^2  19^2  23^2  29^2
31^2  37^1  41^1  43^1  47^1  53^1  59^1  61^1  67^1  71^1
73^1  79^1  83^1  89^1  97^1 101^1 103^1 107^1 109^1 113^1
127^1 131^1 137^1 139^1 149^1 151^1 157^1 163^1 167^1 173^1
179^1 181^1 191^1 193^1 197^1 199^1 211^1 223^1 227^1 229^1
233^1 239^1 241^1 251^1 257^1 263^1 269^1 271^1 277^1 281^1
283^1 293^1 307^1 311^1 313^1 317^1 331^1 337^1 347^1 349^1
353^1 359^1 367^1 373^1 379^1 383^1 389^1 397^1 401^1 409^1
419^1 421^1 431^1 433^1 439^1 443^1 449^1 457^1 461^1 463^1
467^1 479^1 487^1 491^1 499^1 503^1 509^1 521^1 523^1 541^1
547^1 557^1 563^1


The average search block took 4m15s to compute, while the max took 7m7s and the min took 36s.

The search ran for about 55 days (hiccups included), starting at 2021-03-05 05:47:53 and stopping at 2021-04-28 15:06:25. The total AWS bill—including development, and periods where the application was broken but the instances still running, and including instances I wasn’t using but forgot to turn off—was $380.25. When the application was running at its peak, the bill worked out to about$100/month, though I think I could get it much lower by deploying fewer instances, after we made the performance optimizations that reduced the need for resource-heavy instances. There is also the possibility of using something that integrates more tightly with AWS, such as serverless jobs for the cleanup, generate, and process worker jobs.

## Plots

When in doubt, plot it out. I started by writing an export function to get the data into a simpler CSV, which for each $n$ only stored $\log(n)$ and the witness value.

I did this once for the final computation results. I’ll call this the “small” database because it only contains the largest witness value in each block. I did it again for an earlier version of the database before we introduced optimizations (I’ll call this the “large” database), which had all witness values for all superabundant numbers processed up to 80 prime factors.. The small database was only a few dozen megabytes in size, and the large database was ~40 GiB, so I had to use postgres cursors to avoid loading the large database into memory. Moreover, then generated CSV was about 8 GiB in size, and so it required a few extra steps to sort it, and get it into a format that could be plotted in a reasonable amount of time.

First, using GNU sort to sort the file by the first column, $\log(n)$

sort -t , -n -k 1 divisor_sums.csv -o divisor_sums_sorted.csv


Then, I needed to do some simple operations on massive CSV files, including computing a cumulative max, and filtering down to a subset of rows that are sufficient for plotting. After trying to use pandas and vaex, I realized that the old awk command line tool would be great at this job. So I wrote a simple awk script to process the data, and compute data used for the cumulative max witness value plots below.

Then finally we can use vaex to create two plots. The first is a heatmap of witness value counts. The second is a plot of the cumulative max witness value. For the large database:

And for the small database

Note, the two ridges disagree slightly (the large database shows a longer flat line than the small database for the same range), because of the way that the superabundant enumeration doesn’t go in increasing order of $n$. So larger witness values in the range 400-500 are found later.

## Estimating the max witness value growth rate

The next obvious question is whether we can fit the curves above to provide an estimate of how far we might have to look to find the first witness value that exceeds the desired 1.782 threshold. Of course, this will obviously depend on the appropriateness of the underlying model.

A simple first guess would be split between two options: the real data is asymptotic like $a + b / x$ approaching some number less than 1.782 (and hence this approach cannot disprove RH), or the real data grows slowly (perhaps doubly-logarithmic) like $a + b \log \log x$, but eventually surpasses 1.782 (and RH is false). For both cases, we can use scipy’s curve fitting routine as in this pull request.

For the large database (roughly using log n < 400 since that’s when the curve flatlines due to the enumeration order), we get a reciprocal fit of

$\displaystyle f(x) \approx 1.77579122 - 2.72527824 / x$

and a logarithmic fit of

$\displaystyle f(x) \approx 1.65074314 + 0.06642373 \log(\log(x))$

The estimated asymptote is around 1.7757 in the first case, and the second case estimates we’d find an RH counterexample at around $log(n) = 1359$.

For the small database of only sufficiently large witness values, this time going up to about $log(n) \approx 575$, the asymptotic approximation is

$\displaystyle 1.77481154 -2.31226382 / x$

And the logarithmic approximation is

$\displaystyle 1.70825262 + 0.03390312 \log(\log(x))$

Now the asymptote is slightly lower, at 1.7748, and the logarithmic model approximates the counterexample can be found at approximately $\log(n) = 6663$.

Both of the logarithmic approximations suggest that if we want to find an RH counterexample, we would need to look at numbers with thousands of prime factors. The first estimate puts a counterexample at about $2^{1960}$, the second at $2^{9612}$, so let’s say between 1k and 10k prime factors.

Luckily, we can actually jump forward in the superabundant enumeration to exactly the set of candidates with $m$ prime factors. So it might make sense to jump ahead to, say, 5k prime factors and search in that region. However, the size of a level set of the superabundant enumeration still grows exponentially in $m$. Perhaps we should (heuristically) narrow down the search space by looking for patterns in the distribution of prime factors for the best witness values we’ve found thus far. Perhaps the values of $n$ with the best witness values tend to have a certain concentration of prime factors.

## Exploring prime factorizations

At first, my thought was to take the largest witness values, look at their prime factorizations, and try to see a pattern when compared to smaller witness values. However, other than the obvious fact that the larger witness values correspond to larger numbers (more and larger prime factors), I didn’t see an obvious pattern from squinting at plots.

To go in a completely different direction, I wanted to try out the UMAP software package, a very nice and mathematically sophisticated for high dimensional data visualization. It’s properly termed a dimensionality reduction technique, meaning it takes as input a high-dimensional set of data, and produces as output a low-dimensional embedding of that data that tries to maintain the same shape as the input, where “shape” is in the sense of a certain Riemannian metric inferred from the high dimensional data. If there is structure among the prime factorizations, then UMAP should plot a pretty picture, and perhaps that will suggest some clearer approach.

To apply this to the RH witness value dataset, we can take each pair $(n, \sigma(n)/(n \log \log n))$, and associate that with a new (high dimensional) data point corresponding to the witness value paired with the number’s prime factorization

$\displaystyle (\sigma(n)/(n \log \log n), k_1, k_2, \dots, k_d)$,

where $n = 2^{k_1} 3^{k_2} 5^{k_3} \dots p_d^{k_d}$, with zero-exponents included so that all points have the same dimension. This pull request adds the ability to factorize and export the witness values to a CSV file as specified, and this pull request adds the CSV data (using git-lfs), along with the script to run UMAP, the resulting plots shown below, and the saved embeddings as .npy files (numpy arrays).

When we do nothing special to the data and run it through UMAP we see this plot.

It looks cool, but if you stare at it for long enough (and if you zoom in when you generate the plot yourself in matplotlib) you can convince yourself that it’s not finding much useful structure. The red dots dominate (lower witness values) and the blue dots are kind of spread haphazardly throughout the red regions. The “ridges” along the chart are probably due to how the superabundant enumeration skips lots of numbers, and that’s why it thins out on one end: the thinning out corresponds to fewer numbers processed that are that large since the enumeration is not uniform.

It also seemed like there is too much data. The plot above has some 80k points on it. After filtering down to just those points whose witness values are bigger than 1.769, we get a more manageable plot.

This is a bit more reasonable. You can see a stripe of blue dots going through the middle of the plot.

Before figuring out how that blue ridge might relate to the prime factor patterns, let’s take this a few steps further. Typically in machine learning contexts, it helps to normalize your data, i.e., to transform each input dimension into a standard Z-score with respect to the set of values seen in that dimension, subtracting the mean and dividing by the standard deviation. Since the witness values are so close to each other, they’re a good candidate for such normalization. Here’s what UMAP plots when we normalize the witness value column only.

Now this is a bit more interesting! Here the colormap on the right is in units of standard deviation of witness values. You can see a definite bluest region, and it appears that the data is organized into long brushstrokes, where the witness values increase as you move from one end of the stroke to the other. At worst, this suggests that the dataset has structure that a learning algorithm could discover.

Going even one step further, what if we normalize all the columns? Well, it’s not as interesting.

If you zoom in, you can see that the same sort of “brushstroke” idea is occurring here too, with blue on one end and red on the other. It’s just harder to see.

We would like to study the prettiest picture and see if we can determine what pattern of prime numbers the blue region has, if any. The embedding files are stored on github, and I put up (one version of) the UMAP visualization as an interactive plot via this pull request.

I’ve been sitting on this draft for a while, and while this article didn’t make a ton of headway, the pictures will have to do while I’m still dealing with my new home purchase.

Some ideas for next steps:

• Squint harder at the distributions of primes for the largest witness values in comparison to the rest.
• See if a machine learning algorithm can regress witness values based on their prime factorizations (and any other useful features I can derive). Study the resulting hypothesis to determine which features are the most important. Use that to refine the search strategy.
• Try searching randomly in the superabundant enumeration around 1k and 10k prime factors, and see if the best witness values found there match the log-log regression.
• Since witness values above a given threshold seem to be quite common, and because the UMAP visualization shows some possible “locality” structure for larger witness values, it suggests if there is a counterexample to RH then there are probably many. So a local search method (e.g., local neighborhood search/discrete gradient ascent with random restarts) might allow us to get a better sense for whether we are on the right track.

Until next time!

# The Inner Product as a Decision Rule

The standard inner product of two vectors has some nice geometric properties. Given two vectors $x, y \in \mathbb{R}^n$, where by $x_i$ I mean the $i$-th coordinate of $x$, the standard inner product (which I will interchangeably call the dot product) is defined by the formula

$\displaystyle \langle x, y \rangle = x_1 y_1 + \dots + x_n y_n$

This formula, simple as it is, produces a lot of interesting geometry. An important such property, one which is discussed in machine learning circles more than pure math, is that it is a very convenient decision rule.

In particular, say we’re in the Euclidean plane, and we have a line $L$ passing through the origin, with $w$ being a unit vector perpendicular to $L$ (“the normal” to the line).

If you take any vector $x$, then the dot product $\langle x, w \rangle$ is positive if $x$ is on the same side of $L$ as $w$, and negative otherwise. The dot product is zero if and only if $x$ is exactly on the line $L$, including when $x$ is the zero vector.

Left: the dot product of $w$ and $x$ is positive, meaning they are on the same side of $w$. Right: The dot product is negative, and they are on opposite sides.

Here is an interactive demonstration of this property. Click the image below to go to the demo, and you can drag the vector arrowheads and see the decision rule change.

Click above to go to the demo

The code for this demo is available in a github repository.

It’s always curious, at first, that multiplying and summing produces such geometry. Why should this seemingly trivial arithmetic do anything useful at all?

The core fact that makes it work, however, is that the dot product tells you how one vector projects onto another. When I say “projecting” a vector $x$ onto another vector $w$, I mean you take only the components of $x$ that point in the direction of $w$. The demo shows what the result looks like using the red (or green) vector.

In two dimensions this is easy to see, as you can draw the triangle which has $x$ as the hypotenuse, with $w$ spanning one of the two legs of the triangle as follows:

If we call $a$ the (vector) leg of the triangle parallel to $w$, while $b$ is the dotted line (as a vector, parallel to $L$), then as vectors $x = a + b$. The projection of $x$ onto $w$ is just $a$.

Another way to think of this is that the projection is $x$, modified by removing any part of $x$ that is perpendicular to $w$. Using some colorful language: you put your hands on either side of $x$ and $w$, and then you squish $x$ onto $w$ along the line perpendicular to $w$ (i.e., along $b$).

And if $w$ is a unit vector, then the length of $a$—that is, the length of the projection of $x$ onto $w$—is exactly the inner product product $\langle x, w \rangle$.

Moreover, if the angle between $x$ and $w$ is larger than 90 degrees, the projected vector will point in the opposite direction of $w$, so it’s really a “signed” length.

Left: the projection points in the same direction as $w$. Right: the projection points in the opposite direction.

And this is precisely why the decision rule works. This 90-degree boundary is the line perpendicular to $w$.

More technically said: Let $x, y \in \mathbb{R}^n$ be two vectors, and $\langle x,y \rangle$ their dot product. Define by $\| y \|$ the length of $y$, specifically $\sqrt{\langle y, y \rangle}$. Define by $\text{proj}_{y}(x)$ by first letting $y' = \frac{y}{\| y \|}$, and then let $\text{proj}_{y}(x) = \langle x,y' \rangle y'$. In words, you scale $y$ to a unit vector $y'$, use the result to compute the inner product, and then scale $y$ so that it’s length is $\langle x, y' \rangle$. Then

Theorem: Geometrically, $\text{proj}_y(x)$ is the projection of $x$ onto the line spanned by $y$.

This theorem is true for any $n$-dimensional vector space, since if you have two vectors you can simply apply the reasoning for 2-dimensions to the 2-dimensional plane containing $x$ and $y$. In that case, the decision boundary for a positive/negative output is the entire $n-1$ dimensional hyperplane perpendicular to $y$ (the projected vector).

In fact, the usual formula for the angle between two vectors, i.e. the formula $\langle x, y \rangle = \|x \| \cdot \| y \| \cos \theta$, is a restatement of the projection theorem in terms of trigonometry. The $\langle x, y' \rangle$ part of the projection formula (how much you scale the output) is equal to $\| x \| \cos \theta$. At the end of this post we have a proof of the cosine-angle formula above.

Part of why this decision rule property is so important is that this is a linear function, and linear functions can be optimized relatively easily. When I say that, I specifically mean that there are many known algorithms for optimizing linear functions, which don’t have obscene runtime or space requirements. This is a big reason why mathematicians and statisticians start the mathematical modeling process with linear functions. They’re inherently simpler.

In fact, there are many techniques in machine learning—a prominent one is the so-called Kernel Trick—that exist solely to take data that is not inherently linear in nature (cannot be fruitfully analyzed by linear methods) and transform it into a dataset that is. Using the Kernel Trick as an example to foreshadow some future posts on Support Vector Machines, the idea is to take data which cannot be separated by a line, and transform it (usually by adding new coordinates) so that it can. Then the decision rule, computed in the larger space, is just a dot product. Irene Papakonstantinou neatly demonstrates this with paper folding and scissors. The tradeoff is that the size of the ambient space increases, and it might increase so much that it makes computation intractable. Luckily, the Kernel Trick avoids this by remembering where the data came from, so that one can take advantage of the smaller space to compute what would be the inner product in the larger space.

Next time we’ll see how this decision rule shows up in an optimization problem: finding the “best” hyperplane that separates an input set of red and blue points into monochromatic regions (provided that is possible). Finding this separator is core subroutine of the Support Vector Machine technique, and therein lie interesting algorithms. After we see the core SVM algorithm, we’ll see how the Kernel Trick fits into the method to allow nonlinear decision boundaries.

Proof of the cosine angle formula

Theorem: The inner product $\langle v, w \rangle$ is equal to $\| v \| \| w \| \cos(\theta)$, where $\theta$ is the angle between the two vectors.

Note that this angle is computed in the 2-dimensional subspace spanned by $v, w$, viewed as a typical flat plane, and this is a 2-dimensional plane regardless of the dimension of $v, w$.

Proof. If either $v$ or $w$ is zero, then both sides of the equation are zero and the theorem is trivial, so we may assume both are nonzero. Label a triangle with sides $v,w$ and the third side $v-w$. Now the length of each side is $\| v \|, \| w\|,$ and $\| v-w \|$, respectively. Assume for the moment that $\theta$ is not 0 or 180 degrees, so that this triangle is not degenerate.

The law of cosines allows us to write

$\displaystyle \| v - w \|^2 = \| v \|^2 + \| w \|^2 - 2 \| v \| \| w \| \cos(\theta)$

Moreover, The left hand side is the inner product of $v-w$ with itself, i.e. $\| v - w \|^2 = \langle v-w , v-w \rangle$. We’ll expand $\langle v-w, v-w \rangle$ using two facts. The first is trivial from the formula, that inner product is symmetric: $\langle v,w \rangle = \langle w, v \rangle$. Second is that the inner product is linear in each input. In particular for the first input: $\langle x + y, z \rangle = \langle x, z \rangle + \langle y, z \rangle$ and $\langle cx, z \rangle = c \langle x, z \rangle$. The same holds for the second input by symmetry of the two inputs. Hence we can split up $\langle v-w, v-w \rangle$ as follows.

\displaystyle \begin{aligned} \langle v-w, v-w \rangle &= \langle v, v-w \rangle - \langle w, v-w \rangle \\ &= \langle v, v \rangle - \langle v, w \rangle - \langle w, v \rangle + \langle w, w \rangle \\ &= \| v \|^2 - 2 \langle v, w \rangle + \| w \|^2 \\ \end{aligned}

Combining our two offset equations, we can subtract $\| v \|^2 + \| w \|^2$ from each side and get

$\displaystyle -2 \|v \| \|w \| \cos(\theta) = -2 \langle v, w \rangle,$

Which, after dividing by $-2$, proves the theorem if $\theta \not \in \{0, 180 \}$.

Now if $\theta = 0$ or 180 degrees, the vectors are parallel, so we can write one as a scalar multiple of the other. Say $w = cv$ for $c \in \mathbb{R}$. In that case, $\langle v, cv \rangle = c \| v \| \| v \|$. Now $\| w \| = | c | \| v \|$, since a norm is a length and is hence non-negative (but $c$ can be negative). Indeed, if $v, w$ are parallel but pointing in opposite directions, then $c < 0$, so $\cos(\theta) = -1$, and $c \| v \| = - \| w \|$. Otherwise $c > 0$ and $\cos(\theta) = 1$. This allows us to write $c \| v \| \| v \| = \| w \| \| v \| \cos(\theta)$, and this completes the final case of the theorem.

$\square$

# One definition of algorithmic fairness: statistical parity

If you haven’t read the first post on fairness, I suggest you go back and read it because it motivates why we’re talking about fairness for algorithms in the first place. In this post I’ll describe one of the existing mathematical definitions of “fairness,” its origin, and discuss its strengths and shortcomings.

Before jumping in I should remark that nobody has found a definition which is widely agreed as a good definition of fairness in the same way we have for, say, the security of a random number generator. So this post is intended to be exploratory rather than dictating The Facts. Rather, it’s an idea with some good intuitive roots which may or may not stand up to full mathematical scrutiny.

## Statistical parity

Here is one way to define fairness.

Your population is a set $X$ and there is some known subset $S \subset X$ that is a “protected” subset of the population. For discussion we’ll say $X$ is people and $S$ is people who dye their hair teal. We are afraid that banks give fewer loans to the teals because of hair-colorism, despite teal-haired people being just as creditworthy as the general population on average.

Now we assume that there is some distribution $D$ over $X$ which represents the probability that any individual will be drawn for evaluation. In other words, some people will just have no reason to apply for a loan (maybe they’re filthy rich, or don’t like homes, cars, or expensive colleges), and so $D$ takes that into account. Generally we impose no restrictions on $D$, and the definition of fairness will have to work no matter what $D$ is.

Now suppose we have a (possibly randomized) classifier $h:X \to \{-1,1\}$ giving labels to $X$. When given a person $x$ as input $h(x)=1$ if $x$ gets a loan and $-1$ otherwise. The bias, or statistical imparity, of $h$ on $S$ with respect to $X,D$ is the following quantity. In words, it is the difference between the probability that a random individual drawn from $S$ is labeled 1 and the probability that a random individual from the complement $S^C$ is labeled 1.

$\textup{bias}_h(X,S,D) = \Pr[h(x) = 1 | x \in S^{C}] - \Pr[h(x) = 1 | x \in S]$

The probability is taken both over the distribution $D$ and the random choices made by the algorithm. This is the statistical equivalent of the legal doctrine of adverse impact. It measures the difference that the majority and protected classes get a particular outcome. When that difference is small, the classifier is said to have “statistical parity,” i.e. to conform to this notion of fairness.

Definition: A hypothesis $h:X \to \{-1,1\}$ is said to have statistical parity on $D$ with respect to $S$ up to bias $\varepsilon$ if $|\textup{bias}_h(X,S,D)| < \varepsilon$.

So if a hypothesis achieves statistical parity, then it treats the general population statistically similarly to the protected class. So if 30% of normal-hair-colored people get loans, statistical parity requires roughly 30% of teals to also get loans.

It’s pretty simple to write a program to compute the bias. First we’ll write a function that computes the bias of a given set of labels. We’ll determine whether a data point $x \in X$ is in the protected class by specifying a specific value of a specific index. I.e., we’re assuming the feature selection has already happened by this point.

# labelBias: [[float]], [int], int, obj -> float
# compute the signed bias of a set of labels on a given dataset
def labelBias(data, labels, protectedIndex, protectedValue):
protectedClass = [(x,l) for (x,l) in zip(data, labels)
if x[protectedIndex] == protectedValue]
elseClass = [(x,l) for (x,l) in zip(data, labels)
if x[protectedIndex] != protectedValue]

if len(protectedClass) == 0 or len(elseClass) == 0:
raise Exception("One of the classes is empty!")
else:
protectedProb = sum(1 for (x,l) in protectedClass if l == 1) / len(protectedClass)
elseProb = sum(1 for (x,l) in elseClass  if l == 1) / len(elseClass)

return elseProb - protectedProb


Then generalizing this to an input hypothesis is a one-liner.

# signedBias: [[float]], int, obj, h -> float
# compute the signed bias of a hypothesis on a given dataset
def signedBias(data, h, protectedIndex, protectedValue):
return labelBias(pts, [h(x) for x in pts], protectedIndex, protectedValue)


Now we can load the census data from the UCI machine learning repository and compute some biases in the labels. The data points in this dataset correspond to demographic features of people from a census survey, and the labels are +1 if the individual’s salary is at least 50k, and -1 otherwise. I wrote some helpers to load the data from a file (which you can see in this post’s Github repo).

if __name__ == "__main__":

# [(test name, (index, value))]
tests = [('gender', (1,0)),
('private employment', (2,1)),
('asian race', (33,1)),
('divorced', (12, 1))]

for (name, (index, value)) in tests:
print("'%s' bias in training data: %.4f" %
(name, labelBias(train[0], train[1], index, value)))


(I chose ‘asian race’ instead of just ‘asian’ because there are various ‘country of origin’ features that are for countries in Asia.)

Running this gives the following.

anti-'female' bias in training data: 0.1963
anti-'private employment' bias in training data: 0.0731
anti-'asian race' bias in training data: -0.0256
anti-'divorced' bias in training data: 0.1582


Here a positive value means it’s biased against the quoted thing, a negative value means it’s biased in favor of the quoted thing.

Now let me define a stupidly trivial classifier that predicts 1 if the country of origin is India and zero otherwise. If I do this and compute the gender bias of this classifier on the training data I get the following.

>>> indian = lambda x: x[47] == 1
>>> len([x for x in train[0] if indian(x)]) / len(train[0]) # fraction of Indians
0.0030711587481956942
>>> signedBias(train[0], indian, 1, 0)
0.0030631816119030884


So this says that predicting based on being of Indian origin (which probably has very low accuracy, since many non-Indians make at least \$50k) does not bias significantly with respect to gender.

We can generalize statistical parity in various ways, such as using some other specified set $T$ in place of $S^C$, or looking at discrepancies among $k$ different sub-populations or with $m$ different outcome labels. In fact, the mathematical name for this measurement (which is a measurement of a set of distributions) is called the total variation distance. The form we sketched here is a simple case that just works for the binary-label two-class scenario.

Now it is important to note that statistical parity says nothing about the truth about the protected class $S$. I mean two things by this. First, you could have some historical data you want to train a classifier $h$ on, and usually you’ll be given training labels for the data that tell you whether $h(x)$ should be $1$ or $-1$. In the absence of discrimination, getting high accuracy with respect to the training data is enough. But if there is some historical discrimination against $S$ then the training labels are not trustworthy. As a consequence, achieving statistical parity for $S$ necessarily reduces the accuracy of $h$. In other words, when there is bias in the data accuracy is measured in favor of encoding the bias. Studying fairness from this perspective means you study the tradeoff between high accuracy and low statistical disparity. However, and this is why statistical parity says nothing about whether the individuals $h$ behaves differently on (differently compared to the training labels) were the correct individuals to behave differently on. If the labels alone are all we have to work with, and we don’t know the true labels, then we’d need to apply domain-specific knowledge, which is suddenly out of scope of machine learning.

Second, nothing says optimizing for statistical parity is the correct thing to do. In other words, it may be that teal-haired people are truly less creditworthy (jokingly, maybe there is a hidden innate characteristic causing both uncreditworthiness and a desire to dye your hair!) and by enforcing statistical parity you are going against a fact of Nature. Though there are serious repercussions for suggesting such things in real life, my point is that statistical parity does not address anything outside the desire for an algorithm to exhibit a certain behavior. The obvious counterargument is that if, as a society, we have decided that teal-hairedness should be protected by law regardless of Nature, then we’re defining statistical parity to be correct. We’re changing our optimization criterion and as algorithm designers we don’t care about anything else. We care about what guarantees we can prove about algorithms, and the utility of the results.

The third side of the coin is that if all we care about is statistical parity, then we’ll have a narrow criterion for success that can be gamed by an actively biased adversary.

## Statistical parity versus targeted bias

Statistical parity has some known pitfalls. In their paper “Fairness Through Awareness” (Section 3.1 and Appendix A), Dwork, et al. argue convincingly that these are primarily issues of individual fairness and targeted discrimination. They give six examples of “evils” including a few that maintain statistical parity while not being fair from the perspective of an individual. Here are my two favorite ones to think about (using teal-haired people and loans again):

1. Self-fulfilling prophecy: The bank intentionally gives a few loans to teal-haired people who are (for unrelated reasons) obviously uncreditworthy, so that in the future they can point to these examples to justify discriminating against teals. This can appear even if the teals are chosen uniformly at random, since the average creditworthiness of a random teal-haired person is lower than a carefully chosen normal-haired person.
2. Reverse tokenism: The bank intentionally does not give loans to some highly creditworthy normal-haired people, let’s call one Martha, so that when a teal complains that they are denied a loan, the bank can point to Martha and say, “Look how qualified she is, and we didn’t even give her a loan! You’re much less qualified.” Here Martha is the “token” example used to justify discrimination against teals.

I like these two examples for two reasons. First, they illustrate how hard coming up with a good definition is: it’s not clear how to encapsulate both statistical parity and resistance to this kind of targeted discrimination. Second, they highlight that discrimination can both be unintentional and intentional. Since computer scientists tend to work with worst-case guarantees, this makes we think the right definition will be resilient to some level of adversarial discrimination. But again, these two examples are not formalized, and it’s not even clear to what extent existing algorithms suffer from manipulations of these kinds. For instance, many learning algorithms are relatively resilient to changing the desired label of a single point.

In any case, the thing to take away from this discussion is that there is not yet an accepted definition of “fairness,” and there seems to be a disconnect between what it means to be fair for an individual versus a population. There are some other proposals in the literature, and I’ll just mention one: Dwork et al. propose that individual fairness mean that “similar individuals are treated similarly.” I will cover this notion (and what’s know about it) in a future post.

Until then!

# The Boosting Margin, or Why Boosting Doesn’t Overfit

There’s a well-understood phenomenon in machine learning called overfitting. The idea is best shown by a graph:

Let me explain. The vertical axis represents the error of a hypothesis. The horizontal axis represents the complexity of the hypothesis. The blue curve represents the error of a machine learning algorithm’s output on its training data, and the red curve represents the generalization of that hypothesis to the real world. The overfitting phenomenon is marker in the middle of the graph, before which the training error and generalization error both go down, but after which the training error continues to fall while the generalization error rises.

The explanation is a sort of numerical version of Occam’s Razor that says more complex hypotheses can model a fixed data set better and better, but at some point a simpler hypothesis better models the underlying phenomenon that generates the data. To optimize a particular learning algorithm, one wants to set parameters of their model to hit the minimum of the red curve.

This is where things get juicy. Boosting, which we covered in gruesome detail previously, has a natural measure of complexity represented by the number of rounds you run the algorithm for. Each round adds one additional “weak learner” weighted vote. So running for a thousand rounds gives a vote of a thousand weak learners. Despite this, boosting doesn’t overfit on many datasets. In fact, and this is a shocking fact, researchers observed that Boosting would hit zero training error, they kept running it for more rounds, and the generalization error kept going down! It seemed like the complexity could grow arbitrarily without penalty.

Schapire, Freund, Bartlett, and Lee proposed a theoretical explanation for this based on the notion of a margin, and the goal of this post is to go through the details of their theorem and proof. Remember that the standard AdaBoost algorithm produces a set of weak hypotheses $h_i(x)$ and a corresponding weight $\alpha_i \in [-1,1]$ for each round $i=1, \dots, T$. The classifier at the end is a weighted majority vote of all the weak learners (roughly: weak learners with high error on “hard” data points get less weight).

Definition: The signed confidence of a labeled example $(x,y)$ is the weighted sum:

$\displaystyle \textup{conf}(x) = \sum_{i=1}^T \alpha_i h_i(x)$

The margin of $(x,y)$ is the quantity $\textup{margin}(x,y) = y \textup{conf}(x)$. The notation implicitly depends on the outputs of the AdaBoost algorithm via “conf.”

We use the product of the label and the confidence for the observation that $y \cdot \textup{conf}(x) \leq 0$ if and only if the classifier is incorrect. The theorem we’ll prove in this post is

Theorem: With high probability over a random choice of training data, for any $0 < \theta < 1$ generalization error of boosting is bounded from above by

$\displaystyle \Pr_{\textup{train}}[\textup{margin}(x) \leq \theta] + O \left ( \frac{1}{\theta} (\textup{typical error terms}) \right )$

In words, the generalization error of the boosting hypothesis is bounded by the distribution of margins observed on the training data. To state and prove the theorem more generally we have to return to the details of PAC-learning. Here and in the rest of this post, $\Pr_D$ denotes $\Pr_{x \sim D}$, the probability over a random example drawn from the distribution $D$, and $\Pr_S$ denotes the probability over a random (training) set of examples drawn from $D$.

Theorem: Let $S$ be a set of $m$ random examples chosen from the distribution $D$ generating the data. Assume the weak learner corresponds to a finite hypothesis space $H$ of size $|H|$, and let $\delta > 0$. Then with probability at least $1 - \delta$ (over the choice of $S$), every weighted-majority vote function $f$ satisfies the following generalization bound for every $\theta > 0$.

$\displaystyle \Pr_D[y f(x) \leq 0] \leq \Pr_S[y f(x) \leq \theta] + O \left ( \frac{1}{\sqrt{m}} \sqrt{\frac{\log m \log |H|}{\theta^2} + \log(1/\delta)} \right )$

In other words, this phenomenon is a fact about voting schemes, not boosting in particular. From now on, a “majority vote” function $f(x)$ will mean to take the sign of a sum of the form $\sum_{i=1}^N a_i h_i(x)$, where $a_i \geq 0$ and $\sum_i a_i = 1$. This is the “convex hull” of the set of weak learners $H$. If $H$ is infinite (in our proof it will be finite, but we’ll state a generalization afterward), then only finitely many of the $a_i$ in the sum may be nonzero.

To prove the theorem, we’ll start by defining a class of functions corresponding to “unweighted majority votes with duplicates:”

Definition: Let $C_N$ be the set of functions $f(x)$ of the form $\frac{1}{N} \sum_{i=1}^N h_i(x)$ where $h_i \in H$ and the $h_i$ may contain duplicates (some of the $h_i$ may be equal to some other of the $h_j$).

Now every majority vote function $f$ can be written as a weighted sum of $h_i$ with weights $a_i$ (I’m using $a$ instead of $\alpha$ to distinguish arbitrary weights from those weights arising from Boosting). So any such $f(x)$ defines a natural distribution over $H$ where you draw function $h_i$ with probability $a_i$. I’ll call this distribution $A_f$. If we draw from this distribution $N$ times and take an unweighted sum, we’ll get a function $g(x) \in C_N$. Call the random process (distribution) generating functions in this way $Q_f$. In diagram form, the logic goes

$f \to$ weights $a_i \to$ distribution over $H \to$ function in $C_N$ by drawing $N$ times according to $H$.

The main fact about the relationship between $f$ and $Q_f$ is that each is completely determined by the other. Obviously $Q_f$ is determined by $f$ because we defined it that way, but $f$ is also completely determined by $Q_f$ as follows:

$\displaystyle f(x) = \mathbb{E}_{g \sim Q_f}[g(x)]$

Proving the equality is an exercise for the reader.

Proof of Theorem. First we’ll split the probability $\Pr_D[y f(x) \leq 0]$ into two pieces, and then bound each piece.

First a probability reminder. If we have two events $A$ and $B$ (in what’s below, this will be $yg(x) \leq \theta/2$ and $yf(x) \leq 0$, we can split up $\Pr[A]$ into $\Pr[A \textup{ and } B] + \Pr[A \textup{ and } \overline{B}]$ (where $\overline{B}$ is the opposite of $B$). This is called the law of total probability. Moreover, because $\Pr[A \textup{ and } B] = \Pr[A | B] \Pr[B]$ and because these quantities are all at most 1, it’s true that $\Pr[A \textup{ and } B] \leq \Pr[A \mid B]$ (the conditional probability) and that $\Pr[A \textup{ and } B] \leq \Pr[B]$.

Back to the proof. Notice that for any $g(x) \in C_N$ and any $\theta > 0$, we can write $\Pr_D[y f(x) \leq 0]$ as a sum:

$\displaystyle \Pr_D[y f(x) \leq 0] =\\ \Pr_D[yg(x) \leq \theta/2 \textup{ and } y f(x) \leq 0] + \Pr_D[yg(x) > \theta/2 \textup{ and } y f(x) \leq 0]$

Now I’ll loosen the first term by removing the second event (that only makes the whole probability bigger) and loosen the second term by relaxing it to a conditional:

$\displaystyle \Pr_D[y f(x) \leq 0] \leq \Pr_D[y g(x) \leq \theta / 2] + \Pr_D[yg(x) > \theta/2 \mid yf(x) \leq 0]$

Now because the inequality is true for every $g(x) \in C_N$, it’s also true if we take an expectation of the RHS over any distribution we choose. We’ll choose the distribution $Q_f$ to get

$\displaystyle \Pr_D[yf(x) \leq 0] \leq T_1 + T_2$

And $T_1$ (term 1) is

$\displaystyle T_1 = \Pr_{x \sim D, g \sim Q_f} [yg(x) \leq \theta /2] = \mathbb{E}_{g \sim Q_f}[\Pr_D[yg(x) \leq \theta/2]]$

And $T_2$ is

$\displaystyle \Pr_{x \sim D, g \sim Q_f}[yg(x) > \theta/2 \mid yf(x) \leq 0] = \mathbb{E}_D[\Pr_{g \sim Q_f}[yg(x) > \theta/2 \mid yf(x) \leq 0]]$

We can rewrite the probabilities using expectations because (1) the variables being drawn in the distributions are independent, and (2) the probability of an event is the expectation of the indicator function of the event.

Now we’ll bound the terms $T_1, T_2$ separately. We’ll start with $T_2$.

Fix $(x,y)$ and look at the quantity inside the expectation of $T_2$.

$\displaystyle \Pr_{g \sim Q_f}[yg(x) > \theta/2 \mid yf(x) \leq 0]$

This should intuitively be very small for the following reason. We’re sampling $g$ according to a distribution whose expectation is $f$, and we know that $yf(x) \leq 0$. Of course $yg(x)$ is unlikely to be large.

Mathematically we can prove this by transforming the thing inside the probability to a form suitable for the Chernoff bound. Saying $yg(x) > \theta / 2$ is the same as saying $|yg(x) - \mathbb{E}[yg(x)]| > \theta /2$, i.e. that some random variable which is a sum of independent random variables (the $h_i$) deviates from its expectation by at least $\theta/2$. Since the $y$‘s are all $\pm 1$ and constant inside the expectation, they can be removed from the absolute value to get

$\displaystyle \leq \Pr_{g \sim Q_f}[g(x) - \mathbb{E}[g(x)] > \theta/2]$

The Chernoff bound allows us to bound this by an exponential in the number of random variables in the sum, i.e. $N$. It turns out the bound is $e^{-N \theta^2 / 8}$.

Now recall $T_1$

$\displaystyle T_1 = \Pr_{x \sim D, g \sim Q_f} [yg(x) \leq \theta /2] = \mathbb{E}_{g \sim Q_f}[\Pr_D[yg(x) \leq \theta/2]]$

For $T_1$, we don’t want to bound it absolutely like we did for $T_2$, because there is nothing stopping the classifier $f$ from being a bad classifier and having lots of error. Rather, we want to bound it in terms of the probability that $yf(x) \leq \theta$. We’ll do this in two steps. In step 1, we’ll go from $\Pr_D$ of the $g$‘s to $\Pr_S$ of the $g$‘s.

Step 1: For any fixed $g, \theta$, if we take a sample $S$ of size $m$, then consider the event in which the sample probability deviates from the true distribution by some value $\varepsilon_N$, i.e. the event

$\displaystyle \Pr_D[yg(x) \leq \theta /2] > \Pr_{S, x \sim S}[yg(x) \leq \theta/2] + \varepsilon_N$

The claim is this happens with probability at most $e^{-2m\varepsilon_N^2}$. This is again the Chernoff bound in disguise, because the expected value of $\Pr_S$ is $\Pr_D$, and the probability over $S$ is an average of random variables (it’s a slightly different form of the Chernoff bound; see this post for more). From now on we’ll drop the $x \sim S$ when writing $\Pr_S$.

The bound above holds true for any fixed $g,\theta$, but we want a bound over all $g$ and $\theta$. To do that we use the union bound. Note that there are only $(N+1)$ possible choices for a nonnegative $\theta$ because $g(x)$ is a sum of $N$ values each of which is either $\pm1$. And there are only $|C_N| \leq |H|^N$ possibilities for $g(x)$. So the union bound says the above event will occur with probability at most $(N+1)|H|^N e^{-2m\varepsilon_N^2}$.

If we want the event to occur with probability at most $\delta_N$, we can judiciously pick

$\displaystyle \varepsilon_N = \sqrt{(1/2m) \log ((N+1)|H|^N / \delta_N)}$

And since the bound holds in general, we can take expectation with respect to $Q_f$ and nothing changes. This means that for any $\delta_N$, our chosen $\varepsilon_N$ ensures that the following is true with probability at least $1-\delta_N$:

$\displaystyle \Pr_{D, g \sim Q_f}[yg(x) \leq \theta/2] \leq \Pr_{S, g \sim Q_f}[yg(x) \leq \theta/2] + \varepsilon_N$

Now for step 2, we bound the probability that $yg(x) \leq \theta/2$ on a sample to the probability that $yf(x) \leq \theta$ on a sample.

Step 2: The first claim is that

$\displaystyle \Pr_{S, g \sim Q_f}[yg(x) \leq \theta / 2] \leq \Pr_{S} [yf(x) \leq \theta] + \mathbb{E}_{S}[\Pr_{g \sim Q_f}[yg(x) \leq \theta/2 \mid yf(x) \geq \theta]]$

What we did was break up the LHS into two “and”s, when $yf(x) > \theta$ and $yf(x) \leq \theta$ (this was still an equality). Then we loosened the first term to $\Pr_{S}[yf(x) \leq \theta]$ since that is only more likely than both $yg(x) \leq \theta/2$ and $yf(x) \leq \theta$. Then we loosened the second term again using the fact that a probability of an “and” is bounded by the conditional probability.

Now we have the probability of $yg(x) \leq \theta / 2$ bounded by the probability that $yf(x) \leq 0$ plus some stuff. We just need to bound the “plus some stuff” absolutely and then we’ll be done. The argument is the same as our previous use of the Chernoff bound: we assume $yf(x) \geq \theta$, and yet $yg(x) \leq \theta / 2$. So the deviation of $yg(x)$ from its expectation is large, and the probability that happens is exponentially small in the amount of deviation. The bound you get is

$\displaystyle \Pr_{g \sim Q}[yg(x) \leq \theta/2 \mid yf(x) > \theta] \leq e^{-N\theta^2 / 8}.$

And again we use the union bound to ensure the failure of this bound for any $N$ will be very small. Specifically, if we want the total failure probability to be at most $\delta$, then we need to pick some $\delta_j$‘s so that $\delta = \sum_{j=0}^{\infty} \delta_j$. Choosing $\delta_N = \frac{\delta}{N(N+1)}$ works.

Putting everything together, we get that with probability at least $1-\delta$ for every $\theta$ and every $N$, this bound on the failure probability of $f(x)$:

$\displaystyle \Pr_{x \sim D}[yf(x) \leq 0] \leq \Pr_{S, x \sim S}[yf(x) \leq \theta] + 2e^{-N \theta^2 / 8} + \sqrt{\frac{1}{2m} \log \left ( \frac{N(N+1)^2 |H|^N}{\delta} \right )}.$

This claim is true for every $N$, so we can pick $N$ that minimizes it. Doing a little bit of behind-the-scenes calculus that is left as an exercise to the reader, a tight choice of $N$ is $(4/ \theta)^2 \log(m/ \log |H|)$. And this gives the statement of the theorem.

$\square$

We proved this for finite hypothesis classes, and if you know what VC-dimension is, you’ll know that it’s a central tool for reasoning about the complexity of infinite hypothesis classes. An analogous theorem can be proved in terms of the VC dimension. In that case, calling $d$ the VC-dimension of the weak learner’s output hypothesis class, the bound is

$\displaystyle \Pr_D[yf(x) \leq 0] \leq \Pr_S[yf(x) \leq \theta] + O \left ( \frac{1}{\sqrt{m}} \sqrt{\frac{d \log^2(m/d)}{\theta^2} + \log(1/\delta)} \right )$

How can we interpret these bounds with so many parameters floating around? That’s where asymptotic notation comes in handy. If we fix $\theta \leq 1/2$ and $\delta = 0.01$, then the big-O part of the theorem simplifies to $\sqrt{(\log |H| \cdot \log m) / m}$, which is easier to think about since $(\log m)/m$ goes to zero very fast.

Now the theorem we just proved was about any weighted majority function. The question still remains: why is AdaBoost good? That follows from another theorem, which we’ll state and leave as an exercise (it essentially follows by unwrapping the definition of the AdaBoost algorithm from last time).

Theorem: Suppose that during AdaBoost the weak learners produce hypotheses with training errors $\varepsilon_1, \dots , \varepsilon_T$. Then for any $\theta$,

$\displaystyle \Pr_{(x,y) \sim S} [yf(x) \leq \theta] \leq 2^T \prod_{t=1}^T \sqrt{\varepsilon_t^{(1-\theta)} (1-\varepsilon_t)^{(1+\theta)}}$

Let’s interpret this for some concrete numbers. Say that $\theta = 0$ and $\varepsilon_t$ is any fixed value less than $1/2$. In this case the term inside product becomes $\sqrt{\varepsilon (1-\varepsilon)} < 1/2$ and the whole bound tends exponentially quickly to zero in the number of rounds $T$. On the other hand, if we raise $\theta$ to about 1/3, then in order to maintain the LHS tending to zero we would need $\varepsilon < \frac{1}{4} ( 3 - \sqrt{5} )$ which is about 20% error.

If you’re interested in learning more about Boosting, there is an excellent book by Freund and Schapire (the inventors of boosting) called Boosting: Foundations and Algorithms. There they include a tighter analysis based on the idea of Rademacher complexity. The bound I presented in this post is nice because the proof doesn’t require any machinery past basic probability, but if you want to reach the cutting edge of knowledge about boosting you need to invest in the technical stuff.

Until next time!