Featured Posts

Making Hybrid Images Neural Networks
and Backpropagation
Elliptic Curves
and Cryptography
bezier-dog-image-small circle-wedge-sphere baby_programmers
Bezier Curves and Picasso Computing Homology Probably Approximately
Correct – A Formal Theory
of Learning

Carnival of Mathematics #197

Welcome to the 197th Carnival of Mathematics!

197 is an unseemly number, as you can tell by the Wikipedia page which currently says that it has “indiscriminate, excessive, or irrelevant examples.” How deviant. It’s also a Repfigit, which means if you start a fibonacci-type sequence with the digits 1, 9, 7, and then continue with a_n = a_{i-3} + a_{i-2} + a_{i-1}, then 197 shows up in the sequence. Indeed: 1, 9, 7, 17, 33, 57, 107, 197, …

Untangling the unknot

Kennan Crane et al showcased a new paper that can untangle tangled curves quickly, and can do things like generate Hilbert-type space-filling curves on surfaces. It’s a long thread with tons of links to videos and reading materials, covering energy functions, functional analysis, Sobolev methods, and a custom inner product.

Folding equilateral triangles without measuring

Dave Richeson shows off a neat technique for folding equilateral triangles using just paper and no measurements. Replies in the thread show the geometric series that converges to the right 60 degree angle.

Shots fired at UMAP and t-SNE

Lior Pachter et al. study what sorts of structure are preserved by dimensionality reduction techniques like UMAP (which I have also used in a previous article) by comparing it against a genomics dataset with understood structure. They make some big claims about how UMAP and t-SNE destroy important structure, and they show how to contrive the dimensionality reduction plot to look like an elephant even when there’s no elephantine structure in the data.

I’m not expert, but perhaps one best case scenario for UMAP enthusiasts would be that their analysis only applies when you go from very high dimensions down to 2 just so you can plot a picture. But if you stop at, say, \sqrt{n} dimensions, you might still preserve a lot of the meaningful structure. Either way, they make a convincing pitch for Johnson-Lindenstrauss’s random linear reductions, which I’ve also covered here. Their paper is on biorXiv.

Studying the Sieve

Ben Peters Jones took up Grant Sanderson’s math video challenge and released a series of videos studying the Sieve of Eratosthenes.

Additional submissions

Be sure to submit fun math you find in September to the next carvinal host!

Searching for RH Counterexamples — Exploring Data

We’re ironically searching for counterexamples to the Riemann Hypothesis.

  1. Setting up Pytest
  2. Adding a Database
  3. Search Strategies
  4. Unbounded integers
  5. Deploying with Docker
  6. Performance Profiling
  7. Scaling up
  8. Productionizing

In the last article we added a menagerie of “production readiness” features like continuous integration tooling (automating test running and static analysis), alerting, and a simple deployment automation. Then I let it loose on AWS, got extremely busy with buying a house, forgot about this program for a few weeks (no alerts means it worked flawlessly!), and then saw my AWS bill.

So I copied the database off AWS using pg_dump (piped to gzip), terminated the instances, and inspected the results. A copy of the database is here. You may need git-lfs to clone it. If I wanted to start it back up again, I could spin them back up, and use gunzip | psql to restore the database, and it would start back up from where it left off. A nice benefit of all the software engineering work done thus far.

This article will summarize some of the data, show plots, and try out some exploratory data analysis techniques.

Summary

We stopped the search mid-way through the set of numbers with 136 prime divisors.

The largest number processed was

1255923956750926940807079376257388805204
00410625719434151527143279285143764977392
49474111379646103102793414829651500824447
17178682617437033476033026987651806835743
3694669721205424205654368862231754214894
07691711699791787732382878164959602478352
11435434547040000

Which in factored form is the product of these terms

  2^8   3^7   5^4   7^4  11^3  13^3  17^2  19^2  23^2  29^2
 31^2  37^2  41^2  43^1  47^1  53^1  59^1  61^1  67^1  71^1
 73^1  79^1  83^1  89^1  97^1 101^1 103^1 107^1 109^1 113^1
127^1 131^1 137^1 139^1 149^1 151^1 157^1 163^1 167^1 173^1
179^1 181^1 191^1 193^1 197^1 199^1 211^1 223^1 227^1 229^1
233^1 239^1 241^1 251^1 257^1 263^1 269^1 271^1 277^1 281^1
283^1 293^1 307^1 311^1 313^1 317^1 331^1 337^1 347^1 349^1
353^1 359^1 367^1 373^1 379^1 383^1 389^1 397^1 401^1 409^1
419^1 421^1 431^1 433^1 439^1 443^1 449^1 457^1 461^1 463^1
467^1 479^1 487^1 491^1 499^1 503^1 509^1 521^1 523^1 541^1
547^1 557^1 563^1 569^1 571^1 577^1

The best witness—the number with the largest witness value—was

38824169178385494306912668979787078930475
9208283469279319659854547822438432284497
11994812030251439907246255647505123032869
03750131483244222351596015366602420554736
87070007801035106854341150889235475446938
52188272225341139870856016797627204990720000

which has witness value 1.7707954880001586, which is still significantly smaller than the needed 1.782 to disprove RH.

The factored form of the best witness is

 2^11   3^7   5^4   7^3  11^3  13^2  17^2  19^2  23^2  29^2 
 31^2  37^1  41^1  43^1  47^1  53^1  59^1  61^1  67^1  71^1 
 73^1  79^1  83^1  89^1  97^1 101^1 103^1 107^1 109^1 113^1 
127^1 131^1 137^1 139^1 149^1 151^1 157^1 163^1 167^1 173^1 
179^1 181^1 191^1 193^1 197^1 199^1 211^1 223^1 227^1 229^1 
233^1 239^1 241^1 251^1 257^1 263^1 269^1 271^1 277^1 281^1 
283^1 293^1 307^1 311^1 313^1 317^1 331^1 337^1 347^1 349^1 
353^1 359^1 367^1 373^1 379^1 383^1 389^1 397^1 401^1 409^1 
419^1 421^1 431^1 433^1 439^1 443^1 449^1 457^1 461^1 463^1 
467^1 479^1 487^1 491^1 499^1 503^1 509^1 521^1 523^1 541^1 
547^1 557^1 563^1 

The average search block took 4m15s to compute, while the max took 7m7s and the min took 36s.

The search ran for about 55 days (hiccups included), starting at 2021-03-05 05:47:53 and stopping at 2021-04-28 15:06:25. The total AWS bill—including development, and periods where the application was broken but the instances still running, and including instances I wasn’t using but forgot to turn off—was $380.25. When the application was running at its peak, the bill worked out to about $100/month, though I think I could get it much lower by deploying fewer instances, after we made the performance optimizations that reduced the need for resource-heavy instances. There is also the possibility of using something that integrates more tightly with AWS, such as serverless jobs for the cleanup, generate, and process worker jobs.

Plots

When in doubt, plot it out. I started by writing an export function to get the data into a simpler CSV, which for each n only stored \log(n) and the witness value.

I did this once for the final computation results. I’ll call this the “small” database because it only contains the largest witness value in each block. I did it again for an earlier version of the database before we introduced optimizations (I’ll call this the “large” database), which had all witness values for all superabundant numbers processed up to 80 prime factors.. The small database was only a few dozen megabytes in size, and the large database was ~40 GiB, so I had to use postgres cursors to avoid loading the large database into memory. Moreover, then generated CSV was about 8 GiB in size, and so it required a few extra steps to sort it, and get it into a format that could be plotted in a reasonable amount of time.

First, using GNU sort to sort the file by the first column, \log(n)

sort -t , -n -k 1 divisor_sums.csv -o divisor_sums_sorted.csv

Then, I needed to do some simple operations on massive CSV files, including computing a cumulative max, and filtering down to a subset of rows that are sufficient for plotting. After trying to use pandas and vaex, I realized that the old awk command line tool would be great at this job. So I wrote a simple awk script to process the data, and compute data used for the cumulative max witness value plots below.

Then finally we can use vaex to create two plots. The first is a heatmap of witness value counts. The second is a plot of the cumulative max witness value. For the large database:

Witness value heatmap for the large database
The cumulative maximum witness value for the large database.

And for the small database

A heatmap for the witness values for the small database
The cumulative maximum witness value for the small database.

Note, the two ridges disagree slightly (the large database shows a longer flat line than the small database for the same range), because of the way that the superabundant enumeration doesn’t go in increasing order of n. So larger witness values in the range 400-500 are found later.

Estimating the max witness value growth rate

The next obvious question is whether we can fit the curves above to provide an estimate of how far we might have to look to find the first witness value that exceeds the desired 1.782 threshold. Of course, this will obviously depend on the appropriateness of the underlying model.

A simple first guess would be split between two options: the real data is asymptotic like a + b / x approaching some number less than 1.782 (and hence this approach cannot disprove RH), or the real data grows slowly (perhaps doubly-logarithmic) like a + b \log \log x, but eventually surpasses 1.782 (and RH is false). For both cases, we can use scipy’s curve fitting routine as in this pull request.

For the large database (roughly using log n < 400 since that’s when the curve flatlines due to the enumeration order), we get a reciprocal fit of

\displaystyle f(x) \approx 1.77579122 - 2.72527824 / x

and a logarithmic fit of

\displaystyle f(x) \approx 1.65074314 + 0.06642373 \log(\log(x))

The fit of the large database to a + b/x. Note the asymptote of 1.7757 suggests this will not disprove RH.
The fit of the large database to a + b log log x. If this is accurate, we would find the counterexample around log(n) = 1359.

The estimated asymptote is around 1.7757 in the first case, and the second case estimates we’d find an RH counterexample at around log(n) = 1359.

For the small database of only sufficiently large witness values, this time going up to about log(n) \approx 575, the asymptotic approximation is

\displaystyle 1.77481154 -2.31226382 / x

And the logarithmic approximation is

\displaystyle 1.70825262 + 0.03390312 \log(\log(x))

The reciprocal approximation of the small database with asymptote 1.77481154
The logarithmic approximation of the small database with RH counterexample estimate at log(n) = 6663

Now the asymptote is slightly lower, at 1.7748, and the logarithmic model approximates the counterexample can be found at approximately \log(n) = 6663.

Both of the logarithmic approximations suggest that if we want to find an RH counterexample, we would need to look at numbers with thousands of prime factors. The first estimate puts a counterexample at about 2^{1960}, the second at 2^{9612}, so let’s say between 1k and 10k prime factors.

Luckily, we can actually jump forward in the superabundant enumeration to exactly the set of candidates with m prime factors. So it might make sense to jump ahead to, say, 5k prime factors and search in that region. However, the size of a level set of the superabundant enumeration still grows exponentially in m. Perhaps we should (heuristically) narrow down the search space by looking for patterns in the distribution of prime factors for the best witness values we’ve found thus far. Perhaps the values of n with the best witness values tend to have a certain concentration of prime factors.

Exploring prime factorizations

At first, my thought was to take the largest witness values, look at their prime factorizations, and try to see a pattern when compared to smaller witness values. However, other than the obvious fact that the larger witness values correspond to larger numbers (more and larger prime factors), I didn’t see an obvious pattern from squinting at plots.

To go in a completely different direction, I wanted to try out the UMAP software package, a very nice and mathematically sophisticated for high dimensional data visualization. It’s properly termed a dimensionality reduction technique, meaning it takes as input a high-dimensional set of data, and produces as output a low-dimensional embedding of that data that tries to maintain the same shape as the input, where “shape” is in the sense of a certain Riemannian metric inferred from the high dimensional data. If there is structure among the prime factorizations, then UMAP should plot a pretty picture, and perhaps that will suggest some clearer approach.

To apply this to the RH witness value dataset, we can take each pair (n, \sigma(n)/(n \log \log n)), and associate that with a new (high dimensional) data point corresponding to the witness value paired with the number’s prime factorization

\displaystyle (\sigma(n)/(n \log \log n), k_1, k_2, \dots, k_d),

where n = 2^{k_1} 3^{k_2} 5^{k_3} \dots p_d^{k_d}, with zero-exponents included so that all points have the same dimension. This pull request adds the ability to factorize and export the witness values to a CSV file as specified, and this pull request adds the CSV data (using git-lfs), along with the script to run UMAP, the resulting plots shown below, and the saved embeddings as .npy files (numpy arrays).

When we do nothing special to the data and run it through UMAP we see this plot.

UMAP plotted on the raw prime factorization and witness value dataset.

It looks cool, but if you stare at it for long enough (and if you zoom in when you generate the plot yourself in matplotlib) you can convince yourself that it’s not finding much useful structure. The red dots dominate (lower witness values) and the blue dots are kind of spread haphazardly throughout the red regions. The “ridges” along the chart are probably due to how the superabundant enumeration skips lots of numbers, and that’s why it thins out on one end: the thinning out corresponds to fewer numbers processed that are that large since the enumeration is not uniform.

It also seemed like there is too much data. The plot above has some 80k points on it. After filtering down to just those points whose witness values are bigger than 1.769, we get a more manageable plot.

Witness values and prime factors processed with UMAP, where the witness value is at least 1.769.

This is a bit more reasonable. You can see a stripe of blue dots going through the middle of the plot.

Before figuring out how that blue ridge might relate to the prime factor patterns, let’s take this a few steps further. Typically in machine learning contexts, it helps to normalize your data, i.e., to transform each input dimension into a standard Z-score with respect to the set of values seen in that dimension, subtracting the mean and dividing by the standard deviation. Since the witness values are so close to each other, they’re a good candidate for such normalization. Here’s what UMAP plots when we normalize the witness value column only.

UMAP applied to the (normalized) witness values and prime factorizations. Applied to all witness values.

Now this is a bit more interesting! Here the colormap on the right is in units of standard deviation of witness values. You can see a definite bluest region, and it appears that the data is organized into long brushstrokes, where the witness values increase as you move from one end of the stroke to the other. At worst, this suggests that the dataset has structure that a learning algorithm could discover.

Going even one step further, what if we normalize all the columns? Well, it’s not as interesting.

UMAP when normalizing all columns, not just the witness value.

If you zoom in, you can see that the same sort of “brushstroke” idea is occurring here too, with blue on one end and red on the other. It’s just harder to see.

The previous image, zoomed in around a cluster of data

We would like to study the prettiest picture and see if we can determine what pattern of prime numbers the blue region has, if any. The embedding files are stored on github, and I put up (one version of) the UMAP visualization as an interactive plot via this pull request.

I’ve been sitting on this draft for a while, and while this article didn’t make a ton of headway, the pictures will have to do while I’m still dealing with my new home purchase.

Some ideas for next steps:

  • Squint harder at the distributions of primes for the largest witness values in comparison to the rest.
  • See if a machine learning algorithm can regress witness values based on their prime factorizations (and any other useful features I can derive). Study the resulting hypothesis to determine which features are the most important. Use that to refine the search strategy.
  • Try searching randomly in the superabundant enumeration around 1k and 10k prime factors, and see if the best witness values found there match the log-log regression.
  • Since witness values above a given threshold seem to be quite common, and because the UMAP visualization shows some possible “locality” structure for larger witness values, it suggests if there is a counterexample to RH then there are probably many. So a local search method (e.g., local neighborhood search/discrete gradient ascent with random restarts) might allow us to get a better sense for whether we are on the right track.

Until next time!

Regression and Linear Combinations

Recently I’ve been helping out with a linear algebra course organized by Tai-Danae Bradley and Jack Hidary, and one of the questions that came up a few times was, “why should programmers care about the concept of a linear combination?”

For those who don’t know, given vectors v_1, \dots, v_n, a linear combination of the vectors is a choice of some coefficients a_i with which to weight the vectors in a sum v = \sum_{i=1}^n a_i v_i.

I must admit, math books do a poor job of painting the concept as more than theoretical—perhaps linear combinations are only needed for proofs, while the real meat is in matrix multiplication and cross products. But no, linear combinations truly lie at the heart of many practical applications.

In some cases, the entire goal of an algorithm is to find a “useful” linear combination of a set of vectors. The vectors are the building blocks (often a vector space or subspace basis), and the set of linear combinations are the legal ways to combine the blocks. Simpler blocks admit easier and more efficient algorithms, but their linear combinations are less expressive. Hence, a tradeoff.

A concrete example is regression. Most people think of regression in terms of linear regression. You’re looking for a linear function like y = mx+b that approximates some data well. For multiple variables, you have, e.g., \mathbf{x} = (x_1, x_2, x_3) as a vector of input variables, and \mathbf{w} = (w_1, w_2, w_3) as a vector of weights, and the function is y = \mathbf{w}^T \mathbf{x} + b.

To avoid the shift by b (which makes the function affine instead of purely linear; formulas of purely linear functions are easier to work with because the shift is like a pesky special case you have to constantly account for), authors often add a fake input variable x_0 which is always fixed to 1, and relabel b as w_0 to get y = \mathbf{w}^T \mathbf{x} = \sum_i w_i x_i as the final form. The optimization problem to solve becomes the following, where your data set to approximate is \{ \mathbf{x}_1, \dots, \mathbf{x}_k \}.

\displaystyle \min_w \sum_{i=1}^k (y_i - \mathbf{w}^T \mathbf{x}_i)^2

In this case, the function being learned—the output of the regression—doesn’t look like a linear combination. Technically it is, just not in an interesting way.

It becomes more obviously related to linear combinations when you try to model non-linearity. The idea is to define a class of functions called basis functions B = \{ f_1, \dots, f_m \mid f_i: \mathbb{R}^n \to \mathbb{R} \}, and allow your approximation to be any linear combination of functions in B, i.e., any function in the span of B.

\displaystyle \hat{f}(\mathbf{x}) = \sum_{i=1}^m w_i f_i(\mathbf{x})

Again, instead of weighting each coordinate of the input vector with a w_i, we’re weighting each basis function’s contribution (when given the whole input vector) to the output. If the basis functions were to output a single coordinate (f_i(\mathbf{x}) = x_i), we would be back to linear regression.

Then the optimization problem is to choose the weights to minimize the error of the approximation.

\displaystyle \min_w \sum_{j=1}^k (y_j - \hat{f}(\mathbf{x}_j))^2

As an example, let’s say that we wanted to do regression with a basis of quadratic polynomials. Our basis for three input variables might look like

\displaystyle \{ 1, x_1, x_2, x_3, x_1x_2, x_1x_3, x_2x_3, x_1^2, x_2^2, x_3^2 \}

Any quadratic polynomial in three variables can be written as a linear combination of these basis functions. Also note that if we treat this as the basis of a vector space, then a vector is a tuple of 10 numbers—the ten coefficients in the polynomial. It’s the same as \mathbb{R}^{10}, just with a different interpretation of what the vector’s entries mean. With that, we can see how we would compute dot products, projections, and other nice things, though they may not have quite the same geometric sensibility.

These are not the usual basis functions used for polynomial regression in practice (see the note at the end of this article), but we can already do some damage in writing regression algorithms.

A simple stochastic gradient descent

Although there is a closed form solution to many regression problems (including the quadratic regression problem, though with a slight twist), gradient descent is a simple enough solution to showcase how an optimization solver can find a useful linear combination. This code will be written in Python 3.9. It’s on Github.

First we start with some helpful type aliases

from typing import Callable, Tuple, List

Input = Tuple[float, float, float]
Coefficients = List[float]
Gradient = List[float]
Hypothesis = Callable[[Input], float]
Dataset = List[Tuple[Input, float]]

Then define a simple wrapper class for our basis functions

class QuadraticBasisPolynomials:
    def __init__(self):
        self.basis_functions = [
            lambda x: 1,
            lambda x: x[0],
            lambda x: x[1],
            lambda x: x[2],
            lambda x: x[0] * x[1],
            lambda x: x[0] * x[2],
            lambda x: x[1] * x[2],
            lambda x: x[0] * x[0],
            lambda x: x[1] * x[1],
            lambda x: x[2] * x[2],
        ]

    def __getitem__(self, index):
        return self.basis_functions[index]

    def __len__(self):
        return len(self.basis_functions)

    def linear_combination(self, weights: Coefficients) -> Hypothesis:
        def combined_function(x: Input) -> float:
            return sum(
                w * f(x)
                for (w, f) in zip(weights, self.basis_functions)
            )

        return combined_function

basis = QuadraticBasisPolynomials()

The linear_combination function returns a function that computes the weighted sum of the basis functions. Now we can define the error on a dataset, as well as for a single point

def total_error(weights: Coefficients, data: Dataset) -> float:
    hypothesis = basis.linear_combination(weights)
    return sum(
        (actual_output - hypothesis(example)) ** 2
        for (example, actual_output) in data
    )


def single_point_error(
        weights: Coefficients, point: Tuple[Input, float]) -> float:
    return point[1] - basis.linear_combination(weights)(point[0])

We can then define the gradient of the error function with respect to the weights and a single data point. Recall, the error function is defined as

\displaystyle E(\mathbf{w}) = \sum_{j=1}^k (y_j - \hat{f}(\mathbf{x}_j))^2

where \hat{f} is a linear combination of basis functions

\hat{f}(\mathbf{x}_j) = \sum_{s=1}^n w_s f_s(\mathbf{x}_j)

Since we’ll do stochastic gradient descent, the error formula is a bit simpler. We compute it not for the whole data set but only a single random point at a time. So the error is

\displaystyle E(\mathbf{w}) = (y_j - \hat{f}(\mathbf{x}_j))^2

Then we compute the gradient with respect to the individual entries of \mathbf{w}, using the chain rule and noting that the only term of the linear combination that has a nonzero contribution to the gradient for \frac{\partial E}{\partial w_i} is the term containing w_i. This is one of the major benefits of using linear combinations: the gradient computation is easy.

\displaystyle \frac{\partial E}{\partial w_i} = -2 (y_j - \hat{f}(\mathbf{x}_j)) \frac{\partial \hat{f}}{\partial w_i}(\mathbf{x}_j) = -2 (y_j - \hat{f}(\mathbf{x}_j)) f_i(\mathbf{x}_j)

Another advantage to being linear is that this formula is agnostic to the content of the underlying basis functions. This will hold so long as the weights don’t show up in the formula for the basis functions. As an exercise: try changing the implementation to use radial basis functions around each data point. (see the note at the end for why this would be problematic in real life)

def gradient(weights: Coefficients, data_point: Tuple[Input, float]) -> Gradient:
    error = single_point_error(weights, data_point)
    dE_dw = [0] * len(weights)

    for i, w in enumerate(weights):
        dE_dw[i] = -2 * error * basis[i](data_point[0])

    return dE_dw

Finally, the gradient descent core with a debugging helper.

import random

def print_debug_info(step, grad_norm, error, progress):
    print(f"{step}, {progress:.4f}, {error:.4f}, {grad_norm:.4f}")


def gradient_descent(
        data: Dataset,
        learning_rate: float,
        tolerance: float,
        training_callback = None,
) -> Hypothesis:
    weights = [random.random() * 2 - 1 for i in range(len(basis))]
    last_error = total_error(weights, data)
    step = 0
    progress = tolerance * 2
    grad_norm = 1.0

    if training_callback:
        training_callback(step, 0.0, last_error, 0.0)

    while abs(progress) > tolerance or grad_norm > tolerance:
        grad = gradient(weights, random.choice(data))
        grad_norm = sum(x**2 for x in grad)
        
        for i in range(len(weights)):
            weights[i] -= learning_rate * grad[i]

        error = total_error(weights, data)
        progress = error - last_error
        last_error = error
        step += 1

        if training_callback:
            training_callback(step, grad_norm, error, progress)

    return basis.linear_combination(weights)

Next create some sample data and run the optimization

def example_quadratic_data(num_points: int):
    def fn(x, y, z):
        return 2 - 4*x*y + z + z**2

    data = []
    for i in range(num_points):
        x, y, z = random.random(), random.random(), random.random()
        data.append(((x, y, z), fn(x, y, z)))

    return data


if __name__ == "__main__":
    data = example_quadratic_data(30)
    gradient_descent(
        data, 
        learning_rate=0.01, 
        tolerance=1e-06, 
        training_callback=print_debug_info
    )

Depending on the randomness, it may take a few thousand steps, but it typically converges to an error of < 1. Here’s the plot of error against gradient descent steps.

Gradient descent showing log(total error) vs number of steps, for the quadratic regression problem.

Kernels and Regularization

I’ll finish with explanations of the parentheticals above.

The real polynomial kernel. We chose a simple set of polynomial functions. This is closely related to the concept of a “kernel”, but the “real” polynomial kernel uses slightly different basis functions. It scales some of the basis functions by \sqrt{2}. This is OK because a linear combination can compensate by using coefficients that are appropriately divided by \sqrt{2}. But why would one want to do this? The answer boils down to a computational efficiency technique called the “Kernel trick.” In short, it allows you to compute the dot product between two linear combinations of vectors in this vector space without explicitly representing the vectors in the space to begin with. If your regression algorithm uses only dot products in its code (as is true of the closed form solution for regression), you get the benefits of nonlinear feature modeling without the cost of computing the features directly. There’s a lot more mathematical theory to discuss here (cf. Reproducing Kernel Hilbert Space) but I’ll have to leave it there for now.

What’s wrong with the radial basis function exercise? This exercise asked you to create a family of basis functions, one for each data point. The problem here is that having so many basis functions makes the linear combination space too expressive. The optimization will overfit the data. It’s like a lookup table: there’s one entry dedicated to each data point. New data points not in the training would be rarely handled well, since they aren’t in the “lookup table” the optimization algorithm found. To get around this, in practice one would add an extra term to the error corresponding to the L1 or L2 norm of the weight vector. This allows one to ensure that the total size of the weights is small, and in the L1 case that usually corresponds to most weights being zero, and only a few weights (the most important) being nonzero. The process of penalizing the “magnitude” of the linear combination is called regularization.

Searching for RH Counterexamples — Productionizing

We’re ironically searching for counterexamples to the Riemann Hypothesis.

  1. Setting up Pytest
  2. Adding a Database
  3. Search Strategies
  4. Unbounded integers
  5. Deploying with Docker
  6. Performance Profiling
  7. Scaling up

In the last article we rearchitected the application so that we could run as many search instances as we want in parallel, and speed up the application by throwing more compute resources at the problem.

This is good, but comes with a cost. The complexity of this new architecture requires us to manage many different containers and AWS instances. Let’s take a step back. In this article, we’ll focus on improving the “production worthiness” of the application. In particular, we will:

  • Automate running tests on every pull request
  • Add some extra error handling code
  • Clean up stale blocks when worker jobs fail
  • Automate checking Python type hints and test coverage
  • Add static analysis checks
  • Add alerting to tell us when jobs fail
  • Automate the process of updating the application with new code

Automating test running

The main benefit of writing tests is that you can run them. It’s even better when the tests are run automatically on a pull request. It guards buggy changes from breaking the main code branch.

There are many systems that work with GitHub to automate running tests. For this project I used CircleCI, which has a nice free tier. “CI” stands for continuous integration, which is an idea that if you guard your main branch well enough, you can ensure that the main branch is always in a state that can be released or deployed to production servers. And if this is the case, then you might as well have the computer manage regular releases for you (provided all tests pass).

Since we don’t yet have a way to automate releases, we’ll start by running tests on every pull request. This pull request was my first failed attempt and configuring CircleCI based on their tutorial, and this commit fixed it (which was really 15 attempts squashed into one commit post hoc). Originally I thought I could use CircleCI’s built in python testing jobs, but since this project requires a compiled component—the gmp library for unbounded integer arithmetic, and the associated Postgres extension—we need to manually install some other packages. Thankfully, CircleCI uses containers, so the configuration looks a lot like a Dockerfile. Unfortunately, CircleCI’s containers use a different Ubuntu operating system for their base image, which means the packages I had to install have different names than I expected.

So after an hour or two of trial and error, now it works. When you open a new pull request you see the tests run.

And when the tests are finished and passing you see green (otherwise red).

I must admit, most of these tools (see Coveralls below) are primarily designed for Python projects without compiled components. The defaults all tend to fail and you fall back to running arbitrary shell scripts. That’s one good reason to get familiar with standard linux package managers.

Better error handling

A few times the block processing job ran into some errors, and the container died. I don’t have a means to alert when the jobs fail, so I manually checked on them every once in a while. One error was an overflow error as described in this issue. While I might want to dig in a bit more to solve that problem, the effect was that the job died and the block was left in the IN_PROGRESS state. I want errors to result in marking the block as failed, engaging the retry loop, to exercise the possibility that the issue is somehow transient or specific to one broken worker job.

It’s relatively simple, and this pull request does the trick. This commit is the main part.

I added a nice test that demonstrates (again) the benefit of having interfaces. To test this behavior, we need to inject a fake error into the search_strategy.process_block function. One way to do that is to use mocks, and “monkey patch” the process_block method to operate differently during that test. In my opinion, this is bad design, because those tests can quickly become stale, and monkey patching results in odd behavior when you don’t set up the patch just right. Instead, writing a test-specific implementation of the search strategy interface is a safer approach. This commit does that.

Once that was working, I turned to the overflow error, which I fixed in this pull request. Turns out, the integers involved in the computation are so large that their logarithms don’t fit in a single floating point number. So I had to switch to using the log function from the gmp library, which outputs a multi-precision floating point number.

Handle stale blocks

Due to the jobs failing in the previous section, and as an extra safety measure against further failures—including when a job is killed in order to relaunch it with new code—I decided to add an extra job devoted to looking for stale blocks (which have been in_progress for more than an hour) and marking them as failed. This issue describes the idea. Once deployed, the current production system will automatically patch up its past failures. If only it were so easy for we humans. This pull request adds the new job, a new Dockerfile, and an end-to-end test. I deployed the job on the same server as the generator job, since both jobs spend most of their time sleeping.

Type checker tests

So far in the project, I’ve been adding type hints to most of the functions. Type hints are optional in Python, but if you use them, you can add an automated test to ensure that your code matches what the type hints say. This helps you have more confidence that your code is correct. There’s a lot more to say on this subject (software engineers often argue about the value of type checking), but I’ll refrain from providing my opinion for now, assume it’s worthwhile, and show what it’s like to use it.

For this project I’ll use mypy. Since I haven’t been running a type checker on this project yet, I expect the first run will find a lot of errors. When I run mypy riemann/*.py it reports 29 errors. Notably, these two errors suggest problems with our dependencies.

riemann/superabundant.py:6: error: Cannot find implementation or library stub for module named 'gmpy2'
riemann/superabundant.py:7: error: Cannot find implementation or library stub for module named 'numba'

The problem here is that if a dependency doesn’t use type hints, then the type checker won’t know what to do with functions from the dependent module. The options are to create a “stub” for those types, or to tell the type checker to ignore them. Stubbing is supposed to be easy, but for project that have compiled dependencies it can be considerably harder. For now I went with ignoring missing stubs. This pull request does the extra work to run the type checker in the automated test suite, and fix some minor type errors (one of which spotted a real bug).

Test Coverage

Test coverage (or “code coverage”) tools report which lines of your program are exercised by some test. Like type hints, requiring test coverage has advantages and disadvantages. Briefly, I feel code coverage is a good baseline understanding of what’s tested, but coverage does not imply adequate testing, nor does lack of coverage imply poor testing. Rather, it’s a rough guide that can show you glaring holes in what you thought to write tests for. There’s more to say, but not today.

Pytest has plugins for coverage measuring, and when you install them you can just run pytest --cov --cov-report html:cov_html to generate html pages that show line-by-line and file-by-file coverage stats. Mine looked like this.

This project’s coverage report (after fixing up issues described below)

A few obstacles, again coming from the fact that the project has compiled parts. In particular, Numba does not provide coverage support yet. So to get coverage statistics, you have to disable the jit compiler during the test execution. This simply requires setting the environment variable NUMBA_DISABLE_JIT=1. Disabling this made some of the tests run extremely slowly, but some minor changes restored a reasonable runtime (2 minutes, mostly end-to-end tests).

The other problem was the end to end tests. You need to add a special hook to allow the Python coverage tool to see the lines of code executed by subprocesses, which is used heavily by the end to end tests. This pull request shows the steps to overcoming both of these problems.

Also in that pull request is the configuration of Coveralls. This allows code coverage to be computed during the continuous integration test run, and shown on the pull request before merging, with the option to “fail” the code coverage test if the changes in the pull request reduce code coverage too much.

The configuration together between CircleCI and Coveralls required setting the Coveralls project token as an environment variable during CircleCI workflows, and running coveralls after the pytest command finished. Now pull requests look like this.

Now Pull Requests show a coverage report and warn if the coverage percentage is too low.

Static Analysis

Static analysis refers to tools that look at your code and point out problems or suggest improvements. In Python there are scant few good static analysis tools, in part because Python is designed to be very dynamic. It’s not possible for an automated system to be 100% sure what a piece of code does, because even importing a module can do weird things like change the behavior of arithmetic operations. Static analysis plummets in value as false positives increase, but being conservative in making recommendations also limits the value.

Static analysis also refers to the ability to do autocomplete, or reformatting source code to match a particular style, as well as type checking. But I just want to focus on semantic code improvements, not “prettifying” code. A simple example is dead code, which is unused by the application and can be safely deleted. Another is SQL injection vulnerabilities.

For this project I will configure lgtm.com, which is a nice general-purpose static analysis framework that hooks into any GitHub repository without configuration and supports a few important languages. Here is an example run, which resulted in 11 alerts like this one

An example alert found by LGTM

And it’s right! I didn’t need that entire statement because the variable it defined was unused. This one was low-stakes, but another alert pointed to a more serious and subtle bug, related to how the SuperabundantSearchStrategy class mutates its internal state. I can clean that up later, but for now let’s marvel at how the static analysis tool caught the bug with no work on my part. Fantastic!

This pull request addresses the 11 alerts generated by LGTM, most of which are unused imports. LGTM is even nice enough to post comments showing the improvements by PR.

The nice comment left by LGTM when merging a PR that fixes alerts.

Badges

A little bit of flair: I added badges to the project README, so I could show off the fact that tests are passing and I have decent code coverage and code that passes static analysis checks.

Alerting on job failure

Next, I wanted to set up some rudimentary alerting so that I can know when my docker containers fail. If I don’t have this, I have to manually check up on them, and if they fail and I forget, I end up paying money to run the servers for nothing.

There are many approaches to do monitoring and alerting, many software packages, and many opinions. This series is supposed to show you the ideas and problems that software engineers care about, but not go overboard so that we still have time to do math. So I’ll implement what I consider the simplest solution, describe what it lacks, and briefly mention some other approaches.

We need two components: the ability to tell if a container has failed, and the ability to send an email. The former is easy, because the docker runtime provides a pleasant CLI.

docker ps -a --format="{{.Image}}" --filter="status=exited"

docker ps -a shows the status of all running containers, the --filter flag makes the command show only the containers that have exited, and the --format flag makes the command print a restricted subset of the information, in this case just the image name. This makes it suitable for providing as input to a program, since there is no special text added to the output just for human-readability.

The second part, sending an email, is more complicated because popular email services like Gmail have stringent rules. In short, Gmail blocks email from any unknown sender by default (because spam), and so in order to send an email successfully from a program, you need to jump through a bunch of hoops to authenticate the system to send on behalf of an account with a popular provider. You cant just use sendmail. Since I use Gmail, we’ll use it for authentication.

To authenticate a program to send from a Gmail account, you have to generate an “app password” for your Google account, and then configure the mail sender program to use it. The program I chose is the CLI ssmtp, and you can configure it to use your Google account and password via a configuration file typically stored at /etc/ssmtp/ssmtp.conf. The configuration file looks like this (with sensitive data censored with x’s)

root=xxxx@gmail.com
mailhub=smtp.gmail.com:465
FromLineOverride=YES
AuthUser=xxxx@gmail.com
AuthPass=xxxxxxxxxxxxx
TLS_CA_FILE=/etc/ssl/certs/ca-certificates.crt
UseTLS=Yes
rewriteDomain=gmail.com
hostname=xx.xx.xx.xx.xx

Once you have this, you can send email using a command-line invocation like

echo "Email message body goes here" | ssmtp recipient@gmail.com

However, this poses another small problem. While we could manually configure each of these config files on each of our servers, we would like to automate it eventually (next section), and automating it would seem to require us to store the sensitive information in a script that goes into our version-controlled (public!) code repository.

To deal with this, we need a simple way to manage secrets. The typical way to do this is via unix environment variables. In short, environment variables are variables and values you can store in the execution context of any script or terminal session. The most common example is $PATH, which stores a list of directories to search for programs in when you run a program. E.g., if you run docker ps, it will look to see if there’s a program called docker in /user/bin/, and if not try /usr/local/bin/, and so on until exhausting all entries in $PATH. We used environment variables to pass the postgres host ip to our docker containers, but it can also be used for passing secrets to programs to avoid storing passwords in text files that might become public accidentally.

You can set an environment variable using export VARIABLE="any value here". All values are strings. In our case, we will define two environment variables GMAIL_APP_USER—for the email address of the Google account to authenticate as—and GMAIL_APP_PASS for the app password. Then this bash script (run with sudo -E to ensure environment variables are passed from the calling context) will set up the configuration in one go, and this python script will orchestrate the loop of checking docker ps -a and sending an email if it sees any failed containers, passing the password via an environment variable as well. An improvement I discovered later allows us to also avoid storing the password in the ssmtp configuration file, and instead pass it directly from the environment to the python script.

Note this uses the python-dotenv library to easily import environment variables stored in a .env file. We must be careful not to check the .env file into version control, so we update the .gitignore and add an .env.template as a hint to show what environment variables are needed in .env.

Finally, though the deploy script is not quite useful as a standalone script (the docker containers live on different machines), it’s still useful to remind us how to manually deploy. So after exporting the right environment variables we have a few simple new steps to follow to run the monitoring script.

sudo apt install -y python3-pip ssmtp
pip3 install -r alerts/requirements.txt
sudo -E alerts/configure_ssmtp.sh
nohup python3 -m alerts.monitor_docker &

The last line may be new. nohup tells the operating system to “disconnect” the program from the shell that launched it (it’s actually a bit more specific, but has this effect), and the & has the program run in the background. This effectively allows us to launch the monitoring job in an ssh session, and then close the ssh session without terminating the program.

That raises an interesting point: how can we be sure that the monitoring program doesn’t fail? Who watches the watcher? For us, nobody. There is always more engineering you can do if you want a higher level of safety and security, but for us the returns diminish quickly. Eventually we will check the program manually and notice if it broke.

Automating deployment

The final step of our “productionization” process will be to set up some mechanism to deploy our application when we make updates. Like monitoring and alerting, there are quite a few frameworks out there. The main one for us to consider is called a container orchestration system, and the biggest one for Docker is called Kubernetes. If you read through the Kubernetes homepage, you’ll see all kinds of benefits that align with what we want: automatic deployment and container management, including monitoring, restarting jobs, and changing compute resource limits like RAM limit and CPU. These are all things which have taken up our time while building this project! AWS has support for Kubernetes, as well as their own homegrown solutions.

The difficulty around these topics is that you can spend as much (or more) of your time learning the details of the framework than you would making a functional-but-suboptimal version yourself. Such frameworks can also extra layers of frustration because of the additional jargon and learning curve, the mess of configuration, and the tendency for corporate products to change and force you to change in turn. Docker was already a decent investment, it’s asking a lot to invest more!

If I had all the time in the world, I’d go with Kubernetes. But in the next article I want to get back to doing some math and visualization with the data we’ve collected. So let’s pretend we can’t use a framework, and write a basic script that will automate our deployment for us. That is, the script will ssh into the EC2 instances that run each piece of the application, run git pull on each, run docker stop on the containers and remove them, build new container images from the new code, launch those images, and do it all in the right dependency order. The approach won’t be super resilient or have all the engineering ribbons and bows, but in the worst case we can manually fix any problems.

This script does just that. It’s simple enough for starters, and didn’t take me more than an hour to get working. We’ll leave Kubernetes for future work.

That said, in the process of building and testing the deployment script I hit a disappointing problem: docker was deleting the database when I removed the container. Or rather, it started over and I couldn’t find where the data was stored on the host machine. This appears to be because I was letting the Postgres base Dockerfile define a “volume,” which wasn’t saved when I removed the built Docker image (docker prune). So I read up about docker volumes and created a named volume and attached it to the docker container at startup time. The volume won’t disappear when the container stops now, and I can automatically back up the volume (coming soon), so that further blunders don’t result in losing all the data again.

Next time we’ll get back to the mathematical aspects and look for patterns in the data we’ve found so far. I, for one, am optimistic we’ll find the secrets we need to disprove RH.