Binary Search on Graphs

Binary search is one of the most basic algorithms I know. Given a sorted list of comparable items and a target item being sought, binary search looks at the middle of the list, and compares it to the target. If the target is larger, we repeat on the smaller half of the list, and vice versa.

With each comparison the binary search algorithm cuts the search space in half. The result is a guarantee of no more than $ \log(n)$ comparisons, for a total runtime of $ O(\log n)$. Neat, efficient, useful.

There’s always another angle.

What if we tried to do binary search on a graph? Most graph search algorithms, like breadth- or depth-first search, take linear time, and they were invented by some pretty smart cookies. So if binary search on a graph is going to make any sense, it’ll have to use more information beyond what a normal search algorithm has access to.

For binary search on a list, it’s the fact that the list is sorted, and we can compare against the sought item to guide our search. But really, the key piece of information isn’t related to the comparability of the items. It’s that we can eliminate half of the search space at every step. The “compare against the target” step can be thought of a black box that replies to queries of the form, “Is this the thing I’m looking for?” with responses of the form, “Yes,” or, “No, but look over here instead.”


As long as the answers to your queries are sufficiently helpful, meaning they allow you to cut out large portions of your search space at each step, then you probably have a good algorithm on your hands. Indeed, there’s a natural model for graphs, defined in a 2015 paper of Emamjomeh-Zadeh, Kempe, and Singhal that goes as follows.

You’re given as input an undirected, weighted graph $ G = (V,E)$, with weights $ w_e$ for $ e \in E$. You can see the entire graph, and you may ask questions of the form, “Is vertex $ v$ the target?” Responses will be one of two things:

  • Yes (you win!)
  • No, but $ e = (v, w)$ is an edge out of $ v$ on a shortest path from $ v$ to the true target.

Your goal is to find the target vertex with the minimum number of queries.

Obviously this only works if $ G$ is connected, but slight variations of everything in this post work for disconnected graphs. (The same is not true in general for directed graphs)

When the graph is a line, this “reduces” to binary search in the sense that the same basic idea of binary search works: start in the middle of the graph, and the edge you get in response to a query will tell you in which half of the graph to continue.


And if we make this example only slightly more complicated, the generalization should become obvious:


Here, we again start at the “center vertex,” and the response to our query will eliminate one of the two halves. But then how should we pick the next vertex, now that we no longer have a linear order to rely on? It should be clear, choose the “center vertex” of whichever half we end up in. This choice can be formalized into a rule that works even when there’s not such obvious symmetry, and it turns out to always be the right choice.

Definition: median of a weighted graph $ G$ with respect to a subset of vertices $ S \subset V$ is a vertex $ v \in V$ (not necessarily in $ S$) which minimizes the sum of distances to vertices in $ S$. More formally, it minimizes

$ \Phi_S(v) = \sum_{u \in S} d(v, u)$,

where $ d(u,v)$ is the sum of the edge weights along a shortest path from $ v$ to $ u$.

And so generalizing binary search to this query-model on a graph results in the following algorithm, which whittles down the search space by querying the median at every step.

Algorithm: Binary search on graphs. Input is a graph $ G = (V,E)$.

  • Start with a set of candidates $ S = V$.
  • While we haven’t found the target and $ |S| > 1$:
    • Query the median $ v$ of $ S$, and stop if you’ve found the target.
    • Otherwise, let $ e = (v, w)$ be the response edge, and compute the set of all vertices $ x \in V$ for which $ e$ is on a shortest path from $ v$ to $ x$. Call this set $ T$.
    • Replace $ S$ with $ S \cap T$.
  • Output the only remaining vertex in $ S$

Indeed, as we’ll see momentarily, a python implementation is about as simple. The meat of the work is in computing the median and the set $ T$, both of which are slight variants of Dijkstra’s algorithm for computing shortest paths.

The theorem, which is straightforward and well written by Emamjomeh-Zadeh et al. (only about a half page on page 5), is that this algorithm requires only $ O(\log(n))$ queries, just like binary search.

Before we dive into an implementation, there’s a catch. Even though we are guaranteed only $ \log(n)$ many queries, because of our Dijkstra’s algorithm implementation, we’re definitely not going to get a logarithmic time algorithm. So in what situation would this be useful?

Here’s where we use the “theory” trick of making up a fanciful problem and only later finding applications for it (which, honestly, has been quite successful in computer science). In this scenario we’re treating the query mechanism as a black box. It’s natural to imagine that the queries are expensive, and a resource we want to optimize for. As an example the authors bring up in a followup paper, the graph might be the set of clusterings of a dataset, and the query involves a human looking at the data and responding that a cluster should be split, or that two clusters should be joined. Of course, for clustering the underlying graph is too large to process, so the median-finding algorithm needs to be implicit. But the essential point is clear: sometimes the query is the most expensive part of the algorithm.

Alright, now let’s implement it! The complete code is on Github as always.

Always be implementing

We start with a slight variation of Dijkstra’s algorithm. Here we’re given as input a single “starting” vertex, and we produce as output a list of all shortest paths from the start to all possible destination vertices.

We start with a bare-bones graph data structure.

from collections import defaultdict
from collections import namedtuple

Edge = namedtuple('Edge', ('source', 'target', 'weight'))

class Graph:
    # A bare-bones implementation of a weighted, undirected graph
    def __init__(self, vertices, edges=tuple()):
        self.vertices = vertices
        self.incident_edges = defaultdict(list)

        for edge in edges:
                1 if len(edge) == 2 else edge[2]  # optional weight

    def add_edge(self, u, v, weight=1):
        self.incident_edges[u].append(Edge(u, v, weight))
        self.incident_edges[v].append(Edge(v, u, weight))

    def edge(self, u, v):
        return [e for e in self.incident_edges[u] if == v][0]

And then, since most of the work in Dijkstra’s algorithm is tracking information that you build up as you search the graph, we define the “output” data structure, a dictionary of edge weights paired with back-pointers for the discovered shortest paths.

class DijkstraOutput:
    def __init__(self, graph, start):
        self.start = start
        self.graph = graph

        # the smallest distance from the start to the destination v
        self.distance_from_start = {v: math.inf for v in graph.vertices}
        self.distance_from_start[start] = 0

        # a list of predecessor edges for each destination
        # to track a list of possibly many shortest paths
        self.predecessor_edges = {v: [] for v in graph.vertices}

    def found_shorter_path(self, vertex, edge, new_distance):
        # update the solution with a newly found shorter path
        self.distance_from_start[vertex] = new_distance

        if new_distance < self.distance_from_start[vertex]:
            self.predecessor_edges[vertex] = [edge]
        else:  # tie for multiple shortest paths

    def path_to_destination_contains_edge(self, destination, edge):
        predecessors = self.predecessor_edges[destination]
        if edge in predecessors:
            return True
        return any(self.path_to_destination_contains_edge(e.source, edge)
                   for e in predecessors)

    def sum_of_distances(self, subset=None):
        subset = subset or self.graph.vertices
        return sum(self.distance_from_start[x] for x in subset)

The actual Dijkstra algorithm then just does a “breadth-first” (priority-queue-guided) search through $ G$, updating the metadata as it finds shorter paths.

def single_source_shortest_paths(graph, start):
    Compute the shortest paths and distances from the start vertex to all
    possible destination vertices. Return an instance of DijkstraOutput.
    output = DijkstraOutput(graph, start)
    visit_queue = [(0, start)]

    while len(visit_queue) > 0:
        priority, current = heapq.heappop(visit_queue)

        for incident_edge in graph.incident_edges[current]:
            v =
            weight = incident_edge.weight
            distance_from_current = output.distance_from_start[current] + weight

            if distance_from_current <= output.distance_from_start[v]:
                output.found_shorter_path(v, incident_edge, distance_from_current)
                heapq.heappush(visit_queue, (distance_from_current, v))

    return output

Finally, we implement the median-finding and $ T$-computing subroutines:

def possible_targets(graph, start, edge):
    Given an undirected graph G = (V,E), an input vertex v in V, and an edge e
    incident to v, compute the set of vertices w such that e is on a shortest path from
    v to w.
    dijkstra_output = dijkstra.single_source_shortest_paths(graph, start)
    return set(v for v in graph.vertices
               if dijkstra_output.path_to_destination_contains_edge(v, edge))

def find_median(graph, vertices):
    Compute as output a vertex in the input graph which minimizes the sum of distances
    to the input set of vertices
    best_dijkstra_run = min(
         (single_source_shortest_paths(graph, v) for v in graph.vertices),
         key=lambda run: run.sum_of_distances(vertices)
    return best_dijkstra_run.start

And then the core algorithm

QueryResult = namedtuple('QueryResult', ('found_target', 'feedback_edge'))

def binary_search(graph, query):
    Find a target node in a graph, with queries of the form "Is x the target?"
    and responses either "You found the target!" or "Here is an edge on a shortest
    path to the target."
    candidate_nodes = set(x for x in graph.vertices)  # copy

    while len(candidate_nodes) > 1:
        median = find_median(graph, candidate_nodes)
        query_result = query(median)

        if query_result.found_target:
            return median
            edge = query_result.feedback_edge
            legal_targets = possible_targets(graph, median, edge)
            candidate_nodes = candidate_nodes.intersection(legal_targets)

    return candidate_nodes.pop()

Here’s an example of running it on the example graph we used earlier in the post:

Graph looks like this tree, with uniform weights

    a       k
     b     j
     d     l
    e       m
G = Graph(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i',
           'j', 'k', 'l', 'm'],
               ('a', 'b'),
               ('b', 'c'),
               ('c', 'd'),
               ('d', 'e'),
               ('c', 'f'),
               ('f', 'g'),
               ('g', 'h'),
               ('h', 'i'),
               ('i', 'j'),
               ('j', 'k'),
               ('i', 'l'),
               ('l', 'm'),

def simple_query(v):
    ans = input("is '%s' the target? [y/N] " % v)
    if ans and ans.lower()[0] == 'y':
        return QueryResult(True, None)
        print("Please input a vertex on the shortest path between"
              " '%s' and the target. The graph is: " % v)
        for w in G.incident_edges:
            print("%s: %s" % (w, G.incident_edges[w]))

        target = None
        while target not in G.vertices:
            target = input("Input neighboring vertex of '%s': " % v)

    return QueryResult(
        G.edge(v, target)

output = binary_search(G, simple_query)
print("Found target: %s" % output)

The query function just prints out a reminder of the graph and asks the user to answer the query with a yes/no and a relevant edge if the answer is no.

An example run:

is 'g' the target? [y/N] n
Please input a vertex on the shortest path between 'g' and the target. The graph is:
e: [Edge(source='e', target='d', weight=1)]
i: [Edge(source='i', target='h', weight=1), Edge(source='i', target='j', weight=1), Edge(source='i', target='l', weight=1)]
g: [Edge(source='g', target='f', weight=1), Edge(source='g', target='h', weight=1)]
l: [Edge(source='l', target='i', weight=1), Edge(source='l', target='m', weight=1)]
k: [Edge(source='k', target='j', weight=1)]
j: [Edge(source='j', target='i', weight=1), Edge(source='j', target='k', weight=1)]
c: [Edge(source='c', target='b', weight=1), Edge(source='c', target='d', weight=1), Edge(source='c', target='f', weight=1)]
f: [Edge(source='f', target='c', weight=1), Edge(source='f', target='g', weight=1)]
m: [Edge(source='m', target='l', weight=1)]
d: [Edge(source='d', target='c', weight=1), Edge(source='d', target='e', weight=1)]
h: [Edge(source='h', target='g', weight=1), Edge(source='h', target='i', weight=1)]
b: [Edge(source='b', target='a', weight=1), Edge(source='b', target='c', weight=1)]
a: [Edge(source='a', target='b', weight=1)]
Input neighboring vertex of 'g': f
is 'c' the target? [y/N] n
Please input a vertex on the shortest path between 'c' and the target. The graph is:
Input neighboring vertex of 'c': d
is 'd' the target? [y/N] n
Please input a vertex on the shortest path between 'd' and the target. The graph is:
Input neighboring vertex of 'd': e
Found target: e

A likely story

The binary search we implemented in this post is pretty minimal. In fact, the more interesting part of the work of Emamjomeh-Zadeh et al. is the part where the response to the query can be wrong with some unknown probability.

In this case, there can be many shortest paths that are valid responses to a query, in addition to all the invalid responses. In particular, this rules out the strategy of asking the same query multiple times and taking the majority response. If the error rate is 1/3, and there are two shortest paths to the target, you can get into a situation in which you see three responses equally often and can’t choose which one is the liar.

Instead, the technique Emamjomeh-Zadeh et al. use is based on the Multiplicative Weights Update Algorithm (it strikes again!). Each query gives a multiplicative increase (or decrease) on the set of nodes that are consistent targets under the assumption that query response is correct. There are a few extra details and some postprocessing to avoid unlikely outcomes, but that’s the basic idea. Implementing it would be an excellent exercise for readers interested in diving deeper into a recent research paper (or to flex their math muscles).

But even deeper, this model of “query and get advice on how to improve” is a classic  learning model first formally studied by Dana Angluin (my academic grand-advisor). In her model, one wants to design an algorithm to learn a classifier. The allowed queries are membership and equivalence queries. A membership is essentially, “What’s its label of this element?” and an equivalence query has the form, “Is this the right classifier?” If the answer is no, a mislabeled example is provided.

This is different from the usual machine learning assumption, because the learning algorithm gets to construct an example it wants to get more information about, instead of simply relying on a randomly generated subset of data. The goal is to minimize the number of queries before the target hypothesis is learned exactly. And indeed, as we saw in this post, if you have a little extra time to analyze the problem space, you can craft queries that extract quite a lot of information.

Indeed, the model we presented here for binary search on graphs is the natural analogue of an equivalence query for a search problem: instead of a mislabeled counterexample, you get a nudge in the right direction toward the target. Pretty neat!

There are a few directions we could take from here: (1) implement the Multiplicative Weights version of the algorithm, (2) apply this technique to a problem like ranking or clustering, or (3) cover theoretical learning models like membership and equivalence queries in more detail. What interests you?

Until next time!

Learning a single-variable polynomial, or the power of adaptive queries

Problem: Alice chooses a secret polynomial $ p(x)$ with nonnegative integer coefficients. Bob wants to discover this polynomial by querying Alice for the value of $ p(x)$ for some integer $ x$ of Bob’s choice. What is the minimal number of queries Bob needs to determine $ p(x)$ exactly?

Solution: Two queries. The first is $ p(1)$, and if we call $ N = p(1) + 1$, then the second query is $ p(N)$.

To someone who is familiar with polynomials, this may seem shocking, and I’ll explain why it works in a second. After all, it’s very easy to prove that if Bob gives Alice all of his queries at the same time (if the queries are not adaptive), then it’s impossible to discover what $ p(x)$ is using fewer than $ \textup{deg}(p) + 1$ queries. This is due to a fact called polynomial interpolation, which we’ve seen on this blog before in the context of secret sharing. Specifically, there is a unique single-variable degree $ d$ polynomial passing through $ d+1$ points (with distinct $ x$-values). So if you knew the degree of $ p$, you could determine it easily. But Bob doesn’t know the degree of the polynomial, and there’s no way he can figure it out without adaptive queries! Indeed, if Bob tries and gives a set of $ d$ queries, Alice could have easily picked a polynomial of degree $ d+1$. So it’s literally impossible to solve this problem without adaptive queries.

The lovely fact is that once you allow adaptiveness, the number of queries you need doesn’t even depend on the degree of the secret polynomial!

Okay let’s get to the solution. It was crucial that our polynomial had nonnegative integer coefficients, because we’re going to do a tiny bit of number theory. Let $ p(x) = a_0 + a_1 x + \dots + a_d x^d$. First, note that $ p(1)$ is exactly the sum of the coefficients $ \sum_i a_i$, and in particular $ p(1) + 1$ is larger than any single coefficient. So call this $ N$, and query $ p(N)$. This gives us a number $ y_0$ of the form

$ \displaystyle y_0 = a_0 + a_1N + a_2N^2 + \dots + a_dN^d$

And because $ N$ is so big, we can compute $ a_0$ easily by computing $ y_0 \mod N$. Now set $ y_1 = (y_0 – a_0) / N$, and this has the form $ a_1 + a_2N + \dots + a_dN^{d-1}$. We can compute modulus again to get $ a_1$, and repeat until we have all the coefficients. We’ll stop once we get a $ y_i$ that is zero.

[Addendum 2018-02-14: implementation on github]

As a small technical note, this is a polynomial-time algorithm in the number of bits needed to write down $ p(x)$. So this demonstrates the power of adaptive queries: we get from something which is uncomputable with any number of queries to something which is efficiently computable with a constant number of queries.

The obvious follow-up question is: can you come up with an efficient algorithm if we allow the coefficients to be negative integers?

Probably Approximately Correct — a Formal Theory of Learning

In tackling machine learning (and computer science in general) we face some deep philosophical questions. Questions like, “What does it mean to learn?” and, “Can a computer learn?” and, “How do you define simplicity?” and, “Why does Occam’s Razor work? (Why do simple hypotheses do well at modelling reality?)” In a very deep sense, learning theorists take these philosophical questions — or at least aspects of them — give them fleshy mathematical bodies, and then answer them with theorems and proofs. These fleshy bodies might have imperfections or they might only address one small part of a big question, but the more we think about them the closer we get to robust answers and, as a reader of this blog might find relevant, useful applications. But the glamorous big-picture stuff is an important part of the allure of learning theory.

But before we jump too far ahead of ourselves, we need to get through the basics. In this post we’ll develop the basic definitions of the theory of PAC-learning. It will be largely mathematical, but fear not: we’ll mix in a wealth of examples to clarify the austere symbols.

Leslie Valiant

Leslie Valiant

Some historical notes: PAC learning was invented by Leslie Valiant in 1984, and it birthed a new subfield of computer science called computational learning theory and won Valiant some of computer science’s highest awards. Since then there have been numerous modifications of PAC learning, and also models that are entirely different from PAC learning. One other goal of learning theorists (as with computational complexity researchers) is to compare the power of different learning models. We’ll discuss this more later once we have more learning models under our belts.

If you’re interested in following along with a book, the best introduction to the subject is the first few chapters of An Introduction to Computational Learning Theory.

So let’s jump right in and see what this award-winning definition is all about.

Learning Intervals

The core idea of PAC-learnability is easy to understand, and we’ll start with a simple example to explain it. Imagine a game between two players. Player 1 generates numbers $ x$ at random in some fixed way, and in Player 1’s mind he has an interval $ [a,b]$. Whenever Player 1 gives out an $ x$, he must also say whether it’s in the interval (that is, whether $ a \leq x \leq b$). Let’s say that Player 1 gives reports a 1 if $ x$ is in the interval, and a 0 otherwise. We’ll call this number the label of $ x$, and call the pair of ($ x$, label) a sample, or an example. We recognize that the zero and one correspond to “yes” and “no” answers to some question (Is this email spam? Does the user click on my ad? etc.), and so sometimes the labels are instead $ \pm 1$, and referred to as “positive” or “negative” examples. We’ll use the positive/negative terminology here, so positive is a 1 and negative is a 0.

Player 2 (we’re on her side) sees a bunch of samples and her goal is to determine $ a$ and $ b$. Of course Player 2 can’t guess the interval exactly if the endpoints are real numbers, because Player 1 only gives out finitely many samples. But whatever interval Player 2 does guess at the end can be tested against Player 1’s number-producing scheme. That is, we can compute the probability that Player 2’s interval will give an incorrect label if Player 1 were to continue giving out numbers indefinitely. If this error is small (taking into account how many samples were given), then Player 2 has “learned” the interval. And if Player 2 plays this game over and over and usually wins (no matter what strategy or interval Player 1 decides to use!), then we say this problem is PAC-learnable.

PAC stands for Probably Approximately Correct, and our number guessing game makes it clear what this means. Approximately correct means the interval is close enough to the true interval that the error will be small on new samples, and Probably means that if we play the game over and over we’ll usually be able to get a good approximation. That is, we’ll find an approximately good interval with high probability

Indeed, one might already have a good algorithm in mind to learn intervals. Simply take the largest and smallest positive examples and use those as the endpoints of your interval. It’s not hard to see why this works, but if we want to prove it (or anything) is PAC-learnable, then we need to solidify these ideas with mathematical definitions.

Distributions and Hypotheses

First let’s settle the random number generation scheme. In full generality, rather than numbers we’ll just have some set $ X$. Could be finite, could be infinite, no restrictions. And we’re getting samples randomly generated from $ X$ according to some fixed but arbitrary and unknown distribution $ D$. To be completely rigorous, the samples are independent and identically distributed (they’re all drawn from the same $ D$ and independently so). This is Player 1’s dastardly decision: how should he pick his method to generate random numbers so as to bring Player 2’s algorithm to the most devastating ruin?

So then Player 1 is reduced to a choice of distribution $ D$ over $ X$, and since we said that Player 2’s algorithm has to do well with high probability no matter what $ D$ is, then the definition becomes something like this:

A problem is PAC-learnable if there is an algorithm $ A$ which will likely win the game for all distributions $ D$ over $ X$.

Now we have to talk about how “intervals” fit in to the general picture. Because if we’re going to talk about learning in general, we won’t always be working with intervals to make decisions. So we’re really saying that Player 1 picks some function $ c$ for classifying points in $ X$ as a 0 or a 1. We’ll call this a concept, or a target, and it’s the thing Player 2 is trying to learn. That is, Player 2 is producing her own function $ h$ that also labels points in $ X$, and we’re comparing it to $ c$. We call a function generated by Player 2 a hypothesis (hence the use of the letter h).

And how can we compare them? Well, we can compute the probability that they differ. We call this the error:

$ \displaystyle \textup{err}_{c,D}(h) = \textup{P}_D(h(x) \neq c(x))$

One would say this aloud: “The error of the hypothesis $ h$ with respect to the concept $ c$ and the distribution $ D$ is the probability over $ x$ drawn via $ D$ that $ h(x)$ and $ c(x)$ differ”. Some might write the “differ” part as the symmetric difference of the two functions as sets. And then it becomes a probability density, if that’s your cup of tea (it’s not mine).

So now for a problem to be PAC-learnable we can say something like,

A problem is PAC-learnable if there is an algorithm $ A$ which for any distribution $ D$ and any concept $ c$ will, when given some independently drawn samples and with high probability, produce a hypothesis whose error is small.

There are still a few untrimmed hedges in this definition (like “some,” “small,” and “high”), but there’s still a more important problem: there’s just too many possible concepts! Even for finite sets: there are $ 2^n$ $ \left \{ 0,1 \right \}-$valued functions on a set of $ n$ elements, and if we hope to run in polynomial time we can only possible express a miniscule fraction of those functions. Going back to the interval game, it’d be totally unreasonable to expect Player 2 to be able to get a reasonable hypothesis (using intervals or not!) if Player 1’s chosen concept is arbitrary. (The mathematician in me is imaging some crazy rule using non-measurable sets, but just suffice it to say: you might think you know everything about the real numbers, but you don’t.)

So we need to boil down what possibilities there are for the concepts $ c$ and the allowed expressive power of the learner. This is what concept classes are for.

Concept Classes

concept class $ \mathsf{C}$ over $ X$ is a family of functions $ X \to \left \{ 0,1 \right \}$. That’s all.

No, okay, there’s more to the story, but for now it’s just a shift of terminology. Now we can define the class of labeling functions induced by a choice of intervals. One might do this by taking $ \mathsf{C}$ to be the set of all characteristic functions of intervals, $ \chi_{[a,b]}(x) = 1$ if $ a \leq x \leq b$ and 0 otherwise. Now the concept class becomes the sole focus of our algorithm. That is, the algorithm may use knowledge of the concept class to produce its hypotheses. So our working definition becomes:

A concept class $ \mathsf{C}$ is PAC-learnable if there is an algorithm $ A$ which, for any distribution $ D$ of samples and any concept $ c \in \mathsf{C}$, will with high probability produce a hypothesis $ h \in \mathsf{C}$ whose error is small.

As a short prelude to future posts: we’ll be able to prove that, if the concept class is sufficiently simple (think, “low dimension”) then any algorithm that does something reasonable will be able to learn the concept class. But that will come later. Now we turn to polishing the rest of this definition.

Probably Approximately Correct Learning

We don’t want to phrase the definition in terms of games, so it’s time to remove the players from the picture. What we’re really concerned with is whether there’s an algorithm which can produce good hypotheses when given random data. But we have to solidify the “giving” process and exactly what limits are imposed on the algorithm.

It sounds daunting, but the choices are quite standard as far as computational complexity goes. Rather than say the samples come as a big data set as they might in practice, we want the algorithm to be able to decide how much data it needs. To do this, we provide it with a query function which, when accessed, spits out a sample in unit time. Then we’re interested in learning the concept with a reasonable number of calls to the query function.

And now we can iron out those words like “some” and “small” and “high” in our working definition. Since we’re going for small error, we’ll introduce a parameter $ 0 < \varepsilon < 1/2$ to represent our desired error bound. That is, our goal is to find a hypothesis $ h$ such that $ \textup{err}_{c,D}(h) \leq \varepsilon $ with high probability. And as $ \varepsilon$ gets smaller and smaller (as we expect more and more of it), we want to allow our algorithm more time to run, so we limit our algorithm to run in time and space polynomial in $ 1/\varepsilon$.

We need another parameter to control the “high probability” part as well, so we’ll introduce $ 0 < \delta < 1/2$ to represent the small fraction of the time we allow our learning algorithm to have high error. And so our goal becomes to, with probability at least $ 1-\delta$, produce a hypothesis whose error is less than $ \varepsilon$. In symbols, we want

$ \textup{P}_D (\textup{err}_{c,D}(h) \leq \varepsilon) > 1 – \delta$

Note that the $ \textup{P}_D$ refers to the probability over which samples you happen to get when you call the query function (and any other random choices made by the algorithm). The “high probability” hence refers to the unlikely event that you get data which is unrepresentative of the distribution generating it. Note that this is not the probability over which distribution is chosen; an algorithm which learns must still learn no matter what $ D$ is.

And again as we restrict $ \delta$ more and more, we want the algorithm to be allowed more time to run. So we require the algorithm runs in time polynomial in both $ 1/\varepsilon, 1/\delta$.

And now we have all the pieces to state the full definition.

Definition: Let $ X$ be a set, and $ \mathsf{C}$ be a concept class over $ X$. We say that $ \mathsf{C}$ is PAC-learnable if there is an algorithm $ A(\varepsilon, \delta)$ with access to a query function for $ \mathsf{C}$ and runtime $ O(\textup{poly}(\frac{1}{\varepsilon}, \frac{1}{\delta}))$, such that for all $ c \in \mathsf{C}$, all distributions $ D$ over $ X$, and all inputs $ \varepsilon, \delta$ between 0 and $ 1/2$, the probability that $ A$ produces a hypothesis $ h$ with error at most $ \varepsilon$ is at least $ 1- \delta$. In symbols,

$ \displaystyle \textup{P}_{D}(\textup{P}_{x \sim D}(h(x) \neq c(x)) \leq \varepsilon) \geq 1-\delta$

where the first $ \textup{P}_D$ is the probability over samples drawn from $ D$ during the execution of the program to produce $ h$. Equivalently, we can express this using the error function,

$ \displaystyle \textup{P}_{D}(\textup{err}_{c,D}(h) \leq \varepsilon) \geq 1-\delta$


Intervals are PAC-Learnable

Now that we have this definition we can return to our problem of learning intervals on the real line. Our concept class is the set of all characteristic functions of intervals (and we’ll add in the empty set for the default case). And the algorithm we proposed to learn these intervals was quite simple: just grab a bunch of sample points, take the biggest and smallest positive examples, and use those as the endpoints of your hypothesis interval.

Let’s now prove that this algorithm can learn any interval with any distribution over real numbers. This proof will have the following form:

  • Leave the number of samples you pick arbitrary, say $ m$.
  • Figure out the probability that the total error of our produced hypothesis is $ > \varepsilon$ in terms of $ m$.
  • Pick $ m$ to be sufficiently large that this event (failing to achieve low error) happens with small probability.

So fix any distribution $ D$ over the real line and say we have our $ m$ samples, we picked the max and min, and our interval is $ I = [a_1,b_1]$ when the target concept is $ J = [a_0, b_0]$. We can notice one thing, that our hypothesis is contained in the true interval, $ I \subset J$. That’s because the sample never lie, so the largest sample we saw must be smaller than the largest possible positive example, and vice versa. In other words $ a_0 < a_1 < b_1 < b_0$. And so the probability of our hypothesis producing an error is just the probability that $ D$ produces a positive example in the two intervals $ A = [a_0, a_1], B = [b_1, b_0]$.

This is all setup for the second bullet point above. The total error is at most the sum of the probabilities that a positive sample shows up in each of $ A, B$ separately.

$ \displaystyle \textup{err}_{J, D} \leq \textup{P}_{x \sim D}(x \in A) + \textup{P}_{x \sim D}(x \in B)$

Here’s a picture.

The two green intervals are our regions where error can occur.

The two green intervals are the regions where error can occur.

If we can guarantee that each of the green pieces is smaller than $ \varepsilon / 2$ with high probability, then we’ll be done. Let’s look at $ A$, and the same argument will hold for $ B$. Define $ A’$ to be the interval $ [a_0, y]$ which is so big that the probability that a positive example is drawn from $ A’$ under $ D$ is exactly $ \varepsilon / 2$. Here’s another picture to clarify that.

The pink region A' has total probability weight epsilon/2, and if the green region A is larger, we risk too much error to be PAC-learnable.

The pink region A’ has total probability weight epsilon/2, and if the green region A is larger, we risk too much error to be PAC-learnable.

We’ll be in great shape if it’s already the case that $ A \subset A’$, because that implies the probability we draw a positive example from $ A$ is at most $ \varepsilon / 2$. So we’re worried about the possibility that $ A’ \subset A$. But this can only happen if we never saw a point from $ A’$ as a sample during the run of our algorithm. Since we had $ m$ samples, we can compute in terms of $ m$ the probability of never seeing a sample from $ A’$.

The probability of a single sample not being in $ A’$ is just $ 1 – \varepsilon/2$ (by definition!). Recalling our basic probability theory, two draws are independent events, and so the probability of missing $ A’$ $ m$ times is equal to the product of the probabilities of each individual miss. That is, the probability that our chosen $ A$ contributes error greater than $ \varepsilon / 2$ is at most

$ \displaystyle \textup{P}_D(A’ \subset A) \leq (1 – \varepsilon / 2)^m$

The same argument applies to $ B$, so we know by the union bound that the probability of error $ > \varepsilon / 2$ occurring in either $ A$ or $ B$ is at most the sum of the probabilities of large error in each piece, so that

$ \displaystyle \textup{P}_D(\textup{err}_{J,D}(I) > \varepsilon) \leq 2(1 – \varepsilon / 2)^m$

Now for the third bullet. We want the chance that the error is big to be smaller than $ \delta$, so that we’ll have low error with probability $ > 1 – \delta$. So simply set

$ \displaystyle 2(1 – \varepsilon / 2)^m \leq \delta$

And solve for $ m$. Using the fact that $ (1-x) \leq e^{-x}$ (which is proved by Taylor series), it’s enough to solve

$ \displaystyle 2e^{-\varepsilon m/2} \leq \delta$,

And a fine solution is $ m \geq (2 / \varepsilon \log (2 / \delta))$.

Now to cover all our bases: our algorithm simply computes $ m$ for its inputs $ \varepsilon, \delta$, queries that many samples, and computes the tightest-fitting interval containing the positive examples. Since the number of samples is polynomial in $ 1/\varepsilon, 1/\delta$ (and our algorithm doesn’t do anything complicated), we comply with the time and space bounds. And finally we just proved that the chance our algorithm will misclassify an $ \varepsilon$ fraction of new points drawn from $ D$ is at most $ \delta$. So we have proved the theorem:

Theorem: Intervals on the real line are PAC-learnable.

$ \square$

As an exercise, see if you can generalize the argument to axis-aligned rectangles in the plane. What about to arbitrary axis-aligned boxes in $ d$ dimensional space? Where does $ d$ show up in the number of samples needed? Is this still efficient?

Comments and Previews

There are a few more technical details we’ve ignored in the course of this post, but the important idea are clear. We have a formal model of learning which allows for certain pre-specified levels of imperfection, and we proved that one can learn how to recognize intervals in this model. It’s a far cry from decision trees and neural networks, but it’s a solid foundation to build upon.

However, the definition we presented here for PAC-learning is not quite complete. It turns out, as we’ll see in the next post, that forcing the PAC-learning algorithm to produce hypotheses from the same concept class it’s trying to learn makes some problems that should be easy hard. This is just because it could require the algorithm to represent some simple hypothesis in a convoluted form, and in the next post we’ll see that this is not an idle threat, and we’ll make a slight modification to the PAC definition.

However PAC-learning is far from sacred. In particular, the choice that we require a single algorithm to succeed no matter what the distribution $ D$ was a deliberate choice, and it’s quite a strong requirement for a learning algorithm. There are also versions of PAC that remove other assumptions from the definition, such that the oracle for the target concept is honest (noise-free) and that there is any available hypothesis that is actually true (realizability). These all give rise to interesting learning models and discovering the relationship between the models is the end goal.

And so the kinds of questions we ask are: can we classify all PAC-learnable problems? Can we find a meta-algorithm that would work on any PAC-learnable concept class given some assumptions? How does PAC-learning relate to other definitions of learning? Say, one where we don’t require it to work for every distribution; would that really allow us to solve more problems?

It’s a question of finding out the deep truths of mathematics now, but we promise that this series will soon come back around to practical applications, for learning theory naturally entails the design and analysis of fascinating algorithms.

Until next time!