“Practical Math” Preview: Collect Sensitive Survey Responses Privately

This is a draft of a chapter from my in-progress book, Practical Math for Programmers: A Tour of Mathematics in Production Software.

Tip: Determine an aggregate statistic about a sensitive question, when survey respondents do not trust that their responses will be kept secret.

Solution:

import random

def respond_privately(true_answer: bool) -> bool:
    '''Respond to a survey with plausible deniability about your answer.'''
    be_honest = random.random() < 0.5
    random_answer = random.random() < 0.5
    return true_answer if be_honest else random_answer

def aggregate_responses(responses: List[bool]) -> Tuple[float, float]:
    '''Return the estimated fraction of survey respondents that have a truthful
    Yes answer to the survey question.
    '''
    yes_response_count = sum(responses)
    n = len(responses)
    mean = 2 * yes_response_count / n - 0.5
    # Use n-1 when estimating variance, as per Bessel's correction.
    variance = 3 / (4 * (n - 1))
    return (mean, variance)

In the late 1960’s, most abortions were illegal in the United States. Daniel G. Horvitz, a statistician at The Research Triangle Institute in North Carolina and a leader in survey design for social sciences, was tasked with estimating how many women in North Carolina were receiving illegal abortions. The goal was to inform state and federal policymakers about the statistics around abortions, many of which were unreported, even when done legally.

The obstacles were obvious. As Horvitz put it, “a prudent woman would not divulge to a stranger the fact that she was party to a crime for which she could be prosecuted.” [Abernathy70] This resulted in a strong bias in survey responses. Similar issues had plagued surveys of illegal activity of all kinds, including drug abuse and violent crime. Lack of awareness into basic statistics about illegal behavior led to a variety of misconceptions, such as that abortions were not frequently sought out.

Horvitz worked with biostatisticians James Abernathy and Bernard Greenberg to test out a new method to overcome this obstacle, without violating the respondent’s privacy or ability to plausibly deny illegal behavior. The method, called randomized response, was invented by Stanley Warner in 1965, just a few years earlier. [Warner65] Warner’s method was a bit different from what we present in this Tip, but both Warner’s method and the code sample above use the same strategy of adding randomization to the survey.

The mechanism, as presented in the code above, requires respondents to start by flipping a coin. If heads, they answer the sensitive question truthfully. If tails, they flip a second coin to determine how to answer the question—heads resulting in a “yes” answer, tails in a “no” answer. Naturally, the coin flips are private and controlled by the respondent. And so if a respondent answers “Yes” to the question, they may plausibly claim the “Yes” was determined by the coin, preserving their privacy. The figure below describes this process as a diagram.

A branching diagram showing the process a survey respondent takes to record their response.

Another way to describe the outcome is to say that each respondent’s answer is a single bit of information that is flipped with probability 1/4. This is half way between two extremes on the privacy/accuracy tradeoff curve. The first extreme is a “perfectly honest” response, where the bit is never flipped and all information is preserved. The second extreme has the bit flipped with probability 1/2, which is equivalent to ignoring the question and choosing your answer completely at random, losing all information in the aggregate responses. In this perspective, the aggregate survey responses can be thought of as a digital signal, and the privacy mechanism adds noise to that signal.

It remains to determine how to recover the aggregate signal from these noisy responses. In other words, the surveyor cannot know any individual’s true answer, but they can, with some extra work, estimate statistics about the underlying population by correcting for the statistical bias. This is possible because the randomization is well understood. The expected fraction of “Yes” answers can be written as a function of the true fraction of “Yes” answers, and hence the true fraction can be solved for. In this case, where the random coin is fair, that formula is as follows (where $ \mathbf{P}$ stands for “the probability of”).

$ \displaystyle \mathbf{P}(\textup{Yes answer}) = \frac{1}{2} \mathbf{P}(\textup{Truthful yes answer}) + \frac{1}{4}$

And so we solve for $ \mathbf{P}(\textup{Truthful yes answer})$

$ \displaystyle \mathbf{P}(\textup{Truthful yes answer}) = 2 \mathbf{P}(\textup{Yes answer}) – \frac{1}{2}$

We can replace the true probability $ \mathbf{P}(\textup{Yes answer})$ above with our fraction of “Yes” responses from the survey, and the result is an estimate $ \hat{p}$ of $ \mathbf{P}(\textup{Truthful yes answer})$. This estimate is unbiased, but has additional variance—beyond the usual variance caused by picking a finite random sample from the population of interest—introduced by the randomization mechanism.

With a bit of effort, one can calculate that the variance of the estimate is

$ \displaystyle \textup{Var}(\hat{p}) = \frac{3}{4n}$

And via Chebyshev’s inequality, which bounds the likelihood that an estimator is far away from its expectation, we can craft a confidence interval and determine the needed sample sizes. Specifically, the estimate $ \hat{p}$ has additive error at most $ q$ with probability at most $ \textup{Var}(\hat{p}) / q^2$. This implies that for a confidence of $ 1-c$, one requires at least $ n \geq 3 / (4 c q^2)$ samples. For example, to achieve error 0.01 with 90 percent confidence ($ c=0.1$), one requires 7,500 responses.

Horvitz’s randomization mechanism didn’t use coin flips. Instead they used an opaque box with red or blue colored balls which the respondent, who was in the same room as the surveyor, would shake and privately reveal a random color through a small window facing away from the surveyor. The statistical principle is the same. Horvitz and his associates surveyed the women about their opinions of the privacy protections of this mechanism. When asked whether their friends would answer a direct question about abortion honestly, over 80% either believed their friends would lie, or were unsure. [footnote: A common trick in survey methodology when asking someone if they would be dishonest is to instead ask if their friends would be dishonest. This tends to elicit more honesty, because people are less likely to uphold a false perception of the moral integrity of others, and people also don’t realize that their opinion of their friends correlates with their own personal behavior and attitudes. In other words, liars don’t admit to lying, but they think lying is much more common than it really is.] But 60% were convinced there was no trick involved in the randomization, while 20% were unsure and 20% thought there was a trick. This suggests many people were convinced that Horvitz’s randomization mechanism provided the needed safety guarantees to answer honestly.

Horvitz’s survey was a resounding success, both for randomized response as a method and for measuring abortion prevalence. [Abernathy70] They estimated the abortion rate at about 22 per 100 conceptions, with a distinct racial bias—minorities were twice as likely as whites to receive an abortion. Comparing their findings to a prior nationwide study from 1955—the so-called Arden House estimate—which gave a range of between 200,000 and 1.2 million abortions per year, Horvitz’s team estimated more precisely that there were 699,000 abortions in 1955 in the United States, with a reported standard deviation of about 6,000, less than one percent. For 1967, the year of their study, they estimated 829,000.

Their estimate was referenced widely in the flurry of abortion law and court cases that followed due to a surging public interest in the topic. For example, it is cited in the 1970 California Supreme Court opinion for the case Ballard v. Anderson, which concerned whether a minor needs parental consent to receive an otherwise legal abortion. [Ballard71, Roemer71] It was also cited in amici curiae briefs submitted to the United States Supreme Court in 1971 for Roe v. Wade, the famous case that invalidated most U.S. laws making abortion illegal. One such brief was filed jointly by the country’s leading women’s rights organizations like the National Organization for Women. Citing Horvitz for this paragraph, it wrote, [Womens71]

While the realities of law enforcement, social and public health problems posed by abortion laws have been openly discussed […] only within a period of not more than the last ten years, one fact appears undeniable, although unverifiable statistically. There are at least one million illegal abortions in the United States each year. Indeed, studies indicate that, if the local law still has qualifying requirements, the relaxation in the law has not diminished to any substantial extent the numbers in which women procure illegal abortions.

It’s unclear how the authors got this one million number (Horvitz’s estimate was 20% less for 1967), nor what they meant by “unverifiable statistically.” It may have been a misinterpretation of the randomized response technique. In any event, randomized response played a crucial role in providing a foundation for political debate.

Despite Horvitz’s success, and decades of additional research on crime, drug use, and other sensitive topics, randomized response mechanisms have been applied poorly. In some cases, the desired randomization is inextricably complex, such as when requiring a continuous random number. In these cases, a manual randomization mechanism is too complex for a respondent to use accurately. Trying to use software-assisted devices can help, but can also produce mistrust in the interviewee. See [Rueda16] for additional discussion of these pitfalls and what software packages exist for assisting in using randomized response. See [Fox16] for an analysis of the statistical differences between the variety of methods used between 1970 and 2010.

In other contexts, analogues to randomized response may not elicit the intended effect. In the 1950’s, Utah used death by firing squad as capital punishment. To avoid a guilty conscience of the shooters, one of five marksmen was randomly given a blank, providing him some plausible deniability that he knew he had delivered the killing shot. However, this approach failed on two counts. First, once a shot was fired the marksman could tell whether the bullet was real based on the recoil. Second, a 20% chance of a blank was not enough to dissuade a guilty marksman from purposely missing. In the 1951 execution of Elisio Mares, all four real bullets missed the condemned man’s heart, hitting his chest, stomach, and hip. He died, but it was neither painless nor instant.

Of many lessons one might draw from the botched execution, one is that randomization mechanisms must take into account both the psychology of the participants as well as the severity of a failed outcome.

References

@book{Fox16,
  title = {{Randomized Response and Related Methods: Surveying Sensitive Data}},
  author = {James Alan Fox},
  edition = {2nd},
  year = {2016},
  doi = {10.4135/9781506300122},
}

@article{Abernathy70,
  author = {Abernathy, James R. and Greenberg, Bernard G. and Horvitz, Daniel G.
            },
  title = {{Estimates of induced abortion in urban North Carolina}},
  journal = {Demography},
  volume = {7},
  number = {1},
  pages = {19-29},
  year = {1970},
  month = {02},
  issn = {0070-3370},
  doi = {10.2307/2060019},
  url = {https://doi.org/10.2307/2060019},
}

@article{Warner65,
  author = {Stanley L. Warner},
  journal = {Journal of the American Statistical Association},
  number = {309},
  pages = {63--69},
  publisher = {{American Statistical Association, Taylor \& Francis, Ltd.}},
  title = {Randomized Response: A Survey Technique for Eliminating Evasive
           Answer Bias},
  volume = {60},
  year = {1965},
}

@article{Ballard71,
  title = {{Ballard v. Anderson}},
  journal = {California Supreme Court L.A. 29834},
  year = {1971},
  url = {https://caselaw.findlaw.com/ca-supreme-court/1826726.html},
}

@misc{Womens71,
  title = {{Motion for Leave to File Brief Amici Curiae on Behalf of Women’s
           Organizations and Named Women in Support of Appellants in Each Case,
           and Brief Amici Curiae.}},
  booktitle = {{Appellate Briefs for the case of Roe v. Wade}},
  number = {WL 128048},
  year = {1971},
  publisher = {Supreme Court of the United States},
}

@article{Roemer71,
  author = {R. Roemer},
  journal = {Am J Public Health},
  pages = {500--509},
  title = {Abortion law reform and repeal: legislative and judicial developments
           },
  volume = {61},
  number = {3},
  year = {1971},
}

@incollection{Rueda16,
  title = {Chapter 10 - Software for Randomized Response Techniques},
  editor = {Arijit Chaudhuri and Tasos C. Christofides and C.R. Rao},
  series = {Handbook of Statistics},
  publisher = {Elsevier},
  volume = {34},
  pages = {155-167},
  year = {2016},
  booktitle = {Data Gathering, Analysis and Protection of Privacy Through
               Randomized Response Techniques: Qualitative and Quantitative Human
               Traits},
  doi = {https://doi.org/10.1016/bs.host.2016.01.009},
  author = {M. Rueda and B. Cobo and A. Arcos and R. Arnab},
}

Silent Duels—Parsing the Construction

Last time we discussed the setup for the silent duel problem: two players taking actions in $ [0,1]$, player 1 gets $ n$ chances to act, player 2 gets $ m$, and each knows their probability of success when they act.

The solution is in a paper of Rodrigo Restrepo from the 1950s. In this post I’ll start detailing how I study this paper, and talk through my thought process for approaching a bag of theorems and proofs. If you want to follow along, I re-typeset the paper on Github.

Game Theory Basics

The Introduction starts with a summary of the setting of game theory. I remember most of this so I will just summarize the basics of the field. Skip ahead if you already know what the minimax theorem is, and what I mean when I say the “value” of a game.

A two-player game consists of a set of actions for each player—which may be finite or infinite, and need not be the same for both players—and a payoff function for each possible choice of actions. The payoff function is interpreted as the “utility” that player 1 gains and player 2 loses. If the payoff is negative, you interpret it as player 1 losing utility to player 2. Utility is just a fancy way of picking a common set of units for what each player treasures in their heart of hearts. Often it’s stated as money and we assume both players value cash the same way. Games in which the utility is always “one player gains exactly the utility lost by the other player” are called zero-sum.

With a finite set of actions, the payoff function is a table. For rock-paper-scissors the table is:

Rock, paper: -1
Rock, scissors: 1
Rock, rock: 0
Paper, paper: 0
Paper, scissors: -1
Paper, rock: 1
Scissors, paper: 1
Scissors, scissors: 0
Scissors, rock: -1

You could arrange this in a matrix and analyze the structure of the matrix, but we won’t. It doesn’t apply to our forthcoming setting where the players have infinitely many strategies.

A strategy is a possibly-randomized algorithm (whose inputs are just the data of the game, not including any past history of play) that outputs an action. In some games, the optimal strategy is to choose a single action no matter what your opponent does. This is sometimes called a pure, dominating strategy, not because it dominates your opponent, but because it’s better than all of your other options no matter what your opponent does. The output action is deterministic.

However, as with rock-paper-scissors, the optimal strategy for most interesting games requires each player to act randomly according to a fixed distribution. Such strategies are called mixed or randomized. For rock-paper-scissors, the optimal strategy is to choose rock, paper, and scissors with equal probability.  Computers are only better than humans at rock-paper-scissors because humans are bad at behaving consistently and uniformly random.

The famous minimax theorem says that every two-player zero-sum game has an optimal strategy for each player, which is possibly randomized. This strategy is optimal in the sense that it maximizes your expected winnings no matter what your opponent does. However, if your opponent is playing a particularly suboptimal strategy, the minimax solution might not be as good as a solution that takes advantage of the opponent’s dumb choices. A uniform random rock-paper-scissors strategy is not optimal if your opponent always plays “rock.”  However, the optimal strategy doesn’t need special knowledge or space to store information about past play. If you played against God, you would blindly use the minimax strategy and God would have no upper hand. I wonder if the pope would have excommunicated me for saying that in the 1600’s.

The expected winnings for player 1 when both players play a minimax-optimal strategy is called the value of the game, and this number is unique (even if there are possibly multiple optimal strategies). If a game is symmetric—meaning both players have the same actions and the payoff function is symmetric—then the value is guaranteed to be zero. The game is fair.

The version of the minimax theorem that most people use (in particular, the version that often comes up in theoretical computer science) shows that finding an optimal strategy is equivalent to solving a linear program. This is great because it means that any such (finite) game is easy to solve. You don’t need insight; just compile and run. The minimax theorem is also true for sufficiently well-behaved continuous action spaces. The silent duel is well-behaved, so our goal is to compute an explicit, easy-to-implement strategy that the minimax theorem guarantees exists. As a side note, here is an example of a poorly-behaved game with no minimax optimum.

While the minimax theorem guarantees optimal strategies and a value, the concept of the “value” of the game has an independent definition:

Let $ X, Y$ be finite sets of actions for players 1, 2 respectively, and $ p(x), q(y)$ be strategies, i.e., probability distributions over $ X$ and $ Y$ so that $ p(x)$ is the probability that $ x$ is chosen. Let $ \Psi(x, y)$ be the payoff function for the game. The value of the game is a real number $ v$ such that there exist two strategies $ p, q$ with the two following properties. First, for every fixed $ y \in Y$,

$ \displaystyle \sum_{x \in X} p(x) \Psi(x, y) \geq v$

(no matter what player 2 does, player 1’s strategy guarantees at least $ v$ payoff), and for every fixed $ x \in X$,

$ \displaystyle \sum_{y \in Y} q(y) \Psi(x, y) \leq v$

(no matter what player 1 does, player 2’s strategy prevents a loss of more than $ v$).

Since silent duels are continuous, Restrepo opens the paper with the corresponding definition for continuous games. Here a probability distribution is the same thing as a “positive measure with total measure 1.” Restrepo uses $ F$ and $ G$ for the strategies, and the corresponding statement of expected payoff for player 1 is that, for all fixed actions $ y \in Y$,

$ \displaystyle \int \Psi(x, y) dF(x) \geq v$

And likewise, for all $ x \in X$,

$ \displaystyle \int \Psi(x, y) dG(y) \leq v$

All of this background gets us through the very first paragraph of the Restrepo paper. As I elaborate in my book, this is par for the course for math papers, because written math is optimized for experts already steeped in the context. Restrepo assumes the reader knows basic game theory so we can get on to the details of his construction, at which point he slows down considerably to focus on the details.

Description of the Optimal Strategies

Starting in section 2, Restrepo describes the construction of the optimal strategy, but first he explains the formal details of the setting of the game. We already know the two players are taking $ n$ and $ m$ actions between $ 0 \leq t \leq 1$, but we also fix the probability of success. Player 1 knows a distribution $ P(t)$ on $ [0,1]$ for which $ P(t)$ is the probability of success when acting at time $ t$. Likewise, player 2 has a possibly different distribution $ Q(t)$, and (crucially) $ P(t), Q(t)$ both increase continuously on $ [0,1]$. (In section 3 he clarifies further that $ P$ satisfies $ P(0) = 0, P(1) = 1$, and $ P'(t) > 0$, likewise for $ Q(t)$.) Moreover, both players know both $ P, Q$. One could say that each player has an estimate of their opponent’s firing accuracy, and wants to be optimal compared to that estimate.

The payoff function $ \Psi(x, y)$ is defined informally as: 1 if Player one succeeds before Player 2, -1 if Player 2 succeeds first, and 0 if both players exhaust their actions before the end and none succeed. Though Restrepo does not state it, if the players act and succeed at the same time—say both players fire at time $ t=1$—the payoff should also be zero. We’ll see how this is converted to a more formal (and cumbersome!) mathematical definition in a future post.

Next we’ll describe the statement of the fully general optimal strategy (which will be essentially meaningless, but have some notable features we can infer information from), and get a sneak peek at how to build this strategy algorithmically. Then we’ll see a simplified example of the optimal strategy.

The optimal strategy presented depends only on the values $ n, m$ (the number of actions each player gets) and their success probability distributions $ P, Q$. For player 1, the strategy splits up $ [0,1]$ into subintervals

$ \displaystyle [a_i, a_{i+1}] \qquad 0 < a_1 < a_2, < \cdots < a_n < a_{n+1} = 1$

Crucially, this strategy ignores the initial interval $ [0, a_1]$. In each other subinterval Player 1 attempts an action at a time chosen by a probability distribution specific to that interval, independently of previous attempts. But no matter what, there is some initial wait time during which no action will ever be taken. This makes sense: if player 1 fired at time 0, it is a guaranteed wasted shot. Likewise, firing at time 0.000001 is basically wasted (due to continuity, unless $ P(t)$ is obnoxiously steep early on).

Likewise for player 2, the optimal strategy is determined by numbers $ b_1, \dots, b_m$ resulting in $ m$ intervals $ [b_j, b_{j+1}]$ with $ b_{m+1} = 1$.

The difficult part of the construction is describing the distributions dictating when a player should act during an interval. It’s difficult because an interval for player 1 and player 2 can overlap partially. Maybe $ a_2 = 0.5, a_3 = 0.75$ and $ b_1 = 0.25, b_2 = 0.6$. Player 1 knows that Player 2 (using their corresponding minimax strategy) must act before time $ t = 0.6$, and gets another chance after that time. This suggests that the distribution determining when Player 1 should act within $ [a_2, a_3]$ may have a discontinuous jump at $ t = 0.6$.

Call $ F_i$ the distribution for Player 1 to act in the interval $ [a_i, a_{i+1}]$. Since it is a continuous distribution, Restrepo uses $ F_i$ for the cumulative distribution function and $ dF_i$ for the probability density function. Then these functions are defined by (note this should be mostly meaningless for the moment)

$ \displaystyle dF_i(x_i) = \begin{cases} h_i f^*(x_i) dx_i & \textup{ if } a_i < x_i < a_{i+1} \\ 0 & \textup{ if } x_i \not \in [a_i, a_{i+1}] \\ \end{cases}$

where $ f^*$ is defined as

$ \displaystyle f^*(t) = \prod_{b_j > t} \left [ 1 – Q(b_j) \right ] \frac{Q'(t)}{Q^2(t) P(t)}.$

The constants $ h_i$ and $ h_{i+1}$ are related by the equation

$ \displaystyle h_i = [1 – D_i] h_{i+1},$

where

$ \displaystyle D_i = \int_{a_i}^{a_{i+1}} P(t) dF_i(t)$

What can we glean from this mashup of symbols? The first is that (obviously) the distribution is zero outside the interval $ [a_i, a_{i+1}]$. Within it, there is this mysterious $ h_i$ that is related to the $ h_{i+1}$ used to define the next interval’s probability. This suggests we will likely build up the strategy in reverse starting with $ F_n$ as the “base case” (if $ n=1$, then it is the only one).

Next, we notice the curious definition of $ f^*$. It unsurprisingly requires knowledge of both $ P$ and $ Q$, but the coefficient is strangely chosen: it’s a product over all failure probabilities ($ 1 – Q(b_j)$) of all interval-starts happening later for the opponent.

[Side note: it’s very important that this is a constant; when I first read this, I thought that it was $ \prod_{b_j > t}[1 – Q(t)]$, which makes the eventual task of integrating $ f^*$ much harder.]

Finally, the last interval (the one ending at $ t=1$) may include the option to simply “wait for a guaranteed hit,” which Restrepo calls a “discrete mass of $ \alpha$ at $ t=1$.” That is, $ F_n$ may have a different representation than the rest. Indeed, at the end of the paper we will find that Restrepo gives a base-case definition for $ h_n$ that allows us to bootstrap the construction.

Player 2’s strategy is the same as Player 1’s, but replacing the roles of $ P, Q, n, m, a_i, b_j$ in the obvious way.

The symmetric example

As with most math research, the best way to parse a complicated definition or construction is to simplify the different aspects of the problem until they become tractable. One way to do this is to have only a single action for both players, with $ P = Q$. Restrepo provides a more general example to demonstrate, which results in the five most helpful lines in the paper. I’ll reproduce them here verbatim:

EXAMPLE. Symmetric Game: $ P(t) = Q(t),$ and $ n = m$. In this case the two
players have the same optimal strategies; $ \alpha = 0$, and $ a_k = b_k, k=1,
\dots, n$. Furthermore

$ \displaystyle \begin{aligned} P(a_{n-k}) &= \frac{1}{2k+3} & k = 0, 1, \dots, n-1, \\ dF_{n-k}(t) &= \frac{1}{4(k+1)} \frac{P'(t)}{P^3(t)} dt & a_{n-k} < t < a_{n-k+1}. \end{aligned}$

Saying $ \alpha = 0$ means there is no “wait until $ t=1$ to guarantee a hit”, which makes intuitive sense. You’d only want to do that if your opponent has exhausted all their actions before the end, which is only likely to happen if they have fewer actions than you do.

When Restrepo writes $ P(a_{n-k}) = \frac{1}{2k+3}$, there are a few things happening. First, we confirm that we’re working backwards from $ a_n$. Second, he’s implicitly saying “choose $ a_{n-k}$ such that $ P(a_{n-k})$ has the desired cumulative density.” After a bit of reflection, there’s no other way to specify the $ a_i$ except implicitly: we don’t have a formula for $ P$ to lean on.

Finally, the definition of the density function $ dF_{n-k}(t)$ helps us understand under what conditions the probability function would be increasing or decreasing from the start of the interval to the end. Looking at the expression $ P'(t) / P^3(t)$, we can see that polynomials will result in an expression dominated by $ 1/t^k$ for some $ k$, which is decreasing. By taking the derivative, an increasing density would have to be built from a $ P$ satisfying $ P”(t) P(t) – 3(P'(t))^2 > 0$. However, I wasn’t able to find any examples that satisfy this. Polynomials, square roots, logs and exponentials, all seem to result in decreasing density functions.

Finally, we’ll plot two examples. The first is the most reductive: $ P(t) = Q(t) = t$, and $ n = m = 1$. In this case $ n=1$, and there is only one term $ k=0$, for which $ a_n = 1/3$. Then $ dF_1(t) = 1/4t^3$. (For verification, note the integral of $ dF_1$ on $ [1/3, 1]$ is indeed 1).

restrepo-1-over-4tcubed.png

With just one action and P(t) = Q(t) = t, the region before t=1/3 has zero probability, and the probability decreases from 6.75 to 1/4.

Note that the reason $ a_n = 1/3$ is so nice is that $ P(t)$ is so simple. If $ P(t)$ were, say, $ t^2$, then $ a_n$ should shift to being $ \sqrt{1/3}$. If $ P(t)$ were more complicated, we’d have to invert it (or use an approximate search) to find the location $ a_n$ for which $ P(a_n) = 1/3$.

Next, we loosen the example to let $ n=m=4$, still with $ P(t) = Q(t) = t$. In this case, we have the same final interval $ [1/3,1]$. The new actions all occur in the time before $ t=1/3$, in the intervals $ [1/5, 1/3], [1/7, 1/5], [1/9,1/7].$ If there were more actions, we’d get smaller inverse-of-odd-spaced intervals approaching zero. The probability densities are now steeper versions of the same $ 1/4t^3$, with the constant getting smaller to compensate for the fact that $ 1/t^3$ gets larger and maintain the normalized distribution. For example, the earliest interval results in $ \int_{1/9}^{1/7} \frac{1}{16t^3} dt = 1$. Closer to zero the densities are somewhat shallower compared to the size of the interval; for example in $ [1/9, 1/7],$ the density toward the beginning of the interval is only about twice as large as the density toward the end.

restrepo-four-actions.png

The combination of the four F_i’s for the four intervals in which actions are taken. This is a complete description of the optimal strategy for our simple symmetric version of the silent duel.

Since the early intervals are getting smaller and smaller as we add more actions, the optimal strategy will resemble a burst of action at the beginning, gradually tapering off as the accuracy increases and we work through our budget. This is an explicit tradeoff between the value of winning (lots of early, low probability attempts) and keeping some actions around for the end where you’re likely to succeed.

Next step: get to the example from the general theorem

At this point, we’ve parsed the general statement of the theorem, and while much of it is still mysterious, we extracted some useful qualitative information from the statement, and tinkered with some simple examples.

At this point, I have confidence that the simple symmetric example Restrepo provided is correct; it passed some basic unit tests, like that each $ dF_i$ is normalized. My next task in fully understanding the paper is to be able to derive the symmetric example from the general construction. We’ll do this next time, and include a program that constructs the optimal solution for any input.

Until then!

 

Earthmover Distance

Problem: Compute distance between points with uncertain locations (given by samples, or differing observations, or clusters).

For example, if I have the following three “points” in the plane, as indicated by their colors, which is closer, blue to green, or blue to red?

example-points.png

It’s not obvious, and there are multiple factors at work: the red points have fewer samples, but we can be more certain about the position; the blue points are less certain, but the closest non-blue point to a blue point is green; and the green points are equally plausibly “close to red” and “close to blue.” The centers of masses of the three sample sets are close to an equilateral triangle. In our example the “points” don’t overlap, but of course they could. And in particular, there should probably be a nonzero distance between two points whose sample sets have the same center of mass, as below. The distance quantifies the uncertainty.

same-centers.png

All this is to say that it’s not obvious how to define a distance measure that is consistent with perceptual ideas of what geometry and distance should be.

Solution (Earthmover distance): Treat each sample set $ A$ corresponding to a “point” as a discrete probability distribution, so that each sample $ x \in A$ has probability mass $ p_x = 1 / |A|$. The distance between $ A$ and $ B$ is the optional solution to the following linear program.

Each $ x \in A$ corresponds to a pile of dirt of height $ p_x$, and each $ y \in B$ corresponds to a hole of depth $ p_y$. The cost of moving a unit of dirt from $ x$ to $ y$ is the Euclidean distance $ d(x, y)$ between the points (or whatever hipster metric you want to use).

Let $ z_{x, y}$ be a real variable corresponding to an amount of dirt to move from $ x \in A$ to $ y \in B$, with cost $ d(x, y)$. Then the constraints are:

  • Each $ z_{x, y} \geq 0$, so dirt only moves from $ x$ to $ y$.
  • Every pile $ x \in A$ must vanish, i.e. for each fixed $ x \in A$, $ \sum_{y \in B} z_{x,y} = p_x$.
  • Likewise, every hole $ y \in B$ must be completely filled, i.e. $ \sum_{y \in B} z_{x,y} = p_y$.

The objective is to minimize the cost of doing this: $ \sum_{x, y \in A \times B} d(x, y) z_{x, y}$.

In python, using the ortools library (and leaving out a few docstrings and standard import statements, full code on Github):

from ortools.linear_solver import pywraplp

def earthmover_distance(p1, p2):
    dist1 = {x: count / len(p1) for (x, count) in Counter(p1).items()}
    dist2 = {x: count / len(p2) for (x, count) in Counter(p2).items()}
    solver = pywraplp.Solver('earthmover_distance', pywraplp.Solver.GLOP_LINEAR_PROGRAMMING)

    variables = dict()

    # for each pile in dist1, the constraint that says all the dirt must leave this pile
    dirt_leaving_constraints = defaultdict(lambda: 0)

    # for each hole in dist2, the constraint that says this hole must be filled
    dirt_filling_constraints = defaultdict(lambda: 0)

    # the objective
    objective = solver.Objective()
    objective.SetMinimization()

    for (x, dirt_at_x) in dist1.items():
        for (y, capacity_of_y) in dist2.items():
            amount_to_move_x_y = solver.NumVar(0, solver.infinity(), 'z_{%s, %s}' % (x, y))
            variables[(x, y)] = amount_to_move_x_y
            dirt_leaving_constraints[x] += amount_to_move_x_y
            dirt_filling_constraints[y] += amount_to_move_x_y
            objective.SetCoefficient(amount_to_move_x_y, euclidean_distance(x, y))

    for x, linear_combination in dirt_leaving_constraints.items():
        solver.Add(linear_combination == dist1[x])

    for y, linear_combination in dirt_filling_constraints.items():
        solver.Add(linear_combination == dist2[y])

    status = solver.Solve()
    if status not in [solver.OPTIMAL, solver.FEASIBLE]:
        raise Exception('Unable to find feasible solution')

    return objective.Value()

Discussion: I’ve heard about this metric many times as a way to compare probability distributions. For example, it shows up in an influential paper about fairness in machine learning, and a few other CS theory papers related to distribution testing.

One might ask: why not use other measures of dissimilarity for probability distributions (Chi-squared statistic, Kullback-Leibler divergence, etc.)? One answer is that these other measures only give useful information for pairs of distributions with the same support. An example from a talk of Justin Solomon succinctly clarifies what Earthmover distance achieves

Screen Shot 2018-03-03 at 6.11.00 PM.png

Also, why not just model the samples using, say, a normal distribution, and then compute the distance based on the parameters of the distributions? That is possible, and in fact makes for a potentially more efficient technique, but you lose some information by doing this. Ignoring that your data might not be approximately normal (it might have some curvature), with Earthmover distance, you get point-by-point details about how each data point affects the outcome.

This kind of attention to detail can be very important in certain situations. One that I’ve been paying close attention to recently is the problem of studying gerrymandering from a mathematical perspective. Justin Solomon of MIT is a champion of the Earthmover distance (see his fascinating talk here for more, with slides) which is just one topic in a field called “optimal transport.”

This has the potential to be useful in redistricting because of the nature of the redistricting problem. As I wrote previously, discussions of redistricting are chock-full of geometry—or at least geometric-sounding language—and people are very concerned with the apparent “compactness” of a districting plan. But the underlying data used to perform redistricting isn’t very accurate. The people who build the maps don’t have precise data on voting habits, or even locations where people live. Census tracts might not be perfectly aligned, and data can just plain have errors and uncertainty in other respects. So the data that district-map-drawers care about is uncertain much like our point clouds. With a theory of geometry that accounts for uncertainty (and the Earthmover distance is the “distance” part of that), one can come up with more robust, better tools for redistricting.

Solomon’s website has a ton of resources about this, under the names of “optimal transport” and “Wasserstein metric,” and his work extends from computing distances to computing important geometric values like the barycenter, computational advantages like parallelism.

Others in the field have come up with transparency techniques to make it clearer how the Earthmover distance relates to the geometry of the underlying space. This one is particularly fun because the explanations result in a path traveled from the start to the finish, and by setting up the underlying metric in just such a way, you can watch the distribution navigate a maze to get to its target. I like to imagine tiny ants carrying all that dirt.

Screen Shot 2018-03-03 at 6.15.50 PM.png

Finally, work of Shirdhonkar and Jacobs provide approximation algorithms that allow linear-time computation, instead of the worst-case-cubic runtime of a linear solver.

Bayesian Ranking for Rated Items

Problem: You have a catalog of items with discrete ratings (thumbs up/thumbs down, or 5-star ratings, etc.), and you want to display them in the “right” order.

Solution: In Python

'''
  score: [int], [int], [float] -&amp;gt; float

  Return the expected value of the rating for an item with known
  ratings specified by `ratings`, prior belief specified by
  `rating_prior`, and a utility function specified by `rating_utility`,
  assuming the ratings are a multinomial distribution and the prior
  belief is a Dirichlet distribution.
'''
def score(self, ratings, rating_prior, rating_utility):
    ratings = [r + p for (r, p) in zip(ratings, rating_prior)]
    score = sum(r * u for (r, u) in zip(ratings, rating_utility))
    return score / sum(ratings)

Discussion: This deceptively short solution can lead you on a long and winding path into the depths of statistics. I will do my best to give a short, clear version of the story.

As a working example I chose merely because I recently listened to a related podcast, say you’re selling mass-market romance novels—which, by all accounts, is a predictable genre. You have a list of books, each of which has been rated on a scale of 0-5 stars by some number of users. You want to display the top books first, so that time-constrained readers can experience the most titillating novels first, and newbies to the genre can get the best first time experience and be incentivized to buy more.

The setup required to arrive at the above code is the following, which I’ll phrase as a story.

Users’ feelings about a book, and subsequent votes, are independent draws from a known distribution (with unknown parameters). I will just call these distributions “discrete” distributions. So given a book and user, there is some unknown list $ (p_0, p_1, p_2, p_3, p_4, p_5)$ of probabilities ($ \sum_i p_i = 1$) for each possible rating a user could give for that book.

But how do users get these probabilities? In this story, the probabilities are the output of a randomized procedure that generates distributions. That modeling assumption is called a “Dirichlet prior,” with Dirichlet meaning it generates discrete distributions, and prior meaning it encodes domain-specific information (such as the fraction of 4-star ratings for a typical romance novel).

So the story is you have a book, and that book gets a Dirichlet distribution (unknown to us), and then when a user comes along they sample from the Dirichlet distribution to get a discrete distribution, which they then draw from to choose a rating. We observe the ratings, and we need to find the book’s underlying Dirichlet. We start by assigning it some default Dirichlet (the prior) and update that Dirichlet as we observe new ratings. Some other assumptions:

  1. Books are indistinguishable except in the parameters of their Dirichlet distribution.
  2. The parameters of a book’s Dirichlet distribution don’t change over time, and inherently reflect the book’s value.

So a Dirichlet distribution is a process that produces discrete distributions. For simplicity, in this post we will say a Dirichlet distribution is parameterized by a list of six integers $ (n_0, \dots, n_5)$, one for each possible star rating. These values represent our belief in the “typical” distribution of votes for a new book. We’ll discuss more about how to set the values later. Sampling a value (a book’s list of probabilities) from the Dirichlet distribution is not trivial, but we don’t need to do that for this program. Rather, we need to be able to interpret a fixed Dirichlet distribution, and update it given some observed votes.

The interpretation we use for a Dirichlet distribution is its expected value, which, recall, is the parameters of a discrete distribution. In particular if $ n = \sum_i n_i$, then the expected value is a discrete distribution whose probabilities are

$ \displaystyle \left (  \frac{n_0}{n}, \frac{n_1}{n}, \dots, \frac{n_5}{n} \right )$

So you can think of each integer in the specification of a Dirichlet as “ghost ratings,” sometimes called pseudocounts, and we’re saying the probability is proportional to the count.

This is great, because if we knew the true Dirichlet distribution for a book, we could compute its ranking without a second thought. The ranking would simply be the expected star rating:

def simple_score(distribution):
   return sum(i * p for (i, p) in enumerate(distribution))

Putting books with the highest score on top would maximize the expected happiness of a user visiting the site, provided that happiness matches the user’s voting behavior, since the simple_score is just the expected vote.

Also note that all the rating system needs to make this work is that the rating options are linearly ordered. So a thumbs up/down (heaving bosom/flaccid member?) would work, too. We don’t need to know how happy it makes them to see a 5-star vs 4-star book. However, because as we’ll see next we have to approximate the distribution, and hence have uncertainty for scores of books with only a few ratings, it helps to incorporate numerical utility values (we’ll see this at the end).

Next, to update a given Dirichlet distribution with the results of some observed ratings, we have to dig a bit deeper into Bayes rule and the formulas for sampling from a Dirichlet distribution. Rather than do that, I’ll point you to this nice writeup by Jonathan Huang, where the core of the derivation is in Section 2.3 (page 4), and remark that the rule for updating for a new observation is to just add it to the existing counts.

Theorem: Given a Dirichlet distribution with parameters $ (n_1, \dots, n_k)$ and a new observation of outcome $ i$, the updated Dirichlet distribution has parameters $ (n_1, \dots, n_{i-1}, n_i + 1, n_{i+1}, \dots, n_k)$. That is, you just update the $ i$-th entry by adding $ 1$ to it.

This particular arithmetic to do the update is a mathematical consequence (derived in the link above) of the philosophical assumption that Bayes rule is how you should model your beliefs about uncertainty, coupled with the assumption that the Dirichlet process is how the users actually arrive at their votes.

The initial values $ (n_0, \dots, n_5)$ for star ratings should be picked so that they represent the average rating distribution among all prior books, since this is used as the default voting distribution for a new, unknown book. If you have more information about whether a book is likely to be popular, you can use a different prior. For example, if JK Rowling wrote a Harry Potter Romance novel that was part of the canon, you could pretty much guarantee it would be popular, and set $ n_5$ high compared to $ n_0$. Of course, if it were actually popular you could just wait for the good ratings to stream in, so tinkering with these values on a per-book basis might not help much. On the other hand, most books by unknown authors are bad, and $ n_5$ should be close to zero. Selecting a prior dictates how influential ratings of new items are compared to ratings of items with many votes. The more pseudocounts you add to the prior, the less new votes count.

This gets us to the following code for star ratings.

def score(self, ratings, rating_prior):
    ratings = [r + p for (r, p) in zip(ratings, rating_prior)]
    score = sum(i * u for (i, u) in enumerate(ratings))
    return score / sum(ratings)

The only thing missing from the solution at the beginning is the utilities. The utilities are useful for two reasons. First, because books with few ratings encode a lot of uncertainty, having an idea about how extreme a feeling is implied by a specific rating allows one to give better rankings of new books.

Second, for many services, such as taxi rides on Lyft, the default star rating tends to be a 5-star, and 4-star or lower mean something went wrong. For books, 3-4 stars is a default while 5-star means you were very happy.

The utilities parameter allows you to weight rating outcomes appropriately. So if you are in a Lyft-like scenario, you might specify utilities like [-10, -5, -3, -2, 1] to denote that a 4-star rating has the same negative impact as two 5-star ratings would positively contribute. On the other hand, for books the gap between 4-star and 5-star is much less than the gap between 3-star and 4-star. The utilities simply allow you to calibrate how the votes should be valued in comparison to each other, instead of using their literal star counts.