“Practical Math” Preview: Collect Sensitive Survey Responses Privately

This is a draft of a chapter from my in-progress book, Practical Math for Programmers: A Tour of Mathematics in Production Software.

Tip: Determine an aggregate statistic about a sensitive question, when survey respondents do not trust that their responses will be kept secret.

Solution:

import random

def respond_privately(true_answer: bool) -> bool:
    '''Respond to a survey with plausible deniability about your answer.'''
    be_honest = random.random() < 0.5
    random_answer = random.random() < 0.5
    return true_answer if be_honest else random_answer

def aggregate_responses(responses: List[bool]) -> Tuple[float, float]:
    '''Return the estimated fraction of survey respondents that have a truthful
    Yes answer to the survey question.
    '''
    yes_response_count = sum(responses)
    n = len(responses)
    mean = 2 * yes_response_count / n - 0.5
    # Use n-1 when estimating variance, as per Bessel's correction.
    variance = 3 / (4 * (n - 1))
    return (mean, variance)

In the late 1960’s, most abortions were illegal in the United States. Daniel G. Horvitz, a statistician at The Research Triangle Institute in North Carolina and a leader in survey design for social sciences, was tasked with estimating how many women in North Carolina were receiving illegal abortions. The goal was to inform state and federal policymakers about the statistics around abortions, many of which were unreported, even when done legally.

The obstacles were obvious. As Horvitz put it, “a prudent woman would not divulge to a stranger the fact that she was party to a crime for which she could be prosecuted.” [Abernathy70] This resulted in a strong bias in survey responses. Similar issues had plagued surveys of illegal activity of all kinds, including drug abuse and violent crime. Lack of awareness into basic statistics about illegal behavior led to a variety of misconceptions, such as that abortions were not frequently sought out.

Horvitz worked with biostatisticians James Abernathy and Bernard Greenberg to test out a new method to overcome this obstacle, without violating the respondent’s privacy or ability to plausibly deny illegal behavior. The method, called randomized response, was invented by Stanley Warner in 1965, just a few years earlier. [Warner65] Warner’s method was a bit different from what we present in this Tip, but both Warner’s method and the code sample above use the same strategy of adding randomization to the survey.

The mechanism, as presented in the code above, requires respondents to start by flipping a coin. If heads, they answer the sensitive question truthfully. If tails, they flip a second coin to determine how to answer the question—heads resulting in a “yes” answer, tails in a “no” answer. Naturally, the coin flips are private and controlled by the respondent. And so if a respondent answers “Yes” to the question, they may plausibly claim the “Yes” was determined by the coin, preserving their privacy. The figure below describes this process as a diagram.

A branching diagram showing the process a survey respondent takes to record their response.

Another way to describe the outcome is to say that each respondent’s answer is a single bit of information that is flipped with probability 1/4. This is half way between two extremes on the privacy/accuracy tradeoff curve. The first extreme is a “perfectly honest” response, where the bit is never flipped and all information is preserved. The second extreme has the bit flipped with probability 1/2, which is equivalent to ignoring the question and choosing your answer completely at random, losing all information in the aggregate responses. In this perspective, the aggregate survey responses can be thought of as a digital signal, and the privacy mechanism adds noise to that signal.

It remains to determine how to recover the aggregate signal from these noisy responses. In other words, the surveyor cannot know any individual’s true answer, but they can, with some extra work, estimate statistics about the underlying population by correcting for the statistical bias. This is possible because the randomization is well understood. The expected fraction of “Yes” answers can be written as a function of the true fraction of “Yes” answers, and hence the true fraction can be solved for. In this case, where the random coin is fair, that formula is as follows (where \mathbf{P} stands for “the probability of”).

\displaystyle \mathbf{P}(\textup{Yes answer}) = \frac{1}{2} \mathbf{P}(\textup{Truthful yes answer}) + \frac{1}{4}

And so we solve for \mathbf{P}(\textup{Truthful yes answer})

\displaystyle \mathbf{P}(\textup{Truthful yes answer}) = 2 \mathbf{P}(\textup{Yes answer}) - \frac{1}{2}

We can replace the true probability \mathbf{P}(\textup{Yes answer}) above with our fraction of “Yes” responses from the survey, and the result is an estimate \hat{p} of \mathbf{P}(\textup{Truthful yes answer}). This estimate is unbiased, but has additional variance—beyond the usual variance caused by picking a finite random sample from the population of interest—introduced by the randomization mechanism.

With a bit of effort, one can calculate that the variance of the estimate is

\displaystyle \textup{Var}(\hat{p}) = \frac{3}{4n}

And via Chebyshev’s inequality, which bounds the likelihood that an estimator is far away from its expectation, we can craft a confidence interval and determine the needed sample sizes. Specifically, the estimate \hat{p} has additive error at most q with probability at most \textup{Var}(\hat{p}) / q^2. This implies that for a confidence of 1-c, one requires at least n \geq 3 / (4 c q^2) samples. For example, to achieve error 0.01 with 90 percent confidence (c=0.1), one requires 7,500 responses.

Horvitz’s randomization mechanism didn’t use coin flips. Instead they used an opaque box with red or blue colored balls which the respondent, who was in the same room as the surveyor, would shake and privately reveal a random color through a small window facing away from the surveyor. The statistical principle is the same. Horvitz and his associates surveyed the women about their opinions of the privacy protections of this mechanism. When asked whether their friends would answer a direct question about abortion honestly, over 80% either believed their friends would lie, or were unsure. [footnote: A common trick in survey methodology when asking someone if they would be dishonest is to instead ask if their friends would be dishonest. This tends to elicit more honesty, because people are less likely to uphold a false perception of the moral integrity of others, and people also don’t realize that their opinion of their friends correlates with their own personal behavior and attitudes. In other words, liars don’t admit to lying, but they think lying is much more common than it really is.] But 60% were convinced there was no trick involved in the randomization, while 20% were unsure and 20% thought there was a trick. This suggests many people were convinced that Horvitz’s randomization mechanism provided the needed safety guarantees to answer honestly.

Horvitz’s survey was a resounding success, both for randomized response as a method and for measuring abortion prevalence. [Abernathy70] They estimated the abortion rate at about 22 per 100 conceptions, with a distinct racial bias—minorities were twice as likely as whites to receive an abortion. Comparing their findings to a prior nationwide study from 1955—the so-called Arden House estimate—which gave a range of between 200,000 and 1.2 million abortions per year, Horvitz’s team estimated more precisely that there were 699,000 abortions in 1955 in the United States, with a reported standard deviation of about 6,000, less than one percent. For 1967, the year of their study, they estimated 829,000.

Their estimate was referenced widely in the flurry of abortion law and court cases that followed due to a surging public interest in the topic. For example, it is cited in the 1970 California Supreme Court opinion for the case Ballard v. Anderson, which concerned whether a minor needs parental consent to receive an otherwise legal abortion. [Ballard71, Roemer71] It was also cited in amici curiae briefs submitted to the United States Supreme Court in 1971 for Roe v. Wade, the famous case that invalidated most U.S. laws making abortion illegal. One such brief was filed jointly by the country’s leading women’s rights organizations like the National Organization for Women. Citing Horvitz for this paragraph, it wrote, [Womens71]

While the realities of law enforcement, social and public health problems posed by abortion laws have been openly discussed […] only within a period of not more than the last ten years, one fact appears undeniable, although unverifiable statistically. There are at least one million illegal abortions in the United States each year. Indeed, studies indicate that, if the local law still has qualifying requirements, the relaxation in the law has not diminished to any substantial extent the numbers in which women procure illegal abortions.

It’s unclear how the authors got this one million number (Horvitz’s estimate was 20% less for 1967), nor what they meant by “unverifiable statistically.” It may have been a misinterpretation of the randomized response technique. In any event, randomized response played a crucial role in providing a foundation for political debate.

Despite Horvitz’s success, and decades of additional research on crime, drug use, and other sensitive topics, randomized response mechanisms have been applied poorly. In some cases, the desired randomization is inextricably complex, such as when requiring a continuous random number. In these cases, a manual randomization mechanism is too complex for a respondent to use accurately. Trying to use software-assisted devices can help, but can also produce mistrust in the interviewee. See [Rueda16] for additional discussion of these pitfalls and what software packages exist for assisting in using randomized response. See [Fox16] for an analysis of the statistical differences between the variety of methods used between 1970 and 2010.

In other contexts, analogues to randomized response may not elicit the intended effect. In the 1950’s, Utah used death by firing squad as capital punishment. To avoid a guilty conscience of the shooters, one of five marksmen was randomly given a blank, providing him some plausible deniability that he knew he had delivered the killing shot. However, this approach failed on two counts. First, once a shot was fired the marksman could tell whether the bullet was real based on the recoil. Second, a 20% chance of a blank was not enough to dissuade a guilty marksman from purposely missing. In the 1951 execution of Elisio Mares, all four real bullets missed the condemned man’s heart, hitting his chest, stomach, and hip. He died, but it was neither painless nor instant.

Of many lessons one might draw from the botched execution, one is that randomization mechanisms must take into account both the psychology of the participants as well as the severity of a failed outcome.

References

@book{Fox16,
  title = {{Randomized Response and Related Methods: Surveying Sensitive Data}},
  author = {James Alan Fox},
  edition = {2nd},
  year = {2016},
  doi = {10.4135/9781506300122},
}

@article{Abernathy70,
  author = {Abernathy, James R. and Greenberg, Bernard G. and Horvitz, Daniel G.
            },
  title = {{Estimates of induced abortion in urban North Carolina}},
  journal = {Demography},
  volume = {7},
  number = {1},
  pages = {19-29},
  year = {1970},
  month = {02},
  issn = {0070-3370},
  doi = {10.2307/2060019},
  url = {https://doi.org/10.2307/2060019},
}

@article{Warner65,
  author = {Stanley L. Warner},
  journal = {Journal of the American Statistical Association},
  number = {309},
  pages = {63--69},
  publisher = {{American Statistical Association, Taylor \& Francis, Ltd.}},
  title = {Randomized Response: A Survey Technique for Eliminating Evasive
           Answer Bias},
  volume = {60},
  year = {1965},
}

@article{Ballard71,
  title = {{Ballard v. Anderson}},
  journal = {California Supreme Court L.A. 29834},
  year = {1971},
  url = {https://caselaw.findlaw.com/ca-supreme-court/1826726.html},
}

@misc{Womens71,
  title = {{Motion for Leave to File Brief Amici Curiae on Behalf of Women’s
           Organizations and Named Women in Support of Appellants in Each Case,
           and Brief Amici Curiae.}},
  booktitle = {{Appellate Briefs for the case of Roe v. Wade}},
  number = {WL 128048},
  year = {1971},
  publisher = {Supreme Court of the United States},
}

@article{Roemer71,
  author = {R. Roemer},
  journal = {Am J Public Health},
  pages = {500--509},
  title = {Abortion law reform and repeal: legislative and judicial developments
           },
  volume = {61},
  number = {3},
  year = {1971},
}

@incollection{Rueda16,
  title = {Chapter 10 - Software for Randomized Response Techniques},
  editor = {Arijit Chaudhuri and Tasos C. Christofides and C.R. Rao},
  series = {Handbook of Statistics},
  publisher = {Elsevier},
  volume = {34},
  pages = {155-167},
  year = {2016},
  booktitle = {Data Gathering, Analysis and Protection of Privacy Through
               Randomized Response Techniques: Qualitative and Quantitative Human
               Traits},
  doi = {https://doi.org/10.1016/bs.host.2016.01.009},
  author = {M. Rueda and B. Cobo and A. Arcos and R. Arnab},
}

Bayesian Ranking for Rated Items

Problem: You have a catalog of items with discrete ratings (thumbs up/thumbs down, or 5-star ratings, etc.), and you want to display them in the “right” order.

Solution: In Python

'''
  score: [int], [int], [float] -&gt; float

  Return the expected value of the rating for an item with known
  ratings specified by `ratings`, prior belief specified by
  `rating_prior`, and a utility function specified by `rating_utility`,
  assuming the ratings are a multinomial distribution and the prior
  belief is a Dirichlet distribution.
'''
def score(self, ratings, rating_prior, rating_utility):
    ratings = [r + p for (r, p) in zip(ratings, rating_prior)]
    score = sum(r * u for (r, u) in zip(ratings, rating_utility))
    return score / sum(ratings)

Discussion: This deceptively short solution can lead you on a long and winding path into the depths of statistics. I will do my best to give a short, clear version of the story.

As a working example I chose merely because I recently listened to a related podcast, say you’re selling mass-market romance novels—which, by all accounts, is a predictable genre. You have a list of books, each of which has been rated on a scale of 0-5 stars by some number of users. You want to display the top books first, so that time-constrained readers can experience the most titillating novels first, and newbies to the genre can get the best first time experience and be incentivized to buy more.

The setup required to arrive at the above code is the following, which I’ll phrase as a story.

Users’ feelings about a book, and subsequent votes, are independent draws from a known distribution (with unknown parameters). I will just call these distributions “discrete” distributions. So given a book and user, there is some unknown list (p_0, p_1, p_2, p_3, p_4, p_5) of probabilities (\sum_i p_i = 1) for each possible rating a user could give for that book.

But how do users get these probabilities? In this story, the probabilities are the output of a randomized procedure that generates distributions. That modeling assumption is called a “Dirichlet prior,” with Dirichlet meaning it generates discrete distributions, and prior meaning it encodes domain-specific information (such as the fraction of 4-star ratings for a typical romance novel).

So the story is you have a book, and that book gets a Dirichlet distribution (unknown to us), and then when a user comes along they sample from the Dirichlet distribution to get a discrete distribution, which they then draw from to choose a rating. We observe the ratings, and we need to find the book’s underlying Dirichlet. We start by assigning it some default Dirichlet (the prior) and update that Dirichlet as we observe new ratings. Some other assumptions:

  1. Books are indistinguishable except in the parameters of their Dirichlet distribution.
  2. The parameters of a book’s Dirichlet distribution don’t change over time, and inherently reflect the book’s value.

So a Dirichlet distribution is a process that produces discrete distributions. For simplicity, in this post we will say a Dirichlet distribution is parameterized by a list of six integers (n_0, \dots, n_5), one for each possible star rating. These values represent our belief in the “typical” distribution of votes for a new book. We’ll discuss more about how to set the values later. Sampling a value (a book’s list of probabilities) from the Dirichlet distribution is not trivial, but we don’t need to do that for this program. Rather, we need to be able to interpret a fixed Dirichlet distribution, and update it given some observed votes.

The interpretation we use for a Dirichlet distribution is its expected value, which, recall, is the parameters of a discrete distribution. In particular if n = \sum_i n_i, then the expected value is a discrete distribution whose probabilities are

\displaystyle \left (  \frac{n_0}{n}, \frac{n_1}{n}, \dots, \frac{n_5}{n} \right )

So you can think of each integer in the specification of a Dirichlet as “ghost ratings,” sometimes called pseudocounts, and we’re saying the probability is proportional to the count.

This is great, because if we knew the true Dirichlet distribution for a book, we could compute its ranking without a second thought. The ranking would simply be the expected star rating:

def simple_score(distribution):
   return sum(i * p for (i, p) in enumerate(distribution))

Putting books with the highest score on top would maximize the expected happiness of a user visiting the site, provided that happiness matches the user’s voting behavior, since the simple_score is just the expected vote.

Also note that all the rating system needs to make this work is that the rating options are linearly ordered. So a thumbs up/down (heaving bosom/flaccid member?) would work, too. We don’t need to know how happy it makes them to see a 5-star vs 4-star book. However, because as we’ll see next we have to approximate the distribution, and hence have uncertainty for scores of books with only a few ratings, it helps to incorporate numerical utility values (we’ll see this at the end).

Next, to update a given Dirichlet distribution with the results of some observed ratings, we have to dig a bit deeper into Bayes rule and the formulas for sampling from a Dirichlet distribution. Rather than do that, I’ll point you to this nice writeup by Jonathan Huang, where the core of the derivation is in Section 2.3 (page 4), and remark that the rule for updating for a new observation is to just add it to the existing counts.

Theorem: Given a Dirichlet distribution with parameters (n_1, \dots, n_k) and a new observation of outcome i, the updated Dirichlet distribution has parameters (n_1, \dots, n_{i-1}, n_i + 1, n_{i+1}, \dots, n_k). That is, you just update the i-th entry by adding 1 to it.

This particular arithmetic to do the update is a mathematical consequence (derived in the link above) of the philosophical assumption that Bayes rule is how you should model your beliefs about uncertainty, coupled with the assumption that the Dirichlet process is how the users actually arrive at their votes.

The initial values (n_0, \dots, n_5) for star ratings should be picked so that they represent the average rating distribution among all prior books, since this is used as the default voting distribution for a new, unknown book. If you have more information about whether a book is likely to be popular, you can use a different prior. For example, if JK Rowling wrote a Harry Potter Romance novel that was part of the canon, you could pretty much guarantee it would be popular, and set n_5 high compared to n_0. Of course, if it were actually popular you could just wait for the good ratings to stream in, so tinkering with these values on a per-book basis might not help much. On the other hand, most books by unknown authors are bad, and n_5 should be close to zero. Selecting a prior dictates how influential ratings of new items are compared to ratings of items with many votes. The more pseudocounts you add to the prior, the less new votes count.

This gets us to the following code for star ratings.

def score(self, ratings, rating_prior):
    ratings = [r + p for (r, p) in zip(ratings, rating_prior)]
    score = sum(i * u for (i, u) in enumerate(ratings))
    return score / sum(ratings)

The only thing missing from the solution at the beginning is the utilities. The utilities are useful for two reasons. First, because books with few ratings encode a lot of uncertainty, having an idea about how extreme a feeling is implied by a specific rating allows one to give better rankings of new books.

Second, for many services, such as taxi rides on Lyft, the default star rating tends to be a 5-star, and 4-star or lower mean something went wrong. For books, 3-4 stars is a default while 5-star means you were very happy.

The utilities parameter allows you to weight rating outcomes appropriately. So if you are in a Lyft-like scenario, you might specify utilities like [-10, -5, -3, -2, 1] to denote that a 4-star rating has the same negative impact as two 5-star ratings would positively contribute. On the other hand, for books the gap between 4-star and 5-star is much less than the gap between 3-star and 4-star. The utilities simply allow you to calibrate how the votes should be valued in comparison to each other, instead of using their literal star counts.

The Blum-Blum-Shub Pseudorandom Generator

Problem: Design a random number generator that is computationally indistinguishable from a truly random number generator.

Solution (in Python): note this solution uses the Miller-Rabin primality tester, though any primality test will do. See the github repository for the referenced implementation.

from randomized.primality import probablyPrime
import random

def goodPrime(p):
    return p % 4 == 3 and probablyPrime(p, accuracy=100)

def findGoodPrime(numBits=512):
    candidate = 1
    while not goodPrime(candidate):
        candidate = random.getrandbits(numBits)
    return candidate

def makeModulus():
    return findGoodPrime() * findGoodPrime()

def parity(n):
    return sum(int(x) for x in bin(n)[2:]) % 2

class BlumBlumShub(object):
    def __init__(self, seed=None):
        self.modulus = makeModulus()
        self.state = seed if seed is not None else random.randint(2, self.modulus - 1)
        self.state = self.state % self.modulus

    def seed(self, seed):
        self.state = seed

    def bitstream(self):
        while True:
            yield parity(self.state)
            self.state = pow(self.state, 2, self.modulus)

    def bits(self, n=20):
        outputBits = ''
        for bit in self.bitstream():
            outputBits += str(bit)
            if len(outputBits) == n:
                break

        return outputBits

Discussion:

An integer x is called a quadratic residue of another integer N if it can be written as x = a^2 \mod N for some a. That is, if it’s the remainder when dividing a perfect square by N. Some numbers, like N=8, have very special patterns in their quadratic residues, only 0, 1, and 4 can occur as quadratic residues.

The core idea behind this random number generator is that, for a specially chosen modulus N, telling whether a number x is a quadratic residue mod N is hard. In fact, one can directly convert an algorithm that can predict the next bit of this random number generator (by even a slight edge) into an arbitrarily accurate quadratic-residue-decider. So if computing quadratic residues is even mildly hard, then predicting the next bit in this random number generator is very hard.

More specifically, the conjectured guarantee about this random number generator is the following: if you present a polynomial time adversary with two sequences:

  1. A truly random sequence of bits of length k,
  2. k bits from the output of the pseudorandom generator when seeded with a starting state shorter than k bits.

Then the adversary can’t distinguish between the two sequences with probability “significantly” more than 1/2, where by “significantly” I mean 1/k^c for any c>0 (i.e., the edge over randomness vanishes faster than any inverse polynomial). It turns out, due to a theorem of Yao, that this is equivalent to not being able to guess the next bit in a pseudorandom sequence with a significant edge over a random guess, even when given the previous \log(N)^{10} bits in the sequence (or any \textup{poly}(\log N) bits in the sequence).

This emphasizes a deep philosophical viewpoint in theoretical computer science, that whether some object has a property (randomness) really only depends on the power of a computationally limited observer to identify that property. If nobody can tell the difference between fake randomness and real randomness, then the fake randomness is random. Offhand I wonder whether you can meaningfully apply this view to less mathematical concepts like happiness and status.

Anyway, the modulus N is chosen in such a way that every quadratic residue of N has a unique square root which is also a quadratic residue. This makes the squaring function a bijection on quadratic residues. In other words, with a suitably chosen N, there’s no chance that we’ll end up with N=8 where there are very few quadratic residues and the numbers output by the Blum-Blum-Shub generator have a short cycle. Moreover, the assumption that detecting quadratic residues mod N is hard makes the squaring function a one-way permutation.

Here’s an example of how this generator might be used:

generator = BlumBlumShub()

hist = [0] * 2**6
for i in range(10000):
    value = int(generator.bits(6), 2)
    hist[value] += 1

print(hist)

This produces random integers between 0 and 64, with the following histogram:

bbs-hist

See these notes of Junod for a detailed exposition of the number theory behind this random number generator, with full definitions and proofs.

Learning a single-variable polynomial, or the power of adaptive queries

Problem: Alice chooses a secret polynomial p(x) with nonnegative integer coefficients. Bob wants to discover this polynomial by querying Alice for the value of p(x) for some integer x of Bob’s choice. What is the minimal number of queries Bob needs to determine p(x) exactly?

Solution: Two queries. The first is p(1), and if we call N = p(1) + 1, then the second query is p(N).

To someone who is familiar with polynomials, this may seem shocking, and I’ll explain why it works in a second. After all, it’s very easy to prove that if Bob gives Alice all of his queries at the same time (if the queries are not adaptive), then it’s impossible to discover what p(x) is using fewer than \textup{deg}(p) + 1 queries. This is due to a fact called polynomial interpolation, which we’ve seen on this blog before in the context of secret sharing. Specifically, there is a unique single-variable degree d polynomial passing through d+1 points (with distinct x-values). So if you knew the degree of p, you could determine it easily. But Bob doesn’t know the degree of the polynomial, and there’s no way he can figure it out without adaptive queries! Indeed, if Bob tries and gives a set of d queries, Alice could have easily picked a polynomial of degree d+1. So it’s literally impossible to solve this problem without adaptive queries.

The lovely fact is that once you allow adaptiveness, the number of queries you need doesn’t even depend on the degree of the secret polynomial!

Okay let’s get to the solution. It was crucial that our polynomial had nonnegative integer coefficients, because we’re going to do a tiny bit of number theory. Let p(x) = a_0 + a_1 x + \dots + a_d x^d. First, note that p(1) is exactly the sum of the coefficients \sum_i a_i, and in particular p(1) + 1 is larger than any single coefficient. So call this N, and query p(N). This gives us a number y_0 of the form

\displaystyle y_0 = a_0 + a_1N + a_2N^2 + \dots + a_dN^d

And because N is so big, we can compute a_0 easily by computing y_0 \mod N. Now set y_1 = (y_0 - a_0) / N, and this has the form a_1 + a_2N + \dots + a_dN^{d-1}. We can compute modulus again to get a_1, and repeat until we have all the coefficients. We’ll stop once we get a y_i that is zero.

[Addendum 2018-02-14: implementation on github]

As a small technical note, this is a polynomial-time algorithm in the number of bits needed to write down p(x). So this demonstrates the power of adaptive queries: we get from something which is uncomputable with any number of queries to something which is efficiently computable with a constant number of queries.

The obvious follow-up question is: can you come up with an efficient algorithm if we allow the coefficients to be negative integers?