“Practical Math” Preview: Collect Sensitive Survey Responses Privately

This is a draft of a chapter from my in-progress book, Practical Math for Programmers: A Tour of Mathematics in Production Software.

Tip: Determine an aggregate statistic about a sensitive question, when survey respondents do not trust that their responses will be kept secret.

Solution:

import random

def respond_privately(true_answer: bool) -> bool:
    '''Respond to a survey with plausible deniability about your answer.'''
    be_honest = random.random() < 0.5
    random_answer = random.random() < 0.5
    return true_answer if be_honest else random_answer

def aggregate_responses(responses: List[bool]) -> Tuple[float, float]:
    '''Return the estimated fraction of survey respondents that have a truthful
    Yes answer to the survey question.
    '''
    yes_response_count = sum(responses)
    n = len(responses)
    mean = 2 * yes_response_count / n - 0.5
    # Use n-1 when estimating variance, as per Bessel's correction.
    variance = 3 / (4 * (n - 1))
    return (mean, variance)

In the late 1960’s, most abortions were illegal in the United States. Daniel G. Horvitz, a statistician at The Research Triangle Institute in North Carolina and a leader in survey design for social sciences, was tasked with estimating how many women in North Carolina were receiving illegal abortions. The goal was to inform state and federal policymakers about the statistics around abortions, many of which were unreported, even when done legally.

The obstacles were obvious. As Horvitz put it, “a prudent woman would not divulge to a stranger the fact that she was party to a crime for which she could be prosecuted.” [Abernathy70] This resulted in a strong bias in survey responses. Similar issues had plagued surveys of illegal activity of all kinds, including drug abuse and violent crime. Lack of awareness into basic statistics about illegal behavior led to a variety of misconceptions, such as that abortions were not frequently sought out.

Horvitz worked with biostatisticians James Abernathy and Bernard Greenberg to test out a new method to overcome this obstacle, without violating the respondent’s privacy or ability to plausibly deny illegal behavior. The method, called randomized response, was invented by Stanley Warner in 1965, just a few years earlier. [Warner65] Warner’s method was a bit different from what we present in this Tip, but both Warner’s method and the code sample above use the same strategy of adding randomization to the survey.

The mechanism, as presented in the code above, requires respondents to start by flipping a coin. If heads, they answer the sensitive question truthfully. If tails, they flip a second coin to determine how to answer the question—heads resulting in a “yes” answer, tails in a “no” answer. Naturally, the coin flips are private and controlled by the respondent. And so if a respondent answers “Yes” to the question, they may plausibly claim the “Yes” was determined by the coin, preserving their privacy. The figure below describes this process as a diagram.

A branching diagram showing the process a survey respondent takes to record their response.

Another way to describe the outcome is to say that each respondent’s answer is a single bit of information that is flipped with probability 1/4. This is half way between two extremes on the privacy/accuracy tradeoff curve. The first extreme is a “perfectly honest” response, where the bit is never flipped and all information is preserved. The second extreme has the bit flipped with probability 1/2, which is equivalent to ignoring the question and choosing your answer completely at random, losing all information in the aggregate responses. In this perspective, the aggregate survey responses can be thought of as a digital signal, and the privacy mechanism adds noise to that signal.

It remains to determine how to recover the aggregate signal from these noisy responses. In other words, the surveyor cannot know any individual’s true answer, but they can, with some extra work, estimate statistics about the underlying population by correcting for the statistical bias. This is possible because the randomization is well understood. The expected fraction of “Yes” answers can be written as a function of the true fraction of “Yes” answers, and hence the true fraction can be solved for. In this case, where the random coin is fair, that formula is as follows (where $ \mathbf{P}$ stands for “the probability of”).

$ \displaystyle \mathbf{P}(\textup{Yes answer}) = \frac{1}{2} \mathbf{P}(\textup{Truthful yes answer}) + \frac{1}{4}$

And so we solve for $ \mathbf{P}(\textup{Truthful yes answer})$

$ \displaystyle \mathbf{P}(\textup{Truthful yes answer}) = 2 \mathbf{P}(\textup{Yes answer}) – \frac{1}{2}$

We can replace the true probability $ \mathbf{P}(\textup{Yes answer})$ above with our fraction of “Yes” responses from the survey, and the result is an estimate $ \hat{p}$ of $ \mathbf{P}(\textup{Truthful yes answer})$. This estimate is unbiased, but has additional variance—beyond the usual variance caused by picking a finite random sample from the population of interest—introduced by the randomization mechanism.

With a bit of effort, one can calculate that the variance of the estimate is

$ \displaystyle \textup{Var}(\hat{p}) = \frac{3}{4n}$

And via Chebyshev’s inequality, which bounds the likelihood that an estimator is far away from its expectation, we can craft a confidence interval and determine the needed sample sizes. Specifically, the estimate $ \hat{p}$ has additive error at most $ q$ with probability at most $ \textup{Var}(\hat{p}) / q^2$. This implies that for a confidence of $ 1-c$, one requires at least $ n \geq 3 / (4 c q^2)$ samples. For example, to achieve error 0.01 with 90 percent confidence ($ c=0.1$), one requires 7,500 responses.

Horvitz’s randomization mechanism didn’t use coin flips. Instead they used an opaque box with red or blue colored balls which the respondent, who was in the same room as the surveyor, would shake and privately reveal a random color through a small window facing away from the surveyor. The statistical principle is the same. Horvitz and his associates surveyed the women about their opinions of the privacy protections of this mechanism. When asked whether their friends would answer a direct question about abortion honestly, over 80% either believed their friends would lie, or were unsure. [footnote: A common trick in survey methodology when asking someone if they would be dishonest is to instead ask if their friends would be dishonest. This tends to elicit more honesty, because people are less likely to uphold a false perception of the moral integrity of others, and people also don’t realize that their opinion of their friends correlates with their own personal behavior and attitudes. In other words, liars don’t admit to lying, but they think lying is much more common than it really is.] But 60% were convinced there was no trick involved in the randomization, while 20% were unsure and 20% thought there was a trick. This suggests many people were convinced that Horvitz’s randomization mechanism provided the needed safety guarantees to answer honestly.

Horvitz’s survey was a resounding success, both for randomized response as a method and for measuring abortion prevalence. [Abernathy70] They estimated the abortion rate at about 22 per 100 conceptions, with a distinct racial bias—minorities were twice as likely as whites to receive an abortion. Comparing their findings to a prior nationwide study from 1955—the so-called Arden House estimate—which gave a range of between 200,000 and 1.2 million abortions per year, Horvitz’s team estimated more precisely that there were 699,000 abortions in 1955 in the United States, with a reported standard deviation of about 6,000, less than one percent. For 1967, the year of their study, they estimated 829,000.

Their estimate was referenced widely in the flurry of abortion law and court cases that followed due to a surging public interest in the topic. For example, it is cited in the 1970 California Supreme Court opinion for the case Ballard v. Anderson, which concerned whether a minor needs parental consent to receive an otherwise legal abortion. [Ballard71, Roemer71] It was also cited in amici curiae briefs submitted to the United States Supreme Court in 1971 for Roe v. Wade, the famous case that invalidated most U.S. laws making abortion illegal. One such brief was filed jointly by the country’s leading women’s rights organizations like the National Organization for Women. Citing Horvitz for this paragraph, it wrote, [Womens71]

While the realities of law enforcement, social and public health problems posed by abortion laws have been openly discussed […] only within a period of not more than the last ten years, one fact appears undeniable, although unverifiable statistically. There are at least one million illegal abortions in the United States each year. Indeed, studies indicate that, if the local law still has qualifying requirements, the relaxation in the law has not diminished to any substantial extent the numbers in which women procure illegal abortions.

It’s unclear how the authors got this one million number (Horvitz’s estimate was 20% less for 1967), nor what they meant by “unverifiable statistically.” It may have been a misinterpretation of the randomized response technique. In any event, randomized response played a crucial role in providing a foundation for political debate.

Despite Horvitz’s success, and decades of additional research on crime, drug use, and other sensitive topics, randomized response mechanisms have been applied poorly. In some cases, the desired randomization is inextricably complex, such as when requiring a continuous random number. In these cases, a manual randomization mechanism is too complex for a respondent to use accurately. Trying to use software-assisted devices can help, but can also produce mistrust in the interviewee. See [Rueda16] for additional discussion of these pitfalls and what software packages exist for assisting in using randomized response. See [Fox16] for an analysis of the statistical differences between the variety of methods used between 1970 and 2010.

In other contexts, analogues to randomized response may not elicit the intended effect. In the 1950’s, Utah used death by firing squad as capital punishment. To avoid a guilty conscience of the shooters, one of five marksmen was randomly given a blank, providing him some plausible deniability that he knew he had delivered the killing shot. However, this approach failed on two counts. First, once a shot was fired the marksman could tell whether the bullet was real based on the recoil. Second, a 20% chance of a blank was not enough to dissuade a guilty marksman from purposely missing. In the 1951 execution of Elisio Mares, all four real bullets missed the condemned man’s heart, hitting his chest, stomach, and hip. He died, but it was neither painless nor instant.

Of many lessons one might draw from the botched execution, one is that randomization mechanisms must take into account both the psychology of the participants as well as the severity of a failed outcome.

References

@book{Fox16,
  title = {{Randomized Response and Related Methods: Surveying Sensitive Data}},
  author = {James Alan Fox},
  edition = {2nd},
  year = {2016},
  doi = {10.4135/9781506300122},
}

@article{Abernathy70,
  author = {Abernathy, James R. and Greenberg, Bernard G. and Horvitz, Daniel G.
            },
  title = {{Estimates of induced abortion in urban North Carolina}},
  journal = {Demography},
  volume = {7},
  number = {1},
  pages = {19-29},
  year = {1970},
  month = {02},
  issn = {0070-3370},
  doi = {10.2307/2060019},
  url = {https://doi.org/10.2307/2060019},
}

@article{Warner65,
  author = {Stanley L. Warner},
  journal = {Journal of the American Statistical Association},
  number = {309},
  pages = {63--69},
  publisher = {{American Statistical Association, Taylor \& Francis, Ltd.}},
  title = {Randomized Response: A Survey Technique for Eliminating Evasive
           Answer Bias},
  volume = {60},
  year = {1965},
}

@article{Ballard71,
  title = {{Ballard v. Anderson}},
  journal = {California Supreme Court L.A. 29834},
  year = {1971},
  url = {https://caselaw.findlaw.com/ca-supreme-court/1826726.html},
}

@misc{Womens71,
  title = {{Motion for Leave to File Brief Amici Curiae on Behalf of Women’s
           Organizations and Named Women in Support of Appellants in Each Case,
           and Brief Amici Curiae.}},
  booktitle = {{Appellate Briefs for the case of Roe v. Wade}},
  number = {WL 128048},
  year = {1971},
  publisher = {Supreme Court of the United States},
}

@article{Roemer71,
  author = {R. Roemer},
  journal = {Am J Public Health},
  pages = {500--509},
  title = {Abortion law reform and repeal: legislative and judicial developments
           },
  volume = {61},
  number = {3},
  year = {1971},
}

@incollection{Rueda16,
  title = {Chapter 10 - Software for Randomized Response Techniques},
  editor = {Arijit Chaudhuri and Tasos C. Christofides and C.R. Rao},
  series = {Handbook of Statistics},
  publisher = {Elsevier},
  volume = {34},
  pages = {155-167},
  year = {2016},
  booktitle = {Data Gathering, Analysis and Protection of Privacy Through
               Randomized Response Techniques: Qualitative and Quantitative Human
               Traits},
  doi = {https://doi.org/10.1016/bs.host.2016.01.009},
  author = {M. Rueda and B. Cobo and A. Arcos and R. Arnab},
}

One definition of algorithmic fairness: statistical parity

If you haven’t read the first post on fairness, I suggest you go back and read it because it motivates why we’re talking about fairness for algorithms in the first place. In this post I’ll describe one of the existing mathematical definitions of “fairness,” its origin, and discuss its strengths and shortcomings.

Before jumping in I should remark that nobody has found a definition which is widely agreed as a good definition of fairness in the same way we have for, say, the security of a random number generator. So this post is intended to be exploratory rather than dictating The Facts. Rather, it’s an idea with some good intuitive roots which may or may not stand up to full mathematical scrutiny.

Statistical parity

Here is one way to define fairness.

Your population is a set $ X$ and there is some known subset $ S \subset X$ that is a “protected” subset of the population. For discussion we’ll say $ X$ is people and $ S$ is people who dye their hair teal. We are afraid that banks give fewer loans to the teals because of hair-colorism, despite teal-haired people being just as creditworthy as the general population on average.

Now we assume that there is some distribution $ D$ over $ X$ which represents the probability that any individual will be drawn for evaluation. In other words, some people will just have no reason to apply for a loan (maybe they’re filthy rich, or don’t like homes, cars, or expensive colleges), and so $ D$ takes that into account. Generally we impose no restrictions on $ D$, and the definition of fairness will have to work no matter what $ D$ is.

Now suppose we have a (possibly randomized) classifier $ h:X \to \{-1,1\}$ giving labels to $ X$. When given a person $ x$ as input $ h(x)=1$ if $ x$ gets a loan and $ -1$ otherwise. The bias, or statistical imparity, of $ h$ on $ S$ with respect to $ X,D$ is the following quantity. In words, it is the difference between the probability that a random individual drawn from $ S$ is labeled 1 and the probability that a random individual from the complement $ S^C$ is labeled 1.

$ \textup{bias}_h(X,S,D) = \Pr[h(x) = 1 | x \in S^{C}] – \Pr[h(x) = 1 | x \in S]$

The probability is taken both over the distribution $ D$ and the random choices made by the algorithm. This is the statistical equivalent of the legal doctrine of adverse impact. It measures the difference that the majority and protected classes get a particular outcome. When that difference is small, the classifier is said to have “statistical parity,” i.e. to conform to this notion of fairness.

Definition: A hypothesis $ h:X \to \{-1,1\}$ is said to have statistical parity on $ D$ with respect to $ S$ up to bias $ \varepsilon$ if $ |\textup{bias}_h(X,S,D)| < \varepsilon$.

So if a hypothesis achieves statistical parity, then it treats the general population statistically similarly to the protected class. So if 30% of normal-hair-colored people get loans, statistical parity requires roughly 30% of teals to also get loans.

It’s pretty simple to write a program to compute the bias. First we’ll write a function that computes the bias of a given set of labels. We’ll determine whether a data point $ x \in X$ is in the protected class by specifying a specific value of a specific index. I.e., we’re assuming the feature selection has already happened by this point.

# labelBias: [[float]], [int], int, obj -&gt; float
# compute the signed bias of a set of labels on a given dataset
def labelBias(data, labels, protectedIndex, protectedValue):   
   protectedClass = [(x,l) for (x,l) in zip(data, labels) 
      if x[protectedIndex] == protectedValue]   
   elseClass = [(x,l) for (x,l) in zip(data, labels) 
      if x[protectedIndex] != protectedValue]

   if len(protectedClass) == 0 or len(elseClass) == 0:
      raise Exception(&quot;One of the classes is empty!&quot;)
   else:
      protectedProb = sum(1 for (x,l) in protectedClass if l == 1) / len(protectedClass)
      elseProb = sum(1 for (x,l) in elseClass  if l == 1) / len(elseClass)

   return elseProb - protectedProb

Then generalizing this to an input hypothesis is a one-liner.

# signedBias: [[float]], int, obj, h -&gt; float
# compute the signed bias of a hypothesis on a given dataset
def signedBias(data, h, protectedIndex, protectedValue):
   return labelBias(pts, [h(x) for x in pts], protectedIndex, protectedValue)

Now we can load the census data from the UCI machine learning repository and compute some biases in the labels. The data points in this dataset correspond to demographic features of people from a census survey, and the labels are +1 if the individual’s salary is at least 50k, and -1 otherwise. I wrote some helpers to load the data from a file (which you can see in this post’s Github repo).

if __name__ == &quot;__main__&quot;:
   from data import adult
   train, test = adult.load(separatePointsAndLabels=True)

   # [(test name, (index, value))]
   tests = [('gender', (1,0)), 
            ('private employment', (2,1)), 
            ('asian race', (33,1)),
            ('divorced', (12, 1))]

   for (name, (index, value)) in tests:
      print(&quot;'%s' bias in training data: %.4f&quot; %
         (name, labelBias(train[0], train[1], index, value)))

(I chose ‘asian race’ instead of just ‘asian’ because there are various ‘country of origin’ features that are for countries in Asia.)

Running this gives the following.

anti-'female' bias in training data: 0.1963
anti-'private employment' bias in training data: 0.0731
anti-'asian race' bias in training data: -0.0256
anti-'divorced' bias in training data: 0.1582

Here a positive value means it’s biased against the quoted thing, a negative value means it’s biased in favor of the quoted thing.

Now let me define a stupidly trivial classifier that predicts 1 if the country of origin is India and zero otherwise. If I do this and compute the gender bias of this classifier on the training data I get the following.

&gt;&gt;&gt; indian = lambda x: x[47] == 1
&gt;&gt;&gt; len([x for x in train[0] if indian(x)]) / len(train[0]) # fraction of Indians
0.0030711587481956942
&gt;&gt;&gt; signedBias(train[0], indian, 1, 0)
0.0030631816119030884

So this says that predicting based on being of Indian origin (which probably has very low accuracy, since many non-Indians make at least $50k) does not bias significantly with respect to gender.

We can generalize statistical parity in various ways, such as using some other specified set $ T$ in place of $ S^C$, or looking at discrepancies among $ k$ different sub-populations or with $ m$ different outcome labels. In fact, the mathematical name for this measurement (which is a measurement of a set of distributions) is called the total variation distance. The form we sketched here is a simple case that just works for the binary-label two-class scenario.

Now it is important to note that statistical parity says nothing about the truth about the protected class $ S$. I mean two things by this. First, you could have some historical data you want to train a classifier $ h$ on, and usually you’ll be given training labels for the data that tell you whether $ h(x)$ should be $ 1$ or $ -1$. In the absence of discrimination, getting high accuracy with respect to the training data is enough. But if there is some historical discrimination against $ S$ then the training labels are not trustworthy. As a consequence, achieving statistical parity for $ S$ necessarily reduces the accuracy of $ h$. In other words, when there is bias in the data accuracy is measured in favor of encoding the bias. Studying fairness from this perspective means you study the tradeoff between high accuracy and low statistical disparity. However, and this is why statistical parity says nothing about whether the individuals $ h$ behaves differently on (differently compared to the training labels) were the correct individuals to behave differently on. If the labels alone are all we have to work with, and we don’t know the true labels, then we’d need to apply domain-specific knowledge, which is suddenly out of scope of machine learning.

Second, nothing says optimizing for statistical parity is the correct thing to do. In other words, it may be that teal-haired people are truly less creditworthy (jokingly, maybe there is a hidden innate characteristic causing both uncreditworthiness and a desire to dye your hair!) and by enforcing statistical parity you are going against a fact of Nature. Though there are serious repercussions for suggesting such things in real life, my point is that statistical parity does not address anything outside the desire for an algorithm to exhibit a certain behavior. The obvious counterargument is that if, as a society, we have decided that teal-hairedness should be protected by law regardless of Nature, then we’re defining statistical parity to be correct. We’re changing our optimization criterion and as algorithm designers we don’t care about anything else. We care about what guarantees we can prove about algorithms, and the utility of the results.

The third side of the coin is that if all we care about is statistical parity, then we’ll have a narrow criterion for success that can be gamed by an actively biased adversary.

Statistical parity versus targeted bias

Statistical parity has some known pitfalls. In their paper “Fairness Through Awareness” (Section 3.1 and Appendix A), Dwork, et al. argue convincingly that these are primarily issues of individual fairness and targeted discrimination. They give six examples of “evils” including a few that maintain statistical parity while not being fair from the perspective of an individual. Here are my two favorite ones to think about (using teal-haired people and loans again):

  1. Self-fulfilling prophecy: The bank intentionally gives a few loans to teal-haired people who are (for unrelated reasons) obviously uncreditworthy, so that in the future they can point to these examples to justify discriminating against teals. This can appear even if the teals are chosen uniformly at random, since the average creditworthiness of a random teal-haired person is lower than a carefully chosen normal-haired person.
  2. Reverse tokenism: The bank intentionally does not give loans to some highly creditworthy normal-haired people, let’s call one Martha, so that when a teal complains that they are denied a loan, the bank can point to Martha and say, “Look how qualified she is, and we didn’t even give her a loan! You’re much less qualified.” Here Martha is the “token” example used to justify discrimination against teals.

I like these two examples for two reasons. First, they illustrate how hard coming up with a good definition is: it’s not clear how to encapsulate both statistical parity and resistance to this kind of targeted discrimination. Second, they highlight that discrimination can both be unintentional and intentional. Since computer scientists tend to work with worst-case guarantees, this makes we think the right definition will be resilient to some level of adversarial discrimination. But again, these two examples are not formalized, and it’s not even clear to what extent existing algorithms suffer from manipulations of these kinds. For instance, many learning algorithms are relatively resilient to changing the desired label of a single point.

In any case, the thing to take away from this discussion is that there is not yet an accepted definition of “fairness,” and there seems to be a disconnect between what it means to be fair for an individual versus a population. There are some other proposals in the literature, and I’ll just mention one: Dwork et al. propose that individual fairness mean that “similar individuals are treated similarly.” I will cover this notion (and what’s know about it) in a future post.

Until then!