“Practical Math” Preview: Collect Sensitive Survey Responses Privately

This is a draft of a chapter from my in-progress book, Practical Math for Programmers: A Tour of Mathematics in Production Software.

Tip: Determine an aggregate statistic about a sensitive question, when survey respondents do not trust that their responses will be kept secret.

Solution:

import random

def respond_privately(true_answer: bool) -> bool:
    '''Respond to a survey with plausible deniability about your answer.'''
    be_honest = random.random() < 0.5
    random_answer = random.random() < 0.5
    return true_answer if be_honest else random_answer

def aggregate_responses(responses: List[bool]) -> Tuple[float, float]:
    '''Return the estimated fraction of survey respondents that have a truthful
    Yes answer to the survey question.
    '''
    yes_response_count = sum(responses)
    n = len(responses)
    mean = 2 * yes_response_count / n - 0.5
    # Use n-1 when estimating variance, as per Bessel's correction.
    variance = 3 / (4 * (n - 1))
    return (mean, variance)

In the late 1960’s, most abortions were illegal in the United States. Daniel G. Horvitz, a statistician at The Research Triangle Institute in North Carolina and a leader in survey design for social sciences, was tasked with estimating how many women in North Carolina were receiving illegal abortions. The goal was to inform state and federal policymakers about the statistics around abortions, many of which were unreported, even when done legally.

The obstacles were obvious. As Horvitz put it, “a prudent woman would not divulge to a stranger the fact that she was party to a crime for which she could be prosecuted.” [Abernathy70] This resulted in a strong bias in survey responses. Similar issues had plagued surveys of illegal activity of all kinds, including drug abuse and violent crime. Lack of awareness into basic statistics about illegal behavior led to a variety of misconceptions, such as that abortions were not frequently sought out.

Horvitz worked with biostatisticians James Abernathy and Bernard Greenberg to test out a new method to overcome this obstacle, without violating the respondent’s privacy or ability to plausibly deny illegal behavior. The method, called randomized response, was invented by Stanley Warner in 1965, just a few years earlier. [Warner65] Warner’s method was a bit different from what we present in this Tip, but both Warner’s method and the code sample above use the same strategy of adding randomization to the survey.

The mechanism, as presented in the code above, requires respondents to start by flipping a coin. If heads, they answer the sensitive question truthfully. If tails, they flip a second coin to determine how to answer the question—heads resulting in a “yes” answer, tails in a “no” answer. Naturally, the coin flips are private and controlled by the respondent. And so if a respondent answers “Yes” to the question, they may plausibly claim the “Yes” was determined by the coin, preserving their privacy. The figure below describes this process as a diagram.

A branching diagram showing the process a survey respondent takes to record their response.

Another way to describe the outcome is to say that each respondent’s answer is a single bit of information that is flipped with probability 1/4. This is half way between two extremes on the privacy/accuracy tradeoff curve. The first extreme is a “perfectly honest” response, where the bit is never flipped and all information is preserved. The second extreme has the bit flipped with probability 1/2, which is equivalent to ignoring the question and choosing your answer completely at random, losing all information in the aggregate responses. In this perspective, the aggregate survey responses can be thought of as a digital signal, and the privacy mechanism adds noise to that signal.

It remains to determine how to recover the aggregate signal from these noisy responses. In other words, the surveyor cannot know any individual’s true answer, but they can, with some extra work, estimate statistics about the underlying population by correcting for the statistical bias. This is possible because the randomization is well understood. The expected fraction of “Yes” answers can be written as a function of the true fraction of “Yes” answers, and hence the true fraction can be solved for. In this case, where the random coin is fair, that formula is as follows (where $ \mathbf{P}$ stands for “the probability of”).

$ \displaystyle \mathbf{P}(\textup{Yes answer}) = \frac{1}{2} \mathbf{P}(\textup{Truthful yes answer}) + \frac{1}{4}$

And so we solve for $ \mathbf{P}(\textup{Truthful yes answer})$

$ \displaystyle \mathbf{P}(\textup{Truthful yes answer}) = 2 \mathbf{P}(\textup{Yes answer}) – \frac{1}{2}$

We can replace the true probability $ \mathbf{P}(\textup{Yes answer})$ above with our fraction of “Yes” responses from the survey, and the result is an estimate $ \hat{p}$ of $ \mathbf{P}(\textup{Truthful yes answer})$. This estimate is unbiased, but has additional variance—beyond the usual variance caused by picking a finite random sample from the population of interest—introduced by the randomization mechanism.

With a bit of effort, one can calculate that the variance of the estimate is

$ \displaystyle \textup{Var}(\hat{p}) = \frac{3}{4n}$

And via Chebyshev’s inequality, which bounds the likelihood that an estimator is far away from its expectation, we can craft a confidence interval and determine the needed sample sizes. Specifically, the estimate $ \hat{p}$ has additive error at most $ q$ with probability at most $ \textup{Var}(\hat{p}) / q^2$. This implies that for a confidence of $ 1-c$, one requires at least $ n \geq 3 / (4 c q^2)$ samples. For example, to achieve error 0.01 with 90 percent confidence ($ c=0.1$), one requires 7,500 responses.

Horvitz’s randomization mechanism didn’t use coin flips. Instead they used an opaque box with red or blue colored balls which the respondent, who was in the same room as the surveyor, would shake and privately reveal a random color through a small window facing away from the surveyor. The statistical principle is the same. Horvitz and his associates surveyed the women about their opinions of the privacy protections of this mechanism. When asked whether their friends would answer a direct question about abortion honestly, over 80% either believed their friends would lie, or were unsure. [footnote: A common trick in survey methodology when asking someone if they would be dishonest is to instead ask if their friends would be dishonest. This tends to elicit more honesty, because people are less likely to uphold a false perception of the moral integrity of others, and people also don’t realize that their opinion of their friends correlates with their own personal behavior and attitudes. In other words, liars don’t admit to lying, but they think lying is much more common than it really is.] But 60% were convinced there was no trick involved in the randomization, while 20% were unsure and 20% thought there was a trick. This suggests many people were convinced that Horvitz’s randomization mechanism provided the needed safety guarantees to answer honestly.

Horvitz’s survey was a resounding success, both for randomized response as a method and for measuring abortion prevalence. [Abernathy70] They estimated the abortion rate at about 22 per 100 conceptions, with a distinct racial bias—minorities were twice as likely as whites to receive an abortion. Comparing their findings to a prior nationwide study from 1955—the so-called Arden House estimate—which gave a range of between 200,000 and 1.2 million abortions per year, Horvitz’s team estimated more precisely that there were 699,000 abortions in 1955 in the United States, with a reported standard deviation of about 6,000, less than one percent. For 1967, the year of their study, they estimated 829,000.

Their estimate was referenced widely in the flurry of abortion law and court cases that followed due to a surging public interest in the topic. For example, it is cited in the 1970 California Supreme Court opinion for the case Ballard v. Anderson, which concerned whether a minor needs parental consent to receive an otherwise legal abortion. [Ballard71, Roemer71] It was also cited in amici curiae briefs submitted to the United States Supreme Court in 1971 for Roe v. Wade, the famous case that invalidated most U.S. laws making abortion illegal. One such brief was filed jointly by the country’s leading women’s rights organizations like the National Organization for Women. Citing Horvitz for this paragraph, it wrote, [Womens71]

While the realities of law enforcement, social and public health problems posed by abortion laws have been openly discussed […] only within a period of not more than the last ten years, one fact appears undeniable, although unverifiable statistically. There are at least one million illegal abortions in the United States each year. Indeed, studies indicate that, if the local law still has qualifying requirements, the relaxation in the law has not diminished to any substantial extent the numbers in which women procure illegal abortions.

It’s unclear how the authors got this one million number (Horvitz’s estimate was 20% less for 1967), nor what they meant by “unverifiable statistically.” It may have been a misinterpretation of the randomized response technique. In any event, randomized response played a crucial role in providing a foundation for political debate.

Despite Horvitz’s success, and decades of additional research on crime, drug use, and other sensitive topics, randomized response mechanisms have been applied poorly. In some cases, the desired randomization is inextricably complex, such as when requiring a continuous random number. In these cases, a manual randomization mechanism is too complex for a respondent to use accurately. Trying to use software-assisted devices can help, but can also produce mistrust in the interviewee. See [Rueda16] for additional discussion of these pitfalls and what software packages exist for assisting in using randomized response. See [Fox16] for an analysis of the statistical differences between the variety of methods used between 1970 and 2010.

In other contexts, analogues to randomized response may not elicit the intended effect. In the 1950’s, Utah used death by firing squad as capital punishment. To avoid a guilty conscience of the shooters, one of five marksmen was randomly given a blank, providing him some plausible deniability that he knew he had delivered the killing shot. However, this approach failed on two counts. First, once a shot was fired the marksman could tell whether the bullet was real based on the recoil. Second, a 20% chance of a blank was not enough to dissuade a guilty marksman from purposely missing. In the 1951 execution of Elisio Mares, all four real bullets missed the condemned man’s heart, hitting his chest, stomach, and hip. He died, but it was neither painless nor instant.

Of many lessons one might draw from the botched execution, one is that randomization mechanisms must take into account both the psychology of the participants as well as the severity of a failed outcome.

References

@book{Fox16,
  title = {{Randomized Response and Related Methods: Surveying Sensitive Data}},
  author = {James Alan Fox},
  edition = {2nd},
  year = {2016},
  doi = {10.4135/9781506300122},
}

@article{Abernathy70,
  author = {Abernathy, James R. and Greenberg, Bernard G. and Horvitz, Daniel G.
            },
  title = {{Estimates of induced abortion in urban North Carolina}},
  journal = {Demography},
  volume = {7},
  number = {1},
  pages = {19-29},
  year = {1970},
  month = {02},
  issn = {0070-3370},
  doi = {10.2307/2060019},
  url = {https://doi.org/10.2307/2060019},
}

@article{Warner65,
  author = {Stanley L. Warner},
  journal = {Journal of the American Statistical Association},
  number = {309},
  pages = {63--69},
  publisher = {{American Statistical Association, Taylor \& Francis, Ltd.}},
  title = {Randomized Response: A Survey Technique for Eliminating Evasive
           Answer Bias},
  volume = {60},
  year = {1965},
}

@article{Ballard71,
  title = {{Ballard v. Anderson}},
  journal = {California Supreme Court L.A. 29834},
  year = {1971},
  url = {https://caselaw.findlaw.com/ca-supreme-court/1826726.html},
}

@misc{Womens71,
  title = {{Motion for Leave to File Brief Amici Curiae on Behalf of Women’s
           Organizations and Named Women in Support of Appellants in Each Case,
           and Brief Amici Curiae.}},
  booktitle = {{Appellate Briefs for the case of Roe v. Wade}},
  number = {WL 128048},
  year = {1971},
  publisher = {Supreme Court of the United States},
}

@article{Roemer71,
  author = {R. Roemer},
  journal = {Am J Public Health},
  pages = {500--509},
  title = {Abortion law reform and repeal: legislative and judicial developments
           },
  volume = {61},
  number = {3},
  year = {1971},
}

@incollection{Rueda16,
  title = {Chapter 10 - Software for Randomized Response Techniques},
  editor = {Arijit Chaudhuri and Tasos C. Christofides and C.R. Rao},
  series = {Handbook of Statistics},
  publisher = {Elsevier},
  volume = {34},
  pages = {155-167},
  year = {2016},
  booktitle = {Data Gathering, Analysis and Protection of Privacy Through
               Randomized Response Techniques: Qualitative and Quantitative Human
               Traits},
  doi = {https://doi.org/10.1016/bs.host.2016.01.009},
  author = {M. Rueda and B. Cobo and A. Arcos and R. Arnab},
}

Why Theoretical Computer Scientists Aren’t Worried About Privacy

There has been a lot of news recently on government surveillance of its citizens. The biggest two that have pervaded my news feeds are the protests in Turkey, which in particular have resulted in particular oppression of social media users, and the recent light on the US National Security Agency’s widespread “backdoor” in industry databases at Google, Verizon, Facebook, and others. It appears that the facts are in flux, as some companies have denied their involvement in this program, but regardless of the truth the eye of the public has landed firmly on questions of privacy.

Barack Obama weighed in on the controversy as well, being quoted as saying,

You can’t have 100% security and 100% privacy, and also zero inconvenience.

I don’t know what balance the US government hopes to strike, but what I do know is that privacy and convenience are technologically possible, and we need not relinquish security to attain it.

Before I elaborate, let me get my personal beliefs out of the way. I consider the threat of terrorism low compared to the hundreds of other ways I can die. I should know, as I personally have been within an $ \varepsilon$ fraction of my life for all $ \varepsilon > 0$ (when I was seven I was hit by a bus, proclaimed dead, and revived). So I take traffic security much more seriously than terrorism, and the usual statistics will back me up in claiming one would be irrational to do otherwise. On the other hand, I also believe that I only need so much privacy. So I don’t mind making much of my personal information public, and I opt in to every one of Google’s tracking services in the hopes that my user experience can be improved. Indeed it has, as services like Google Now will, e.g., track my favorite bands for me based on my Google Play listening and purchasing habits, and alert me when there are concerts in my area. If only it could go one step further and alert me of trending topics in theoretical computer science! I have much more utility for timely knowledge of these sorts of things than I do for the privacy of my Facebook posts. Of course, ideologically I’m against violating privacy as a matter of policy, but this is a different matter. One can personally loathe a specific genre of music and still recognize its value and one’s right to enjoy it.

But putting my personal beliefs aside, I want to make it clear that there is no technological barrier to maintaining privacy and utility. This may sound shocking, but it rings true to the theoretical computer scientist. Researchers in cryptography have experienced this feeling many times, that their wildest cryptographic dreams are not only possible but feasible! Public-key encryption and digital signatures, secret sharing on a public channel, zero-knowledge verification, and many other protocols have been realized quite soon after being imagined. There are still some engineering barriers to implementing these technologies efficiently in large-scale systems, but with demand and a few years of focused work there is nothing stopping them from being used by the public. I want to use this short post to describe two of the more recent ideas that have pervaded the crypto community and provide references for further reading.

Differential Privacy and Fully Homomorphic Encryption

There are two facts which are well known in theoretical computer science that the general public is not aware of. The first is about the privacy of databases:

There is a way to mine information from a database without the ability to inspect individual entries in the database.

This is known as differential privacy. The second is no less magical:

There are secure encryption schemes which allow one to run programs on encrypted data and produce encrypted results, without the ability to decrypt the data.

This is known as fully homomorphic encryption. 

The implications of these two facts should be obvious: search engines need not know our queries but can still fetch us search results and mine our information to serve ads, Facebook need not have access to our personal data but may still accurately predict new friends, grocery stores can even know what products to place side by side without knowing what any individual customer has purchased. Banks could process our transactions without knowing the amounts involved, or even the parties involved. Perhaps most importantly, governments can have access to databases (in the form of differentially private queries) and mine for the existence of threats without violating any individual user’s privacy. If they get an indication of a terrorist threat, then they can use the usual channels (court orders) to get access to specific individual data.

It’s easy to argue that these techniques will never become mainstream enough for individuals to benefit from it. Indeed, we’ve had cryptography for many years but few average users actively encrypt their communication for a lack of convenience. And then there are questions of policy: why would any company relinquish the ability to work directly with user data? And the cost of rearchitecturing existing services to utilize these technologies would be enough to dissuade most business leaders.

But the point of all this is that these are problems of policy that could in principle be solved without waiting for governments and corporations to get their act together. With enough demand for such services and with enough technologically-minded entrepreneurs (I’m looking at you, Silicon Valley), it would be just a matter of time before the world was differentially private. Mathematics cannot be revoked or legislated away.

Fully Homomorphic Encryption

A fully homomorphic encryption scheme is a normal encryption scheme (two functions “enc” and “dec” to encrypt and decrypt) with one additional function, which we’ll call “eval.” Very roughly, eval accepts as input the text of a program and a ciphertext, and produces as output a ciphertext such that the following diagram commutes:

FHE-diagram

That is, $ m$ is our message, and $ \textup{eval}$ runs $ f$ on the encrypted version of our message. In practice this happens by lifting two operations, multiplication and addition, from plaintexts (which are usually number-representations of letters) to ciphertexts (again usually numbers). Once this is done one can simulate the functionality of an arbitrary circuit on the encrypted data without decrypting it. Those readers who have been following our category theory series will recognize these sorts of diagrams as being functorial. [Actually, at the time of this writing we have yet to look at functors, but we will soon!] So perhaps a better term would be “functorial encryption.”

I should emphasize: a truly homomorphic encryption scheme has the ability to run any computable function on the encrypted data. There is no loss of functionality in preserving the privacy from the program runner. The main use of this is to maintain privacy while deferring large computations to the cloud. We do this all the time, e.g. a search query, but it also applies to big websites like Reddit, which operate entirely on Amazon Web Services.

Fully homomorphic encryption was first envisaged by Rivest, Adleman (two of the inventors of RSA), and Dertouzos in the late seventies, mainly because the RSA encryption scheme is close to being homomorphic (one can multiply ciphertexts, but not add them). In 2009, Craig Gentry released the first working fully-homomorphic scheme based on the mathematical theory of ideal lattices, and later that year he (with a group of other researchers) came up with a second system that is arguably as simple as RSA; it operates on integers with modular arithmetic.

Gentry has produced a lot of research since then in homomorphic encryption, but the interested reader should probably start with his tutorial paper describing his arithmetic-based system. From there, there are existing implementations in Python (using Sage) and C++, both of which are freely available on github.

Differential Privacy

The main idea of differential privacy is that one can add noise to statistical data to protect the identities of individual records. Slightly more rigorously, a randomized algorithm $ f$ is said to be $ \varepsilon$-differentially private if for all possible datasets (inputs) $ D_1, D_2$ which differ on a single record, and all possible collections of outputs $ y$ of $ f$, the probability of correctly guessing $ D_1$ from $ y$ is not significantly different from that of $ D_2$. In particular, their quotient is at most $ e^{\varepsilon}$ (this choice of using $ e$ is arbitrary, but makes the analysis nicer).

The motivation for differential privacy came from two notable events in which companies released “anonymized” data which was partially de-anonymized because it was too specific. The first was the million-dollar Netflix Prize contest to develop a better recommendation algorithm, and the second was the release of the Massachusetts Group Insurance Commission medical database. As such, many companies are very strict with how they handle their user data, and information sharing the medical community is practically nonexistent.

There are many known differentially private algorithms, and they’re much stronger than one would imagine at first. One can run random forests of decision trees, network trace analysis, query-click analysis, certain forms of clustering, and a whole host of combinatorial optimization problems. For a gentle introduction to differential privacy, see Christine Task’s lecture video, a Practical Beginner’s Guide to Differential Privacy. There is also an influential survey from Microsoft Research of Dwork. These go into much more detail about the abilities and inabilities of differential privacy than I could do here.

If there’s one thing to take away from this discussion, it’s that efficient protocols for ensuring privacy are out there waiting to be implemented in software. So while we complain and listen to others complain about governments violating our liberties (and indeed, this discussion is extremely important to have), let’s do a little mathematics, do a little computer science, and figure out how to make privacy the standard of measure in software.

Until next time!

Decision Trees and Political Party Classification

Last time we investigated the k-nearest-neighbors algorithm and the underlying idea that one can learn a classification rule by copying the known classification of nearby data points. This required that we view our data as sitting inside a metric space; that is, we imposed a kind of geometric structure on our data. One glaring problem is that there may be no reasonable way to do this. While we mentioned scaling issues and provided a number of possible metrics in our primer, a more common problem is that the data simply isn’t numeric.

For instance, a poll of US citizens might ask the respondent to select which of a number of issues he cares most about. There could be 50 choices, and there is no reasonable way to assign these numerical values so that all are equidistant in the resulting metric space.

Another issue is that the quality of the data could be bad. For instance, there may be missing values for some attributes (e.g., a respondent may neglect to answer one or more questions). Alternatively, the attributes or the classification label could be wrong; that is, the data could exhibit noise. For k-nearest-neighbors, missing an attribute means we can’t (or can’t accurately) compute the distance function. And noisy data can interfere with our choice of $ k$. In particular, certain regions might be better with a smaller value of $ k$, while regions with noisier data might require a larger $ k$ to achieve the same accuracy rate.  Without making the algorithm sufficiently more complicated to vary $ k$ when necessary, our classification accuracy will suffer.

In this post we’ll see how decision trees can alleviate these issues, and we’ll test the decision tree on an imperfect data set of congressional voting records. We’ll implement the algorithm in Python, and test it on the problem of predicting the US political party affiliation of members of Congress based on their votes for a number of resolutions. As usual, we post the entire code and data set on this blog’s Github page.

Before going on, the reader is encouraged to read our primer on trees. We will assume familiarity with the terminology defined there.

Intuition

Imagine we have a data set where each record is a list of categorical weather conditions on a randomly selected number of days, and the labels correspond to whether a girl named Arya went for a horse ride on that day. Let’s also assume she would like to go for a ride every day, and the only thing that might prohibit her from doing so is adverse weather. In this case, the input variables will be the condition in the sky (sunny, cloudy, rainy, and snow), the temperature (cold, warm, and hot), the relative humidity (low, medium, and high), and the wind speed (low and high). The output variable will be whether Arya goes on a horse ride that day. Some entries in this data set might look like:

                 Arya's Riding Data
Sky     Temperature    Humidity    Wind    Horse Ride
Cloudy  Warm           Low         Low     Yes
Rainy   Cold           Medium      Low     No
Sunny   Warm           Medium      Low     Yes
Sunny   Hot            High        High    No
Snow    Cold           Low         High    No
Rainy   Warm           High        Low     Yes

In this case, one might reasonably guess that certain weather features are more important than others in determining whether Arya can go for a horse ride. For instance, the difference between sun and rain/snow should be a strong indicator, although it is not always correct in this data set. In other words, we’re looking for a weather feature that best separates the data into its respective classes. Of course, we’ll need a rigorous way to measure how good that separation is, but intuitively we can continue.

For example, we might split based on the wind speed feature. In this case, we have two smaller data sets corresponding to the entries where the wind is high and low. The corresponding table might look like:

     Arya's Riding Data, Wind = High
Sky     Temperature    Humidity    Horse Ride
Sunny   Hot            High        No
Snow    Cold           Low         No

     Arya's Riding Data, Wind = Low
Sky     Temperature    Humidity    Horse Ride
Cloudy  Warm           Low         Yes
Rainy   Cold           Medium      No
Sunny   Warm           Medium      Yes
Rainy   Warm           High        Yes

In this case, Arya is never known to ride a horse when the wind speed is high, and there is only one occasion when she doesn’t ride a horse and the wind speed is low. Taking this one step further, we can repeat the splitting process on the “Wind = Low” data in search of a complete split between the two output classes. We can see by visual inspection that the only “no” instance occurs when the temperature is cold. Hence, we should split on the temperature feature.

It is not useful to write out another set of tables (one feels the pain already when imagining a data set with a thousand entries), because in fact there is a better representation. The astute reader will have already recognized that our process of picking particular values for the weather features is just the process of traversing a tree.

Let’s investigate this idea closer. Imagine we have a tree where the root node corresponds to the Wind feature, and it has two edges connected to child nodes; one edge corresponds to the value “Low” and the other to “High.” That is, the process of traveling from the root to a child along an edge is the process of selecting only those data points whose “Wind” feature is that edge’s label. We can take the child corresponding to “Low” wind and have it represent the Temperature feature, further adding three child nodes with edges corresponding to the “Cold,” “Warm,” and “Hot” values.

We can stop this process once the choice of features completely splits our data set. Pictorially, our tree would look like this:

We reasonably decide to stop the traversal when all of the examples in the split are in the same class. More so, we would not want to include the option for the temperature to be Hot in the right subtree, because the data tells us nothing about such a scenario (as indicated by the “None” in the corresponding leaf).

Now that we have the data organized as a tree, we can try to classify new data with it. Suppose the new example is:

Sky     Temperature    Humidity    Wind    Horse Ride
Rainy   Cold           Low         Low     ?

We first inspect the wind speed feature, and seeing that it is “Low,” we follow the edge to the right subtree and repeat. Seeing that the temperature feature is “Cold,” we further descend down the “Cold” branch, reaching the “All No” leaf. Since this leaf corresponds to examples we’ve seen which are all in the “No” class, we should classify the new data as “No” as well.

Summarizing, given a new piece of data, we can traverse the tree according to the values of its features until we reach a leaf node. If the leaf node is “All No,” then we classify the new set of weather conditions as a “No,” and if it is “All Yes,” we classify as “Yes.”

Of course, this tree makes it clear that this toy data set is much too small to be useful, and the rules we’ve extrapolated from it are ridiculous. In particular, surely some people ride horses when the wind speed is high, and they would be unlikely to do so in a warm, low-wind thunderstorm. Nevertheless, we might expect a larger data set to yield a more sensible tree, as the data would more precisely reflect the true reasons one might refrain from riding a hose.

Before we generalize this example to any data set, we should note that there is yet another form for our classification rule. In particular, we can write the traversal from the root to the rightmost leaf as a boolean expression of the form:

$ \displaystyle \textup{Wind = “Low”} \wedge \textup{Temp = “Warm”}$

An example will be classified as “Yes” if and only if the wind feature is “High” and the temperature feature is “Warm” (here the wedge symbol $ \wedge$ means “and,” and is called a conjunction). If we had multiple such routes leading to leaves labeled with “All Yes,” say a branch for wind being “High” and sky being “Sunny,” we could expand our expression as a disjunction (an “or,” denoted $ \vee$) of the two corresponding conjunctions as follows:

 $ \displaystyle (\textup{Wind = “Low”} \wedge \textup{Temp = “Warm”}) \vee (\textup{Wind = “High”} \wedge \textup{Sky = “Sunny”})$

In the parlance of formal logic, this kind of expression is called the disjunctive normal form, that is, a disjunction of conjunctions. It’s an easy exercise to prove that every propositional statement (in our case, using only and, or, and parentheses for grouping) can be put into disjunctive normal form. That is, any boolean condition that can be used to classify the data can be expressed in a disjunctive normal form, and hence as a decision tree.

Such a “boolean condition” is an example of a hypothesis, which is the formal term for the rule an algorithm uses to classify new data points. We call the set of all possible hypotheses expressible by our algorithm the hypothesis space of our algorithm. What we’ve just shown above is that decision trees have a large and well-defined hypothesis space. On the other hand, it is much more difficult to describe the hypothesis space for an algorithm like k-nearest-neighbors. This is one argument in favor of decision trees: they have a well-understood hypothesis space, and that makes them analytically tractable and interpretable.

Using Entropy to Find Optimal Splits

The real problem here is not in using a decision tree, but in constructing one from data alone. At any step in the process we outlined in the example above, we need to determine which feature is the right one to split the data on. That is, we need to choose the labels for the interior nodes in so that the resulting data subsets are as homogeneous as possible. In particular, it would be nice to have a quantitative way to measure the quality of a split. Then at each step we could simply choose the feature whose split yields the highest value under this measurement.

While we won’t derive such a measurement in this post, we will use one that has an extensive history of applications: Shannon entropy.

Definition: Let $ D$ be a discrete probability distribution $ (p_1, p_2, \dots, p_n)$. Then the Shannon entropy of $ D$, denoted $ E(p_1, \dots, p_n)$ is

$ \displaystyle E(p_1, \dots , p_n) = – \sum_{i=1}^n p_i \log(p_i)$

Where the logarithms are taken in base 2.

In English, there are $ n$ possible outcomes numbered 1 to $ n$, and the probability that an instance drawn from $ D$ results in the outcome $ k$ is $ p_k$. Then Shannon’s entropy function computes a numerical quantity describing how “dispersed” the outcomes are.

While there are many other useful interpretations of Shannon entropy, we only need it to describe how well the data is split into its classes. For our purposes, the probability distribution will simply be the observed proportions of data with respect to their class labels. In the case of Arya’s horse riding, the initial distribution would be $ (1/2, 1/2)$, giving an entropy of $ 1$.

Let’s verify that Shannon’s entropy function makes sense for our problem. Specifically, the best scenario for splitting the data on a feature is a perfect split; that is, each subset only has data from one class. On the other hand, the worst case would be where each subset is uniformly distributed across all classes (if there are $ n$ classes, then each subset has $ 1/n$ of its data from each class).

Indeed, if we adopt the convention that $ \log(0) = 0$, then the entropy of $ (1,0, \dots, 0)$ consists of a single term $ -1 \log(1) = 0$. It is clear that this does not depend on the position of the 1 within the probability distribution. On the other hand, the entropy of $ (1/n, \dots, 1/n)$ is

$ \displaystyle -\sum_{i=1}^n\frac{1}{n}\log \left (\frac{1}{n} \right ) = -\log \left (\frac{1}{n} \right ) = -(0 – \log(n)) = \log(n)$

A well-known property of the entropy function tells us that this is in fact the maximum value for this function.

Summarizing this, in the best case entropy is minimized after the split, and in the worst case entropy is maximized. But we can’t simply look at the entropy of each subset after splitting. We need a sensible way to combine these entropies and to compare them with the entropy of the data before splitting. In particular, we would quantify the “decrease” in entropy caused by a split, and maximize that quantity.

Definition: Let $ S$ be a data set and $ A$ a feature with values $ v \in V$, and let $ E$ denote Shannon’s entropy function. Moreover, let $ S_v$ denote the subset of $ S$ for which the feature $ A$ has the value $ v$. The gain of a split along the feature $ A$, denoted $ G(S,A)$ is

$ \displaystyle G(S,A) = E(S) – \sum_{v \in V} \frac{|S_v|}{|S|} E(S_v)$

That is, we are taking the difference of the entropy before the split, and subtracting off the entropies of each part after splitting, with an appropriate weight depending on the size of each piece. Indeed, if the entropy grows after the split (that is if the data becomes more mixed), then this number will be small. On the other hand if the split separates the classes nicely, each subset $ S_v$ will have small entropy, and hence the value will be large.

It requires a bit of mathematical tinkering to be completely comfortable that this function actually does what we want it to (for instance, it is not obvious that this function is non-negative; does it make sense to have a negative gain?). We won’t tarry in those details (this author has spent at least a day or two ironing them out), but we can rest assured that this function has been studied extensively, and nothing unexpected happens.

So now the algorithm for building trees is apparent: at each stage, simply pick the feature for which the gain function is maximized, and split the data on that feature. Create a child node for each of the subsets in the split, and connect them via edges with labels corresponding to the chosen feature value for that piece.

This algorithm is classically called ID3, and we’ll implement it in the next section.

Building a Decision Tree in Python

As with our primer on trees, we can use a quite simple data structure to represent the tree, but here we need a few extra pieces of data associated with each node.

class Tree:
   def __init__(self, parent=None):
      self.parent = parent
      self.children = []
      self.splitFeature = None
      self.splitFeatureValue = None
      self.label = None

In particular, now that features can have more than two possible values, we need to allow for an arbitrarily long list of child nodes. In addition, we add three pieces of data (with default values None): the splitFeature is the feature for which each of its children assumes a separate value; the splitFeatureValue is the feature assumed for its parent’s split; and the label (which is None for all interior nodes) is the final classification label for a leaf.

We also need to nail down our representations for the data. In particular, we will represent a data set as a list of pairs of the form (point, label), where the point is itself a list of the feature values, and the label is a string.

Now given a data set the first thing we need to do is compute its entropy. For that we can first convert it to a distribution (in the sense defined above, a list of probabilities which sum to 1):

def dataToDistribution(data):
   ''' Turn a dataset which has n possible classification labels into a
       probability distribution with n entries. '''
   allLabels = [label for (point, label) in data]
   numEntries = len(allLabels)
   possibleLabels = set(allLabels)

   dist = []
   for aLabel in possibleLabels:
      dist.append(float(allLabels.count(aLabel)) / numEntries)

   return dist

And we can compute the entropy of such a distribution in the obvious way:

def entropy(dist):
   ''' Compute the Shannon entropy of the given probability distribution. '''
   return -sum([p * math.log(p, 2) for p in dist])

Now in order to compute the gain of a data set by splitting on a particular value, we need to be able to split the data set. To do this, we identify features with their index in the list of feature values of a given data point, enumerate all possible values of that feature, and generate the needed subsets one at a time. In particular, we use a Python generator object:

def splitData(data, featureIndex):
   ''' Iterate over the subsets of data corresponding to each value
       of the feature at the index featureIndex. '''

   # get possible values of the given feature
   attrValues = [point[featureIndex] for (point, label) in data]

   for aValue in set(attrValues):
      dataSubset = [(point, label) for (point, label) in data
                    if point[featureIndex] == aValue]

      yield dataSubset

So to compute the gain, we simply need to iterate over the set of all splits, and compute the entropy of each split. In code:

def gain(data, featureIndex):
   ''' Compute the expected gain from splitting the data along all possible
       values of feature. '''

   entropyGain = entropy(dataToDistribution(data))

   for dataSubset in splitData(data, featureIndex):
      entropyGain -= entropy(dataToDistribution(dataSubset))

   return entropyGain

Of course, the best split (represented as the best feature to split on) is given by such a line of code as:

bestFeature = max(range(n), key=lambda index: gain(data, index))

We can’t quite use this line exactly though, because while we’re building up the decision tree (which will of course be a recursive process) we need to keep track of which features have been split on previously and which have not; this data is different for each possible traversal of the tree. In the end, our function to build a decision tree requires three pieces of data: the current subset of the data to investigate, the root of the current subtree that we are in the process of building, and the set of features we have yet to split on.

Of course, this raises the obvious question about the base cases. One base case will be when we run out of data to split; that is, when our input data all have the same classification label. To check for this we implement a function called “homogeneous”

def homogeneous(data):
   ''' Return True if the data have the same label, and False otherwise. '''
   return len(set([label for (point, label) in data])) &lt;= 1

The other base case is when we run out of good features to split on. Of course, if the true classification function is actually a decision tree then this won’t be the case. But now that we’re in the real world, we can imagine there may be two data points with identical features but different classes. Perhaps the simplest way to remedy this situation is to terminate the tree at that point (when we run out of features to split on, or no split gives positive gain), and use a simple majority vote to label the new leaf. In a sense, this strategy is a sort of nearest-neighbors vote as a default. To implement this, we have a function which simply patches up the leaf appropriately:

def majorityVote(data, node):
   ''' Label node with the majority of the class labels in the given data set. '''
   labels = [label for (pt, label) in data]
   choice = max(set(labels), key=labels.count)
   node.label = choice
   return node

The base cases show up rather plainly in the code to follow, so let us instead focus on the inductive step. We declare our function to accept the data set in question, the root of the subtree to be built, and a list of the remaining allowable features to split on. The function begins with:

def buildDecisionTree(data, root, remainingFeatures):
   ''' Build a decision tree from the given data, appending the children
       to the given root node (which may be the root of a subtree). '''

   if homogeneous(data):
      root.label = data[0][1]
      return root

   if len(remainingFeatures) == 0:
      return majorityVote(data, root)

   # find the index of the best feature to split on
   bestFeature = max(remainingFeatures, key=lambda index: gain(data, index))

   if gain(data, bestFeature) == 0:
      return majorityVote(data, root)

   root.splitFeature = bestFeature

Here we see the base cases, and the selection of the best feature to split on. As a side remark, we observe this is not the most efficient implementation. We admittedly call the gain function and splitData functions more often than necessary, but we feel what is lost in runtime speed is gained in code legibility.

Once we bypass the three base cases, and we have determined the right split, we just do it:

def buildDecisionTree(data, root, remainingFeatures):
   ''' Build a decision tree from the given data, appending the children
       to the given root node (which may be the root of a subtree). '''

   ...

   # add child nodes and process recursively
   for dataSubset in splitData(data, bestFeature):
      aChild = Tree(parent=root)
      aChild.splitFeatureValue = dataSubset[0][0][bestFeature]
      root.children.append(aChild)

      buildDecisionTree(dataSubset, aChild, remainingFeatures - set([bestFeature]))

   return root

Here we iterate over the subsets of data after the split, and create a child node for each. We then assign the child its corresponding feature value in the splitFeatureValue variable, and append the child to the root’s list of children. Next is where we first see the remainingFeatures set come into play, and in particular we note the overloaded minus sign as an operation on sets. This is a feature of python sets, and in particular it behaves exactly like set exclusion in mathematics. The astute programmer will note that the minus operation generates a new set, so that further recursive calls to buildDecisionTree will not be affected by what happens in this recursive call.

Now the first call to this function requires some initial parameter setup, so we define a convenience function that only requires a single argument: the data.

def decisionTree(data):
   return buildDecisionTree(data, Tree(), set(range(len(data[0][0]))))

Classifying New Data

The last piece of the puzzle is to classify a new piece of data once we’ve constructed the decision tree. This is a considerably simpler recursive process. If the current node is a leaf, output its label. Otherwise, recursively search the subtree (the child of the current node) whose splitFeatureValue matches the new data’s choice of the feature being split. In code,

def classify(tree, point):
   ''' Classify a data point by traversing the given decision tree. '''

   if tree.children == []:
      return tree.label
   else:
      matchingChildren = [child for child in tree.children
         if child.splitFeatureValue == point[tree.splitFeature]]

      return classify(matchingChildren[0], point)

And we can use this function to naturally test a dataset:

def testClassification(data, tree):
   actualLabels = [label for point, label in data]
   predictedLabels = [classify(tree, point) for point, label in data]

   correctLabels = [(1 if a == b else 0) for a,b in zip(actualLabels, predictedLabels)]
   return float(sum(correctLabels)) / len(actualLabels)

But now we run into the issue of noisy data. What if one wants to classify a point where one of the feature values which is used in the tree is unknown? One can take many approaches to remedy this, and we choose a simple one: simply search both routes, and use a majority vote when reaching a leaf. This requires us to add one additional piece of information to the leaf nodes: the total number of labels in each class used to build that leaf (recall, one of our stopping conditions resulted in a leaf having heterogeneous data). We omit the details here, but the reader is invited to read them on this blog’s Github page, where as usual we have provided all of the code used in this post.

Classifying Political Parties Based on Congressional Votes

We now move to a concrete application of decision trees. The data set we will work with comes from the UCI machine learning repository, and it records the votes cast by the US House of Representatives during a particular session of Congress in 1984. The data set has 16 features; that is, there were 16 different measures considered “key” measures that were vote upon during this session. So each point in the dataset represents the 16 votes of a single House member in that session. With a bit of reformatting, we provide the complete data set on this blog’s Github page.

Our goal is to learn party membership based on the voting records. This data set is rife with missing values; roughly half of the members abstained from voting on some of these measures. So we constructed a decision tree from the clean portion of the data, and use that to classify the remainder of the data.

Indeed, this data fits precisely into the algorithm we designed above. The code to construct a tree is almost trivial:

   with open('house-votes-1984.txt', 'r') as inputFile:
      lines = inputFile.readlines()

   data = [line.strip().split(',') for line in lines]
   data = [(x[1:], x[0]) for x in data]

   cleanData = [x for x in data if '?' not in x[0]]
   noisyData = [x for x in data if '?' in x[0]]

   tree = decisionTree(cleanData)

Indeed, the classification accuracy in doing this is around 90%. In addition (though we will revisit the concept of overfitting later), this is stable despite variation in the size of the subset of data used to build the tree. The graph below shows this.

The size of the subset used to construct the tree versus its accuracy in classifying the remainder of the data. Note that the subsets were chosen uniformly at random without replacement. The x-axis is the number of points used to construct the tree, and the y-axis is the proportion of labels correctly classified.

Inspecting the trees generated in this process, it appears that the most prominent feature to split on is the adoption of a new budget resolution. Very few Demorats voted in favor of this, so for many of the random subsets of the data, a split on this feature left one side homogeneously Republican.

Overfitting, Pruning, and Other Issues

Now there are some obvious shortcomings to the method in general. If the data set used to build the decision tree is enormous (in dimension or in number of points), then the resulting decision tree can be arbitrarily large in size. In addition, there is the pitfall of overfitting to this particular data set. For the party classification problem above, the point is to extend the classification to any population of people who might vote on these issues (or, more narrowly, to any politician who might vote on these issues). If we make our decision tree very large, then the hypothesis may be overly specific to the people in the sample used, and hence will not generalize well.

This problem is called overfitting to the data, and it’s a prevalent concern among all machine learning algorithms. There are a number of ways to avoid it for decision trees. Perhaps the most common is the idea of pruning: one temporarily removes all possible proper subtrees and reevaluates the classification accuracy for that removal. Whichever subtree results in the greatest increase in accuracy is actually removed, and it is replaced with a single leaf whose label corresponds to the majority label of the data points used to create the entire subtree. This process is then repeated until there are no possible improvements, or the gain is sufficiently small.

From a statistical point of view one could say this process is that of ignoring outliers. Any points which do not generally agree with the whole trend of the data set (hence, create their own branches in the decision tree) are simply removed. From a theoretical point of a view, a smaller decision tree satisfies the principle of Occam’s razor: a simpler hypothesis is more accurate by virtue of being simple.

While we won’t implement a pruning method here (indeed, we didn’t detect any overfitting in the congressional voting example), but it would be a nice exercise for the reader to wet his feet with the code given above. Finally, there are other algorithms to build decision trees that we haven’t mentioned here. You can see a list of such algorithms on the relevant wikipedia page. Because of the success of ID3, there is a large body of research on such algorithms.

In any event, next time we’ll investigate yet another machine learning method: that of neural networks. We’ll also start to look at more general frameworks for computational learning theory. That is, we’ll exercise the full might of theoretical mathematics to reason about how hard certain problems are to learn (or whether they can be learned at all).

Until then!