What does it mean for an algorithm to be fair?

In 2014 the White House commissioned a 90-day study that culminated in a report (pdf) on the state of “big data” and related technologies. The authors give many recommendations, including this central warning.

Warning: algorithms can facilitate illegal discrimination!

Here’s a not-so-imaginary example of the problem. A bank wants people to take loans with high interest rates, and it also serves ads for these loans. A modern idea is to use an algorithm to decide, based on the sliver of known information about a user visiting a website, which advertisement to present that gives the largest chance of the user clicking on it. There’s one problem: these algorithms are trained on historical data, and poor uneducated people (often racial minorities) have a historical trend of being more likely to succumb to predatory loan advertisements than the general population. So an algorithm that is “just” trying to maximize clickthrough may also be targeting black people, de facto denying them opportunities for fair loans. Such behavior is illegal.

Payday-Loans

On the other hand, even if algorithms are not making illegal decisions, by training algorithms on data produced by humans, we naturally reinforce prejudices of the majority. This can have negative effects, like Google’s autocomplete finishing “Are transgenders” with “going to hell?” Even if this is the most common question being asked on Google, and even if the majority think it’s morally acceptable to display this to users, this shows that algorithms do in fact encode our prejudices. People are slowly coming to realize this, to the point where it was recently covered in the New York Times.

There are many facets to the algorithm fairness problem one that has not even been widely acknowledged as a problem, despite the Times article. The message has been echoed by machine learning researchers but mostly ignored by practitioners. In particular, “experts” continually make ignorant claims such as, “equations can’t be racist,” and the following quote from the above linked article about how the Chicago Police Department has been using algorithms to do predictive policing.

Wernick denies that [the predictive policing] algorithm uses “any racial, neighborhood, or other such information” to assist in compiling the heat list [of potential repeat offenders].

Why is this ignorant? Because of the well-known fact that removing explicit racial features from data does not eliminate an algorithm’s ability to learn race. If racial features disproportionately correlate with crime (as they do in the US), then an algorithm which learns race is actually doing exactly what it is designed to do! One needs to be very thorough to say that an algorithm does not “use race” in its computations. Algorithms are not designed in a vacuum, but rather in conjunction with the designer’s analysis of their data. There are two points of failure here: the designer can unwittingly encode biases into the algorithm based on a biased exploration of the data, and the data itself can encode biases due to human decisions made to create it. Because of this, the burden of proof is (or should be!) on the practitioner to guarantee they are not violating discrimination law. Wernick should instead prove mathematically that the policing algorithm does not discriminate.

While that viewpoint is idealistic, it’s a bit naive because there is no accepted definition of what it means for an algorithm to be fair. In fact, from a precise mathematical standpoint, there isn’t even a precise legal definition of what it means for any practice to be fair. In the US the existing legal theory is called disparate impact, which states that a practice can be considered illegal discrimination if it has a “disproportionately adverse” effect on members of a protected group. Here “disproportionate” is precisely defined by the 80% rule, but this is somehow not enforced as stated. As with many legal issues, laws are broad assertions that are challenged on a case-by-case basis. In the case of fairness, the legal decision usually hinges on whether an individual was treated unfairly, because the individual is the one who files the lawsuit. Our understanding of the law is cobbled together, essentially through anecdotes slanted by political agendas. A mathematician can’t make progress with that. We want the mathematical essence of fairness, not something that can be interpreted depending on the court majority.

The problem is exacerbated for data mining because the practitioners often demonstrate a poor understanding of statistics, the management doesn’t understand algorithms, and almost everyone is lulled into a false sense of security via abstraction (remember, “equations can’t be racist”). Experts in discrimination law aren’t trained to audit algorithms, and engineers aren’t trained in social science or law. The speed with which research becomes practice far outpaces the speed at which anyone can keep up. This is especially true at places like Google and Facebook, where teams of in-house mathematicians and algorithm designers bypass the delay between academia and industry.

And perhaps the worst part is that even the world’s best mathematicians and computer scientists don’t know how to interpret the output of many popular learning algorithms. This isn’t just a problem that stupid people aren’t listening to smart people, it’s that everyone is “stupid.” A more politically correct way to say it: transparency in machine learning is a wide open problem. Take, for example, deep learning. A far-removed adaptation of neuroscience to data mining, deep learning has become the flagship technique spearheading modern advances in image tagging, speech recognition, and other classification problems.

A typical example of how a deep neural network learns to tag images. Image source: http://engineering.flipboard.com/2015/05/scaling-convnets/

A typical example of how a deep neural network learns to tag images. Image source: http://engineering.flipboard.com/2015/05/scaling-convnets/

The picture above shows how low level “features” (which essentially boil down to simple numerical combinations of pixel values) are combined in a “neural network” to more complicated image-like structures. The claim that these features represent natural concepts like “cat” and “horse” have fueled the public attention on deep learning for years. But looking at the above, is there any reasonable way to say whether these are encoding “discriminatory information”? Not only is this an open question, but we don’t even know what kinds of problems deep learning can solve! How can we understand to what extent neural networks can encode discrimination if we don’t have a deep understanding of why a neural network is good at what it does?

What makes this worse is that there are only about ten people in the world who understand the practical aspects of deep learning well enough to achieve record results for deep learning. This means they spent a ton of time tinkering the model to make it domain-specific, and nobody really knows whether the subtle differences between the top models correspond to genuine advances or slight overfitting or luck. Who is to say whether the fiasco with Google tagging images of black people as apes was caused by the data or the deep learning algorithm or by some obscure tweak made by the designer? I doubt even the designer could tell you with any certainty.

Opacity and a lack of interpretability is the rule more than the exception in machine learning. Celebrated techniques like Support Vector Machines, Boosting, and recent popular “tensor methods” are all highly opaque. This means that even if we knew what fairness meant, it is still a challenge (though one we’d be suited for) to modify existing algorithms to become fair. But with recent success stories in theoretical computer science connecting security, trust, and privacy, computer scientists have started to take up the call of nailing down what fairness means, and how to measure and enforce fairness in algorithms. There is now a yearly workshop called Fairness, Accountability, and Transparency in Machine Learning (FAT-ML, an awesome acronym), and some famous theory researchers are starting to get involved, as are social scientists and legal experts. Full disclosure, two days ago I gave a talk as part of this workshop on modifications to AdaBoost that seem to make it more fair. More on that in a future post.

From our perspective, we the computer scientists and mathematicians, the central obstacle is still that we don’t have a good definition of fairness.

In the next post I want to get a bit more technical. I’ll describe the parts of the fairness literature I like (which will be biased), I’ll hypothesize about the tension between statistical fairness and individual fairness, and I’ll entertain ideas on how someone designing a controversial algorithm (such as a predictive policing algorithm) could maintain transparency and accountability over its discriminatory impact. In subsequent posts I want to explain in more detail why it seems so difficult to come up with a useful definition of fairness, and to describe some of the ideas I and my coauthors have worked on.

Until then!

Advertisements

On the Computational Complexity of MapReduce

I recently wrapped up a fun paper with my coauthors Ben Fish, Adam Lelkes, Lev Reyzin, and Gyorgy Turan in which we analyzed the computational complexity of a model of the popular MapReduce framework. Check out the preprint on the arXiv. Update: this paper is now published in the proceedings of DISC2015.

As usual I’ll give a less formal discussion of the research here, and because the paper is a bit more technically involved than my previous work I’ll be omitting some of the more pedantic details. Our project started after Ben Moseley gave an excellent talk at UI Chicago. He presented a theoretical model of MapReduce introduced by Howard Karloff et al. in 2010, and discussed his own results on solving graph problems in this model, such as graph connectivity. You can read Karloff’s original paper here, but we’ll outline his model below.

Basically, the vast majority of the work on MapReduce has been algorithmic. What I mean by that is researchers have been finding more and cleverer algorithms to solve problems in MapReduce. They have covered a huge amount of work, implementing machine learning algorithms, algorithms for graph problems, and many others. In Moseley’s talk, he posed a question that caught our eye:

Is there a constant-round MapReduce algorithm which determines whether a graph is connected?

After we describe the model below it’ll be clear what we mean by “solve” and what we mean by “constant-round,” but the conjecture is that this is impossible, particularly for the case of sparse graphs. We know we can solve it in a logarithmic number of rounds, but anything better is open.

In any case, we started thinking about this problem and didn’t make much progress. To the best of my knowledge it’s still wide open. But along the way we got into a whole nest of more general questions about the power of MapReduce. Specifically, Karloff proved a theorem relating MapReduce to a very particular class of circuits. What I mean is he proved a theorem that says “anything that can be solved in MapReduce with so many rounds and so much space can be solved by circuits that are yae big and yae complicated, and vice versa.”

But this question is so specific! We wanted to know: is MapReduce as powerful as polynomial time, our classical notion of efficiency (does it equal P)? Can it capture all computations requiring logarithmic space (does it contain L)? MapReduce seems to be somewhere in between, but it’s exact relationship to these classes is unknown. And as we’ll see in a moment the theoretical model uses a novel communication model, and processors that never get to see the entire input. So this led us to a host of natural complexity questions:

  1. What computations are possible in a model of parallel computation where no processor has enough space to store even one thousandth of the input?
  2. What computations are possible in a model of parallel computation where processors can’t request or send specific information from/to other processors?
  3. How the hell do you prove that something can’t be done under constraints of this kind?
  4. How do you measure the increase of power provided by giving MapReduce additional rounds or additional time?

These questions are in the domain of complexity theory, and so it makes sense to try to apply the standard tools of complexity theory to answer them. Our paper does this, laying some brick for future efforts to study MapReduce from a complexity perspective.

In particular, one of our theorems is the following:

Theorem: Any problem requiring sublogarithmic space o(\log n) can be solved in MapReduce in two rounds.

This theorem is nice for two reasons. First is it’s constructive. If you give me a problem and I know it classically takes less than logarithmic space, then this gives an automatic algorithm to implement it in MapReduce, and if you were so inclined you could even automate the process of translating a classical algorithm to a MapReduce algorithm (it’s not pretty, but it’s possible). One example of such a useful problem is string searching: if you give me a fixed regex, I can search a large corpus for matches to that regex in actually constant space, and hence in MapReduce in two rounds.

Our other results are a bit more esoteric. We relate the questions about MapReduce to old, deep questions about computations that have simultaneous space and time bounds. In the end we give a (conditional, nonconstructive) answer to question 4 above, which I’ll sketch without getting too deep in the details. It’s interesting despite the conditionalness and nonconstructiveness because it’s the first result of its kind for MapReduce.

So let’s start with the model of Karloff et al., which they named “MRC” for MapReduce Class.

The MRC Model of Karloff et al.

MapReduce is a programming paradigm which is intended to make algorithm design for distributed computing easier. Specifically, when you’re writing massively distributed algorithms by hand, you spend a lot of time and effort dealing with really low-level questions. Like, what do I do if a processor craps out in the middle of my computation? Or, what’s the most efficient way to broadcast a message to all the processors, considering the specific topology of my network layout means the message will have to be forwarded? These are questions that have nothing to do with the actual problem you’re trying to solve.

So the designers of MapReduce took a step back, analyzed the class of problems they were most often trying to solve (turns out, searching, sorting, and counting), and designed their framework to handle all of the low-level stuff automatically. The catch is that the algorithm designer now has to fit their program into the MapReduce paradigm, which might be hard.

In designing a MapReduce algorithm, one has to implement two functions: a mapper and a reducer. The mapper takes a list of key-value pairs, and applies some operation to each element independently (just like the map function in most functional programming languages). The reducer takes a single key and a list of values for that key, and outputs a new list of values with the same key. And then the MapReduce protocol iteratively applies mappers and reducers in rounds to the input data. There are a lot of pictures of this protocol on the internet. Here’s one

Image source: IBM.com

Image source: ibm.com

An interesting part of this protocol is that the reducers never get to communicate with each other, except indirectly through the mappers in the next round. MapReduce handles the implied grouping and message passing, as well as engineering issues like fault-tolerance and load balancing. In this sense, the mappers and reducers are ignorant cogs in the machine, so it’s interesting to see how powerful MapReduce is.

The model of Karloff, Suri, and Vassilvitskii is a relatively straightforward translation of this protocol to mathematics. They make a few minor qualifications, though, the most important of which is that they encode a bound on the total space used. In particular, if the input has size n, they impose that there is some \varepsilon > 0 for which every reducer uses space O(n^{1 - \varepsilon}). Moreover, the number of reducers in any round is also bounded by O(n^{1 - \varepsilon}).

We can formalize all of this with a few easy definitions.

Definition: An input to a MapReduce algorithm is a list of key-value pairs \langle k_i,v_i \rangle_{i=1}^N of total size n = \sum_{i=1}^N |k_i| + |v_i|.

For binary languages (e.g., you give me a binary string x and you want to know if there are an odd number of 1’s), we encode a string x \in \{ 0,1 \}^m as the list \langle x \rangle := \langle i, x_i \rangle_{i=1}^n. This won’t affect any of our space bounds, because the total blowup from m = |x| to n = |\langle x \rangle | is logarithmic.

Definition: A mapper \mu is a Turing machine which accepts as input a single key-value pair \langle k, v \rangle and produces as output a list of key-value pairs \langle k_1', v'_1 \rangle, \dots, \langle k'_s, v'_s \rangle.

Definition: reducer \rho is a Turing machine which accepts as input a key k and a list of values v_1, \dots, v_m and produces as output the same key and a new list of values v'_1, \dots, v'_M.

Now we can describe the MapReduce protocol, i.e. what a program is and how to run it. I copied this from our paper because the notation is the same so far.

MRC definition

The MRC protocol

All we’ve done here is just describe the MapReduce protocol in mathematics. It’s messier than it is complicated. The last thing we need is the space bounds.

Definition: A language L is in \textup{MRC}[f(n), g(n)] if there is a constant 0 < c < 1 and a sequence of mappers and reducers \mu_1, \rho_1, \mu_2, \rho_2, \dots such that for all x \in \{ 0,1 \}^n the following is satisfied:

  1. Letting R = O(f(n)) and M = (\mu_1, \rho_1, \dots, \mu_R, \rho_R), M accepts x if and only if x \in L.
  2. For all 1 \leq r \leq R, \mu_r, \rho_r use O(n^c) space and O(g(n)) time.
  3. Each \mu_r outputs O(n^c) keys in round r.

To be clear, the first argument to \textup{MRC}[f(n), g(n)] is the round bound, and the second argument is the time bound. The last point implies the bound on the number of processors. Since we are often interested in a logarithmic number of rounds, we can define

\displaystyle \textup{MRC}^i = \textup{MRC}[\log^i(n), \textup{poly}(n)]

So we can restate the question from the beginning of the post as,

Is graph connectivity in \textup{MRC}^0?

Here there’s a bit of an issue with representation. You have to show that if you’re just given a binary string representing a graph, that you can translate that into a reasonable key-value description of a graph in a constant number of rounds. This is possible, and gives evidence that the key-value representation is without loss of generality for all intents and purposes.

A Pedantic Aside

If our goal is to compare MRC with classes like polynomial time (P) and logarithmic space (L), then the definition above has a problem. Specifically the definition above allows one to have an infinite list of reducers, where each one is potentially different, and the actual machine that is used depends on the input size. In formal terminology, MRC as defined above is a nonuniform model of computation.

Indeed, we give a simple proof that this is a problem by showing any unary language is in \textup{MRC}^1 (which is where many of the MRC algorithms in recent years have been). For those who don’t know, this is a problem because you can encode versions of the Halting problem as a unary language, and the Halting problem is undecidable. We don’t want our model to be able to solve undecidable problems!

While this level of pedantry might induce some eye-rolling, it is necessary in order to make statements like “MRC is contained in P.” It’s simply not true for the nonuniform model above. To fix this problem we refined the MRC model to a uniform version. The details are not trivial, but also not that interesting. Check out the paper if you want to know exactly how we do it. It doesn’t have that much of an effect on the resulting analysis. The only important detail is that now we’re representing the entire MRC machine as a single Turing machine. So now, unlike before, any MRC machine can be written down as a finite string, and there are infinitely many strings representing a specific MRC machine. We call our model MRC, and Karloff’s model nonuniform MRC.

You can also make randomized versions of MRC, but we’ll stick to the deterministic version here.

Sublogarithmic Space

Now we can sketch the proof that sublogarithmic space is in \textup{MRC}^0. In fact, the proof is simpler for regular languages (equivalent to constant space algorithms) being in \textup{MRC}^0. The idea is as follows:

A regular language is one that can be decided by a DFA, say G (a graph representing state transitions as you read a stream of bits). And the DFA is independent of the input size, so every mapper and reducer can have it encoded into them. Now what we’ll do is split the input string up in contiguous chunks of size O(\sqrt{n}) among the processors (mappers can do this just fine). Now comes the trick. We have each reducer compute, for each possible state s in G, what the ending state would be if the DFA had started in state s after processing their chunk of the input. So the output of reducer j would be an encoding of a table of the form:

\displaystyle s_1 \to T_j(s_1) \\ s_2 \to T_j(s_2) \\ \vdots \\ s_{|S|} \to T_j(s_{|S|})

And the result would be a lookup table of intermediate computation results. So each reducer outputs their part of the table (which has constant size). Since there are only O(\sqrt{n}) reducers, all of the table pieces can fit on a single reducer in the second round, and this reducer can just chain the functions together from the starting state and output the answer.

The proof for sublogarithmic space has the same structure, but is a bit more complicated because one has to account for the tape of a Turing machine which has size o(\log n). In this case, you split up the tape of the Turing machine among the processors as well. And then you have to keep track of a lot more information. In particular, each entry of your function has to look like

“if my current state is A and my tape contents are B and the tape head starts on side C of my chunk of the input”

then

“the next time the tape head would leave my chunk, it will do so on side C’ and my state will be A’ and my tape contents will be B’.”

As complicated as it sounds, the size of one of these tables for one reducer is still less than n^c for some c < 1/2. And so we can do the same trick of chaining the functions together to get the final answer. Note that this time the chaining will be adaptive, because whether the tape head leaves the left or right side will determine which part of the table you look at next. And moreover, we know the chaining process will stop in polynomial time because we can always pick a Turing machine to begin with that will halt in polynomial time (i.e., we know that L is contained in P).

The size calculations are also just large enough to stop us from doing the same trick with logarithmic space. So that gives the obvious question: is \textup{L} \subset \textup{MRC}^0? The second part of our paper shows that even weaker containments are probably very hard to prove, and they relate questions about MRC to the problem of L vs P.

Hierarchy Theorems

This part of the paper is where we go much deeper into complexity theory than I imagine most of my readers are comfortable with. Our main theorem looks like this:

Theorem: Assume some conjecture is true. Then for every a, b > 0, there is are bigger c > a, d > b such that the following hold:

\displaystyle \textup{MRC}[n^a, n^b] \subsetneq \textup{MRC}[n^c, n^b],
\displaystyle \textup{MRC}[n^a, n^b] \subsetneq \textup{MRC}[n^a, n^d].

This is a kind of hierarchy theorem that one often likes to prove in complexity theory. It says that if you give MRC enough extra rounds or time, then you’ll get strictly more computational power.

The conjecture we depend on is called the exponential time hypothesis (ETH), and it says that the 3-SAT problem can’t be solved in 2^{cn} time for any 0 < c < 1. Actually, our theorem depends on a weaker conjecture (implied by ETH), but it’s easier to understand our theorems in the context of the ETH. Because 3-SAT is this interesting problem: we believe it’s time-intensive, and yet it’s space efficient because we can solve it in linear space given 2^n time. This time/space tradeoff is one of the oldest questions in all of computer science, but we still don’t have a lot of answers.

For instance, define \textup{TISP}(f(n), g(n)) to the the class of languages that can be decided by Turing machines that have simultaneous bounds of O(f(n)) time and O(g(n)) space. For example, we just said that \textup{3-SAT} \in \textup{TISP}(2^n, n), and there is a famous theorem that says that SAT is not in \textup{TISP}(n^{1.1} n^{0.1}); apparently this is the best we know. So clearly there are some very basic unsolved problems about TISP. An important one that we address in our paper is whether there are hierarchies in TISP for a fixed amount of space. This is the key ingredient in proving our hierarchy theorem for MRC. In particular here is an open problem:

Conjecture: For every two integers 0 < a < b, the classes \textup{TISP}(n^a, n) and \textup{TISP}(n^b, n) are different.

We know this is true of time and space separately, i.e., that \textup{TIME}(n^a) \subsetneq \textup{TIME}(n^b) and \textup{SPACE}(n^a) \subsetneq \textup{SPACE}(n^b). but for TISP we only know that you get more power if you increase both parameters.

So we prove a theorem that address this, but falls short in two respects: it depends on a conjecture like ETH, and it’s not for every a, b.

Theorem: Suppose ETH holds, then for every a > 0 there is a b > a for which \textup{TIME}(n^a) \not \subset \textup{TISP}(n^b, n).

In words, this suggests that there are problems that need both n^2 time and n^2 space, but can be solved with linear space if you allow enough extra time. Since \textup{TISP}(n^a, n) \subset \textup{TIME}(n^a), this gives us the hierarchy we wanted.

To prove this we take 3-SAT and give it exponential padding so that it becomes easy enough to do in polynomial TISP (and it still takes linear space, in fact sublinear), but not so easy that you can do it in n^a time. It takes some more work to get from this TISP hierarchy to our MRC hierarchy, but the details are a bit much for this blog. One thing I’d like to point out is that we prove that statements that are just about MRC have implications beyond MapReduce. In particular, the last corollary of our paper is the following.

Corollary: If \textup{MRC}[\textup{poly}(n), 1] \subsetneq \textup{MRC}[\textup{poly}(n), \textup{poly}(n)], then L is different from P.

In other words, if you’re afforded a polynomial number of rounds in MRC, then showing that constant time per round is weaker than polynomial time is equivalently hard to separating L from P. The theorem is true because, as it turns out, \textup{L} \subset \textup{MRC}[textup{poly}(n), 1], by simulating one step of a TM across polynomially many rounds. The proof is actually not that complicated (and doesn’t depend on ETH at all), and it’s a nice demonstration that studying MRC can have implications beyond parallel computing.

The other side of the coin is also interesting. Our first theorem implies the natural question of whether \textup{L} \subset \textup{MRC}^0. We’d like to say that this would imply the separation of L from P, but we don’t quite get that. In particular, we know that

\displaystyle \textup{MRC}[1, n] \subsetneq \textup{MRC}[n, n] \subset \textup{MRC}[1, n^2] \subsetneq \textup{MRC}[n^2, n^2] \subset \dots

But at the same time we could live in a world where

\displaystyle \textup{MRC}[1, \textup{poly}(n)] = \textup{MRC}[\textup{poly}(n), \textup{poly}(n)]

It seems highly unlikely, but to the best of our knowledge none of our techniques prove this is not the case. If we could rule this out, then we could say that \textup{L} \subset \textup{MRC}^0 implies the separation of L and P. And note this would not depend on any conjectures.

Open Problems

Our paper has a list of open problems at the end. My favorite is essentially: how do we prove better lower bounds in MRC? In particular, it would be great if we could show that some problems need a lot of rounds simply because the communication model is too restrictive, and nobody has true random access to the entire input. For example, this is why we think graph connectivity needs a logarithmic number of rounds. But as of now nobody really knows how to prove it, and it seems like we’ll need some new and novel techniques in order to do it. I only have the wisps of ideas in that regard, and it will be fun to see which ones pan out.

Until next time!

Community Detection in Graphs — a Casual Tour

Graphs are among the most interesting and useful objects in mathematics. Any situation or idea that can be described by objects with connections is a graph, and one of the most prominent examples of a real-world graph that one can come up with is a social network.

Recall, if you aren’t already familiar with this blog’s gentle introduction to graphs, that a graph G is defined by a set of vertices V, and a set of edges E, each of which connects two vertices. For this post the edges will be undirected, meaning connections between vertices are symmetric.

One of the most common topics to talk about for graphs is the notion of a community. But what does one actually mean by that word? It’s easy to give an informal definition: a subset of vertices C such that there are many more edges between vertices in C than from vertices in C to vertices in V - C (the complement of C). Try to make this notion precise, however, and you open a door to a world of difficult problems and open research questions. Indeed, nobody has yet come to a conclusive and useful definition of what it means to be a community. In this post we’ll see why this is such a hard problem, and we’ll see that it mostly has to do with the word “useful.” In future posts we plan to cover some techniques that have found widespread success in practice, but this post is intended to impress upon the reader how difficult the problem is.

The simplest idea

The simplest thing to do is to say a community is a subset of vertices which are completely connected to each other. In the technical parlance, a community is a subgraph which forms a clique. Sometimes an n-clique is also called a complete graph on n vertices, denoted K_n. Here’s an example of a 5-clique in a larger graph:

"Where's Waldo" for graph theorists: a clique hidden in a larger graph.

“Where’s Waldo” for graph theorists: a clique hidden in a larger graph.

Indeed, it seems reasonable that if we can reliably find communities at all, then we should be able to find cliques. But as fate should have it, this problem is known to be computationally intractable. In more detail, the problem of finding the largest clique in a graph is NP-hard. That essentially means we don’t have any better algorithms to find cliques in general graphs than to try all possible subsets of the vertices and check to see which, if any, form cliques. In fact it’s much worse, this problem is known to be hard to approximate to any reasonable factor in the worst case (the error of the approximation grows polynomially with the size of the graph!). So we can’t even hope to find a clique half the size of the biggest, or a thousandth the size!

But we have to take these impossibility results with a grain of salt: they only say things about the worst case graphs. And when we’re looking for communities in the real world, the worst case will never show up. Really, it won’t! In these proofs, “worst case” means that they encode some arbitrarily convoluted logic problem into a graph, so that finding the clique means solving the logic problem. To think that someone could engineer their social network to encode difficult logic problems is ridiculous.

So what about an “average case” graph? To formulate this typically means we need to consider graphs randomly drawn from a distribution.

Random graphs

The simplest kind of “randomized” graph you could have is the following. You fix some set of vertices, and then run an experiment: for each pair of vertices you flip a coin, and if the coin is heads you place an edge and otherwise you don’t. This defines a distribution on graphs called G(n, 1/2), which we can generalize to G(n, p) for a coin with bias p. With a slight abuse of notation, we call G(n, p) the Erdős–Rényi random graph (it’s not a graph but a distribution on graphs). We explored this topic form a more mathematical perspective earlier on this blog.

So we can sample from this distribution and ask questions like: what’s the probability of the largest clique being size at least 20? Indeed, cliques in Erdős–Rényi random graphs are so well understood that we know exactly how they work. For example, if p=1/2 then the size of the largest clique is guaranteed (with overwhelming probability as n grows) to have size k(n) or k(n)+1, where k(n) is about 2 \log n. Just as much is known about other values of p as well as other properties of G(n,p), see Wikipedia for a short list.

In other words, if we wanted to find the largest clique in an Erdős–Rényi random graph, we could check all subsets of size roughly 2\log(n), which would take about (n / \log(n))^{\log(n)} time. This is pretty terrible, and I’ve never heard of an algorithm that does better (contrary to the original statement in this paragraph that showed I can’t count). In any case, it turns out that the Erdős–Rényi random graph, and using cliques to represent communities, is far from realistic. There are many reasons why this is the case, but here’s one example that fits with the topic at hand. If I thought the world’s social network was distributed according to G(n, 1/2) and communities were cliques, then I would be claiming that the largest community is of size 65 or 66. Estimated world population: 7 billion, 2 \log(7 \cdot 10^9) \sim 65. Clearly this is ridiculous: there are groups of larger than 66 people that we would want to call “communities,” and there are plenty of communities that don’t form bona-fide cliques.

Another avenue shows that things are still not as easy as they seem in Erdős–Rényi land. This is the so-called planted clique problem. That is, you draw a graph G from G(n, 1/2). You give G to me and I pick a random but secret subset of r vertices and I add enough edges to make those vertices form an r-clique. Then I ask you to find the r-clique. Clearly it doesn’t make sense when r < 2 \log (n) because you won’t be able to tell it apart from the guaranteed cliques in G. But even worse, nobody knows how to find the planted clique when r is even a little bit smaller than \sqrt{n} (like, r = n^{9/20} even). Just to solidify this with some numbers, we don’t know how to reliably find a planted clique of size 60 in a random graph on ten thousand vertices, but we do when the size of the clique goes up to 100. The best algorithms we know rely on some sophisticated tools in spectral graph theory, and their details are beyond the scope of this post.

So Erdős–Rényi graphs seem to have no hope. What’s next? There are a couple of routes we can take from here. We can try to change our random graph model to be more realistic. We can relax our notion of communities from cliques to something else. We can do both, or we can do something completely different.

Other kinds of random graphs

There is an interesting model of Barabási and Albert, often called the “preferential attachment” model, that has been described as a good model of large, quickly growing networks like the internet. Here’s the idea: you start off with a two-clique G = K_2, and at each time step t you add a new vertex v to G, and new edges so that the probability that the edge (v,w) is added to G is proportional to the degree of w (as a fraction of the total number of edges in G). Here’s an animation of this process:

Image source: Wikipedia

The significance of this random model is that it creates graphs with a small number of hubs, and a large number of low-degree vertices. In other words, the preferential attachment model tends to “make the rich richer.” Another perspective is that the degree distribution of such a graph is guaranteed to fit a so-called power-law distribution. Informally, this means that the overall fraction of small-degree vertices gives a significant contribution to the total number of edges. This is sometimes called a “fat-tailed” distribution. Since power-law distributions are observed in a wide variety of natural settings, some have used this as justification for working in the preferential attachment setting. On the other hand, this model is known to have no significant community structure (by any reasonable definition, certainly not having cliques of nontrivial size), and this has been used as evidence against the model. I am not aware of any work done on planting dense subgraphs in graphs drawn from a preferential attachment model, but I think it’s likely to be trivial and uninteresting. On the other hand, Bubeck et al. have looked at changing the initial graph (the “seed”) from a 2-clique to something else, and seeing how that affects the overall limiting distribution.

Another model that often shows up is a model that allows one to make a random graph starting with any fixed degree distribution, not just a power law. There are a number of models that do this to some fashion, and you’ll hear a lot of hyphenated names thrown around like Chung-Lu and Molloy-Reed and Newman-Strogatz-Watts. The one we’ll describe is quite simple. Say you start with a set of vertices V, and a number d_v for each vertex v, such that the sum of all the d_v is even. This condition is required because in any graph the sum of the degrees of a vertex is twice the number of edges. Then you imagine each vertex v having d_v “edge-stubs.” The name suggests a picture like the one below:

Each node has a prescribed number of "edge stubs," which are randomly connected to form a graph.

Each node has a prescribed number of “edge stubs,” which are randomly connected to form a graph.

Now you pick two edge stubs at random and connect them. One usually allows self-loops and multiple edges between vertices, so that it’s okay to pick two edge stubs from the same vertex. You keep doing this until all the edge stubs are accounted for, and this is your random graph. The degrees were fixed at the beginning, so the only randomization is in which vertices are adjacent. The same obvious biases apply, that any given vertex is more likely to be adjacent to high-degree vertices, but now we get to control the biases with much more precision.

The reason such a model is useful is that when you’re working with graphs in the real world, you usually have statistical information available. It’s simple to compute the degree of each vertex, and so you can use this random graph as a sort of “prior” distribution and look for anomalies. In particular, this is precisely how one of the leading measures of community structure works: the measure of modularity. We’ll talk about this in the next section.

Other kinds of communities

Here’s one easy way to relax our notion of communities. Rather than finding complete subgraphs, we could ask about finding very dense subgraphs (ignoring what happens outside the subgraph). We compute density as the average degree of vertices in the subgraph.

If we impose no bound on the size of the subgraph an algorithm is allowed to output, then there is an efficient algorithm for finding the densest subgraph in a given graph. The general exact solution involves solving a linear programming problem and a little extra work, but luckily there is a greedy algorithm that can get within half of the optimal density. You start with all the vertices S_n = V, and remove any vertex of minimal degree to get S_{n-1}. Continue until S_0, and then compute the density of all the S_i. The best one is guaranteed to be at least half of the optimal density. See this paper of Moses Charikar for a more formal analysis.

One problem with this is that the size of the densest subgraph might be too big. Unfortunately, if you fix the size of the dense subgraph you’re looking for (say, you want to find the densest subgraph of size at most k where k is an input), then the problem once again becomes NP-hard and suffers from the same sort of inapproximability theorems as finding the largest clique.

A more important issue with this is that a dense subgraph isn’t necessarily a community. In particular, we want communities to be dense on the inside and sparse on the outside. The densest subgraph analysis, however, might rate the following graph as one big dense subgraph instead of two separately dense communities with some modest (but not too modest) amount of connections between them.

What should be identified as communities?

What are the correct communities here?

Indeed, we want a quantifiable a notion of “dense on the inside and sparse on the outside.” One such formalization is called modularity. Modularity works as follows. If you give me some partition of the vertices of G into two sets, modularity measures how well this partition reflects two separate communities. It’s the definition of “community” here that makes it interesting. Rather than ask about densities exactly, you can compare the densities to the expected densities in a given random graph model.

In particular, we can use the fixed-degree distribution model from the last section. If we know the degrees of all the vertices ahead of time, we can compute the probability that we see some number of edges going between the two pieces of the partition relative to what we would see at random. If the difference is large (and largely biased toward fewer edges across the partition and more edges within the two subsets), then we say it has high modularity. This involves a lot of computations  — the whole measure can be written as a quadratic form via one big matrix — but the idea is simple enough. We intend to write more about modularity and implement the algorithm on this blog, but the excited reader can see the original paper of M.E.J. Newman.

Now modularity is very popular but it too has shortcomings. First, even though you can compute the modularity of a given partition, there’s still the problem of finding the partition that globally maximizes modularity. Sadly, this is known to be NP-hard. Mover, it’s known to be NP-hard even if you’re just trying to find a partition into two pieces that maximizes modularity, and even still when the graph is regular (every vertex has the same degree).

Still worse, while there are some readily accepted heuristics that often “do well enough” in practice, we don’t even know how to approximate modularity very well. Bhaskar DasGupta has a line of work studying approximations of maximum modularity, and he has proved that for dense graphs you can’t even approximate modularity to within any constant factor. That is, the best you can do is have an approximation that gets worse as the size of the graph grows. It’s similar to the bad news we had for finding the largest clique, but not as bad. For example, when the graph is sparse it’s known that one can approximate modularity to within a \log(n) factor of the optimum, where n is the number of vertices of the graph (for cliques the factor was like n^c for some c, and this is drastically worse).

Another empirical issue is that modularity seems to fail to find small communities. That is, if your graph has some large communities and some small communities, strictly maximizing the modularity is not the right thing to do. So we’ve seen that even the leading method in the field has some issues.

Something completely different

The last method I want to sketch is in the realm of “something completely different.” The notion is that if we’re given a graph, we can run some experiment on the graph, and the results of that experiment can give us insight into where the communities are.

The experiment I’m going to talk about is the random walk. That is, say you have a vertex v in a graph G and you want to find some vertices that are “closest” to v. That is, those that are most likely to be in the same community as v. What you can do is run a random walk starting at v. By a “random walk” I mean you start at v, you pick a neighbor at random and move to it, then repeat. You can compute statistics about the vertices you visit in a sample of such walks, and the vertices that you visit most often are those you say are “in the same community as v. One important parameter is how long the walk is, but it’s generally believed to be best if you keep it between 3-6 steps.

Of course, this is not a partition of the vertices, so it’s not a community detection algorithm, but you can turn it into one. Run this process for each vertex, and use it to compute a “distance” between all the pairs of vertices. Then you compute a tree of partitions by lumping the closest pairs of vertices into the same community, one at a time, until you’ve got every vertex. At each step of the way, you compute the modularity of the partition, and when you’re done you choose the partition that maximizes modularity. This algorithm as a whole is called the walktrap clustering algorithm, and was introduced by Pons and Latapy in 2005.

This sounds like a really great idea, because it’s intuitive: there’s a relatively high chance that the friends of your friends are also your friends. It’s also really great because there is an easily measurable tradeoff between runtime and quality: you can tune down the length of the random walk, and the number of samples you take for each vertex, to speed up the runtime but lower the quality of your statistical estimates. So if you’re working on huge graphs, you get a lot of control and a clear idea of exactly what’s going on inside the algorithm (something which is not immediately clear in a lot of these papers).

Unfortunately, I’m not aware of any concrete theoretical guarantees for walktrap clustering. The one bit of theoretical justification I’ve read over the last year is that you can relate the expected distances you get to certain spectral properties of the graph that are known to be related to community structure, but the lower bounds on maximizing modularity already suggest (though they do not imply) that walktrap won’t do that well in the worst case.

So many algorithms, so little time!

I have only brushed the surface of the literature on community detection, and the things I have discussed are heavily biased toward what I’ve read about and used in my own research. There are methods based on information theory, label propagation, and obscure physics processes like “spin glass” (whatever that is, it sounds frustrating).

And we have only been talking about perfect community structure. What if you want to allow people to be in multiple communities, or have communities at varying levels of granularity (e.g. a sports club within a school versus the whole student body of that school)? What if we want to allow people to be “members” of a community at varying degrees of intensity? How do we deal with noisy signals in our graphs? For example, if we get our data from observing people talk, are two people who have heated arguments considered to be in the same community? Since a lot social network data comes from sources like Twitter and Facebook where arguments are rampant, how do we distinguish between useful and useless data? More subtly, how do we determine useful information if a group within the social network are trying to mask their discovery? That is, how do we deal with adversarial noise in a graph?

And all of this is just on static graphs! What about graphs that change over time? You can keep making the problem more and more complicated as it gets more realistic.

With the huge wealth of research that has already been done just on the simplest case, and the difficult problems and known barriers to success even for the simple problems, it seems almost intimidating to even begin to try to answer these questions. But maybe that’s what makes them fascinating, not to mention that governments and big businesses pour many millions of dollars into this kind of research.

In the future of this blog we plan to derive and implement some of the basic methods of community detection. This includes, as a first outline, the modularity measure and the walktrap clustering algorithm. Considering that I’m also going to spend a large part of the summer thinking about these problems (indeed, with some of the leading researchers and upcoming stars under the sponsorship of the American Mathematical Society), it’s unlikely to end there.

Until next time!

Bandits and Stocks

So far in this series we’ve seen two nontrivial algorithms for bandit learning in two different settings. The first was the UCB1 algorithm, which operated under the assumption that the rewards for the trials were independent and stochastic. That is, each slot machine was essentially a biased coin flip, and the algorithm was trying to find the machine with the best odds. The second was the Exp3 algorithm, which held the belief that the payoffs were arbitrary. In particular this includes the possibility that an adversary is setting the payoffs against you, and so we measured the success of an algorithm in terms of how it fares against the best single action (just as we did with UCB1, but with Exp3 it’s a nontrivial decision).

Before we move on to other bandit settings it’s natural to try to experiment with the ones we have on real world data. On one hand it’s interesting to see how they fare outside academia. And more relevantly to the design of the future bandit algorithms we’ll see on this blog, we need to know what worldly problems actually provide in terms of inputs to our learning algorithm in each round.

But another interesting issue goes like this. In the real world we can’t ever really know whether the rewards of the actions are stochastic or adversarial. Many people believe that adversarial settings are far too pathological to be realistic, while others claim that the assumptions made by stochastic models are too strict. To weigh in on this dispute, we’ll dip into a bit of experimental science and see which of the two algorithms performs better on the problem of stock trading. The result is then evidence that stocks behave stochastically or adversarially. But we don’t want to stir up too many flames, so we can always back up behind the veil of applied mathematics (“this model is too simple anyway”).

Indeed the model we use in this post is rather simplistic. I don’t know as much as I should (or as my father would have me know) about stock markets. In fact, I’m more partial to not trade stocks on principle. But I must admit that average-quality stock data is easy to come by, and the basic notions of market interactions lend themselves naturally to many machine learning problems. If the reader has any ideas about how to strengthen the model, I welcome suggestions in the comments (or a fork on github).

A fair warning to the reader, we do not solve the problem of trading stocks by any means. We use a model that’s almost entirely unrealistic, and the results aren’t even that good. I’m quite nervous to publish this at all, just because above all else it reveals my gaping ignorance on how stock markets work. But this author believes in revealing ignorance as learning, if for nothing else than that it provides extremely valuable insight into the nature of a problem and an appreciation of its complexity. So criticize away, dear readers.

As usual, all of the code and data we use in this post is available on this blog’s Github page. Our language of choice for this post is Python.

This little trader got lucky.

This little trader got lucky. Could it be because he’s got TEN MONITORS?!

Stocks for Dummies (me)

A quick primer on stocks, which is only as detailed as it needs to be for this post: a stock is essentially the sum of the value of all the assets of a company. A publicly traded company divides their stock into a number of “shares,” and owning a share represents partial ownership of the company. If you own 50% of the shares, you own 50% of the company. Companies sell shares or give them to employees as benefits (or options), and use the money gained through their sale for whatever they see fit. The increase in the price of a stock generally signifies the company is successful and growing; for example, stocks generally rise when a hotly anticipated product is announced.

The stock of a company is traded through one of a number of markets called stock exchanges. The buying and selling interactions are recorded and public, and there are many people in the world who monitor the interactions as they happen (via television, or programmatically) in the hopes of noticing opportunities before others and capitalizing on them. Each interaction induces a change on the price of a share of stock: whenever a share is bought at a certain price, that is the established and recorded price of a share (up to some fudging by brokers which is entirely mysterious to me). In any case, the prices go up and down, and they’re often bundled into “bars” which summarize the data over a certain period of time. The bars we use in this post are daily, and consist of four numbers: the open, the price at the beginning of the day, the high and low, which are self-explanatory, and the close, which is the price at the end of the day.

Bandits and Daily Stock Trading

Now let’s simplify things as much as possible. Our bandit learning algorithm will interact with the market as follows: each day it chooses whether or not to buy a single dollar’s worth of a stock, and at the end of the day it sells the stock and observes the profit. There are no brokers involved, and the price the algorithm sees is the price it gets. In bandit language: the stocks represent actions, and the amount of profit at the end of a day constitutes the payoff of an action in one round. Since small-scale stock price movement is generally very poorly understood, it makes some level of sense to assume the price movements within a given day are adversarial. On the other hand, since we understand them so poorly, we might be tempted to just call them “random” fluctuations, i.e. stochastic. So this is a nice little testbed for seeing which assumption yields a more successful algorithm.

Unlike the traditional image of stock trading where an individual owns shares of a stock over a long period of time, our program will operate on a daily time scale, and hence cannot experience the typical kinds of growth. Nevertheless, we can try to make some money over time, and if it’s a good strategy, we could scale up the single dollar to whatever we’re willing to risk. Specifically, the code we used to compute the payoff is

def payoff(stockTable, t, stock, amountToInvest=1.0):
   openPrice, closePrice = stockTable[stock][t]

   sharesBought = amountToInvest / openPrice
   amountAfterSale = sharesBought * closePrice

   return amountAfterSale - amountToInvest

The remainder of the code is interfacing with the Exp3 and UCB1 functions we gave in previous posts, and data shuffling. We got our data from Google Finance, and we provide it, along with all of the code used in the making of this post, on this blog’s Github page. Before we run our experiments, let’s give a few reasons why this model is unrealistic.

  1. We assume we can buy/sell fractional shares of a stock, which to my knowledge is not possible. Though this experiment could be redone where you buy a single share of a stock, or with mutual funds/currency exchange/whatever replacing stocks, we didn’t do it this way.
  2. Brokerage fees can drastically change the success of an algorithm which trades frequently.
  3. Open and close prices are not typical prices. People will often make decisions based on the time of day, but then again we might expect this to be just another reason that Exp3 would perform better than UCB1.
  4. We’re not actually trading in the stock market, and so we’re ignoring the effects of our own algorithm on the prices in the market.
  5. It’s impossible to guarantee you get to use the opening price and closing price in your transactions.
  6. UCB1 and Exp3 don’t use all of the information available. Indeed, they assume that they would not be able to see the outcome of an action they did not take, but with stocks you can get a good estimate of how much money you would have made had you chosen a different stock.
  7. Each trial in a bandit learning problem is identical from the learner’s perspective, but one often keeps a stock around while making other decisions.

We’ll come back to #6 after seeing the raw experiments for an unaltered UCB1 and Exp3, because there is a natural extension of the algorithm to handle additional information. I’m sure there are other glaring issues with the experimental setup, and the reader should feel free to rant about it in the comments. It won’t stop me from running the algorithm and seeing what happens just for fun.

Data Sets

We ran the experiment on two sets of stocks. The first set consisted of nine random stocks, taken from the random stocks twitter feed, with 5 years of past data. The stocks are:

lxrx, keg, cuba, tdi, brks, mux, cadx, belfb, htr

And you can view more information about these particular stocks via Google Finance. The second set was a non-random choice of nine Fortune 500 companies with 10 years of past data. The stocks were

amzn, cost, jpm, gs, wfc, msft, tgt, aapl, wmt

And again more information about these stocks is available via Google Finance. For the record, here were the cumulative payoffs of each of the nine Fortune 500 companies:

f500-cumulative-rewards

The cumulative rewards for the nine Fortune 500 companies over the last ten years of data.

Interestingly, the company which started off with the best prospects (Apple), turned out to have the worst cumulative reward by the end. The long-term winners in our little imaginary world happen to be Amazon, Costco, and Goldman Sachs. Perhaps this gives credence to the assumption that payoffs are adversarial. A learner can easily get tricked into putting too much faith in one action early on.

And for the random stocks:

The cumulative payoff for the nine randomly chosen stocks.

The cumulative payoff for the nine randomly chosen stocks for the last five years of data.

The random stocks clearly perform worse and more variably overall (although HTR surpasses most of the Fortune 500 companies, despite its otherwise relatively modest stock growth over the last five years). To my untrained eyes these movements look more like a stochastic model than an adversarial one.

Experiments

Here is a typical example of a run of Exp3 on the Fortune 500 data set (using \gamma = 0.33, recall \gamma measures the amount of uniform exploration performed):

(Expected payoff, variance) over 1000 trials is (1.122463919564572, 0.5518037498918705)
For a single run: 
Payoff was 1.12
Regret was 2.91
Best stock was amzn at 4.02
weights: '0.00, 0.00, 0.00, 0.46, 0.52, 0.00, 0.00, 0.00, 0.01'

And one for UCB1:

(Expected payoff, variance) over 1000 trials is (1.1529891576139333, 0.5012825847001482)
For a single run: 
Payoff was 1.73
Regret was 2.29
Best stock was amzn at 4.02
ucbs: '0.234, 0.234, 0.234, 0.234, 0.234, 0.234, 0.234, 0.234, 0.234'

The results are quite curious. Indeed, the expected payoff seems to be a whopping 110% return! The variance of these results is quite high, and so it’s not at all impossible that the algorithm could have a negative return. But just as often it would return around 200% profit. 

Before we go risking all our money on this strategy, let’s take a closer look at what’s happening in the algorithm. It appears that for UCB1 the upper confidence bounds assigned to each action are the same! In other words, even after ten years of trials, no single stock “shined” above the others in the eyes of UCB1. It may seem that Exp3 has a leg up on UCB1 in this respect, because it’s clear that it gives higher weights to some stocks over others. However, running the algorithm multiple times shows drastically different weight distributions, and if we average the resulting weights over a thousand rounds, we see that they all have roughly the same mean and variance (the mean being first in the pair):

weight stats for msft: (0.107, 0.025)
weight stats for jpm: (0.109, 0.027)
weight stats for tgt: (0.110, 0.029)
weight stats for gs: (0.112, 0.025)
weight stats for wmt: (0.110, 0.027)
weight stats for aapl: (0.111, 0.027)
weight stats for amzn: (0.120, 0.029)
weight stats for cost: (0.113, 0.026)
weight stats for wfc: (0.107, 0.023)
Indeed, the best stock, Amazon, had an average weight just barely larger (and more variable) than any of the other stocks. So this evidence points to the conclusion that neither EXP3 nor UCB1 has any clue which stock is better. Pairing this with the fact that both algorithms nevertheless perform well would suggest that a random choice of action at each step is equally likely to do well. Indeed, when we run with a “random bandit” that just chooses actions uniformly at random, we get the following results:
(Expected payoff, variance) over 1000 trials is (1.1094227056931132, 0.4403783017367529)
For a single run: 
Payoff was 3.13
Regret was 0.90
Best stock was amzn at 4.02

It’s not quite as good as either Exp3 or UCB1, but it’s close and less variable, which means a lot to an investor. In other words, it’s starting to look like Exp3 and UCB1 aren’t doing significantly better than random at all, and that a monkey would do well in this system (for these particular stocks).

Of course, Fortune 500 companies are pretty successful by definition, so let’s turn our attention to the random stocks:

For the random bandit learner:

(Expected payoff, variance) over 1000 trials is (-0.23952295977625776, 1.0787311145181104)
For a single run: 
Payoff was -2.01
Regret was 3.92
Best stock was htr at 1.91

For UCB1:

(Expected payoff, variance) over 1000 trials is (-0.3503593899029112, 1.1136234992964154)
For a single run: 
Payoff was 0.26
Regret was 1.65
Best stock was htr at 1.91
ucbs: '0.315, 0.315, 0.315, 0.316, 0.315, 0.315, 0.315, 0.315, 0.316'

And for Exp3:

(Expected payoff, variance) over 1000 trials is (-0.25827976810345593, 1.2946101887058519)
For a single run: 
Payoff was -0.34
Regret was 2.25
Best stock was htr at 1.91
weights: '0.08, 0.00, 0.14, 0.06, 0.48, 0.00, 0.00, 0.04, 0.19'

But again Exp3 has no idea what stocks are actually best, with the average, variance over 1000 trials being:

weight stats for lxrx: '0.11, 0.02'
weight stats for keg: '0.11, 0.02'
weight stats for htr: '0.12, 0.02'
weight stats for cadx: '0.10, 0.02'
weight stats for belfb: '0.11, 0.02'
weight stats for tdi: '0.11, 0.02'
weight stats for cuba: '0.11, 0.02'
weight stats for mux: '0.11, 0.02'
weight stats for brks: '0.11, 0.02'

The long and short of it is that the choice of Fortune 500 stocks was inherently so biased toward success than a monkey could have made money investing in them, while the average choice of stocks had, if anything, a bias toward loss. And unfortunately using an algorithm like UCB1 or Exp3 straight out of the box doesn’t produce anything better than a monkey.

Issues and Improvements

There are two glaring theoretical issues here that we haven’t yet addressed. One of these goes back to issue #5 in that list we gave at the beginning of the post: the bandit algorithms are assuming they have less information than they actually have! Indeed, at the end of a day of stock trading, you have a good idea what would have happened to you had you bought a different stock, and in our simplified world you can know exactly what your profit would have been. Recalling that UCB1 and Exp3 both maintained some numbers representing the strength of an action (Exp3 had a “weight” and UCB1 an upper confidence bound), the natural extension to both UCB1 and Exp3 is simply to modify the beliefs about all actions after any given round. This is a pretty simple improvement to make in our implementation, since it just changes a single weight update to a loop. For Exp3:

for choice in range(numActions):
   rewardForUpdate = reward(choice, t)
   scaledReward = (rewardForUpdate - rewardMin) / (rewardMax - rewardMin)
   estimatedReward = 1.0 * scaledReward / probabilityDistribution[choice]
   weights[choice] *= math.exp(estimatedReward * gamma / numActions)

With a similar loop for UCB1. This code should be familiar from our previous posts on bandits. We then rerun the new algorithms on the same data sets, and the results are somewhat surprising. First, UCB1 on Fortune 500:

(Expected payoff, variance) over 1000 trials is (3.530670654982728, 0.007713190816014095)
For a single run: 
Payoff was 3.56
Regret was 0.47

This is clearly outperforming the random bandit learning algorithm, with an average return of 350%! In fact, it does almost as well as the best stock, and the variance is quite low. UCB1 also outperforms Exp3, which fares comparably to its pre-improved self. That is, it’s still not much better than random:

(Expected payoff, variance) over 1000 trials is (1.1424797906901956, 0.434335471375294)
For a single run: 
Payoff was 1.24
Regret was 2.79

And also for the random stocks, UCB1 with improvements outperforms Exp3 and UCB1 without improvements. UCB1:

(Expected payoff, variance) over 1000 trials is (0.680211923900068, 0.04226672915962647)
For a single run:
Payoff was 0.82
Regret was 1.09

And Exp3:

(Expected payoff, variance) over 1000 trials is (-0.2242152508929378, 1.1312843329929194)
For a single run: 
Payoff was -0.16
Regret was 2.07

We might wonder why this is the case, and there is a plausible explanation. See, Exp3 has a difficult life: it has to assume that at any time the adversary can completely change the game. And so Exp3 must remain vigilant, continuing to try options it knows to be terrible for fear that they may spontaneously do well. Exp3 is the grandfather who, after 75 years of not winning the lotto, continues to buy tickets every week. A better analogy might be a lioness who, even after being moved to the zoo, stays up all night to protect a cub from predators. This gives us quite a new perspective on Exp3: the world really has to be that messed up for Exp3 to be useful. As we saw, UCB1 is much more eager to jump on a winning bandwagon, and it paid off in both the good (Fortune 500) and bad (random stock) scenarios. All in all, this experiment would provide some minor evidence that the stock market (or just this cheesy version of it) is more stochastic than adversarial.

The second problem is that we’re treating these stocks as if they were isolated from the rest of the world. Indeed, along with each stock comes some kind of context in the form of information about that stock. Historical prices, corporate announcements, cyclic boom and bust, what the talking heads think, all of this may be relevant to the price fluctuations of a stock on any given day. While Exp3 and UCB1 are ill-equipped to handle such a rich landscape, researchers in bandit learning have recognized the importance of context in decision making. So much so, in fact, that an entire subfield of “Contextual Bandits” was born. John Langford, perhaps the world’s leading expert on bandit learning, wrote on his blog in 2007,

I’m having difficulty finding interesting real-world k-Armed Bandit settings which aren’t better thought of as Contextual Bandits in practice. For myself, bandit algorithms are (at best) motivational because they can not be applied to real-world problems without altering them to take context into account.

I tend to agree with him. Bandit problems almost always come with some inherent additional structure in the real world, and the best algorithms will always take advantage of that structure. A “context” associated with each round is perhaps the weakest kind of structure, so it’s a natural place to look for better algorithms.

So that’s what we’ll do in the future of this series. But before then we might decide to come up with another experiment to run Exp3 and UCB1 on. It would be nice to see an instance in which Exp3 seriously outperforms UCB1, but maybe the real world is just stochastic and there’s nothing we can do about it.

Until next time!