# Bandits and Stocks

So far in this series we’ve seen two nontrivial algorithms for bandit learning in two different settings. The first was the UCB1 algorithm, which operated under the assumption that the rewards for the trials were independent and stochastic. That is, each slot machine was essentially a biased coin flip, and the algorithm was trying to find the machine with the best odds. The second was the Exp3 algorithm, which held the belief that the payoffs were arbitrary. In particular this includes the possibility that an adversary is setting the payoffs against you, and so we measured the success of an algorithm in terms of how it fares against the best single action (just as we did with UCB1, but with Exp3 it’s a nontrivial decision).

Before we move on to other bandit settings it’s natural to try to experiment with the ones we have on real world data. On one hand it’s interesting to see how they fare outside academia. And more relevantly to the design of the future bandit algorithms we’ll see on this blog, we need to know what worldly problems actually provide in terms of inputs to our learning algorithm in each round.

But another interesting issue goes like this. In the real world we can’t ever really know whether the rewards of the actions are stochastic or adversarial. Many people believe that adversarial settings are far too pathological to be realistic, while others claim that the assumptions made by stochastic models are too strict. To weigh in on this dispute, we’ll dip into a bit of experimental science and see which of the two algorithms performs better on the problem of stock trading. The result is then evidence that stocks behave stochastically or adversarially. But we don’t want to stir up too many flames, so we can always back up behind the veil of applied mathematics (“this model is too simple anyway”).

Indeed the model we use in this post is rather simplistic. I don’t know as much as I should (or as my father would have me know) about stock markets. In fact, I’m more partial to not trade stocks on principle. But I must admit that average-quality stock data is easy to come by, and the basic notions of market interactions lend themselves naturally to many machine learning problems. If the reader has any ideas about how to strengthen the model, I welcome suggestions in the comments (or a fork on github).

A fair warning to the reader, we do not solve the problem of trading stocks by any means. We use a model that’s almost entirely unrealistic, and the results aren’t even that good. I’m quite nervous to publish this at all, just because above all else it reveals my gaping ignorance on how stock markets work. But this author believes in revealing ignorance as learning, if for nothing else than that it provides extremely valuable insight into the nature of a problem and an appreciation of its complexity. So criticize away, dear readers.

As usual, all of the code and data we use in this post is available on this blog’s Github page. Our language of choice for this post is Python.

This little trader got lucky. Could it be because he’s got TEN MONITORS?!

## Stocks for Dummies (me)

A quick primer on stocks, which is only as detailed as it needs to be for this post: a stock is essentially the sum of the value of all the assets of a company. A publicly traded company divides their stock into a number of “shares,” and owning a share represents partial ownership of the company. If you own 50% of the shares, you own 50% of the company. Companies sell shares or give them to employees as benefits (or options), and use the money gained through their sale for whatever they see fit. The increase in the price of a stock generally signifies the company is successful and growing; for example, stocks generally rise when a hotly anticipated product is announced.

The stock of a company is traded through one of a number of markets called stock exchanges. The buying and selling interactions are recorded and public, and there are many people in the world who monitor the interactions as they happen (via television, or programmatically) in the hopes of noticing opportunities before others and capitalizing on them. Each interaction induces a change on the price of a share of stock: whenever a share is bought at a certain price, that is the established and recorded price of a share (up to some fudging by brokers which is entirely mysterious to me). In any case, the prices go up and down, and they’re often bundled into “bars” which summarize the data over a certain period of time. The bars we use in this post are daily, and consist of four numbers: the open, the price at the beginning of the day, the high and low, which are self-explanatory, and the close, which is the price at the end of the day.

## Bandits and Daily Stock Trading

Now let’s simplify things as much as possible. Our bandit learning algorithm will interact with the market as follows: each day it chooses whether or not to buy a single dollar’s worth of a stock, and at the end of the day it sells the stock and observes the profit. There are no brokers involved, and the price the algorithm sees is the price it gets. In bandit language: the stocks represent actions, and the amount of profit at the end of a day constitutes the payoff of an action in one round. Since small-scale stock price movement is generally very poorly understood, it makes some level of sense to assume the price movements within a given day are adversarial. On the other hand, since we understand them so poorly, we might be tempted to just call them “random” fluctuations, i.e. stochastic. So this is a nice little testbed for seeing which assumption yields a more successful algorithm.

Unlike the traditional image of stock trading where an individual owns shares of a stock over a long period of time, our program will operate on a daily time scale, and hence cannot experience the typical kinds of growth. Nevertheless, we can try to make some money over time, and if it’s a good strategy, we could scale up the single dollar to whatever we’re willing to risk. Specifically, the code we used to compute the payoff is

def payoff(stockTable, t, stock, amountToInvest=1.0):
openPrice, closePrice = stockTable[stock][t]

sharesBought = amountToInvest / openPrice
amountAfterSale = sharesBought * closePrice

return amountAfterSale - amountToInvest


The remainder of the code is interfacing with the Exp3 and UCB1 functions we gave in previous posts, and data shuffling. We got our data from Google Finance, and we provide it, along with all of the code used in the making of this post, on this blog’s Github page. Before we run our experiments, let’s give a few reasons why this model is unrealistic.

1. We assume we can buy/sell fractional shares of a stock, which to my knowledge is not possible. Though this experiment could be redone where you buy a single share of a stock, or with mutual funds/currency exchange/whatever replacing stocks, we didn’t do it this way.
2. Brokerage fees can drastically change the success of an algorithm which trades frequently.
3. Open and close prices are not typical prices. People will often make decisions based on the time of day, but then again we might expect this to be just another reason that Exp3 would perform better than UCB1.
4. We’re not actually trading in the stock market, and so we’re ignoring the effects of our own algorithm on the prices in the market.
5. It’s impossible to guarantee you get to use the opening price and closing price in your transactions.
6. UCB1 and Exp3 don’t use all of the information available. Indeed, they assume that they would not be able to see the outcome of an action they did not take, but with stocks you can get a good estimate of how much money you would have made had you chosen a different stock.
7. Each trial in a bandit learning problem is identical from the learner’s perspective, but one often keeps a stock around while making other decisions.

We’ll come back to #6 after seeing the raw experiments for an unaltered UCB1 and Exp3, because there is a natural extension of the algorithm to handle additional information. I’m sure there are other glaring issues with the experimental setup, and the reader should feel free to rant about it in the comments. It won’t stop me from running the algorithm and seeing what happens just for fun.

## Data Sets

We ran the experiment on two sets of stocks. The first set consisted of nine random stocks, taken from the random stocks twitter feed, with 5 years of past data. The stocks are:

lxrx, keg, cuba, tdi, brks, mux, cadx, belfb, htr

And you can view more information about these particular stocks via Google Finance. The second set was a non-random choice of nine Fortune 500 companies with 10 years of past data. The stocks were

amzn, cost, jpm, gs, wfc, msft, tgt, aapl, wmt

And again more information about these stocks is available via Google Finance. For the record, here were the cumulative payoffs of each of the nine Fortune 500 companies:

The cumulative rewards for the nine Fortune 500 companies over the last ten years of data.

Interestingly, the company which started off with the best prospects (Apple), turned out to have the worst cumulative reward by the end. The long-term winners in our little imaginary world happen to be Amazon, Costco, and Goldman Sachs. Perhaps this gives credence to the assumption that payoffs are adversarial. A learner can easily get tricked into putting too much faith in one action early on.

And for the random stocks:

The cumulative payoff for the nine randomly chosen stocks for the last five years of data.

The random stocks clearly perform worse and more variably overall (although HTR surpasses most of the Fortune 500 companies, despite its otherwise relatively modest stock growth over the last five years). To my untrained eyes these movements look more like a stochastic model than an adversarial one.

## Experiments

Here is a typical example of a run of Exp3 on the Fortune 500 data set (using $\gamma = 0.33$, recall $\gamma$ measures the amount of uniform exploration performed):

(Expected payoff, variance) over 1000 trials is (1.122463919564572, 0.5518037498918705)
For a single run:
Payoff was 1.12
Regret was 2.91
Best stock was amzn at 4.02
weights: '0.00, 0.00, 0.00, 0.46, 0.52, 0.00, 0.00, 0.00, 0.01'

And one for UCB1:

(Expected payoff, variance) over 1000 trials is (1.1529891576139333, 0.5012825847001482)
For a single run:
Payoff was 1.73
Regret was 2.29
Best stock was amzn at 4.02
ucbs: '0.234, 0.234, 0.234, 0.234, 0.234, 0.234, 0.234, 0.234, 0.234'

The results are quite curious. Indeed, the expected payoff seems to be a whopping 110% return! The variance of these results is quite high, and so it’s not at all impossible that the algorithm could have a negative return. But just as often it would return around 200% profit.

Before we go risking all our money on this strategy, let’s take a closer look at what’s happening in the algorithm. It appears that for UCB1 the upper confidence bounds assigned to each action are the same! In other words, even after ten years of trials, no single stock “shined” above the others in the eyes of UCB1. It may seem that Exp3 has a leg up on UCB1 in this respect, because it’s clear that it gives higher weights to some stocks over others. However, running the algorithm multiple times shows drastically different weight distributions, and if we average the resulting weights over a thousand rounds, we see that they all have roughly the same mean and variance (the mean being first in the pair):

weight stats for msft: (0.107, 0.025)
weight stats for jpm: (0.109, 0.027)
weight stats for tgt: (0.110, 0.029)
weight stats for gs: (0.112, 0.025)
weight stats for wmt: (0.110, 0.027)
weight stats for aapl: (0.111, 0.027)
weight stats for amzn: (0.120, 0.029)
weight stats for cost: (0.113, 0.026)
weight stats for wfc: (0.107, 0.023)
Indeed, the best stock, Amazon, had an average weight just barely larger (and more variable) than any of the other stocks. So this evidence points to the conclusion that neither EXP3 nor UCB1 has any clue which stock is better. Pairing this with the fact that both algorithms nevertheless perform well would suggest that a random choice of action at each step is equally likely to do well. Indeed, when we run with a “random bandit” that just chooses actions uniformly at random, we get the following results:
(Expected payoff, variance) over 1000 trials is (1.1094227056931132, 0.4403783017367529)
For a single run:
Payoff was 3.13
Regret was 0.90
Best stock was amzn at 4.02

It’s not quite as good as either Exp3 or UCB1, but it’s close and less variable, which means a lot to an investor. In other words, it’s starting to look like Exp3 and UCB1 aren’t doing significantly better than random at all, and that a monkey would do well in this system (for these particular stocks).

Of course, Fortune 500 companies are pretty successful by definition, so let’s turn our attention to the random stocks:

For the random bandit learner:

(Expected payoff, variance) over 1000 trials is (-0.23952295977625776, 1.0787311145181104)
For a single run:
Payoff was -2.01
Regret was 3.92
Best stock was htr at 1.91

For UCB1:

(Expected payoff, variance) over 1000 trials is (-0.3503593899029112, 1.1136234992964154)
For a single run:
Payoff was 0.26
Regret was 1.65
Best stock was htr at 1.91
ucbs: '0.315, 0.315, 0.315, 0.316, 0.315, 0.315, 0.315, 0.315, 0.316'

And for Exp3:

(Expected payoff, variance) over 1000 trials is (-0.25827976810345593, 1.2946101887058519)
For a single run:
Payoff was -0.34
Regret was 2.25
Best stock was htr at 1.91
weights: '0.08, 0.00, 0.14, 0.06, 0.48, 0.00, 0.00, 0.04, 0.19'

But again Exp3 has no idea what stocks are actually best, with the average, variance over 1000 trials being:

weight stats for lxrx: '0.11, 0.02'
weight stats for keg: '0.11, 0.02'
weight stats for htr: '0.12, 0.02'
weight stats for cadx: '0.10, 0.02'
weight stats for belfb: '0.11, 0.02'
weight stats for tdi: '0.11, 0.02'
weight stats for cuba: '0.11, 0.02'
weight stats for mux: '0.11, 0.02'
weight stats for brks: '0.11, 0.02'

The long and short of it is that the choice of Fortune 500 stocks was inherently so biased toward success than a monkey could have made money investing in them, while the average choice of stocks had, if anything, a bias toward loss. And unfortunately using an algorithm like UCB1 or Exp3 straight out of the box doesn’t produce anything better than a monkey.

## Issues and Improvements

There are two glaring theoretical issues here that we haven’t yet addressed. One of these goes back to issue #5 in that list we gave at the beginning of the post: the bandit algorithms are assuming they have less information than they actually have! Indeed, at the end of a day of stock trading, you have a good idea what would have happened to you had you bought a different stock, and in our simplified world you can know exactly what your profit would have been. Recalling that UCB1 and Exp3 both maintained some numbers representing the strength of an action (Exp3 had a “weight” and UCB1 an upper confidence bound), the natural extension to both UCB1 and Exp3 is simply to modify the beliefs about all actions after any given round. This is a pretty simple improvement to make in our implementation, since it just changes a single weight update to a loop. For Exp3:

for choice in range(numActions):
rewardForUpdate = reward(choice, t)
scaledReward = (rewardForUpdate - rewardMin) / (rewardMax - rewardMin)
estimatedReward = 1.0 * scaledReward / probabilityDistribution[choice]
weights[choice] *= math.exp(estimatedReward * gamma / numActions)


With a similar loop for UCB1. This code should be familiar from our previous posts on bandits. We then rerun the new algorithms on the same data sets, and the results are somewhat surprising. First, UCB1 on Fortune 500:

(Expected payoff, variance) over 1000 trials is (3.530670654982728, 0.007713190816014095)
For a single run:
Payoff was 3.56
Regret was 0.47

This is clearly outperforming the random bandit learning algorithm, with an average return of 350%! In fact, it does almost as well as the best stock, and the variance is quite low. UCB1 also outperforms Exp3, which fares comparably to its pre-improved self. That is, it’s still not much better than random:

(Expected payoff, variance) over 1000 trials is (1.1424797906901956, 0.434335471375294)
For a single run:
Payoff was 1.24
Regret was 2.79

And also for the random stocks, UCB1 with improvements outperforms Exp3 and UCB1 without improvements. UCB1:

(Expected payoff, variance) over 1000 trials is (0.680211923900068, 0.04226672915962647)
For a single run:
Payoff was 0.82
Regret was 1.09

And Exp3:

(Expected payoff, variance) over 1000 trials is (-0.2242152508929378, 1.1312843329929194)
For a single run:
Payoff was -0.16
Regret was 2.07

We might wonder why this is the case, and there is a plausible explanation. See, Exp3 has a difficult life: it has to assume that at any time the adversary can completely change the game. And so Exp3 must remain vigilant, continuing to try options it knows to be terrible for fear that they may spontaneously do well. Exp3 is the grandfather who, after 75 years of not winning the lotto, continues to buy tickets every week. A better analogy might be a lioness who, even after being moved to the zoo, stays up all night to protect a cub from predators. This gives us quite a new perspective on Exp3: the world really has to be that messed up for Exp3 to be useful. As we saw, UCB1 is much more eager to jump on a winning bandwagon, and it paid off in both the good (Fortune 500) and bad (random stock) scenarios. All in all, this experiment would provide some minor evidence that the stock market (or just this cheesy version of it) is more stochastic than adversarial.

The second problem is that we’re treating these stocks as if they were isolated from the rest of the world. Indeed, along with each stock comes some kind of context in the form of information about that stock. Historical prices, corporate announcements, cyclic boom and bust, what the talking heads think, all of this may be relevant to the price fluctuations of a stock on any given day. While Exp3 and UCB1 are ill-equipped to handle such a rich landscape, researchers in bandit learning have recognized the importance of context in decision making. So much so, in fact, that an entire subfield of “Contextual Bandits” was born. John Langford, perhaps the world’s leading expert on bandit learning, wrote on his blog in 2007,

I’m having difficulty finding interesting real-world k-Armed Bandit settings which aren’t better thought of as Contextual Bandits in practice. For myself, bandit algorithms are (at best) motivational because they can not be applied to real-world problems without altering them to take context into account.

I tend to agree with him. Bandit problems almost always come with some inherent additional structure in the real world, and the best algorithms will always take advantage of that structure. A “context” associated with each round is perhaps the weakest kind of structure, so it’s a natural place to look for better algorithms.

So that’s what we’ll do in the future of this series. But before then we might decide to come up with another experiment to run Exp3 and UCB1 on. It would be nice to see an instance in which Exp3 seriously outperforms UCB1, but maybe the real world is just stochastic and there’s nothing we can do about it.

Until next time!

# Adversarial Bandits and the Exp3 Algorithm

In the last twenty years there has been a lot of research in a subfield of machine learning called Bandit Learning. The name comes from the problem of being faced with a large sequence of slot machines (once called one-armed bandits) each with a potentially different payout scheme. The problems in this field all focus on one central question:

If I have many available actions with uncertain outcomes, how should I act to maximize the quality of my results over many trials?

The deep question here is how to balance exploitation, the desire to choose an action which has payed off well in the past, with exploration, the desire to try options which may produce even better results. The ideas are general enough that it’s hard not to find applications: choosing which drug to test in a clinical study, choosing which companies to invest in, choosing which ads or news stories to display to users, and even (as Richard Feynman once wondered) how to maximize your dining enjoyment.

Herbert Robbins, one of the first to study bandit learning algorithms. Image credit

In less recent times (circa 1960’s), this problem was posed and considered in the case where the payoff mechanisms had a very simple structure: each slot machine is a coin flip with a different probability $p$ of winning, and the player’s goal is to find the best machine as quickly as possible. We called this the “stochastic” setting, and last time we saw a modern strategy called UCB1 which maintained statistical estimates on the payoffs of the actions and chose the action with the highest estimate. The underlying philosophy was “optimism in the face of uncertainty,” and it gave us something provably close to optimal.

Unfortunately payoff structures are more complex than coin flips in the real world. Having “optimism” is arguably naive, especially when it comes to competitive scenarios like stock trading. Indeed the algorithm we’ll analyze in this post will take the polar opposite stance, that payoffs could conceivably operate in any manner. This is called the adversarial model, because even though the payoffs are fixed in advance of the game beginning, it can always be the case that the next choice you make results in the worst possible payoff.

One might wonder how we can hope to do anything in such a pessimistic model. As we’ll see, our notion of performing well is relative to the best single slot machine, and we will argue that this is the only reasonable notion of success. On the other hand, one might argue that real world payoffs are almost never entirely adversarial, and so we would hope that algorithms which do well theoretically in the adversarial model excel beyond their minimal guarantees in practice.

In this post we’ll explore and implement one algorithm for adversarial bandit learning, called Exp3, and in the next post we’ll see how it fares against UCB1 in some applications. Some prerequisites: since the main algorithm presented in this post is randomized, its analysis requires some familiarity with techniques and notation from probability theory. Specifically, we will assume that the reader is familiar with the content of this blog’s basic probability theory primers (1, 2), though the real difficulty in the analysis will be keeping up with all of the notation.

In case the reader is curious, Exp3 was invented in 2001 by Auer, Cesa-Bianchi, Freund, and Schapire. Here is their original paper, which contains lots of other mathematical goodies.

As usual, all of the code and data produced in the making of this blog post is available for download on this blog’s Github page.

## Model Formalization and Notions of Regret

Before we describe the algorithm and analyze its we have to set up the problem formally. The first few paragraphs of our last post give a high-level picture of general bandit learning, so we won’t repeat that here. Recall, however, that we have to describe both the structure of the payoffs and how success is measured. So let’s describe the former first.

Definition: An adversarial bandit problem is a pair $(K, \mathbf{x})$, where $K$ represents the number of actions (henceforth indexed by $i$), and $\mathbf{x}$ is an infinite sequence of payoff vectors $\mathbf{x} = \mathbf{x}(1), \mathbf{x}(2), \dots$, where $\mathbf{x}(t) = (x_1(t), \dots, x_K(t))$ is a vector of length $K$ and $x_i(t) \in [0,1]$ is the reward of action $i$ on step $t$.

In English, the game is played in rounds (or “time steps”) indexed by $t = 1, 2, \dots$, and the payoffs are fixed for each action and time before the game even starts. Note that we assume the reward of an action is a number in the interval $[0,1]$, but all of our arguments in this post can be extended to payoffs in some range $[a,b]$ by shifting by $a$ and dividing by $b-a$.

Let’s specify what the player (algorithm designer) knows during the course of the game. First, the value of $K$ is given, and total number of rounds is kept secret. In each round, the player has access to the history of rewards for the actions that were chosen by the algorithm in previous rounds, but not the rewards of unchosen actions. In other words, it will only ever know one $x_i(t)$ for each $t$. To set up some notation, if we call $i_1, \dots, i_t$ the list of chosen actions over $t$ rounds, then at step $t+1$ the player has access to the values of $x_{i_1}(1), \dots, x_{i_t}(t)$ and must pick $i_{t+1}$ to continue.

So to be completely clear, the game progresses as follows:

The player is given access to $K$.
For each time step $t$:

The player must pick an action $i_t$.
The player observes the reward $x_i(t) \in [0,1]$, which he may save for future use.

The problem gives no explicit limit on the amount of computation performed during each step, but in general we want it to run in polynomial time and not depend on the round number $t$. If runtime even logarithmically depended on $t$, then we’d have a big problem using it for high-frequency applications. For example in ad serving, Google processes on the order of $10^9$ ads per day; so a logarithmic dependence wouldn’t be that bad, but at some point in the distance future Google wouldn’t be able to keep up (and we all want long-term solutions to our problems).

Note that the reward vectors $\mathbf{x}_t$ must be fixed in advance of the algorithm running, but this still allows a lot of counterintuitive things. For example, the payoffs can depend adversarially on the algorithm the player decides to use. For example, if the player chooses the stupid strategy of always picking the first action, then the adversary can just make that the worst possible action to choose. However, the rewards cannot depend on the random choices made by the player during the game.

So now let’s talk about measuring success. For an algorithm $A$ which chooses the sequence $i_1, \dots, i_t$ of actions, define $G_A(t)$ to be the sum of the observed rewards

$\displaystyle G_A(t) = \sum_{s=1}^t x_{i_s}(s)$.

And because $A$ will often be randomized, this value is a random variable depending on the decisions made by $A$. As such, we will often only consider the payoff up to expectation. That is, we’ll be interested in how $\textup{E}(G_A(t))$ relates to other possible courses of action. To be completely rigorous, the randomization is not over “choices made by an algorithm,” but rather the probability distribution over sequences of actions that the algorithm induces. It’s a fine distinction but a necessary one. In other words, we could define any sequence of actions $\mathbf{j} = (j_1, \dots, j_t)$ and define $G_{\mathbf{j}}(t)$ analogously as above:

$\displaystyle G_{\mathbf{j}}(t) = \sum_{s=1}^t x_{j_s}(s)$.

Any algorithm and choice of reward vectors induces a probability distribution over sequences of actions in a natural way (if you want to draw from the distribution, just run the algorithm). So instead of conditioning our probabilities and expectations on previous choices made by the algorithm, we do it over histories of actions $i_1, \dots, i_t$.

An obvious question we might ask is: why can’t the adversary just make all the payoffs zero? (or negative!) In this event the player won’t get any reward, but he can emotionally and psychologically accept this fate. If he never stood a chance to get any reward in the first place, why should he feel bad about the inevitable result? What a truly cruel adversary wants is, at the end of the game, to show the player what he could have won, and have it far exceed what he actually won. In this way the player feels regret for not using a more sensible strategy, and likely turns to binge eating cookie dough ice cream. Or more likely he returns to the casino to lose more money. The trick that the player has up his sleeve is precisely the randomness in his choice of actions, and he can use its objectivity to partially overcome even the nastiest of adversaries.

The adversary would love to show you this bluff after you choose to fold your hand. What a jerk. Image credit

Sadism aside, this thought brings us to a few mathematical notions of regret that the player algorithm may seek to minimize. The first, most obvious, and least reasonable is the worst-case regret. Given a stopping time $T$ and a sequence of actions $\mathbf{j} = (j_1, \dots, j_T)$, the expected regret of algorithm $A$ with respect to $\mathbf{j}$ is the difference $G_{\mathbf{j}}(T) - \mathbb{E}(G_A(T))$. This notion of regret measures the regret of a player if he knew what would have happened had he played $\mathbf{j}$.  The expected worst-case regret of $A$ is then the maximum over all sequences $\mathbf{j}$ of the regret of $A$ with respect to $\mathbf{j}$. This notion of regret seems particularly unruly, especially considering that the payoffs are adversarial, but there are techniques to reason about it.

However, the focus of this post is on a slightly easier notion of regret, called weak regret, which instead compares the results of $A$ to the best single action over all rounds. That is, this quantity is just

$\displaystyle \left ( \max_{j} \sum_{t=1}^T x_j(t) \right ) - \mathbb{E}(G_A(T))$

We call the parenthetical term $G_{\textup{max}}(T)$. This kind of regret is a bit easier to analyze, and the main theorem of this post will given an upper bound on it for Exp3. The reader who read our last post on UCB1 will wonder why we make a big distinction here just to arrive at the same definition of regret that we had in the stochastic setting. But with UCB1 the best sequence of actions to take just happened to be to play the best action over and over again. Here, the payoff difference between the best sequence of actions and the best single action can be arbitrarily large.

## Exp3 and an Upper Bound on Weak Regret

We now describe at the Exp3 algorithm.

Exp3 stands for Exponential-weight algorithm for Exploration and Exploitation. It works by maintaining a list of weights for each of the actions, using these weights to decide randomly which action to take next, and increasing (decreasing) the relevant weights when a payoff is good (bad). We further introduce an egalitarianism factor $\gamma \in [0,1]$ which tunes the desire to pick an action uniformly at random. That is, if $\gamma = 1$, the weights have no effect on the choices at any step.

The algorithm is readily described in Python code, but we need to set up some notation used in the proof of the theorem. The pseudocode for the algorithm is as follows.

Exp3

1. Given $\gamma \in [0,1]$, initialize the weights $w_i(1) = 1$ for $i = 1, \dots, K$.
2. In each round $t$:
1.  Set $\displaystyle p_i(t) = (1-\gamma)\frac{w_i(t)}{\sum_{j=1}^K w_j(t)} + \frac{\gamma}{K}$ for each $i$.
2. Draw the next action $i_t$ randomly according to the distribution of $p_i(t)$.
3. Observe reward $x_{i_t}(t)$.
4. Define the estimated reward $\hat{x}_{i_t}(t)$ to be $x_{i_t}(t) / p_{i_t}(t)$.
5. Set $\displaystyle w_{i_t}(t+1) = w_{i_t}(t) e^{\gamma \hat{x}_{i_t}(t) / K}$
6. Set all other $w_j(t+1) = w_j(t)$.

The choices of these particular mathematical quantities (in steps 1, 4, and 5) are a priori mysterious, but we will explain them momentarily. In the proof that follows, we will extend $\hat{x}_{i_t}(t)$ to indices other than $i_t$ and define those values to be zero.

The Python implementation is perhaps more legible, and implements the possibly infinite loop as a generator:

def exp3(numActions, reward, gamma):
weights = [1.0] * numActions

t = 0
while True:
probabilityDistribution = distr(weights, gamma)
choice = draw(probabilityDistribution)
theReward = reward(choice, t)

estimatedReward = 1.0 * theReward / probabilityDistribution[choice]
weights[choice] *= math.exp(estimatedReward * gamma / numActions) # important that we use estimated reward here!

yield choice, theReward, estimatedReward, weights
t = t + 1


Here the “rewards” variable refers to a callable which accepts as input the action chosen in round $t$ (keeps track of $t$, assuming we’ll play nice), and returns as output the reward for that choice. The distr and draw functions are also easily defined, with the former depending on the gamma parameter as follows:

def distr(weights, gamma=0.0):
theSum = float(sum(weights))
return tuple((1.0 - gamma) * (w / theSum) + (gamma / len(weights)) for w in weights)


There is one odd part of the algorithm above, and that’s the “estimated reward” $\hat{x}_{i_t}(t) = x_{i_t}(t) / p_{i_t}(t)$. The intuitive reason to do this is to compensate for a potentially small probability of getting the observed reward. More formally, it ensures that the conditional expectation of the “estimated reward” is the actual reward. We will explore this formally during the proof of the main theorem.

As usual, the programs we write in this post are available on this blog’s Github page.

We can now state and prove the upper bound on the weak regret of Exp3. Note all logarithms are base $e$.

Theorem: For any $K > 0, \gamma \in (0, 1]$, and any stopping time $T \in \mathbb{N}$

$\displaystyle G_{\textup{max}}(T) - \mathbb{E}(G_{\textup{Exp3}}(T)) \leq (e-1)\gamma G_{\textup{max}}(T) + \frac{K \log K}{\gamma}$.

This is a purely analytical result because we don’t actually know what $G_{\textup{max}}(T)$ is ahead of time. Also note how the factor of $\gamma$ occurs: in the first term, having a large $\gamma$ will result in a poor upper bound because it occurs in the numerator of that term: too much exploration means not enough exploitation. But it occurs in the denominator of the second term, meaning that not enough exploration can also produce an undesirably large regret. This theorem then provides a quantification of the tradeoff being made, although it is just an upper bound.

Proof.

We present the proof in two parts. Part 1:

We made a notable mistake in part 1, claiming that $e^x \leq 1 + x + (e-2)x^2$ when $x \leq 1$. In fact, this does follow from the Taylor series expansion of $e$, but it’s not as straightforward as I made it sound. In particular, note that $e^x = 1 + x + \frac{x^2}{2!} + \dots$, and so $e^1 = 2 + \sum_{k=2}^\infty \frac{1}{k!}$. Using $(e-2)$ in place of $\frac{1}{2}$ gives

$\displaystyle 1 + x + \left ( \sum_{k=2}^{\infty} \frac{x^2}{k!} \right )$

And since $0 < x \leq 1$, each term in the sum will decrease when replaced by $\frac{x^k}{k!}$, and we’ll be left with exactly $e^x$. In other words, this is the tightest possible quadratic upper bound on $e^x$. Pretty neat! On to part 2:

As usual, here is the entire canvas made over the course of both videos.

$\square$

We can get a version of this theorem that is easier to analyze by picking a suitable choice of $\gamma$.

Corollary: Assume that $G_{\textup{max}}(T)$ is bounded by $g$, and that Exp3 is run with

$\displaystyle \gamma = \min \left ( 1, \sqrt{\frac{K \log K}{(e-1)g}} \right )$

Then the weak regret of Exp3 is bounded by $2.63 \sqrt{g K \log K}$ for any reward vector $\mathbf{x}$.

Proof. Simply plug $\gamma$ in the bound in the theorem above, and note that $2 \sqrt{e-1} < 2.63$.

## A Simple Test Against Coin Flips

Now that we’ve analyzed the theoretical guarantees of the Exp3 algorithm, let’s use our implementation above and see how it fares in practice. Our first test will use 10 coin flips (Bernoulli trials) for our actions, with the probabilities of winning (and the actual payoff vectors) defined as follows:

biases = [1.0 / k for k in range(2,12)]
rewardVector = [[1 if random.random() < bias else 0 for bias in biases] for _ in range(numRounds)]
rewards = lambda choice, t: rewardVector[t][choice]


If we are to analyze the regret of Exp3 against the best action, we must compute the payoffs for all actions ahead of time, and compute which is the best. Obviously it will be the one with the largest probability of winning (the first in the list generated above), but it might not be, so we have to compute it. Specifically, it’s the following argmax:

bestAction = max(range(numActions), key=lambda action: sum([rewardVector[t][action] for t in range(numRounds)]))


Where the max function is used as “argmax” would be in mathematics.

We also have to pick a good choice of $\gamma$, and the corollary from the previous section gives us a good guide to the optimal $\gamma$: simply find a good upper bound on the reward of the best action, and use that. We can cheat a little here: we know the best action has a probability of 1/2 of paying out, and so the expected reward if we always did the best action is half the number of rounds. If we use, say, $g = 2T / 3$ and compute $\gamma$ using the formula from the corollary, this will give us a reasonable (but perhaps not perfectly correct) upper bound.

Then we just run the exp3 generator for $T = \textup{10,000}$ rounds, and compute some statistics as we go:

bestUpperBoundEstimate = 2 * numRounds / 3
gamma = math.sqrt(numActions * math.log(numActions) / ((math.e - 1) * bestUpperBoundEstimate))
gamma = 0.07

cumulativeReward = 0
bestActionCumulativeReward = 0
weakRegret = 0

t = 0
for (choice, reward, est, weights) in exp3(numActions, rewards, gamma):
cumulativeReward += reward
bestActionCumulativeReward += rewardVector[t][bestAction]

weakRegret = (bestActionCumulativeReward - cumulativeReward)
regretBound = (math.e - 1) * gamma * bestActionCumulativeReward + (numActions * math.log(numActions)) / gamma

t += 1
if t >= numRounds:
break


At the end of one run of ten thousand rounds, the weights are overwhelmingly in favor of the best arm. The cumulative regret is 723, compared to the theoretical upper bound of 897. It’s not too shabby, but by tinkering with the value of $\gamma$ we see that we can get regrets lower than 500 (when $\gamma$ is around 0.7). Considering that the cumulative reward for the player is around 4,500 in this experiment, that means we spent only about 500 rounds out of ten thousand exploring non-optimal options (and also getting unlucky during said exploration). Not too shabby at all.

Here is a graph of a run of this experiment.

A run of Exp3 against Bernoulli rewards. The first graph represents the simple regret of the player algorithm against the best action; the blue line is the actual simple regret, and the green line is the theoretical O(sqrt(k log k)) upper bound. The second graph shows the weights of each action evolving over time. The blue line is the weight of the best action, while the green and red lines are the weights of the second and third best actions.

Note how the Exp3 algorithm never stops increasing its regret. This is in part because of the adversarial model; even if Exp3 finds the absolutely perfect action to take, it just can’t get over the fact that the world might try to screw it over. As long as the $\gamma$ parameter is greater than zero, Exp3 will explore bad options just in case they turn out to be good. The benefits of this is that if the model changes over time Exp3 will adapt, but the downside is that the pessimism inherent in this worldview generally results in lower payoffs than other algorithms.

## More Variations, and Future Plans

Right now we have two contesting models of how the world works: is it stochastic and independent, like the UCB1 algorithm would optimize for? Or does it follow Exp3’s world view that the payoffs are adversarial? Next time we’ll run some real-world tests to see how each fares.

But before that, we should note that there are still more models we haven’t discussed. One extremely significant model is that of contextual bandits. That is, the real world settings we care about often come with some “context” associated with each trial. Ads being displayed to users have probabilities that should take into account the information known about the user, and medical treatments should take into account past medical history. While we will not likely investigate any contextual bandit algorithms on this blog in the near future, the reader who hopes to apply this work to his or her own exploits (no pun intended) should be aware of the additional reading.

Until next time!

Postscript: years later, a cool post by Tim Vieira shows a neat data structure that asymptotically speeds up the update/sample step of the EXP3 algorithm from linear to logarithmic (among others). The weights are stored in a heap of partial sums (the leaves are the individual weights), and sampling is a binary search. See the original post and the accompanying gist for an implementation. Exercise: implement the data structure for use with our EXP3 implementation.

# Anti-Coordination Games and Stable Graph Colorings

## My First Paper

I’m pleased to announce that my first paper, titled “Anti-Coordination Games and Stable Colorings,” has been accepted for publication! The venue is the Symposium on Algorithmic Game Theory, which will take place in Aachen, Germany this October. A professor of mine once told me that everyone puts their first few publications on a pedestal, so I’ll do my best to keep things down to earth by focusing on the contents of the paper and not my swirling cocktail of pride. The point of this post is to explain the results of my work to a non-research-level audience; the research level folks will likely feel more comfortable reading the paper itself. So here we’ll spend significantly longer explaining the proofs and the concepts, and significantly less time on previous work.

I will assume familiarity with basic graph theory (we have a gentle introduction to that here) and NP-completeness proofs (again, see our primer). We’ll give a quick reminder about the latter when we get to it.

## Anti-Coordination Games on Graphs

The central question in the paper is how to find stable strategy profiles for anti-coordination games played on graphs. This section will flush out exactly what all of that means.

The easiest way to understand the game is in terms of fashion. Imagine there is a group of people. Every day they choose their outfits individually and interact with their friends. If any pair of friends happens to choose the same clothing, then they both suffer some embarrassment. We can alternatively say that whenever two friends anti-coordinate their outfits, they each get some kind of reward. If not being embarrassed is your kind of reward, then these really are equivalent. Not every pair of people are friends, so perhaps the most important aspect of this problem is how the particular friendship network considered affects their interactions. This kind of game is called an anti-coordination game, and the network of friends makes it a “game on a graph.” We’ll make this more rigorous shortly.

We can ask questions like, if everyone is acting independently and greedily will their choices converge over time to a single choice of outfit? If so how quickly? How much better could a centralized fashion-planner who knows the entire friendship network fare in choosing outfits? Is the problem of finding a best strategy for picking outfits computationally hard? What if some pairs of people want to coordinate their outfits and others don’t? What if caring about another’s fashion is only one-sided in some cases?

Already this problem is rooted in the theory of social networks, but the concept of an anti-coordination game played on a graph is quite broad, and the relevance of this model to the real world comes from the generality of a graph. For example, one may consider the trading networks of various countries; in this case not all countries are trading partners, and it is beneficial to produce different commodities than your trading partners so that you actually benefit from the interaction. Likewise, neighboring radio towers want to emit signals on differing wavelengths to minimize interference, and commuters want to pick different roadways to minimize traffic. These are all examples of this model which we’re about to formalize.

In place of our “network of friends,” the game entails a graph $G = (V,E)$ in which each player is represented by a vertex, and there is an edge between two vertices whenever the corresponding players are trying to anti-coordinate. We will use the terms player and vertex interchangeably. For now the graph is undirected, but later we will relax this assumption and work with directed graphs. In place of “outfits” we’ll have a generic set of strategies denoted by the numbers $1, \dots, k$, and each vertex will choose a strategy from this set. In one round of the game, each vertex $v$ chooses a strategy, and this defines a function $f : V \to \left \{ 1, \dots, k \right \}$ from the set of vertices to the set of strategies. Then the payoff of a vertex $v$ in a round, which we denote $\mu_f(v)$, is the number of neighbors of $v$ which have chosen a different strategy than $v$. That is, it is

$\displaystyle \mu_f(v) = \sum_{(v,w) \in E} \mathbf{1}_{\left \{ f(v) \neq f(w) \right \}}$

Where $\mathbf{1}_{A}$ denotes the indicator function for the event $A$, which assumes a value of 1 when the event occurs and 0 otherwise. Here is an example of an instance of the game. We have three strategies, denoted by colors, and the payoff for the vertex labeled $v$ is three.

If this game is played over many many rounds, we can ask if there is a so-called Nash equilibrium. That is, is there a choice of strategies for the players so that no single player will have an incentive to change in future rounds, assuming nobody else changes? We restrict even further to thinking about pure strategy Nash equilibria, which means there are no probabilistic choices made in choosing a strategy. Equivalently, a pure strategy equilibrium is just a choice of a strategy for each vertex which doesn’t change across rounds. In terms of the graph, we call a strategy function $f$ which is a Nash equilibrium a stable equilibrium (or, as will be made clear in the next paragraph, a stable coloring). It must satisfy the property that no vertex can increase its payoff by switching to a different strategy. So our question now becomes: how can we find a stable coloring which as good as possible for all players involved? Slightly more generally, we call a Nash equilibrium a strictly stable equilibrium (or a strictly stable coloring) if every vertex would strictly decrease its payoff by switching to another strategy. As opposed to a plain old stable coloring where one could have the same payoff by switching strategies, if any player tries to switch strategy then it will get a necessarily worse payoff. Though it’s not at all clear now, we will see that this distinction is the difference between computational tractability and infeasibility.

We can see a very clear connection between this game and graph coloring. Here an edge produces a payoff of 1 for each of its two vertices if and only if it’s properly colored. And so if the strategy choice function $f$ is also a proper coloring, this will produce the largest possible payoff for all vertices in the graph. But it may not be the case that (for a fixed set of strategies) the graph is properly colorable, and we already know that finding a proper coloring with more than two colors is a computationally hard problem. So this isn’t a viable avenue for solving our fashion game. In any case, the connection confuses us enough to interchangeably call the strategy choice function $f$coloring of $G$.

As an interesting side note, a slight variation of this game was actually tested on humans (with money as payoff!) to see how well they could do. Each human player was only shown the strategies of their neighbors, and received $5 for every round in which they collectively arrived at a proper coloring of the graph. See this article in Science for more details. Since our game allows for the presence of improperly colored edges, we could instead propose to find an assignment of colors to vertices which maximizes the sum of the payoffs of all players. In this vein, we define the social welfare of a graph and a coloring, denoted $W(G,f)$, to be the sum of the payoffs for all vertices $\sum_v \mu_f(v)$. This is a natural quantity one wants to analyze. Unfortunately, even in the case of two strategies, this quantity is computationally difficult (NP-hard) to maximize. It’s a version of the MAX-CUT problem, in which we try to separate the graph into two sets $X, Y$ such that the largest number of edges crosses from $X$ to $Y$. The correspondence between the two problems is seen by having $X$ represent those vertices which get strategy 1 and $Y$ represent strategy 2. So we can’t hope to find an efficient algorithm maximizing social welfare. The next natural question is: can we find stable or strictly stable colorings at all? Will they even necessarily exist? The answers to these questions form the main results of our paper. ## An Algorithm for Stable Colorings, and the Price of Anarchy It turns out that there is a very simple greedy algorithm for finding stable colorings of a graph. We state it in the form of a proposition. By stable $k$-coloring we mean a stable coloring of the graph with $k$ colors (strategies). Proposition: For every graph $G$ and every $k \geq 1$, $G$ admits a stable $k$-coloring, and such a coloring can be found in polynomial time. Proof. The proof operates by using the social welfare function as a so-called potential function. That is, a change in a player’s strategy which results in a higher payoff results in a higher value of the social welfare function. It is easy to see why: if a player $v$ changes to a color that helps him, then it will result in more properly colored edges (adjacent to $v$) than there were before. This means that more of $v$‘s neighbors receive an additional 1 unit of payoff than those that lost 1 as a result of $v$‘s switch. We call a vertex which has the potential to improve its payoff unhappy, and one which cannot improve its payoff happy. And so our algorithm to find a stable coloring simply finds some unhappy vertex, switches its color to the most uncommon color among its neighbors, and repeats the process until all vertices are happy. Indeed, this is a local maximum of the social welfare function, and the very definition of a stable coloring. $\square$ So that was nice, but we might ask: how much worse is the social welfare arrived at by this algorithm than the optimal social welfare? How much do we stand to lose by accepting the condemnation of NP-hardness and settling for the greedy solution we found? More precisely, if we call $Q$ the set of stable colorings and $C$ the set of all possible colorings, what is the value of $\displaystyle \frac{\max_{c' \in C} W(G, c')}{\min_{c \in Q} W(G, c)}$ This is a well-studied quantity in many games, called the price of anarchy. The name comes from the thought: what do we stand to gain by having a central authority, who can see the entire network topology and decide what is best for us, manage our strategies? The alternative (anarchy) is to have each player in the game act as selfishly and rationally as possible without complete information. It turns out that as the number of strategies grows large in our anti-coordination game, there is no price of anarchy. For our game this obviously depends on the choice of graph, but we know what it is and we formally state the result as a proposition: Proposition: For any graph, the price of anarchy for the $k$ strategy anti-coordination game is at most $k/(k-1)$ and this value is actually achieved by some instances of the game. Proof. The pigeonhole principle says that every vertex can always achieve at least a $(k-1)/k$ fraction of its maximum possible payoff. Specifically, if a vertex $v_i$ can’t achieve a proper coloring, then every color must be accounted for among $v_i$‘s neighbors. Choosing the minimally occurring color will give $v_i$ at least a payoff of $d_i(k-1)/k$ where $d_i$ is the number of neighbors of $v_i$. Since every stable coloring has to satisfy the condition that no vertex can do any better than the strategy it already has, even in the worst stable coloring every vertex already has chosen such a minority color. Since the maximum payoff is twice the number of edges $2 |E|$, and the sum of the degrees $\sum_i d_i = 2 |E|$, we have that the price of anarchy is at most $\displaystyle \frac{2|E|}{\frac{k-1}{k} \sum_i d_i} = \frac{k}{k-1}$ Indeed, we can’t do any better than this in general, because the following graph gives an example where the price of anarchy exactly meets this bound. An instance of the anti-coordination game with 5 strategies which meets the upper bound on price of anarchy. This example can easily be generalized to work with arbitrary $k$. We leave the details as an exercise to the reader. $\square$ ## Strictly Stable Colorings are Hard to Find Perhaps surprisingly, the relatively minor change of adding strictness is enough to make computability intractable. We’ll give an explicit proof of this, but first let’s recall what it means to be intractable. Recall that a problem is in NP if there is an efficient (read, polynomial-time) algorithm which can verify a solution to the problem is actually a solution. For example, the problem of proper graph $k$-coloring is in NP, because if someone gives you a purported coloring all you have to do is verify that each of the $O(n^2)$ edges are properly colored. Similarly, the problem of strictly stable coloring is in NP; all one need do is verify that no choice of a different color for any vertex improves its payoff, and it is trivial to come up with an algorithm which checks this. Next, call a problem $A$ NP-hard if a solution to $A$ allows you to solve any problem in NP. More formally, $A$ being NP-hard means that there is a polynomial-time reduction from any problem in NP $B$ to $A$ in the following (rough) sense: there is a polynomial-time computable function (i.e. deterministic program) $f$ which takes inputs for $B$ and transforms them into inputs for $A$ such that: $w$ is a solvable instance of $B$ is if and only if $f(w)$ is solvable for $A$. This is not a completely formal definition (see this primer on NP-completeness for a more serious treatment), but it’s good enough for this post. In order to prove a problem $C$ is NP-hard, all you need to do is come up with a polynomial-time reduction from a known NP-hard problem $A$ to your new problem $C$. The composition of the reduction used for $A$ can be composed with the reduction for $C$ to get a new reduction proving $C$ is NP-hard. Finally, we call a problem NP-complete if it is both in NP and NP-hard. One natural question to ask is: if we don’t already know of any NP-hard problems, how can we prove anything is NP-hard? The answer is: it’s very hard, but it was done once and we don’t need to do it again (but if you really want to, see these notes). As a result, we have generated a huge list of problems that are NP-complete, and unless P = NP none of these algorithms have polynomial-time algorithms to solve them. We need two examples of NP-hard problems for this paper: graph coloring, and boolean satisfiability. Since we assume the reader is familiar with the former, we recall the latter. Given a set of variables $x_i$, we can form a boolean formula over those variables of the form $\varphi = C_1 \wedge C_2 \wedge \dots \wedge C_m$ where each clause $C_i$ is a disjunction of three literals (negated or unnegated variables). For example, $C_i = (x_2 \vee \bar{x_5} \vee \bar{x_9})$ might be one clause. Here interpret a formula as the $x_i$ having the value true or false, the horizontal bars denoting negation, the wedges $\wedge$ meaning “and” and the vees $\vee$ meaning “or.” We call this particular form conjunctive normal form. A formula $\varphi$ is called satisfiable if there is a choice of true and false assignments to the variables which makes the entire logical formula true. The problem of determining whether there is any satisfying assignment of such a formula, called 3-SAT, is NP-hard. Going back to strictly stable equilibria and anti-coordination games, we will prove that the problem of determining whether a graph has a strictly stable coloring with $k$ colors is NP-hard. As a consequence, finding such an equilibrium is NP-hard. Since the problem is also in NP, it is in fact NP-complete. Theorem: For all $k \geq 2$, the problem of determining whether a graph $G$ has a strictly stable coloring with $k$ colors is NP-complete. Proof. The hardest case is $k =2$, but $k \geq 3$ is a nice warmup to understand how a reduction works, so we start there. The $k \geq 3$ part works by reducing from graph coloring. That is, our reduction will take an input to the graph $k$-coloring problem (a graph $G$ whose $k$-colorability is in question) and we produce a graph $G'$ such that $G$ is $k$-colorable if and only if $G'$ has a strictly stable coloring with $k$ colors. Since graph coloring is hard for $k \geq 3$, this will prove our problem is NP-hard. More specifically, we will construct $G'$ in such a way that the strictly stable colorings also happen to be proper colorings! So finding a strictly stable coloring of $G'$ will immediately give us a proper coloring of $G$. The construction of $G'$ is quite straightforward. We start with $G' = G$, and then for each edge $e = (u,v)$ we add a new subgraph which we call $H_e$ that looks like: By $K_{k-2}$ we mean the complete graph on $k-2$ vertices (all possible edges are present), and the vertices $u,v$ are adjacent to all vertices of the $H_e = K_{k-2}$ part. That is, the graph $H_e \cup \left \{ u,v \right \}$ is the complete graph on $k$ vertices. Now if the original graph $G$ was $k$-colorable, then we can use the same colors for the corresponding vertices in $G'$, and extend to a proper coloring (and hence a strictly stable equilibrium) of all of $G'$. Indeed, for any $H_e$ we can use one different color for each vertex of the $K_{k-2}$ part if we don’t use either of the colors used for $u,v$, then we’ll have a proper coloring. On the other hand, if $G'$ has a strictly stable equilibrium, then no edge $e$ which originally came from $G$ can be improperly colored. If some edge $e = (u,v)$ were improperly colored, then there would be some vertex in the corresponding $H_e$ which is not strictly stable. To see this, notice that in the $k$ vertices among $H_e \cup \left \{ u,v \right \}$ there can be at most $k-1$ colors used, and so any vertex will always be able to switch to that color without hurting his payoff. That is, the coloring might be stable, but it won’t be strictly so. So strictly stable colorings are the same as proper colorings, and we already see that the subgraph $G \subset G'$ is $k$-colorable, completing the reduction. Well that was a bit of a cheap trick, but it shows the difficulty of working with strictly stable equilibria: preventing vertices from changing with no penalty is hard! What’s worse is that it’s still hard even if there are only two colors. The reduction here is a lot more complicated, so we’ll give a sketch of how it works. The reduction is from 3-SAT. So given a boolean formula $\varphi = C_1 \wedge \dots \wedge C_m$ we produce a graph $G$ so that $\varphi$ has a satisfying assignment if and only if $G$ has a strictly stable coloring with two colors. The principle part of the reduction is the following gadget which represents the logic inherent in a clause. We pulled the figure directly from our paper, since the caption gives a good explanation of how it works. To reiterate, the two “appendages” labeled by $x$ correspond to the literal $x$, and the choice of colors for these vertices correspond to truth assignments in $\varphi$. In particular, if the two vertices have the same color, then the literal is assigned true. Of course, we have to ensure that any $x$‘s showing up in other clause gadgets agree, and any $\bar{x}$‘s will have opposite truth values. That’s what the following two gadgets do: The gadget on the left enforces x’s to have the same truth assignment across gadgets (otherwise the center vertex won’t be in strict equilibrium). The gadget on the right enforces two literals to be opposites. And if we stitch the clauses together in a particular way (using the two gadgets above) then we will guarantee consistency across all of the literals. All that’s left to check is that the clauses do what they’re supposed to. That is, we need it to be the case that if all of the literals in a clause gadget are “false,” then we can’t complete the coloring to be strictly stable, and otherwise we can. Indeed, the following diagram gives all possible cases of this up to symmetry: The last figure deserves an explanation: if the three literals are all false, then we can pick any color we wish for $v_1$, and its two remaining neighbors must both have the same color (or else $v_1$ is not in strict equilibrium). Call this color $a$, and using the same reasoning call the neighbors of $v_2$ and $v_3$ $b$ and $c$, respectively. Now by the pigeonhole principle, either $a=b, b=c,$ or $b=c$. Suppose without loss of generality that $a=b$, then the edge labeled $(a,b)$ will have the $a$ part not in strict equilibrium (it will have two neighbors of its same color and only one of the other color). This shows that no strict equilibrium can exist. The reduction then works by taking a satisfying assignment for the variables, coloring the literals in $G$ appropriately, and extending to a strictly stable equilibrium of all of $G$. Conversely, if $G$ has a strictly stable coloring, then the literals must be consistent and each clause must be fully colorable, which the above diagram shows is the same as the clauses being satisfiable. So all of $\varphi$ is satisfiable and we’re done (excluding a few additional details we describe in the paper). $\square$ ## Directed Graphs and Cooperation That was the main result of our paper, but we go on to describe some interesting generalizations. Since this post is getting quite long, we’ll just give a quick overview of the interesting parts. The rest of the paper is dedicated to directed graphs, where we define the payoff of a directed edge $(u,v)$ to go to the $u$ player if $u$ and $v$ anti-coordinate, but $v$ gets nothing. Here the computational feasibility is even worse than it was in the undirected case, but the structure is noticeably more interesting. For the former, not only is in NP-hard to compute strictly stable colorings, it’s even NP-hard to do so in the non-strict case! One large part of the reason for this is that stable colorings might not even exist: a directed 3-cycle has no stable equilibrium. We use this fact as a tool in our reductions to prove the following theorem. Theorem: For all $k \geq 2$, determining whether a directed graph has a stable$latex k\$-coloring is NP-complete.

See section 5 of our paper for a full proof.

To address the interesting structure that arises in the directed case, we observe that we can use a directed graph to simulate the desire of one vertex to actually cooperate with another. To see this for two colors, instead of adding an edge $(u,v)$ we add a proxy edge $u'$ and directed edges $(u,u'), (u',v)$. To be in equilibrium, the proxy has no choice but to anti-coordinate with $v$, and this will give $u$ more incentive to cooperate with $v$ by anti-cooperating with its proxy. This can be extended to $k$ colors by using an appropriately (acyclically) directed copy of $K_{k-1}$.

## Thoughts, and Future Work

While the results in this paper are nice, and I’m particularly proud that I came up with a novel NP-hardness reduction, it is unfortunate that the only results in this paper were hardness results. Because of the ubiquity of NP-hard problems, it’s far more impressive to have an algorithm which actually does something (approximate a good solution, do well under some relaxed assumption, do well in expectation with some randomness) than to prove something is NP-hard. To get an idea of the tone set by researchers, NP-hardness results are often called “negative” results (in the sense that they give a “no” answer to the question of whether there is an efficient algorithm) and finding an algorithm that does something is called a positive result. That being said the technique of using two separate vertices to represent a single literal in a reduction proof is a nice trick, and I have since used it in another research paper, so I’m happy with my work.

On the positive side, though, there is some interesting work to be done. We could look at varying types of payoff structures, where instead of a binary payoff it is a function of the colors involved (say, $|i - j|$. Another interesting direction is to consider distributed algorithms (where each player operates independently and in parallel) and see what kinds of approximations of the optimal payoff can be achieved in that setting. Yet another direction favored by a combinatorialist is to generalize the game to hypergraphs, which makes me wonder what type of payoff structure is appropriate (payoff of 1 for a rainbow edge? a non-monochromatic edge?). There is also some more work that can be done in inspecting the relationship between cooperation and anti-cooperation in the directed version. Though I don’t have any immediate open questions about it, it’s a very interesting phenomenon.

In any event, I’m currently scheduled to give three talks about the results in this paper (one at the conference venue in Germany, and two at my department’s seminars). Here’s to starting off my research career!