Bandits and Stocks

So far in this series we’ve seen two nontrivial algorithms for bandit learning in two different settings. The first was the UCB1 algorithm, which operated under the assumption that the rewards for the trials were independent and stochastic. That is, each slot machine was essentially a biased coin flip, and the algorithm was trying to find the machine with the best odds. The second was the Exp3 algorithm, which held the belief that the payoffs were arbitrary. In particular this includes the possibility that an adversary is setting the payoffs against you, and so we measured the success of an algorithm in terms of how it fares against the best single action (just as we did with UCB1, but with Exp3 it’s a nontrivial decision).

Before we move on to other bandit settings it’s natural to try to experiment with the ones we have on real world data. On one hand it’s interesting to see how they fare outside academia. And more relevantly to the design of the future bandit algorithms we’ll see on this blog, we need to know what worldly problems actually provide in terms of inputs to our learning algorithm in each round.

But another interesting issue goes like this. In the real world we can’t ever really know whether the rewards of the actions are stochastic or adversarial. Many people believe that adversarial settings are far too pathological to be realistic, while others claim that the assumptions made by stochastic models are too strict. To weigh in on this dispute, we’ll dip into a bit of experimental science and see which of the two algorithms performs better on the problem of stock trading. The result is then evidence that stocks behave stochastically or adversarially. But we don’t want to stir up too many flames, so we can always back up behind the veil of applied mathematics (“this model is too simple anyway”).

Indeed the model we use in this post is rather simplistic. I don’t know as much as I should (or as my father would have me know) about stock markets. In fact, I’m more partial to not trade stocks on principle. But I must admit that average-quality stock data is easy to come by, and the basic notions of market interactions lend themselves naturally to many machine learning problems. If the reader has any ideas about how to strengthen the model, I welcome suggestions in the comments (or a fork on github).

A fair warning to the reader, we do not solve the problem of trading stocks by any means. We use a model that’s almost entirely unrealistic, and the results aren’t even that good. I’m quite nervous to publish this at all, just because above all else it reveals my gaping ignorance on how stock markets work. But this author believes in revealing ignorance as learning, if for nothing else than that it provides extremely valuable insight into the nature of a problem and an appreciation of its complexity. So criticize away, dear readers.

As usual, all of the code and data we use in this post is available on this blog’s Github page. Our language of choice for this post is Python.

This little trader got lucky.

This little trader got lucky. Could it be because he’s got TEN MONITORS?!

Stocks for Dummies (me)

A quick primer on stocks, which is only as detailed as it needs to be for this post: a stock is essentially the sum of the value of all the assets of a company. A publicly traded company divides their stock into a number of “shares,” and owning a share represents partial ownership of the company. If you own 50% of the shares, you own 50% of the company. Companies sell shares or give them to employees as benefits (or options), and use the money gained through their sale for whatever they see fit. The increase in the price of a stock generally signifies the company is successful and growing; for example, stocks generally rise when a hotly anticipated product is announced.

The stock of a company is traded through one of a number of markets called stock exchanges. The buying and selling interactions are recorded and public, and there are many people in the world who monitor the interactions as they happen (via television, or programmatically) in the hopes of noticing opportunities before others and capitalizing on them. Each interaction induces a change on the price of a share of stock: whenever a share is bought at a certain price, that is the established and recorded price of a share (up to some fudging by brokers which is entirely mysterious to me). In any case, the prices go up and down, and they’re often bundled into “bars” which summarize the data over a certain period of time. The bars we use in this post are daily, and consist of four numbers: the open, the price at the beginning of the day, the high and low, which are self-explanatory, and the close, which is the price at the end of the day.

Bandits and Daily Stock Trading

Now let’s simplify things as much as possible. Our bandit learning algorithm will interact with the market as follows: each day it chooses whether or not to buy a single dollar’s worth of a stock, and at the end of the day it sells the stock and observes the profit. There are no brokers involved, and the price the algorithm sees is the price it gets. In bandit language: the stocks represent actions, and the amount of profit at the end of a day constitutes the payoff of an action in one round. Since small-scale stock price movement is generally very poorly understood, it makes some level of sense to assume the price movements within a given day are adversarial. On the other hand, since we understand them so poorly, we might be tempted to just call them “random” fluctuations, i.e. stochastic. So this is a nice little testbed for seeing which assumption yields a more successful algorithm.

Unlike the traditional image of stock trading where an individual owns shares of a stock over a long period of time, our program will operate on a daily time scale, and hence cannot experience the typical kinds of growth. Nevertheless, we can try to make some money over time, and if it’s a good strategy, we could scale up the single dollar to whatever we’re willing to risk. Specifically, the code we used to compute the payoff is

def payoff(stockTable, t, stock, amountToInvest=1.0):
   openPrice, closePrice = stockTable[stock][t]

   sharesBought = amountToInvest / openPrice
   amountAfterSale = sharesBought * closePrice

   return amountAfterSale - amountToInvest

The remainder of the code is interfacing with the Exp3 and UCB1 functions we gave in previous posts, and data shuffling. We got our data from Google Finance, and we provide it, along with all of the code used in the making of this post, on this blog’s Github page. Before we run our experiments, let’s give a few reasons why this model is unrealistic.

  1. We assume we can buy/sell fractional shares of a stock, which to my knowledge is not possible. Though this experiment could be redone where you buy a single share of a stock, or with mutual funds/currency exchange/whatever replacing stocks, we didn’t do it this way.
  2. Brokerage fees can drastically change the success of an algorithm which trades frequently.
  3. Open and close prices are not typical prices. People will often make decisions based on the time of day, but then again we might expect this to be just another reason that Exp3 would perform better than UCB1.
  4. We’re not actually trading in the stock market, and so we’re ignoring the effects of our own algorithm on the prices in the market.
  5. It’s impossible to guarantee you get to use the opening price and closing price in your transactions.
  6. UCB1 and Exp3 don’t use all of the information available. Indeed, they assume that they would not be able to see the outcome of an action they did not take, but with stocks you can get a good estimate of how much money you would have made had you chosen a different stock.
  7. Each trial in a bandit learning problem is identical from the learner’s perspective, but one often keeps a stock around while making other decisions.

We’ll come back to #6 after seeing the raw experiments for an unaltered UCB1 and Exp3, because there is a natural extension of the algorithm to handle additional information. I’m sure there are other glaring issues with the experimental setup, and the reader should feel free to rant about it in the comments. It won’t stop me from running the algorithm and seeing what happens just for fun.

Data Sets

We ran the experiment on two sets of stocks. The first set consisted of nine random stocks, taken from the random stocks twitter feed, with 5 years of past data. The stocks are:

lxrx, keg, cuba, tdi, brks, mux, cadx, belfb, htr

And you can view more information about these particular stocks via Google Finance. The second set was a non-random choice of nine Fortune 500 companies with 10 years of past data. The stocks were

amzn, cost, jpm, gs, wfc, msft, tgt, aapl, wmt

And again more information about these stocks is available via Google Finance. For the record, here were the cumulative payoffs of each of the nine Fortune 500 companies:

f500-cumulative-rewards

The cumulative rewards for the nine Fortune 500 companies over the last ten years of data.

Interestingly, the company which started off with the best prospects (Apple), turned out to have the worst cumulative reward by the end. The long-term winners in our little imaginary world happen to be Amazon, Costco, and Goldman Sachs. Perhaps this gives credence to the assumption that payoffs are adversarial. A learner can easily get tricked into putting too much faith in one action early on.

And for the random stocks:

The cumulative payoff for the nine randomly chosen stocks.

The cumulative payoff for the nine randomly chosen stocks for the last five years of data.

The random stocks clearly perform worse and more variably overall (although HTR surpasses most of the Fortune 500 companies, despite its otherwise relatively modest stock growth over the last five years). To my untrained eyes these movements look more like a stochastic model than an adversarial one.

Experiments

Here is a typical example of a run of Exp3 on the Fortune 500 data set (using \gamma = 0.33, recall \gamma measures the amount of uniform exploration performed):

(Expected payoff, variance) over 1000 trials is (1.122463919564572, 0.5518037498918705)
For a single run: 
Payoff was 1.12
Regret was 2.91
Best stock was amzn at 4.02
weights: '0.00, 0.00, 0.00, 0.46, 0.52, 0.00, 0.00, 0.00, 0.01'

And one for UCB1:

(Expected payoff, variance) over 1000 trials is (1.1529891576139333, 0.5012825847001482)
For a single run: 
Payoff was 1.73
Regret was 2.29
Best stock was amzn at 4.02
ucbs: '0.234, 0.234, 0.234, 0.234, 0.234, 0.234, 0.234, 0.234, 0.234'

The results are quite curious. Indeed, the expected payoff seems to be a whopping 110% return! The variance of these results is quite high, and so it’s not at all impossible that the algorithm could have a negative return. But just as often it would return around 200% profit. 

Before we go risking all our money on this strategy, let’s take a closer look at what’s happening in the algorithm. It appears that for UCB1 the upper confidence bounds assigned to each action are the same! In other words, even after ten years of trials, no single stock “shined” above the others in the eyes of UCB1. It may seem that Exp3 has a leg up on UCB1 in this respect, because it’s clear that it gives higher weights to some stocks over others. However, running the algorithm multiple times shows drastically different weight distributions, and if we average the resulting weights over a thousand rounds, we see that they all have roughly the same mean and variance (the mean being first in the pair):

weight stats for msft: (0.107, 0.025)
weight stats for jpm: (0.109, 0.027)
weight stats for tgt: (0.110, 0.029)
weight stats for gs: (0.112, 0.025)
weight stats for wmt: (0.110, 0.027)
weight stats for aapl: (0.111, 0.027)
weight stats for amzn: (0.120, 0.029)
weight stats for cost: (0.113, 0.026)
weight stats for wfc: (0.107, 0.023)
Indeed, the best stock, Amazon, had an average weight just barely larger (and more variable) than any of the other stocks. So this evidence points to the conclusion that neither EXP3 nor UCB1 has any clue which stock is better. Pairing this with the fact that both algorithms nevertheless perform well would suggest that a random choice of action at each step is equally likely to do well. Indeed, when we run with a “random bandit” that just chooses actions uniformly at random, we get the following results:
(Expected payoff, variance) over 1000 trials is (1.1094227056931132, 0.4403783017367529)
For a single run: 
Payoff was 3.13
Regret was 0.90
Best stock was amzn at 4.02

It’s not quite as good as either Exp3 or UCB1, but it’s close and less variable, which means a lot to an investor. In other words, it’s starting to look like Exp3 and UCB1 aren’t doing significantly better than random at all, and that a monkey would do well in this system (for these particular stocks).

Of course, Fortune 500 companies are pretty successful by definition, so let’s turn our attention to the random stocks:

For the random bandit learner:

(Expected payoff, variance) over 1000 trials is (-0.23952295977625776, 1.0787311145181104)
For a single run: 
Payoff was -2.01
Regret was 3.92
Best stock was htr at 1.91

For UCB1:

(Expected payoff, variance) over 1000 trials is (-0.3503593899029112, 1.1136234992964154)
For a single run: 
Payoff was 0.26
Regret was 1.65
Best stock was htr at 1.91
ucbs: '0.315, 0.315, 0.315, 0.316, 0.315, 0.315, 0.315, 0.315, 0.316'

And for Exp3:

(Expected payoff, variance) over 1000 trials is (-0.25827976810345593, 1.2946101887058519)
For a single run: 
Payoff was -0.34
Regret was 2.25
Best stock was htr at 1.91
weights: '0.08, 0.00, 0.14, 0.06, 0.48, 0.00, 0.00, 0.04, 0.19'

But again Exp3 has no idea what stocks are actually best, with the average, variance over 1000 trials being:

weight stats for lxrx: '0.11, 0.02'
weight stats for keg: '0.11, 0.02'
weight stats for htr: '0.12, 0.02'
weight stats for cadx: '0.10, 0.02'
weight stats for belfb: '0.11, 0.02'
weight stats for tdi: '0.11, 0.02'
weight stats for cuba: '0.11, 0.02'
weight stats for mux: '0.11, 0.02'
weight stats for brks: '0.11, 0.02'

The long and short of it is that the choice of Fortune 500 stocks was inherently so biased toward success than a monkey could have made money investing in them, while the average choice of stocks had, if anything, a bias toward loss. And unfortunately using an algorithm like UCB1 or Exp3 straight out of the box doesn’t produce anything better than a monkey.

Issues and Improvements

There are two glaring theoretical issues here that we haven’t yet addressed. One of these goes back to issue #5 in that list we gave at the beginning of the post: the bandit algorithms are assuming they have less information than they actually have! Indeed, at the end of a day of stock trading, you have a good idea what would have happened to you had you bought a different stock, and in our simplified world you can know exactly what your profit would have been. Recalling that UCB1 and Exp3 both maintained some numbers representing the strength of an action (Exp3 had a “weight” and UCB1 an upper confidence bound), the natural extension to both UCB1 and Exp3 is simply to modify the beliefs about all actions after any given round. This is a pretty simple improvement to make in our implementation, since it just changes a single weight update to a loop. For Exp3:

for choice in range(numActions):
   rewardForUpdate = reward(choice, t)
   scaledReward = (rewardForUpdate - rewardMin) / (rewardMax - rewardMin)
   estimatedReward = 1.0 * scaledReward / probabilityDistribution[choice]
   weights[choice] *= math.exp(estimatedReward * gamma / numActions)

With a similar loop for UCB1. This code should be familiar from our previous posts on bandits. We then rerun the new algorithms on the same data sets, and the results are somewhat surprising. First, UCB1 on Fortune 500:

(Expected payoff, variance) over 1000 trials is (3.530670654982728, 0.007713190816014095)
For a single run: 
Payoff was 3.56
Regret was 0.47

This is clearly outperforming the random bandit learning algorithm, with an average return of 350%! In fact, it does almost as well as the best stock, and the variance is quite low. UCB1 also outperforms Exp3, which fares comparably to its pre-improved self. That is, it’s still not much better than random:

(Expected payoff, variance) over 1000 trials is (1.1424797906901956, 0.434335471375294)
For a single run: 
Payoff was 1.24
Regret was 2.79

And also for the random stocks, UCB1 with improvements outperforms Exp3 and UCB1 without improvements. UCB1:

(Expected payoff, variance) over 1000 trials is (0.680211923900068, 0.04226672915962647)
For a single run:
Payoff was 0.82
Regret was 1.09

And Exp3:

(Expected payoff, variance) over 1000 trials is (-0.2242152508929378, 1.1312843329929194)
For a single run: 
Payoff was -0.16
Regret was 2.07

We might wonder why this is the case, and there is a plausible explanation. See, Exp3 has a difficult life: it has to assume that at any time the adversary can completely change the game. And so Exp3 must remain vigilant, continuing to try options it knows to be terrible for fear that they may spontaneously do well. Exp3 is the grandfather who, after 75 years of not winning the lotto, continues to buy tickets every week. A better analogy might be a lioness who, even after being moved to the zoo, stays up all night to protect a cub from predators. This gives us quite a new perspective on Exp3: the world really has to be that messed up for Exp3 to be useful. As we saw, UCB1 is much more eager to jump on a winning bandwagon, and it paid off in both the good (Fortune 500) and bad (random stock) scenarios. All in all, this experiment would provide some minor evidence that the stock market (or just this cheesy version of it) is more stochastic than adversarial.

The second problem is that we’re treating these stocks as if they were isolated from the rest of the world. Indeed, along with each stock comes some kind of context in the form of information about that stock. Historical prices, corporate announcements, cyclic boom and bust, what the talking heads think, all of this may be relevant to the price fluctuations of a stock on any given day. While Exp3 and UCB1 are ill-equipped to handle such a rich landscape, researchers in bandit learning have recognized the importance of context in decision making. So much so, in fact, that an entire subfield of “Contextual Bandits” was born. John Langford, perhaps the world’s leading expert on bandit learning, wrote on his blog in 2007,

I’m having difficulty finding interesting real-world k-Armed Bandit settings which aren’t better thought of as Contextual Bandits in practice. For myself, bandit algorithms are (at best) motivational because they can not be applied to real-world problems without altering them to take context into account.

I tend to agree with him. Bandit problems almost always come with some inherent additional structure in the real world, and the best algorithms will always take advantage of that structure. A “context” associated with each round is perhaps the weakest kind of structure, so it’s a natural place to look for better algorithms.

So that’s what we’ll do in the future of this series. But before then we might decide to come up with another experiment to run Exp3 and UCB1 on. It would be nice to see an instance in which Exp3 seriously outperforms UCB1, but maybe the real world is just stochastic and there’s nothing we can do about it.

Until next time!

Optimism in the Face of Uncertainty: the UCB1 Algorithm

startups

The software world is always atwitter with predictions on the next big piece of technology. And a lot of chatter focuses on what venture capitalists express interest in. As an investor, how do you pick a good company to invest in? Do you notice quirky names like “Kaggle” and “Meebo,” require deep technical abilities, or value a charismatic sales pitch?

This author personally believes we’re not thinking as big as we should be when it comes to innovation in software engineering and computer science, and that as a society we should value big pushes forward much more than we do. But making safe investments is almost always at odds with innovation. And so every venture capitalist faces the following question. When do you focus investment in those companies that have proven to succeed, and when do you explore new options for growth? A successful venture capitalist must strike a fine balance between this kind of exploration and exploitation. Explore too much and you won’t make enough profit to sustain yourself. Narrow your view too much and you will miss out on opportunities whose return surpasses any of your current prospects.

In life and in business there is no correct answer on what to do, partly because we just don’t have a good understanding of how the world works (or markets, or people, or the weather). In mathematics, however, we can meticulously craft settings that have solid answers. In this post we’ll describe one such scenario, the so-called multi-armed bandit problem, and a simple algorithm called UCB1 which performs close to optimally. Then, in a future post, we’ll analyze the algorithm on some real world data.

As usual, all of the code used in the making of this post are available for download on this blog’s Github page.

Multi-Armed Bandits

The multi-armed bandit scenario is simple to describe, and it boils the exploration-exploitation tradeoff down to its purest form.

Suppose you have a set of K actions labeled by the integers \left \{ 1, 2, \dots, K \right \}. We call these actions in the abstract, but in our minds they’re slot machines. We can then play a game where, in each round, we choose an action (a slot machine to play), and we observe the resulting payout. Over many rounds, we might explore the machines by trying some at random. Assuming the machines are not identical, we naturally play machines that seem to pay off well more frequently to try to maximize our total winnings.

Exploit away, you lucky ladies.

Exploit away, you lucky ladies.

This is the most general description of the game we could possibly give, and every bandit learning problem has these two components: actions and rewards. But in order to get to a concrete problem that we can reason about, we need to specify more details. Bandit learning is a large tree of variations and this is the point at which the field ramifies. We presently care about two of the main branches.

How are the rewards produced? There are many ways that the rewards could work. One nice option is to have the rewards for action i be drawn from a fixed distribution D_i (a different reward distribution for each action), and have the draws be independent across rounds and across actions. This is called the stochastic setting and it’s what we’ll use in this post. Just to pique the reader’s interest, here’s the alternative: instead of having the rewards be chosen randomly, have them be adversarial. That is, imagine a casino owner knows your algorithm and your internal beliefs about which machines are best at any given time. He then fixes the payoffs of the slot machines in advance of each round to screw you up! This sounds dismal, because the casino owner could just make all the machines pay nothing every round. But actually we can design good algorithms for this case, but “good” will mean something different than absolute winnings. And so we must ask:

How do we measure success? In both the stochastic and the adversarial setting, we’re going to have a hard time coming up with any theorems about the performance of an algorithm if we care about how much absolute reward is produced. There’s nothing to stop the distributions from having terrible expected payouts, and nothing to stop the casino owner from intentionally giving us no payout. Indeed, the problem lies in our measurement of success. A better measurement, which we can apply to both the stochastic and adversarial settings, is the notion of regret. We’ll give the definition for the stochastic case, and investigate the adversarial case in a future post.

Definition: Given a player algorithm A and a set of actions \left \{1, 2, \dots, K \right \}, the cumulative regret of A in rounds 1, \dots, T is the difference between the expected reward of the best action (the action with the highest expected payout) and the expected reward of A for the first T rounds.

We’ll add some more notation shortly to rephrase this definition in symbols, but the idea is clear: we’re competing against the best action. Had we known it ahead of time, we would have just played it every single round. Our notion of success is not in how well we do absolutely, but in how well we do relative to what is feasible.

Notation

Let’s go ahead and draw up some notation. As before the actions are labeled by integers \left \{ 1, \dots, K \right \}. The reward of action i is a [0,1]-valued random variable X_i distributed according to an unknown distribution and possessing an unknown expected value \mu_i. The game progresses in rounds t = 1, 2, \dots so that in each round we have different random variables X_{i,t} for the reward of action i in round t (in particular, X_{i,t} and X_{i,s} are identically distributed). The X_{i,t} are independent as both t and i vary, although when i varies the distribution changes.

So if we were to play action 2 over and over for T rounds, then the total payoff would be the random variable G_2(T) = \sum_{t=1}^T X_{2,t}. But by independence across rounds and the linearity of expectation, the expected payoff is just \mu_2 T. So we can describe the best action as the action with the highest expected payoff. Define

\displaystyle \mu^* = \max_{1 \leq i \leq K} \mu_i

We call the action which achieves the maximum i^*.

A policy is a randomized algorithm A which picks an action in each round based on the history of chosen actions and observed rewards so far. Define I_t to be the action played by A in round t and P_i(n) to be the number of times we’ve played action i in rounds 1 \leq t \leq n. These are both random variables. Then the cumulative payoff for the algorithm A over the first T rounds, denoted G_A(T), is just

\displaystyle G_A(T) = \sum_{t=1}^T X_{I_t, t}

and its expected value is simply

\displaystyle \mathbb{E}(G_A(T)) = \mu_1 \mathbb{E}(P_1(T)) + \dots + \mu_K \mathbb{E}(P_K(T)).

Here the expectation is taken over all random choices made by the policy and over the distributions of rewards, and indeed both of these can affect how many times a machine is played.

Now the cumulative regret of a policy A after the first T steps, denoted R_A(T) can be written as

\displaystyle R_A(T) = G_{i^*}(T) - G_A(T)

And the goal of the policy designer for this bandit problem is to minimize the expected cumulative regret, which by linearity of expectation is

\mathbb{E}(R_A(T)) = \mu^*T - \mathbb{E}(G_A(T)).

Before we continue, we should note that there are theorems concerning lower bounds for expected cumulative regret. Specifically, for this problem it is known that no algorithm can guarantee an expected cumulative regret better than \Omega(\sqrt{KT}). It is also known that there are algorithms that guarantee no worse than O(\sqrt{KT}) expected regret. The algorithm we’ll see in the next section, however, only guarantees O(\sqrt{KT \log T}). We present it on this blog because of its simplicity and ubiquity in the field.

The UCB1 Algorithm

The policy we examine is called UCB1, and it can be summed up by the principle of optimism in the face of uncertainty. That is, despite our lack of knowledge in what actions are best we will construct an optimistic guess as to how good the expected payoff of each action is, and pick the action with the highest guess. If our guess is wrong, then our optimistic guess will quickly decrease and we’ll be compelled to switch to a different action. But if we pick well, we’ll be able to exploit that action and incur little regret. In this way we balance exploration and exploitation.

The formalism is a bit more detailed than this, because we’ll need to ensure that we don’t rule out good actions that fare poorly early on. Our “optimism” comes in the form of an upper confidence bound (hence the acronym UCB). Specifically, we want to know with high probability that the true expected payoff of an action \mu_i is less than our prescribed upper bound. One general (distribution independent) way to do that is to use the Chernoff-Hoeffding inequality.

As a reminder, suppose Y_1, \dots, Y_n are independent random variables whose values lie in [0,1] and whose expected values are \mu_i. Call Y = \frac{1}{n}\sum_{i}Y_i and \mu = \mathbb{E}(Y) = \frac{1}{n} \sum_{i} \mu_i. Then the Chernoff-Hoeffding inequality gives an exponential upper bound on the probability that the value of Y deviates from its mean. Specifically,

\displaystyle \textup{P}(Y + a < \mu) \leq e^{-2na^2}

For us, the Y_i will be the payoff variables for a single action j in the rounds for which we choose action j. Then the variable Y is just the empirical average payoff for action j over all the times we’ve tried it. Moreover, a is our one-sided upper bound (and as a lower bound, sometimes). We can then solve this equation for a to find an upper bound big enough to be confident that we’re within a of the true mean.

Indeed, if we call n_j the number of times we played action j thus far, then n = n_j in the equation above, and using a = a(j,T) = \sqrt{2 \log(T) / n_j} we get that \textup{P}(Y > \mu + a) \leq T^{-4}, which converges to zero very quickly as the number of rounds played grows. We’ll see this pop up again in the algorithm’s analysis below. But before that note two things. First, assuming we don’t play an action j, its upper bound a grows in the number of rounds. This means that we never permanently rule out an action no matter how poorly it performs. If we get extremely unlucky with the optimal action, we will eventually be convinced to try it again. Second, the probability that our upper bound is wrong decreases in the number of rounds independently of how many times we’ve played the action. That is because our upper bound a(j, T) is getting bigger for actions we haven’t played; any round in which we play an action j, it must be that a(j, T+1) = a(j,T), although the empirical mean will likely change.

With these two facts in mind, we can formally state the algorithm and intuitively understand why it should work.

UCB1:
Play each of the K actions once, giving initial values for empirical mean payoffs \overline{x}_i of each action i.
For each round t = K, K+1, \dots:

Let n_j represent the number of times action j was played so far.
Play the action j maximizing \overline{x}_j + \sqrt{2 \log t / n_j}.
Observe the reward X_{j,t} and update the empirical mean for the chosen action.

And that’s it. Note that we’re being super stateful here: the empirical means x_j change over time, and we’ll leave this update implicit throughout the rest of our discussion (sorry, functional programmers, but the notation is horrendous otherwise).

Before we implement and test this algorithm, let’s go ahead and prove that it achieves nearly optimal regret. The reader uninterested in mathematical details should skip the proof, but the discussion of the theorem itself is important. If one wants to use this algorithm in real life, one needs to understand the guarantees it provides in order to adequately quantify the risk involved in using it.

Theorem: Suppose that UCB1 is run on the bandit game with K actions, each of whose reward distribution X_{i,t} has values in [0,1]. Then its expected cumulative regret after T rounds is at most O(\sqrt{KT \log T}).

Actually, we’ll prove a more specific theorem. Let \Delta_i be the difference \mu^* - \mu_i, where \mu^* is the expected payoff of the best action, and let \Delta be the minimal nonzero \Delta_i. That is, \Delta_i represents how suboptimal an action is and \Delta is the suboptimality of the second best action. These constants are called problem-dependent constants. The theorem we’ll actually prove is:

Theorem: Suppose UCB1 is run as above. Then its expected cumulative regret \mathbb{E}(R_{\textup{UCB1}}(T)) is at most

\displaystyle 8 \sum_{i : \mu_i < \mu^*} \frac{\log T}{\Delta_i} + \left ( 1 + \frac{\pi^2}{3} \right ) \left ( \sum_{j=1}^K \Delta_j \right )

Okay, this looks like one nasty puppy, but it’s actually not that bad. The first term of the sum signifies that we expect to play any suboptimal machine about a logarithmic number of times, roughly scaled by how hard it is to distinguish from the optimal machine. That is, if \Delta_i is small we will require more tries to know that action i is suboptimal, and hence we will incur more regret. The second term represents a small constant number (the 1 + \pi^2 / 3 part) that caps the number of times we’ll play suboptimal machines in excess of the first term due to unlikely events occurring. So the first term is like our expected losses, and the second is our risk.

But note that this is a worst-case bound on the regret. We’re not saying we will achieve this much regret, or anywhere near it, but that UCB1 simply cannot do worse than this. Our hope is that in practice UCB1 performs much better.

Before we prove the theorem, let’s see how derive the O(\sqrt{KT \log T}) bound mentioned above. This will require familiarity with multivariable calculus, but such things must be endured like ripping off a band-aid. First consider the regret as a function R(\Delta_1, \dots, \Delta_K) (excluding of course \Delta^*), and let’s look at the worst case bound by maximizing it. In particular, we’re just finding the problem with the parameters which screw our bound as badly as possible, The gradient of the regret function is given by

\displaystyle \frac{\partial R}{\partial \Delta_i} = - \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3}

and it’s zero if and only if for each i, \Delta_i = \sqrt{\frac{8 \log T}{1 + \pi^2/3}} = O(\sqrt{\log T}). However this is a minimum of the regret bound (the Hessian is diagonal and all its eigenvalues are positive). Plugging in the \Delta_i = O(\sqrt{\log T}) (which are all the same) gives a total bound of O(K \sqrt{\log T}). If we look at the only possible endpoint (the \Delta_i = 1), then we get a local maximum of O(K \sqrt{\log T}). But this isn’t the O(\sqrt{KT \log T}) we promised, what gives? Well, this upper bound grows arbitrarily large as the \Delta_i go to zero. But at the same time, if all the \Delta_i are small, then we shouldn’t be incurring much regret because we’ll be picking actions that are close to optimal!

Indeed, if we assume for simplicity that all the \Delta_i = \Delta are the same, then another trivial regret bound is \Delta T (why?). The true regret is hence the minimum of this regret bound and the UCB1 regret bound: as the UCB1 bound degrades we will eventually switch to the simpler bound. That will be a non-differentiable switch (and hence a critical point) and it occurs at \Delta = O(\sqrt{(K \log T) / T}). Hence the regret bound at the switch is \Delta T = O(\sqrt{KT \log T}), as desired.

Proving the Worst-Case Regret Bound

Proof. The proof works by finding a bound on P_i(T), the expected number of times UCB chooses an action up to round T. Using the \Delta notation, the regret is then just \sum_i \Delta_i \mathbb{E}(P_i(T)), and bounding the P_i‘s will bound the regret.

Recall the notation for our upper bound a(j, T) = \sqrt{2 \log T / P_j(T)} and let’s loosen it a bit to a(y, T) = \sqrt{2 \log T / y} so that we’re allowed to “pretend” a action has been played y times. Recall further that the random variable I_t has as its value the index of the machine chosen. We denote by \chi(E) the indicator random variable for the event E. And remember that we use an asterisk to denote a quantity associated with the optimal action (e.g., \overline{x}^* is the empirical mean of the optimal action).

Indeed for any action i, the only way we know how to write down P_i(T) is as

\displaystyle P_i(T) = 1 + \sum_{t=K}^T \chi(I_t = i)

The 1 is from the initialization where we play each action once, and the sum is the trivial thing where just count the number of rounds in which we pick action i. Now we’re just going to pull some number m-1 of plays out of that summation, keep it variable, and try to optimize over it. Since we might play the action fewer than m times overall, this requires an inequality.

P_i(T) \leq m + \sum_{t=K}^T \chi(I_t = i \textup{ and } P_i(t-1) \geq m)

These indicator functions should be read as sentences: we’re just saying that we’re picking action i in round t and we’ve already played i at least m times. Now we’re going to focus on the inside of the summation, and come up with an event that happens at least as frequently as this one to get an upper bound. Specifically, saying that we’ve picked action i in round t means that the upper bound for action i exceeds the upper bound for every other action. In particular, this means its upper bound exceeds the upper bound of the best action (and i might coincide with the best action, but that’s fine). In notation this event is

\displaystyle \overline{x}_i + a(P_i(t), t-1) \geq \overline{x}^* + a(P^*(T), t-1)

Denote the upper bound \overline{x}_i + a(i,t) for action i in round t by U_i(t). Since this event must occur every time we pick action i (though not necessarily vice versa), we have

\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi(U_i(t-1) \geq U^*(t-1) \textup{ and } P_i(t-1) \geq m)

We’ll do this process again but with a slightly more complicated event. If the upper bound of action i exceeds that of the optimal machine, it is also the case that the maximum upper bound for action i we’ve seen after the first m trials exceeds the minimum upper bound we’ve seen on the optimal machine (ever). But on round t we don’t know how many times we’ve played the optimal machine, nor do we even know how many times we’ve played machine i (except that it’s more than m). So we try all possibilities and look at minima and maxima. This is a pretty crude approximation, but it will allow us to write things in a nicer form.

Denote by \overline{x}_{i,s} the random variable for the empirical mean after playing action i a total of s times, and \overline{x}^*_s the corresponding quantity for the optimal machine. Realizing everything in notation, the above argument proves that

\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi \left ( \max_{m \leq s < t} \overline{x}_{i,s} + a(s, t-1) \geq \min_{0 < s' < t} \overline{x}^*_{s'} + a(s', t-1) \right )

Indeed, at each t for which the max is greater than the min, there will be at least one pair s,s' for which the values of the quantities inside the max/min will satisfy the inequality. And so, even worse, we can just count the number of pairs s, s' for which it happens. That is, we can expand the event above into the double sum which is at least as large:

\displaystyle P_i(T) \leq m + \sum_{t=K}^T \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t-1) \geq \overline{x}^*_{s'} + a(s', t-1) \right )

We can make one other odd inequality by increasing the sum to go from t=1 to \infty. This will become clear later, but it means we can replace t-1 with t and thus have

\displaystyle P_i(T) \leq m + \sum_{t=1}^\infty \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \right )

Now that we’ve slogged through this mess of inequalities, we can actually get to the heart of the argument. Suppose that this event actually happens, that \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t). Then what can we say? Well, consider the following three events:

(1) \displaystyle \overline{x}^*_{s'} \leq \mu^* - a(s', t)
(2) \displaystyle \overline{x}_{i,s} \geq \mu_i + a(s, t)
(3) \displaystyle \mu^* < \mu_i + 2a(s, t)

In words, (1) is the event that the empirical mean of the optimal action is less than the lower confidence bound. By our Chernoff bound argument earlier, this happens with probability t^{-4}. Likewise, (2) is the event that the empirical mean payoff of action i is larger than the upper confidence bound, which also occurs with probability t^{-4}. We will see momentarily that (3) is impossible for a well-chosen m (which is why we left it variable), but in any case the claim is that one of these three events must occur. For if they are all false, we have

\displaystyle \begin{matrix} \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) & > & \mu^* - a(s',t) + a(s',t) = \mu^* \\ \textup{assumed} & (1) \textup{ is false} & \\ \end{matrix}

and

\begin{matrix} \mu_i + 2a(s,t) & > & \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \\ & (2) \textup{ is false} & \textup{assumed} \\ \end{matrix}

But putting these two inequalities together gives us precisely that (3) is true:

\mu^* < \mu_i + 2a(s,t)

This proves the claim.

By the union bound, the probability that at least one of these events happens is 2t^{-4} plus whatever the probability of (3) being true is. But as we said, we’ll pick m to make (3) always false. Indeed m depends on which action i is being played, and if s \geq m > 8 \log T / \Delta_i^2 then 2a(s,t) \leq \Delta_i, and by the definition of \Delta_i we have

\mu^* - \mu_i - 2a(s,t) \geq \mu^* - \mu_i - \Delta_i = 0.

Now we can finally piece everything together. The expected value of an event is just its probability of occurring, and so

\displaystyle \begin{aligned} \mathbb{E}(P_i(T)) & \leq m + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t \textup{P}(\overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t)) \\ & \leq \left \lceil \frac{8 \log T}{\Delta_i^2} \right \rceil + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t 2t^{-4} \\ & \leq \frac{8 \log T}{\Delta_i^2} + 1 + \sum_{t=1}^\infty \sum_{s=1}^t \sum_{s' = 1}^t 2t^{-4} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + 2 \sum_{t=1}^\infty t^{-2} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3} \\ \end{aligned}

The second line is the Chernoff bound we argued above, the third and fourth lines are relatively obvious algebraic manipulations, and the last equality uses the classic solution to Basel’s problem. Plugging this upper bound in to the regret formula we gave in the first paragraph of the proof establishes the bound and proves the theorem.

\square

Implementation and an Experiment

The algorithm is about as simple to write in code as it is in pseudocode. The confidence bound is trivial to implement (though note we index from zero):

def upperBound(step, numPlays):
   return math.sqrt(2 * math.log(step + 1) / numPlays)

And the full algorithm is quite short as well. We define a function ub1, which accepts as input the number of actions and a function reward which accepts as input the index of the action and the time step, and draws from the appropriate reward distribution. Then implementing ub1 is simply a matter of keeping track of empirical averages and an argmax. We implement the function as a Python generator, so one can observe the steps of the algorithm and keep track of the confidence bounds and the cumulative regret.

def ucb1(numActions, reward):
   payoffSums = [0] * numActions
   numPlays = [1] * numActions
   ucbs = [0] * numActions

   # initialize empirical sums
   for t in range(numActions):
      payoffSums[t] = reward(t,t)
      yield t, payoffSums[t], ucbs

   t = numActions

   while True:
      ucbs = [payoffSums[i] / numPlays[i] + upperBound(t, numPlays[i]) for i in range(numActions)]
      action = max(range(numActions), key=lambda i: ucbs[i])
      theReward = reward(action, t)
      numPlays[action] += 1
      payoffSums[action] += theReward

      yield action, theReward, ucbs
      t = t + 1

The heart of the algorithm is the second part, where we compute the upper confidence bounds and pick the action maximizing its bound.

We tested this algorithm on synthetic data. There were ten actions and a million rounds, and the reward distributions for each action were uniform from [0,1], biased by 1/k for some 5 \leq k \leq 15. The regret and theoretical regret bound are given in the graph below.

ucb1-simple-example

The regret of ucb1 run on a simple example. The blue curve is the cumulative regret of the algorithm after a given number of steps. The green curve is the theoretical upper bound on the regret.

Note that both curves are logarithmic, and that the actual regret is quite a lot smaller than the theoretical regret. The code used to produce the example and image are available on this blog’s Github page.

Next Time

One interesting assumption that UCB1 makes in order to do its magic is that the payoffs are stochastic and independent across rounds. Next time we’ll look at an algorithm that assumes the payoffs are instead adversarial, as we described earlier. Surprisingly, in the adversarial case we can do about as well as the stochastic case. Then, we’ll experiment with the two algorithms on a real-world application.

Until then!

Neural Networks and the Backpropagation Algorithm

Neurons, as an Extension of the Perceptron Model

In a previous post in this series we investigated the Perceptron model for determining whether some data was linearly separable. That is, given a data set where the points are labelled in one of two classes, we were interested in finding a hyperplane that separates the classes. In the case of points in the plane, this just reduced to finding lines which separated the points like this:

A hyperplane (the slanted line) separating the blue data points (class -1) from the red data points (class +1)

As we saw last time, the Perceptron model is particularly bad at learning data. More accurately, the Perceptron model is very good at learning linearly separable data, but most kinds of data just happen to more complicated. Even with those disappointing results, there are two interesting generalizations of the Perceptron model that have exploded into huge fields of research. The two generalizations can roughly be described as

  • Use a number of Perceptron models in some sort of conjunction.
  • Use the Perceptron model on some non-linear transformation of the data.

The point of both of these is to introduce some sort of non-linearity into the decision boundary. The first generalization leads to the neural network, and the second leads to the support vector machine. Obviously this post will focus entirely on the first idea, but we plan to cover support vector machines in the near future. Recall further that the separating hyperplane was itself defined by a single vector (a normal vector to the plane) \mathbf{w} = (w_1, \dots, w_n). To “decide” what class the new point \mathbf{x} is in, we check the sign of an inner product with an added constant shifting term:

\displaystyle f(\mathbf{x}) = \textup{sign}(\left \langle \mathbf{w}, \mathbf{x} \right \rangle + b) = \textup{sign} \left ( b + \sum_{i=1}^n w_i x_i \right )

The class of a point is just the value of this function, and as we saw with the Perceptron this corresponds geometrically to which side of the hyperplane the point lies on. Now we can design a “neuron” based on this same formula. We consider a point \mathbf{x} = (x_1, \dots, x_n) to be an input to the neuron, and the output will be the sign of the above sum for some coefficients w_i. In picture form it would look like this:

It is quite useful to literally think of this picture as a directed graph (see this blog’s gentle introduction to graph theory if you don’t know what a graph is). The edges corresponding to the coordinates of the input vector \mathbf{x} have weights w_i, and the output edge corresponds to the sign of the linear combination. If we further enforce the inputs to be binary (that is, \mathbf{x} \in \left \{ 0, 1 \right \}^n), then we get a very nice biological interpretation of the system. If we think of the unit as a neuron, then the input edges correspond to nerve impulses, which can either be on or off (identically to an electrical circuit: there is high current or low current). The weights w_i correspond to the strength of the neuronal connection. The neuron transmits or does not transmit a pulse as output depending on whether the inputs are strong enough.

We’re not quite done, though, because in this interpretation the output of the neuron will either fire or not fire. However, neurons in real life are somewhat more complicated. Specifically, neurons do not fire signals according to a discontinuous function. In addition, we want to use the usual tools from classical calculus to analyze our neuron, but we cannot do that unless the activation function is differentiable, and a prerequisite for that is to be continuous. In plain words, we need to allow our neurons to be able to “partially fire.” We need a small range at which the neuron ramps up quickly from not firing to firing, so that the activation function as a whole is differentiable.

This raises the obvious question: what function should we pick? It turns out that there are a number of possible functions we could use, ranging from polynomial to exponential in nature. But before we pick one in particular, let’s outline the qualities we want such a function to have.

Definition: A function \sigma: \mathbb{R} \to [0,1] is an activation function if it satisfies the following properties:

  • It has a first derivative \sigma'.
  • \sigma is non-decreasing, that is \sigma'(x) \geq 0 for all x
  • \sigma has horizontal asymptotes at both 0 and 1 (and as a consequence, \lim_{x \to \infty} \sigma(x) = 1, and \lim_{x \to -\infty} \sigma(x) = 0).
  • \sigma and \sigma' are both computable functions.

With appropriate shifting and normalizing, there are a few reasonable (and time-tested) activation functions. The two main ones are the hyperbolic tangent \tanh(x) and the sigmoid curve 1/ (1+e^{-t}). They both look (more or less) like this:

A sigmoid function (source: Wikipedia)

And it is easy to see visually that this is what we want.

As a side note, the sigmoid function is actually not used very often in practice for a good reason: it gets too “flat” when the function value approaches 0 or 1. The reason this is bad is because how “flat” the function is (the gradient) will guide the learning process. If the function is very flat, then the network won’t learn as quickly. This will manifest itself in our test later in this post, when we see that a neural network struggles to learn the sine function. It struggles specifically at those values of the function that are close to 1 or -1. Though I don’t want to go into too much detail about this, one alternative that has found a lot of success in deep learning is the “rectified linear unit.” This also breaks the assumption of having a derivative everywhere, so one needs a bit more work to deal with that.

Withholding any discussion of why one would pick one specific activation over another, there is one more small detail. In the Perceptron model we allowed a “bias” b which translated the separating hyperplane so that it need not pass through the origin, hence allowing a the set of all pairs (\mathbf{w}, b) to represent every possible hyperplane. Perhaps the simplest way to incorporate the bias into this model is to add another input x_0 which is fixed to 1. Then we add a weight w_0, and it is easy to see that the constant b can just be replaced with the weight w_0. In other words, the inner product b \cdot 1 + \left \langle \mathbf{w}, \mathbf{x} \right \rangle is the same as the inner product of two new vectors \mathbf{w}', \mathbf{x}' where we set w'_0 = b and x'_0 = 1 and w'_i = w_i, x'_i = x_i for all other i.

The updated picture is now:

Now the specification of a single neuron is complete:

Definition: neuron N is a pair (W, \sigma), where W is a list of weights (w_0, w_1, \dots, w_k), and \sigma is an activation function. The impulse function of a neuron N, which we will denote f_N: \left \{ 0,1 \right \}^{k+1} \to [0,1], is defined as

\displaystyle f_N(\mathbf{x}) = \sigma(\left \langle \mathbf{w}, \mathbf{x} \right \rangle) = \sigma \left (\sum_{i=0}^k w_i x_i \right )

We call w_0 the bias weight, and by convention the first input coordinate x_0 is fixed to 1 for all inputs \mathbf{x}.

(Since we always fix the first input to 1, f_N is technically a function \left \{ 0,1 \right \}^k \to \left \{ 0,1 \right \}, but the reader will forgive us for blurring these details.)

Combining Neurons into a Network

The question of how to “train” a single neuron is just a reformulation of the Perceptron problem. If we have a data set \mathbf{x}_i with class labels y_i = \pm 1, we want to update the weights of a neuron so that the outputs agree with their class labels; that is, f_N(\mathbf{x}_i) = y_i for all i. And we saw in the Perceptron how to do this: it’s fast and efficient, given that the data are linearly separable. And in fact training a neuron in this model (accounting for the new activation function) will give us identical decision functions as in the Perceptron model. All we have done so far is change our perspective from geometry to biology. But as we mentioned originally, we want to form a mathematical army of neurons, all working together to form a more powerful decision function.

The question is what form should this army take? Since we already thought of a single neuron as a graph, let’s generalize this. Instead of having a bunch of “input” vertices, a single “output” vertex, and one neuron doing the computation, we now have the same set of input vertices, the same output vertex, but now a number of intermediate neurons connected arbitrarily to each other. That is, the edges that are outputs of some neurons are connected to the inputs of other neurons, and the very last neuron’s output is the final output. We call such a construction a neural network.

For example, the following graph gives a neural network with 5 neurons

To compute the output of any neuron N, we need to compute the values of the impulse functions for each neuron whose output feeds into N. This in turn requires computing the values of the impulse functions for each of the inputs to those neurons, and so on. If we imagine electric current flowing through such a structure, we can view it as a kind of network flow problem, which is where the name “neural networks” comes from. This structure is also called a dependency graph, and (in the parlance of graph theory) a directed acyclic graph. Though nothing technical about these structures will show up in this particular post, we plan in the future to provide primers on their basic theories.

We remark that we view the above picture as a directed graph with the directed edges going upwards. And as in the picture, the incidence structure (which pairs of neurons are connected or not connected) of the graph is totally arbitrary, as long as it has no cycles. Note that this is in contrast to the classical idea of a neural network as “layered” with one or more intermediate layers, such that all neurons in neighboring layers are completely connected to one another.  Hence we will take a slightly more general approach in this post.

Now the question of training a network of interconnected neurons is significantly more complicated than that of training a single neuron. The algorithm to do so is called backpropagation, because we will check to see if the final output is an error, and if it is we will propagate the error backward through the network, updating weights as we go. But before we get there, let’s explore some motivation for the algorithm.

The Backpropagation Algorithm – Single Neuron

Let us return to the case of a single neuron N with weights \mathbf{w} = (w_0, \dots , w_k) and an input \mathbf{x} = (x_0, \dots, x_k). And momentarily, let us remove the activation function \sigma from the picture (so that f_N just computes the summation part). In this simplified world it is easy to take a given set of training inputs \mathbf{x}_j with labels y_j \in \left \{ 0,1 \right \} and compute the error of our neuron N on the entire training set. A standard mathematical way to compute error is by sum of the squared deviations of our neuron’s output from the actual label.

\displaystyle E(\mathbf{w}) = \sum_j (y_j - f_N(\mathbf{x}_j))^2

The important part is that E is a function just of the weights w_i. In other words, the set of weights completely specifies the behavior of a single neuron.

Enter calculus. Any time we have a multivariate function (here, each of the weights w_i is a variable), then we can speak of its minima and maxima. In our case we strive to find a global minimum of the error function E, for then we would have learned our target classification function as well as possible. Indeed, to improve upon our current set of weights, we can use the standard gradient-descent algorithm. We have discussed versions of the gradient-descent algorithm on this blog before, as in our posts on decrypting substitution ciphers with n-grams and finding optimal stackings in Texas Hold ‘Em. We didn’t work with calculus there because the spaces involved were all discrete. But here we will eventually extend this error function to allow the inputs x_i to be real-valued instead of binary, and so we need the full power of calculus. Luckily for the uninformed reader, the concept of gradient descent is the same in both cases. Since E gives us a real number for each possible neuron (each choice of weights), we can take our current neuron and make it better it by changing the weights slightly, and ensuring our change gives us a smaller value under E. If we cannot ensure this, then we have reached a minimum.

Here are the details. For convenience we add a factor of 1/2 to E and drop the subscript N from f_N. Since minimizing E is the same as minimizing \frac{1}{2}E, this changes nothing about the minima of the function. That is, we will henceforth work with

\displaystyle E(\mathbf{w}) = \frac{1}{2} \sum_j (y_j - f(\mathbf{x}_j))^2

Then we compute the gradient \nabla E of E. For fixed values of the variables w_i (our current set of weights) this is a vector in \mathbb{R}^n, and as we know from calculus it points in the direction of steepest ascent of the function E. That is, if we subtract some sufficiently small multiple of this vector from our current weight vector, we will be closer to a minimum of the error function than we were before. If we were to add, we’d go toward a maximum.

Note that E is never negative, and so it will have a global minimum value at or near 0 (if it is possible for the neuron to represent the target function perfectly, it will be zero). That is, our update rule should be

\displaystyle \mathbf{w}_\textup{current} = \mathbf{w}_\textup{current} - \eta \nabla E(\mathbf{w}_\textup{current})

where \eta is some fixed parameter between 0 and 1 that represent the “learning rate.” We will not mention \eta too much except to say that as long as it is sufficiently small and we allow ourselves enough time to learn, we are guaranteed to get a good approximation of some local minimum (though it might not be a global one).

With this update rule it suffices to compute \nabla E explicitly.

\displaystyle \nabla E = \left ( \frac{\partial E}{\partial w_0}, \dots, \frac{\partial E}{\partial w_n} \right )

In each partial \partial E / \partial w_i we consider each other variable beside w_i to be constant, and combining this with the chain rule gives

\displaystyle \frac{\partial E}{\partial w_i} = - \sum_j (y_j - f(\mathbf{x}_j)) \frac{\partial f}{\partial w_i}

Since in the summation formula for f the w_i variable only shows up in the product w_i x_{j,i} (where x_{j,i} is the i-th term of the vector \mathbf{x}_j), the last part expands as x_{j,i}. i.e. we have

\displaystyle \frac{\partial E}{\partial w_i} = - \sum_j (y_j - f(\mathbf{x}_j)) x_{j,i}

Noting the negatives cancelling, this makes our update rule just

\displaystyle w_i = w_i + \eta \sum_j (y_j - f(\mathbf{x}_j))x_{j,i}

There is an alternative form of an update rule that allows one to update the weights after each individual input is tested (as opposed to checking the outputs of the entire training set). This is called the stochastic update rule, and is given identically as above but without summing over all j:

\mathbf{w} = \mathbf{w} + \eta (y_j - f(\mathbf{x}_j)) \mathbf{x}_j

For our purposes in this post, the stochastic and non-stochastic update rules will give identical results.

Adding in the activation function \sigma is not hard, but we will choose our \sigma so that it has particularly nice computability properties. Specifically, we will pick \sigma = 1/(1+e^{-x}) the sigmoid function, because it satisfies the identity

\sigma'(x) = \sigma(x)(1-\sigma(x))

So instead of \partial f in the formula above we need \partial (\sigma \circ f) / \partial w_i, and this requires the chain rule once again:

\displaystyle \frac{\partial E}{\partial w_i} = - \sum_j (y_j - \sigma(f(\mathbf{x}_j))) \sigma'(f(\mathbf{x}_j))x_{j,i}

And using the identity for \sigma' gives us

 \displaystyle \frac{\partial E}{\partial w_i} = - \sum_j (y_j - \sigma(f(\mathbf{x}_j))) \sigma(f(\mathbf{x}_j))(1-\sigma(f(\mathbf{x}_j)))x_{j,i}

And a similar update rule as before. If we denote by o_j the output value \sigma(f(\mathbf{x}_j)), then the stochastic version of this update rule is

\mathbf{w} = \mathbf{w} + \eta o_j (1-o_j)(y_j - o_j)\mathbf{x}_j

Now that we have motivated an update rule for a single neuron, let’s see how to apply this to an entire network of neurons.

The Backpropagation Algorithm – Entire Network

There is a glaring problem in training a neural network using the update rule above. We don’t know what the “expected” output of any of the internal edges in the graph are. In order to compute the error we need to know what the correct output should be, but we don’t immediately have this information.

We don't know the error value for a non-output node in the network.

We don’t know the error value for a non-output node in the network.

In the picture above, we know the expected value of the edge leaving the node N_2, but not that of N_1. In order to compute the error for N_1, we need to derive some kind of error value for nodes in the middle of the network.

It seems reasonable that the error for N_1 should depend on the errors of the nodes for which N_1 provides an input. That is, in the following picture the error should come from all of the neurons M_1, \dots M_n.

weightedsumerror

In particular, one possible error value for a particular input to the entire network \mathbf{x} would be a weighted sum over the errors of M_i, where the weights are the weights of the edges from N_1 to M_i. In other words, if N_1 has little effect on the output of one particular M_i, it shouldn’t assume too much responsibility for that error. That is, using the above picture, the error for N_1 (in terms of the input weights v_i) is

\displaystyle \sum_i w_i E_{M_i}

where E_{M_i} is the error computed for the node M_i.

It turns out that there is a nice theoretical justification for using this quantity as well. In particular, if we think of the entire network as a single function, we can imagine the error E(\mathbf{w}) as being a very convoluted function of all the weights in the network. But no matter how confusing the function may be to write down, we know that it only involves addition, multiplication, and composition of differentiable functions. So if we want to know how to update the error with respect to a weight that is hidden very far down in the network, in theory it just requires enough applications of the chain rule to find it.

To see this, let’s say we have a nodes N_{1,k} connected forward to nodes N_{2,j} connected forward to nodes N_{3,i}, such that the weights w_{k,j} represent weights going from N_{1,k} \to N_{2,j}, and weights w_{j,i} are N_{2,j} \to N_{3,i}.

If we want to know the partial derivative of E with respect to the deeply nested weight w_{k,j}, then we can just compute it:

\displaystyle \frac{\partial E}{\partial w_{k,j}} = -\sum_i (y_i - o_i)\frac{\partial o_i}{\partial w_{k,j}}

where o_i = f_{N_{1,j}}(\dots) = \sigma(f(\dots)) represents the value of the impulse function at each of the output neurons, in terms of a bunch of crazy summations we omit for clarity.

But after applying the chain rule, the partial of the inner summation f = \sum_j w_{j,i}x_k only depends on w_{k,j} via the coefficient w_{j,i}. i.e., the weight w_{k,j} only affects node N_{3,i} by the output of N_{2,j} passing through the edge labeled w_{j,i}. So we get a sum

\displaystyle \frac{\partial E}{\partial w_{k,j}} = -\sum_i (y_i - o_i)\sigma'(f(\dots)) w_{j,i}

That is, it’s simply a weighted sum of the final errors y_i - o_i by the right weights. The stuff inside the f(\dots) is simply the output of that node, which is again a sum over its inputs. In stochastic form, this makes our update rule (for the weights of N_j) just

\displaystyle \mathbf{w} = \mathbf{w} + \eta o_j(1-o_j) \left ( \sum_i w_{j,i} (y_i - o_i) \right ) \mathbf{z}

where by \mathbf{z} we denote the vector of inputs to the neuron in question (these may be the original input \mathbf{x} if this neuron is the first in the network and all of the inputs are connected to it, or it may be the outputs of other neurons feeding into it).

The argument we gave only really holds for a network where there are only two edges from the input to the output.  But the reader who has mastered the art of juggling notation may easily generalize this via induction to prove it in general. This really is a sensible weight update for any neuron in the network.

And now that we have established our update rule, the backpropagation algorithm for training a neural network becomes relatively straightforward. Start by initializing the weights in the network at random. Evaluate an input \mathbf{x} by feeding it forward through the network and recording at each internal node the output value o_j, and call the final output o. Then compute the error for that output value, propagate the error back to each of the nodes feeding into the output node, and update the weights for the output node using our update rule. Repeat this error propagation followed by a weight update for each of the nodes feeding into the output node in the same way, compute the updates for the nodes feeding into those nodes, and so on until the weights of the entire network are updated. Then repeat with a new input \mathbf{x}.

One minor issue is when to stop. Specifically, it won’t be the case that we only need to evaluate each input \mathbf{x} exactly once. Depending on how the learning parameter \eta is set, we may need to evaluate the entire training set many times! Indeed, we should only stop when the gradient for all of our examples is small, or we have run it for long enough to exhaust our patience. For simplicity we will ignore checking for a small gradient, and we will simply fix a number of iterations. We leave the gradient check as an exercise to the reader.

Then the result is a trained network, which we can further use to evaluate the labels for unknown inputs.

Python Implementation

We now turn to implementing a neural network. As usual, all of the source code used in this post (and then some) is available on this blog’s Github page.

The first thing we need to implement all of this is a data structure for a network. That is, we need to represent nodes and edges connecting nodes. Moreover each edge needs to have an associated value, and each node needs to store multiple values (the error that has been propagated back, and the output produced at that node). So we will represent this via two classes:

class Node:
   def __init__(self):
      self.lastOutput = None
      self.lastInput = None
      self.error = None
      self.outgoingEdges = []
      self.incomingEdges = []

class Edge:
   def __init__(self, source, target):
      self.weight = random.uniform(0,1)
      self.source = source
      self.target = target

      # attach the edges to its nodes
      source.outgoingEdges.append(self)
      target.incomingEdges.append(self)

Then a neural network is represented by the set of input and output nodes.

class Network:
   def __init__(self):
      self.inputNodes = []
      self.outputNode = None

In particular, each Node needs to know about its most recent output, input, and error in order to update its weights. So any time we evaluate some input, we need to store these values in the Node. We will progressively fill in these classes with the methods needed to evaluate and train the network on data. But before we can do anything, we need to be able to distinguish between an input node and a node internal to the network. For this we create a subclass of Node called InputNode:

class InputNode(Node):
   def __init__(self, index):
      Node.__init__(self)
      self.index = index; # the index of the input vector corresponding to this node

And now here is a function which evaluates a given input when provided a neural network:

class Network:
   ...

   def evaluate(self, inputVector):
      return self.outputNode.evaluate(inputVector)

class Node:
   ...

   def evaluate(self, inputVector):
      self.lastInput = []
      weightedSum = 0

      for e in self.incomingEdges:
         theInput = e.source.evaluate(inputVector)
         self.lastInput.append(theInput)
         weightedSum += e.weight * theInput

      self.lastOutput = activationFunction(weightedSum)
      return self.lastOutput

class InputNode(Node):
   ...

   def evaluate(self, inputVector):
      self.lastOutput = inputVector[self.index]
      return self.lastOutput

A network calls the evaluate function on its output node, and each node recursively calls evaluate on the sources of each of its incoming edges. An InputNode simply returns the corresponding entry in the inputVector (which requires us to pass the input vector along through the recursive calls). Since our graph structure is arbitrary, we note that some nodes may be “evaluated” more than once per evaluation. As such, we need to store the node’s output for the duration of the evaluation. We also need to store this value for use in training, and so before a call to evaluate we must clear this value. We omit the details here for brevity.

We will use the evaluate() function for training as well as for evaluating unknown inputs. As usual, examples of using these classes (and tests) are available in the full source code on this blog’s Github page.

In addition, we need to automatically add bias nodes and corresponding edges to the non-input nodes. This results in a new subclass of Node which has a default evaluate() value of 1. Because of the way we organized things, the existence of this class changes nothing about the training algorithm.

class BiasNode(InputNode):
   def __init__(self):
      Node.__init__(self)

   def evaluate(self, inputVector):
      return 1.0

We simply add a function to the Node class (which is overridden in the InputNode class) which adds a bias node and edge to every non-input node. The details are trivial; the reader may see them in the full source code.

The training algorithm will come in a loop consisting of three steps: first, evaluate an input example. Second, go through the network updating the error values of each node using backpropagation. Third, go through the network again to update the weights of the edges appropriately. This can be written as a very short function on a Network, which then requires a number of longer functions on the Node classes:

class Network:
   ...

   def propagateError(self, label):
      for node in self.inputNodes:
         node.getError(label)

   def updateWeights(self, learningRate):
      for node in self.inputNodes:
         node.updateWeights(learningRate)

   def train(self, labeledExamples, learningRate=0.9, maxIterations=10000):
      while maxIterations > 0:
         for example, label in labeledExamples:
            output = self.evaluate(example)
            self.propagateError(label)
            self.updateWeights(learningRate)

            maxIterations -= 1

That is, the network class just makes the first call for each of these recursive processes. We then have the following functions on Nodes to implement getError and updateWeights:

class InputNode(Node):
   ...

   def updateWeights(self, learningRate):
      for edge in self.outgoingEdges:
         edge.target.updateWeights(learningRate)

   def getError(self, label):
      for edge in self.outgoingEdges:
         edge.target.getError(label)

class Node:
   ...

   def getError(self, label):
      if self.error is not None:
         return self.error

      if self.outgoingEdges == []: # this is an output node
         self.error = label - self.lastOutput
      else:
         self.error = sum([edge.weight * edge.target.getError(label)
                           for edge in self.outgoingEdges])

      return self.error

   def updateWeights(self, learningRate):
      if (self.error is not None and self.lastOutput is not None
            and self.lastInput is not None):

         for i, edge in enumerate(self.incomingEdges):
            edge.weight += (learningRate * self.lastOutput * (1 - self.lastOutput) *
                           self.error * self.lastInput[i])

         for edge in self.outgoingEdges:
            edge.target.updateWeights(learningRate)

         self.error = None
         self.lastInput = None
         self.lastOutput = None

These are simply the formulas we derived in the previous sections translated into code. The propagated error is computed as a weighted sum in getError(), and the previous input and output values were saved from the call to evaluate().

A Sine Curve Example, and Issues

One simple example we can use to illustrate this is actually not a decision problem, per se, but a function estimation problem. In the course of all of this calculus, we implicitly allowed our neural network to output any values between 0 and 1 (indeed, the activation function did this for us). And so we can use a neural network to approximate any function which has values in [0,1]. In particular we will try this on

\displaystyle f(x) = \frac{1}{2}(\sin(x) + 1)

on the domain [0,4 \pi ].

Our network is simple: we have a single layer of twenty neurons, each of which is connected to a single input neuron and a single output neuron. The learning rate is set to 0.25, the number of iterations is set to a hundred thousand, and the training set is randomly sampled from the domain.

After training (which takes around fifteen seconds), the average error (when tested against a new random sample) is between 0.03 and 0.06. Here is an example of one such output:

An example of a 20-node neural network approximating two periods of a sine function.

An example of a 20-node neural network approximating two periods of a sine function.

This picture hints at an important shortcoming of our algorithm. Note how the neural network’s approximation of the sine function does particularly poorly close to 0 and 1. This is not a coincidence, but rather a side effect of our activation function \sigma. In particular, because the sigmoid function achieves the values 0 and 1 only in the limit. That is, they never actually achieve 0 and 1, and in order to get close we require prohibitively large weights (which in turn correspond to rather large values to be fed to the activation function). One potential solution is to modify our sine function slightly more, by scaling it and translating it so that its values lie in [0.1, 0.9], say. We leave this as an exercise to the reader.

As one might expect, the neural network also does better when we test it on a single period instead of two (since the sine function is less “complicated” on a single period). We also constructed a data set of binary numbers whose labels were 1 if the number was even and 0 if the number was odd. A similar layout to the sine example with three internal nodes again gave good results.

The issues arise on larger datasets. One big problem with training a neural network is that it’s near impossible to determine the “correct” structure of the network ahead of time. The success of our sine function example, for instance, depended much more than we anticipated on the number of nodes used. Of course, this also depends on the choice of learning rate and the number of iterations allowed, but the point is the same: the neural network is fraught with arbitrary choices. What’s worse is that it’s just as impossible to tell if your choices are justified. All you have is an empirical number to determine how well your network does on one training set, and inspecting the values of the various weights will tell you nothing in all but the most trivial of examples.

There are a number of researchers who have attempted to alleviate this problem in some way. One prominent example is the Cascade Correlation algorithm, which dynamically builds the network structure depending on the data. Other avenues include dynamically updating the learning rate and using a variety of other activation and error functions, based on information theory and game theory (adding penalties for various undesirable properties). Still other methods involve alternative weight updates based on more advanced optimization techniques (such as the conjugate gradient method). Part of the benefit of the backpropagation algorithm is that the choice of error function is irrelevant, as long as it is differentiable. This gives us a lot of flexibility to customize the neural network for our own application domain.

These sorts of questions are what have caused neural networks to become such a huge field of research in machine learning. As such, this blog post has only given the reader a small taste of what is out there. This is the bread and butter in a world of fine cuisine: it’s proven to be a solid choice, but it leaves a cornucopia of flavors unrealized and untapped.

The future of this machine learning series, however, will deviate from the neural network path. We will instead investigate the other extension of the Perceptron model, the Support Vector Machine. We will also lay down some formal theories of learning, because as of yet we have simply been exploring algorithms without the ability to give guarantees on their performance. In order to do that, we must needs formalize the notion of an algorithm “learning” a task. This is no small feat, and nobody has quite agreed on the best formalization. We will nevertheless explore these frameworks, and see what kinds of theorems we can prove in them.

Until then!

Decision Trees and Political Party Classification

Last time we investigated the k-nearest-neighbors algorithm and the underlying idea that one can learn a classification rule by copying the known classification of nearby data points. This required that we view our data as sitting inside a metric space; that is, we imposed a kind of geometric structure on our data. One glaring problem is that there may be no reasonable way to do this. While we mentioned scaling issues and provided a number of possible metrics in our primer, a more common problem is that the data simply isn’t numeric.

For instance, a poll of US citizens might ask the respondent to select which of a number of issues he cares most about. There could be 50 choices, and there is no reasonable way to assign these numerical values so that all are equidistant in the resulting metric space.

Another issue is that the quality of the data could be bad. For instance, there may be missing values for some attributes (e.g., a respondent may neglect to answer one or more questions). Alternatively, the attributes or the classification label could be wrong; that is, the data could exhibit noise. For k-nearest-neighbors, missing an attribute means we can’t (or can’t accurately) compute the distance function. And noisy data can interfere with our choice of k. In particular, certain regions might be better with a smaller value of k, while regions with noisier data might require a larger k to achieve the same accuracy rate.  Without making the algorithm sufficiently more complicated to vary k when necessary, our classification accuracy will suffer.

In this post we’ll see how decision trees can alleviate these issues, and we’ll test the decision tree on an imperfect data set of congressional voting records. We’ll implement the algorithm in Python, and test it on the problem of predicting the US political party affiliation of members of Congress based on their votes for a number of resolutions. As usual, we post the entire code and data set on this blog’s Github page.

Before going on, the reader is encouraged to read our primer on trees. We will assume familiarity with the terminology defined there.

Intuition

Imagine we have a data set where each record is a list of categorical weather conditions on a randomly selected number of days, and the labels correspond to whether a girl named Arya went for a horse ride on that day. Let’s also assume she would like to go for a ride every day, and the only thing that might prohibit her from doing so is adverse weather. In this case, the input variables will be the condition in the sky (sunny, cloudy, rainy, and snow), the temperature (cold, warm, and hot), the relative humidity (low, medium, and high), and the wind speed (low and high). The output variable will be whether Arya goes on a horse ride that day. Some entries in this data set might look like:

                 Arya's Riding Data
Sky     Temperature    Humidity    Wind    Horse Ride
Cloudy  Warm           Low         Low     Yes
Rainy   Cold           Medium      Low     No
Sunny   Warm           Medium      Low     Yes
Sunny   Hot            High        High    No
Snow    Cold           Low         High    No
Rainy   Warm           High        Low     Yes

In this case, one might reasonably guess that certain weather features are more important than others in determining whether Arya can go for a horse ride. For instance, the difference between sun and rain/snow should be a strong indicator, although it is not always correct in this data set. In other words, we’re looking for a weather feature that best separates the data into its respective classes. Of course, we’ll need a rigorous way to measure how good that separation is, but intuitively we can continue.

For example, we might split based on the wind speed feature. In this case, we have two smaller data sets corresponding to the entries where the wind is high and low. The corresponding table might look like:

     Arya's Riding Data, Wind = High
Sky     Temperature    Humidity    Horse Ride
Sunny   Hot            High        No
Snow    Cold           Low         No

     Arya's Riding Data, Wind = Low
Sky     Temperature    Humidity    Horse Ride
Cloudy  Warm           Low         Yes
Rainy   Cold           Medium      No
Sunny   Warm           Medium      Yes
Rainy   Warm           High        Yes

In this case, Arya is never known to ride a horse when the wind speed is high, and there is only one occasion when she doesn’t ride a horse and the wind speed is low. Taking this one step further, we can repeat the splitting process on the “Wind = Low” data in search of a complete split between the two output classes. We can see by visual inspection that the only “no” instance occurs when the temperature is cold. Hence, we should split on the temperature feature.

It is not useful to write out another set of tables (one feels the pain already when imagining a data set with a thousand entries), because in fact there is a better representation. The astute reader will have already recognized that our process of picking particular values for the weather features is just the process of traversing a tree.

Let’s investigate this idea closer. Imagine we have a tree where the root node corresponds to the Wind feature, and it has two edges connected to child nodes; one edge corresponds to the value “Low” and the other to “High.” That is, the process of traveling from the root to a child along an edge is the process of selecting only those data points whose “Wind” feature is that edge’s label. We can take the child corresponding to “Low” wind and have it represent the Temperature feature, further adding three child nodes with edges corresponding to the “Cold,” “Warm,” and “Hot” values.

We can stop this process once the choice of features completely splits our data set. Pictorially, our tree would look like this:

We reasonably decide to stop the traversal when all of the examples in the split are in the same class. More so, we would not want to include the option for the temperature to be Hot in the right subtree, because the data tells us nothing about such a scenario (as indicated by the “None” in the corresponding leaf).

Now that we have the data organized as a tree, we can try to classify new data with it. Suppose the new example is:

Sky     Temperature    Humidity    Wind    Horse Ride
Rainy   Cold           Low         Low     ?

We first inspect the wind speed feature, and seeing that it is “Low,” we follow the edge to the right subtree and repeat. Seeing that the temperature feature is “Cold,” we further descend down the “Cold” branch, reaching the “All No” leaf. Since this leaf corresponds to examples we’ve seen which are all in the “No” class, we should classify the new data as “No” as well.

Summarizing, given a new piece of data, we can traverse the tree according to the values of its features until we reach a leaf node. If the leaf node is “All No,” then we classify the new set of weather conditions as a “No,” and if it is “All Yes,” we classify as “Yes.”

Of course, this tree makes it clear that this toy data set is much too small to be useful, and the rules we’ve extrapolated from it are ridiculous. In particular, surely some people ride horses when the wind speed is high, and they would be unlikely to do so in a warm, low-wind thunderstorm. Nevertheless, we might expect a larger data set to yield a more sensible tree, as the data would more precisely reflect the true reasons one might refrain from riding a hose.

Before we generalize this example to any data set, we should note that there is yet another form for our classification rule. In particular, we can write the traversal from the root to the rightmost leaf as a boolean expression of the form:

\displaystyle \textup{Wind = ``Low''} \wedge \textup{Temp = ``Warm''}

An example will be classified as “Yes” if and only if the wind feature is “High” and the temperature feature is “Warm” (here the wedge symbol \wedge means “and,” and is called a conjunction). If we had multiple such routes leading to leaves labeled with “All Yes,” say a branch for wind being “High” and sky being “Sunny,” we could expand our expression as a disjunction (an “or,” denoted \vee) of the two corresponding conjunctions as follows:

 \displaystyle (\textup{Wind = ``Low''} \wedge \textup{Temp = ``Warm''}) \vee (\textup{Wind = ``High''} \wedge \textup{Sky = ``Sunny''})

In the parlance of formal logic, this kind of expression is called the disjunctive normal form, that is, a disjunction of conjunctions. It’s an easy exercise to prove that every propositional statement (in our case, using only and, or, and parentheses for grouping) can be put into disjunctive normal form. That is, any boolean condition that can be used to classify the data can be expressed in a disjunctive normal form, and hence as a decision tree.

Such a “boolean condition” is an example of a hypothesis, which is the formal term for the rule an algorithm uses to classify new data points. We call the set of all possible hypotheses expressible by our algorithm the hypothesis space of our algorithm. What we’ve just shown above is that decision trees have a large and well-defined hypothesis space. On the other hand, it is much more difficult to describe the hypothesis space for an algorithm like k-nearest-neighbors. This is one argument in favor of decision trees: they have a well-understood hypothesis space, and that makes them analytically tractable and interpretable.

Using Entropy to Find Optimal Splits

The real problem here is not in using a decision tree, but in constructing one from data alone. At any step in the process we outlined in the example above, we need to determine which feature is the right one to split the data on. That is, we need to choose the labels for the interior nodes in so that the resulting data subsets are as homogeneous as possible. In particular, it would be nice to have a quantitative way to measure the quality of a split. Then at each step we could simply choose the feature whose split yields the highest value under this measurement.

While we won’t derive such a measurement in this post, we will use one that has an extensive history of applications: Shannon entropy.

Definition: Let D be a discrete probability distribution (p_1, p_2, \dots, p_n). Then the Shannon entropy of D, denoted E(p_1, \dots, p_n) is

\displaystyle E(p_1, \dots , p_n) = - \sum_{i=1}^n p_i \log(p_i)

Where the logarithms are taken in base 2.

In English, there are n possible outcomes numbered 1 to n, and the probability that an instance drawn from D results in the outcome k is p_k. Then Shannon’s entropy function computes a numerical quantity describing how “dispersed” the outcomes are.

While there are many other useful interpretations of Shannon entropy, we only need it to describe how well the data is split into its classes. For our purposes, the probability distribution will simply be the observed proportions of data with respect to their class labels. In the case of Arya’s horse riding, the initial distribution would be (1/2, 1/2), giving an entropy of 1.

Let’s verify that Shannon’s entropy function makes sense for our problem. Specifically, the best scenario for splitting the data on a feature is a perfect split; that is, each subset only has data from one class. On the other hand, the worst case would be where each subset is uniformly distributed across all classes (if there are n classes, then each subset has 1/n of its data from each class).

Indeed, if we adopt the convention that \log(0) = 0, then the entropy of (1,0, \dots, 0) consists of a single term -1 \log(1) = 0. It is clear that this does not depend on the position of the 1 within the probability distribution. On the other hand, the entropy of (1/n, \dots, 1/n) is

\displaystyle -\sum_{i=1}^n\frac{1}{n}\log \left (\frac{1}{n} \right ) = -\log \left (\frac{1}{n} \right ) = -(0 - \log(n)) = \log(n)

A well-known property of the entropy function tells us that this is in fact the maximum value for this function.

Summarizing this, in the best case entropy is minimized after the split, and in the worst case entropy is maximized. But we can’t simply look at the entropy of each subset after splitting. We need a sensible way to combine these entropies and to compare them with the entropy of the data before splitting. In particular, we would quantify the “decrease” in entropy caused by a split, and maximize that quantity.

Definition: Let S be a data set and A a feature with values v \in V, and let E denote Shannon’s entropy function. Moreover, let S_v denote the subset of S for which the feature A has the value v. The gain of a split along the feature A, denoted G(S,A) is

\displaystyle G(S,A) = E(S) - \sum_{v \in V} \frac{|S_v|}{|S|} E(S_v)

That is, we are taking the difference of the entropy before the split, and subtracting off the entropies of each part after splitting, with an appropriate weight depending on the size of each piece. Indeed, if the entropy grows after the split (that is if the data becomes more mixed), then this number will be small. On the other hand if the split separates the classes nicely, each subset S_v will have small entropy, and hence the value will be large.

It requires a bit of mathematical tinkering to be completely comfortable that this function actually does what we want it to (for instance, it is not obvious that this function is non-negative; does it make sense to have a negative gain?). We won’t tarry in those details (this author has spent at least a day or two ironing them out), but we can rest assured that this function has been studied extensively, and nothing unexpected happens.

So now the algorithm for building trees is apparent: at each stage, simply pick the feature for which the gain function is maximized, and split the data on that feature. Create a child node for each of the subsets in the split, and connect them via edges with labels corresponding to the chosen feature value for that piece.

This algorithm is classically called ID3, and we’ll implement it in the next section.

Building a Decision Tree in Python

As with our primer on trees, we can use a quite simple data structure to represent the tree, but here we need a few extra pieces of data associated with each node.

class Tree:
   def __init__(self, parent=None):
      self.parent = parent
      self.children = []
      self.splitFeature = None
      self.splitFeatureValue = None
      self.label = None

In particular, now that features can have more than two possible values, we need to allow for an arbitrarily long list of child nodes. In addition, we add three pieces of data (with default values None): the splitFeature is the feature for which each of its children assumes a separate value; the splitFeatureValue is the feature assumed for its parent’s split; and the label (which is None for all interior nodes) is the final classification label for a leaf.

We also need to nail down our representations for the data. In particular, we will represent a data set as a list of pairs of the form (point, label), where the point is itself a list of the feature values, and the label is a string.

Now given a data set the first thing we need to do is compute its entropy. For that we can first convert it to a distribution (in the sense defined above, a list of probabilities which sum to 1):

def dataToDistribution(data):
   ''' Turn a dataset which has n possible classification labels into a
       probability distribution with n entries. '''
   allLabels = [label for (point, label) in data]
   numEntries = len(allLabels)
   possibleLabels = set(allLabels)

   dist = []
   for aLabel in possibleLabels:
      dist.append(float(allLabels.count(aLabel)) / numEntries)

   return dist

And we can compute the entropy of such a distribution in the obvious way:

def entropy(dist):
   ''' Compute the Shannon entropy of the given probability distribution. '''
   return -sum([p * math.log(p, 2) for p in dist])

Now in order to compute the gain of a data set by splitting on a particular value, we need to be able to split the data set. To do this, we identify features with their index in the list of feature values of a given data point, enumerate all possible values of that feature, and generate the needed subsets one at a time. In particular, we use a Python generator object:

def splitData(data, featureIndex):
   ''' Iterate over the subsets of data corresponding to each value
       of the feature at the index featureIndex. '''

   # get possible values of the given feature
   attrValues = [point[featureIndex] for (point, label) in data]

   for aValue in set(attrValues):
      dataSubset = [(point, label) for (point, label) in data
                    if point[featureIndex] == aValue]

      yield dataSubset

So to compute the gain, we simply need to iterate over the set of all splits, and compute the entropy of each split. In code:

def gain(data, featureIndex):
   ''' Compute the expected gain from splitting the data along all possible
       values of feature. '''

   entropyGain = entropy(dataToDistribution(data))

   for dataSubset in splitData(data, featureIndex):
      entropyGain -= entropy(dataToDistribution(dataSubset))

   return entropyGain

Of course, the best split (represented as the best feature to split on) is given by such a line of code as:

bestFeature = max(range(n), key=lambda index: gain(data, index))

We can’t quite use this line exactly though, because while we’re building up the decision tree (which will of course be a recursive process) we need to keep track of which features have been split on previously and which have not; this data is different for each possible traversal of the tree. In the end, our function to build a decision tree requires three pieces of data: the current subset of the data to investigate, the root of the current subtree that we are in the process of building, and the set of features we have yet to split on.

Of course, this raises the obvious question about the base cases. One base case will be when we run out of data to split; that is, when our input data all have the same classification label. To check for this we implement a function called “homogeneous”

def homogeneous(data):
   ''' Return True if the data have the same label, and False otherwise. '''
   return len(set([label for (point, label) in data])) <= 1

The other base case is when we run out of good features to split on. Of course, if the true classification function is actually a decision tree then this won’t be the case. But now that we’re in the real world, we can imagine there may be two data points with identical features but different classes. Perhaps the simplest way to remedy this situation is to terminate the tree at that point (when we run out of features to split on, or no split gives positive gain), and use a simple majority vote to label the new leaf. In a sense, this strategy is a sort of nearest-neighbors vote as a default. To implement this, we have a function which simply patches up the leaf appropriately:

def majorityVote(data, node):
   ''' Label node with the majority of the class labels in the given data set. '''
   labels = [label for (pt, label) in data]
   choice = max(set(labels), key=labels.count)
   node.label = choice
   return node

The base cases show up rather plainly in the code to follow, so let us instead focus on the inductive step. We declare our function to accept the data set in question, the root of the subtree to be built, and a list of the remaining allowable features to split on. The function begins with:

def buildDecisionTree(data, root, remainingFeatures):
   ''' Build a decision tree from the given data, appending the children
       to the given root node (which may be the root of a subtree). '''

   if homogeneous(data):
      root.label = data[0][1]
      return root

   if len(remainingFeatures) == 0:
      return majorityVote(data, root)

   # find the index of the best feature to split on
   bestFeature = max(remainingFeatures, key=lambda index: gain(data, index))

   if gain(data, bestFeature) == 0:
      return majorityVote(data, root)

   root.splitFeature = bestFeature

Here we see the base cases, and the selection of the best feature to split on. As a side remark, we observe this is not the most efficient implementation. We admittedly call the gain function and splitData functions more often than necessary, but we feel what is lost in runtime speed is gained in code legibility.

Once we bypass the three base cases, and we have determined the right split, we just do it:

def buildDecisionTree(data, root, remainingFeatures):
   ''' Build a decision tree from the given data, appending the children
       to the given root node (which may be the root of a subtree). '''

   ...

   # add child nodes and process recursively
   for dataSubset in splitData(data, bestFeature):
      aChild = Tree(parent=root)
      aChild.splitFeatureValue = dataSubset[0][0][bestFeature]
      root.children.append(aChild)

      buildDecisionTree(dataSubset, aChild, remainingFeatures - set([bestFeature]))

   return root

Here we iterate over the subsets of data after the split, and create a child node for each. We then assign the child its corresponding feature value in the splitFeatureValue variable, and append the child to the root’s list of children. Next is where we first see the remainingFeatures set come into play, and in particular we note the overloaded minus sign as an operation on sets. This is a feature of python sets, and in particular it behaves exactly like set exclusion in mathematics. The astute programmer will note that the minus operation generates a new set, so that further recursive calls to buildDecisionTree will not be affected by what happens in this recursive call.

Now the first call to this function requires some initial parameter setup, so we define a convenience function that only requires a single argument: the data.

def decisionTree(data):
   return buildDecisionTree(data, Tree(), set(range(len(data[0][0]))))

Classifying New Data

The last piece of the puzzle is to classify a new piece of data once we’ve constructed the decision tree. This is a considerably simpler recursive process. If the current node is a leaf, output its label. Otherwise, recursively search the subtree (the child of the current node) whose splitFeatureValue matches the new data’s choice of the feature being split. In code,

def classify(tree, point):
   ''' Classify a data point by traversing the given decision tree. '''

   if tree.children == []:
      return tree.label
   else:
      matchingChildren = [child for child in tree.children
         if child.splitFeatureValue == point[tree.splitFeature]]

      return classify(matchingChildren[0], point)

And we can use this function to naturally test a dataset:

def testClassification(data, tree):
   actualLabels = [label for point, label in data]
   predictedLabels = [classify(tree, point) for point, label in data]

   correctLabels = [(1 if a == b else 0) for a,b in zip(actualLabels, predictedLabels)]
   return float(sum(correctLabels)) / len(actualLabels)

But now we run into the issue of noisy data. What if one wants to classify a point where one of the feature values which is used in the tree is unknown? One can take many approaches to remedy this, and we choose a simple one: simply search both routes, and use a majority vote when reaching a leaf. This requires us to add one additional piece of information to the leaf nodes: the total number of labels in each class used to build that leaf (recall, one of our stopping conditions resulted in a leaf having heterogeneous data). We omit the details here, but the reader is invited to read them on this blog’s Github page, where as usual we have provided all of the code used in this post.

Classifying Political Parties Based on Congressional Votes

We now move to a concrete application of decision trees. The data set we will work with comes from the UCI machine learning repository, and it records the votes cast by the US House of Representatives during a particular session of Congress in 1984. The data set has 16 features; that is, there were 16 different measures considered “key” measures that were vote upon during this session. So each point in the dataset represents the 16 votes of a single House member in that session. With a bit of reformatting, we provide the complete data set on this blog’s Github page.

Our goal is to learn party membership based on the voting records. This data set is rife with missing values; roughly half of the members abstained from voting on some of these measures. So we constructed a decision tree from the clean portion of the data, and use that to classify the remainder of the data.

Indeed, this data fits precisely into the algorithm we designed above. The code to construct a tree is almost trivial:

   with open('house-votes-1984.txt', 'r') as inputFile:
      lines = inputFile.readlines()

   data = [line.strip().split(',') for line in lines]
   data = [(x[1:], x[0]) for x in data]

   cleanData = [x for x in data if '?' not in x[0]]
   noisyData = [x for x in data if '?' in x[0]]

   tree = decisionTree(cleanData)

Indeed, the classification accuracy in doing this is around 90%. In addition (though we will revisit the concept of overfitting later), this is stable despite variation in the size of the subset of data used to build the tree. The graph below shows this.

The size of the subset used to construct the tree versus its accuracy in classifying the remainder of the data. Note that the subsets were chosen uniformly at random without replacement. The x-axis is the number of points used to construct the tree, and the y-axis is the proportion of labels correctly classified.

Inspecting the trees generated in this process, it appears that the most prominent feature to split on is the adoption of a new budget resolution. Very few Demorats voted in favor of this, so for many of the random subsets of the data, a split on this feature left one side homogeneously Republican.

Overfitting, Pruning, and Other Issues

Now there are some obvious shortcomings to the method in general. If the data set used to build the decision tree is enormous (in dimension or in number of points), then the resulting decision tree can be arbitrarily large in size. In addition, there is the pitfall of overfitting to this particular data set. For the party classification problem above, the point is to extend the classification to any population of people who might vote on these issues (or, more narrowly, to any politician who might vote on these issues). If we make our decision tree very large, then the hypothesis may be overly specific to the people in the sample used, and hence will not generalize well.

This problem is called overfitting to the data, and it’s a prevalent concern among all machine learning algorithms. There are a number of ways to avoid it for decision trees. Perhaps the most common is the idea of pruning: one temporarily removes all possible proper subtrees and reevaluates the classification accuracy for that removal. Whichever subtree results in the greatest increase in accuracy is actually removed, and it is replaced with a single leaf whose label corresponds to the majority label of the data points used to create the entire subtree. This process is then repeated until there are no possible improvements, or the gain is sufficiently small.

From a statistical point of view one could say this process is that of ignoring outliers. Any points which do not generally agree with the whole trend of the data set (hence, create their own branches in the decision tree) are simply removed. From a theoretical point of a view, a smaller decision tree satisfies the principle of Occam’s razor: a simpler hypothesis is more accurate by virtue of being simple.

While we won’t implement a pruning method here (indeed, we didn’t detect any overfitting in the congressional voting example), but it would be a nice exercise for the reader to wet his feet with the code given above. Finally, there are other algorithms to build decision trees that we haven’t mentioned here. You can see a list of such algorithms on the relevant wikipedia page. Because of the success of ID3, there is a large body of research on such algorithms.

In any event, next time we’ll investigate yet another machine learning method: that of neural networks. We’ll also start to look at more general frameworks for computational learning theory. That is, we’ll exercise the full might of theoretical mathematics to reason about how hard certain problems are to learn (or whether they can be learned at all).

Until then!