What does it mean for an algorithm to be fair?

In 2014 the White House commissioned a 90-day study that culminated in a report (pdf) on the state of “big data” and related technologies. The authors give many recommendations, including this central warning.

Warning: algorithms can facilitate illegal discrimination!

Here’s a not-so-imaginary example of the problem. A bank wants people to take loans with high interest rates, and it also serves ads for these loans. A modern idea is to use an algorithm to decide, based on the sliver of known information about a user visiting a website, which advertisement to present that gives the largest chance of the user clicking on it. There’s one problem: these algorithms are trained on historical data, and poor uneducated people (often racial minorities) have a historical trend of being more likely to succumb to predatory loan advertisements than the general population. So an algorithm that is “just” trying to maximize clickthrough may also be targeting black people, de facto denying them opportunities for fair loans. Such behavior is illegal.

Payday-Loans

On the other hand, even if algorithms are not making illegal decisions, by training algorithms on data produced by humans, we naturally reinforce prejudices of the majority. This can have negative effects, like Google’s autocomplete finishing “Are transgenders” with “going to hell?” Even if this is the most common question being asked on Google, and even if the majority think it’s morally acceptable to display this to users, this shows that algorithms do in fact encode our prejudices. People are slowly coming to realize this, to the point where it was recently covered in the New York Times.

There are many facets to the algorithm fairness problem one that has not even been widely acknowledged as a problem, despite the Times article. The message has been echoed by machine learning researchers but mostly ignored by practitioners. In particular, “experts” continually make ignorant claims such as, “equations can’t be racist,” and the following quote from the above linked article about how the Chicago Police Department has been using algorithms to do predictive policing.

Wernick denies that [the predictive policing] algorithm uses “any racial, neighborhood, or other such information” to assist in compiling the heat list [of potential repeat offenders].

Why is this ignorant? Because of the well-known fact that removing explicit racial features from data does not eliminate an algorithm’s ability to learn race. If racial features disproportionately correlate with crime (as they do in the US), then an algorithm which learns race is actually doing exactly what it is designed to do! One needs to be very thorough to say that an algorithm does not “use race” in its computations. Algorithms are not designed in a vacuum, but rather in conjunction with the designer’s analysis of their data. There are two points of failure here: the designer can unwittingly encode biases into the algorithm based on a biased exploration of the data, and the data itself can encode biases due to human decisions made to create it. Because of this, the burden of proof is (or should be!) on the practitioner to guarantee they are not violating discrimination law. Wernick should instead prove mathematically that the policing algorithm does not discriminate.

While that viewpoint is idealistic, it’s a bit naive because there is no accepted definition of what it means for an algorithm to be fair. In fact, from a precise mathematical standpoint, there isn’t even a precise legal definition of what it means for any practice to be fair. In the US the existing legal theory is called disparate impact, which states that a practice can be considered illegal discrimination if it has a “disproportionately adverse” effect on members of a protected group. Here “disproportionate” is precisely defined by the 80% rule, but this is somehow not enforced as stated. As with many legal issues, laws are broad assertions that are challenged on a case-by-case basis. In the case of fairness, the legal decision usually hinges on whether an individual was treated unfairly, because the individual is the one who files the lawsuit. Our understanding of the law is cobbled together, essentially through anecdotes slanted by political agendas. A mathematician can’t make progress with that. We want the mathematical essence of fairness, not something that can be interpreted depending on the court majority.

The problem is exacerbated for data mining because the practitioners often demonstrate a poor understanding of statistics, the management doesn’t understand algorithms, and almost everyone is lulled into a false sense of security via abstraction (remember, “equations can’t be racist”). Experts in discrimination law aren’t trained to audit algorithms, and engineers aren’t trained in social science or law. The speed with which research becomes practice far outpaces the speed at which anyone can keep up. This is especially true at places like Google and Facebook, where teams of in-house mathematicians and algorithm designers bypass the delay between academia and industry.

And perhaps the worst part is that even the world’s best mathematicians and computer scientists don’t know how to interpret the output of many popular learning algorithms. This isn’t just a problem that stupid people aren’t listening to smart people, it’s that everyone is “stupid.” A more politically correct way to say it: transparency in machine learning is a wide open problem. Take, for example, deep learning. A far-removed adaptation of neuroscience to data mining, deep learning has become the flagship technique spearheading modern advances in image tagging, speech recognition, and other classification problems.

A typical example of how a deep neural network learns to tag images. Image source: http://engineering.flipboard.com/2015/05/scaling-convnets/

A typical example of how a deep neural network learns to tag images. Image source: http://engineering.flipboard.com/2015/05/scaling-convnets/

The picture above shows how low level “features” (which essentially boil down to simple numerical combinations of pixel values) are combined in a “neural network” to more complicated image-like structures. The claim that these features represent natural concepts like “cat” and “horse” have fueled the public attention on deep learning for years. But looking at the above, is there any reasonable way to say whether these are encoding “discriminatory information”? Not only is this an open question, but we don’t even know what kinds of problems deep learning can solve! How can we understand to what extent neural networks can encode discrimination if we don’t have a deep understanding of why a neural network is good at what it does?

What makes this worse is that there are only about ten people in the world who understand the practical aspects of deep learning well enough to achieve record results for deep learning. This means they spent a ton of time tinkering the model to make it domain-specific, and nobody really knows whether the subtle differences between the top models correspond to genuine advances or slight overfitting or luck. Who is to say whether the fiasco with Google tagging images of black people as apes was caused by the data or the deep learning algorithm or by some obscure tweak made by the designer? I doubt even the designer could tell you with any certainty.

Opacity and a lack of interpretability is the rule more than the exception in machine learning. Celebrated techniques like Support Vector Machines, Boosting, and recent popular “tensor methods” are all highly opaque. This means that even if we knew what fairness meant, it is still a challenge (though one we’d be suited for) to modify existing algorithms to become fair. But with recent success stories in theoretical computer science connecting security, trust, and privacy, computer scientists have started to take up the call of nailing down what fairness means, and how to measure and enforce fairness in algorithms. There is now a yearly workshop called Fairness, Accountability, and Transparency in Machine Learning (FAT-ML, an awesome acronym), and some famous theory researchers are starting to get involved, as are social scientists and legal experts. Full disclosure, two days ago I gave a talk as part of this workshop on modifications to AdaBoost that seem to make it more fair. More on that in a future post.

From our perspective, we the computer scientists and mathematicians, the central obstacle is still that we don’t have a good definition of fairness.

In the next post I want to get a bit more technical. I’ll describe the parts of the fairness literature I like (which will be biased), I’ll hypothesize about the tension between statistical fairness and individual fairness, and I’ll entertain ideas on how someone designing a controversial algorithm (such as a predictive policing algorithm) could maintain transparency and accountability over its discriminatory impact. In subsequent posts I want to explain in more detail why it seems so difficult to come up with a useful definition of fairness, and to describe some of the ideas I and my coauthors have worked on.

Until then!

Bandits and Stocks

So far in this series we’ve seen two nontrivial algorithms for bandit learning in two different settings. The first was the UCB1 algorithm, which operated under the assumption that the rewards for the trials were independent and stochastic. That is, each slot machine was essentially a biased coin flip, and the algorithm was trying to find the machine with the best odds. The second was the Exp3 algorithm, which held the belief that the payoffs were arbitrary. In particular this includes the possibility that an adversary is setting the payoffs against you, and so we measured the success of an algorithm in terms of how it fares against the best single action (just as we did with UCB1, but with Exp3 it’s a nontrivial decision).

Before we move on to other bandit settings it’s natural to try to experiment with the ones we have on real world data. On one hand it’s interesting to see how they fare outside academia. And more relevantly to the design of the future bandit algorithms we’ll see on this blog, we need to know what worldly problems actually provide in terms of inputs to our learning algorithm in each round.

But another interesting issue goes like this. In the real world we can’t ever really know whether the rewards of the actions are stochastic or adversarial. Many people believe that adversarial settings are far too pathological to be realistic, while others claim that the assumptions made by stochastic models are too strict. To weigh in on this dispute, we’ll dip into a bit of experimental science and see which of the two algorithms performs better on the problem of stock trading. The result is then evidence that stocks behave stochastically or adversarially. But we don’t want to stir up too many flames, so we can always back up behind the veil of applied mathematics (“this model is too simple anyway”).

Indeed the model we use in this post is rather simplistic. I don’t know as much as I should (or as my father would have me know) about stock markets. In fact, I’m more partial to not trade stocks on principle. But I must admit that average-quality stock data is easy to come by, and the basic notions of market interactions lend themselves naturally to many machine learning problems. If the reader has any ideas about how to strengthen the model, I welcome suggestions in the comments (or a fork on github).

A fair warning to the reader, we do not solve the problem of trading stocks by any means. We use a model that’s almost entirely unrealistic, and the results aren’t even that good. I’m quite nervous to publish this at all, just because above all else it reveals my gaping ignorance on how stock markets work. But this author believes in revealing ignorance as learning, if for nothing else than that it provides extremely valuable insight into the nature of a problem and an appreciation of its complexity. So criticize away, dear readers.

As usual, all of the code and data we use in this post is available on this blog’s Github page. Our language of choice for this post is Python.

This little trader got lucky.

This little trader got lucky. Could it be because he’s got TEN MONITORS?!

Stocks for Dummies (me)

A quick primer on stocks, which is only as detailed as it needs to be for this post: a stock is essentially the sum of the value of all the assets of a company. A publicly traded company divides their stock into a number of “shares,” and owning a share represents partial ownership of the company. If you own 50% of the shares, you own 50% of the company. Companies sell shares or give them to employees as benefits (or options), and use the money gained through their sale for whatever they see fit. The increase in the price of a stock generally signifies the company is successful and growing; for example, stocks generally rise when a hotly anticipated product is announced.

The stock of a company is traded through one of a number of markets called stock exchanges. The buying and selling interactions are recorded and public, and there are many people in the world who monitor the interactions as they happen (via television, or programmatically) in the hopes of noticing opportunities before others and capitalizing on them. Each interaction induces a change on the price of a share of stock: whenever a share is bought at a certain price, that is the established and recorded price of a share (up to some fudging by brokers which is entirely mysterious to me). In any case, the prices go up and down, and they’re often bundled into “bars” which summarize the data over a certain period of time. The bars we use in this post are daily, and consist of four numbers: the open, the price at the beginning of the day, the high and low, which are self-explanatory, and the close, which is the price at the end of the day.

Bandits and Daily Stock Trading

Now let’s simplify things as much as possible. Our bandit learning algorithm will interact with the market as follows: each day it chooses whether or not to buy a single dollar’s worth of a stock, and at the end of the day it sells the stock and observes the profit. There are no brokers involved, and the price the algorithm sees is the price it gets. In bandit language: the stocks represent actions, and the amount of profit at the end of a day constitutes the payoff of an action in one round. Since small-scale stock price movement is generally very poorly understood, it makes some level of sense to assume the price movements within a given day are adversarial. On the other hand, since we understand them so poorly, we might be tempted to just call them “random” fluctuations, i.e. stochastic. So this is a nice little testbed for seeing which assumption yields a more successful algorithm.

Unlike the traditional image of stock trading where an individual owns shares of a stock over a long period of time, our program will operate on a daily time scale, and hence cannot experience the typical kinds of growth. Nevertheless, we can try to make some money over time, and if it’s a good strategy, we could scale up the single dollar to whatever we’re willing to risk. Specifically, the code we used to compute the payoff is

def payoff(stockTable, t, stock, amountToInvest=1.0):
   openPrice, closePrice = stockTable[stock][t]

   sharesBought = amountToInvest / openPrice
   amountAfterSale = sharesBought * closePrice

   return amountAfterSale - amountToInvest

The remainder of the code is interfacing with the Exp3 and UCB1 functions we gave in previous posts, and data shuffling. We got our data from Google Finance, and we provide it, along with all of the code used in the making of this post, on this blog’s Github page. Before we run our experiments, let’s give a few reasons why this model is unrealistic.

  1. We assume we can buy/sell fractional shares of a stock, which to my knowledge is not possible. Though this experiment could be redone where you buy a single share of a stock, or with mutual funds/currency exchange/whatever replacing stocks, we didn’t do it this way.
  2. Brokerage fees can drastically change the success of an algorithm which trades frequently.
  3. Open and close prices are not typical prices. People will often make decisions based on the time of day, but then again we might expect this to be just another reason that Exp3 would perform better than UCB1.
  4. We’re not actually trading in the stock market, and so we’re ignoring the effects of our own algorithm on the prices in the market.
  5. It’s impossible to guarantee you get to use the opening price and closing price in your transactions.
  6. UCB1 and Exp3 don’t use all of the information available. Indeed, they assume that they would not be able to see the outcome of an action they did not take, but with stocks you can get a good estimate of how much money you would have made had you chosen a different stock.
  7. Each trial in a bandit learning problem is identical from the learner’s perspective, but one often keeps a stock around while making other decisions.

We’ll come back to #6 after seeing the raw experiments for an unaltered UCB1 and Exp3, because there is a natural extension of the algorithm to handle additional information. I’m sure there are other glaring issues with the experimental setup, and the reader should feel free to rant about it in the comments. It won’t stop me from running the algorithm and seeing what happens just for fun.

Data Sets

We ran the experiment on two sets of stocks. The first set consisted of nine random stocks, taken from the random stocks twitter feed, with 5 years of past data. The stocks are:

lxrx, keg, cuba, tdi, brks, mux, cadx, belfb, htr

And you can view more information about these particular stocks via Google Finance. The second set was a non-random choice of nine Fortune 500 companies with 10 years of past data. The stocks were

amzn, cost, jpm, gs, wfc, msft, tgt, aapl, wmt

And again more information about these stocks is available via Google Finance. For the record, here were the cumulative payoffs of each of the nine Fortune 500 companies:

f500-cumulative-rewards

The cumulative rewards for the nine Fortune 500 companies over the last ten years of data.

Interestingly, the company which started off with the best prospects (Apple), turned out to have the worst cumulative reward by the end. The long-term winners in our little imaginary world happen to be Amazon, Costco, and Goldman Sachs. Perhaps this gives credence to the assumption that payoffs are adversarial. A learner can easily get tricked into putting too much faith in one action early on.

And for the random stocks:

The cumulative payoff for the nine randomly chosen stocks.

The cumulative payoff for the nine randomly chosen stocks for the last five years of data.

The random stocks clearly perform worse and more variably overall (although HTR surpasses most of the Fortune 500 companies, despite its otherwise relatively modest stock growth over the last five years). To my untrained eyes these movements look more like a stochastic model than an adversarial one.

Experiments

Here is a typical example of a run of Exp3 on the Fortune 500 data set (using \gamma = 0.33, recall \gamma measures the amount of uniform exploration performed):

(Expected payoff, variance) over 1000 trials is (1.122463919564572, 0.5518037498918705)
For a single run: 
Payoff was 1.12
Regret was 2.91
Best stock was amzn at 4.02
weights: '0.00, 0.00, 0.00, 0.46, 0.52, 0.00, 0.00, 0.00, 0.01'

And one for UCB1:

(Expected payoff, variance) over 1000 trials is (1.1529891576139333, 0.5012825847001482)
For a single run: 
Payoff was 1.73
Regret was 2.29
Best stock was amzn at 4.02
ucbs: '0.234, 0.234, 0.234, 0.234, 0.234, 0.234, 0.234, 0.234, 0.234'

The results are quite curious. Indeed, the expected payoff seems to be a whopping 110% return! The variance of these results is quite high, and so it’s not at all impossible that the algorithm could have a negative return. But just as often it would return around 200% profit. 

Before we go risking all our money on this strategy, let’s take a closer look at what’s happening in the algorithm. It appears that for UCB1 the upper confidence bounds assigned to each action are the same! In other words, even after ten years of trials, no single stock “shined” above the others in the eyes of UCB1. It may seem that Exp3 has a leg up on UCB1 in this respect, because it’s clear that it gives higher weights to some stocks over others. However, running the algorithm multiple times shows drastically different weight distributions, and if we average the resulting weights over a thousand rounds, we see that they all have roughly the same mean and variance (the mean being first in the pair):

weight stats for msft: (0.107, 0.025)
weight stats for jpm: (0.109, 0.027)
weight stats for tgt: (0.110, 0.029)
weight stats for gs: (0.112, 0.025)
weight stats for wmt: (0.110, 0.027)
weight stats for aapl: (0.111, 0.027)
weight stats for amzn: (0.120, 0.029)
weight stats for cost: (0.113, 0.026)
weight stats for wfc: (0.107, 0.023)
Indeed, the best stock, Amazon, had an average weight just barely larger (and more variable) than any of the other stocks. So this evidence points to the conclusion that neither EXP3 nor UCB1 has any clue which stock is better. Pairing this with the fact that both algorithms nevertheless perform well would suggest that a random choice of action at each step is equally likely to do well. Indeed, when we run with a “random bandit” that just chooses actions uniformly at random, we get the following results:
(Expected payoff, variance) over 1000 trials is (1.1094227056931132, 0.4403783017367529)
For a single run: 
Payoff was 3.13
Regret was 0.90
Best stock was amzn at 4.02

It’s not quite as good as either Exp3 or UCB1, but it’s close and less variable, which means a lot to an investor. In other words, it’s starting to look like Exp3 and UCB1 aren’t doing significantly better than random at all, and that a monkey would do well in this system (for these particular stocks).

Of course, Fortune 500 companies are pretty successful by definition, so let’s turn our attention to the random stocks:

For the random bandit learner:

(Expected payoff, variance) over 1000 trials is (-0.23952295977625776, 1.0787311145181104)
For a single run: 
Payoff was -2.01
Regret was 3.92
Best stock was htr at 1.91

For UCB1:

(Expected payoff, variance) over 1000 trials is (-0.3503593899029112, 1.1136234992964154)
For a single run: 
Payoff was 0.26
Regret was 1.65
Best stock was htr at 1.91
ucbs: '0.315, 0.315, 0.315, 0.316, 0.315, 0.315, 0.315, 0.315, 0.316'

And for Exp3:

(Expected payoff, variance) over 1000 trials is (-0.25827976810345593, 1.2946101887058519)
For a single run: 
Payoff was -0.34
Regret was 2.25
Best stock was htr at 1.91
weights: '0.08, 0.00, 0.14, 0.06, 0.48, 0.00, 0.00, 0.04, 0.19'

But again Exp3 has no idea what stocks are actually best, with the average, variance over 1000 trials being:

weight stats for lxrx: '0.11, 0.02'
weight stats for keg: '0.11, 0.02'
weight stats for htr: '0.12, 0.02'
weight stats for cadx: '0.10, 0.02'
weight stats for belfb: '0.11, 0.02'
weight stats for tdi: '0.11, 0.02'
weight stats for cuba: '0.11, 0.02'
weight stats for mux: '0.11, 0.02'
weight stats for brks: '0.11, 0.02'

The long and short of it is that the choice of Fortune 500 stocks was inherently so biased toward success than a monkey could have made money investing in them, while the average choice of stocks had, if anything, a bias toward loss. And unfortunately using an algorithm like UCB1 or Exp3 straight out of the box doesn’t produce anything better than a monkey.

Issues and Improvements

There are two glaring theoretical issues here that we haven’t yet addressed. One of these goes back to issue #5 in that list we gave at the beginning of the post: the bandit algorithms are assuming they have less information than they actually have! Indeed, at the end of a day of stock trading, you have a good idea what would have happened to you had you bought a different stock, and in our simplified world you can know exactly what your profit would have been. Recalling that UCB1 and Exp3 both maintained some numbers representing the strength of an action (Exp3 had a “weight” and UCB1 an upper confidence bound), the natural extension to both UCB1 and Exp3 is simply to modify the beliefs about all actions after any given round. This is a pretty simple improvement to make in our implementation, since it just changes a single weight update to a loop. For Exp3:

for choice in range(numActions):
   rewardForUpdate = reward(choice, t)
   scaledReward = (rewardForUpdate - rewardMin) / (rewardMax - rewardMin)
   estimatedReward = 1.0 * scaledReward / probabilityDistribution[choice]
   weights[choice] *= math.exp(estimatedReward * gamma / numActions)

With a similar loop for UCB1. This code should be familiar from our previous posts on bandits. We then rerun the new algorithms on the same data sets, and the results are somewhat surprising. First, UCB1 on Fortune 500:

(Expected payoff, variance) over 1000 trials is (3.530670654982728, 0.007713190816014095)
For a single run: 
Payoff was 3.56
Regret was 0.47

This is clearly outperforming the random bandit learning algorithm, with an average return of 350%! In fact, it does almost as well as the best stock, and the variance is quite low. UCB1 also outperforms Exp3, which fares comparably to its pre-improved self. That is, it’s still not much better than random:

(Expected payoff, variance) over 1000 trials is (1.1424797906901956, 0.434335471375294)
For a single run: 
Payoff was 1.24
Regret was 2.79

And also for the random stocks, UCB1 with improvements outperforms Exp3 and UCB1 without improvements. UCB1:

(Expected payoff, variance) over 1000 trials is (0.680211923900068, 0.04226672915962647)
For a single run:
Payoff was 0.82
Regret was 1.09

And Exp3:

(Expected payoff, variance) over 1000 trials is (-0.2242152508929378, 1.1312843329929194)
For a single run: 
Payoff was -0.16
Regret was 2.07

We might wonder why this is the case, and there is a plausible explanation. See, Exp3 has a difficult life: it has to assume that at any time the adversary can completely change the game. And so Exp3 must remain vigilant, continuing to try options it knows to be terrible for fear that they may spontaneously do well. Exp3 is the grandfather who, after 75 years of not winning the lotto, continues to buy tickets every week. A better analogy might be a lioness who, even after being moved to the zoo, stays up all night to protect a cub from predators. This gives us quite a new perspective on Exp3: the world really has to be that messed up for Exp3 to be useful. As we saw, UCB1 is much more eager to jump on a winning bandwagon, and it paid off in both the good (Fortune 500) and bad (random stock) scenarios. All in all, this experiment would provide some minor evidence that the stock market (or just this cheesy version of it) is more stochastic than adversarial.

The second problem is that we’re treating these stocks as if they were isolated from the rest of the world. Indeed, along with each stock comes some kind of context in the form of information about that stock. Historical prices, corporate announcements, cyclic boom and bust, what the talking heads think, all of this may be relevant to the price fluctuations of a stock on any given day. While Exp3 and UCB1 are ill-equipped to handle such a rich landscape, researchers in bandit learning have recognized the importance of context in decision making. So much so, in fact, that an entire subfield of “Contextual Bandits” was born. John Langford, perhaps the world’s leading expert on bandit learning, wrote on his blog in 2007,

I’m having difficulty finding interesting real-world k-Armed Bandit settings which aren’t better thought of as Contextual Bandits in practice. For myself, bandit algorithms are (at best) motivational because they can not be applied to real-world problems without altering them to take context into account.

I tend to agree with him. Bandit problems almost always come with some inherent additional structure in the real world, and the best algorithms will always take advantage of that structure. A “context” associated with each round is perhaps the weakest kind of structure, so it’s a natural place to look for better algorithms.

So that’s what we’ll do in the future of this series. But before then we might decide to come up with another experiment to run Exp3 and UCB1 on. It would be nice to see an instance in which Exp3 seriously outperforms UCB1, but maybe the real world is just stochastic and there’s nothing we can do about it.

Until next time!

Optimism in the Face of Uncertainty: the UCB1 Algorithm

startups

The software world is always atwitter with predictions on the next big piece of technology. And a lot of chatter focuses on what venture capitalists express interest in. As an investor, how do you pick a good company to invest in? Do you notice quirky names like “Kaggle” and “Meebo,” require deep technical abilities, or value a charismatic sales pitch?

This author personally believes we’re not thinking as big as we should be when it comes to innovation in software engineering and computer science, and that as a society we should value big pushes forward much more than we do. But making safe investments is almost always at odds with innovation. And so every venture capitalist faces the following question. When do you focus investment in those companies that have proven to succeed, and when do you explore new options for growth? A successful venture capitalist must strike a fine balance between this kind of exploration and exploitation. Explore too much and you won’t make enough profit to sustain yourself. Narrow your view too much and you will miss out on opportunities whose return surpasses any of your current prospects.

In life and in business there is no correct answer on what to do, partly because we just don’t have a good understanding of how the world works (or markets, or people, or the weather). In mathematics, however, we can meticulously craft settings that have solid answers. In this post we’ll describe one such scenario, the so-called multi-armed bandit problem, and a simple algorithm called UCB1 which performs close to optimally. Then, in a future post, we’ll analyze the algorithm on some real world data.

As usual, all of the code used in the making of this post are available for download on this blog’s Github page.

Multi-Armed Bandits

The multi-armed bandit scenario is simple to describe, and it boils the exploration-exploitation tradeoff down to its purest form.

Suppose you have a set of K actions labeled by the integers \left \{ 1, 2, \dots, K \right \}. We call these actions in the abstract, but in our minds they’re slot machines. We can then play a game where, in each round, we choose an action (a slot machine to play), and we observe the resulting payout. Over many rounds, we might explore the machines by trying some at random. Assuming the machines are not identical, we naturally play machines that seem to pay off well more frequently to try to maximize our total winnings.

Exploit away, you lucky ladies.

Exploit away, you lucky ladies.

This is the most general description of the game we could possibly give, and every bandit learning problem has these two components: actions and rewards. But in order to get to a concrete problem that we can reason about, we need to specify more details. Bandit learning is a large tree of variations and this is the point at which the field ramifies. We presently care about two of the main branches.

How are the rewards produced? There are many ways that the rewards could work. One nice option is to have the rewards for action i be drawn from a fixed distribution D_i (a different reward distribution for each action), and have the draws be independent across rounds and across actions. This is called the stochastic setting and it’s what we’ll use in this post. Just to pique the reader’s interest, here’s the alternative: instead of having the rewards be chosen randomly, have them be adversarial. That is, imagine a casino owner knows your algorithm and your internal beliefs about which machines are best at any given time. He then fixes the payoffs of the slot machines in advance of each round to screw you up! This sounds dismal, because the casino owner could just make all the machines pay nothing every round. But actually we can design good algorithms for this case, but “good” will mean something different than absolute winnings. And so we must ask:

How do we measure success? In both the stochastic and the adversarial setting, we’re going to have a hard time coming up with any theorems about the performance of an algorithm if we care about how much absolute reward is produced. There’s nothing to stop the distributions from having terrible expected payouts, and nothing to stop the casino owner from intentionally giving us no payout. Indeed, the problem lies in our measurement of success. A better measurement, which we can apply to both the stochastic and adversarial settings, is the notion of regret. We’ll give the definition for the stochastic case, and investigate the adversarial case in a future post.

Definition: Given a player algorithm A and a set of actions \left \{1, 2, \dots, K \right \}, the cumulative regret of A in rounds 1, \dots, T is the difference between the expected reward of the best action (the action with the highest expected payout) and the expected reward of A for the first T rounds.

We’ll add some more notation shortly to rephrase this definition in symbols, but the idea is clear: we’re competing against the best action. Had we known it ahead of time, we would have just played it every single round. Our notion of success is not in how well we do absolutely, but in how well we do relative to what is feasible.

Notation

Let’s go ahead and draw up some notation. As before the actions are labeled by integers \left \{ 1, \dots, K \right \}. The reward of action i is a [0,1]-valued random variable X_i distributed according to an unknown distribution and possessing an unknown expected value \mu_i. The game progresses in rounds t = 1, 2, \dots so that in each round we have different random variables X_{i,t} for the reward of action i in round t (in particular, X_{i,t} and X_{i,s} are identically distributed). The X_{i,t} are independent as both t and i vary, although when i varies the distribution changes.

So if we were to play action 2 over and over for T rounds, then the total payoff would be the random variable G_2(T) = \sum_{t=1}^T X_{2,t}. But by independence across rounds and the linearity of expectation, the expected payoff is just \mu_2 T. So we can describe the best action as the action with the highest expected payoff. Define

\displaystyle \mu^* = \max_{1 \leq i \leq K} \mu_i

We call the action which achieves the maximum i^*.

A policy is a randomized algorithm A which picks an action in each round based on the history of chosen actions and observed rewards so far. Define I_t to be the action played by A in round t and P_i(n) to be the number of times we’ve played action i in rounds 1 \leq t \leq n. These are both random variables. Then the cumulative payoff for the algorithm A over the first T rounds, denoted G_A(T), is just

\displaystyle G_A(T) = \sum_{t=1}^T X_{I_t, t}

and its expected value is simply

\displaystyle \mathbb{E}(G_A(T)) = \mu_1 \mathbb{E}(P_1(T)) + \dots + \mu_K \mathbb{E}(P_K(T)).

Here the expectation is taken over all random choices made by the policy and over the distributions of rewards, and indeed both of these can affect how many times a machine is played.

Now the cumulative regret of a policy A after the first T steps, denoted R_A(T) can be written as

\displaystyle R_A(T) = G_{i^*}(T) - G_A(T)

And the goal of the policy designer for this bandit problem is to minimize the expected cumulative regret, which by linearity of expectation is

\mathbb{E}(R_A(T)) = \mu^*T - \mathbb{E}(G_A(T)).

Before we continue, we should note that there are theorems concerning lower bounds for expected cumulative regret. Specifically, for this problem it is known that no algorithm can guarantee an expected cumulative regret better than \Omega(\sqrt{KT}). It is also known that there are algorithms that guarantee no worse than O(\sqrt{KT}) expected regret. The algorithm we’ll see in the next section, however, only guarantees O(\sqrt{KT \log T}). We present it on this blog because of its simplicity and ubiquity in the field.

The UCB1 Algorithm

The policy we examine is called UCB1, and it can be summed up by the principle of optimism in the face of uncertainty. That is, despite our lack of knowledge in what actions are best we will construct an optimistic guess as to how good the expected payoff of each action is, and pick the action with the highest guess. If our guess is wrong, then our optimistic guess will quickly decrease and we’ll be compelled to switch to a different action. But if we pick well, we’ll be able to exploit that action and incur little regret. In this way we balance exploration and exploitation.

The formalism is a bit more detailed than this, because we’ll need to ensure that we don’t rule out good actions that fare poorly early on. Our “optimism” comes in the form of an upper confidence bound (hence the acronym UCB). Specifically, we want to know with high probability that the true expected payoff of an action \mu_i is less than our prescribed upper bound. One general (distribution independent) way to do that is to use the Chernoff-Hoeffding inequality.

As a reminder, suppose Y_1, \dots, Y_n are independent random variables whose values lie in [0,1] and whose expected values are \mu_i. Call Y = \frac{1}{n}\sum_{i}Y_i and \mu = \mathbb{E}(Y) = \frac{1}{n} \sum_{i} \mu_i. Then the Chernoff-Hoeffding inequality gives an exponential upper bound on the probability that the value of Y deviates from its mean. Specifically,

\displaystyle \textup{P}(Y + a < \mu) \leq e^{-2na^2}

For us, the Y_i will be the payoff variables for a single action j in the rounds for which we choose action j. Then the variable Y is just the empirical average payoff for action j over all the times we’ve tried it. Moreover, a is our one-sided upper bound (and as a lower bound, sometimes). We can then solve this equation for a to find an upper bound big enough to be confident that we’re within a of the true mean.

Indeed, if we call n_j the number of times we played action j thus far, then n = n_j in the equation above, and using a = a(j,T) = \sqrt{2 \log(T) / n_j} we get that \textup{P}(Y > \mu + a) \leq T^{-4}, which converges to zero very quickly as the number of rounds played grows. We’ll see this pop up again in the algorithm’s analysis below. But before that note two things. First, assuming we don’t play an action j, its upper bound a grows in the number of rounds. This means that we never permanently rule out an action no matter how poorly it performs. If we get extremely unlucky with the optimal action, we will eventually be convinced to try it again. Second, the probability that our upper bound is wrong decreases in the number of rounds independently of how many times we’ve played the action. That is because our upper bound a(j, T) is getting bigger for actions we haven’t played; any round in which we play an action j, it must be that a(j, T+1) = a(j,T), although the empirical mean will likely change.

With these two facts in mind, we can formally state the algorithm and intuitively understand why it should work.

UCB1:
Play each of the K actions once, giving initial values for empirical mean payoffs \overline{x}_i of each action i.
For each round t = K, K+1, \dots:

Let n_j represent the number of times action j was played so far.
Play the action j maximizing \overline{x}_j + \sqrt{2 \log t / n_j}.
Observe the reward X_{j,t} and update the empirical mean for the chosen action.

And that’s it. Note that we’re being super stateful here: the empirical means x_j change over time, and we’ll leave this update implicit throughout the rest of our discussion (sorry, functional programmers, but the notation is horrendous otherwise).

Before we implement and test this algorithm, let’s go ahead and prove that it achieves nearly optimal regret. The reader uninterested in mathematical details should skip the proof, but the discussion of the theorem itself is important. If one wants to use this algorithm in real life, one needs to understand the guarantees it provides in order to adequately quantify the risk involved in using it.

Theorem: Suppose that UCB1 is run on the bandit game with K actions, each of whose reward distribution X_{i,t} has values in [0,1]. Then its expected cumulative regret after T rounds is at most O(\sqrt{KT \log T}).

Actually, we’ll prove a more specific theorem. Let \Delta_i be the difference \mu^* - \mu_i, where \mu^* is the expected payoff of the best action, and let \Delta be the minimal nonzero \Delta_i. That is, \Delta_i represents how suboptimal an action is and \Delta is the suboptimality of the second best action. These constants are called problem-dependent constants. The theorem we’ll actually prove is:

Theorem: Suppose UCB1 is run as above. Then its expected cumulative regret \mathbb{E}(R_{\textup{UCB1}}(T)) is at most

\displaystyle 8 \sum_{i : \mu_i < \mu^*} \frac{\log T}{\Delta_i} + \left ( 1 + \frac{\pi^2}{3} \right ) \left ( \sum_{j=1}^K \Delta_j \right )

Okay, this looks like one nasty puppy, but it’s actually not that bad. The first term of the sum signifies that we expect to play any suboptimal machine about a logarithmic number of times, roughly scaled by how hard it is to distinguish from the optimal machine. That is, if \Delta_i is small we will require more tries to know that action i is suboptimal, and hence we will incur more regret. The second term represents a small constant number (the 1 + \pi^2 / 3 part) that caps the number of times we’ll play suboptimal machines in excess of the first term due to unlikely events occurring. So the first term is like our expected losses, and the second is our risk.

But note that this is a worst-case bound on the regret. We’re not saying we will achieve this much regret, or anywhere near it, but that UCB1 simply cannot do worse than this. Our hope is that in practice UCB1 performs much better.

Before we prove the theorem, let’s see how derive the O(\sqrt{KT \log T}) bound mentioned above. This will require familiarity with multivariable calculus, but such things must be endured like ripping off a band-aid. First consider the regret as a function R(\Delta_1, \dots, \Delta_K) (excluding of course \Delta^*), and let’s look at the worst case bound by maximizing it. In particular, we’re just finding the problem with the parameters which screw our bound as badly as possible, The gradient of the regret function is given by

\displaystyle \frac{\partial R}{\partial \Delta_i} = - \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3}

and it’s zero if and only if for each i, \Delta_i = \sqrt{\frac{8 \log T}{1 + \pi^2/3}} = O(\sqrt{\log T}). However this is a minimum of the regret bound (the Hessian is diagonal and all its eigenvalues are positive). Plugging in the \Delta_i = O(\sqrt{\log T}) (which are all the same) gives a total bound of O(K \sqrt{\log T}). If we look at the only possible endpoint (the \Delta_i = 1), then we get a local maximum of O(K \sqrt{\log T}). But this isn’t the O(\sqrt{KT \log T}) we promised, what gives? Well, this upper bound grows arbitrarily large as the \Delta_i go to zero. But at the same time, if all the \Delta_i are small, then we shouldn’t be incurring much regret because we’ll be picking actions that are close to optimal!

Indeed, if we assume for simplicity that all the \Delta_i = \Delta are the same, then another trivial regret bound is \Delta T (why?). The true regret is hence the minimum of this regret bound and the UCB1 regret bound: as the UCB1 bound degrades we will eventually switch to the simpler bound. That will be a non-differentiable switch (and hence a critical point) and it occurs at \Delta = O(\sqrt{(K \log T) / T}). Hence the regret bound at the switch is \Delta T = O(\sqrt{KT \log T}), as desired.

Proving the Worst-Case Regret Bound

Proof. The proof works by finding a bound on P_i(T), the expected number of times UCB chooses an action up to round T. Using the \Delta notation, the regret is then just \sum_i \Delta_i \mathbb{E}(P_i(T)), and bounding the P_i‘s will bound the regret.

Recall the notation for our upper bound a(j, T) = \sqrt{2 \log T / P_j(T)} and let’s loosen it a bit to a(y, T) = \sqrt{2 \log T / y} so that we’re allowed to “pretend” a action has been played y times. Recall further that the random variable I_t has as its value the index of the machine chosen. We denote by \chi(E) the indicator random variable for the event E. And remember that we use an asterisk to denote a quantity associated with the optimal action (e.g., \overline{x}^* is the empirical mean of the optimal action).

Indeed for any action i, the only way we know how to write down P_i(T) is as

\displaystyle P_i(T) = 1 + \sum_{t=K}^T \chi(I_t = i)

The 1 is from the initialization where we play each action once, and the sum is the trivial thing where just count the number of rounds in which we pick action i. Now we’re just going to pull some number m-1 of plays out of that summation, keep it variable, and try to optimize over it. Since we might play the action fewer than m times overall, this requires an inequality.

P_i(T) \leq m + \sum_{t=K}^T \chi(I_t = i \textup{ and } P_i(t-1) \geq m)

These indicator functions should be read as sentences: we’re just saying that we’re picking action i in round t and we’ve already played i at least m times. Now we’re going to focus on the inside of the summation, and come up with an event that happens at least as frequently as this one to get an upper bound. Specifically, saying that we’ve picked action i in round t means that the upper bound for action i exceeds the upper bound for every other action. In particular, this means its upper bound exceeds the upper bound of the best action (and i might coincide with the best action, but that’s fine). In notation this event is

\displaystyle \overline{x}_i + a(P_i(t), t-1) \geq \overline{x}^* + a(P^*(T), t-1)

Denote the upper bound \overline{x}_i + a(i,t) for action i in round t by U_i(t). Since this event must occur every time we pick action i (though not necessarily vice versa), we have

\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi(U_i(t-1) \geq U^*(t-1) \textup{ and } P_i(t-1) \geq m)

We’ll do this process again but with a slightly more complicated event. If the upper bound of action i exceeds that of the optimal machine, it is also the case that the maximum upper bound for action i we’ve seen after the first m trials exceeds the minimum upper bound we’ve seen on the optimal machine (ever). But on round t we don’t know how many times we’ve played the optimal machine, nor do we even know how many times we’ve played machine i (except that it’s more than m). So we try all possibilities and look at minima and maxima. This is a pretty crude approximation, but it will allow us to write things in a nicer form.

Denote by \overline{x}_{i,s} the random variable for the empirical mean after playing action i a total of s times, and \overline{x}^*_s the corresponding quantity for the optimal machine. Realizing everything in notation, the above argument proves that

\displaystyle P_i(T) \leq m + \sum_{t=K}^T \chi \left ( \max_{m \leq s < t} \overline{x}_{i,s} + a(s, t-1) \geq \min_{0 < s' < t} \overline{x}^*_{s'} + a(s', t-1) \right )

Indeed, at each t for which the max is greater than the min, there will be at least one pair s,s' for which the values of the quantities inside the max/min will satisfy the inequality. And so, even worse, we can just count the number of pairs s, s' for which it happens. That is, we can expand the event above into the double sum which is at least as large:

\displaystyle P_i(T) \leq m + \sum_{t=K}^T \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t-1) \geq \overline{x}^*_{s'} + a(s', t-1) \right )

We can make one other odd inequality by increasing the sum to go from t=1 to \infty. This will become clear later, but it means we can replace t-1 with t and thus have

\displaystyle P_i(T) \leq m + \sum_{t=1}^\infty \sum_{s = m}^{t-1} \sum_{s' = 1}^{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \right )

Now that we’ve slogged through this mess of inequalities, we can actually get to the heart of the argument. Suppose that this event actually happens, that \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t). Then what can we say? Well, consider the following three events:

(1) \displaystyle \overline{x}^*_{s'} \leq \mu^* - a(s', t)
(2) \displaystyle \overline{x}_{i,s} \geq \mu_i + a(s, t)
(3) \displaystyle \mu^* < \mu_i + 2a(s, t)

In words, (1) is the event that the empirical mean of the optimal action is less than the lower confidence bound. By our Chernoff bound argument earlier, this happens with probability t^{-4}. Likewise, (2) is the event that the empirical mean payoff of action i is larger than the upper confidence bound, which also occurs with probability t^{-4}. We will see momentarily that (3) is impossible for a well-chosen m (which is why we left it variable), but in any case the claim is that one of these three events must occur. For if they are all false, we have

\displaystyle \begin{matrix} \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) & > & \mu^* - a(s',t) + a(s',t) = \mu^* \\ \textup{assumed} & (1) \textup{ is false} & \\ \end{matrix}

and

\begin{matrix} \mu_i + 2a(s,t) & > & \overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t) \\ & (2) \textup{ is false} & \textup{assumed} \\ \end{matrix}

But putting these two inequalities together gives us precisely that (3) is true:

\mu^* < \mu_i + 2a(s,t)

This proves the claim.

By the union bound, the probability that at least one of these events happens is 2t^{-4} plus whatever the probability of (3) being true is. But as we said, we’ll pick m to make (3) always false. Indeed m depends on which action i is being played, and if s \geq m > 8 \log T / \Delta_i^2 then 2a(s,t) \leq \Delta_i, and by the definition of \Delta_i we have

\mu^* - \mu_i - 2a(s,t) \geq \mu^* - \mu_i - \Delta_i = 0.

Now we can finally piece everything together. The expected value of an event is just its probability of occurring, and so

\displaystyle \begin{aligned} \mathbb{E}(P_i(T)) & \leq m + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t \textup{P}(\overline{x}_{i,s} + a(s, t) \geq \overline{x}^*_{s'} + a(s', t)) \\ & \leq \left \lceil \frac{8 \log T}{\Delta_i^2} \right \rceil + \sum_{t=1}^\infty \sum_{s=m}^t \sum_{s' = 1}^t 2t^{-4} \\ & \leq \frac{8 \log T}{\Delta_i^2} + 1 + \sum_{t=1}^\infty \sum_{s=1}^t \sum_{s' = 1}^t 2t^{-4} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + 2 \sum_{t=1}^\infty t^{-2} \\ & = \frac{8 \log T}{\Delta_i^2} + 1 + \frac{\pi^2}{3} \\ \end{aligned}

The second line is the Chernoff bound we argued above, the third and fourth lines are relatively obvious algebraic manipulations, and the last equality uses the classic solution to Basel’s problem. Plugging this upper bound in to the regret formula we gave in the first paragraph of the proof establishes the bound and proves the theorem.

\square

Implementation and an Experiment

The algorithm is about as simple to write in code as it is in pseudocode. The confidence bound is trivial to implement (though note we index from zero):

def upperBound(step, numPlays):
   return math.sqrt(2 * math.log(step + 1) / numPlays)

And the full algorithm is quite short as well. We define a function ub1, which accepts as input the number of actions and a function reward which accepts as input the index of the action and the time step, and draws from the appropriate reward distribution. Then implementing ub1 is simply a matter of keeping track of empirical averages and an argmax. We implement the function as a Python generator, so one can observe the steps of the algorithm and keep track of the confidence bounds and the cumulative regret.

def ucb1(numActions, reward):
   payoffSums = [0] * numActions
   numPlays = [1] * numActions
   ucbs = [0] * numActions

   # initialize empirical sums
   for t in range(numActions):
      payoffSums[t] = reward(t,t)
      yield t, payoffSums[t], ucbs

   t = numActions

   while True:
      ucbs = [payoffSums[i] / numPlays[i] + upperBound(t, numPlays[i]) for i in range(numActions)]
      action = max(range(numActions), key=lambda i: ucbs[i])
      theReward = reward(action, t)
      numPlays[action] += 1
      payoffSums[action] += theReward

      yield action, theReward, ucbs
      t = t + 1

The heart of the algorithm is the second part, where we compute the upper confidence bounds and pick the action maximizing its bound.

We tested this algorithm on synthetic data. There were ten actions and a million rounds, and the reward distributions for each action were uniform from [0,1], biased by 1/k for some 5 \leq k \leq 15. The regret and theoretical regret bound are given in the graph below.

ucb1-simple-example

The regret of ucb1 run on a simple example. The blue curve is the cumulative regret of the algorithm after a given number of steps. The green curve is the theoretical upper bound on the regret.

Note that both curves are logarithmic, and that the actual regret is quite a lot smaller than the theoretical regret. The code used to produce the example and image are available on this blog’s Github page.

Next Time

One interesting assumption that UCB1 makes in order to do its magic is that the payoffs are stochastic and independent across rounds. Next time we’ll look at an algorithm that assumes the payoffs are instead adversarial, as we described earlier. Surprisingly, in the adversarial case we can do about as well as the stochastic case. Then, we’ll experiment with the two algorithms on a real-world application.

Until then!

Neural Networks and the Backpropagation Algorithm

Neurons, as an Extension of the Perceptron Model

In a previous post in this series we investigated the Perceptron model for determining whether some data was linearly separable. That is, given a data set where the points are labelled in one of two classes, we were interested in finding a hyperplane that separates the classes. In the case of points in the plane, this just reduced to finding lines which separated the points like this:

A hyperplane (the slanted line) separating the blue data points (class -1) from the red data points (class +1)

As we saw last time, the Perceptron model is particularly bad at learning data. More accurately, the Perceptron model is very good at learning linearly separable data, but most kinds of data just happen to more complicated. Even with those disappointing results, there are two interesting generalizations of the Perceptron model that have exploded into huge fields of research. The two generalizations can roughly be described as

  • Use a number of Perceptron models in some sort of conjunction.
  • Use the Perceptron model on some non-linear transformation of the data.

The point of both of these is to introduce some sort of non-linearity into the decision boundary. The first generalization leads to the neural network, and the second leads to the support vector machine. Obviously this post will focus entirely on the first idea, but we plan to cover support vector machines in the near future. Recall further that the separating hyperplane was itself defined by a single vector (a normal vector to the plane) \mathbf{w} = (w_1, \dots, w_n). To “decide” what class the new point \mathbf{x} is in, we check the sign of an inner product with an added constant shifting term:

\displaystyle f(\mathbf{x}) = \textup{sign}(\left \langle \mathbf{w}, \mathbf{x} \right \rangle + b) = \textup{sign} \left ( b + \sum_{i=1}^n w_i x_i \right )

The class of a point is just the value of this function, and as we saw with the Perceptron this corresponds geometrically to which side of the hyperplane the point lies on. Now we can design a “neuron” based on this same formula. We consider a point \mathbf{x} = (x_1, \dots, x_n) to be an input to the neuron, and the output will be the sign of the above sum for some coefficients w_i. In picture form it would look like this:

It is quite useful to literally think of this picture as a directed graph (see this blog’s gentle introduction to graph theory if you don’t know what a graph is). The edges corresponding to the coordinates of the input vector \mathbf{x} have weights w_i, and the output edge corresponds to the sign of the linear combination. If we further enforce the inputs to be binary (that is, \mathbf{x} \in \left \{ 0, 1 \right \}^n), then we get a very nice biological interpretation of the system. If we think of the unit as a neuron, then the input edges correspond to nerve impulses, which can either be on or off (identically to an electrical circuit: there is high current or low current). The weights w_i correspond to the strength of the neuronal connection. The neuron transmits or does not transmit a pulse as output depending on whether the inputs are strong enough.

We’re not quite done, though, because in this interpretation the output of the neuron will either fire or not fire. However, neurons in real life are somewhat more complicated. Specifically, neurons do not fire signals according to a discontinuous function. In addition, we want to use the usual tools from classical calculus to analyze our neuron, but we cannot do that unless the activation function is differentiable, and a prerequisite for that is to be continuous. In plain words, we need to allow our neurons to be able to “partially fire.” We need a small range at which the neuron ramps up quickly from not firing to firing, so that the activation function as a whole is differentiable.

This raises the obvious question: what function should we pick? It turns out that there are a number of possible functions we could use, ranging from polynomial to exponential in nature. But before we pick one in particular, let’s outline the qualities we want such a function to have.

Definition: A function \sigma: \mathbb{R} \to [0,1] is an activation function if it satisfies the following properties:

  • It has a first derivative \sigma'.
  • \sigma is non-decreasing, that is \sigma'(x) \geq 0 for all x
  • \sigma has horizontal asymptotes at both 0 and 1 (and as a consequence, \lim_{x \to \infty} \sigma(x) = 1, and \lim_{x \to -\infty} \sigma(x) = 0).
  • \sigma and \sigma' are both computable functions.

With appropriate shifting and normalizing, there are a few reasonable (and time-tested) activation functions. The two main ones are the hyperbolic tangent \tanh(x) and the sigmoid curve 1/ (1+e^{-t}). They both look (more or less) like this:

A sigmoid function (source: Wikipedia)

And it is easy to see visually that this is what we want.

As a side note, the sigmoid function is actually not used very often in practice for a good reason: it gets too “flat” when the function value approaches 0 or 1. The reason this is bad is because how “flat” the function is (the gradient) will guide the learning process. If the function is very flat, then the network won’t learn as quickly. This will manifest itself in our test later in this post, when we see that a neural network struggles to learn the sine function. It struggles specifically at those values of the function that are close to 1 or -1. Though I don’t want to go into too much detail about this, one alternative that has found a lot of success in deep learning is the “rectified linear unit.” This also breaks the assumption of having a derivative everywhere, so one needs a bit more work to deal with that.

Withholding any discussion of why one would pick one specific activation over another, there is one more small detail. In the Perceptron model we allowed a “bias” b which translated the separating hyperplane so that it need not pass through the origin, hence allowing a the set of all pairs (\mathbf{w}, b) to represent every possible hyperplane. Perhaps the simplest way to incorporate the bias into this model is to add another input x_0 which is fixed to 1. Then we add a weight w_0, and it is easy to see that the constant b can just be replaced with the weight w_0. In other words, the inner product b \cdot 1 + \left \langle \mathbf{w}, \mathbf{x} \right \rangle is the same as the inner product of two new vectors \mathbf{w}', \mathbf{x}' where we set w'_0 = b and x'_0 = 1 and w'_i = w_i, x'_i = x_i for all other i.

The updated picture is now:

Now the specification of a single neuron is complete:

Definition: neuron N is a pair (W, \sigma), where W is a list of weights (w_0, w_1, \dots, w_k), and \sigma is an activation function. The impulse function of a neuron N, which we will denote f_N: \left \{ 0,1 \right \}^{k+1} \to [0,1], is defined as

\displaystyle f_N(\mathbf{x}) = \sigma(\left \langle \mathbf{w}, \mathbf{x} \right \rangle) = \sigma \left (\sum_{i=0}^k w_i x_i \right )

We call w_0 the bias weight, and by convention the first input coordinate x_0 is fixed to 1 for all inputs \mathbf{x}.

(Since we always fix the first input to 1, f_N is technically a function \left \{ 0,1 \right \}^k \to \left \{ 0,1 \right \}, but the reader will forgive us for blurring these details.)

Combining Neurons into a Network

The question of how to “train” a single neuron is just a reformulation of the Perceptron problem. If we have a data set \mathbf{x}_i with class labels y_i = \pm 1, we want to update the weights of a neuron so that the outputs agree with their class labels; that is, f_N(\mathbf{x}_i) = y_i for all i. And we saw in the Perceptron how to do this: it’s fast and efficient, given that the data are linearly separable. And in fact training a neuron in this model (accounting for the new activation function) will give us identical decision functions as in the Perceptron model. All we have done so far is change our perspective from geometry to biology. But as we mentioned originally, we want to form a mathematical army of neurons, all working together to form a more powerful decision function.

The question is what form should this army take? Since we already thought of a single neuron as a graph, let’s generalize this. Instead of having a bunch of “input” vertices, a single “output” vertex, and one neuron doing the computation, we now have the same set of input vertices, the same output vertex, but now a number of intermediate neurons connected arbitrarily to each other. That is, the edges that are outputs of some neurons are connected to the inputs of other neurons, and the very last neuron’s output is the final output. We call such a construction a neural network.

For example, the following graph gives a neural network with 5 neurons

To compute the output of any neuron N, we need to compute the values of the impulse functions for each neuron whose output feeds into N. This in turn requires computing the values of the impulse functions for each of the inputs to those neurons, and so on. If we imagine electric current flowing through such a structure, we can view it as a kind of network flow problem, which is where the name “neural networks” comes from. This structure is also called a dependency graph, and (in the parlance of graph theory) a directed acyclic graph. Though nothing technical about these structures will show up in this particular post, we plan in the future to provide primers on their basic theories.

We remark that we view the above picture as a directed graph with the directed edges going upwards. And as in the picture, the incidence structure (which pairs of neurons are connected or not connected) of the graph is totally arbitrary, as long as it has no cycles. Note that this is in contrast to the classical idea of a neural network as “layered” with one or more intermediate layers, such that all neurons in neighboring layers are completely connected to one another.  Hence we will take a slightly more general approach in this post.

Now the question of training a network of interconnected neurons is significantly more complicated than that of training a single neuron. The algorithm to do so is called backpropagation, because we will check to see if the final output is an error, and if it is we will propagate the error backward through the network, updating weights as we go. But before we get there, let’s explore some motivation for the algorithm.

The Backpropagation Algorithm – Single Neuron

Let us return to the case of a single neuron N with weights \mathbf{w} = (w_0, \dots , w_k) and an input \mathbf{x} = (x_0, \dots, x_k). And momentarily, let us remove the activation function \sigma from the picture (so that f_N just computes the summation part). In this simplified world it is easy to take a given set of training inputs \mathbf{x}_j with labels y_j \in \left \{ 0,1 \right \} and compute the error of our neuron N on the entire training set. A standard mathematical way to compute error is by sum of the squared deviations of our neuron’s output from the actual label.

\displaystyle E(\mathbf{w}) = \sum_j (y_j - f_N(\mathbf{x}_j))^2

The important part is that E is a function just of the weights w_i. In other words, the set of weights completely specifies the behavior of a single neuron.

Enter calculus. Any time we have a multivariate function (here, each of the weights w_i is a variable), then we can speak of its minima and maxima. In our case we strive to find a global minimum of the error function E, for then we would have learned our target classification function as well as possible. Indeed, to improve upon our current set of weights, we can use the standard gradient-descent algorithm. We have discussed versions of the gradient-descent algorithm on this blog before, as in our posts on decrypting substitution ciphers with n-grams and finding optimal stackings in Texas Hold ‘Em. We didn’t work with calculus there because the spaces involved were all discrete. But here we will eventually extend this error function to allow the inputs x_i to be real-valued instead of binary, and so we need the full power of calculus. Luckily for the uninformed reader, the concept of gradient descent is the same in both cases. Since E gives us a real number for each possible neuron (each choice of weights), we can take our current neuron and make it better it by changing the weights slightly, and ensuring our change gives us a smaller value under E. If we cannot ensure this, then we have reached a minimum.

Here are the details. For convenience we add a factor of 1/2 to E and drop the subscript N from f_N. Since minimizing E is the same as minimizing \frac{1}{2}E, this changes nothing about the minima of the function. That is, we will henceforth work with

\displaystyle E(\mathbf{w}) = \frac{1}{2} \sum_j (y_j - f(\mathbf{x}_j))^2

Then we compute the gradient \nabla E of E. For fixed values of the variables w_i (our current set of weights) this is a vector in \mathbb{R}^n, and as we know from calculus it points in the direction of steepest ascent of the function E. That is, if we subtract some sufficiently small multiple of this vector from our current weight vector, we will be closer to a minimum of the error function than we were before. If we were to add, we’d go toward a maximum.

Note that E is never negative, and so it will have a global minimum value at or near 0 (if it is possible for the neuron to represent the target function perfectly, it will be zero). That is, our update rule should be

\displaystyle \mathbf{w}_\textup{current} = \mathbf{w}_\textup{current} - \eta \nabla E(\mathbf{w}_\textup{current})

where \eta is some fixed parameter between 0 and 1 that represent the “learning rate.” We will not mention \eta too much except to say that as long as it is sufficiently small and we allow ourselves enough time to learn, we are guaranteed to get a good approximation of some local minimum (though it might not be a global one).

With this update rule it suffices to compute \nabla E explicitly.

\displaystyle \nabla E = \left ( \frac{\partial E}{\partial w_0}, \dots, \frac{\partial E}{\partial w_n} \right )

In each partial \partial E / \partial w_i we consider each other variable beside w_i to be constant, and combining this with the chain rule gives

\displaystyle \frac{\partial E}{\partial w_i} = - \sum_j (y_j - f(\mathbf{x}_j)) \frac{\partial f}{\partial w_i}

Since in the summation formula for f the w_i variable only shows up in the product w_i x_{j,i} (where x_{j,i} is the i-th term of the vector \mathbf{x}_j), the last part expands as x_{j,i}. i.e. we have

\displaystyle \frac{\partial E}{\partial w_i} = - \sum_j (y_j - f(\mathbf{x}_j)) x_{j,i}

Noting the negatives cancelling, this makes our update rule just

\displaystyle w_i = w_i + \eta \sum_j (y_j - f(\mathbf{x}_j))x_{j,i}

There is an alternative form of an update rule that allows one to update the weights after each individual input is tested (as opposed to checking the outputs of the entire training set). This is called the stochastic update rule, and is given identically as above but without summing over all j:

\mathbf{w} = \mathbf{w} + \eta (y_j - f(\mathbf{x}_j)) \mathbf{x}_j

For our purposes in this post, the stochastic and non-stochastic update rules will give identical results.

Adding in the activation function \sigma is not hard, but we will choose our \sigma so that it has particularly nice computability properties. Specifically, we will pick \sigma = 1/(1+e^{-x}) the sigmoid function, because it satisfies the identity

\sigma'(x) = \sigma(x)(1-\sigma(x))

So instead of \partial f in the formula above we need \partial (\sigma \circ f) / \partial w_i, and this requires the chain rule once again:

\displaystyle \frac{\partial E}{\partial w_i} = - \sum_j (y_j - \sigma(f(\mathbf{x}_j))) \sigma'(f(\mathbf{x}_j))x_{j,i}

And using the identity for \sigma' gives us

 \displaystyle \frac{\partial E}{\partial w_i} = - \sum_j (y_j - \sigma(f(\mathbf{x}_j))) \sigma(f(\mathbf{x}_j))(1-\sigma(f(\mathbf{x}_j)))x_{j,i}

And a similar update rule as before. If we denote by o_j the output value \sigma(f(\mathbf{x}_j)), then the stochastic version of this update rule is

\mathbf{w} = \mathbf{w} + \eta o_j (1-o_j)(y_j - o_j)\mathbf{x}_j

Now that we have motivated an update rule for a single neuron, let’s see how to apply this to an entire network of neurons.

The Backpropagation Algorithm – Entire Network

There is a glaring problem in training a neural network using the update rule above. We don’t know what the “expected” output of any of the internal edges in the graph are. In order to compute the error we need to know what the correct output should be, but we don’t immediately have this information.

We don't know the error value for a non-output node in the network.

We don’t know the error value for a non-output node in the network.

In the picture above, we know the expected value of the edge leaving the node N_2, but not that of N_1. In order to compute the error for N_1, we need to derive some kind of error value for nodes in the middle of the network.

It seems reasonable that the error for N_1 should depend on the errors of the nodes for which N_1 provides an input. That is, in the following picture the error should come from all of the neurons M_1, \dots M_n.

weightedsumerror

In particular, one possible error value for a particular input to the entire network \mathbf{x} would be a weighted sum over the errors of M_i, where the weights are the weights of the edges from N_1 to M_i. In other words, if N_1 has little effect on the output of one particular M_i, it shouldn’t assume too much responsibility for that error. That is, using the above picture, the error for N_1 (in terms of the input weights v_i) is

\displaystyle \sum_i w_i E_{M_i}

where E_{M_i} is the error computed for the node M_i.

It turns out that there is a nice theoretical justification for using this quantity as well. In particular, if we think of the entire network as a single function, we can imagine the error E(\mathbf{w}) as being a very convoluted function of all the weights in the network. But no matter how confusing the function may be to write down, we know that it only involves addition, multiplication, and composition of differentiable functions. So if we want to know how to update the error with respect to a weight that is hidden very far down in the network, in theory it just requires enough applications of the chain rule to find it.

To see this, let’s say we have a nodes N_{1,k} connected forward to nodes N_{2,j} connected forward to nodes N_{3,i}, such that the weights w_{k,j} represent weights going from N_{1,k} \to N_{2,j}, and weights w_{j,i} are N_{2,j} \to N_{3,i}.

If we want to know the partial derivative of E with respect to the deeply nested weight w_{k,j}, then we can just compute it:

\displaystyle \frac{\partial E}{\partial w_{k,j}} = -\sum_i (y_i - o_i)\frac{\partial o_i}{\partial w_{k,j}}

where o_i = f_{N_{1,j}}(\dots) = \sigma(f(\dots)) represents the value of the impulse function at each of the output neurons, in terms of a bunch of crazy summations we omit for clarity.

But after applying the chain rule, the partial of the inner summation f = \sum_j w_{j,i}x_k only depends on w_{k,j} via the coefficient w_{j,i}. i.e., the weight w_{k,j} only affects node N_{3,i} by the output of N_{2,j} passing through the edge labeled w_{j,i}. So we get a sum

\displaystyle \frac{\partial E}{\partial w_{k,j}} = -\sum_i (y_i - o_i)\sigma'(f(\dots)) w_{j,i}

That is, it’s simply a weighted sum of the final errors y_i - o_i by the right weights. The stuff inside the f(\dots) is simply the output of that node, which is again a sum over its inputs. In stochastic form, this makes our update rule (for the weights of N_j) just

\displaystyle \mathbf{w} = \mathbf{w} + \eta o_j(1-o_j) \left ( \sum_i w_{j,i} (y_i - o_i) \right ) \mathbf{z}

where by \mathbf{z} we denote the vector of inputs to the neuron in question (these may be the original input \mathbf{x} if this neuron is the first in the network and all of the inputs are connected to it, or it may be the outputs of other neurons feeding into it).

The argument we gave only really holds for a network where there are only two edges from the input to the output.  But the reader who has mastered the art of juggling notation may easily generalize this via induction to prove it in general. This really is a sensible weight update for any neuron in the network.

And now that we have established our update rule, the backpropagation algorithm for training a neural network becomes relatively straightforward. Start by initializing the weights in the network at random. Evaluate an input \mathbf{x} by feeding it forward through the network and recording at each internal node the output value o_j, and call the final output o. Then compute the error for that output value, propagate the error back to each of the nodes feeding into the output node, and update the weights for the output node using our update rule. Repeat this error propagation followed by a weight update for each of the nodes feeding into the output node in the same way, compute the updates for the nodes feeding into those nodes, and so on until the weights of the entire network are updated. Then repeat with a new input \mathbf{x}.

One minor issue is when to stop. Specifically, it won’t be the case that we only need to evaluate each input \mathbf{x} exactly once. Depending on how the learning parameter \eta is set, we may need to evaluate the entire training set many times! Indeed, we should only stop when the gradient for all of our examples is small, or we have run it for long enough to exhaust our patience. For simplicity we will ignore checking for a small gradient, and we will simply fix a number of iterations. We leave the gradient check as an exercise to the reader.

Then the result is a trained network, which we can further use to evaluate the labels for unknown inputs.

Python Implementation

We now turn to implementing a neural network. As usual, all of the source code used in this post (and then some) is available on this blog’s Github page.

The first thing we need to implement all of this is a data structure for a network. That is, we need to represent nodes and edges connecting nodes. Moreover each edge needs to have an associated value, and each node needs to store multiple values (the error that has been propagated back, and the output produced at that node). So we will represent this via two classes:

class Node:
   def __init__(self):
      self.lastOutput = None
      self.lastInput = None
      self.error = None
      self.outgoingEdges = []
      self.incomingEdges = []

class Edge:
   def __init__(self, source, target):
      self.weight = random.uniform(0,1)
      self.source = source
      self.target = target

      # attach the edges to its nodes
      source.outgoingEdges.append(self)
      target.incomingEdges.append(self)

Then a neural network is represented by the set of input and output nodes.

class Network:
   def __init__(self):
      self.inputNodes = []
      self.outputNode = None

In particular, each Node needs to know about its most recent output, input, and error in order to update its weights. So any time we evaluate some input, we need to store these values in the Node. We will progressively fill in these classes with the methods needed to evaluate and train the network on data. But before we can do anything, we need to be able to distinguish between an input node and a node internal to the network. For this we create a subclass of Node called InputNode:

class InputNode(Node):
   def __init__(self, index):
      Node.__init__(self)
      self.index = index; # the index of the input vector corresponding to this node

And now here is a function which evaluates a given input when provided a neural network:

class Network:
   ...

   def evaluate(self, inputVector):
      return self.outputNode.evaluate(inputVector)

class Node:
   ...

   def evaluate(self, inputVector):
      self.lastInput = []
      weightedSum = 0

      for e in self.incomingEdges:
         theInput = e.source.evaluate(inputVector)
         self.lastInput.append(theInput)
         weightedSum += e.weight * theInput

      self.lastOutput = activationFunction(weightedSum)
      return self.lastOutput

class InputNode(Node):
   ...

   def evaluate(self, inputVector):
      self.lastOutput = inputVector[self.index]
      return self.lastOutput

A network calls the evaluate function on its output node, and each node recursively calls evaluate on the sources of each of its incoming edges. An InputNode simply returns the corresponding entry in the inputVector (which requires us to pass the input vector along through the recursive calls). Since our graph structure is arbitrary, we note that some nodes may be “evaluated” more than once per evaluation. As such, we need to store the node’s output for the duration of the evaluation. We also need to store this value for use in training, and so before a call to evaluate we must clear this value. We omit the details here for brevity.

We will use the evaluate() function for training as well as for evaluating unknown inputs. As usual, examples of using these classes (and tests) are available in the full source code on this blog’s Github page.

In addition, we need to automatically add bias nodes and corresponding edges to the non-input nodes. This results in a new subclass of Node which has a default evaluate() value of 1. Because of the way we organized things, the existence of this class changes nothing about the training algorithm.

class BiasNode(InputNode):
   def __init__(self):
      Node.__init__(self)

   def evaluate(self, inputVector):
      return 1.0

We simply add a function to the Node class (which is overridden in the InputNode class) which adds a bias node and edge to every non-input node. The details are trivial; the reader may see them in the full source code.

The training algorithm will come in a loop consisting of three steps: first, evaluate an input example. Second, go through the network updating the error values of each node using backpropagation. Third, go through the network again to update the weights of the edges appropriately. This can be written as a very short function on a Network, which then requires a number of longer functions on the Node classes:

class Network:
   ...

   def propagateError(self, label):
      for node in self.inputNodes:
         node.getError(label)

   def updateWeights(self, learningRate):
      for node in self.inputNodes:
         node.updateWeights(learningRate)

   def train(self, labeledExamples, learningRate=0.9, maxIterations=10000):
      while maxIterations > 0:
         for example, label in labeledExamples:
            output = self.evaluate(example)
            self.propagateError(label)
            self.updateWeights(learningRate)

            maxIterations -= 1

That is, the network class just makes the first call for each of these recursive processes. We then have the following functions on Nodes to implement getError and updateWeights:

class InputNode(Node):
   ...

   def updateWeights(self, learningRate):
      for edge in self.outgoingEdges:
         edge.target.updateWeights(learningRate)

   def getError(self, label):
      for edge in self.outgoingEdges:
         edge.target.getError(label)

class Node:
   ...

   def getError(self, label):
      if self.error is not None:
         return self.error

      if self.outgoingEdges == []: # this is an output node
         self.error = label - self.lastOutput
      else:
         self.error = sum([edge.weight * edge.target.getError(label)
                           for edge in self.outgoingEdges])

      return self.error

   def updateWeights(self, learningRate):
      if (self.error is not None and self.lastOutput is not None
            and self.lastInput is not None):

         for i, edge in enumerate(self.incomingEdges):
            edge.weight += (learningRate * self.lastOutput * (1 - self.lastOutput) *
                           self.error * self.lastInput[i])

         for edge in self.outgoingEdges:
            edge.target.updateWeights(learningRate)

         self.error = None
         self.lastInput = None
         self.lastOutput = None

These are simply the formulas we derived in the previous sections translated into code. The propagated error is computed as a weighted sum in getError(), and the previous input and output values were saved from the call to evaluate().

A Sine Curve Example, and Issues

One simple example we can use to illustrate this is actually not a decision problem, per se, but a function estimation problem. In the course of all of this calculus, we implicitly allowed our neural network to output any values between 0 and 1 (indeed, the activation function did this for us). And so we can use a neural network to approximate any function which has values in [0,1]. In particular we will try this on

\displaystyle f(x) = \frac{1}{2}(\sin(x) + 1)

on the domain [0,4 \pi ].

Our network is simple: we have a single layer of twenty neurons, each of which is connected to a single input neuron and a single output neuron. The learning rate is set to 0.25, the number of iterations is set to a hundred thousand, and the training set is randomly sampled from the domain.

After training (which takes around fifteen seconds), the average error (when tested against a new random sample) is between 0.03 and 0.06. Here is an example of one such output:

An example of a 20-node neural network approximating two periods of a sine function.

An example of a 20-node neural network approximating two periods of a sine function.

This picture hints at an important shortcoming of our algorithm. Note how the neural network’s approximation of the sine function does particularly poorly close to 0 and 1. This is not a coincidence, but rather a side effect of our activation function \sigma. In particular, because the sigmoid function achieves the values 0 and 1 only in the limit. That is, they never actually achieve 0 and 1, and in order to get close we require prohibitively large weights (which in turn correspond to rather large values to be fed to the activation function). One potential solution is to modify our sine function slightly more, by scaling it and translating it so that its values lie in [0.1, 0.9], say. We leave this as an exercise to the reader.

As one might expect, the neural network also does better when we test it on a single period instead of two (since the sine function is less “complicated” on a single period). We also constructed a data set of binary numbers whose labels were 1 if the number was even and 0 if the number was odd. A similar layout to the sine example with three internal nodes again gave good results.

The issues arise on larger datasets. One big problem with training a neural network is that it’s near impossible to determine the “correct” structure of the network ahead of time. The success of our sine function example, for instance, depended much more than we anticipated on the number of nodes used. Of course, this also depends on the choice of learning rate and the number of iterations allowed, but the point is the same: the neural network is fraught with arbitrary choices. What’s worse is that it’s just as impossible to tell if your choices are justified. All you have is an empirical number to determine how well your network does on one training set, and inspecting the values of the various weights will tell you nothing in all but the most trivial of examples.

There are a number of researchers who have attempted to alleviate this problem in some way. One prominent example is the Cascade Correlation algorithm, which dynamically builds the network structure depending on the data. Other avenues include dynamically updating the learning rate and using a variety of other activation and error functions, based on information theory and game theory (adding penalties for various undesirable properties). Still other methods involve alternative weight updates based on more advanced optimization techniques (such as the conjugate gradient method). Part of the benefit of the backpropagation algorithm is that the choice of error function is irrelevant, as long as it is differentiable. This gives us a lot of flexibility to customize the neural network for our own application domain.

These sorts of questions are what have caused neural networks to become such a huge field of research in machine learning. As such, this blog post has only given the reader a small taste of what is out there. This is the bread and butter in a world of fine cuisine: it’s proven to be a solid choice, but it leaves a cornucopia of flavors unrealized and untapped.

The future of this machine learning series, however, will deviate from the neural network path. We will instead investigate the other extension of the Perceptron model, the Support Vector Machine. We will also lay down some formal theories of learning, because as of yet we have simply been exploring algorithms without the ability to give guarantees on their performance. In order to do that, we must needs formalize the notion of an algorithm “learning” a task. This is no small feat, and nobody has quite agreed on the best formalization. We will nevertheless explore these frameworks, and see what kinds of theorems we can prove in them.

Until then!