**Problem:** You have a catalog of items with discrete ratings (thumbs up/thumbs down, or 5-star ratings, etc.), and you want to display them in the “right” order.

**Solution: **In Python

''' score: [int], [int], [float] -&gt; float Return the expected value of the rating for an item with known ratings specified by `ratings`, prior belief specified by `rating_prior`, and a utility function specified by `rating_utility`, assuming the ratings are a multinomial distribution and the prior belief is a Dirichlet distribution. ''' def score(self, ratings, rating_prior, rating_utility): ratings = [r + p for (r, p) in zip(ratings, rating_prior)] score = sum(r * u for (r, u) in zip(ratings, rating_utility)) return score / sum(ratings)

**Discussion: **This deceptively short solution can lead you on a long and winding path into the depths of statistics. I will do my best to give a short, clear version of the story.

As a working example I chose merely because I recently listened to a related podcast, say you’re selling mass-market romance novels—which, by all accounts, is a predictable genre. You have a list of books, each of which has been rated on a scale of 0-5 stars by some number of users. You want to display the top books first, so that time-constrained readers can experience the most titillating novels first, and newbies to the genre can get the best first time experience and be incentivized to buy more.

The setup required to arrive at the above code is the following, which I’ll phrase as a story.

Users’ feelings about a book, and subsequent votes, are independent draws from a known distribution (with unknown parameters). I will just call these distributions “discrete” distributions. So given a book and user, there is some unknown list $ (p_0, p_1, p_2, p_3, p_4, p_5)$ of probabilities ($ \sum_i p_i = 1$) for each possible rating a user could give for that book.

But how do users get these probabilities? In this story, the probabilities are the output of a randomized procedure that generates distributions. That modeling assumption is called a “Dirichlet prior,” with *Dirichlet* meaning it generates discrete distributions, and *prior *meaning it encodes domain-specific information (such as the fraction of 4-star ratings for a typical romance novel).

So the story is you have a book, and that book gets a Dirichlet distribution (unknown to us), and then when a user comes along they sample from the Dirichlet distribution to get a discrete distribution, which they then draw from to choose a rating. We observe the ratings, and we need to find the book’s underlying Dirichlet. We start by assigning it some default Dirichlet (the prior) and update that Dirichlet as we observe new ratings. Some other assumptions:

- Books are indistinguishable except in the parameters of their Dirichlet distribution.
- The parameters of a book’s Dirichlet distribution don’t change over time, and inherently reflect the book’s value.

So a Dirichlet distribution is a process that produces discrete distributions. For simplicity, in this post we will say a Dirichlet distribution is parameterized by a list of six integers $ (n_0, \dots, n_5)$, one for each possible star rating. These values represent our belief in the “typical” distribution of votes for a new book. We’ll discuss more about how to set the values later. Sampling a value (a book’s list of probabilities) from the Dirichlet distribution is not trivial, but we don’t need to do that for this program. Rather, we need to be able to interpret a fixed Dirichlet distribution, and update it given some observed votes.

The interpretation we use for a Dirichlet distribution is its expected value, which, recall, is the parameters of a discrete distribution. In particular if $ n = \sum_i n_i$, then the expected value is a discrete distribution whose probabilities are

$ \displaystyle \left ( \frac{n_0}{n}, \frac{n_1}{n}, \dots, \frac{n_5}{n} \right )$

So you can think of each integer in the specification of a Dirichlet as “ghost ratings,” sometimes called *pseudocounts*, and we’re saying the probability is proportional to the count.

This is great, because if we knew the true Dirichlet distribution for a book, we could compute its ranking without a second thought. The ranking would simply be the expected star rating:

def simple_score(distribution): return sum(i * p for (i, p) in enumerate(distribution))

Putting books with the highest score on top would maximize the expected happiness of a user visiting the site, provided that happiness matches the user’s voting behavior, since the simple_score *is* just the expected vote.

Also note that all the rating system needs to make this work is that the rating options are linearly ordered. So a thumbs up/down (heaving bosom/flaccid member?) would work, too. We don’t need to know *how* happy it makes them to see a 5-star vs 4-star book. However, because as we’ll see next we have to approximate the distribution, and hence have uncertainty for scores of books with only a few ratings, it helps to incorporate numerical utility values (we’ll see this at the end).

Next, to update a given Dirichlet distribution with the results of some observed ratings, we have to dig a bit deeper into Bayes rule and the formulas for sampling from a Dirichlet distribution. Rather than do that, I’ll point you to this nice writeup by Jonathan Huang, where the core of the derivation is in Section 2.3 (page 4), and remark that the rule for updating for a new observation is to just add it to the existing counts.

**Theorem:** Given a Dirichlet distribution with parameters $ (n_1, \dots, n_k)$ and a new observation of outcome $ i$, the updated Dirichlet distribution has parameters $ (n_1, \dots, n_{i-1}, n_i + 1, n_{i+1}, \dots, n_k)$. That is, you just update the $ i$-th entry by adding $ 1$ to it.

This particular arithmetic to do the update is a mathematical consequence (derived in the link above) of the *philosophical* assumption that Bayes rule is how you should model your beliefs about uncertainty, coupled with the assumption that the Dirichlet process is how the users actually arrive at their votes.

The initial values $ (n_0, \dots, n_5)$ for star ratings should be picked so that they represent the average rating distribution among all prior books, since this is used as the default voting distribution for a new, unknown book. If you have more information about whether a book is likely to be popular, you can use a different prior. For example, if JK Rowling wrote a Harry Potter Romance novel that was part of the canon, you could pretty much guarantee it would be popular, and set $ n_5$ high compared to $ n_0$. Of course, if it were actually popular you could just wait for the good ratings to stream in, so tinkering with these values on a per-book basis might not help much. On the other hand, most books by unknown authors are bad, and $ n_5$ should be close to zero. Selecting a prior dictates how influential ratings of new items are compared to ratings of items with many votes. The more pseudocounts you add to the prior, the less new votes count.

This gets us to the following code for star ratings.

def score(self, ratings, rating_prior): ratings = [r + p for (r, p) in zip(ratings, rating_prior)] score = sum(i * u for (i, u) in enumerate(ratings)) return score / sum(ratings)

The only thing missing from the solution at the beginning is the utilities. The utilities are useful for two reasons. First, because books with few ratings encode a lot of uncertainty, having an idea about how extreme a feeling is implied by a specific rating allows one to give better rankings of new books.

Second, for many services, such as taxi rides on Lyft, the default star rating tends to be a 5-star, and 4-star or lower mean something went wrong. For books, 3-4 stars is a default while 5-star means you were very happy.

The utilities parameter allows you to weight rating outcomes appropriately. So if you are in a Lyft-like scenario, you might specify utilities like [-10, -5, -3, -2, 1] to denote that a 4-star rating has the same negative impact as two 5-star ratings would positively contribute. On the other hand, for books the gap between 4-star and 5-star is much less than the gap between 3-star and 4-star. The utilities simply allow you to calibrate how the votes should be valued in comparison to each other, instead of using their literal star counts.

So, basically, for a repartition $(n_0, \dots, n_5)$ of rates, with $n = n_0 + \dots n_5$, we just say that the new user will give it a score of $(0 * n_0 + \dots + 5 * n_5) / n$? I’m not really sure the Dirichlet distributions add any hindsight to this. Did I miss something?

I think this post is about explaining that the Dirichlet distribution is a conjugate prior, that is the prior and the posterior distribution are both Dirichlet, and also that when the parameters are interpreted as counts of occurrences then the Bayes’ update rule is simple recount. Perhaps giving the example of the beta distribution would be more clear, I think the first step is to introduce the reader into Bayesian statistic, in which parameters are not numbers but random variables. I think that conjugate priors like beta an Dirichlet are a very good way to introduce the readers into the bayesian point of view, but once the reader consider parameters as counting number of ocurrences, the applications seems trivial. Perhaps there is more content that I didn’t understand.

That’s because it *is* trivial.

The point of this post is to cleanly separate the algorithm, mathematical model, and philosophy in the context of a concrete example. Spouting “the algorithm works because the Dirichlet is a conjugate prior” is the quickest way to blur the lines. And my personal opinion is that (albeit in the interest of generality) the clear, trivial ideas in statistics are obfuscated by the jargon.

In other words, this post is not for you, and I’m ignoring your advice on purpose.

I learned about conjugate prior reading the free book Think Stats, is a book about statistics for programming and it use Python. Later I read more about conjugate prior, for example a post in John Cook page. When I read your post I looked up in the wikipedia the Dirichlet distribution and I discovered it was a generalization of the beta distribution and also a good and simple example of conjugate prior. That is my humble voyage to learn that stuff. Maybe this post is not for me, but I hope to follow enjoying your posts ).

Sorry, the book I was referring to is Think Bayes,http://greenteapress.com/wp/think-bayes/

Thanks for the recommendation!

From a pedagogical, wouldn’t it be nice to work out the beta-binomial model as a special case (also beta(0.5, 0.5) is the simplest jeffrey’s prior)?

Btw, your blog is fantastic! I know little to nothing about algorithms and your expositions are very helpful.