## A Series on Machine Learning

These days an absolutely staggering amount of research and development work goes into the very coarsely defined field of “machine learning.” Part of the reason why it’s so coarsely defined is because it borrows techniques from so many different fields. Many problems in machine learning can be phrased in different but equivalent ways. While they are often purely optimization problems, such techniques can be expressed in terms of statistical inference, have biological interpretations, or have a distinctly geometric and topological flavor. As a result, machine learning has come to be understood as a toolbox of techniques as opposed to a unified theory.

It is unsurprising, then, that such a multitude of mathematics supports this diversified discipline. Practitioners (that is, algorithm designers) rely on statistical inference, linear algebra, convex optimization, and dabble in graph theory, functional analysis, and topology. Of course, above all else machine learning focuses on *algorithms* and *data.*

The general pattern, which we’ll see over and over again as we derive and implement various techniques, is to develop an algorithm or mathematical model, test it on datasets, and refine the model based on specific domain knowledge. The first step usually involves a leap of faith based on some mathematical intuition. The second step commonly involves a handful of established and well understood datasets (often taken from the University of California at Irvine’s machine learning database, and there is some controversy over how ubiquitous this practice is). The third step often seems to require some voodoo magic to tweak the algorithm and the dataset to complement one another.

It is this author’s personal belief that the most important part of machine learning is the mathematical foundation, followed closely by efficiency in implementation details. The thesis is that natural data has inherent structure, and that the goal of machine learning is to represent this and utilize it. To make true progress, one must represent and analyze structure abstractly. And so this blog will focus predominantly on mathematical underpinnings of the algorithms and the mathematical structure of data.

## General Plans

While we do intend to cover the classical topics of machine learning, such as neural networks and decision trees, we would like to quickly approach more sophisticated modern techniques such as support vector machines and methods based on Kolmogorov complexity. And so we put forth the ambitious list of topics (in no particular order). **[update: it’s been a while since this initial post, and we’ve covered some of the topics listed below, as indicated by the links]**

- K nearest neighbors
- Decision trees
- Centroid-based and density-based clustering
- Neural networks
- Support vector machines
- Regression

- Bandit learning (UCB1, Exp3)
- Bayesian inference and networks
- Methods based on Kolmogorov complexity
- Manifold learning and persistence homology

This long and circuitous journey will inevitably require arbitrarily large but finite detours to cover the mathematical background. We’ll cover metric spaces, functional analysis, mathematical statistics and probability theory, abstract algebra, topology, and even some category theory. Note that some of the more esoteric (i.e., advanced) topics will have their own series as well (for instance, we’ve had an itch to do computational category theory but having covered none of the typical concrete applications of category theory, the jump into extreme abstraction would come off as pointlessly complicated).

Of course, as we’ve mentioned before, while the mathematics is motivated by our desire to connect ideas, programming is motivated by what we can *do*. And so we’re interested in using machine learning methods to perform cool tasks. Some ideas we plan to implement on this blog include social network analysis, machine vision and optical character recognition, spam classification, natural language processing, speech recognition, and content classification and recommendation.

Finally, we are interested in the theoretical boundaries on what is possible for a computer to learn. Aside from its practical use, this area of study would require us to rigorously define what it means for a machine to “learn.” This field is known as computational learning theory, and a good deal of it is devoted to the typical complexity-theory-type questions such as “can this class of classification functions be learned in polynomial time?” In addition, this includes learning theories of a more statistical flavor, such as the “Probably Approximately Correct” model. We plan to investigate each of these models in turn as they come up in our explorations.

If any readers have suggestions for additional machine learning topics (to add to this already gargantuan list), feel free to pipe in with a comment! We’ll begin with an exploration of the simplest algorithm on the above list, k nearest neighbors, and a more rigorous exploration of metric spaces.

Until then!

Thanks for starting this cool series . I am interested in “theory” part which I know little about it ,mostly hand waving knowledge basically. Some questions/suggestions :

What will be the pace of this series ?.

Set of exercise for each topic in series will be great .

Extra notes(pdf) , references, will be also good.

Can you also list prerequisites for series ( I don’t have formal background in maths, but can pick up to some extent)

Thanks again and looking forward to learning.

I’ll definitely be leaving some exercises (mathematical ones and programming ones) and give references to notes written by others. I’m going to try not to assume anything outside of what I cover in my primers, but occasionally I’ll exercise the right to omit an arduously technical proof or wave my hand at certain pedantic distinctions.

I’ll definitely have a variety of complexity. I think every reader should have an easy time with the post on k nearest neighbors and k means but, say, manifold learning will be among the most abstract topics on this blog to date.

Thanks for reading!

This sounds like a place I’ll be returning to..

Could you list some of the methods/algorithms next to the bullets?

What are the methods based on Kolmogorov complexity? Isn’t it pretty much anything that uses regularization/smoothing in any shape way or form?

Do persistence homologies relate to metric-preserving mappings, such as MVU?

Manifold learning methods AFAIK imply the Riemannian manifold assumption.. will there be methods that are based on, say, GPs which can generalize to datasets with highly nonlinear mappings to the latent variables?

For Kolmogorov complexity we’ll derive and prove the universality of the Kolmogorov similarity metric. While theoretically uncomputable, it also encapsulates every possible similarity metric (which we’ll make rigorous later). From there one could use a number of different metric-based techniques to learn.

Persistence homology relates specifically to approximating the topological features of data, so yes, it’s assumed to be sampled from a manifold. Of course, there will always be more methods than I’m familiar with, so I’ll expand as time and knowledge permits.

Would it be possible for this blog series to pull from your post on facial recognition for application? Maybe a program that can learn to recognize different objects over time with minimally guided input? Or would that be outside the scope of this series?

So the post on facial recognition is essentially a combination of k-nearest-neighbors and principal component analysis. I’ll be sure to highlight this in my post on k-nearest-neighbors, and throughout the series I’ll describe different ways to manipulate data to become good inputs for the various methods.

I think what you’re describing, at least, could be k-means clustering with principal component analysis dimensionality reduction.

Very ambitious plan.

I would like to skip to manifold learning and methods oriented around Kolmogorov complexity, both of which I have been meaning to pick up for awhile. Where did you go to learn about these things?

One paper that covers it is “On Using SVM and Kolmogorov Complexity for Spam Filtering”

For both of those topics I go straight to the original papers. As far as I know both ideas are bleeding-edge research topics, so there is certainly no other way to find out about it (except, hopefully soon, to read my blog). Kolmogorov complexity is certainly the easier topic, and the resulting programs are entirely trivial, so most of that time will be spent on proving the universality of the Kolmogorov similarity metric.

On the other hand manifold learning requires quite a lot of background. There are many different flavors of manifold learning, and the one I’m most interested in right now is persistent homology. Unfortunately that’s probably the hardest one to describe to the uninitiated. I’m still looking into exactly how I’ll cover those topics, and in the mean time I think I can get some of the background covered by doing the easier topics on the list.

Hmm. Do you think the prior material, as listed here, will cover a lot that’s useful for persistent homology? I’m not sure I do. Finding homology that’s persistent over changes in tuning parameters does have deep connection to more “normal” methods of analyzing the structure of data, but prima facie, it seems to be quite different. Overall it seems that the community agrees, because the biggest question is whether the added complexity actually buys us anything in terms of results.

Everyone agrees it’s awesome though.

The effectiveness of persistent homology is an important question, but it’s a very young idea with many avenues for potential progress. For now, I’m just interested in learning, synthesizing my knowledge, and presenting it here.

As to the prior material, it’s certainly not enough. My own understanding of homology took many long hours of study, the completion of many exercises, and long discussions with friends and mentors. I find it immeasurably valuable, however, to revisit the basics and synthesize my own understanding in the form of a blog post. My hope is that this will benefit someone else as well 🙂

Really looking forward to this series. Do you recommend any prior reading? Just so we absorb material better?

I’m going to try extremely hard not to cover anything I don’t at least mention in my primers. That, and I try to give a high-level and a mathematically detailed explanation where I can. One thing I rarely get on this blog is readers asking me to clarify or expand on something I describe in a post. I have the feeling most readers either get it or don’t try. So if you’re having trouble absorbing a point or a proof, ask and I’ll do my best to give!

There doesn’t appear to be a way to subscribe to your blog. I’d really like to follow this series.

The standard wordpress “Follow” button at the top should work, and every wordpress blog has the same extension for its rss feed: i.e., http://jeremykun.wordpress.com/feed/.

i’m so excited!

Quite looking forward to more of your posts. You’ve already influenced me to code up some rugged eigenfaces and decision tree implementations (already in Python, soon in Haskell), and I’m looking forward to expanding the decision tree implementation out to some ensemble methods.

A fascination of mine is with spectral methods and manifold learning for big, sparse term-document matrices (and it’s my job!). Persistent homology is fascinating, and I’m looking forward to seeing how you’ll explain Betti numbers and bar codes (and what takeaways this’ll give us in the models we build.)

I’ve already implemented the original algorithm for computing persistent homology. The difficult part of presenting it here is that it requires *so* much background to fully understand the derivation and what it’s actually computing. Despite all of this, I think I’ll do it by first presenting the algorithm from my level (which is admittedly in the clouds relative to most of my readers) and then swing back around with a series of topology and algebra primers aimed at covering just enough background to understand what is going on.

Unfortunately I just don’t think it will be possible to get as much algebra and topology across as I would like. But then again at some point I plan to do some computational algebraic geometry here, and that requires quite a bit of ring theory.

when / if you get to explaining Bayes theorem – a Venn Pie Chart could be something fresh and useful

http://dl.dropbox.com/u/133074120/venn_pie_solver.html

Neat app. I will be doing Bayes’s Theorem, but I always felt like it was obvious… Maybe I should take a closer look at what confuses people about it.

Hi, I am interested in SVMs because its the focus of my work right now. Please if uyou dont mind educate me more on it with a clear cut examples to drive home your points. Cheers. You are doing a great job.

Where can I find the online course for machine learning, manifold learning, pattern recognizing? I could not see any link so far. Thanks!

How about deep learning? I’m really enjoy your posts. It helps me a lot. Thank you.