A professor at Stanford once said,

If you really want to impress your friends and confound your enemies, you can invoke

tensor products… People run in terror from the symbol.

He was explaining some aspects of multidimensional Fourier transforms, but this comment is only half in jest; people get confused by tensor products. It’s often for good reason. People who really understand tensors feel obligated to explain it using abstract language (specifically, universal properties). And the people who explain it in elementary terms don’t really understand tensors.

This post is an attempt to bridge the gap between the elementary and advanced understandings of tensors. We’ll start with the elementary (axiomatic) approach, just to get a good feel for the objects we’re working with and their essential properties. Then we’ll transition to the “universal” mode of thought, with the express purpose of enlightening us as to why the properties are both necessary and natural.

But above all, we intend to be sufficiently friendly so as to not make anybody run in fear. This means lots of examples and preferring words over symbols. Unfortunately, we simply can’t get by without the reader knowing the very basics of linear algebra (the content of our first two primers on linear algebra (1) (2), though the only important part of the second is the definition of an inner product).

So let’s begin.

## Tensors as a Bunch of Axioms

Before we get into the thick of things I should clarify some basic terminology. *Tensors* are just vectors in a special vector space. We’ll see that such a vector space comes about by combining two smaller vector spaces via a *tensor product*. So the tensor product is an operation combining vector spaces, and tensors are the elements of the resulting vector space.

Now the use of the word *product* is quite suggestive, and it may lead one to think that a tensor product is similar or related to the usual *direct product* of vector spaces. In fact they are related (in very precise sense), but they are far from *similar*. If you were pressed, however, you could start with the direct product of two vector spaces and take a mathematical machete to it until it’s so disfigured that you have to give it a new name (the tensor product).

With that image in mind let’s see how that is done. For the sake of generality we’ll talk about two arbitrary finite-dimensional vector spaces of dimensions . Recall that the *direct product* is the vector space of pairs where comes from and from . Recall that addition in this vector space is defined componentwise ()) and scalar multiplication scales both components .

To get the tensor product space , we make the following modifications. First, we *redefine* what it means to do scalar multiplication. In this brave new tensor world, scalar multiplication of the whole vector-pair is *declared* to be the same as scalar multiplication of *any component you want*. In symbols,

for all choices of scalars and vectors . Second, we change the addition operation so that it only works if one of the two components are the same. In symbols, we declare that

only works because is the same in both pieces, and with the same rule applying if we switch the positions of above. All other additions are simply declared to be *new *vectors. I.e. is simply itself. It’s a valid addition — we need to be able to add stuff to be a vector space — but you just can’t combine it any further unless you can use the scalar multiplication to factor out some things so that or . To say it still one more time, a general element of the tensor is a sum of these pairs that can or can’t be combined by addition (in general things can’t always be combined).

Finally, we *rename* the pair to , to distinguish it from the old vector space that we’ve totally butchered and reanimated, and we call the tensor product space as a whole . Those familiar with this kind of abstract algebra will recognize *quotient spaces* at work here, but we won’t use that language except to note that we cover quotients and free spaces elsewhere on this blog, and that’s the formality we’re ignoring.

As an example, say we’re taking the tensor product of two copies of . This means that our space is comprised of vectors like , and moreover that the following operations are completely legitimate.

Cool. This seemingly innocuous change clearly has huge implications on the structure of the space. We’ll get to specifics about how different tensors are from regular products later in this post, but for now we haven’t even proved this thing is a vector space. It might not be obvious, but if you go and do the formalities and write the thing as a quotient of a free vector space (as we mentioned we wouldn’t do) then you know that quotients of vector spaces are again vector spaces. So we get that one for free. But even without that it should be pretty obvious: we’re essentially just *declaring* that all the axioms of a vector space hold when we want them to. So if you were wondering whether

The answer is yes, by force of will.

So just to recall, the axioms of a tensor space are

- The “basic” vectors are for , and they’re used to build up all other vectors.
- Addition is symbolic, unless one of the components is the same in both addends, in which case and .
- You can freely move scalar multiples around the components of .
- The rest of the vector space axioms (distributivity, additive inverses, etc) are assumed with extreme prejudice.

Naturally, one can extend this definition to -fold tensor products, like . Here we write the vectors as sums of things like , and we enforce that addition can only be combined if *all but one* coordinates are the same in the addends, and scalar multiples move around to all coordinates equally freely.

## So where does it come from?!

By now we have this definition and we can play with tensors, but any sane mathematically minded person would protest, “What the hell would cause anyone to come up with such a definition? I thought mathematics was supposed to be elegant!”

It’s an understandable position, but let me now try to convince you that tensor products are very natural. The main intrinsic motivation for the rest of this section will be this:

We have all these interesting mathematical objects, but over the years we have discovered that the *maps between objects* are the truly interesting things.

A fair warning: although we’ll maintain a gradual pace and informal language in what follows, by the end of this section you’ll be reading more or less mature 20th-century mathematics. It’s quite alright to stop with the elementary understanding (and skip to the last section for some cool notes about computing), but we trust that the intrepid readers will push on.

So with that understanding we turn to multilinear maps. Of course, the first substantive thing we study in linear algebra is the notion of a *linear map* between vector spaces. That is, a map that factors through addition and scalar multiplication (i.e. and ).

But it turns out that lots of maps we work with have much stronger properties worth studying. For example, if we think of matrix multiplication as an operation, call it , then takes in two matrices and spits out their product

Now what would be an appropriate notion of linearity for this map? Certainly it is linear in the first coordinate, because if we fix then

And for the same reason it’s linear in the second coordinate. But it is most definitely not linear in both coordinates *simultaneously. *In other words,

In fact, there is only *one* function that satisfies both “linearity in its two coordinates separately” and also “linearity in both coordinates simultaneously,” and it’s the zero map! (Try to prove this as an exercise.) So the strongest kind of linearity we could reasonably impose is that is linear in each coordinate when **all else is fixed**. Note that this property allows us to shift around scalar multiples, too. For example,

Starting to see the wispy strands of a connection to tensors? Good, but hold it in for a bit longer. This single-coordinate-wise-linear property is called *bilinearity *when we only have two coordinates, and *multilinearity* when we have more.

Here are some examples of nice multilinear maps that show up everywhere:

- If is an inner product space over , then the inner product is bilinear.
- The determinant of a matrix is a multilinear map if we view the columns of the matrix as vector arguments.
- The cross product of vectors in is bilinear.

There are many other examples, but you should have at least passing familiarity with these notions, and it’s enough to convince us that multilinearity is worth studying abstractly.

And so what tensors do is give a sort of *classification* of multilinear maps. The idea is that every multilinear map from a product vector space to any vector space can be written first as a multilinear map to the tensor space

Followed by a linear map to ,

And the important part is that doesn’t depend on the original (but does). One usually draws this as a single diagram:

And to say this diagram *commutes* is to say that all possible ways to get from one point to another are equivalent (the compositions of the corresponding maps you follow are equal, i.e. ).

In fuzzy words, the tensor product is like the gatekeeper of all multilinear maps, and is the gate. Yet another way to say this is that is the most general possible multilinear map that can be constructed from . Moreover, the tensor product itself is *uniquely* defined by having a “most-general” (up to isomorphism). This notion is often referred to by mathematicians as the “universal property” of the tensor product. And they might say something like “the tensor product is initial with respect to multilinear mappings from the standard product.” We discuss language like this in detail in this blog’s series on category theory, but it’s essentially a super-compact (and almost too vague) way of saying what the diagram says.

Let’s explore this definition when we specialize to a tensor of two vector spaces, and it will give us a good understanding of (which is really incredibly simple, but people like to muck it up with choices of coordinates and summations). So fix as vector spaces and look at the diagram

What is in this case? Well it just sends . Is this map multilinear? Well if we fix then

and

And our familiarity with tensors now tells us that the other side holds too. Actually, rather than say this is a result of our “familiarity with tensors,” the truth is that this is how we know that we need to define the properties of tensors as we did. It’s all because we *designed* tensors to be the gatekeepers of multilinear maps!

So now let’s prove that all maps can be decomposed into an part and a part. To do this we need to know what data uniquely defines a multilinear map. For usual linear maps, all we had to do was define the effect of the map on each element of a basis (the rest was uniquely determined by the linearity property). We know what a basis of is, it’s just the union of the bases of the pieces. Say that has a basis and has , then a basis for the product is just .

But multilinear maps are more nuanced, because they have two arguments. In order to say “what they do on a basis” we really need to know how they act on *all possible pairs *of basis elements. For how else could we determine ? If there are of the ‘s and of the ‘s, then there are such pairs .

Uncoincidentally, as is a vector space, *its* basis can also be constructed in terms of the bases of and . You simply take all possible tensors . Since every can be written in terms of their bases, it’s clear than any tensor can also be written in terms of the basis tensors (by simply expanding each in terms of their respective bases, and getting a larger sum of more basic tensors).

Just to drive this point home, if is a basis for , and a basis for , then the tensor space has basis

It’s a theorem that finite-dimensional vector spaces of equal dimension are isomorphic, so the length of this basis (6) tells us that .

So fine, back to decomposing . All we have left to do is use the data given by (the effect on pairs of basis elements) to define . The definition is rather straightforward, as we have already made the suggestive move of showing that the basis for the tensor space () and the definition of are essentially the same.

That is, just take . Note that this is just defined on the basis elements, and so we extend to all other vectors in the tensor space by imposing linearity (defining to split across sums of tensors as needed). Is this well defined? Well, multilinearity of forces it to be so. For if we had two equal tensors, say, , then we know that has to respect their equality, because , so will take the same value on equal tensors regardless of which representative we pick (where we decide to put the ). The same idea works for sums, so everything checks out, and is equal to , as desired. Moreover, we didn’t make any choices in constructing . If you retrace our steps in the argument, you’ll see that everything was essentially decided for us once we fixed a choice of a basis (by our wise decisions in defining ). Since the construction would be isomorphic if we changed the basis, our choice of is unique.

There is a lot more to say about tensors, and indeed there are some other useful ways to think about tensors that we’ve completely ignored. But this discussion should make it clear *why* we define tensors the way we do. Hopefully it eliminates most of the mystery in tensors, although there is still a lot of mystery in trying to compute stuff using tensors. So we’ll wrap up this post with a short discussion about that.

## Computability and Stuff

It should be clear by now that plain product spaces and tensor product spaces are extremely different. In fact, they’re only related in that their underlying sets of vectors are built from pairs of vectors in and . Avid readers of this blog will also know that operations involving matrices (like row reduction, eigenvalue computations, etc.) are generally efficient, or at least they run in polynomial time so they’re not crazy impractically slow for modest inputs.

On the other hand, it turns out that almost every question you might want to ask about tensors is difficult to answer computationally. As with the definition of the tensor product, this is no mere coincidence. There is something deep going on with tensors, and it has serious implications regarding quantum computing. More on that in a future post, but for now let’s just focus on one hard problem to answer for tensors.

As you know, the most general way to write an element of a tensor space is as a sum of the basic-looking tensors.

where the are linear combinations of basis vectors in the . But as we saw with our examples over , there can be lots of different ways to write a tensor. If you’re lucky, you can write the entire tensor as a one-term sum, that is just a tensor . If you can do this we call the tensor a *pure tensor,* or a *rank 1 tensor*. We then have the following natural definition and problem:

**Definition: **The *rank *of a tensor is the minimum number of terms in any representation of as a sum of pure tensors. The one exception is the zero element, which has rank zero by convention.

**Problem:** Given a tensor where is a field, compute its rank.

Of course this isn’t possible in standard computing models unless you can represent the elements of the field (and hence the elements of the vector space in question) in a computer program. So we restrict to be either the rational numbers or a finite field .

Even though the problem is simple to state, it was proved in 1990 (a result of Johan Håstad) that tensor rank is hard to compute. Specifically, the theorem is that

**Theorem:** Computing tensor rank is NP-hard when and NP-complete when is a finite field.

The details are given in Håstad’s paper, but the important work that followed essentially showed that most problems involving tensors are hard to compute (many of them by reduction from computing rank). This is unfortunate, but also displays the power of tensors. In fact, tensors are so powerful that many believe understanding them better will lead to insight in some very important problems, like finding faster matrix multiplication algorithms or proving circuit lower bounds (which is closely related to P vs NP). Finding low-rank tensor approximations is also a key technique in a lot of recent machine learning and data mining algorithms.

With this in mind, the enterprising reader will probably agree that understanding tensors is both valuable and useful. In the future of this blog we’ll hope to see some of these techniques, but at the very least we’ll see the return of tensors when we delve into quantum computing.

Until next time!

Thank you for a great article, Jeremy! By the way, I have little question about notation for basis of $V \times W$: shouldn’t it be more correct to write it as $\lbrace (v_1, 0), \dots, (v_n, 0), (0, w_1), \dots, (0, w_m) \rbrace$?

LikeLike

Yes, definitely. This is a bit of sloppiness that I overlooked.

LikeLike

didn’t Einstein use tensors in relativity?

LikeLiked by 1 person

Yes, and he was dumbfounded when he first started working with them.

LikeLike

This is all much easier to understand in Einstein summation notation.

LikeLike

Maybe if you’re a physicist?

LikeLike

I was going to ask if the Hastad result holds specifically for tensor products of finite-dimensional vector spaces, but then I realised that every finitely representable tensor (in the obvious representation) is embedded in a tensor product of finite-dimensional vector spaces by cutting the space off at the highest-numbered coordinate.

For infinite families of vector spaces, there’s this difference between direct sum (where all elements have only finitely many nonzero components) and direct product (where this is not the case). Do you know if there’s a similar distinction for tensor products?

LikeLike

Shoveling my driveway gave me the insight (obvious in hindsight) that you always need infinitely many nonzero vectors to make a nonzero element of the tensor product of infinitely many vector spaces, because even one zero component makes the tensor product zero. So there is no distinction similar to the difference between direct sum and direct product.

LikeLike

Yes, that is correct. You need to work with products, and the universal property can be reformulated nicely in terms of an arbitrary index set.

LikeLike

Thanks for a wonderful post, as always! I’d like to add something, too:

Recently in machine learning and statistics there’s been a surge of interest in a certain class of tensors which CAN be decomposed efficiently (poly-time).

These are orthogonally-decomposable tensors, and they have an amazing array of newly discovered applications in fitting complex statistical models. Examples include simple Gaussian Mixture Models, Hidden Markov Models, Latent Dirichlet Allocation, Independent Component Analysis and more.

LikeLike

I seem to see papers on this topic crop up all over arXiv. I intend to read some of them at some point for my own research.

LikeLike

Thanks j2kun, urishin! I’m now reading “Nonnegative Tensor Factorization with Applications to Statistics and Computer Vision”. I always considered tensors a bit like category theory, perhaps fun for a Sunday afternoon when it’s raining and I want to hone my conceptual skills a bit with presheafs, fiber products, and what not, but this is actually quite useful! Nonnegative matrix factorization shows “ghost structures” compared to Nonnegative tensor factorization. And their treatment of expectation-maximization as a nonnegative tensor factorization (and not the best one) is of course superb!

LikeLike

Tim Gowers article “Lose your fear of tensor products” may also be of interest.

https://www.dpmms.cam.ac.uk/~wtg10/tensors3.html

LikeLike

This is good, I haven’t seen it before. It covers some aspects I omit, like the associativity of the tensor product.

LikeLike

“But above all, we intend to be sufficiently friendly so as to not make anybody run in fear.”

That you accomplished, but you lost me a short distance in. I know a lot and can use all of it involving vectors, but you don’t form bridges – there is no “extension’ or superset. Just something not quite completely different, but using whatever strange symbols for – I’m not quite sure if whatever operator I can’t find the unicode character for to drop in here is a cross product, or means something identical, analogous, similar, or utterly and completely different for vectors and tensors.

I won’t run in fear. I will just shrug and walk away. you may understand something, but cannot reduce it and explain it to me.

LikeLike

Tensors are vectors in a particular vector space. I’m not sure what your question is. It need not be related to cross products for one to understand it…

LikeLike

Personally I’m not really seeing where the “bridge” between elementary to advanced terms is (seems more like a roundabout set firmly in advanced).

Think the problem is that you’re thinking of the axiomatic description as “elementary” but by then us laymen are starting to amble off …

LikeLike

Could you comment on the relation between tensor products and Kronecker product in matrices (and why the latter is useful)

LikeLike

The Kronecker product is just a particular way to write linear maps between tensor spaces that arise as the tensor product of two linear *maps* on the pieces. (It should make sense that if you can tensor two vector spaces, then you can tensor two linear operators on those spaces. It’s just )

The Kronecker product is just a way of “picking coordinates” or “fixing a basis” so that this “tensored” linear map can be represented as a single matrix. The basis is quite obvious; it’s just the lexicographic ordering on the bases of the two pieces (i.e., ).

It’s useful if you want to do things like row-reduction on maps between tensor spaces, because you don’t need to change any of your existing algorithms.

LikeLike

Thank you, great article, Wow this was some reading ..

LikeLike

$b_1 \otimes \ldots \otimes b_d$ perhaps?

LikeLike

Yes, of course. Fixed it.

LikeLike

Jeremy, I wanted to say something different: suppose, according to your setup, , then where . A tensor of rank 1 would then simply be and not (or as in the article).

LikeLike

You’re right I tried to fix it without thinking and confused myself as to what I was doing. All in a day’s work 🙂

LikeLike

Thank you. Excellent article.

LikeLike

I was interested to find that your use of “rank” differs substantially from its usage as I have encountered it elsewhere. For example, see http://mathworld.wolfram.com/TensorRank.html

To be fair, I have principally used tensors mainly for their applications in physics/smooth manifold theory where they are defined as multilinear functionals on a product of copies of a vector space and its dual — so, not very heavy on the algebraic way of thinking.

Cool article! I’ve been enjoying these primers.

LikeLike

There must certainly be some connection between the two…

LikeLike

Thanks, great post! Minor suggestions for clarity:

“So now let’s prove that all maps” => “… all multilinear maps”

“But multilinear maps are more nuanced, because they have two arguments.” => “But bilinear maps…”

LikeLike

What a fantastic post! I’ll definitely be coming back to this blog in the future. Thanks!

LikeLike

I’m new to the concept of a tensor product, this is a great explanation. Thanks!

LikeLike

Nice article. One question, though: is it too obvious to require justification that all pairs of basis elements of the component vector spaces form a linearly independent set in the tensor product of two vector spaces?

LikeLike

Thanks a lot for this article! This site is super interesting and I just started browsing it recently, and I have heard of tensors and the tensor product before, but I had never properly understood it until reading this article.

LikeLike

You said:

m(A+B, C+D) = (A+B)(C+D) = AC + AD + BC + BD

m(A+B, C+D) = m(A,C) + m(B,D).

I am totally confused as to why you would make that statement, specifically, why would we think that m(A+B, C+D) = m(A,C) + m(B,D) and what does that have to do with m(A+B, C+D) = m(A,C) + m(A,D) + m(B,C) + m(B,D)?

It seems to me that:

m(A+B, C+D) = (A+B)(C+D) = AC + AD + BC + BD

= m(A, C+D) + m(B, C+D)

= m(A,C) + m(A,D) + m(B,C) + m(B,D)

= AC + AD + BC + BD

I tried the above with some sample matrices and I did find that

(A+B)(C+D) = AC + AD + BC + BD.

Then you state that the one and only function that satisfies both linearity conditions is the zero map, but the derivation I gave above says otherwise.

So, what am I not understanding? I understand that if the A, B, C, and D were some abstract thing like you talk about elsewhere that the properties of those things might be such as to make the derivation invalid, but m is matrix multiplication and A, B, C, and D are matrices, so I don’t see why m is not linear in both arguments simultaneously. What am I missing?

Thanks.

LikeLike

The point of saying that is to say that the “m” operator is _not_ linear in both coordinates simultaneously. The part that “I said” has a “not equal” symbol in between the two statements.

The question was “what could it mean to call m(A,B) linear?” because there are two coordinates. Saying it’s linear in just one coordinate is fine, buy saying it’s linear in both coordinates at the same time (m(A+B,C+D) = m(A,C) + m(B,D)) is too strong. It’s not a particularly interesting statement.

LikeLike

Thanks for your prompt response, I appreciate it very much. I tried to put a “not equal” symbol between the two statements, but something went wrong.

It seems to me that you are defining linearity in (m(A+B,C+D) = m(A,C) + m(B,D)) as pairing A with C and B with D, whereas I’m defining it as breaking out the two arguments of the LHS into a fully distributed multiplication over the full addition: m(A+B, C+D) = m(A,C) + m(A,D) + m(B,C) + m(B,D). I think I understand what you are saying about the differences between the “direct product” vector space with its componentwise addition and scalar multiplication on both components versus the “tensor product” space with its different rules of addition & multiplication, but the example you used was the matrix multiplication of matrices. I’m assuming these are ordinary matrices from elementary linear algebra. From what I’m seeing in my example, they work by the fully distributed procedure I used. I guess that’s what’s confusing me. Unless I picked four matrices that just happened to work, it seems to me that matrices behave according to the full distribution rather than the partial, tensor product space-like rules.

Another way in which I may be confused is that perhaps you’re saying that we just define “your” definition of linearity of matrix multiplication, as you said, “by force of will”. Perhaps this is a way of defining a truly interesting map between objects. That is, just forget everything we’ve ever learned about elementary matrix and real number algebra and just define linearity “your” way for the purpose of providing some rationale and insight into why “your” definition of linearity is important in tensor product spaces, but don’t worry that matrices don’t really work this way. If that’s the case, then I’ll just accept what you said and see where it leads. If that’s not the case, then I’m definitely confused.

So I guess in my mind, the question boils down to these points:

1. Am I correct that elementary matrix multiplication works by m(A+B, C+D) = m(A,C) + m(A,D) + m(B,C) + m(B,D) or not?

2. Is the definition m(A+B,C+D) = m(A,C) + m(B,D) something we’ve defined by force of will just for pedagogical purposes related to understanding tensor product spaces, but is not intended to be how matrix multiplication really behaves?

If what I’m asking makes any sense to you and you have the time and feel inclined to response again, I’ll certainly appreciate it, but I know you’re very busy and I don’t want to waste your time & energy trying to train this particular chimpanzee.

Either way, I’ll keep reading and perhaps I’ll stumble upon whatever it is that’s confusing me so much at the moment.

Thanks,

Greg.

LikeLike

” The definition is rather straightforward, as we have already made the suggestive move of showing that the basis for the tensor space (v_i \otimes w_j) and the definition of f(v_i, w_j) are essentially the same.”

In the final f(v_i,w_j) above, should the f be an \alpha ?

LikeLike