MLIR — Canonicalizers and Declarative Rewrite Patterns

Table of Contents

In a previous article we defined folding functions, and used them to enable some canonicalization and the sccp constant propagation pass for the poly dialect. This time we’ll see how to add more general canonicalization patterns.

The code for this article is in this pull request, and as usual the commits are organized to be read in order.

Why is Canonicalization Needed?

MLIR provides folding as a mechanism to simplify an IR, which can result in simpler, more efficient ops (e.g., replacing multiplication by a power of 2 with a shift) or removing ops entirely (e.g., deleting $y = x+0$ and using $x$ for $y$ downstream). It also has a special hook for constant materialization.

General canonicalization differs from folding in MLIR in that it can transform many operations. In the MLIR docs they will call this “DAG-to-DAG” rewriting, where DAG stands for “directed acyclic graph” in the graph theory sense, but really just means a section of the IR.

Canonicalizers can be written in the standard way: declare the op has a canonicalizer in tablegen and then implement a generated C++ function declaration. The official docs for that are here. Or you can do it all declaratively in tablegen, the docs for that are here. We’ll do both in this article.

Aside: there is a third way, to use a new system called PDLL, but I haven’t figured out how to use that yet. It should be noted that PDLL is under active development, and in the meantime the tablegen-based approach in this article (called “DRR” for Declarative Rewrite Rules in the MLIR docs) is considered to be in maintenance mode, but not yet deprecated. I’ll try to cover PDLL in a future article.

Canonicalizers in C++

Reusing our poly dialect, we’ll start with the binary polynomial operations, adding let hasCanonicalizer = 1; to the op base class in this commit, which generates the following method headers on each of the binary op classes

// PolyOps.h.inc
static void getCanonicalizationPatterns(
  ::mlir::RewritePatternSet &results, ::mlir::MLIRContext *context);

The body of this method asks to add custom rewrite patterns to the input results set, and we can define those patterns however we feel in the C++.

The first canonicalization pattern we’ll write in this commit is for the simple identity $x^y – y^2 = (x+y)(x-y)$, which is useful because it replaces a multiplication with an addition. The only caveat is that this canonicalization is only more efficient if the squares have no other downstream uses.

// Rewrites (x^2 - y^2) as (x+y)(x-y) if x^2 and y^2 have no other uses.
struct DifferenceOfSquares : public OpRewritePattern<SubOp> {
  DifferenceOfSquares(mlir::MLIRContext *context)
      : OpRewritePattern<SubOp>(context, /*benefit=*/1) {}

  LogicalResult matchAndRewrite(SubOp op, 
                                PatternRewriter &rewriter) const override {
    Value lhs = op.getOperand(0);
    Value rhs = op.getOperand(1);
    if (!lhs.hasOneUse() || !rhs.hasOneUse()) { 
      return failure();
    }

    auto rhsMul = rhs.getDefiningOp<MulOp>();
    auto lhsMul = lhs.getDefiningOp<MulOp>();
    if (!rhsMul || !lhsMul) {
      return failure();
    }

    bool rhsMulOpsAgree = rhsMul.getLhs() == rhsMul.getRhs();
    bool lhsMulOpsAgree = lhsMul.getLhs() == lhsMul.getRhs();

    if (!rhsMulOpsAgree || !lhsMulOpsAgree) {
      return failure();
    }

    auto x = lhsMul.getLhs();
    auto y = rhsMul.getLhs();

    AddOp newAdd = rewriter.create<AddOp>(op.getLoc(), x, y);
    SubOp newSub = rewriter.create<SubOp>(op.getLoc(), x, y);
    MulOp newMul = rewriter.create<MulOp>(op.getLoc(), newAdd, newSub);

    rewriter.replaceOp(op, {newMul});
    // We don't need to remove the original ops because MLIR already has
    // canonicalization patterns that remove unused ops.

    return success();
  }
};

The test in the same commit shows the impact:

// Input:
func.func @test_difference_of_squares(
  %0: !poly.poly<3>, %1: !poly.poly<3>) -> !poly.poly<3> {
  %2 = poly.mul %0, %0 : !poly.poly<3>
  %3 = poly.mul %1, %1 : !poly.poly<3>
  %4 = poly.sub %2, %3 : !poly.poly<3>
  %5 = poly.add %4, %4 : !poly.poly<3>
  return %5 : !poly.poly<3>
}

// Output:
// bazel run tools:tutorial-opt -- --canonicalize $FILE
func.func @test_difference_of_squares(%arg0: !poly.poly<3>, %arg1: !poly.poly<3>) -> !poly.poly<3> {
  %0 = poly.add %arg0, %arg1 : !poly.poly<3>
  %1 = poly.sub %arg0, %arg1 : !poly.poly<3>
  %2 = poly.mul %0, %1 : !poly.poly<3>
  %3 = poly.add %2, %2 : !poly.poly<3>
  return %3 : !poly.poly<3>
}

Other than this pattern being used in the getCanonicalizationPatterns function, there is nothing new here compared to the previous article on rewrite patterns.

Canonicalizers in Tablegen

The above canonicalization is really a simple kind of optimization attached to the canonicalization pass. It seems that this is how many minor optimizations end up being implemented, making the --canonicalize pass a heavyweight and powerful pass. However, the name “canonicalize” also suggests that it should be used to put the IR into a canonical form so that later passes don’t have to check as many cases.

So let’s implement a canonicalization like that as well. We’ll add one that has poly interact with the complex dialect, and implement a canonicalization that ensures complex conjugation always comes after polynomial evaluation. This works because of the identity $f(\overline{z}) = \overline{f(z)}$ for all polynomials $f$ and all complex numbers $z$.

This commit adds support for complex inputs in the poly.eval op. Note that in MLIR, complex types are forced to be floating point because all the op verifiers that construct complex numbers require it. The complex type itself, however, suggests a complex<i32> is perfectly legal, so it seems nobody in the MLIR community needs Gaussian integers yet.

The rule itself is implemented in tablegen in this commit. The main tablegen code is:

include "PolyOps.td"
include "mlir/Dialect/Complex/IR/ComplexOps.td"
include "mlir/IR/PatternBase.td"

def LiftConjThroughEval : Pat<
  (Poly_EvalOp $f, (ConjOp $z)),
  (ConjOp (Poly_EvalOp $f, $z))
>;

The two new pieces here are the Pat class (source) that defines a rewrite pattern, and the parenthetical notation that defines sub-trees of the IR being matched and rewritten. The source documentation on the Pattern parent class is quite well written, so read that for extra detail, and the normal docs provide a higher level view with additional semantics.

But the short story here is that the inputs to Pat are two “IR tree” objects (MLIR calls them “DAG nodes”), and each node in the tree is specified by parentheses ( ) with the first thing in the parentheses being the name of an operation (the tablegen name, e.g., Poly_EvalOp which comes from PolyOps.td), and the remaining arguments being the op’s arguments or attributes. Naturally, the nodes can nest, and that corresponds to a match applied to the argument. I.e., (Poly_EvalOp $f, (ConjOp $z)) means “an eval op whose first argument is anything (bind that to $f) and whose second argument is the output of a ConjOp whose input is anything (bind that to $z).

When running tablegen with the -gen-rewriters option, this generates this code, which is not much more than a thorough version of the pattern we’d write manually. Then in this commit we show how to include it in the codebase. We still have to tell MLIR which pass to add the generated patterns to. You can add each pattern by name, or use the populateWithGenerated function to add them all.

As another example, this commit reimplements the difference of squares pattern in tablegen. This one uses three additional features: a pattern that generates multiple new ops (which uses Pattern instead of Pat), binding the ops to names, and constraints that control when the pattern may be run.

// PolyPatterns.td
def HasOneUse: Constraint<CPred<"$_self.hasOneUse()">, "has one use">;

// Rewrites (x^2 - y^2) as (x+y)(x-y) if x^2 and y^2 have no other uses.
def DifferenceOfSquares : Pattern<
  (Poly_SubOp (Poly_MulOp:$lhs $x, $x), (Poly_MulOp:$rhs $y, $y)),
  [
    (Poly_AddOp:$sum $x, $y),
    (Poly_SubOp:$diff $x, $y),
    (Poly_MulOp:$res $sum, $diff),
  ],
  [(HasOneUse:$lhs), (HasOneUse:$rhs)]
>;

The HasOneUse constraint merely injects the quoted C++ code into a generated if guard, with $_self as a magic string to substitute in the argument when it’s used.

But then notice the syntax of (Poly_MulOp:$lhs $x, $x), the colon binds $lhs to refer to the op as a whole (or, via method overloads, its result value), so that it can be passed to the constraint. Similarly, the generated ops are all given names so they can be fed as the arguments of other generated ops Finally, the second argument of Pattern is a list of generated ops to replace the matched input IR, rather than a single node for Pat.

The benefit of doing this is significantly less boilerplate related to type casting, checking for nulls, and emitting error messages. But because you still occasionally need to inject random C++ code, and inspect the generated C++ to debug, it helps to be fluent in both styles. I don’t know how to check a constraint like “has a single use” in pure DRR tablegen.

MLIR — Verifiers

Table of Contents

Last time we defined folders and used them to enable some canonicalization and the sccp constant propagation pass for the poly dialect. This time we’ll add some additional safety checks to the dialect in the form of verifiers.

The code for this article is in this pull request, and as usual the commits are organized to be read in order.

Purpose of a verifier

Verifiers ensure the types and operations in a concrete MLIR program are well-formed. Verifiers are run before and after every pass, and help to ensure that individual passes, folders, rewrite patterns, etc., emit proper IR. This allows you to enforce invariants of each op, and it makes passes simpler because they can rely on the invariants to avoid edge case checking.

The official docs for verifiers of attributes and types is here, and for operations here.

That said, most common kinds of verification code are implemented as Traits (mixins, recall the earlier article). So we’ll start with those.

Trait-based verifiers

In the last article we added SameOperandsAndResultElementType to enable poly.add to have a mixed poly and tensor-of-poly input semantics. This technically did add a verifier to the IR, but to demonstrate this more clearly I want to restrict that behavior to assert that the input and output types must all agree (all tensors-of-polys or all polys).

This commit shows the work to do this, which is mainly changing the trait name to SameOperandsAndResultType. As a result, we get a few new generated things for free. First the verification engine uses verifyTrait to check that the types agree. There, verifyInvariants is an Operation base class method that the generated code overrides when traits inject verification, the same thing that checks the type constraints on operation types. (Note: a custom verifier instead gets a method name verify to distinguish it from verifyInvariants). Since the SameOperandsAndResultType is a generic check, this doesn’t affect the generated code.

Next, an inferReturnTypes function is generated, below shown for AddOp.

::mlir::LogicalResult AddOp::inferReturnTypes(::mlir::MLIRContext *context, ::std::optional<::mlir::Location> location, ::mlir::ValueRange operands, ::mlir::DictionaryAttr attributes, ::mlir::OpaqueProperties properties, ::mlir::RegionRange regions, ::llvm::SmallVectorImpl<::mlir::Type>&inferredReturnTypes) {
  inferredReturnTypes.resize(1);
  ::mlir::Builder odsBuilder(context);
  ::mlir::Type odsInferredType0 = operands[0].getType();
  inferredReturnTypes[0] = odsInferredType0;
  return ::mlir::success();
}

With a type inference hook present, we can simplify the operation’s assembly format, so that the type need only be specified once instead of three times (type, type) -> type. If we tried to simplify it before this trait, tablegen would complain that it can’t infer the types needed to build a parser.

let assemblyFormat = "$lhs `,` $rhs attr-dict `:` qualified(type($output))";

This also requires updating all the tests to use the new assembly format (I did not try to find a way to make both the functional and abbreviated forms allowed at the same time, no big deal).

Finally, this trait adds builders that don’t need you to specify the return type. Another example for AddOp:

void AddOp::build(::mlir::OpBuilder &odsBuilder, ::mlir::OperationState &odsState, ::mlir::Value lhs, ::mlir::Value rhs) {
  odsState.addOperands(lhs);
  odsState.addOperands(rhs);

  ::llvm::SmallVector<::mlir::Type, 2> inferredReturnTypes;
  if (::mlir::succeeded(AddOp::inferReturnTypes(odsBuilder.getContext(),
          odsState.location, odsState.operands,
          odsState.attributes.getDictionary(odsState.getContext()),
          odsState.getRawProperties(),
          odsState.regions, inferredReturnTypes)))
    odsState.addTypes(inferredReturnTypes);
  else
    ::llvm::report_fatal_error("Failed to infer result type(s).");
}

For another example, the EvalOp can’t use SameOperandsAndResultType, because its operands require different types, but we can use AllTypesMatch which generates similar code, but restricts the verification to a subset of types. This is added in this commit.

def Poly_EvalOp : Op<Poly_Dialect, "eval", [AllTypesMatch<["point", "output"]>]> {
  let summary = "Evaluates a Polynomial at a given input value.";
  let arguments = (ins Polynomial:$input, AnyInteger:$point);
  let results = (outs AnyInteger:$output);
  let assemblyFormat = "$input `,` $point attr-dict `:` `(` qualified(type($input)) `,` type($point) `)` `->` type($output)";
}

You can see many similar sorts of type-inference traits here and their corresponding verifiers here.

Aside: the nomenclature for verification in regards to traits is a bit confusing. There is a concept called constraint in MLIR, and the docs describe traits as subclasses of a Constraint base class. But at the time of this writing (2023-09-11) that particular claim is wrong. There’s a TraitBase base class for traits, and the Constraint base class appears to be used for verification on type declarations and attribute delarations (the let arguments = (ins ...) stuff). These go by the names “type constraints” and “attribute constraints.” AnyInteger is an example of a type constraint because it can match multiple types, and it does inherit (indirectly) from the Constraint base class. I think type constraints are a bit more complicated because the examples I see in MLIR all involve injecting C++ code through the tablegen (you’ll see stuff like CPred) and I haven’t explored how that materializes as generated code yet.

A custom verifier

I couldn’t think of any legitimate custom verification I wanted to have in poly, so to make one up arbitrarily, I will assert that EvalOp‘s input must be a 32-bit integer type. I could do this with the type constraint in tablegen, but I will do it in a custom verifier instead for demonstration purposes.

Repeating our routine, we start by adding let hasVerifier = 1; to the op’s tablegen, and inspect the generated signature in the header, in this commit.

class EvalOp ... {
  ...
  ::mlir::LogicalResult verify();
}

And the implementation

LogicalResult EvalOp::verify() {
  return getPoint().getType().IsIneger(32)
             ? success()
             : emitOpError("argument point must be a 32-bit integer");
}

The new thing here is emitOpError, which is required because if you just return failure, then the verifier will fail but not output any information, resulting in an empty stdout.

And then to test for failure, the lit run command should pipe stderr to stdout, and have FileCheck operate on that

// tests/poly_verifier.mlir
// RUN: tutorial-opt %s 2>%t; FileCheck %s < %t

A trait-based custom verifier

We can combine these two ideas together by defining a custom trait that includes a verification hook.

Each trait in MLIR has an optional verifyTrait hook (which is checked before the custom verifier created via hasVerifier), and we can use this to define generic verifiers that apply to many ops. We’ll do this by making a verifier that extends the previous section by asserting—generically for any op—that all integer-like operands must be 32-bit. Again, this is a silly and arbitrary constraint to enforce, but it’s a demonstration.

The process of defining this is almost entirely C++, and contained in this commit. We start by defining a so-called NativeOpTrait subclass in tablegen:

def Has32BitArguments:  NativeOpTrait<"Has32BitArguments"> {
  let cppNamespace = "::mlir::tutorial::poly";
}

This has an almost trivial effect: add a template argument called ::mlir::tutorial::poly::Has32BitArguments to the generated header class for an op that has this trait. E.g., for EvalOp,

def Poly_EvalOp : Op<Poly_Dialect, "eval", [
    AllTypesMatch<["point", "output"]>, 
    Has32BitArguments
]> {
  ...
}

Generates

// PolyOps.h.inc
class EvalOp : public ::mlir::Op<
    EvalOp, ::mlir::OpTrait::ZeroRegions, 
    ...,
    ::mlir::tutorial::poly::Has32BitArguments,
    ...
> {
  ...
}

The rest is up to you in C++. Define an implementation of Has32BitArguments, following the curiously-recurring-template pattern required of OpTrait::TraitBase, and then implement a verifyTrait hook. In our case, iterate over all ops, check which ones are integer-like, and then assert the width of those.

// PolyTraits.h
template <typename ConcreteType>
class Has32BitArguments : public OpTrait::TraitBase<ConcreteType, Has32BitArguments> {
 public:
  static LogicalResult verifyTrait(Operation *op) {
    for (auto type : op->getOperandTypes()) {
      // OK to skip non-integer operand types
      if (!type.isIntOrIndex()) continue;

      if (!type.isInteger(32)) {
        return op->emitOpError()
               << "requires each numeric operand to be a 32-bit integer";
      }
    }

    return success();
  }
};

This has the upside of being more generic, but the downside of requiring awkward casting to support specific ops and their named arguments. I.e., here we can’t refer to getPoint unless we do a dynamic cast to EvalOp. Doing so would make the trait less general, so a custom op-specific verifier is more appropriate for that.

Tensorphobia and the Outer Product

Variations on a theme

Back in 2014 I wrote a post called How to Conquer Tensorphobia that should end up on Math $ \cap$ Programming’s “greatest hits” album. One aspect of tensors I neglected to discuss was the connection between the modern views of tensors and the practical views of linear algebra. I feel I need to write this because every year or two I forget why it makes sense.

The basic question is:

What the hell is going on with the outer product of vectors?

The simple answer, the one that has never satisfied me, is that the outer product of $ v,w$ is the matrix $ vw^T$ whose $ i,j$ entry is the product $ v_i w_j$. This doesn’t satisfy me because it’s an explanation by fiat. It lacks motivation and you’re supposed to trust (or verify) things magically work out. To me this definition is like the definition of matrix multiplication: having it dictated to you before you understand why it makes sense is a cop out. Math isn’t magic, it needs to make perfect sense.

The answer I like, and the one I have to re-derive every few years because I never wrote it down, is a little bit longer.

To borrow a programming term, the basic confusion is a type error. We start with two vectors, $ v,w$ in some vector space $ V$ (let’s say everything is finite dimensional), and we magically turn them into a matrix. Let me reiterate for dramatic effect: we start with vectors, which I have always thought of as objects, things you can stretch, rotate, or give as a gift to your cousin Mauricio. While a matrix is a mapping, a thing that takes vectors as input and spits out new vectors. Sure, you can play with the mappings, feed them kibble and wait for them to poop or whatever. And sure, sometimes vectors are themselves maps, like in the vector space of real-valued functions (where the word “matrix” is a stretch, since it’s infinite dimensional).

But to take two vectors and *poof* get a mapping of all vectors, it’s a big jump. And part of the discomfort is that it feels random. To be happy we have to understand that the construction is natural, or even better canonical, meaning this should be the only way to turn two vectors into a linear map. Then the definition would make sense.

So let’s see how we can do that. Just to be clear, everything we do in this post will be for finite-dimensional vector spaces over $ \mathbb{R}$, but we’ll highlight the caveats when they come up.

Dual vector spaces

The first step is understanding how to associate a vector with a linear map in a “natural” or “canonical” way. There is one obvious candidate: if you give me a vector $ v$, I can make a linear map by taking the dot product of $ v$ with the input. So I can associate

$ \displaystyle v \mapsto \langle v,- \rangle$

The dash is a placeholder for the input. Another way to say it is to define $ \varphi_v(w) = \langle v,w \rangle$ and say the association takes $ v \mapsto \varphi_v$. So this “association,” taking $ v$ to the inner product, is itself  a mapping from $ V$ to “maps from $ V$ to $ \mathbb{R}$.” Note that $ \varphi_v(w)$ is linear in $ v$ and $ w$ because that’s part of the definition of an inner product.

To avoid saying “maps from $ V$ to $ \mathbb{R}$” all the time, we’ll introduce some notation.

Definition: Let $ V$ be a vector space over a field $ k$. The set of $ k$-linear maps from $ V \to k$ is called $ \textup{Hom}(V,k)$. More generally, the set of $ k$-linear maps from $ V$ to another $ k$-vector space $ W$ is called $ \textup{Hom}(V,W)$.

“Hom” stands for “homomorphism,” and in general it just means “maps with the structure I care about.” For this post $ k = \mathbb{R}$, but most of what we say here will be true for any field. If you go deeply into this topic, it matters whether $ k$ is algebraically closed, or has finite characteristic, but for simplicity we’ll ignore all of that. We’ll also ignore the fact that these maps are called linear functionals and this is where the name “functional analysis” comes from. All we really want to do is understand the definition of the outer product.

Another bit of notation for brevity:

Definition: Let $ V$ be a $ \mathbb{R}$-vector space. The dual vector space for $ V$, denoted $ V^*$, is $ \textup{Hom}(V, \mathbb{R})$.

So the “vector-to-inner-product” association we described above is a map $ V \to V^*$. It takes in $ v \in V$ and spits out $ \varphi_v \in V^*$.

Now here’s where things start to get canonical (interesting). First, $ V^*$ is itself a vector space. This is an easy exercise, and the details are not too important for us, but I’ll say the key: if you want to add two functions, you just add their (real number) outputs. In fact we can say more:

Theorem: $ V$ and $ V^*$ are isomorphic as vector spaces, and the map $ v \mapsto \varphi_v$ is the canonical isomorphism.

Confessions of a mathematician: we’re sweeping some complexity under the rug. When we upgraded our vector space to an inner product space, we fixed a specific (but arbitrary) inner product on $ V$. For finite dimensional vector spaces it makes no difference, because every finite-dimensional $ \mathbb{R}$-inner product space is isomorphic to $ \mathbb{R}^n$ with the usual inner product. But the theorem is devastatingly false for infinite-dimensional vector spaces. There are two reasons: (1) there are many (non-canonical) choices of inner products and (2) the mapping for any given inner product need not span $ V^*$. Luckily we’re in finite dimensions so we can ignore all that. [Edit: see Emilio’s comments for a more detailed discussion of what’s being swept under the rug, and how we’re ignoring the categorical perspective when we say “natural” and “canonical.”]

Before we make sense of the isomorphism let’s talk more about $ V^*$. First off, it’s not even entirely obvious that $ V^*$ is finite-dimensional. On one hand, if $ v_1, \dots, v_n$ is a basis of $ V$ then we can quickly prove that $ \varphi_{v_1}, \dots ,\varphi_{v_n}$ are linearly independent in $ V^*$. Indeed, if they weren’t then there’d be some linear combination $ a_1 \varphi_{v_1} + \dots + a_n \varphi_{v_n}$ that is the zero function, meaning that for every vector $ w$, the following is zero

$ \displaystyle a_1 \langle v_1, w \rangle + \dots + a_n \langle v_n, w \rangle = 0.$

But since the inner product is linear in both arguments we get that $ \langle a_1 v_1 + \dots + a_n v_n , w \rangle = 0$ for every $ w$. And this can only happen when $ a_1v_1 + \dots + a_nv_n$ is the zero vector (prove this).

One consequence is that the linear map $ v \mapsto \varphi_v$ is injective. So we can think of $ V$ as “sitting inside” $ V^*$. Now here’s a very slick way to show that the $ \varphi_{v_i}$ span all of $ V^*$. First we can assume our basis $ v_1, \dots, v_n$ is actually an orthonormal basis with respect to our inner product (this is without loss of generality). Then we write any linear map $ f \in V^*$ as

$ f = f(v_1)\varphi_{v_1} + \dots + f(v_n) \varphi_{v_n}$

To show these two are actually equal, it’s enough to show they agree on a basis for $ V$. That is, if you plug in $ v_1$ to the function on the left- and right-hand side of the above, you’ll get the same thing. The orthonormality of the basis makes it work, since all the irrelevant inner products are zero.

In case you missed it, that completes the proof that $ V$ and $ V^*$ are isomorphic. Now when I say that the isomorphism is “canonical,” I mean that if you’re willing to change the basis of $ V$ and $ V^*$, then $ v \mapsto \varphi_v$ is the square identity matrix, i.e. the only isomorphism between any two finite vector spaces (up to a change of basis).

Tying in tensors

At this point we have a connection between single vectors $ v$ and linear maps $ V^*$ whose codomain has dimension 1. If we want to understand the outer product, we need a connection between pairs of vectors and matrices, i.e. $ \textup{Hom}(V,V)$. In other words, we’d like to find a canonical isomorphism between $ V \times V$ and $ \textup{Hom}(V,V)$. But already it’s not possible because the spaces have different dimensions. If $ \textup{dim}(V) = n$ then the former has dimension $ 2n$ and the latter has dimension $ n^2$. So any “natural” relation between these spaces has to be a way to embed $ V \times V \subset \textup{Hom}(V,V)$ as a subspace via some injective map.

There are two gaping problems with this approach. First, the outer product is not linear as a map from $ V \times V \to \textup{Hom}(V,V)$. To see this, take any $ v,w \in V$, pick any scalar $ \lambda \in \mathbb{R}$. Scaling the pair $ (v,w)$ means scaling both components to $ (\lambda v, \lambda w)$, and so the outer product is the matrix $ (\lambda v)(\lambda w^T) = \lambda^2 vw^T$.

The second problem is that the only way to make $ V \times V$ a subspace of $ \textup{Hom}(V,V)$ (up to a change of basis) is to map $ v,w$ to the first two rows of a matrix with zeros elsewhere. This is canonical but it doesn’t have the properties that the outer product promises us. Indeed, the outer product let’s us uniquely decompose a matrix as a “sum of rank 1 matrices,” but we don’t get a unique decomposition of a matrix as a sum of these two-row things. We also don’t even get a well-defined rank by decomposing into a sum of two-row matrices (you can get cancellation by staggering the sum). This injection is decisively useless.

It would seem like we’re stuck, until we think back to our association between $ V$ and $ V^*$. If we take one of our two vectors, say $ v$, and pair it with $ w$, we can ask how $ (\varphi_v, w)$ could be turned into a linear map in $ \textup{Hom}(V,V)$. A few moments of guessing and one easily discovers the map

$ \displaystyle x \mapsto \varphi_v(x) w = \langle v,x \rangle w$

In words, we’re scaling $ w$ by the inner product of $ x$ and $ v$. In geometric terms, we project onto $ v$ and scale $ w$ by the signed length of that projection. Let’s call this map $ \beta_{v,w}(x)$, so that the association maps $ (v,w) \mapsto \beta_{v,w}$. The thought process of “easily discovering” this is to think, “What can you do with a function $ \varphi_v$ and an input $ x$? Plug it in. Then what can you do with the resulting number and a vector $ w$? Scale $ w$.”

If you look closely you’ll see we’ve just defined the outer product. This is because the outer product works by saying $ uv^T$ is a matrix, which acts on a vector $ x$ by doing $ (uv^T)x = u(v^T x)$. But the important thing is that, because $ V$ and $ V^*$ are canonically isomorphic, this is a mapping

$ \displaystyle V \times V = V^* \times V \to \textup{Hom}(V,V)$

Now again, this mapping is not linear. In fact, it’s bilinear, and if there’s one thing we know about bilinear maps, it’s that tensors are their gatekeepers. If you recall our previous post on tensorphobia, this means that this bilinear map “factors through” the tensor product in a canonical way. So the true heart of this association $ (v,w) \mapsto \beta_{v,w}$ is a map $ B: V \otimes V \to \textup{Hom}(V,V)$ defined by

$ \displaystyle B(v \otimes w) = \beta_{v,w}$

And now the punchline,

Theorem: $ B$ is an isomorphism of vector spaces.

Proof. If $ v_1, \dots, v_n$ is a basis for $ V$ then it’s enough to show that $ \beta_{v_i, v_j}$ forms a basis for $ \textup{Hom}(V,V)$. Since we already know $ \dim(\textup{Hom}(V,V)) = n^2$ and there are $ n^2$ of the $ \beta_{v_i, v_j}$, all we need to do is show that the $ \beta$’s are linearly independent. For brevity let me remove the $ v$’s and call $ \beta_{i,j} =\beta_{v_i, v_j}$.

Suppose they are not linearly independent. Then there is some choice of scalars $ a_{i,j}$ so that the linear combination below is the identically zero function

$ \displaystyle \sum_{i,j=1}^n a_{i,j}\beta_{i,j} = 0$

In other words, if I plug in any $ v_i$ from my (orthonormal) basis, the result is zero. So let’s plug in $ v_1$.

$ \displaystyle \begin{aligned} 0 &= \sum_{i,j=1}^n a_{i,j} \beta_{i,j}(v_1) \\ &= \sum_{i,j=1}^n a_{i,j} \langle v_i, v_1 \rangle v_j \\ &= \sum_{j=1}^n a_{1,j} \langle v_1, v_1 \rangle v_j \\ &= \sum_{j=1}^n a_{1,j} v_j \end{aligned}$

The orthonormality makes all of the $ \langle v_i ,v_1 \rangle = 0$ when $ i \neq 1$, so we get a linear combination of the $ v_j$ being zero. Since the $ v_i$ form a basis, it must be that all the $ a_{1,j} = 0$. The same thing happens when you plug in $ v_2$ or any other $ v_k$, and so all the $ a_{i,j}$ are zero, proving linear independence.

$ \square$

This theorem immediately implies some deep facts, such as that every matrix can be uniquely decomposed as a sum of the $ \beta_{i,j}$’s. Moreover, facts like the $ \beta_{i,j}$’s being rank 1 are immediate: by definition the maps scale a single vector by some number. So of course the image will be one-dimensional. Finding a useful basis $ \{ v_i \}$ with which to decompose a matrix is where things get truly fascinating, and we’ll see that next time when we study the singular value decomposition.

In the mean time, this understanding generalizes nicely (via induction/recursion) to higher dimensional tensors. And you don’t need to talk about slices or sub-tensors or lose your sanity over $ n$-tuples of indices.

Lastly, all of this duality stuff provides a “coordinate-free” way to think about the transpose of a linear map. We can think of the “transpose” operation as a linear map $ -^T : \textup{Hom}(V,W) \to \textup{Hom}(W^*,V^*)$ which (even in infinite dimensions) has the following definition. If $ f \in \textup{Hom}(V,W)$ then $ f^T$ is a linear map taking $ h \in W^*$ to $ f^T(h) \in V^*$. The latter is a function in $ V^*$, so we need to say what it does on inputs $ v \in V$. The only definition that doesn’t introduce any type errors is $ (f^T(h))(v) = h(f(v))$. A more compact way to say this is that $ f^T(h) = h \circ f$.

Until next time!

Martingales and the Optional Stopping Theorem

This is a guest post by my colleague Adam Lelkes.

The goal of this primer is to introduce an important and beautiful tool from probability theory, a model of fair betting games called martingales. In this post I will assume that the reader is familiar with the basics of probability theory. For those that need to refresh their knowledge, Jeremy’s excellent primers (1, 2) are a good place to start.

The Geometric Distribution and the ABRACADABRA Problem

Before we start playing with martingales, let’s start with an easy exercise. Consider the following experiment: we throw an ordinary die repeatedly until the first time a six appears. How many throws will this take in expectation? The reader might recognize immediately that this exercise can be easily solved using the basic properties of the geometric distribution, which models this experiment exactly. We have independent trials, every trial succeeding with some fixed probability $ p$. If $ X$ denotes the number of trials needed to get the first success, then clearly $ \Pr(X = k) = (1-p)^{k-1} p$ (since first we need $ k-1$ failures which occur independently with probability $ 1-p$, then we need one success which happens with probability $ p$). Thus the expected value of $ X$ is

$ \displaystyle E(X) = \sum_{k=1}^\infty k P(X = k) = \sum_{k=1}^\infty k (1-p)^{k-1} p = \frac1p$

by basic calculus. In particular, if success is defined as getting a six, then $ p=1/6$ thus the expected time is $ 1/p=6$.

Now let us move on to a somewhat similar, but more interesting and difficult problem, the ABRACADABRA problem. Here we need two things for our experiment, a monkey and a typewriter. The monkey is asked to start bashing random keys on a typewriter. For simplicity’s sake, we assume that the typewriter has exactly 26 keys corresponding to the 26 letters of the English alphabet and the monkey hits each key with equal probability. There is a famous theorem in probability, the infinite monkey theorem, that states that given infinite time, our monkey will almost surely type the complete works of William Shakespeare. Unfortunately, according to astronomists the sun will begin to die in a few billion years, and the expected time we need to wait until a monkey types the complete works of William Shakespeare is orders of magnitude longer, so it is not feasible to use monkeys to produce works of literature.

So let’s scale down our goals, and let’s just wait until our monkey types the word ABRACADABRA. What is the expected time we need to wait until this happens? The reader’s first idea might be to use the geometric distribution again. ABRACADABRA is eleven letters long, the probability of getting one letter right is $ \frac{1}{26}$, thus the probability of a random eleven-letter word being ABRACADABRA is exactly $ \left(\frac{1}{26}\right)^{11}$. So if typing 11 letters is one trial, the expected number of trials is

$ \displaystyle \frac1{\left(\frac{1}{26}\right)^{11}}=26^{11}$

which means $ 11\cdot 26^{11}$ keystrokes, right?

Well, not exactly. The problem is that we broke up our random string into eleven-letter blocks and waited until one block was ABRACADABRA. However, this word can start in the middle of a block. In other words, we considered a string a success only if the starting position of the word ABRACADABRA was divisible by 11. For example, FRZUNWRQXKLABRACADABRA would be recognized as success by this model but the same would not be true for AABRACADABRA. However, it is at least clear from this observation that $ 11\cdot 26^{11}$ is a strict upper bound for the expected waiting time. To find the exact solution, we need one very clever idea, which is the following:

Let’s Open a Casino!

Do I mean that abandoning our monkey and typewriter and investing our time and money in a casino is a better idea, at least in financial terms? This might indeed be the case, but here we will use a casino to determine the expected wait time for the ABRACADABRA problem. Unfortunately we won’t make any money along the way (in expectation) since our casino will be a fair one.

Let’s do the following thought experiment: let’s open a casino next to our typewriter. Before each keystroke, a new gambler comes to our casino and bets $1 that the next letter will be A. If he loses, he goes home disappointed. If he wins, he bets all the money he won on the event that the next letter will be B. Again, if he loses, he goes home disappointed. (This won’t wreak havoc on his financial situation, though, as he only loses $1 of his own money.) If he wins again, he bets all the money on the event that the next letter will be R, and so on.

If a gambler wins, how much does he win? We said that the casino would be fair, i.e. the expected outcome should be zero. That means that it the gambler bets $1, he should receive $26 if he wins, since the probability of getting the next letter right is exactly $ \frac{1}{26}$ (thus the expected value of the change in the gambler’s fortune is $ \frac{25}{26}\cdot (-1) + \frac{1}{26}\cdot (+25) = 0$.

Let’s keep playing this game until the word ABRACADABRA first appears and let’s denote the number of keystrokes up to this time as $ T$. As soon as we see this word, we close our casino. How much was the revenue of our casino then? Remember that before each keystroke, a new gambler comes in and bets $1, and if he wins, he will only bet the money he has received so far, so our revenue will be exactly $ T$ dollars.

How much will we have to pay for the winners? Note that the only winners in the last round are the players who bet on A. How many of them are there? There is one that just came in before the last keystroke and this was his first bet. He wins $26. There was one who came three keystrokes earlier and he made four successful bets (ABRA). He wins $ \$26^4$. Finally there is the luckiest gambler who went through the whole ABRACADABRA sequence, his prize will be $ \$26^{11}$. Thus our casino will have to give out $ 26^{11}+26^4+26$ dollars in total, which is just under the price of 200,000 WhatsApp acquisitions.

Now we will make one crucial observation: even at the time when we close the casino, the casino is fair! Thus in expectation our expenses will be equal to our income. Our income is $ T$ dollars, the expected value of our expenses is $ 26^{11}+26^4+26$ dollars, thus $ E(T)=26^{11}+26^4+26$. A beautiful solution, isn’t it? So if our monkey types at 150 characters per minute on average, we will have to wait around 47 million years until we see ABRACADABRA. Oh well.

Time to be More Formal

After giving an intuitive outline of the solution, it is time to formalize the concepts that we used, to translate our fairy tales into mathematics. The mathematical model of the fair casino is called a martingale, named after a class of betting strategies that enjoyed popularity in 18th century France. The gambler’s fortune (or the casino’s, depending on our viewpoint) can be modeled with a sequence of random variables. $ X_0$ will denote the gambler’s fortune before the game starts, $ X_1$ the fortune after one round and so on. Such a sequence of random variables is called a stochastic process. We will require the expected value of the gambler’s fortune to be always finite.

How can we formalize the fairness of the game? Fairness means that the gambler’s fortune does not change in expectation, i.e. the expected value of $ X_n$, given $ X_1, X_2, \ldots, X_{n-1}$ is the same as $ X_{n-1}$. This can be written as $ E(X_n | X_1, X_2, \ldots, X_{n-1}) = X_{n-1}$ or, equivalently, $ E(X_n – X_{n-1} | X_1, X_2, \ldots, X_{n-1}) = 0$.

The reader might be less comfortable with the first formulation. What does it mean, after all, that the conditional expected value of a random variable is another random variable? Shouldn’t the expected value be a number? The answer is that in order to have solid theoretical foundations for the definition of a martingale, we need a more sophisticated notion of conditional expectations. Such sophistication involves measure theory, which is outside the scope of this post. We will instead naively accept the definition above, and the reader can look up all the formal details in any serious probability text (such as [1]).

Clearly the fair casino we constructed for the ABRACADABRA exercise is an example of a martingale. Another example is the simple symmetric random walk on the number line: we start at 0, toss a coin in each step, and move one step in the positive or negative direction based on the outcome of our coin toss.

The Optional Stopping Theorem

Remember that we closed our casino as soon as the word ABRACADABRA appeared and we claimed that our casino was also fair at that time. In mathematical language, the closed casino is called a stopped martingale. The stopped martingale is constructed as follows: we wait until our martingale X exhibits a certain behaviour (e.g. the word ABRACADABRA is typed by the monkey), and we define a new martingale X’ as follows: let $ X’_n = X_n$ if $ n < T$ and $ X’_n = X_T$ if $ n \ge T$ where $ T$ denotes the stopping time, i.e. the time at which the desired event occurs. Notice that $ T$ itself is a random variable.

We require our stopping time $ T$ to depend only on the past, i.e. that at any time we should be able to decide whether the event that we are waiting for has already happened or not (without looking into the future). This is a very reasonable requirement. If we could look into the future, we could obviously cheat by closing our casino just before some gambler would win a huge prize.

We said that the expected wealth of the casino at the stopping time is the same as the initial wealth. This is guaranteed by Doob’s optional stopping theorem, which states that under certain conditions, the expected value of a martingale at the stopping time is equal to its expected initial value.

Theorem: (Doob’s optional stopping theorem) Let $ X_n$ be a martingale stopped at step $ T$, and suppose one of the following three conditions hold:

  1. The stopping time $ T$ is almost surely bounded by some constant;
  2. The stopping time $ T$ is almost surely finite and every step of the stopped martingale $ X_n$ is almost surely bounded by some constant; or
  3. The expected stopping time $ E(T)$ is finite and the absolute value of the martingale increments $ |X_n-X_{n-1}|$ are almost surely bounded by a constant.

Then $ E(X_T) = E(X_0).$

We omit the proof because it requires measure theory, but the interested reader can see it in these notes.

For applications, (1) and (2) are the trivial cases. In the ABRACADABRA problem, the third condition holds: the expected stopping time is finite (in fact, we showed using the geometric distribution that it is less than $ 26^{12}$) and the absolute value of a martingale increment is either 1 or a net payoff which is bounded by $ 26^{11}+26^4+26$. This shows that our solution is indeed correct.

Gambler’s Ruin

Another famous application of martingales is the gambler’s ruin problem. This problem models the following game: there are two players, the first player has $ a$ dollars, the second player has $ b$ dollars. In each round they toss a coin and the loser gives one dollar to the winner. The game ends when one of the players runs out of money. There are two obvious questions: (1) what is the probability that the first player wins and (2) how long will the game take in expectation?

Let $ X_n$ denote the change in the second player’s fortune, and set $ X_0 = 0$. Let $ T_k$ denote the first time $ s$ when $ X_s = k$. Then our first question can be formalized as trying to determine $ \Pr(T_{-b} < T_a)$. Let $ t = \min \{ T_{-b}, T_a\}$. Clearly $ t$ is a stopping time. By the optional stopping theorem we have that

$ \displaystyle 0=E(X_0)=E(X_t)=-b\Pr(T_{-b} < T_a)+a(1-\Pr(T_{-b} < T_a))$

thus $ \Pr(T_{-b} < T_a)=\frac{a}{a+b}$.

I would like to ask the reader to try to answer the second question. It is a little bit trickier than the first one, though, so here is a hint: $ X_n^2-n$ is also a martingale (prove it), and applying the optional stopping theorem to it leads to the answer.

A Randomized Algorithm for 2-SAT

The reader is probably familiar with 3-SAT, the first problem shown to be NP-complete. Recall that 3-SAT is the following problem: given a boolean formula in conjunctive normal form with at most three literals in each clause, decide whether there is a satisfying truth assignment. It is natural to ask if or why 3 is special, i.e. why don’t we work with $ k$-SAT for some $ k \ne 3$ instead? Clearly the hardness of the problem is monotone increasing in $ k$ since $ k$-SAT is a special case of $ (k+1)$-SAT. On the other hand, SAT (without any bound on the number of literals per clause) is clearly in NP, thus 3-SAT is just as hard as $ k$-SAT for any $ k>3$. So the only question is: what can we say about 2-SAT?

It turns out that 2-SAT is easier than satisfiability in general: 2-SAT is in P. There are many algorithms for solving 2-SAT. Here is one deterministic algorithm: associate a graph to the 2-SAT instance such that there is one vertex for each variable and each negated variable and the literals $ x$ and $ y$ are connected by a directed edge if there is a clause $ (\bar x \lor y)$. Recall that $ \bar x \lor y$ is equivalent to $ x \implies y$, so the edges show the implications between the variables. Clearly the 2-SAT instance is not satisfiable if there is a variable x such that there are directed paths $ x \to \bar x$ and $ \bar x \to x$ (since $ x \Leftrightarrow \bar x$ is always false). It can be shown that this is not only a sufficient but also a necessary condition for unsatisfiability, hence the 2-SAT instance is satisfiable if and only if there is are no such path. If there are directed paths from one vertex of a graph to another and vice versa then they are said to belong to the same strongly connected component. There are several graph algorithms for finding strongly connected components of directed graphs, the most well-known algorithms are all based on depth-first search.

Now we give a very simple randomized algorithm for 2-SAT (due to Christos Papadimitriou in a ’91 paper): start with an arbitrary truth assignment and while there are unsatisfied clauses, pick one and flip the truth value of a random literal in it. Stop after $ O(n^2)$ rounds where $ n$ denotes the number of variables. Clearly if the formula is not satisfiable then nothing can go wrong, we will never find a satisfying truth assignment. If the formula is satisfiable, we want to argue that with high probability we will find a satisfying truth assignment in $ O(n^2)$ steps.

The idea of the proof is the following: fix an arbitrary satisfying truth assignment and consider the Hamming distance of our current assignment from it. The Hamming distance of two truth assignments (or in general, of two binary vectors) is the number of coordinates in which they differ. Since we flip one bit in every step, this Hamming distance changes by $ \pm 1$ in every round. It also easy to see that in every step the distance is at least as likely to be decreased as to be increased (since we pick an unsatisfied clause, which means at least one of the two literals in the clause differs in value from the satisfying assignment).

Thus this is an unfair “gambler’s ruin” problem where the gambler’s fortune is the Hamming distance from the solution, and it decreases with probability at least $ \frac{1}{2}$. Such a stochastic process is called a supermartingale — and this is arguably a better model for real-life casinos. (If we flip the inequality, the stochastic process we get is called a submartingale.) Also, in this case the gambler’s fortune (the Hamming distance) cannot increase beyond $ n$. We can also think of this process as a random walk on the set of integers: we start at some number and in each round we make one step to the left or to the right with some probability. If we use random walk terminology, 0 is called an absorbing barrier since we stop the process when we reach 0. The number $ n$, on the other hand, is called a reflecting barrier: we cannot reach $ n+1$, and whenever we get close we always bounce back.

There is an equivalent version of the optimal stopping theorem for supermartingales and submartingales, where the conditions are the same but the consequence holds with an inequality instead of equality. It follows from the optional stopping theorem that the gambler will be ruined (i.e. a satisfying truth assignment will be found) in $ O(n^2)$ steps with high probability.

[1] For a reference on stochastic processes and martingales, see the text of Durrett .