Group Actions and Hashing Unordered Multisets

I learned of a neat result due to Kevin Ventullo that uses group actions to study the structure of hash functions for unordered sets and multisets. This piqued my interest because a while back a colleague asked me if I could think of any applications of “pure” group theory to practical computer programming that were not cryptographic in nature. He meant, not including rings, fields, or vector spaces whose definitions happen to be groups when you forget the extra structure. I couldn’t think of any at the time, and years later Kevin has provided me with a tidy example. I took the liberty of rephrasing his argument to be a bit simpler (I hope), but I encourage you to read Kevin’s post to see the reasoning in its original form.

Hashes are useful in programming, and occasionally one wants to hash an unordered set or multiset in such a way that the hash does not depend on the order the elements were added to the set. Since collection types are usually generic, one often has to craft a hash function for a set<T> or multiset<T> that relies on a hash function hash(t) of an unknown type T. To make things more efficient, rather than requiring one to iterate over the entire set each time a hash is computed, one might seek out a hash function that can be incrementally updated as new elements are added, and provably does not depend on the order of insertion.

For example, having a starting hash of zero, and adding hashes of elements as they are added (modulo 2^{64}) has this incremental order-ignorant property, because addition is commutative and sums can be grouped. XOR-ing the bits of the hashes is similar. However, both of these strategies have downsides.

For example, if you adopt the XOR strategy for a multiset hash, then any element that has an even quantity in the multiset will be the same as if it were not in the set at all (or if it were in the set with some other even quantity). This is because x XOR x == 0. On the other hand, if you use the addition approach, if an element hashes to zero, its inclusion in any set has no effect on the hash. In Java the integer hash is the identity, so zero would be undetectable as a member of a multiset of ints in any quantity. Less drastically, a multiset with all even counts of elements will always hash to a multiple of 2, and this makes it easier to find hash collisions.

A natural question arises: given the constraint that the hash function is accumulative and commutative, can we avoid such degenerate situations? In principle the answer should obviously be no, just by counting. I.e., the set of all unordered sets of 64-bit integers has size 2^{2^{64}}, while the set of 64-bit hashes has size merely 2^{64}. You will have many many hash collisions, and would need a much longer hash to avoid them in principle.

More than just “no,” we can characterize the structure of such hash functions. They impose an abelian group structure on the set of hashes. And due to the classification theorem of finite abelian groups, up to isomorphism (and for 64-bit hashes) that structure consists of addition on blocks of bits with various power-of-2 moduli, and the blocks are XOR’ed together at the end.

To give more detail, we need to review some group theory, write down the formal properties of an accumulative hash function, and prove the structure theorem.

Group actions, a reminder

This post will assume basic familiarity with group theory as covered previously on this blog. Basically, this introductory post defining groups and actions, and this followup post describing the classification theorem for commutative (abelian) groups. But I will quickly review group actions since they take center stage today.

A group G defines some abstract notion of symmetries in a way that’s invertible. But a group is really meaningless by itself. They’re only useful when they “act” upon a set. For example, a group of symmetries of the square acts upon the square to actually rotate its points. When you have multiple group structures to consider, it makes sense to more formally separate the group structure from the set.

So a group action is formally a triple of a group G, a set X, and a homomorphism f:G \to S_X, where S_X is the permutation group (or symmetry group) of X, i.e., the set of all bijections X \to X. The permutation group of a set encompasses every possible group that can act on X. In other words, every group is a subgroup of a permutation group. In our case, G and f define a subgroup of symmetries on X via the image of f. If f is not injective, some of the structure in G is lost. The homomorphism determines which parts of G are kept and how they show up in the codomain. The first isomorphism theorem says how: G / \textup{ker} f \cong \textup{im} f.

This relates to our hashing topic because an accumulative hash—and a nicely behaved hash, as we’ll make formal shortly—creates a group action on the set of possible hash values. The image of that group action is the “group structure” imposed by the hash function, and the accumulation function defines the group operation in that setting.

Multisets as a group, and nice hash functions

An appropriate generalization of multisets whose elements come from a base set X forms a group. This generalization views a multiset as a “counting function” T: X \to \mathbb{Z}. The empty set is the function that assigns zero to each element. A positive value of k implies the entry shows up in the multiset k times. And a negative value is like membership “debt,” which is how we represent inverses, or equivalently set difference operations. The inverse of a multiset T, denoted -T, is the multiset with all counts negated elementwise. Since integer-valued functions generally form a group under point-wise addition, these multisets also do. Call this group \textup{MSet}(X). We will freely use the suggestive notation T \cup \{ x \} to denote the addition of T and the function that is 1 on x and 0 elsewhere. Similarly for T - \{ x \}.

\textup{MSet}(X) is isomorphic to the free abelian group on X (because an instance of a multiset only has finitely many members). Now we can define a hash function with three pieces:

  • An arbitrary base hash function \textup{hash}: X \to \mathbb{Z} / 2^n \mathbb{Z}.
  • An arbitrary hash accumulator \textup{h}: \mathbb{Z} / 2^n \mathbb{Z} \times \mathbb{Z} / 2^n \mathbb{Z} \to \mathbb{Z} / 2^n \mathbb{Z}
  • A seed, i.e., a choice for the hash of the empty multiset s \in \mathbb{Z} / 2^n \mathbb{Z}

With these three data we want to define a multiset hash function h^*: \textup{MSet}(X) \to \mathbb{Z} / 2^n \mathbb{Z} recursively as

  • h^*(\left \{ \right \}) = s
  • h^*(T \cup \left \{ x \right \}) = h(h^*(T), \textup{hash}(x))
  • h^*(T - \left \{ x \right \}) = \dots

In order for the second bullet to lead to a well-defined hash, we need the property that the accumulation order of individual elements does not matter. Call a hash accumulator commutative if, for all a, b, c \in \mathbb{Z} / 2^n \mathbb{Z},

\displaystyle h(h(a,b), c) = h(h(a,c), b)

This extends naturally to being able to reorder any sequence of hashes being accumulated.

The third is a bit more complicated. We need to be able to use the accumulator to “de-accumulate” the hash of an element x, even when the set that gave rise to that hash didn’t have x in it to start.

Call a hash accumulator invertible if for a fixed hash z = \textup{hash}(x), the map a \mapsto h(a, z) is a bijection. That is, accumulating the hash z to two sets with different hashes under h^* will not cause a hash collision. This defines the existence of an inverse map (even if it’s not easily computable). This allows us to finish the third bullet point.

  • Fix z = \textup{hash}(x) and let g be the inverse of the map a \mapsto h(a, z). Then h^*(T - \left \{ x \right \}) = g(h^*(T))

Though we don’t have a specific definition for the inverse above, we don’t need it because we’re just aiming to characterize the structure of this soon-to-emerge group action. Though, in all likelihood, if you implement a hash for a multiset, it should support incremental hash updates when removing elements, and that formula would apply here.

This gives us the well-definition of h^*. Commutativity allows us to define h^*(T \cup S) by decomposing S arbitrarily into its constituent elements (with multiplicity), and applying h^*(T \cup \{ x \}) or h^*(T - \{ x \}) in any order.

A group action emerges

Now we have a well-defined hash function on a free abelian group.

\displaystyle h^*: \textup{MSet}(X) \to \mathbb{Z} / 2^n \mathbb{Z}

However, h^* is not a homomorphism. There’s no reason hash accumulation needs to mesh well with addition of hashes. Instead, the family of operations “combine a hash with the hash of some fixed set” defines a group action on the set of hashes. Let’s suppose for simplicity that h^* is surjective, i.e., that every hash value can be achieved as the hash of some set. Kevin’s post gives the more nuanced case when this fails, and in that case you work within S_{\textup{im}(h^*)} instead of all of S_{\mathbb{Z} / 2^n \mathbb{Z}}.

The group action is defined formally as a homomorphism

\displaystyle \varphi : \textup{MSet}(X) \to S_{\mathbb{Z} / 2^n \mathbb{Z}}

where \varphi(T) is the permutation a \mapsto h(a, h^*(T)). Equivalently, we start from a, pick some set S with h^*(S) = a, and output h^*(T \cup S).

The map \varphi is a homomorphism. The composition of two accumulations is order-independent because h is commutative. This is how we view h as “the binary operation” in \textup{im} \varphi, because combining two permutations a \mapsto h(a, h^*(T)) and a \mapsto h(a, h^*(S)) is the permutation latex a \mapsto h(a, h^*(S \cup T)).

And now we can apply the first isomorphism theorem, that

\displaystyle \textup{MSet}(X) / \textup{ker} \varphi \cong \textup{im} \varphi \subset S_{\mathbb{Z} / 2^n \mathbb{Z}}

This is significant because any quotient of an abelian group is abelian, and this quotient is finite because S_{\mathbb{Z} / 2^n \mathbb{Z}} is finite. This means that the group \textup{im} \varphi is isomorphic to

\displaystyle \textup{im} \varphi \cong \bigoplus_{i=1}^k \mathbb{Z}/2^{n_i} \mathbb{Z}

where n = \sum_i n_i, and where the operation in each component is the usual addition modulo n_i. The i-th summand corresponds to a block of n_i bits of the hash, and within that block the operation is addition modulo 2^{n_i}. Here the “block” structure is where XOR comes in. Each block can be viewed as a bitmask with zeros outside the block, and two members are XOR’ed together, which allows the operations to apply to each block independently.

For example, the group might be \mathbb{Z} / 2^{4} \mathbb{Z} \times \mathbb{Z} / 2^{26} \mathbb{Z}\times \mathbb{Z} / 2^{2} \mathbb{Z} for a 32-bit hash. The first block corresponds to 32-bit unsigned integers whose top 4 bits may be set but all other bits are zero. Addition is done within those four bits modulo 16, leaving the other bits unchanged. Likewise, the second component has the top four bits zero and the bottom two bits zero, but the remaining 26 bits are summed mod 2^{24}. XOR combines the bits from different blocks.

In one extreme case, you only have one block, and your group is just \mathbb{Z} / 2^n \mathbb{Z} and the usual addition combines hashes. In the other extreme, each bit is its own block, your group is (\mathbb{Z} / 2 \mathbb{Z})^n, the operation is a bitwise XOR.

Note, if instead of 2^n we used a hash of some other length m, then in the direct sum decomposition above, m would be the product of the sizes of the components. The choice m = 2^n maximizes the number of different structures you can have.

Implications for hash function designers

Here’s the takeaway.

First, if you’re trying to design a hash function that avoids the degeneracies mentioned at the beginning of this article, then it will have to break one of the properties listed. This could happen, say, by maintaining additional state.

Second, if you’re resigned to use a commutative, invertible, accumulative hash, then you might as well make this forced structure explicit, and just pick the block structure you want to use in advance. Since no clever bit shifting will allow you to outrun this theorem, you might as well make it simple.

Until next time!