Understanding Union Find from Robert Endre Tarjan

14 minute read Published: 2024-04-22

Union-Find (or Disjoint Set) is a data structure used to manage disjoint sets of elements through two primary operations: find(x) and union(A, B). The find(x) operation determines which set the element $x$ belongs to, while the union(A, B, C) operation merges the sets $A$ and $B$ into a single set $C$. In this article, we will explore Robert Endre Tarjan’s work on determining the upper bound of the time complexity for these Union-Find operations.

Table of Contents

Who this article is for:

Who this article is NOT for:

Note:

The idea discussed in this article is a slight variation of the algorithm presented in the paper “Efficiency of a Good But Not Linear Set Union Algorithm” by Robert Endre Tarjan. I highly recommend reading the original paper in the reference section.

In the paper, the size heuristic (also known as union by size) is used. However, in this article, we explore another variation known as "union by depth" of the trees.


Introduction

Union Find (or Disjoint Set) is a data structure to manipulate disjoint set of elements using the following operations.

  1. find(x): This function is used to determine which set the element $x$ belongs to. It locates the node containing $x$ and follows pointers up to the root of the corresponding tree to identify the name of the set. The collapsing rule may be applied to flatten the tree structure, optimizing future queries.

  2. union(A, B, C): This function is used to combine the sets $A$ and $B$ into a new set named $C$. It locates the roots of the sets $A$ and $B$, makes one root a child of the other, and names the new root as $C$. The choice of which root becomes the child can be made either randomly or by applying the weighted union rule, where the smaller tree is attached under the larger tree to keep the resulting structure balanced.

Collapsing Rule (aka. path compression)

After a find operation, make all the vertices traversed during the operation direct children of the root of the tree. (Refer to below figure)

Your browser doesn't support the img tag, which I use for images
Fig 1:Collapsing rule

Weighted Union Rule ("union by depth" for the context of this article):

If depth of the tree formed by set A is greater than that of set B, make B son of A. Otherwise make A son of B.

Your browser doesn't support the img tag, which I use for images
Fig 2: Weighted union rule

Note: The depth of the tree rooted at r1 is less than that of the tree rooted at r2. Therefore, r1 is made a child of r2. In the original paper, the size-based variation was discussed.

The Claim

Suppose we are given $n$ elements, and we want to perform $m \geq n$ find operations and $n-1$ union operations. Let $t(m, n)$ denote the maximum time required for a sequence of $m$ intermixed find operations and $n$ union operations.

An Upper Bound (Proof)

Let’s build a tree (called $T$) by performing all $n-1$ union operations first. Assign a rank to each vertex, equal to the height of the tree. This rank will remain unchanged in the future. (Refer to the figure below.)

Your browser doesn't support the img tag, which I use for images
Fig 3: Building a tree using all the union operations.

Let the given sequence of operations are as follows:

1union(N2, N5)
2union(N3, N6)
3union(N6, N7)
4union(N6, N8)
5find(N8) <-- We are here
6union(N1, N2)
7union(N1, N3)
8union(N2, N4)
9... // other find operations

To compute the find operation at line number 5 using tree $T$, we follow the path from node $N_8$ to its furthest ancestor, but only using the edges formed by union operations prior to this find operation (operations from line 1 to 4). We apply the collapsing rule to flatten the tree during this process. Let’s call this a partial-find. (Refer to the figure below.)

Your browser doesn't support the img tag, which I use for images
Fig 4: Edges adjust using collapsing rule

Note: To get an upper bound on $t(m, n)$ by only using collapsing rule, we just need to bound sum of path lengths of all such partial findings on the tree $T$.

Using only Collapsing Rule

Let’s attempt to bound the number of edges traversed during all partial find operations on the tree $T$. To do this, we define a few classes containing sequences of positive integers for some positive constants $b$ and $z$, as follows:

Classnum1num2num3num4...
Class 0$0$$1$$2$$3$...
Class 1$0$$b$$2b$$3b$...
Class 2$0$$b^2$$2b^2$$3b^2$...
...$..$$..$$..$$..$...
Class i$0$$b^i$$2b^i$$3b^i$...
...$..$$..$$..$$..$...
Class z$0$$b^z$$2b^z$$3b^z$...

Let’s assign classes to the edges $x \rightarrow y$ using following rules.

  1. Choose the minimum class number $i \in {1, 2, .., z}$, if $rank(x)$ and $rank(y)$ can be fit in between two consecutive values.
  2. If some edges can not be put to any class from rule 1, put them in class $z+1$.

For all the edges, we encounter during the $m$ partial-find, can be divided into following three categories. (refer figure below)

  1. Final edge from each class. (blue colour)
  2. Non-final edge belongs to class 0 till class z. (black colour)
  3. Non-final edge belongs to class z+1. (red colour)
Your browser doesn't support the img tag, which I use for images
Fig 5: Various class of edges, after $m$ partial-find. $c(x)$ means, edge belongs to class $x$.

Bounding the edges

  1. Final edges. (Let’s denote the count as $c_1$)

    For each partial-find, we can encounter at most $z + 1$ final edges, since we only have $z + 1$ classes. Therefore, for $m$ partial-find operations, the total number of final edges is bounded by $(z + 1) \cdot m$.

  2. Non-final edges belong to class 0 till class z. (Let’s denote the count as $c_2$)

    Each such edge comes from a vertex (bounded by $n$). Let’s say such an edge $x \rightarrow y$ coming out of vertex $x$, belongs to two consecutive value, $p$ and $q$, of class i (represented as two red circle, under column class i of Figure 6). Since this is a non final edge in the partial-find, there could be another edge $u \rightarrow v$, belongs to class i, in between $p$ and $q$. After applying the collapsing rule, $x \rightarrow y'$ (the green edge) in Figure 6 is formed and $x \rightarrow y$ is removed. With this operation, the class of $x \rightarrow y'$ either remains the same as $x \rightarrow y$ or increases, since the rank difference (between $y'$ and $x$) can only increase due to the collapsing rule.

    Let’s see the case where the class stays the same. In Figure 6, $x \rightarrow y$ and $u \rightarrow v$ present between two consecutive values of class i. After applying collapsing rule finally $x$ will be attached to $y’$ via the green edge. Let’s say after applying the rule, the class of the edge $x \rightarrow y'$ which is coming out of $x$ does not change. Since $u \rightarrow v$ and $x \rightarrow y$ are present in class $i$, they could not have been assigned to class $i-1$ by the definition of how classes are assigned to edges. This implies that there must be at least one value from class $i-1$ between $r(u)$ and $r(v)$, as well as between $r(x)$ and $r(y)$, as illustrated in Figure 6 (red circle under class $i-1$). Hence we have at least two values from class $i-1$ in between the ranks of $x$ and $y'$.

    Your browser doesn't support the img tag, which I use for images
    Fig 6: Applying collapsing rule. Red circle being numbers from the table for the class. Green edge was not present in original $m$ partial-find, but is formed after applying collapsing rule.

    Observe that between two consecutive values of class $i$, there can be at most $b-1$ values in class $i-1$, based on the way we have defined the sequence number for each class. After applying collapsing rule one time to the edge coming from $x$, we have at least two values from class $i-1$ in between $x \rightarrow y'$. So after applying such operation $b$ times, the edge have to move to the next class. We can move at most $z$ times up in the class, so at most $b \cdot z$ times edge traversal for one vertex. For $n$ vertices, it would be $n \cdot b \cdot z$.

  3. Non final edge belongs to class $z+1$. (Let’s denote the count as $c_3$)

    Let’s calculate total number of values in class $z$, where the value is less than or equal to the max rank $n$. Total number of such values is $n/b^z$

    $$ k.b^z \leq n \Rightarrow k \leq n/b^z $$

    Therefore, for an edge from a vertex in class $z+1$, we can apply the collapsing rule at most $n/b^z + 1$ times. Since, there are $n$ vertices, total number of such application would be $(n/b^z + 1) \cdot n$ which equals to $n^2/b^z + n$.

Total number of edges traversed (equation-1) $$ = c_1 + c_2 + c_3 $$ $$ = (z+1)m + bzn + \frac{n^2}{b^z} + n $$

We can pick suitable value for $b$ and $z$ to minimise the number of edge traversed. Let’s pick the following values.

$$ \begin{align*} &b = \frac{2m}{n} \ &z = max \Big(1, \frac{log(n^2/m)}{log(b)}\Big) \end{align*} $$

By putting the values of $b$ and $z$ in equation-1, we get the following result, for a suitable constant $k$. $$ (z+1)m + bzn + \frac{n^2}{b^z} + n $$ $$ = 3mz + m + n + \frac{n^2}{b^z} $$ $$ \leq kmz + n^2 / max(b, b^{{log(n^2/m)}/logb}) $$ $$ \leq kmz + n^2 / max(b, 2^{log(n^2/m)}) $$ $$ \leq kmz + n^2 / max(2m/n, n^2/m) $$ $$ \leq kmz $$ $$ \leq k \cdot m \cdot max\Big(1, \frac{log(n^2/m)}{log(2m/n)}\Big) $$

Using weighted union rule with collapsing rule

We will again build the tree similar to what we have discussed here, but now we will use weighted union rule, while building the tree. Observe that, If $rank(x) = k$, then number of vertices in the tree rooted at $x$ is at least $2^k$. This implies there are no more than $n/2^k$ vertices of rank $k$ in the tree.

Let’s redefine the classes (for the below table, $i$ represents the class $i$). Instead of using simple geometric progression as we did previously, we will use a variation of Ackermann function.

The Ackermann function $A(i, j)$ is defined as follows.

1A(0, x) = 2x
2A(i, 0) = 0 for i≥1
3A(i,2) = 2 for i ≥ 1
4A(i, x) = A(i-1, A(i, x-1)) for i ≥ 1, x ≥ 2
i\x01234...x
00$2$$4$$6$$8$...$2x$
10$2$$4$$8$$16$...$2^x$
20$2$$4$$16$$2^{16}$...$2^{2^{2^{...}}}$ x-times
30$2$$4$$2^{16}$$2^{2^{2^{...}}}$ $2^{16}$-times...
40$2$$4$$2^{2^{2^{...}}}$ $2^{16}$-times......
....................

Let’s take the first $z$ sequence from the above table and try to assign the class to the edges in the same way we did when only applying the collapsing rule.

Bounding the edges

To calculate the total time complexity, we will use the same approach as the previous edge bounding technique.

  1. Final edges (Let’s denote the count as $c_4$)

    Since there are at most $z+1$ final classes, there are atmost $z+1$ final edges present. For $m$ queries, $c_4$ is bound by $(z+1) \cdot m$.

  2. Non-final edges belong to class 0 till class z (Let’s denote the count as $c_5$)

    We will follow the same argument as before. But, since the sequence of the table has changed, the number of element in the previous class (class $i-1$) has also changed. Observe that, now between two values of class $i$, let’s say $A(i, j)$ and $A(i, j+1)$, there are at least $A(i, j)$ values from the class $i-1$.

    The number of node $x$, whose rank $r(x)$ lies between $A(i, j)$ and $A(i, j+1)$ (Remember that, rank are related to height of the tree)

    $$ \leq \sum_{r=A(i, j)}^{A(i, j+1)} \frac{n}{2^r} \leq \sum_{r=A(i, j)}^{\infty} \frac{n}{2^r} \leq \frac{2n}{2^{A(i, j)}} $$

    For each nodes from class $i$, we can perform at most $A(i, j) - 1$ operations while keeping the edge in the same class, since there are at most $A(i, j)$ values in class $i-1$. So for all the nodes in class $i$, we can bound the operation by the following summation to move all the nodes from $i^{th}$ class to ${i+1}^{th}$ class. $$ \sum_{j=0}^{\infty} \Big( \frac{2n}{2^{A(i, j)}} \cdot A(i, j) \Big) $$

    To move all the edges from class $1$ to $z+1$, we need to perform this operation $z$ times. Hence all such operations can be bound by the following summation for some constant $k$ and $k'$.

    $$ \sum_{i=0}^{z} \Big( \sum_{j=0}^{\infty} \frac{2n \cdot A(i, j)}{2^{A(i, j)}} \Big) $$

    $$ = 2n \sum_{i=0}^{z} \sum_{j=0}^{\infty} \frac{ A(i, j)}{2^{A(i, j)}} $$

    $$ \leq 2n \sum_{i=0}^{z} k' $$

    $$ = 2 \cdot k' \cdot z \cdot n $$

    $$ \leq k \cdot z \cdot n $$ $$ = O(z \cdot n) $$

  3. Non final edge belongs to class z+1 (Let’s denote the count as $c_6$)

    The total number of operations we can perform on class $z+1$ can be bound by bounding the number of elements in class $z$.

    The max rank in the tree is bound by $log(n)$, since the height of the tree can be at most $log(n)$ by following the weighted union rule.

    Let’s define inverse to the above Ackermann function $\alpha(m, n)$

    $$ \alpha(m,n) = min\{i \geq 1 | A(i, m/n) > log_2(n)\} $$

    Max rank in the class $z$ is $log(n)$. Now if we choose $z=\alpha(m,n)$, then we can observe that we can have at most $m/n$ element in the class $z$. This is obvious from the way we define the inverse ackermann function. So for $n$ nodes, we can do at most $n \cdot \frac{m}{n} = m$ operations.

Total time, by considering $z = \alpha(m, n)$ is (for some constant $k$)

$$ = c_4 + c_5 + c_6 $$ $$ \leq (z+1) \cdot m + z \cdot n + m $$ $$ \leq k \cdot m \cdot z $$ $$ \leq k \cdot m \cdot \alpha(m, n) $$

Therefore, the total time is bound by $O(m \cdot \alpha(m, n))$. Hence amortized time bound per operation is $O(\alpha(m, n))$. And since $A(3, 4)$ is such a huge number, as we see from the above Ackermann function calculation table, for all practical purpose $\alpha(m, n) \leq 3$.


Reference / Credits

  1. [Research Paper] Efficiency of a Good But Not Linear Set Union Algorithm
  2. [Youtube] A&DS. Time Complexity of Union-Find (inverse Ackermann function)