## Sunday, July 5, 2020

### Fun with Trees

Anyway, so in that spirit, while poking around general algorithms reading, I delved into some CS data structures, binary trees, and got a chance to get some TOTALLY AWESOME visualizations in. It's one of those subjects I thought was completely old hat and elementary. But armed with new analytic tools, I never knew how interesting it could get. My interest, in particular, is in is getting a sense of the variety of tree shapes, and exploring that variety probabilistically (we've done more than one probability post). It completely fits into one of the major threads of this blog, namely, exploring state space. So good that they are worthy of a first blog update in 3 years. It is an excellent example of a space that has a couple of distinct, natural probability distributions (in particular, it can be seen as explorations of two different parametrizations of the space). Plus, throw in a bit about realizing trees as monomials in noncommutative polynomial rings in infinitely many variables... and only after thinking about it that way, I managed to come up with a sensible encoding of the possible shapes of trees that I could use as dictionary keys to verify subsequent calculations with simulation... I hope y'all get my drift about the essential unity of math. Anyway, one shout-out; I got on this thread by poking around the algorithms literature, and I have to say, Prof. Robert Sedgewick's course on analytic combinatorics (and generally, the approach of analyzing algorithms by way of real/complex-analytic asymptotics) kicks super ass (or as the young whippersnappers these days say: it is totally sick, but in the time of The Rona, probably not the best terminology...)

So we begin with binary trees. What they are, how to count them, one of its most common uses in CS, and how the theory of permutations that surrounds this common use causes a nontrivial difference in probability distributions, from which we may obtain different definitions for what it means to be "a random tree". So what's a binary tree? Actually, surprisingly, this is also not a 100% transparent and straightforward answer, because different texts define things differently, and there are subtle variations. Sometimes it's not apparent what the difference in definitions can be. At its most basic, and the canonical introductory CS book definition, is that a binary tree is either empty, or a root node together with a left subtree and a right subtree (recursively defined). This distinction between left and right is what forces it to be different from a rooted acyclic graph, or algebraic topologists' simply connected graph, or, perhaps surprisingly, even an ordered tree with at most 2 children per node. Here's a couple of examples describing such subtleties: These two rooted trees are the same (isomorphic) as rooted trees, but are different as ordered rooted trees, because the left-right order matters in that case. These trees are isomorphic as ordered rooted trees, because there are no other nodes in each level, but not isomorphic as binary trees.

The reason for the last one is that a tree with two lefts is distinct from a tree with two rights, whereas, they'd be identical as ordered trees. But, if one designates that all nodes must have either two or zero children, and we call the ones with zero children "external", and the others "internal", then the tree that we care about is contained in the internal nodes, and it is sufficient to say they are ordered trees (because the left-right distinction is now realized as which node is getting the two terminal children). This actually is close to what is implemented in practice, because external nodes can be considered to be the null pointers that signify the lack of a child in that direction. Here's a picture of what we mean, adding external nodes to the two above distinct trees (the unfilled circles correspond to external nodes): After adding the external nodes (in cyan with empty holes) to the two binary trees above, they are no longer isomorphic as ordered rooted trees, because now there's more than one node on a given level, so there's now a choice of where to attach subtrees. The external nodes are attached differently for the left-leaning vs. right-leaning trees. Again, in CS terms, you should think of the cyan nodes being null.
How do we get a sense of what varieties of binary trees exist? First on the todo-list is to count them. Actually, it may be because of trees that one of the most famous sequences of numbers in combinatorics, the Catalan numbers, is so famous—many things can be modeled using binary trees, so that naturally leads to those things being counted using Catalan numbers $c_n = 1, 1, 2, 5, 14, 42, 132,...$. They are defined by the recurrence relation $c_{n+1} = \sum_{k=0}^n c_k c_{n-k}$ with $c_0=1$. Using the "internal node" definition, there are $c_n$ binary trees that have $n$ internal nodes. This can be seen by showing that it satisfied that recurrence relation with the same initial conditions. Let's temporarily write $T_n$ for the number of trees of size $n$. A single external node (root node being null, in the CS terminology) corresponds to $n=0$, and it is obviously only one tree; similarly, a single internal node (root node having no children) is $n=1$, also only one tree, so $T_0=1$ and $T_1=1$. For the general case, we use the recursive definition: a tree is a root node plus two subtrees. Given a tree of size $n+1$, the root counts as one, and for each $k$, $0 \leq k \leq n$, we can consider a left subtree of size $k$, and the right subtree has to be of size $n-k$. Since there are $T_k$ possibilities for the left subtree, and $T_{n-k}$ possibilities for the right subtree, the total number of possibilities of trees with left subtree of size $k$ is $T_k T_{n-k}$. Adding up over all possible $k$ makes $T_{n+1} = \sum_{k=0}^{n} T_k T_{n-k}$, exactly the Catalan recurrence relation. Using a lot of algebraic trickery (or generating functions, which are also totally awesome, even though they really are not a visual thing), we can derive an explicit formula and asymptotic expansion
$c_n = \frac{1}{n+1} {2n \choose n} \sim \frac{4^{n}}{\sqrt{\pi n^3}}$ (yes, that $\pi$, another completely random connection; who knew you'd find $\pi$ studying trees??). As $n\to\infty$, the asymptotic formula comes closer and closer to approximating the number of nodes in a tree (it's related to Stirling's formula, which we've encountered in our Stat Mech post).

Ok, now that we got that count, the next natural thing to do is to explore the variety of trees that exist. One natural way of constructing trees is to consider binary search trees, often the first example of a tree given in elementary CS. Every node stores a key, keys are comparable, and smaller keys go into the left subtree and larger keys go into the right subtree. Keys are inserted by recursively traversing until a suitable null node is found and creating a new node there. Random input is modeled by considering random permutations. Say, an input of size $n$ is a random permutation of $0, 1, \dots, n-1$. Inserting in that order creates some tree. Now, there are $n!$ random permutations of $0, 1, \dots, n -1$, but $c_n$ trees, and $n!$ grows much faster than $c_n$ ($c_n$ is bounded above by an exponential $4^n$, whereas $n!$ grows like $\sqrt{2\pi n}(n/e)^n$, Stirling's approximation), so clearly, there must be distinct permutations that create the same tree, and they don't do so uniformly. In fact, as it turns out, random permutations heavily bias towards forming balanced trees, trees with roughly equal-sized left and right subtrees for a given root. This is, in practice, a very good thing, because balanced trees also exhibit very good performance for actual searching, and it's nice to know that typical cases that occur in practice in fact are also the best performing. Visualizing trees after a random permutation (here, $n=800$) is what created this post's title image (the tree nodes on each level are centered). Here's a couple more:

The wide middle seems to be characteristic of what balanced trees look like. They look like nice lanterns, or maybe acorns. It turns out that we can precisely quantify (using again, a recurrence relation) how many permutations will correspond to a given tree shape. It is also a very interesting decomposition. The first element of a permutation (represented as a scrambled array of $0, 1, 2, \dots. n-1$) corresponds to the root node, and all the elements less than it are in the left subtree, and all the elements greater than it are in the right subtree. Now the key observation here is: if you shuffle the permutation corresponding two subtrees (now, perhaps, visualizing the root node as the top card in a deck of cards and the two subtrees as two piles of the remaining cards), namely, interleave them with one another in an arbitrary way, but keep the same ordering within each group, this will lead to the same tree. This is because each insertion of a node greater than the root affects only the right subtree and completely ignores the left subtree, and vice versa. For an example, the permutation $[3, 1, 4, 6, 0, 2, 5]$, the root is $3$, and the left subtree consists of nodes $[1, 0, 2]$, and the right subtree corresponds to $[4, 6, 5]$. Then a shuffle is scanning both those lists from left to right and picking from one or the other in an arbitrary way, so long as the order for the left remains $1$, then possibly a bunch of things from the other tree, then $0$, then stuff, then $2$, and similarly for the right, $4$ then stuff then $6$ then stuff then $5$. So $[3, 4, 6, 1, 5, 0, 2]$, $[3, 4, 1, 6, 5, 0, 2]$, $[3, 1, 0, 4, 2, 6, 5]$ are examples of shuffles that lead to the same tree, but $[3, 4, 0, 6, 5, 1, 2]$ does not (because the $1$ and $0$ are no longer in the same order). It's possible to count the replications; namely a shuffle of $k$ elements is choosing $k$ positions among the $n-1$ leftover spots (since the root is always going to take the first spot). So the total number of replications is ${n-1 \choose k}$. This and we have to multiply by the replications of the left and right subtrees as well.

Anyway, given this, what does this mean for random trees? The more balanced the trees are, the larger the number of shuffles; for an exact split, $n-1 = 2k$, and thus the number of shuffles is ${2k \choose k}$. The middle binomial coefficient (or the middle of two, if odd) is always the biggest one. So most random permutations, on average, will probably land in this huge space of possibilities. Ok, but what if we want to sample trees randomly so that every tree of size $n$ has equal probability? We can do this if we skew the probabilities involved in a random permutation. Basically, the key is to weight extreme values more, to purposely make it more likely to form unbalanced trees. What weight to choose? Since the number of $(n+1)$-node trees with a left subtree of size $k$ is $c_k c_{n-k}$, to get an even representation of that tree, such a tree should appear with probability $c_k c_{n-k}/c_{n+1}$. In other words, we should choose our root node to have value $k$ with probability $c_k c_{n-k}/c_{n+1}$, rather than the ordinary $1/(n+1)$. Once the root node is chosen, we can recursively get a random subtree for the left and right (both skewed in the same way). Finally, we shuffle the remaining. Now let's see what trees we get with this algorithm (using the same technique of generating a permutation, but according to this skewed way, and inserting it into a tree):

Very different! They look a lot more like Christmas trees. 'Tis the season! (Har, it's about as far from the Christmas season as we're gonna get...). Anyway, there's loads of other fun I had with this, playing around in a SageMath notebook (beware, it's not super cleaned up). Also SageMath has taken the place of Mathematica as my main symbolic math package of choice. I'll leave you with two additional pictures. One of them is an interesting formula that enumerates trees using a polynomial ring with noncommuting variables $x_i$ (the subscript denotes the level the node is in the tree, and distinct occurrences of a variable correspond to distinct nodes; the power of $z$ is how many nodes in the tree, and substituting $1$ for all the $x_i$ gives the generating function of the Catalan numbers). Truly a non geometric visualization, but pretty to look at anyway (or, maybe terrifying; I could print it on a poster to make sure people keep social distancing)!

and finally trees visualized a different way, with inorder traversal as opposed to just centering each level: