## Projections of the Mahalanobis norm

In this post $\langle \cdot,\cdot \rangle$ is the standard inner product on $\mathbb{R}^n$ for some $n \geq 2$ and $A$ is a fixed Hermitian positive definite operator. The square Mahalanobis norm $m$ of a vector $v \in \mathbb{R}^n$ for this operator is defined by

$\displaystyle m(v) = \langle v, A^{-1}v \rangle$.

The results in this post were found while looking for ways to approximate this Mahalanobis norm without the need to invert $A$. (Later I realised that using the Cholesky factorisation suited me better. Nice results can be found by looking in the wrong places!) The idea is to use projections on some smaller dimensional subspace to get estimates of the actual Mahalanobis norm. To be precise let $V \subseteq \mathbb{R}^n$ be some subspace of dimension $1 \leq d \leq n$ and let $\pi_V$ be the orthogonal projection onto $V$. The operator $\pi_V A$ is non-singular on the subspace $V$. Let $A_V^{-1}:\mathbb{R}^n\to V$ be its pseudo inverse such that $A_V^{-1}A\pi_V = \pi_V A A_V^{-1} = \pi_V$. The projected Mahalanobis norm $m_V$ on $V$ is defined by

$\displaystyle m_V(v) = \langle A_V^{-1} v, v \rangle$.

Let’s take the one-dimensional case as an example. Let $v \in \mathbb{R}^n$ be non-zero and denote the span of $v$ by $\llbracket v \rrbracket$. Then the norm $m_{\llbracket v \rrbracket}$ is given by

$\displaystyle m_{\llbracket v \rrbracket}(v) = \frac{\langle v,v \rangle^2}{\langle A v, v \rangle}$.

Note that this expression does not involve the inverse of $A$. The basic property of the projected Mahalanobis norm is the following:

The inequality $m_V \leq m$ holds throughout $V$. Equality $m_V(v) = m(v)$ occurs if and only if $A_V^{-1}v = A^{-1}v$.

This property follows from the Cauchy-Schwarz inequality for the inner product $\langle \cdot, A^{-1} \cdot \rangle$:

$\displaystyle \langle A_V^{-1}v, v\rangle^2 = \langle A A_V^{-1}v, A^{-1} v\rangle^2 \leq \langle A A_V^{-1}v, A_V^{-1} v \rangle \langle v, A^{-1}v\rangle = \langle v, A_V^{-1} v \rangle \langle v, A^{-1}v\rangle$.

This is an equality if and only if $AA_V^{-1}v$ and $v$ are linearly dependent. Combined with $\pi_VAA_V^{-1}v = v$ it follows that in fact $AA_V^{-1}v = v$.

The following realisation came as a surprise. It shows that projections onto two-dimensional subspaces suffice to get an exact value for the Mahalanobis norm:

Let $v\in \mathbb{R}^n$ be a non-zero vector and let $V \subseteq \mathbb{R}^n$ be the span of $\{v, A^{-1}v \}$ (so $\dim(V) \leq 2$). Then $m_V(v) = m(v)$.

The projected norm for a two-dimensional subspace also has a simple explicit form. Let $w \in \mathbb{R}^n$ be a non-zero vector orthogonal to $v$ and let $V$ be the span of $\{v, w\}$. The norm $m_V$ is given by

$\displaystyle m_V(v) = \frac{\langle v, v \rangle^2}{\langle A v, v\rangle - \frac{\langle A v, w \rangle^2}{\langle A w, w\rangle}}$.

## The quantum harmonic oscillator in the plane

This post investigates some solutions of the (or rather a) quantum harmonic oscillator in the plane. Here the plane is $\mathbb{C}$ which will be identified with $\mathbb{R}^2$ through $z=x+\textrm{i} y$. Let $\partial_x$ and $\partial_y$ denote the partial derivative operators with respect to $x$ and $y$ respectively. Then the Laplace operator $\nabla^2$ is $\partial_x^2 + \partial_y^2$. The time independent form of the Schrödinger equation for the quantum harmonic oscillator considered here is

$\displaystyle (-\tfrac{1}{4} \nabla^2 + |z|^2) \varphi = E \varphi$

where $\varphi$ is a complex function and $E$ a real eigenvalue (related to energy in physical terms). The operator $-\tfrac{1}{4} \nabla^2 + |z|^2$ is called the Hamiltonian and will be denoted by $H$. Only radial solutions $\varphi$ will be considered. A complex function is called radial when it is of the form

$\displaystyle \varphi(z) = f(|z|^2) z^m$ or $\displaystyle \varphi(z) = f(|z|^2) \overline{z}^{\,m}$

for some real analytic function $f$ and some $m \in \mathbb{N}$. Radial functions of the first and second form are said to be of degree $m$ and $-m$ respectively. If $\varphi$ is a radial function of degree $m \in \mathbb{Z}$ and $|\omega| = 1$ then $\varphi(\omega z) = \omega^m \varphi(z)$.

Working over $\mathbb{C}$ it is convenient to rewrite the Laplace operator as follows. Let $\partial = \tfrac{1}{2}(\partial_x - \textrm{i}\partial_y)$ and $\overline{\partial} = \tfrac{1}{2}(\partial_x + \textrm{i} \partial_y)$. If $f$ is a complex differentiable function then $\overline{\partial} f = 0$ and $\partial f = f'$ while $\overline{\partial}\, \overline{f} = \overline{f'}$ and $\partial \overline{f} = 0$. (These are the Cauchy-Riemann equations.) The operators $\partial$ and $\overline{\partial}$ commute and are related to the Laplace operator by

$\displaystyle \partial \overline{\partial} = \overline{\partial} \partial = \tfrac{1}{4} \nabla^2.$

The Hamiltonian is therefore $-\partial \overline{\partial} + |z|^2$ and we will use it in this form. For a radial function $f(|z|^2) z^m$

$\displaystyle \partial \overline{\partial} f(|z|^2) z^m = \partial f'(|z|^2) z^{m+1} = |z|^2 f''(|z|^2) z^m + (m+1) f'(|z|^2) z^m$.

Applying the Hamiltonian results in

$H f(|z|^2) z^m = |z|^2 \left(f(|z|^2) - f''(|z|^2)\right) z^m - (m+1) f'(|z|^2) z^m$

and in particular if $f(x) = e^{-x}$ then $H e^{-|z|^2} z^m = (m+1) e^{-|z|^2} z^m$. So for each degree $m \geq 0$ the function $e^{-|z|^2} z^m$ is an eigenfunction of $H$ with eigenvalue $E = m + 1$. Complex conjugation shows that the same holds for $e^{-|z|^2} \overline{z}^{\,m}$ of degree $-m$. For each degree $m \in \mathbb{Z}$ we found a radial solution for the harmonic oscillator of degree $m$ and eigenvalue $|m| + 1$. These examples do not exhaust all such solutions however. Others can be found by a clever trick that was already known in this context by Schrödinger and Dirac.

This trick is known as the factorisation or algebraic method and the auxiliary operators that appear are called ladder operators or annihilation and creation operators. The following facts can be readily verified:

1. The operators $-\partial+\overline{z}$ and $\partial + \overline{z}$ lower the degree of a radial function by $1$.
2. The operators $-\overline{\partial} + z$ and $\overline{\partial} + z$ raise the degree of a radial function by $1$.
3. If $\varphi$ is a radial function of degree $m \in \mathbb{Z}$ then $(z \partial - \overline{z} \, \overline{\partial}) \varphi = m \varphi$ and

$\displaystyle (-\partial + \overline{z})(\overline{\partial} + z) \varphi = (H - 1 - m) \varphi$
$\displaystyle (\overline{\partial} + z)(-\partial + \overline{z}) \varphi = (H + 1 - m) \varphi$
$\displaystyle (-\overline{\partial} + z)(\partial + \overline{z}) \varphi = (H - 1 + m) \varphi$
$\displaystyle (\partial + \overline{z})(-\overline{\partial} + z) \varphi = (H + 1 + m) \varphi$

Combining these observations we find for a radial function $\varphi$ of degree $m$:

$\displaystyle \begin{matrix} H(-\partial + \overline{z})\varphi &=& ((-\partial + \overline{z})(\overline{\partial} + z) + m)(-\partial + \overline{z})\varphi\\ &=& (-\partial + \overline{z})((\overline{\partial} + z)(-\partial + \overline{z}) + m)\varphi\\ &=&(-\partial + \overline{z})(H + 1)\varphi. \end{matrix}$

In particular if $\varphi \neq 0$ is a solution with eigenvalue $E$ then $(-\partial + \overline{z})\varphi$ is a solution of degree $m-1$ and eigenvalue $E+1$. In this case $(\overline{\partial} + z)(-\partial + \overline{z})\varphi = (E + 1 - m) \varphi$ and so $(-\partial + \overline{z})\varphi \neq 0$ if $E \neq m - 1$. Under the assumption that $E \geq |m| + 1$ we can assemble the following table of solutions based on $\varphi$:

$\displaystyle \begin{matrix} \textrm{solution} & \textrm{degree} & \textrm{eigenvalue} & \textrm{remark}\\[1ex] (-\partial + \overline{z})\varphi & m - 1 & E + 1 & \neq 0\\ (-\overline{\partial} + z)\varphi & m + 1 & E + 1 & \neq 0\\ (\partial + \overline{z})\varphi & m - 1 & E - 1 & \neq 0 \textrm{ unless } E = 1 - m\\ (\overline{\partial} + z)\varphi & m + 1 & E - 1 & \neq 0 \textrm{ unless } E = 1 + m \end{matrix}$

The first two operators in this table are called raising or creation operators: they raise the eigenvalue or create energy. The last two are called lowering or annihilation operators for similar reasons. Starting from the solution $\varphi(z) = e^{-|z|^2}$ of degree $m=0$ and eigenvalue $E=1$ and repeatedly applying the operators $-\partial + \overline{z}$ and $-\overline{\partial} + z$ we find non-zero solutions for all pairs $(m, E)$ for which $m + E$ is odd and $E \geq |m|+1$. This process results in the solutions below up to scalar multiples. Solutions for negative degrees can be obtained by complex conjugation.

$\displaystyle \begin{matrix} m & E & \textrm{solution}\\[1ex] 0 & 1 & e^{-|z|^2} \\ 1 & 2 & z \, e^{-|z|^2}\\ 0 & 3 & (1 - 2|z|^2) \, e^{-|z|^2}\\ 2 & 3 & z^2 \, e^{-|z|^2}\\ 1 & 4 & z (1 - |z|^2) \, e^{-|z|^2}\\ 3 & 4 & z^3 \, e^{-|z|^2} \end{matrix}$

## Bounds for central binomial coefficients

Let $n$ be a positive integer. The binomial coefficient ${2n \choose n}$ is called a central binomial coefficient. Several bounds for these coefficients are known and the more advanced ones are commonly derived from Stirling’s formula for factorials or from Wallis’ product for $\pi$. This post presents an alternative and self-contained elementary proof of the bounds

$\displaystyle \frac{4^n}{\sqrt{(n+\frac{1}{2})\pi}} < {2n \choose n} < \frac{4^n}{\sqrt{n\pi}}$.

Define a function $f$ by

$\displaystyle f(x) = \cos(x)^n e^{n x^2/2}$

and take $x\in[0, \pi/2)$. From $f(x) \geq 0$ and $\tan(x)\geq x$ it follows that

$\displaystyle f'(x) = n f(x)\left(x - \tan(x)\right) \leq 0$.

So $f$ is non-increasing on this interval and therefore $f(x) \leq f(0)=1$ or

$\displaystyle \cos(x)^n \leq e^{-n x^2/2}$.

This inequality clearly also holds at the endpoint $x=\pi/2$ and since both sides are symmetric in $x$ it holds throughout the interval $[-\pi/2, \pi/2]$. (It suggests that $\cos(x)^n$ closely resembles a normal distribution on this interval.) Integration of the inequality above leads to

$\displaystyle \int_{-\pi/2}^{\pi/2}\cos(x)^n dx \leq \int_{-\pi/2}^{\pi/2}e^{-n x^2/2}dx < \int_{-\infty}^{\infty}e^{-n x^2/2}dx$.

Both the far left and right hand sides of these inequality can be computed explicitly. Starting with the right hand side let

$\displaystyle b_n=\int_{-\infty}^{\infty}e^{-n x^2/2}dx$.

Then

$\displaystyle b_n^2=\int_{-\infty}^{\infty}e^{-n x^2/2}dx \int_{-\infty}^{\infty}e^{-n y^2/2}dy=\iint_{-\infty}^{\infty}e^{-n (x^2+y^2)/2}dx dy$

$\displaystyle =2 \pi \int_0^{\infty} r e^{-n r^2/2}dr = \frac{2\pi}{n}$

And so $b_n=\sqrt{\frac{2\pi}{n}}$. To evaluate the left hand side let

$\displaystyle a_n=\int_{-\pi/2}^{\pi/2}\cos(x)^n dx.$

By explicit computation we find $a_0=\pi$ and $a_1=2$. The other values can be found by a recursive relation that follows from partial integration. For positive $n$ we have

$\displaystyle a_{n+1}=\int_{-\pi/2}^{\pi/2}\cos(x)^n \cos(x)dx=n\int_{-\pi/2}^{\pi/2}\cos(x)^{n-1} \sin(x)^2dx$

$\displaystyle =n\int_{-\pi/2}^{\pi/2}\cos(x)^{n-1}\left(1- \cos(x)^2\right)dx=n(a_{n-1}-a_{n+1})$

and therefore the recursion $a_{n+1}=\frac{n}{n+1}a_{n-1}$. Using this recursion one can check that the even and odd entries of the sequence $a$ are given respectively by

$\displaystyle a_{2n}=4^{-n}{2n \choose n}\pi$

$\displaystyle a_{2n+1}=\frac{2^{2n+1}}{{2n \choose n}(2n +1)}$.

Putting all results sofar together we find for positive even integers $2n$

$\displaystyle 4^{-n}{2n \choose n}\pi<\sqrt{\frac{2\pi}{2n}}$

$\displaystyle {2n \choose n}<\frac{4^n}{\sqrt{n \pi}}$

and for odd integers $2n+1$

$\displaystyle \frac{2^{2n+1}}{{2n \choose n}(2n+1)} < \sqrt{\frac{2\pi}{2n+1}}$

$\displaystyle \frac{4^n}{\sqrt{(n+\frac{1}{2})\pi}} < {2n \choose n}$.

## Fixed points of entire functions

This story has a long personal history. In my first attempted complex analysis exam I was asked if the entire function $e^z$ has any fixed points. I had no idea how to approach this question and I failed the exam. This first failure was soon overcome but that particular question was —much to my advantage— never addressed during my study again. Much later I found several ways to solve this question.

The image of $e^z-z$ is invariant under a shift $z \mapsto z + 2 \pi \textrm{i}$. The “Little Picard” theorem then asserts that this function maps onto $\mathbb{C}$ since it cannot omit just a single value and so it must have a root. Another way uses “Great Picard” and the fact that $z e^{-z}$ only has a single root so it must attain the value $1$ infinitely often. Both approaches seem nice enough but depend on non-trivial theorems (“Great Picard” more so than “Little Picard”) and there is no indication of the location of the fixed points. A more pedestrian approach shows that there must be infinitely many fixed points along the curve $|e^{2z}| = |z|^2$ but this is hardly in the spirit of complex analysis. This was the state of affairs until this week. Then I found a simple and much more satisfying answer.

The Banach fixed point theorem asserts that every contraction of a complete metric space has a single fixed point. I will use that theorem in the following setting.

Let $E$ be a non-empty closed convex subset of an open set $U$ and let $f{:U}\to E$ be holomorphic. If there exists a constant $k < 1$ such that $|f'| \leq k$ on $E$ then $f$ restricted to $E$ is a contraction and therefore $f$ has a single fixed point in $E$.

For $m \in \mathbb{Z}$ let $E_m$ denote the horizontal strip

$\displaystyle E_m = \{ z \mid |\textrm{Im}(z) - 2m \pi| \leq \pi \}$

and let $\log_m$ be the branch of the logarithm that maps the slit complex plane $\mathbb{C} \setminus (-\infty, 0]$ onto the interior of $E_m$. If $m \neq 0$ then

$\displaystyle |\log_m'| \leq \frac{1}{(2|m|-1)\pi} < 1$

on $E_m$ so the conditions of the theorem above apply and therefore $\log_m$ has a single fixed point $z_m \in E_m$ (which must lie in the interior.) Each $z_m$ is a fixed point of $e^z$. Unfortunately the same argument does not work for $E_0$ where $e^z$ has two more fixed points.

The fixed point method works equally well for a number of other functions. Here are two more examples:

Example 1. Let $m \in \mathbb{Z}$ and $m \neq 0$. The branch of $\arcsin$ that maps the (closed) upper half plane onto the half strip

$\displaystyle E_m = \{ z \mid \textrm{Im}(z) \geq 0 \textrm { and } |\textrm{Re}(z) - 2m\pi| \leq \frac{\pi}{2} \}$

is a contraction on $E_m$. So $\sin(z)$ has exactly one fixed point in each such strip (and therefore also exactly one in its complex conjugate).

Example 2. Let $m \in \mathbb{Z}$ and $m$ odd. The branch of $\arctan$ that maps the complement of the open unit disc into the strip

$\displaystyle E_m = \{ z \mid |\textrm{Re}(z) - \tfrac{1}{2}m \pi| \leq \frac{\pi}{4} \}$

is a contraction on $E_m$. So $\tan(z)$ has exactly one fixed point in each such strip.

These example for $e^z$, $\sin$ and $\tan$ work so well because their inverses have a nice derivative that is less than one except in a small bounded region. Put in another way these functions all satisfy a differential equation of the form $(f')^d = p(f)$ for some $d \in \{1,2\}$ and some polynomial $p$.

## Acceleration of alternating series

Let $f(n):\mathbb{N}\to\mathbb{R}$ be a positive decreasing function such that $\lim_{n\to\infty}f(n)=0$. Then the alternating series

$\displaystyle \sum_{n=0}^{\infty}(-1)^nf(n)$

converges. Depending on the rate at which $f$ decays it may however take many terms to get a decent approximation of the sum. Examples of such slowly converging series are

$\displaystyle \log(2) = \sum_{n=0}^{\infty}(-1)^n\frac{1}{n+1}$

$\displaystyle \frac{\pi}{4} = \sum_{n=0}^{\infty}(-1)^n\frac{1}{2n+1}$

A well known method to accelerate the convergence of such alternating series uses the binomial transform of the function $f$ as follows. For $m \geq 0$ let $\Delta^m f$ be the $m$-th difference of $f$:

$\displaystyle (\Delta^m f)(n) = \sum_{k=0}^m (-1)^k{m \choose k} f(n+k)$

Then it turns out that

$\displaystyle \sum_{n=0}^{\infty}(-1)^nf(n) = \sum_{n=0}^{\infty}\frac{(\Delta^n f)(0)}{2^{n+1}}$

and the right hand side may converge much faster. The binomial transform for the two examples above can be computed explicitly to obtain

$\displaystyle \log(2)=\sum_{n=0}^{\infty}\frac{1}{2^{n+1}(n+1)}$

$\displaystyle \frac{\pi}{4}=\sum_{n=0}^{\infty}\frac{n!}{2\cdot(2n+1)!!} = \frac{1}{2} + \frac{1}{2\cdot 3} + \frac{2}{2 \cdot 3 \cdot 5} + \frac{2 \cdot 3}{2 \cdot 3 \cdot 5 \cdot 7} + \ldots$

Both transformed series indeed converge much faster than the original ones. You may recognize the first as the series for $-\log(1-x)$ taken at $x=\tfrac{1}{2}$. The second one is harder to spot but it turns out to be the series of

$\displaystyle \frac{\arcsin(x)}{\sqrt{2-2x^2}}$

taken at $x=\tfrac{1}{2}\sqrt{2}$.

The binomial transform is a great tool if you can find an explicit expression for it such as in both examples above. This is however not a trivial transform in general. If you do not know an explicit expression for the binomial transform then it is also not very convenient to compute: $(\Delta^n f)(0)$ depends on the values $f(k)$ for all $k \leq n$.

There is however another much simpler method that only involves the repeated difference $\Delta^m f$ for a single fixed value of $m \geq 1$. This difference is now easy to compute (numerically) since $(\Delta^m f)(n)$ depends only on $m+1$ consecutive values of $f$. It is convenient for the sake of notation to extend $f$ to $\mathbb{Z}$ by setting $f(n)=0$ for all $n < 0$. Then $(\Delta^m f)(n) = 0$ for all $n < -m$.

Now the following equalities hold in which the second and third sums are taken over $\mathbb{Z}$:

$\displaystyle \sum_{n=0}^{\infty}(-1)^nf(n) = 2^{1-m}\sum_{n \textrm{ even}}(\Delta^m f)(n) = -2^{1-m} \sum_{n \textrm{ odd}} (\Delta^m f)(n)$

If all terms $(\Delta^m f)(n)$ are positive for $n \geq 0$ then partial sums increase for the second series and decrease for the third from $n=0$ onward. This means that these partial sums are lower and upper bounds for the series which is useful to estimate the remaining error for these partial sums.

As an example take $m=2$. The series for $\log(2)$ and $\tfrac{\pi}{4}$ now become

$\displaystyle \log(2) = \frac{1}{2} + \sum_{n=0}^{\infty}\frac{1}{(2n+1)(2n+2)(2n+3)} = \frac{3}{4} - \sum_{n=0}^{\infty}\frac{1}{(2n+2)(2n+3)(2n+4)}$

$\displaystyle \frac{\pi}{4} = \frac{1}{2} + \sum_{n=0}^{\infty}\frac{4}{(4n+1)(4n+3)(4n+5)} = \frac{5}{6} - \sum_{n=0}^{\infty}\frac{4}{(4n+3)(4n+5)(4n+7)}$

## Game trees under a partial order part 3: Alpha Beta pruning

In this last post of this series of three I will discuss alpha beta pruning. The context will be evaluation functions on game trees that take values in a distributive lattice. Of course a prime example would be the linear lattice $\mathscr{L}(X, \leq)$ over the partially ordered set $(X, \leq)$ but this post does not depend on specific features of this linear lattice. The result is a description of alpha beta pruning for distributive lattices that is a slightly improved version of that given by Ginsberg et al in their paper “Alpha Beta pruning under partial orders” from 2001.

Alpha Beta pruning is a way to compute the value of an evaluation function on a game tree for the root node by visiting as little child nodes as necessary. See the previous post for a more extensive introduction to evaluation functions. Here I will only briefly recall the recursive definition of an evaluation function $f$ on a game tree that takes values in a lattice $L$. I will assume that $L$ has a minimal and maximal element denoted by $\mathbf{0}$ and $\mathbf{1}$. The normal forms $\textrm{CNF}(X, \leq)$ and $\textrm{DNF}(X, \leq)$ of the linear lattice are examples of this.

1. The value of $f$ for a leaf node is given by some heuristic game evaluation function (e.g. chess piece values).
2. At an inner min node (where the minimizing player is to make a move) the value of $f$ is defined recursively as the greatest lower bound in $L$ of all values of $f$ on direct child nodes.
3. At an inner max node the value of $f$ is defined as the least upper bound of all values of $f$ on direct child nodes.

This definition leads to the following algorithm for the computation of $f$. This algorithm is also the basis of further considerations on ways to prune parts of the game tree that are irrelevant for the computation of $f$ on the root node.

function f(node)
local value
if node is a leaf node then
value = heuristic position value of node
elseif node is a min node then
value = 1
for child in all children of node do
value = value ∧ f(child)
end
else
value = 0
for child in all children of node do
value = value ∨ f(child)
end
end
return value
end

The structure of this naive algorithm is simple. To compute the value at a min node start with the maximal value $\mathbf{1}$ then iterate over all child nodes and adjust the node’s value by taking the greatest lower bound with the value of a child node. At a max node the procedure is similar but starts off with the minimal value $\mathbf{0}$ and repeatedly takes the least upper bound. The computation recurses depth first left to right over all child nodes, where “left to right” is according to the enumeration order of child nodes. Move ordering does not play a role in this naive algorithm but it can play a big role in alpha beta pruning in terms of the number of pruned nodes. I will not address move ordering in this post. During execution of the algorithm the state of the computation is schematically depicted below. As in the example in the previous post I will assume that the top node is a min node.

Intermediate values in the depth first left to right computation of an evaluation function

Here $x_1, x_2, \dotsc$ and $y_1, y_2, \dotsc$ are the intermediate values that are computed for the inner min and max nodes respectively. The $\star$ symbol indicates all subtrees located to the left of the current path of computation in the tree. These are the subtrees for which a value is already (recursively) computed. When the computation progresses each intermediate value $x_j$ descends from $\mathbf{1}$ and each $y_j$ ascends from $\mathbf{0}$. When we would cut the computation short at the point depicted above then the algorithm would track back up and yield the final outcome

$x_1 \wedge (y_1 \vee (x_2 \wedge \dotsc \vee (x_n \wedge y_n)))$

at the root node. The essential point of alpha beta pruning is that it identifies when this intermediate value remains constant even when $y_n$ gets larger in the remainder of the computation. Because in that case any further children of the node marked $y_n$ in the diagram will not contribute to the final result and can be pruned (skipped). In the picture above the search path ends with a max node but that is irrelevant for alpha beta pruning as we will see.

A lattice $L$ has a canonical partial order where $x \leq y$ is equivalent to $x \wedge y = x$ (or alternatively $x \vee y = y$). The lattice operators $\wedge$ and $\vee$ indeed result in the greatest lower bound and least upper bound with respect to this ordering. The following lattice property is the key point in alpha beta pruning.

Lemma: When $x, y, z \in L$ and $y \leq z$ then $x \wedge y \leq x \wedge z$ and $x \vee y \leq x \vee z$.

Since $y \leq z$ the greatest lower bound $x \wedge y$ is also a lower bound for $x$ and $z$ and is therefore less than or equal to the greatest lower bound $x \wedge z$. The other inequality is derived similarly. For sequences $x_1, x_2, \dotsc$ and $y_1, y_2, \dotsc$ in $L$ define sequences $\alpha$ and $\beta$ recursively by

1. $\alpha_1 = x_1 \wedge y_1$ and $\alpha_{j+1}$ is obtained from $\alpha_j$ by the substitution $y_j \leftarrow y_j \vee (x_{j+1} \wedge y_{j+1})$.
2. $\beta_1 = x_1$ and $\beta_{j+1}$ is obtained from $\beta_j$ by the substitution $x_j \leftarrow x_j \wedge (y_j \vee x_{j+1})$.

So the sequence $\alpha$ starts with

$\alpha_1 = x_1 \wedge y_1$
$\alpha_2 = x_1 \wedge (y_1 \vee (x_2 \wedge y_2))$
$\alpha_3 = x_1 \wedge (y_1 \vee (x_2 \wedge (y_2 \vee (x_3 \wedge y_3))))$
$\vdots$

and the sequence $\beta$ with

$\beta_1 = x_1$
$\beta_2 = x_1 \wedge (y_1 \vee x_2)$
$\beta_3 = x_1 \wedge (y_1 \vee (x_2 \wedge (y_2 \vee x_3)))$
$\vdots$

The following inequalities are a direct consequence of our lemma above. The sequences $\alpha$ and $\beta$ satisfy

$\alpha_j \leq \alpha_{j+1} \leq \beta_{j+1} \leq \beta_j$

for all indices $j$. In particular $\alpha$ is an ascending sequence, $\beta$ is a descending sequence, each $\alpha_j$ is a lower bound for $\beta$ and each $\beta_j$ is an upper bound for $\alpha$. If $\alpha_j = \beta_j$ or $\alpha_j = \beta_{j+1}$ for some index $j$ then both sequences are constant from that point onwards. The values in the sequences $\alpha$ and $\beta$ are precisely those obtained from aborting the computation of the evaluation function for the root node at min node $j$ (sequence $\alpha$) or at max node $j$ (sequence $\beta$). Hence further nodes can be pruned from the computation as soon as both sequences yield equal values.

Alpha Beta pruning (strong version): At the first index $j$ where either $\alpha_j = \beta_j$ or $\alpha_j = \beta_{j+1}$ further subnodes of max node $j$ or min node $j+1$ can be pruned.

Note that this formulation of alpha beta pruning resembles “classical” alpha beta pruning for totally ordered sets in that we have an ascending sequences of high values and a descending sequence of low values and nodes can be pruned as soon as these values coincide. There is also an unfortunate difference between the two. In classical alpha beta pruning the high and low values are easy to compute progressively as a simple maximum or minimum value. In contrast high and low values in the pruning algorithm above are not so simple to compute. Since they are computed deepest node first it requires constant back tracking. Moreover, computation of $\wedge$ and $\vee$ can be much more expensive than taking simply the maximum or minimum of two numbers.

Also note that the distributive property of the lattice $L$ is not used at all in this version of alpha beta pruning! This does not contradict the results of Ginsberg et al since they formulate alpha beta pruning (in particular “deep pruning”) differently. I will briefly come back to this point at the end of this post.

There is a weaker version of alpha beta pruning based on alternative sequences $\alpha'$ and $\beta'$ that are easier to compute. The consequence is however that it may take longer before the ascending high values and descending low values allow cut off, resulting in less aggressive pruning. The weaker estimates for high and low values follow from the following identities (where the distributive property of $L$ is now essential). Let $y_0 = \mathbf{0}$ and define two new sequences $\alpha'$ and $\beta'$ by

$\alpha'_j = y_0 \vee \dotsc \vee y_{j-1}$ and
$\beta'_j = x_1 \wedge \dotsc \wedge x_j$.

So $\alpha'_j$ is the least upper bound of the values of all max nodes above min node $j$ and $\beta'_j$ is the greatest lower bound of the values of all min nodes above max node $j$. If we take $\alpha_0 = \mathbf{0}$ and $\beta_0 = \mathbf{1}$ then for all indices $j \geq 1$ the following equalities hold:

$\alpha_j = \alpha_{j-1} \vee (\beta'_j \wedge y_j)$ and
$\beta_j = \beta_{j-1} \wedge (\alpha'_j \vee x_j)$.

These can be proved by induction and the distributive property of the lattice $L$. Taking $y_j = x_j$ in the first equation and $x_j = y_{j-1}$ in the second results in

$\beta_j = \alpha_{j-1} \vee \beta'_j$ and
$\alpha_{j-1} = \beta_{j-1} \wedge \alpha'_j$.

Comparing these equalities it follows that $x_j \leq \alpha'_j$ implies $\beta_j = \alpha_{j-1}$ and $\beta'_j \leq y_j$ implies $\alpha_j = \beta_j$. This leads to the following version of alpha beta pruning.

Alpha Beta pruning: If the value $x_j$ of min node $j$ gets less than or equal to $\alpha'_j$ or the value $y_j$ of max node $j$ gets greater than or equal to $\beta'_j$ then further child nodes can be pruned.

This is a slightly different formulation of alpha beta pruning than Ginsberg et al use in their paper. What they call (shallow or deep) pruning comes down to either $x_j \leq y_k$ for some $k < j$ at min node $j$ or $x_k \leq y_j$ for some $k \leq j$ at max node $j$. These are stronger conditions than those we found above and will therefore prune fewer nodes in general.

Here is an algorithmic formulation of alpha beta pruning. The value of the root node is computed by f(root, 0, 1).

function f(node, α, β)
local value
if node is a leaf node then
value = heuristic position value of node
elseif node is a min node then
value = 1
for child in all children of node do
value = value ∧ f(child, α, β ∧ value)
if value ≤ α then break end
end
else
value = 0
for child in all children of node do
value = value ∨ f(child, α ∨ value, β)
if value ≥ β then break end
end
end
return value
end

To sum up the results of this post: I described two versions of alpha beta pruning that allow to prune nodes from the computation of an evaluation function with values in a distributive lattice at the root node. The first, strong, version has the disadvantage that it is computationally more difficult than the second. Both versions are stronger (potentially prune more nodes) than the method described in the paper by Ginsberg et al.

## Game trees under a partial order part 2: Evaluation functions

The first part of this series introduced the maximizing and minimizing orders on finite antichains in any partially ordered set $(X, \leq)$. In this second part I will show how these orders and their properties are used to define evaluation functions on game trees. The previous post was mainly based on the paper “Searching game trees under a partial order” by P. Dasgupta et al. In this second post I will show how it is related to another paper “Alpha-Beta pruning under partial orders” by M. Ginsberg et al from 2001. This second paper has an interesting different approach. Instead of starting with a partial order from the beginning it investigates what structure an evaluation function on a game tree must have to satisfy some reasonable constraints. The answer they find is that such an evaluation function must be defined on a lattice, that is, a set with operators “greatest lower bound” $\wedge$ and “least upper bound” $\vee$. An evaluation function is then defined recursively, starting at the leaves. The minimizing player combines the values of subnodes with $\wedge$ while the maximizing player uses $\vee$. Moreover, Ginsberg et al show that such an evaluation function allows alpha beta pruning (which I will define in detail in the next post) exactly if the lattice is distributive. While the paper by Ginsberg references that of Dasgupta the relation between the two is not made very explicit. This post will give a very precise and (I think) satisfying relation between the two papers ultimately improving the results of both: The work of Dasgupta is formalized in terms of lattice theory and can now be applied to any partially ordered set, not just $\mathbb{R}^n$ and the result of Ginsberg on alpha beta pruning is sharpened in the next post, potentially leading to more aggressive pruning in game tree searches.

Consider a game tree in which a min and max player make moves in turn. Suppose that leaf nodes in this tree get a value in the partially ordered space $X$ assigned to them. Is there a way, given the values at the leaves, to define an evaluation function on the entire tree? Let’s first assume that both players in fact use an (unspecified) linear extension $\prec$ of the partial order $\leq$ to select their optimal move. (Dasgupta et al call such a linear extension a strategy. See the intermezzo below for why it is fair to assume that both players use the same strategy.) Here is an example how this linear extension can be used to define an evaluation function on a tree of positions. The symbols “$\bullet$” at the leaf nodes indicate some (not necessarily equal) values in $X$. The greatest lower bound and least upper bound $\wedge$ and $\vee$ are taken with respect to the linear extension $\prec$.

An evaluation function based on a linear extension (or strategy)

In this example the minimizing player is to move and the tree represents possible move developments that are considered. The value at inner nodes is defined recursively based on the linear extension: The minimizing player takes the minimum (the greatest lower bound) of the subnodes while the maximizing player takes the maximum (the least upper bound). This leads to a value for the top node in $X$ and the minimizing player will select a move that leads to this minimum.

Intermezzo. Assuming that the min player in this example uses an unspecified linear extension (or “strategy” as Dasgupta calls them) to select moves is central in our considerations. But why is it fair to assume that the max player uses the same strategy? The reason is simple. Suppose the max player uses a different strategy or in fact any system at all (even random decisions) to select moves. In that case the altered evaluation function would assign to each node a value that is less than or equal to the value obtained from equal strategies (when compared with the min player’s linear extension). This follows from the recursive definition of the evaluation function. Since the min player’s goal is to reach a minimal value, the worst case scenario is to assume that the opponent uses the same strategy. End of intermezzo.

Any evaluation function defined as above still depends on the choice of linear extension, so different linear extensions give different evaluation functions. What we are looking for instead is a way to define a single evaluation function based only on the given partial order on $X$. There is a nice trick to turn the example above into exactly such a function, but it requires to alter the target space. To define the new target space start with $\mathscr{F}(X, \leq)$, the space of all functions that map linear extensions of $\leq$ to $X$. The evaluation defined above at any node can be regarded as an example of such a function by treating the linear extension $\prec$ in its definition not as fixed but as a variable. Let me make this more precise based on the following observations:

1. There is a canonical embedding $\iota: X \hookrightarrow \mathscr{F}(X, \leq)$ as constant functions: $\iota_x(\prec) = x$ for all linear extensions $\prec$.
2. If $f, g \in \mathscr{F}(X, \leq)$ then define an element $f \wedge g \in \mathscr{F}(X, \leq)$ by $(f \wedge g)(\prec) = f(\prec) \wedge g(\prec)$ where $\wedge$ in the right hand side refers to the minimum with respect to $\prec$.
3. If $f, g \in \mathscr{F}(X, \leq)$ then define an element $f \vee g \in \mathscr{F}(X, \leq)$ by $(f \vee g)(\prec) = f(\prec) \vee g(\prec)$ where $\vee$ in the right hand side refers to the maximum with respect to $\prec$.
4. Since any totally ordered set is a distributive lattice with respect to its minimum and maximum operators $\wedge$ and $\vee$ it follows that the definitions above turn $\mathscr{F}(X, \leq)$ into a distributive lattice and therefore also into a partially ordered set.

The entire space $\mathscr{F}(X, \leq)$ is too big for our purpose, therefore I define $\mathscr{L}(X, \leq)$, the linear lattice over $(X, \leq)$ as the smallest sublattice of $\mathscr{F}(X, \leq)$ that contains $X$. Then $\mathscr{L}(X, \leq)$ is still a distributive lattice. The linear lattice $\mathscr{L}(X, \leq)$ will serve as the target space for evaluation functions on game trees under the partial order $(X, \leq)$.

Remark. The set $X$ embeds into $\mathscr{L}(X, \leq)$ as a partially ordered set but not always as a lattice. For example if $x, y \in X$ have a greatest lower bound $x \wedge y \in X$ then $\iota_{x \wedge y}$ is a lower bound for $\iota_x$ and $\iota_y$ but not necessarily the greatest lower bound. End of remark.

Example. If $(X, \leq)$ is a totally ordered space then $\mathscr{L}(X, \leq) = (X, \leq)$. End of example.

An evaluation function on a game tree is now defined recursively, as in the picture above, as follows.

1. Assign values in $X \subseteq \mathscr{L}(X, \leq)$ to the leaf nodes based on some heuristics.
2. Define the value of an inner min node as the greatest lower bound in $\mathscr{L}(X, \leq)$ of the values of all its direct successors.
3. Define the value of an inner max node as the least upper bound in $\mathscr{L}(X, \leq)$ of the values of all its direct successors.

Note that this construction is completely in line with the results of Ginsberg et al. The fact that $\mathscr{L}(X, \leq)$ is a distributive lattice even implies that alpha beta pruning should be applicable in the recursive computation of the evaluation function. I will address that part in the next post.

The linear lattice may seem rather abstract but the fact that it is distributive and generated by $X$ means that each element can be put in both a disjunctive and conjunctive normal form built from elements of $X$ only. These normal forms can be described directly in terms of minimizing and maximizing orders as introduced in the first post. This will identify the linear lattice with what Dasgupta et al call “dominance algebra”. Define two partially ordered spaces, suggestively called $\textrm{CNF}(X, \leq)$ and $\textrm{DNF}(X, \leq)$, by applying the star order extension twice as in the picture below:

Two step construction of DNF (top) and CNF (bottom)

So $\textrm{CNF}(X, \leq)$ is the set of finite antichains in $(X^{\ast}, \leqslant)$ ordered by their maximizing order and $\textrm{DNF}(X, \leq)$ the set of finite antichains in $(X^{\ast}, \eqslantless)$ ordered by their minimizing order. If $\prec$ is a linear extension of $\leq$ I denote it’s minimum and maximum operators by $\curlywedge$ and $\curlyvee$ respectively to distinguish them from the lattice operators in CNF and DNF. The following statements identify the normal forms of $\mathscr{L}(X, \leq)$.

CNF: For any partially ordered space $(X, \leq)$ the space $\textrm{CNF}(X, \leq)$ is a lattice with minimal and maximal elements $\mathbf{0} = \{\varnothing\}$ and $\mathbf{1} = \varnothing$ respectively. The sublattice $\textrm{CNF}(X, \leq) \setminus \{\mathbf{0}, \mathbf{1}\}$ is isomorphic to $\mathscr{L}(X, \leq)$ via the rule $x(\prec) = \curlywedge \{ \curlyvee s \mid s \in x\}$.

DNF: For any partially ordered space $(X, \leq)$ the space $\textrm{DNF}(X, \leq)$ is a lattice with minimal and maximal elements $\mathbf{0} = \varnothing$ and $\mathbf{1} = \{\varnothing\}$ respectively. The sublattice $\textrm{DNF}(X, \leq) \setminus \{\mathbf{0}, \mathbf{1}\}$ is isomorphic to $\mathscr{L}(X, \leq)$ via the rule $x(\prec) = \curlyvee \{ \curlywedge s \mid s \in x\}$.

I will only show the CNF case. In the previous post we saw that any maximizing order always has a greatest lower bound operator $\wedge$ and it also has a least upper bound operator $\vee$ if its base space has one. Since in particular $(X^{\ast}, \leqslant)$ always has a least upper bound operator, this shows that $(X^{\ast \ast}, \leqslant)$ is a lattice. The empty set (which is a finite antichain) is maximal for any maximizing order and minimal for any minimizing order. Therefore the elements $\mathbf{0} = \{\varnothing\}$ and $\mathbf{1} = \varnothing$ are minimal and maximal elements for this lattice respectively. A simple verification shows that the given evaluation rule at least defines a surjective homomorphism onto $\mathscr{L}(X, \leq)$ so it remains to prove that this homomorphism is also injective (meaning that the conjunctive normal form is unique). Suppose $A, B \in \textrm{CNF}(X, \leq)$ are such that $A(\prec) = B(\prec)$ for all linear extensions $\prec$. I will show that this implies $A \subseteq B$ (which suffices since by symmetry then also $B \subseteq A$ and $A = B$).

Let $a \in A$. For any $x \in a$ there exists a linear extension $\prec$ such that

1. $x$ is maximal in the lower set ${\downarrow} a$ and
2. if $y \in X$ and $y \prec x$ then $y \in {\downarrow} a$.

This implies that $A(\prec) = x$ since, being an antichain, no element of $A$ other than $a$ itself can be entirely contained in ${\downarrow} a$ and therefore must have a maximum greater than $x$. Then by assumption also $B(\prec) = x$. In particular all $b \in B$ with $b \leqslant a$ must contain $x$ and at least one such element must exist in $B$. Since this holds for any $x \in a$ it follows that $a \subseteq b$ for all $b \in B$ with $b \leqslant a$. This implies that $a = b$ and hence $a \in B$.

Summing up the results of this post: The linear lattice $\mathscr{L}(X, \leq)$ is the natural target space for evaluation functions on game trees based on heuristic evaluation of positions with values in the partially ordered set $(X, \leq)$. Both lattices $\textrm{CNF}(X, \leq)$ and $\textrm{DNF}(X, \leq)$ (minus their maximal and minimal elements) are isomorphic to the linear lattice $\mathscr{L}(X, \leq)$ and explicitly represent elements in their unique conjunctive and disjunctive normal form respectively. In particular the CNF and DNF lattices are themselves distributive and isomorphic. The isomorphisms both ways correspond to what in the paper by Dasgupta et al is called “MAX to MIN” and “MIN to MAX”.

In the next and final post in this series I will discuss alpha beta pruning for game trees under a partial order.