Vectors will be interpreted as column vectors and transposition is denoted by a superscript asterisk. It is assumed that is non-singular (and therefore positive definite). Define . Then is positive semi-definite on . Let be orthogonal eigenvectors in order of decreasing eigenvalue and let be the orthogonal projection of onto the span of for . (Note that this definition leaves some choice in case not all eigenvalues are distinct since the projections are not unique in that case.) Now the main result states:

For each the linear map minimises the expected square error among all linear maps of rank .

Let’s apply this result to two special cases. For the first case we simply assume that , and . In this case and . So the projections project onto the eigenspaces of the covariance matrix and . This result coincides with principle component analysis for the variable .

For the second case assume that and for some random diagonal matrix where each diagonal entry is independent (also of ) and Bernoulli distributed with probability . The matrix models *dropout *in the coefficients of . Let be the covariance matrix of , the diagonal of , and . In this case

and .

Now are projections onto eigenspaces of

,

which is the semi-positive definite matrix that appeared in the previous post about linear encoders with dropout. Finally in this case .

]]>- The operator is unitary: on .
- The expected square error is minimal among pairs .

The second requirement is interpreted as “ reduces the dimension of the variable from to with a minimal loss of information”. The function

has a saddle point exactly when

- The image of in is invariant under .
- .

Note that is an orthogonal projection in this case. The expected error among such saddle points is minimal if the image of is a direct sum of eigenspaces of with the largest possible eigenvalues.

The variable is in practical situation a finite sampled set of observations each occurring with equal probability. Doing a least square error fit on as above has the risk of being oversensitive to features that are only apparent in this specific sample. In other words it may be sensitive to outliers. One way to reduce this sensitivity is to introduce *dropout*.

A standard form of dropout is the following. Let be a Bernoulli random variable with expectation and a random operator with mutually independent coefficients (so for all indices ). This matrix is also taken to be independent of . Now in dropout the operator is replaced by the Hadamard product . This means that each coefficient of can “drop out” independently with probability . A linear autoencoder *with dropout* is a pair of operators similar as above but now minimises the altered expected error

.

Here the expectation is for the joint distribution of the independent pair . The idea is that must now be robust against random dropout and that this prevents it from being oversensitive to accidental features in . Also in this dropout case the saddle points of the function

can be described explicitly. Let denote the diagonal operator with the same diagonal entries as . The pair is a saddle point if

- The image of is invariant under .
- .

Indeed for (probability of dropout is zero) this reduces to the criterion above for a linear autoencoder without dropout.

]]>was expressed in terms of the angle between the gradients of and :

.

Note that if both images are equal then everywhere and mutual information is not well defined in the form above. This can be fixed by adding a Gaussian error term to the Jacobian of as we will explore at the end of this post. Here we take a different approach that circumvents this problem altogether.

The goal of image registration is to position the reference image by shifting it relative to such that mutual information is maximized. At that position the reference image provides the most information about . This makes perfect sense for image registration. Note however that the formula above is rather expensive to use directly for this purpose since it has to be computed for *every possible* offset of the reference image . One way around this is to apply some algorithm to reduce the number of positions to inspect such as steepest ascend. This is not what we will do here. Instead we use an approximation for mutual information that *is* efficient to compute for all positions.

Observing that the integrand in the mutual information expression is a function of we will approximate it by trigonometric polynomials of the form

for some finite degree . The constant term is chosen such that

for each trigonometric polynomial, so the area below all these functions is the same on the interval . The chosen approximation method considers these functions as probability densities. In particular each approximation is chosen such that

- for all
- and
- minimizes the Kullback-Leibler divergence among all such trigonometric polynomials of degree .

The first three approximations are:

Their graphs are shown below. Note that the peak around gets more pronounced with an increasing degree. Also turns out to be very close (up to a scalar multiple) to which is easier to remember and could be used as an alternative approximation.

Here is a plot of together with the actual distribution .

Instead of the actual distribution for mutual information we can use any of the approximations to compute the approximate mutual information

.

Moreover this value can be computed efficiently for every image offset of the reference image as a cross-correlation of vector fields as we will see now. Identify with the complex plane . The inner product on translates to complex numbers as

.

Write the gradients of and normalized to length one as complex fields and (so ). Then for the angle between these gradients and any integer we have

.

This shows that every term in is a cross-correlation between two fields ( and ) and hence can be computed efficiently using FFT. Since for the reference image everything can be pre-computed the approximate mutual information can be computed with real FFT computations: two per term for to compute its spectrum and one reverse transform to compute the cross-correlation from a linear combination of all these spectra.

As a bonus this method can be used without any modification to locate only a part of the reference image in : Simply mask out the part of to be ignored by setting its normalized gradient field to zero there. This works because the fields are normalized and are not affected by local variations in gain and offset when the masked reference image is matched with . (In contrast, such variations make this simple masking scheme useless for an ordinary cross-correlatin approach.)

A final remark about how, as mentioned in the beginning of this post, error terms can be introduced in mutual information to get rid of the singularity at . If the Jacobian appearing in the definition of mutual information is distorted by adding a Gaussian error term then mutual information is changed to

where is an error ratio. Note that for we regain the original mutual information while for the other extreme we have . Note that for the singularity at is resolved. For those who want to explore further in this direction I leave you with the remark that

which helps in treating the integrand as a probability density in the angle .

]]>for and .

The *entropy * of is defined by the integral

where is the determinant and the standard volume element. (In this post I will disregard any question of well-definedness of this integral.) To motivate this definition: if and is injective then is the usual differential entropy of the push forward of the standard volume form.

If then the Jacobian equals the gradient of and so the entropy becomes

.

Let then the *mutual information* of is defined by

.

Mutual information is always non-negative. It expresses how much information is gained by knowing the joint value distribution of compared to knowing only the value distributions of the separate coordinates . In other words, mutual information is a measure of dependence between the coordinates: The higher the dependence the higher the mutual information while for independent coordinates the mutual information is (there is no information to be gained from their joint value distribution).

The nice thing about mutual information is that it is invariant under any injective coordinate-wise distortion. In imaging related terms it is for example invariant under changes of gamma, gain and offset of the image. This is hugely beneficial in practical imaging applications where lighting conditions are never the same. Different images (the coordinates) may even have been produced with completely different sensing equipment.

A key observation about mutual information is the following:

for some function with values in that depends only on the direction of the gradients but not their length. Moreover if and only if the gradients are linearly dependent and if and only if they are mutually orthogonal. Using this decomposition mutual information can be expressed as

.

This confirms that mutual information is non-negative since and therefore . I will conclude this post by looking at the specific case of a pair of 2-dimensional images so the case that . Then the function has a simple explicit form. Let be the angle between the gradients and . Then

.

There are two pleasant observations to make:

- Mutual information of a pair of images depends only on the double angle between their gradients. In particular it does not depend on the length or a sign change of either gradient.
- The expression is easy to compute as an inner product. The double angle can be accounted for by a simple rational transformation of the gradient. This will be explained in more detail in a next post.

A next post will discuss the application of mutual information to image registration. It results in a method that is very efficient (based on FFT), is robust against image distortions and can also be applied to register (locate) a partial template image of any shape within a bigger scene.

]]>.

The results in this post were found while looking for ways to approximate this Mahalanobis norm without the need to invert . (Later I realised that using the Cholesky factorisation suited me better. Nice results can be found by looking in the wrong places!) The idea is to use projections on some smaller dimensional subspace to get estimates of the actual Mahalanobis norm. To be precise let be some subspace of dimension and let be the orthogonal projection onto . The operator is non-singular on the subspace . Let be its pseudo inverse such that . The projected Mahalanobis norm on is defined by

.

Let’s take the one-dimensional case as an example. Let be non-zero and denote the span of by . Then the norm is given by

.

Note that this expression does not involve the inverse of . The basic property of the projected Mahalanobis norm is the following:

The inequality holds throughout . Equality occurs if and only if .

This property follows from the Cauchy-Schwarz inequality for the inner product :

.

This is an equality if and only if and are linearly dependent. Combined with it follows that in fact .

The following realisation came as a surprise. It shows that projections onto two-dimensional subspaces suffice to get an *exact* value for the Mahalanobis norm:

Let be a non-zero vector and let be the span of (so ). Then .

The projected norm for a two-dimensional subspace also has a simple explicit form. Let be a non-zero vector orthogonal to and let be the span of . The norm is given by

.

]]>where is a complex function and a real eigenvalue (related to energy in physical terms). The operator is called the Hamiltonian and will be denoted by . Only radial solutions will be considered. A complex function is called radial when it is of the form

or

for some real analytic function and some . Radial functions of the first and second form are said to be of degree and respectively. If is a radial function of degree and then .

Working over it is convenient to rewrite the Laplace operator as follows. Let and . If is a complex differentiable function then and while and . (These are the Cauchy-Riemann equations.) The operators and commute and are related to the Laplace operator by

The Hamiltonian is therefore and we will use it in this form. For a radial function

.

Applying the Hamiltonian results in

and in particular if then . So for each degree the function is an eigenfunction of with eigenvalue . Complex conjugation shows that the same holds for of degree . For each degree we found a radial solution for the harmonic oscillator of degree and eigenvalue . These examples do not exhaust all such solutions however. Others can be found by a clever trick that was already known in this context by Schrödinger and Dirac.

This trick is known as the factorisation or algebraic method and the auxiliary operators that appear are called ladder operators or annihilation and creation operators. The following facts can be readily verified:

- The operators and lower the degree of a radial function by .
- The operators and raise the degree of a radial function by .
- If is a radial function of degree then and

Combining these observations we find for a radial function of degree :

In particular if is a solution with eigenvalue then is a solution of degree and eigenvalue . In this case and so if . Under the assumption that we can assemble the following table of solutions based on :

The first two operators in this table are called raising or creation operators: they raise the eigenvalue or create energy. The last two are called lowering or annihilation operators for similar reasons. Starting from the solution of degree and eigenvalue and repeatedly applying the operators and we find non-zero solutions for all pairs for which is odd and . This process results in the solutions below up to scalar multiples. Solutions for negative degrees can be obtained by complex conjugation.

]]>.

Define a function by

and take . From and it follows that

.

So is non-increasing on this interval and therefore or

.

This inequality clearly also holds at the endpoint and since both sides are symmetric in it holds throughout the interval . (It suggests that closely resembles a normal distribution on this interval.) Integration of the inequality above leads to

.

Both the far left and right hand sides of these inequality can be computed explicitly. Starting with the right hand side let

.

Then

And so . To evaluate the left hand side let

By explicit computation we find and . The other values can be found by a recursive relation that follows from partial integration. For positive we have

and therefore the recursion . Using this recursion one can check that the even and odd entries of the sequence are given respectively by

.

Putting all results sofar together we find for positive even integers

and for odd integers

.

]]>The image of is invariant under a shift . The “Little Picard” theorem then asserts that this function maps onto since it cannot omit just a single value and so it must have a root. Another way uses “Great Picard” and the fact that only has a *single* root so it must attain the value infinitely often. Both approaches seem nice enough but depend on non-trivial theorems (“Great Picard” more so than “Little Picard”) and there is no indication of the location of the fixed points. A more pedestrian approach shows that there must be infinitely many fixed points along the curve but this is hardly in the spirit of complex analysis. This was the state of affairs until this week. Then I found a simple and much more satisfying answer.

The Banach fixed point theorem asserts that every contraction of a complete metric space has a single fixed point. I will use that theorem in the following setting.

Let be a non-empty closed convex subset of an open set and let be holomorphic. If there exists a constant such that on then restricted to is a contraction and therefore has a single fixed point in .

For let denote the horizontal strip

and let be the branch of the logarithm that maps the slit complex plane onto the interior of . If then

on so the conditions of the theorem above apply and therefore has a single fixed point (which must lie in the interior.) Each is a fixed point of . Unfortunately the same argument does not work for where has two more fixed points.

The fixed point method works equally well for a number of other functions. Here are two more examples:

**Example 1.** Let and . The branch of that maps the (closed) upper half plane onto the half strip

is a contraction on . So has exactly one fixed point in each such strip (and therefore also exactly one in its complex conjugate).

**Example 2.** Let and odd. The branch of that maps the complement of the open unit disc into the strip

is a contraction on . So has exactly one fixed point in each such strip.

These example for , and work so well because their inverses have a nice derivative that is less than one except in a small bounded region. Put in another way these functions all satisfy a differential equation of the form for some and some polynomial .

]]>converges. Depending on the rate at which decays it may however take many terms to get a decent approximation of the sum. Examples of such slowly converging series are

A well known method to accelerate the convergence of such alternating series uses the binomial transform of the function as follows. For let be the -th difference of :

Then it turns out that

and the right hand side may converge much faster. The binomial transform for the two examples above can be computed explicitly to obtain

Both transformed series indeed converge much faster than the original ones. You may recognize the first as the series for taken at . The second one is harder to spot but it turns out to be the series of

taken at .

The binomial transform is a great tool if you can find an explicit expression for it such as in both examples above. This is however not a trivial transform in general. If you do not know an explicit expression for the binomial transform then it is also not very convenient to compute: depends on the values for all .

There is however another much simpler method that only involves the repeated difference for a single *fixed* value of . This difference is now easy to compute (numerically) since depends only on consecutive values of . It is convenient for the sake of notation to extend to by setting for all . Then for all .

Now the following equalities hold in which the second and third sums are taken over :

If all terms are positive for then partial sums increase for the second series and decrease for the third from onward. This means that these partial sums are lower and upper bounds for the series which is useful to estimate the remaining error for these partial sums.

As an example take . The series for and now become

]]>Alpha Beta pruning is a way to compute the value of an evaluation function on a game tree for the root node by visiting as little child nodes as necessary. See the previous post for a more extensive introduction to evaluation functions. Here I will only briefly recall the recursive definition of an evaluation function on a game tree that takes values in a lattice . I will assume that has a minimal and maximal element denoted by and . The normal forms and of the linear lattice are examples of this.

- The value of for a leaf node is given by some heuristic game evaluation function (e.g. chess piece values).
- At an inner min node (where the minimizing player is to make a move) the value of is defined recursively as the greatest lower bound in of all values of on direct child nodes.
- At an inner max node the value of is defined as the least upper bound of all values of on direct child nodes.

This definition leads to the following algorithm for the computation of . This algorithm is also the basis of further considerations on ways to prune parts of the game tree that are irrelevant for the computation of on the root node.

functionf(node)localvalueifnodeis a leaf nodethenvalue =heuristic position value ofnodeelseifnodeis a min nodethenvalue =1forchildinall children ofnodedovalue = value ∧ f(child)endelsevalue =0forchildinall children ofnodedovalue = value ∨ f(child)endendreturnvalueend

The structure of this naive algorithm is simple. To compute the value at a min node start with the maximal value then iterate over all child nodes and adjust the node’s value by taking the greatest lower bound with the value of a child node. At a max node the procedure is similar but starts off with the minimal value and repeatedly takes the least upper bound. The computation recurses depth first left to right over all child nodes, where “left to right” is according to the enumeration order of child nodes. Move ordering does not play a role in this naive algorithm but it can play a big role in alpha beta pruning in terms of the number of pruned nodes. I will not address move ordering in this post. During execution of the algorithm the state of the computation is schematically depicted below. As in the example in the previous post I will assume that the top node is a min node.

Here and are the intermediate values that are computed for the inner min and max nodes respectively. The symbol indicates all subtrees located to the left of the current path of computation in the tree. These are the subtrees for which a value is already (recursively) computed. When the computation progresses each intermediate value descends from and each ascends from . When we would cut the computation short at the point depicted above then the algorithm would track back up and yield the final outcome

at the root node. The essential point of alpha beta pruning is that it identifies when this intermediate value remains constant even when gets larger in the remainder of the computation. Because in that case any further children of the node marked in the diagram *will not contribute* to the final result and can be pruned (skipped). In the picture above the search path ends with a max node but that is irrelevant for alpha beta pruning as we will see.

A lattice has a canonical partial order where is equivalent to (or alternatively ). The lattice operators and indeed result in the greatest lower bound and least upper bound with respect to this ordering. The following lattice property is the key point in alpha beta pruning.

**Lemma:** When and then and .

Since the greatest lower bound is also a lower bound for and and is therefore less than or equal to the greatest lower bound . The other inequality is derived similarly. For sequences and in define sequences and recursively by

- and is obtained from by the substitution .
- and is obtained from by the substitution .

So the sequence starts with

and the sequence with

The following inequalities are a direct consequence of our lemma above. The sequences and satisfy

for all indices . In particular is an ascending sequence, is a descending sequence, each is a lower bound for and each is an upper bound for . If or for some index then both sequences are *constant* from that point onwards. The values in the sequences and are precisely those obtained from aborting the computation of the evaluation function for the root node at min node (sequence ) or at max node (sequence ). Hence further nodes can be pruned from the computation as soon as both sequences yield equal values.

**Alpha Beta pruning (strong version):** At the first index where either or further subnodes of max node or min node can be pruned.

Note that this formulation of alpha beta pruning resembles “classical” alpha beta pruning for totally ordered sets in that we have an ascending sequences of high values and a descending sequence of low values and nodes can be pruned as soon as these values coincide. There is also an unfortunate difference between the two. In classical alpha beta pruning the high and low values are easy to compute progressively as a simple maximum or minimum value. In contrast high and low values in the pruning algorithm above are not so simple to compute. Since they are computed deepest node first it requires constant back tracking. Moreover, computation of and can be much more expensive than taking simply the maximum or minimum of two numbers.

Also note that the distributive property of the lattice is not used at all in this version of alpha beta pruning! This does not contradict the results of Ginsberg et al since they formulate alpha beta pruning (in particular “deep pruning”) differently. I will briefly come back to this point at the end of this post.

There is a weaker version of alpha beta pruning based on alternative sequences and that are easier to compute. The consequence is however that it may take longer before the ascending high values and descending low values allow cut off, resulting in less aggressive pruning. The weaker estimates for high and low values follow from the following identities (where the distributive property of is now essential). Let and define two new sequences and by

and

.

So is the least upper bound of the values of all max nodes above min node and is the greatest lower bound of the values of all min nodes above max node . If we take and then for all indices the following equalities hold:

and

.

These can be proved by induction and the distributive property of the lattice . Taking in the first equation and in the second results in

and

.

Comparing these equalities it follows that implies and implies . This leads to the following version of alpha beta pruning.

**Alpha Beta pruning:** If the value of min node gets less than or equal to or the value of max node gets greater than or equal to then further child nodes can be pruned.

This is a slightly different formulation of alpha beta pruning than Ginsberg et al use in their paper. What they call (shallow or deep) pruning comes down to either for some at min node or for some at max node . These are stronger conditions than those we found above and will therefore prune fewer nodes in general.

Here is an algorithmic formulation of alpha beta pruning. The value of the root node is computed by `f(root, `

.**0**, **1**)

functionf(node, α, β)localvalueifnodeis a leaf nodethenvalue =heuristic position value ofnodeelseifnodeis a min nodethenvalue =1forchildinall children ofnodedovalue = value ∧ f(child, α, β ∧ value)ifvalue ≤ αthen break endendelsevalue =0forchildinall children ofnodedovalue = value ∨ f(child, α ∨ value, β)ifvalue ≥ βthen break endendendreturnvalueend

To sum up the results of this post: I described two versions of alpha beta pruning that allow to prune nodes from the computation of an evaluation function with values in a distributive lattice at the root node. The first, strong, version has the disadvantage that it is computationally more difficult than the second. Both versions are stronger (potentially prune more nodes) than the method described in the paper by Ginsberg et al.

]]>