## A fast 24 bit PRNG algorithm for the 6502 processor

Popular choices for pseudo random number generators (or PRNG) are linear feedback shift registers (LFSR) and Marsaglia’s xorshift. Both are based on non-zero elements $g$ of maximal order $2^n-1$ in $\mathrm{GL}(n, \mathbb{F}_2)$. Such PRNG algorithms keep a state vector $v \in \mathbb F_2^n$ from which a number of bits is used to generate a “random” number (for example the first $8$ or $16$ coordinates, to get a random byte or word). The operator $g$ is then applied to $v$ to get the next state $\displaystyle v \leftarrow g(v)$ before extracting the next random number, and so forth.

However, both LFSR and xorshift have some drawbacks when trying to implement them on an 8-bit architecture such as the 6502. Both methods rely on bit shifting of the state vector $v$ that consists of several bytes. However, the 6502 only has instructions to shift the accumulator (an 8 bit register) one bit in either direction. On top of that, the 6502 has only two additional registers (X and Y, both 8 bits), that cannot be shifted at all. So implementations need multiple shift instructions and careful use of auxiliary memory locations to store intermediate results. The more instructions and memoy accesses, the more bytes and cycles an algorithm needs. (To be fair, there is a nice implementation of Marsaglia’s algorithm for a 16 bit state and shifts $\ll 7$, $\gg 9$, and $\ll 8$ respectively.)

So let’s turn this situation around. Instead of trying to implement some well-known method, let’s start with some primitive operations that are easy and fast on a 6502 and then try to build a operator $g$ with maximal order from that simple set. I wanted to use a 24 bit (three byte) random state vector and so we’re looking for an operator $g$ of order $2^{24}-1 = 16{,}777{,}215$. Let the three bytes in the random state be a, b, c. We will use the following 18 basic operators as building blocks:

These basic operations can be combined by performing one after another. Such a combination can be denoted as a tuple. For example the tuple (1, 15) represents the following sequence:

1. Clear carry
2. Apply operation 1 to the random state
3. Apply operation 15 to the random state

Note that instead of clearing the carry it is also possible to replace the first occurrence of ROL or ROR in such a sequence with ASL or LSR respectively. It turns out that at least five basic operations must be combined to get an operator of maximum order on the three byte state. In fact, there are no less than 2904 5-tuples that result in a maximal order. However, note that the three bytes a, b, c and the shifts ROL, ROR can be permuted to give equivalent operators. So these 2904 tuples come in 242 groups of 12 equivalent tuples each.

Apart from having maximum order, those operators $g$ are preferred that have a minimal polynomial (of degree 24 in $\mathbb F_2[x]$) with about half of its coefficients equal to 1. For example, both tuples (7, 9, 5, 15, 6) and (7, 7, 4, 6, 8) produce operators with maximal order but the minimal polynomial for the first tuple is

$1+x^2+x^3+x^6+x^{11}+x^{12}+x^{15}+x^{16}+x^{17}+x^{19}+x^{20}+x^{23}+x^{24}$

while for the second tuple it is

$1+x^7+x^9+x^{16}+x^{24}$.

The following code implements the tuple (7, 9, 5, 15, 6) in 27 bytes and 42 cycles. For any initial state other than (0, 0, 0) this code produces –by repeated application– a sequence of states with period $2^{24}-1$.

                lda a           ; Operation 7 (with carry clear).
asl
eor b
sta b
rol             ; Operation 9.
eor c
sta c
eor a           ; Operation 5.
sta a
lda b           ; Operation 15.
ror
eor c
sta c
eor b           ; Operation 6.
sta b


## Least square fitting of vector-valued random variables

Let $(x, y) \in \mathbb{R}^n \times \mathbb{R}^m$ be joint random variables (typically not independent). This post describes which linear map $\sigma{: \mathbb{R}^n \to \mathbb{R}^m}$ relates these variables best in the sense that the expected square error $\mathbb{E} \lVert \sigma(x) - y \rVert^2$ is minimised. Related applications such as principle component analysis and auto encoding under random distortion (such as drop out) are recovered as special cases.

Vectors will be interpreted as column vectors and transposition is denoted by a superscript asterisk. It is assumed that $X = \mathbb{E}(x x^{\ast})$ is non-singular (and therefore positive definite). Define $Y = \mathbb{E}(y x^{\ast})$. Then $Y X^{-1} Y^{\ast}$ is positive semi-definite on $\mathbb{R}^m$. Let $v_1, \ldots, v_m \in \mathbb{R}^m$ be orthogonal eigenvectors in order of decreasing eigenvalue and let $\pi_k$ be the orthogonal projection of $\mathbb{R}^m$ onto the span of $v_1, \ldots, v_k$ for $k \in \{1, \ldots, m\}$. (Note that this definition leaves some choice in case not all eigenvalues are distinct since the projections are not unique in that case.) Now the main result states:

For each $k \in \{1, \ldots, m\}$ the linear map $\sigma_k = \pi_k Y X^{-1}$ minimises the expected square error $\mathbb{E} \lVert \sigma_k(x) - y \lVert^2$ among all linear maps of rank $k$.

Let’s apply this result to two special cases. For the first case we simply assume that $n = m$, $x=y$ and $\mathbb{E}(x) = 0$. In this case $Y = X$ and $Y X^{-1} Y^{\ast} = X$. So the projections $\pi_k$ project onto the eigenspaces of the covariance matrix $X$ and $\sigma_k = \pi_k Y X^{-1} = \pi_k$. This result coincides with principle component analysis for the variable $x$.

For the second case assume that $n = m$ and $x = A y$ for some random diagonal matrix where each diagonal entry $A_{ii}$ is independent (also of $y$) and Bernoulli distributed with probability $p > 0$. The matrix $A$ models dropout in the coefficients of $y$. Let $\Sigma = \mathbb{E}(y y^{\ast})$ be the covariance matrix of $y$, $D$ the diagonal of $\Sigma$, and $q = 1-p$. In this case

$X = p \left( p \Sigma + q D \right)$ and $Y = p \Sigma$.

Now $\pi_k$ are projections onto eigenspaces of

$\Sigma \left(p \Sigma + q D \right)^{-1} \Sigma$,

which is the semi-positive definite matrix that appeared in the previous post about linear encoders with dropout. Finally in this case $\sigma_k = \pi_k \Sigma \left(p \Sigma + q D \right)^{-1}$.

## Linear autoencoder with dropout

We begin with a brief review of least squares fitting formulated in autoencoder language. Let $x$ be a random variable in $\mathbb{R}^n$ such that $\mathbb{E}(x) = 0$ and let $X = \mathbb{E}(x x^{\ast})$ be its covariance matrix. Then $X$ is a self-adjoint (symmetric) operator. Let $d$ be a positive number not greater than $n$. A linear autoencoder for $x$ is a pair of linear operators $R{:\mathbb{R}^n \to \mathbb{R}^d}$ and $C{: \mathbb{R}^d \to \mathbb{R}^n}$ such that

1. The operator $C$ is unitary: $C^{\ast}C = 1$ on $\mathbb{R}^d$.
2. The expected square error $\mathbb{E} \lVert x - CRx \rVert^2$ is minimal among pairs $(C, R)$.

The second requirement is interpreted as “$R$ reduces the dimension of the variable $x$ from $n$ to $d$ with a minimal loss of information”. The function

$\displaystyle (C, R) \mapsto \mathbb{E} \lVert x - CRx \rVert^2$

has a saddle point exactly when

1. The image of $C$ in $\mathbb{R}^n$ is invariant under $X$.
2. $R = C^{\ast}$.

Note that $CR = C C^{\ast}$ is an orthogonal projection in this case. The expected error among such saddle points  is minimal if the image of $C$ is a direct sum of eigenspaces of $X$ with the largest possible eigenvalues.

The variable $x$ is in practical situation a finite sampled set of observations $x_1, x_2, \dots, x_m$ each occurring with equal probability. Doing a least square error fit on $x$ as above has the risk of being oversensitive to features that are only apparent in this specific sample. In other words it may be sensitive to outliers. One way to reduce this sensitivity is to introduce dropout.

A standard form of dropout is the following. Let $b = b(p) \in \{0, 1\}$ be a Bernoulli random variable with expectation $p > 0$ and $B {: \mathbb{R}^n \to \mathbb{R}^d}$ a random operator with mutually independent coefficients $b$ (so $B_{ij} = e_i^{\ast} B e_j = b$ for all indices $i, j$). This matrix $B$ is also taken to be independent of $x$. Now in dropout the operator $R$ is replaced by the Hadamard product $R \circ B$. This means that each coefficient of $R$ can “drop out” independently with probability $q = 1-p$. A linear autoencoder with dropout is a pair $(C, R)$ of operators similar as above but now minimises the altered expected error

$\displaystyle \mathbb{E}\left \lVert x - C (R \circ B) x \right \rVert^2$.

Here the expectation is for the joint distribution of the independent pair $(B, x)$. The idea is that $R$ must now be robust against random dropout and that this prevents it from being oversensitive to accidental features in $x$. Also in this dropout case the saddle points of the function

$\displaystyle (C, R) \mapsto \mathbb{E}\left \lVert x - C (R \circ B) x \right \rVert^2$

can be described explicitly. Let $\mathrm{diag}(X)$ denote the diagonal operator with the same diagonal entries as $X$. The pair $(C, R)$ is a saddle point if

1. The image of $C$ is invariant under $X \left( p X + q \, \mathrm{diag}(X) \right)^{-1} X$.
2. $R = C^{\ast} X \left(p X + q \, \mathrm{diag}(X)\right)^{-1}$.

Indeed for $p=1$ (probability of dropout is zero) this reduces to the criterion above for a linear autoencoder without dropout.

Posted in Uncategorized | 1 Comment

## Mutual information and fast image registration

In the previous post the mutual information for a pair of images

$\displaystyle{f = (f_1, f_2) {:[0,1]^2} \to [0,1]^2}$

was expressed in terms of the angle $\alpha$ between the gradients of $f_1$ and $f_2$:

$\displaystyle{I(f) = \int_{[0,1]^2} -\tfrac12 \log \left(\sin^2(\alpha) \right) \mathrm{d}V = \int_{[0,1]^2} -\tfrac12 \log\left(\frac{1-\cos(2\alpha)}2\right) \mathrm{d}V}$.

Note that if both images are equal then $\alpha=0$ everywhere and mutual information is not well defined in the form above. This can be fixed  by adding a Gaussian error term to the Jacobian of $f$ as we will explore at the end of this post. Here we take a different approach that circumvents this problem altogether.

The goal of image registration is to position the reference image $f_1$ by shifting it relative to $f_2$ such that mutual information is maximized. At that  position the reference image $f_1$ provides the most information about $f_2$. This makes perfect sense for image registration. Note however that the formula above is rather expensive to use directly for this purpose since it has to be computed for every possible offset of the reference image $f_1$. One way around this is to apply some algorithm to reduce the number of positions to inspect such as steepest ascend. This is not what we will do here. Instead we use an approximation for mutual information that is efficient to compute for all positions.

Observing that the integrand $p(\alpha)$ in the mutual information expression is a function of $\cos(2\alpha)$ we will approximate it by trigonometric polynomials of the form

$\displaystyle{p_d(\alpha) = \log(2) + \sum_{k=1}^dc_k \cos(2k \alpha)}$

for some finite degree $d \geq 1$. The constant term is chosen such that

$\displaystyle{\int_{-\pi/2}^{\pi/2}p(\alpha) \mathrm{d}\alpha = \int_{-\pi/2}^{\pi/2} p_d(\alpha) \mathrm{d}\alpha = \pi \log(2)}$

for each trigonometric polynomial, so the area below all these functions is the same on the interval $[-\pi/2, \pi/2]$. The chosen approximation method considers these functions as probability densities. In particular each approximation $p_d$ is chosen such that

1. $p_d(\alpha) \geq 0$ for all $\alpha$
2. $p_d(-\pi/2) = p_d(\pi/2) = 0$ and
3. $p_d$ minimizes the Kullback-Leibler divergence $\mathrm{KL}(p, p_d)$ among all such trigonometric polynomials of degree $d$.

The first three approximations are:

1. $p_1(\alpha) \approx 0.693 + 0.693\cos(2\alpha)$
2. $p_2(\alpha) \approx 0.693+0.912 \cos(2 \alpha)+0.219 \cos(4 \alpha)$
3. $p_3(\alpha) \approx 0.693+0.969 \cos(2 \alpha)+0.425 \cos(4 \alpha)+0.149 \cos(6 \alpha)$

Their graphs are shown below. Note that the peak around $\alpha=0$ gets more pronounced with an increasing degree. Also $p_2$ turns out to be very close (up to a scalar multiple) to $\cos^4(\alpha)$ which is easier to remember and could be used as an alternative approximation.

Here is a plot of $p_3$ together with the actual distribution $p$.

Instead of the actual distribution $p$ for mutual information we can use any of the approximations $p_d$ to compute the approximate mutual information

$\displaystyle{I_d(f) = \int_{[0,1]^2} p_d(\alpha) \, \mathrm{d}V}$.

Moreover this value can be computed efficiently for every image offset of the reference image $f_1$ as a cross-correlation of  vector fields as we will see now. Identify $\mathbb{R}^2$ with the complex plane $\mathbb{C}$. The inner product on $\mathbb{R}^2$ translates to complex numbers as

$\displaystyle{\langle z, w \rangle = \mathrm{Re}(z \overline{w})}$.

Write the gradients of $f_1$ and $f_2$ normalized to length one as complex fields $z_1$ and $z_2$ (so $\lvert z_1 \rvert = \lvert z_2 \rvert = 1$). Then for the angle $\alpha$ between these gradients and any integer $k$ we have

$\displaystyle{\cos(k\alpha) = \langle z_1^k, z_2^k \rangle}$.

This shows that every term in $I_d(f)$ is a cross-correlation between two fields ($z_1^k$ and $z_2^k$) and hence can be computed efficiently using FFT. Since for the reference image everything can be pre-computed the approximate mutual information $I_d(f)$ can be computed with $2d+1$ real FFT computations: two per term $\cos(2k \, \alpha)$ for $k=1 \ldots d$ to compute its spectrum and one reverse transform to compute the cross-correlation from a linear combination of all these spectra.

As a bonus this method can be used without any modification to locate only a part of the reference image $f_1$ in $f_2$: Simply mask out the part of $f_1$ to be ignored by setting its normalized gradient field $z_1$ to zero there. This works because the fields are normalized and are not affected by local variations in gain and offset when the masked reference image is matched with $f_2$. (In contrast, such variations make this simple masking scheme useless for an ordinary cross-correlatin approach.)

A final remark about how, as mentioned in the beginning of this post, error terms can be introduced in mutual information to get rid of the $\log$ singularity at $\alpha = 0$. If the Jacobian appearing in the definition of mutual information is distorted by adding a Gaussian error term then mutual information is changed to

$\displaystyle{I(t; f) = \int_{[0,1]^2} -\tfrac12 \log(\sin^2(\alpha) + t^2 \cos^2(\alpha)) \, \mathrm{d}V}$

where $t \in [0, 1]$ is an error ratio. Note that for $t=0$ we regain the original mutual information while for the other extreme $t=1$ we have $I(1; f) = 0$. Note that for $t > 0$ the singularity at $\alpha=0$ is resolved. For those who want to explore further in this direction I leave you with the remark that

$\displaystyle{\int_{-\pi/2}^{\pi/2} -\tfrac12 \log(\sin^2(\alpha) + t^2 \cos^2(\alpha)) \, \mathrm{d}\alpha = -\pi \log \left( \frac{t+1}2 \right)}$

which helps in treating the integrand as a probability density in the angle $\alpha$.

## Mutual information and gradients

In this post we will see a that mutual information between functions (e.g. two images) can be expressed in terms of their gradient fields. First some definitions and background. Let $f{: [0, 1]^n} \to [0, 1]^m$ be a differential function for some $n \geq m$. In this post $f$ typically represents an image or a pair of images with $n=2$ and $m \in \{1, 2\}$. Let $D$ be the Jacobian of $f$. This is the $m \times n$ matrix with

$\displaystyle{D_{ij}=\frac{\partial f_i}{\partial x_j}}$ for $1 \leq i \leq m$ and $1 \leq j \leq n$.

The entropy $H(f)$ of $f$ is defined by the integral

$H(f) = \displaystyle{\int_{[0,1]^n} \tfrac12 \log \lvert D D^{\mathrm{t}}\rvert \, \mathrm{d}V}$

where $\lvert \cdot \rvert$ is the determinant and $\mathrm{d}V$ the standard volume element. (In this post I will disregard any question of well-definedness of this integral.) To motivate  this definition: if $n=m$ and $f$ is injective then $H(f)$ is the usual differential entropy of the push forward $f_{\ast}(\mathrm{d}V)$ of the standard volume form.

If $m=1$ then the Jacobian $D = \nabla f$ equals the gradient of $f$ and $\lvert D D^{\mathrm{t}} \rvert = \lVert \nabla f \rVert^2$ so the entropy becomes

$\displaystyle{H(f) = \int_{[0,1]^n} \log \lVert \nabla f \rVert \, \mathrm{d}V}$.

Let $f = (f_1, \ldots, f_m)$ then the mutual information $I(f)$ of $f$ is defined by

$\displaystyle{I(f) = \sum_{k=1}^m H(f_k) - H(f)}$.

Mutual information is always non-negative. It expresses how much information is gained by knowing the joint value distribution of $f$  compared to knowing only the value distributions of the separate coordinates $f_1, \ldots, f_m$. In other words, mutual information is a measure of dependence between the coordinates: The higher the dependence the higher the mutual information while for independent coordinates the mutual information is $0$ (there is no information to be gained from their joint value distribution).

The nice thing about mutual information is that it is invariant under any injective coordinate-wise distortion. In imaging related terms it is for example invariant under changes of gamma, gain and offset of the image. This is hugely beneficial in practical imaging applications where lighting conditions are never the same. Different images (the coordinates) may even have been produced with completely different sensing equipment.

A key observation about mutual information is the following:

$\displaystyle{\lvert D D^{\mathrm{t}} \rvert = v \cdot \prod_{k=1}^m \lVert \nabla f_k \rVert^2}$

for some function $v$ with values in $[0, 1]$ that depends only on the direction of the gradients but not their length. Moreover $v=0$ if and only if the gradients are linearly dependent and $v=1$ if and only if they are mutually orthogonal. Using this decomposition mutual information can be expressed as

$\displaystyle{I(f) = \int_{[0,1]^n} -\tfrac12 \log(v) \, \mathrm{d}V}$.

This confirms that mutual information is non-negative since $v\in[0,1]$ and therefore $\log(v) \leq 0$. I will conclude this post by looking at the specific case of a pair of 2-dimensional images so the case that $n=m=2$. Then the function $v$ has a simple explicit form. Let $\alpha$ be the angle between the gradients $\nabla f_1$ and $\nabla f_2$. Then

$v = \sin^2(\alpha) = \frac{1-\cos(2\alpha)}{2}$.

There are two pleasant observations to make:

1. Mutual information of a pair of images depends only on the double angle between their gradients. In particular it does not depend on the length or a sign change of either gradient.
2. The expression $\cos(2\alpha)$ is easy to compute as an inner product. The double angle can be accounted for by a simple rational transformation of the gradient. This will be explained in more detail in a next post.

A next post will discuss the application of mutual information to image registration. It results in a method that is very efficient (based on FFT), is robust against image distortions and can also be applied to register (locate) a partial template image of any shape within a bigger scene.

## Projections of the Mahalanobis norm

In this post $\langle \cdot,\cdot \rangle$ is the standard inner product on $\mathbb{R}^n$ for some $n \geq 2$ and $A$ is a fixed Hermitian positive definite operator. The square Mahalanobis norm $m$ of a vector $v \in \mathbb{R}^n$ for this operator is defined by

$\displaystyle m(v) = \langle v, A^{-1}v \rangle$.

The results in this post were found while looking for ways to approximate this Mahalanobis norm without the need to invert $A$. (Later I realised that using the Cholesky factorisation suited me better. Nice results can be found by looking in the wrong places!) The idea is to use projections on some smaller dimensional subspace to get estimates of the actual Mahalanobis norm. To be precise let $V \subseteq \mathbb{R}^n$ be some subspace of dimension $1 \leq d \leq n$ and let $\pi_V$ be the orthogonal projection onto $V$. The operator $\pi_V A$ is non-singular on the subspace $V$. Let $A_V^{-1}:\mathbb{R}^n\to V$ be its pseudo inverse such that $A_V^{-1}A\pi_V = \pi_V A A_V^{-1} = \pi_V$. The projected Mahalanobis norm $m_V$ on $V$ is defined by

$\displaystyle m_V(v) = \langle A_V^{-1} v, v \rangle$.

Let’s take the one-dimensional case as an example. Let $v \in \mathbb{R}^n$ be non-zero and denote the span of $v$ by $\llbracket v \rrbracket$. Then the norm $m_{\llbracket v \rrbracket}$ is given by

$\displaystyle m_{\llbracket v \rrbracket}(v) = \frac{\langle v,v \rangle^2}{\langle A v, v \rangle}$.

Note that this expression does not involve the inverse of $A$. The basic property of the projected Mahalanobis norm is the following:

The inequality $m_V \leq m$ holds throughout $V$. Equality $m_V(v) = m(v)$ occurs if and only if $A_V^{-1}v = A^{-1}v$.

This property follows from the Cauchy-Schwarz inequality for the inner product $\langle \cdot, A^{-1} \cdot \rangle$:

$\displaystyle \langle A_V^{-1}v, v\rangle^2 = \langle A A_V^{-1}v, A^{-1} v\rangle^2 \leq \langle A A_V^{-1}v, A_V^{-1} v \rangle \langle v, A^{-1}v\rangle = \langle v, A_V^{-1} v \rangle \langle v, A^{-1}v\rangle$.

This is an equality if and only if $AA_V^{-1}v$ and $v$ are linearly dependent. Combined with $\pi_VAA_V^{-1}v = v$ it follows that in fact $AA_V^{-1}v = v$.

The following realisation came as a surprise. It shows that projections onto two-dimensional subspaces suffice to get an exact value for the Mahalanobis norm:

Let $v\in \mathbb{R}^n$ be a non-zero vector and let $V \subseteq \mathbb{R}^n$ be the span of $\{v, A^{-1}v \}$ (so $\dim(V) \leq 2$). Then $m_V(v) = m(v)$.

The projected norm for a two-dimensional subspace also has a simple explicit form. Let $w \in \mathbb{R}^n$ be a non-zero vector orthogonal to $v$ and let $V$ be the span of $\{v, w\}$. The norm $m_V$ is given by

$\displaystyle m_V(v) = \frac{\langle v, v \rangle^2}{\langle A v, v\rangle - \frac{\langle A v, w \rangle^2}{\langle A w, w\rangle}}$.

## The quantum harmonic oscillator in the plane

This post investigates some solutions of the (or rather a) quantum harmonic oscillator in the plane. Here the plane is $\mathbb{C}$ which will be identified with $\mathbb{R}^2$ through $z=x+\textrm{i} y$. Let $\partial_x$ and $\partial_y$ denote the partial derivative operators with respect to $x$ and $y$ respectively. Then the Laplace operator $\nabla^2$ is $\partial_x^2 + \partial_y^2$. The time independent form of the Schrödinger equation for the quantum harmonic oscillator considered here is

$\displaystyle (-\tfrac{1}{4} \nabla^2 + |z|^2) \varphi = E \varphi$

where $\varphi$ is a complex function and $E$ a real eigenvalue (related to energy in physical terms). The operator $-\tfrac{1}{4} \nabla^2 + |z|^2$ is called the Hamiltonian and will be denoted by $H$. Only radial solutions $\varphi$ will be considered. A complex function is called radial when it is of the form

$\displaystyle \varphi(z) = f(|z|^2) z^m$ or $\displaystyle \varphi(z) = f(|z|^2) \overline{z}^{\,m}$

for some real analytic function $f$ and some $m \in \mathbb{N}$. Radial functions of the first and second form are said to be of degree $m$ and $-m$ respectively. If $\varphi$ is a radial function of degree $m \in \mathbb{Z}$ and $|\omega| = 1$ then $\varphi(\omega z) = \omega^m \varphi(z)$.

Working over $\mathbb{C}$ it is convenient to rewrite the Laplace operator as follows. Let $\partial = \tfrac{1}{2}(\partial_x - \textrm{i}\partial_y)$ and $\overline{\partial} = \tfrac{1}{2}(\partial_x + \textrm{i} \partial_y)$. If $f$ is a complex differentiable function then $\overline{\partial} f = 0$ and $\partial f = f'$ while $\overline{\partial}\, \overline{f} = \overline{f'}$ and $\partial \overline{f} = 0$. (These are the Cauchy-Riemann equations.) The operators $\partial$ and $\overline{\partial}$ commute and are related to the Laplace operator by

$\displaystyle \partial \overline{\partial} = \overline{\partial} \partial = \tfrac{1}{4} \nabla^2.$

The Hamiltonian is therefore $-\partial \overline{\partial} + |z|^2$ and we will use it in this form. For a radial function $f(|z|^2) z^m$

$\displaystyle \partial \overline{\partial} f(|z|^2) z^m = \partial f'(|z|^2) z^{m+1} = |z|^2 f''(|z|^2) z^m + (m+1) f'(|z|^2) z^m$.

Applying the Hamiltonian results in

$H f(|z|^2) z^m = |z|^2 \left(f(|z|^2) - f''(|z|^2)\right) z^m - (m+1) f'(|z|^2) z^m$

and in particular if $f(x) = e^{-x}$ then $H e^{-|z|^2} z^m = (m+1) e^{-|z|^2} z^m$. So for each degree $m \geq 0$ the function $e^{-|z|^2} z^m$ is an eigenfunction of $H$ with eigenvalue $E = m + 1$. Complex conjugation shows that the same holds for $e^{-|z|^2} \overline{z}^{\,m}$ of degree $-m$. For each degree $m \in \mathbb{Z}$ we found a radial solution for the harmonic oscillator of degree $m$ and eigenvalue $|m| + 1$. These examples do not exhaust all such solutions however. Others can be found by a clever trick that was already known in this context by Schrödinger and Dirac.

This trick is known as the factorisation or algebraic method and the auxiliary operators that appear are called ladder operators or annihilation and creation operators. The following facts can be readily verified:

1. The operators $-\partial+\overline{z}$ and $\partial + \overline{z}$ lower the degree of a radial function by $1$.
2. The operators $-\overline{\partial} + z$ and $\overline{\partial} + z$ raise the degree of a radial function by $1$.
3. If $\varphi$ is a radial function of degree $m \in \mathbb{Z}$ then $(z \partial - \overline{z} \, \overline{\partial}) \varphi = m \varphi$ and

$\displaystyle (-\partial + \overline{z})(\overline{\partial} + z) \varphi = (H - 1 - m) \varphi$
$\displaystyle (\overline{\partial} + z)(-\partial + \overline{z}) \varphi = (H + 1 - m) \varphi$
$\displaystyle (-\overline{\partial} + z)(\partial + \overline{z}) \varphi = (H - 1 + m) \varphi$
$\displaystyle (\partial + \overline{z})(-\overline{\partial} + z) \varphi = (H + 1 + m) \varphi$

Combining these observations we find for a radial function $\varphi$ of degree $m$:

$\displaystyle \begin{matrix} H(-\partial + \overline{z})\varphi &=& ((-\partial + \overline{z})(\overline{\partial} + z) + m)(-\partial + \overline{z})\varphi\\ &=& (-\partial + \overline{z})((\overline{\partial} + z)(-\partial + \overline{z}) + m)\varphi\\ &=&(-\partial + \overline{z})(H + 1)\varphi. \end{matrix}$

In particular if $\varphi \neq 0$ is a solution with eigenvalue $E$ then $(-\partial + \overline{z})\varphi$ is a solution of degree $m-1$ and eigenvalue $E+1$. In this case $(\overline{\partial} + z)(-\partial + \overline{z})\varphi = (E + 1 - m) \varphi$ and so $(-\partial + \overline{z})\varphi \neq 0$ if $E \neq m - 1$. Under the assumption that $E \geq |m| + 1$ we can assemble the following table of solutions based on $\varphi$:

$\displaystyle \begin{matrix} \textrm{solution} & \textrm{degree} & \textrm{eigenvalue} & \textrm{remark}\\[1ex] (-\partial + \overline{z})\varphi & m - 1 & E + 1 & \neq 0\\ (-\overline{\partial} + z)\varphi & m + 1 & E + 1 & \neq 0\\ (\partial + \overline{z})\varphi & m - 1 & E - 1 & \neq 0 \textrm{ unless } E = 1 - m\\ (\overline{\partial} + z)\varphi & m + 1 & E - 1 & \neq 0 \textrm{ unless } E = 1 + m \end{matrix}$

The first two operators in this table are called raising or creation operators: they raise the eigenvalue or create energy. The last two are called lowering or annihilation operators for similar reasons. Starting from the solution $\varphi(z) = e^{-|z|^2}$ of degree $m=0$ and eigenvalue $E=1$ and repeatedly applying the operators $-\partial + \overline{z}$ and $-\overline{\partial} + z$ we find non-zero solutions for all pairs $(m, E)$ for which $m + E$ is odd and $E \geq |m|+1$. This process results in the solutions below up to scalar multiples. Solutions for negative degrees can be obtained by complex conjugation.

$\displaystyle \begin{matrix} m & E & \textrm{solution}\\[1ex] 0 & 1 & e^{-|z|^2} \\ 1 & 2 & z \, e^{-|z|^2}\\ 0 & 3 & (1 - 2|z|^2) \, e^{-|z|^2}\\ 2 & 3 & z^2 \, e^{-|z|^2}\\ 1 & 4 & z (1 - |z|^2) \, e^{-|z|^2}\\ 3 & 4 & z^3 \, e^{-|z|^2} \end{matrix}$

## Bounds for central binomial coefficients

Let $n$ be a positive integer. The binomial coefficient ${2n \choose n}$ is called a central binomial coefficient. Several bounds for these coefficients are known and the more advanced ones are commonly derived from Stirling’s formula for factorials or from Wallis’ product for $\pi$. This post presents an alternative and self-contained elementary proof of the bounds

$\displaystyle \frac{4^n}{\sqrt{(n+\frac{1}{2})\pi}} < {2n \choose n} < \frac{4^n}{\sqrt{n\pi}}$.

Define a function $f$ by

$\displaystyle f(x) = \cos(x)^n e^{n x^2/2}$

and take $x\in[0, \pi/2)$. From $f(x) \geq 0$ and $\tan(x)\geq x$ it follows that

$\displaystyle f'(x) = n f(x)\left(x - \tan(x)\right) \leq 0$.

So $f$ is non-increasing on this interval and therefore $f(x) \leq f(0)=1$ or

$\displaystyle \cos(x)^n \leq e^{-n x^2/2}$.

This inequality clearly also holds at the endpoint $x=\pi/2$ and since both sides are symmetric in $x$ it holds throughout the interval $[-\pi/2, \pi/2]$. (It suggests that $\cos(x)^n$ closely resembles a normal distribution on this interval.) Integration of the inequality above leads to

$\displaystyle \int_{-\pi/2}^{\pi/2}\cos(x)^n dx \leq \int_{-\pi/2}^{\pi/2}e^{-n x^2/2}dx < \int_{-\infty}^{\infty}e^{-n x^2/2}dx$.

Both the far left and right hand sides of these inequality can be computed explicitly. Starting with the right hand side let

$\displaystyle b_n=\int_{-\infty}^{\infty}e^{-n x^2/2}dx$.

Then

$\displaystyle b_n^2=\int_{-\infty}^{\infty}e^{-n x^2/2}dx \int_{-\infty}^{\infty}e^{-n y^2/2}dy=\iint_{-\infty}^{\infty}e^{-n (x^2+y^2)/2}dx dy$

$\displaystyle =2 \pi \int_0^{\infty} r e^{-n r^2/2}dr = \frac{2\pi}{n}$

And so $b_n=\sqrt{\frac{2\pi}{n}}$. To evaluate the left hand side let

$\displaystyle a_n=\int_{-\pi/2}^{\pi/2}\cos(x)^n dx.$

By explicit computation we find $a_0=\pi$ and $a_1=2$. The other values can be found by a recursive relation that follows from partial integration. For positive $n$ we have

$\displaystyle a_{n+1}=\int_{-\pi/2}^{\pi/2}\cos(x)^n \cos(x)dx=n\int_{-\pi/2}^{\pi/2}\cos(x)^{n-1} \sin(x)^2dx$

$\displaystyle =n\int_{-\pi/2}^{\pi/2}\cos(x)^{n-1}\left(1- \cos(x)^2\right)dx=n(a_{n-1}-a_{n+1})$

and therefore the recursion $a_{n+1}=\frac{n}{n+1}a_{n-1}$. Using this recursion one can check that the even and odd entries of the sequence $a$ are given respectively by

$\displaystyle a_{2n}=4^{-n}{2n \choose n}\pi$

$\displaystyle a_{2n+1}=\frac{2^{2n+1}}{{2n \choose n}(2n +1)}$.

Putting all results sofar together we find for positive even integers $2n$

$\displaystyle 4^{-n}{2n \choose n}\pi<\sqrt{\frac{2\pi}{2n}}$

$\displaystyle {2n \choose n}<\frac{4^n}{\sqrt{n \pi}}$

and for odd integers $2n+1$

$\displaystyle \frac{2^{2n+1}}{{2n \choose n}(2n+1)} < \sqrt{\frac{2\pi}{2n+1}}$

$\displaystyle \frac{4^n}{\sqrt{(n+\frac{1}{2})\pi}} < {2n \choose n}$.

## Fixed points of entire functions

This story has a long personal history. In my first attempted complex analysis exam I was asked if the entire function $e^z$ has any fixed points. I had no idea how to approach this question and I failed the exam. This first failure was soon overcome but that particular question was —much to my advantage— never addressed during my study again. Much later I found several ways to solve this question.

The image of $e^z-z$ is invariant under a shift $z \mapsto z + 2 \pi \textrm{i}$. The “Little Picard” theorem then asserts that this function maps onto $\mathbb{C}$ since it cannot omit just a single value and so it must have a root. Another way uses “Great Picard” and the fact that $z e^{-z}$ only has a single root so it must attain the value $1$ infinitely often. Both approaches seem nice enough but depend on non-trivial theorems (“Great Picard” more so than “Little Picard”) and there is no indication of the location of the fixed points. A more pedestrian approach shows that there must be infinitely many fixed points along the curve $|e^{2z}| = |z|^2$ but this is hardly in the spirit of complex analysis. This was the state of affairs until this week. Then I found a simple and much more satisfying answer.

The Banach fixed point theorem asserts that every contraction of a complete metric space has a single fixed point. I will use that theorem in the following setting.

Let $E$ be a non-empty closed convex subset of an open set $U$ and let $f{:U}\to E$ be holomorphic. If there exists a constant $k < 1$ such that $|f'| \leq k$ on $E$ then $f$ restricted to $E$ is a contraction and therefore $f$ has a single fixed point in $E$.

For $m \in \mathbb{Z}$ let $E_m$ denote the horizontal strip

$\displaystyle E_m = \{ z \mid |\textrm{Im}(z) - 2m \pi| \leq \pi \}$

and let $\log_m$ be the branch of the logarithm that maps the slit complex plane $\mathbb{C} \setminus (-\infty, 0]$ onto the interior of $E_m$. If $m \neq 0$ then

$\displaystyle |\log_m'| \leq \frac{1}{(2|m|-1)\pi} < 1$

on $E_m$ so the conditions of the theorem above apply and therefore $\log_m$ has a single fixed point $z_m \in E_m$ (which must lie in the interior.) Each $z_m$ is a fixed point of $e^z$. Unfortunately the same argument does not work for $E_0$ where $e^z$ has two more fixed points.

The fixed point method works equally well for a number of other functions. Here are two more examples:

Example 1. Let $m \in \mathbb{Z}$ and $m \neq 0$. The branch of $\arcsin$ that maps the (closed) upper half plane onto the half strip

$\displaystyle E_m = \{ z \mid \textrm{Im}(z) \geq 0 \textrm { and } |\textrm{Re}(z) - 2m\pi| \leq \frac{\pi}{2} \}$

is a contraction on $E_m$. So $\sin(z)$ has exactly one fixed point in each such strip (and therefore also exactly one in its complex conjugate).

Example 2. Let $m \in \mathbb{Z}$ and $m$ odd. The branch of $\arctan$ that maps the complement of the open unit disc into the strip

$\displaystyle E_m = \{ z \mid |\textrm{Re}(z) - \tfrac{1}{2}m \pi| \leq \frac{\pi}{4} \}$

is a contraction on $E_m$. So $\tan(z)$ has exactly one fixed point in each such strip.

These example for $e^z$, $\sin$ and $\tan$ work so well because their inverses have a nice derivative that is less than one except in a small bounded region. Put in another way these functions all satisfy a differential equation of the form $(f')^d = p(f)$ for some $d \in \{1,2\}$ and some polynomial $p$.

Posted in Uncategorized | 1 Comment

## Acceleration of alternating series

Let $f(n):\mathbb{N}\to\mathbb{R}$ be a positive decreasing function such that $\lim_{n\to\infty}f(n)=0$. Then the alternating series

$\displaystyle \sum_{n=0}^{\infty}(-1)^nf(n)$

converges. Depending on the rate at which $f$ decays it may however take many terms to get a decent approximation of the sum. Examples of such slowly converging series are

$\displaystyle \log(2) = \sum_{n=0}^{\infty}(-1)^n\frac{1}{n+1}$

$\displaystyle \frac{\pi}{4} = \sum_{n=0}^{\infty}(-1)^n\frac{1}{2n+1}$

A well known method to accelerate the convergence of such alternating series uses the binomial transform of the function $f$ as follows. For $m \geq 0$ let $\Delta^m f$ be the $m$-th difference of $f$:

$\displaystyle (\Delta^m f)(n) = \sum_{k=0}^m (-1)^k{m \choose k} f(n+k)$

Then it turns out that

$\displaystyle \sum_{n=0}^{\infty}(-1)^nf(n) = \sum_{n=0}^{\infty}\frac{(\Delta^n f)(0)}{2^{n+1}}$

and the right hand side may converge much faster. The binomial transform for the two examples above can be computed explicitly to obtain

$\displaystyle \log(2)=\sum_{n=0}^{\infty}\frac{1}{2^{n+1}(n+1)}$

$\displaystyle \frac{\pi}{4}=\sum_{n=0}^{\infty}\frac{n!}{2\cdot(2n+1)!!} = \frac{1}{2} + \frac{1}{2\cdot 3} + \frac{2}{2 \cdot 3 \cdot 5} + \frac{2 \cdot 3}{2 \cdot 3 \cdot 5 \cdot 7} + \ldots$

Both transformed series indeed converge much faster than the original ones. You may recognize the first as the series for $-\log(1-x)$ taken at $x=\tfrac{1}{2}$. The second one is harder to spot but it turns out to be the series of

$\displaystyle \frac{\arcsin(x)}{\sqrt{2-2x^2}}$

taken at $x=\tfrac{1}{2}\sqrt{2}$.

The binomial transform is a great tool if you can find an explicit expression for it such as in both examples above. This is however not a trivial transform in general. If you do not know an explicit expression for the binomial transform then it is also not very convenient to compute: $(\Delta^n f)(0)$ depends on the values $f(k)$ for all $k \leq n$.

There is however another much simpler method that only involves the repeated difference $\Delta^m f$ for a single fixed value of $m \geq 1$. This difference is now easy to compute (numerically) since $(\Delta^m f)(n)$ depends only on $m+1$ consecutive values of $f$. It is convenient for the sake of notation to extend $f$ to $\mathbb{Z}$ by setting $f(n)=0$ for all $n < 0$. Then $(\Delta^m f)(n) = 0$ for all $n < -m$.

Now the following equalities hold in which the second and third sums are taken over $\mathbb{Z}$:

$\displaystyle \sum_{n=0}^{\infty}(-1)^nf(n) = 2^{1-m}\sum_{n \textrm{ even}}(\Delta^m f)(n) = -2^{1-m} \sum_{n \textrm{ odd}} (\Delta^m f)(n)$

If all terms $(\Delta^m f)(n)$ are positive for $n \geq 0$ then partial sums increase for the second series and decrease for the third from $n=0$ onward. This means that these partial sums are lower and upper bounds for the series which is useful to estimate the remaining error for these partial sums.

As an example take $m=2$. The series for $\log(2)$ and $\tfrac{\pi}{4}$ now become

$\displaystyle \log(2) = \frac{1}{2} + \sum_{n=0}^{\infty}\frac{1}{(2n+1)(2n+2)(2n+3)} = \frac{3}{4} - \sum_{n=0}^{\infty}\frac{1}{(2n+2)(2n+3)(2n+4)}$

$\displaystyle \frac{\pi}{4} = \frac{1}{2} + \sum_{n=0}^{\infty}\frac{4}{(4n+1)(4n+3)(4n+5)} = \frac{5}{6} - \sum_{n=0}^{\infty}\frac{4}{(4n+3)(4n+5)(4n+7)}$