Linear autoencoder with dropout

We begin with a brief review of least squares fitting formulated in autoencoder language. Let $x$ be a random variable in $\mathbb{R}^n$ such that $\mathbb{E}(x) = 0$ and let $X = \mathbb{E}(x x^{\ast})$ be its covariance matrix. Then $X$ is a self-adjoint (symmetric) operator. Let $d$ be a positive number not greater than $n$. A linear autoencoder for $x$ is a pair of linear operators $R{:\mathbb{R}^n \to \mathbb{R}^d}$ and $C{: \mathbb{R}^d \to \mathbb{R}^n}$ such that

1. The operator $C$ is unitary: $C^{\ast}C = 1$ on $\mathbb{R}^d$.
2. The expected square error $\mathbb{E} \lVert x - CRx \rVert^2$ is minimal among pairs $(C, R)$.

The second requirement is interpreted as “$R$ reduces the dimension of the variable $x$ from $n$ to $d$ with a minimal loss of information”. The function

$\displaystyle (C, R) \mapsto \mathbb{E} \lVert x - CRx \rVert^2$

has a saddle point exactly when

1. The image of $C$ in $\mathbb{R}^n$ is invariant under $X$.
2. $R = C^{\ast}$.

Note that $CR = C C^{\ast}$ is an orthogonal projection in this case. The expected error among such saddle points  is minimal if the image of $C$ is a direct sum of eigenspaces of $X$ with the largest possible eigenvalues.

The variable $x$ is in practical situation a finite sampled set of observations $x_1, x_2, \dots, x_m$ each occurring with equal probability. Doing a least square error fit on $x$ as above has the risk of being oversensitive to features that are only apparent in this specific sample. In other words it may be sensitive to outliers. One way to reduce this sensitivity is to introduce dropout.

A standard form of dropout is the following. Let $b = b(p) \in \{0, 1\}$ be a Bernoulli random variable with expectation $p > 0$ and $B {: \mathbb{R}^n \to \mathbb{R}^d}$ a random operator with mutually independent coefficients $b$ (so $B_{ij} = e_i^{\ast} B e_j = b$ for all indices $i, j$). This matrix $B$ is also taken to be independent of $x$. Now in dropout the operator $R$ is replaced by the Hadamard product $R \circ B$. This means that each coefficient of $R$ can “drop out” independently with probability $q = 1-p$. A linear autoencoder with dropout is a pair $(C, R)$ of operators similar as above but now minimises the altered expected error

$\displaystyle \mathbb{E}\left \lVert x - C (R \circ B) x \right \rVert^2$.

Here the expectation is for the joint distribution of the independent pair $(B, x)$. The idea is that $R$ must now be robust against random dropout and that this prevents it from being oversensitive to accidental features in $x$. Also in this dropout case the saddle points of the function

$\displaystyle (C, R) \mapsto \mathbb{E}\left \lVert x - C (R \circ B) x \right \rVert^2$

can be described explicitly. Let $\mathrm{diag}(X)$ denote the diagonal operator with the same diagonal entries as $X$. The pair $(C, R)$ is a saddle point if

1. The image of $C$ is invariant under $X \left( p X + q \, \mathrm{diag}(X) \right)^{-1} X$.
2. $R = C^{\ast} X \left(p X + q \, \mathrm{diag}(X)\right)^{-1}$.

Indeed for $p=1$ (probability of dropout is zero) this reduces to the criterion above for a linear autoencoder without dropout.