## Least square fitting of vector-valued random variables

Let $(x, y) \in \mathbb{R}^n \times \mathbb{R}^m$ be joint random variables (typically not independent). This post describes which linear map $\sigma{: \mathbb{R}^n \to \mathbb{R}^m}$ relates these variables best in the sense that the expected square error $\mathbb{E} \lVert \sigma(x) - y \rVert^2$ is minimised. Related applications such as principle component analysis and auto encoding under random distortion (such as drop out) are recovered as special cases.

Vectors will be interpreted as column vectors and transposition is denoted by a superscript asterisk. It is assumed that $X = \mathbb{E}(x x^{\ast})$ is non-singular (and therefore positive definite). Define $Y = \mathbb{E}(y x^{\ast})$. Then $Y X^{-1} Y^{\ast}$ is positive semi-definite on $\mathbb{R}^m$. Let $v_1, \ldots, v_m \in \mathbb{R}^m$ be orthogonal eigenvectors in order of decreasing eigenvalue and let $\pi_k$ be the orthogonal projection of $\mathbb{R}^m$ onto the span of $v_1, \ldots, v_k$ for $k \in \{1, \ldots, m\}$. (Note that this definition leaves some choice in case not all eigenvalues are distinct since the projections are not unique in that case.) Now the main result states:

For each $k \in \{1, \ldots, m\}$ the linear map $\sigma_k = \pi_k Y X^{-1}$ minimises the expected square error $\mathbb{E} \lVert \sigma_k(x) - y \lVert^2$ among all linear maps of rank $k$.

Let’s apply this result to two special cases. For the first case we simply assume that $n = m$, $x=y$ and $\mathbb{E}(x) = 0$. In this case $Y = X$ and $Y X^{-1} Y^{\ast} = X$. So the projections $\pi_k$ project onto the eigenspaces of the covariance matrix $X$ and $\sigma_k = \pi_k Y X^{-1} = \pi_k$. This result coincides with principle component analysis for the variable $x$.

For the second case assume that $n = m$ and $x = A y$ for some random diagonal matrix where each diagonal entry $A_{ii}$ is independent (also of $y$) and Bernoulli distributed with probability $p > 0$. The matrix $A$ models dropout in the coefficients of $y$. Let $\Sigma = \mathbb{E}(y y^{\ast})$ be the covariance matrix of $y$, $D$ the diagonal of $\Sigma$, and $q = 1-p$. In this case

$X = p \left( p \Sigma + q D \right)$ and $Y = p \Sigma$.

Now $\pi_k$ are projections onto eigenspaces of

$\Sigma \left(p \Sigma + q D \right)^{-1} \Sigma$,

which is the semi-positive definite matrix that appeared in the previous post about linear encoders with dropout. Finally in this case $\sigma_k = \pi_k \Sigma \left(p \Sigma + q D \right)^{-1}$.