Incomplete Cholesky Decomposition of The Kernel Matrix¶
Cholesky decomposition of the kernel matrix is equivalent to the Gram-Schmidt orthogonalisation (QR) of the feature space matrix. The incomplete Cholesky decomposition can be used to obtain a low-rank approximation of the kernel matrix.
The incomplete Cholesky decomposition is implemented in the
template<class TKernel, typename T>IncCholesky class.
Gram-Schmidt orthonormalisation¶
Given a set of linearly independent vectors \(\{\mathbf{u}_i\}_{i=1}^{N}, \mathbf{u}_i \in \mathbb{R}^d\) of a finate dimensional inner product space \(V\), the GS process generates an orthonormal basis \(\{\mathbf{q}_i\}_{i=1}^{N},\mathbf{q}_i \in \mathbb{R}^d\) by orthogonalising each vector one by one to all the earlier vectors.
The first basis vector is just one of the vectors normalised to unit length as \(\mathbf{q}_1 := \mathbf{u}_1/\|\mathbf{u}_1\|\). The \(i\)-th basis vector is obtained by selecting \(\mathbf{u}_i\) i.e. one of the remaining \(N-i\) vectors, taking its (vector) projection to the orthogonal complement of the sub-space spanned by the \(\mathbf{q}_1,\dots,\mathbf{q}_{i-1}\) vectors, i.e. subtracting its (vector) projections to all the previously generated orthonormal vectors \(\mathbf{q}_1,\dots,\mathbf{q}_{i-1}\)
where \(V_{i-1}\) is the sub-space spanned by the \(\mathbf{q}_1,\dots, \mathbf{q}_{i-1}\) vectors, \(Q_{i-1} \in \mathbb{R}^{d\times i-1}\) matrix with \(\mathbf{q}_1,\dots,\mathbf{q}_{i-1}\) orthogonal vectors as its columns and \(I \in \mathbb{R}^{d \times d}\) identity matrix. For the final form of \(\mathbf{q}_i\) one needs to normalise this projection as
where \(\nu_i = \left\| \left[ \mathbf{I} - Q_{i-1}Q_{i-1}^T \right] \mathbf{u}_i \right\|\) [12].
According to Eq. (2), the input data vector \(\mathbf{u}_i\) can be written as
therefore the input data matrix \(X \in \mathbb{R}^{(N\times d)}\) with the \(\{\mathbf{u}_i\}_{i=1}^{N},\mathbf{x}_i \in \mathbb{R}^d\) vectors as rows can be written as
where the \(Q \in \mathbb{R}^{(d\times N)}\) matrix contains the \(\{\mathbf{q}_i\}_{i=1}^{N},\mathbf{q}_i \in \mathbb{R}^d\) orthonormal (i.e. \(\mathbf{q}_i^T\mathbf{q}_j = \delta_{ij}\)) basis vectors as columns and the \(i\)-th column of the \(R \in \mathbb{R}^{(N\times N)}\) upper triangular matrix \(R_{\cdot,i}\) contains the projections of the \(i\)-th input data vector \(\mathbf{u}_i\) onto these basis vectors (that is zero for \(j>i\)). One can interpret the \(R_{\cdot,i}, i=\{1,\dots,N\}\) columns of the matrix \(R\) as the representation of the \(\{\mathbf{u}_i\}_{i=1}^{N},\mathbf{x}_i \in \mathbb{R}^d\) input data vectors in the \(\{\mathbf{q}_i\}_{i=1}^{N},\mathbf{q}_i \in \mathbb{R}^d\) basis.
So the input data vectors processed one after the other generating the corresponding subsequent orthonormal basis vectors. When the input data vector \(\mathbf{u}_i\) processed is not linearly independent form the previously processed ones, the corresponding residual norm \(\nu_i\) (the length of the projection of the \(\mathbf{u}_i\) vector to the orthogonal complement of the space spanned by the \(\mathbf{q}_1,\dots,\mathbf{q}_{i-1}\) vectors) becomes zero (since \(\mathbf{u}_i\) can be expressed as \(\mathbf{u}_i=\sum_{j=1}^{i-1}[\mathbf{q}_j\mathbf{q}_j^T]\mathbf{u}_i\)).
In general, the residual norm \(\nu_i\) indicates how independent the corresponding input data \(\mathbf{u}_i\) from the previously processed ones. Changing the order, in which the input data is processed, by selecting the one with the largest residual norm to be processed at the next step (pivoting) and and eventually ignoring those with small residual norms (partial, incomplete) leads to incomplete and pivoted versions of the algorithm.
Cholesky decomposition of the kernel matrix¶
Let the \(\varphi(\cdot):\mathbb{R}^d\to\mathbb{R}^{n_h}\) is the mapping to the high(even infinite)-dimensional feature space and the \(\Phi \in \mathbb{R}^{(N\times n_h)}\) matrix is the feature space matrix with rows of the feature maps of the input data as
If \(\Phi^T_{(n_h\times N)}=Q_{(n_h\times N)}R_{(N\times N)}\) is the QR decomposition (Gram–Schmidt orthogonalization of the columns of \(\Phi^T\) i.e. the feature maps of the input data) of this feature space matrix, then the kernel matrix
where the fact that the matrix \(Q\) builds up from mutually orthonormal columns i.e. \(\mathbf{q}_i^T\mathbf{q}_j = \delta_{ij} \to QQ^T=I\) was used. Therefore, performing QR decomposition of the feature space \(\mathbf{ \{\varphi(\mathbf{u}_i)\}_{i=1}^N }\) is equivalent to the Cholesky decomposition of the corresponding kernel matrix \(\mathbf{ \Omega_{ij}=\varphi(\mathbf{u}_i)^T\varphi(\mathbf{u}_j), i,j={1,\dots,N} }\).
Computing the \(j,i\)-th element of the \(R \in \mathbb{R}^{(N\times N)}\) upper triangular matrix:
according to Eq. (3), computing the \(R_{\cdot,i}, i\)-th column of the \(R \in \mathbb{R}^{(N\times N)}\) upper triangular matrix involves the computation of the (scalar) projections of the feature map of the \(i\)-th input data \(\varphi(\mathbf{u}_i)\) onto all the previously generated \(\mathbf{q}_1,\dots,\mathbf{q}_{i-1}\) basis vectors:
\[\begin{split}R_{\cdot,i} = \begin{bmatrix} \mathbf{q}_1^T \varphi{(\mathbf{u}_i)} \\ \vdots \\ \mathbf{q}_{i-1}^T \varphi{(\mathbf{u}_i)} \\ \nu_i \\ \mathbf{0}_{N-i} \end{bmatrix}_{(N\times 1)}\end{split}\]furthermore, these basis vectors \(\mathbf{q}_j, j=1,\dots,i-1\) can be expressed as Eqs. (2) and (1) \(\nu_j\mathbf{q}_j = \varphi{(\mathbf{u}_j)} - \sum_{t=1}^{j-1} [\mathbf{q}_t\mathbf{q}_t^T] \varphi{(\mathbf{u}_j)}\)
The \(j,i\)-th element of the \(R, j<i\), i.e. the \(j<i\)-th elements the column vector written in the first point above and be expressed by using the second point as
where \(j<i\). Since \(\nu_j\) is given at Eq. (2)
where the mutually orthonormal property \(\mathbf{q}_k^T\mathbf{q}_l = \delta_{kl}\) of the basis vectors \(\mathbf{q}_k,\dots,\mathbf{q}_{j-1}\) was used to obtain the 4-th equation.
(Also note, that according to \(\Omega \approx R^TR\), actually \(\sum_{t=1}^{j-1} R_{tj}^2\) is the approximation of the \(\Omega_{jj}, j=1,\dots,N\) diagonal elements at the \(j-1\)-th step of the algorithm.)
The algorithm¶
Given the \(\{\mathbf{x}_i\}_{i=1}^{N},\mathbf{x}_i \in \mathbb{R}^d\) input data set with the kernel function \(K(\mathbf{x_i},\mathbf{x_j})=\varphi(\mathbf{x})_i^T\varphi(\mathbf{x}_j)\) with \(\varphi(\cdot):\mathbb{R}^d\to\mathbb{R}^{n_h}\) mapping to the feature space. One can formulate the algorithm for the incomplete Cholesky decomposition of the kernel matrix \(\Omega \approx \tilde{\Omega} = G^TG\) with \(\Omega \in \mathbb{R}^{(N\times N)}, \Omega_{ij} = K(\mathbf{x_i},\mathbf{x_j})\) symmetric, positive semi-definite matrix and \(G \in \mathbb{R}^{(R\times N)} , R \leq N\) upper triangular matrix by combining Eqs. (4) and (5). The following algorithm generates the rows of \(G\) one by one as:
Initialise the squared diagonal elements of the \(\tilde{\Omega} = G^TG\) as \(\nu_k^2 = \Omega_{kk} = K(\mathbf{x}_k,\mathbf{x}_k), k=1,\dots,N\):
these elements will be updated after the computation of the \(j\)-th row of the \(G\), \(G_{j\cdot}\) as
\[\nu_k^2 = \nu_k^2 - G_{jk}^2, k=j+1,\dots,N\]according to Eq. (5).
the \(\nu_k^2\) value is the squared residual norm of the feature map of the \(k\)-th input data i.e. \(\varphi(\mathbf{x}_k)\). These will be used to:
the maximum of these will be used to select the next, \(j+1\)-th input data (more exactly its feature map) to orthogonalise (to all the previously generated basis vectors). This will greedily select input data to minimise the residual norm i.e. the projection of the feature maps of the input data onto the orthogonal complement of the currently generated sub-space spanned by the underlying basis vectors.
these will be used to compute the approximation error after the \(j\)-th step as
\[\begin{split}\|\Omega-\tilde{\Omega}\|_1 & = \|\Omega-G^TG\|_1 = \text{tr}(\Omega) - \text{tr}(G^TG) = \text{tr}(\Omega) - \text{tr}(G^TG) \\ & = \text{tr}(\Omega) - \sum_{k=1}^{N} \sum_{t=1}^{j} G_{tk}^2 = \sum_{k=1}^{N}\Omega_{kk} - \sum_{k=1}^{N} \sum_{t=1}^{j} G_{tk}^2 \\ & = \sum_{k=1}^{N} \left[ \Omega_{kk} - \sum_{t=1}^{j} G_{tk}^2 \right] = \sum_{k=1}^{N} \nu_k^2\end{split}\]which will be normalised as \(\eta := \|\Omega-\tilde{\Omega}\|_1/N = \left[ \sum_{k=1}^{N} \nu_k^2 \right]/N = \left[ \sum_{k=j+1}^{N} \nu_k^2 \right]/N \in [0,1]\) where the last equality took into account that \(\Omega_{kk}=\tilde{\Omega}_{kk}, \text{ for } k \leq j\).
these residual norms will give the diagonal elements of the \(G\) matrix (see below)
The first row, i.e. \(j=1\) of \(G\):
set \(G_{jj} = \sqrt{\nu_j^2}\)
then for \(i=j+1,\dots,N\):
set \(G_{ji} = \Omega_{ji} = K(\mathbf{x}_j,\mathbf{x}_i)\)
update the residual norm as \(\nu_i^2 = \nu_i^2 - G_{ji}^2\)
while performing the above two sub-steps for all \(i=j+1,\dots,N\):
find the next pivot i.e. the index of the next input data feature map to orthogonalise i.e. find \(i\) with the maximum residual norm \(\texttt{pivot}=\arg\max_i(\nu_i^2)\)
compute the sum of the squared residual norms for the approximation error computation which is \(\eta = \left[ \sum_i \nu_i^2 \right]/N\) at the end of this \(j\)-th step
increase the index j of the generated rows of the matrix \(G\) to j+1
The \(1<j\leq N\)-th row of \(G\):
swap the:
\(j\)-th column of the current \(G\) matrix with the \(\texttt{pivot}\)-th column, i.e. the input data index selected during the previous step
\(j\)-th element of the squared residual norms with the \(\texttt{pivot}\)-th element i.e. \(\nu_{\texttt{pivot}}^2 \text{ and } \nu_j^2\)
set \(G_{jj} = \sqrt{\nu_j^2}\)
then for \(i=j+1,\dots,N\):
set \(G_{ji} = \Omega_{ji} - \sum_{t=1}^{j-1} G_{tj}G_{ti}\)
update the residual norm as \(\nu_i^2 = \nu_i^2 - G_{ji}^2\)
all steps are the same as in case of 1.
Repeat step 2. as long as \(\eta > \epsilon\) and \(j < \texttt{itr}_{\text{max}}\) where \(\epsilon\) is the required approximation error. The \(\texttt{itr}_{\text{max}}\) is the maximum iteration number which is the maximum number of rows of \(G\) to be generated. This is equal to the rank of the approximated kernel matrix \(\tilde{\Omega} = G^TG\) which is equal to the rank of the projected input data feature map.
Notes on the approximation error¶
Based on some things form [7].
As it has already been discussed, the incomplete Cholesky decomposition selects the data vector to orthogonalise in the next step (pivot) based on the norm of their projections to the orthogonal complement of the actual sub-space spanned by the actual set of bias vectors. This will remove the highest residual norm at each step by including the corresponding data vector into the orthogonalisation.
In turn the individual residual norms (square) after completing the \(k\)-th step (i.e. the norm of the projection of the individual data vectors to the orthogonal complement of the \(k\) basis vectors), is equal to the difference between the corresponding real and approximated diagonal elements of the kernel matrix \(\nu_j^2=\Omega_{jj} - \tilde{\Omega}_{jj} = \varphi(\mathbf{x}_j)^T\varphi(\mathbf{x}_j) - \sum_{t=1}^{k} G_{tj}^{(k)2})\) Therefore, by selecting the data vector for the \(k+1\)-th step to orthogonalise with the highest residual norm i.e. removing the highest residual norm in the next step can be interpreted as the minimisation of \(\texttt{trace}\{\Omega-\tilde{\Omega}\}\) by removing the highest individual contribution at each step.
Let the \(\|A\|_1\) denote the sum of the singular values of the matrix \(A\). If \(A\) is a square, symmetric matrix then its singular values are its eigenvalues. Furthermore, the sum of the eigenvalues of this matrix is equal to its trace. The kernel matrix \(\Omega\) is a Gram matrix i.e. square, symmetric, positive definite matrix i.e. \(\Omega \succeq 0\). Moreover, \(\Omega - \tilde{\Omega}^{(k)} \succeq 0\) since \(\tilde{\Omega}^{(k)} = G^{(k)T}G \preceq \Omega\) for each \(k\) steps (\(\succeq \text{ means that } \mathbf{x}^TA\mathbf{x} \geq 0, \forall \mathbf{x}\)).
Note
Therefore, \(\|\Omega-\tilde{\Omega}^{(k)}\|_1= \sum_{i} \lambda_i = \texttt{trace}\{ \Omega-\tilde{\Omega}^{(k)} \}\) So minimising the trace of \(\|\Omega-\tilde{\Omega} \|\) is maximising \(\texttt{trace}\{ \tilde{\Omega} \} = \sum_j \lambda_j\). Good feeling that \(\texttt{trace}\{ \tilde{\Omega} \} \leq \texttt{trace}\{ \Omega \}\) this can be seen from either that the diagonals of the approximation approaching the real one or from the fact that all the kernel, approximated kernel and their difference are positive semi-definite i.e. non negative eigenvalues and tehir sums…) since \(\tilde{\Omega}^{(k)} = G^{(k)T}G \preceq \Omega\) the eigenvalues of \(\mathbf{x}^T \left[ \Omega - \tilde{\Omega} \right] \mathbf{x} = \mathbf{x}^T \Omega \mathbf{x} - \mathbf{x}^T \tilde{\Omega}\mathbf{x} = \lambda \mathbf{x}\))
So back to the original, the goal of minimising the \(\| \Omega-\tilde{\Omega}^{(k)} \|_1\) which is the sum of the eigenvalues. After the \(k\)-th step, \(\texttt{trace} \{ \tilde{\Omega}^{(k)} \} = \sum_{i=1}^{N} \sum_{t=1}^{k} G_{ti}^{(k)^2}\) The gain in the \(k+1\)-th step to this is \(\sum_{i=1}^{N} G_{k+1,i}^{(k+1)^2}\) i.e. the squared sum of the last added, \(k+1\)-th row of the \(G\) matrix. According to Eqs. (4) and (5) these elements can be computed as (in the \(k+1\)-th step by selecting the \(p\)-th data vector out of the \(N-k\) remaining)
So we should chose \(p\) after the \(k\)-th step out of the \(N-k\) remaining such that the above sum is maximal. We should compute the sum for all possible values of \(p\) that would not be feasible since it would include the computation of all possible \(k+1\)-th row of the matrix \(G\) (at each step). Instead, a lower bound of this sum i.e. \(\nu_p^2\) is used. Since these diagonals are known after the completing the \(k\)-th step, one can chose the pivot \(p\) such that \(p = \arg \max_p{\nu_p^2}\) (and hope that it also will lead to a maximum of the above sum, but no guarantee). One could also chose more than one \(p\) say \(\kappa\) that are the \(\kappa\) maximal \(\nu_{p_{1,\dots,\kappa}}^2\). Then compute the sum for all and chose the one that maximal. If \(\kappa\) is low then it might help.
Summary and remarks¶
As it was shown, performing Cholesky decomposition of the kernel matrix is equivalent to performing QR decomposition of the feature map matrix. During this process, the feature map of the input data set will be expressed in a new orthonormal basis. The set of these basis vectors is greedily generated by choosing the feature map at each step that has the maximum residual norm: the residual norm is the projection to the orthogonal complement of the sub-space spanned by the current set of basis vectors. The higher this residual norm the more independent from the corresponding feature map form those processed previously:
it becomes zero: if the corresponding input data feature map can be expressed as linear combination of the previous i.e. this vectors lies into the sub-space spanned by the current set of basis vectors (sum of the projections to the current set of basis vectors already equals to the vector)
it’s equal to the norm of the vector itself: this vector is linearly independent from all the previous i.e. this vector lies entirely into the orthogonal complement of the sub-space spanned by the current set of basis vectors (sum of the projections to the current set of basis vectors is zero i.e. rthogonal to all of these basis vectors)
The rank of the feature map of the input data is equal to the rank of the kernel matrix. Although the kernel matrix might have even full rank, its spectrum decays rapidly in case of many kernels and can be well approximated by low rank (\(\ll \texttt{full rank}\)) matrices [13] [16] [6]. This happens when the data lives “near” a low-dimensional subspace in feature space which means that even with a low-number (low-dimensional) of basis vectors a small sum residual norm (“near”) can be achieved (the sum of the vector projections to these basis vectors already approximate very well the feature map vectors). Incomplete Cholesky decomposition can be used in such cases to obtain a low-rank approximation of the kernel matrix that can help e.g. to reduce the size of the eigenvalue problem in which the full kernel matrix is involved or deal with the full kernel matrix that would not fit into memory through an appropriate approximation.
Note, that the \(G_{\cdot i} \in \mathbb{R}^{R}, i=1,\dots,N\) columns of the \(G \in \mathbb{R}^{(R\times N)}\) matrix are the projections of the feature map of the input data set \(\left\{ \varphi(\mathbf{x}_i) \right\}_{i=1}^N\) onto the low (\(R\)) dimensional sub-space spanned by the \(R\) orthonormal basis vectors. Moreover, this sub-space is obtained such that the residual norm of the feature maps (i.e. the norm of the projection to the orthogonal complement of the sub-space) is minimised (at least its maximum value is reduced to zero at each step). Therefore, one can view the columns of \(G\) as a low dimensional representation of the input data feature map.
Note
One could see similarities here with:
KPCA: that will select directions to project the feature map of the input data such that the variance of the projection is maximised. The optimal solution to this maximisation problem is given by the leading eigenvectors of the kernel matrix i.e. using the eigenvectors as directions that corresponds to the highest eigenvalues.
as it was shown in [9], by selecting the directions to project as the largest sum of the eigenvalue and eigenvector pairs one can obtain a new set of data that will maximise the quadratic Renyi entropy (see also in [10])
one could keep in mind here that incomplete Cholesky is a kind of trace maximisation of the approximated \(\tilde{\Omega} = G^TG\) kernel matrix (not a global trace maximisation because we use just a lower bound…!). And keeping mind that the trace is the sum of the eigenvalues so it’s related somehow to the PCA ??? and related somehow to the Renyi entropy maximisatio (but in sub-set selection i.e. sum) ???. Or not. Also note, in [10] section 3.1. somethind interestuon.
??? Maximising the Reny entropy of a Kernel matrix (or a sub- as in sub-set selection) means computing the sum of the elements (which are projections to each other).
- ?? Also, note, \(\|A\|_F=\sqrt{ \texttt{trace} \{ A^TA\} }\) so
\(\| G \|_F=\sqrt{ \texttt{trace} \{ G^TG \} }\) so by maximising the \(\texttt{trace} \{ \tilde{Omega} \} = \texttt{trace} \{ G^TG \}\) one maximising the Frobenius norm of \(G\) which is the sum of the squared elements of \(G\).