KSC code documentation¶

An introductory descrption…then doxygen inputs Here is how we can link one of the classes below template<class TKernel, typename T>KscEncodingAndQM bla lab..

Cluster membership encoding and quality measure (QM)¶

Here is how we can cite one of the details documents Incomplete Cholesky Decomposition of The Kernel Matrix

template<typename T> class KscEncodingAndQM¶

Base class for different KSC cluster membership encoding and quality measure implementations.

This base class provides interfaces for implementing different cluster membership encoding, cluster assigment depending on the encoding and model evaluation criterion for model selection.

cluster membership encoding: each cluster is represented by a special vector that depends on the selected cluster encoding scheme. These cluster representations are generated based on a training data set by calling the KscEncodingAndQM<T>::GenerateCodeBook interface method. Each derived class can implement its own algorithm to gerate these vectors depending on the corresponding ecoding scheme.

Since all implemented cluster membership ecoding schemes make use of the sign (of the training score data) based encoding when generating their own encodings, this is implemented here in this base class KscEncodingAndQM<T>::GenerateCodeBook method.
cluster assigment: when assigning a data to a cluster, each scheme (independently from the type of the encoding) needs to compute the distance of this data point measured from the clustres i.e. fromthe vectors representing each clustres. Both the type of these cluster prototype vectors and the distance computation depends on the selected encoding scheme. Therefore, the KscEncodingAndQM<T>::ComputeDistance interface method is provided for the implementaton of computing the distance between the score variable space representation of an input data point and a cluster. This computation is used then in the KscEncodingAndQM<T>::ClusterDataPoint interface methods to assign the input data point to a cluster and (in some cases like BAS, AMS) compute further infomation regarding the strength of this membership. A higher level method, the KscEncodingAndQM<T>::ClusterDataSet, that depends exclusively on the KscEncodingAndQM<T>::ClusterDataPoint interfaces, is implemented in this base class to cluster a set of data.

Since the base class implements the sign based cluster membership encoding scheme, the corresponding Hamming distance computation is implemented in the KscEncodingAndQM<T>::ComputeDistance interface method here in the base class. Having the Hamming distance computation implemented, the corresponding cluster assigment (based on the smallest Hamming distance) is also implemented in the KscEncodingAndQM<T>::ClusterDataPoint interfaces methods.
model evaluation criterion: the special structure of the score variable space representation of the clusters in the ideal case makes possible to measure the quality of a model or a completed clustering. Different model quality measures are available but some of them depends on the cluster membership encoding and the cluster assigment. Therefore, the quality measure computation is linked with the cluster membership encoding. The base class provides the KscEncodingAndQM<T>::ComputeQualityMeasure interface method for implementing the model quality evaluation algorithm. Each derived class implements their own algorithm.

The following cluster membership encoding, assigment schemes are available with the corresponding model selection criterion:

KscEncodingAndQM_BLF: sign based encoding; assigment based on minimum Hamming distance i.e. binary membership indicator; Balanced Line Fit (BLF) model selection criterion that measures the within cluster collinearity (can be used for model selection $K \geq 2$
KscEncodingAndQM_AMS: direction based encoding; assigment based on highest membership indicator value that is based on cosine distance (measured from the cluster ptototype directions); soft membership indicator; Average Membership Strength (AMS) model selection criterion that measures the average within cluster collinearity (can be used for model selection $K \geq 2$
KscEncodingAndQM_BAS: similart to the above but special for sparse KSC model obtained by using the reduced set method; direction based encoding; assigment based on the smallest Euclidean distance (measured from the cluster ptototype directions); Balanced Angular Similarity (BAS) model selection that penalizes KSC models yielding more data near the decision boundaries; (can be used for model selection $K > 2$.

Author: M. Novak
Date: February 2020

Subclassed by KscEncodingAndQM_AMS< T >, KscEncodingAndQM_BAS< T >, KscEncodingAndQM_BLF< T >

Public Functions

KscEncodingAndQM(KscQMType qmt, const std::string &name)¶

The only available constructor.

Parameters

[in] qmt: the cluster membership encoding and quality measure type
- KscQMType::kBLF sign based encoding and BLF quality measure (see more at KscEncodingAndQM_BLF)
- KscQMType::kAMS direction based (score data + cosine distance) encoding and AMS quality measure (see more at KscEncodingAndQM_AMS)
- KscQMType::kBAS direction based (reduced set coeffitient data and Euclidean distance) encoding and quality measure (see more at KscEncodingAndQM_BAS)

~KscEncodingAndQM()¶: Destructor.

KscQMType GetQualityMeasureType() const¶

Public method to obtain the cluster membership encoding and quality measure type.

Return: The cluster membership encoding and quality measure type.

const std::string &GetName() const¶: Public method to obtain the name of the cluster membership encoding and quality measure type.

void SetCoefEtaBalance(T eta)¶: Public method to set the balance term coefficient used in the quality measure computation to determine the weight of the balance over the collinearity terms (should be in [0,1]).

T GetCoefEtaBalance() const¶: Public method to obtain the balance term coefficient used in the quality measure.

void SetOutlierThreshold(size_t val)¶

Public method to set the outlier threshold used in the quality measure computation: clusters below this cardinality value are considered to contain outliers and do not cotribute to the quality measure value.

Parameters

[in] val: The outlier threshold value;

size_t GetOutlierThreshold() const¶: Public method to obtain outlier threshold value used in the quality measure.

T GetTheQualityMeasureValue() const¶: Public method to obtain the KSC model quality measure value computed and set when invoking the ComputeQualityMeasure interface method.

void GenerateCodeBook(const Matrix<T> &encodeMatrix, size_t numClusters)¶

Interface method to generate the code book (the cluster encoding).

In case of KSC, each cluster is represented by a special vector that depends on the selected cluster encoding scheme. These cluster representations are generated in this interface method based on a training data set. Since the type of the cluster prototype vector depends on the encoding scheme, each derived class implements it own algorithm.

Since all implemented cluster ecoding schemes make use of the sign (of the training score data or the corresoonding reduced set coefficient data) based encoding when generating their own encodings, this is implemented here in this base class method through the KscEncodingAndQM<T>::GenerateSignBasedCodeBook method. Therefore, each derived class can invokde this base class implementation to generate the sign based encoding. The method also sets the KscEncodingAndQM<T>::fNumClusters member variable to the number of desired clusters.

Parameters

[in] encodeMatrix: reference to either the training set score data matrix or to the corresponding reduced set coefficient matrix.
[in] numClusters: number of required clusters
[in] isScoreVarBased: flag to indicate if the encoding is score variable based (BLF,AMS) or not (BAS) i.e. the encodeMatrix is a reference to the training set score data or to the corresponding reduced set coeffitients.

const std::vector<std::vector<bool>> &GetTheSignBasedCodeBook()¶

Public method to obtain a reference to the sign based code book generated by calling the GenerateCodeBook method.

Return: A reference to the sign bbased code book: a vector of fNumClusters, fNumClusters-1 dimensional boolean vectors encoding the clusters (true = +, flase = -).

size_t GetNumClusters()¶: Public method to get the number of required clusters set when calling the the GenerateCodeBook method.

size_t ClusterDataPoint(const T *aScoreData, size_t dim)¶

One of the 3 different interface methods to cluster a data point given its representation in the score variable space.

Any data point can be clusterred given by its representation in the score varibale space after generating the cluster membership encoding by calling the GenerateCodeBook interface method. The 3 different methods provided to perform this cluster membership assigment differes regarding the information filled in. Only the index of the cluster, to which this data is assigned, is returned in this case. The other two method can be useful in case of soft cluster membership encodings such as the AMS or BAS since they provide the infomation on how certain the given assigment is: either the assigment to the selected cluster or to all the possible clusters.

Since this base class implements the sign based code book generation, this cluster assigment method is implemented: assigning the data to the cluster that gives the minimum Hamming distance when that is computed between the binarised score data and the corresponding sign based code word. Note, that it also indicates that the Hamming distance computation is implemented in the ComputeDistance interface method is this base class. Therefore, all ingerdients that the BLF encoding and assigment requires is implemented in theis base class.

Return

Index of the cluster the input data is assigned to.

Parameters

[in] aScoreData: pointer to a memory where a fNumClusters-1 dimensional score data is stored (in memory continous way) that is to be clusterred.
[in] dim: dimension of the score data (must be fNumClusters-1).

size_t ClusterDataPoint(const T *aScoreData, size_t dim, T &aMemberships)¶

Interface method to cluster a data point given its representation in the score variable space.

Same as above with the difference that a soft cluster membership indicator i.e. a value representing how ceratin is the given assigment, will be given in addition to the index of the cluster to which the data is assigned.

Note, this additional information on the strength of the cluster membership will be set to 1 in case of hard membership indicators such as the one when using the pure sign based encoding and the Hamming distance i.e. in case of BLF.

Return

Index of the cluster the input data is assigned to.

Parameters

[in] aScoreData: pointer to a memory where a fNumClusters-1 dimensional score data is stored (in memory continous way) that is to be clusterred.
[in] dim: dimension of the score data (must be fNumClusters-1).
[in] aMembership: reference to fill the cluster membership strength.

size_t ClusterDataPoint(const T *aScoreData, size_t dim, T *aMemberships)¶

Interface method to cluster a data point given its representation in the score variable space.

Same as above with the difference that a soft cluster membership indicator i.e. a value representing how ceratin is the given assigment, will be given not only for the assigned cluster but all possible assigments. This infomation is available only in case of AMS. The strength of the cluster membership indicator will be set only for the cluster to which the data is assigned to in all other cases (to 1 in case of BLF as hard assigment and to the soft value in case of BAS).

Return

Index of the cluster the input data is assigned to.

Parameters

[in] aScoreData: pointer to a memory where a fNumClusters-1 dimensional score data is stored (in memory continous way) that is to be clusterred.
[in] dim: dimension of the score data (must be fNumClusters-1).
[in] aMembership: a pointer to a continuos memory where the cluster membership strength to be filled.

void ClusterDataSet(Matrix<T> &aScoreMatrix, Matrix<T, false> &aCMMatrix, size_t flag = 0)¶

Method to cluster a set of data points given their representation in the score varibale space.

This method is implemented in this base class and used by all the derived classes since the only functionality required must be implemented in the ClusterDataPoint interface methods.

Parameters

[in] aScoreMatrix: refrence to the matrix that contains the representation of the data points to be clusterred in the score variable space i.e. the score data matrix.
[inout] aCMMatrix: reference to a matrix where the assigned cluster index as well as possible membership infomation will be filled. The fisrt column of the matrix will contain the assigned cluster indices: the $i$-th row will contan the index of teh cluster to which the data given in the $i$-th column of the sore data matrix is assigned to. Further column(s) along this row might contain information on the strength of the assigment:
- $\texttt{flag}$=1 (BAS,AMS): the columns with index = 1 will contain the strength of the assignment to the selected cluster (which index is given in col=0).
- $\texttt{flag}$=2 (AMS): the columns with index $= [1,\dots,K]$ will contain the strength of assigning the given data to the cluster with index=k.
[in] flag: a flag to indicate if:
- $\texttt{flag}$=0: simple cluster membership assigment is required.
- $\texttt{flag}$=1: the strength of the assigment is aslo required.
- $\texttt{flag}$=2: the strength to assigning to all clusters are requied.

Note: score data are assumed to be stored in the input $\texttt{aScoreMatrix}$ as columns i.e. the aScoreMatrix is assumed to be $\in \mathbb{R}^{(K-1)xN}$ where $ K$ is the number of clusters and $ N$ is the number of data point stored in the matrix. Moreover, the $\texttt{aCMMatrix}$ is assumed to be a row major matrix with $ N$ rows and 1,2 or K+1 number of columns depending on the values of the $\texttt{flag}=0,1$ or $2$.

T ComputeQualityMeasure(Matrix<T, false> &aCMMatrix, const Matrix<T> *aScoreMatrix = nullptr, const Matrix<T> *theSecondVarForBLFM = nullptr)¶

Intrface method to compute the quality measure for model selection.

Depending on th selected cluster membership encoding (BLF, AMS, BAS) one can compute the corresponding quality measure i.e. a measure on how far the model is from the ideal situation. Since the type of the quality measure depends on the selected encoding, each derived class needs to implement its own version of this interface method.

Parameters

[in] aCMMatrix: reference to the matrix that stores the assigned cluster indices and in some cases (AMS, BAS) further membership infomation.
[in] aScoreMatrix: the score data matrix used in to obtain the cluster assigment. Used only in case of the BLF quality measure computation.
[in] aScoreMatrix: the score data matrix used in to obtain the cluster assigment. Used only in case of the BLF quality measure computation.
[in] theSecondVarForBLFM: the second variable matrix (vector) for BLF (used only in case of BLF when the number of desired clusters = 2)
Returns: with the computed quality easure that indicates how far the model is from the ideal case. This can be used for model selection when combined with a grid search.

T ComputeDistance(const T *aScoreData, size_t pCluster, size_t dim)¶

Interface method to compute the distance of a data point from a cluster.

The data point is given with its score variable space representation. The method needs to implement the computation of the (encoding dependent) distance of the input data measured from the cluster prototype vector generated during the GenerateCodeBook method. Since the distance computation depends on the cluster membership encoding scheme, each derived class implements its own distance computation.

Since this base class implements the sign based cluster membership encoding in the GenerateCodeBook method, the corresponding distance, the Hamming distance computation is implemented in this base class methods.

Return

The distance of the input data measured from the cluster specified with its index. The Hamming distance computation is implemented in this base class method.

Parameters

[in] aScoreData: pointer to a memeory space that stores the score variable space representation of a data point in a memory continuos way.
[in] pCluster: index of the cluster from which the distance needs to be computed.
[in] dim: dimension of the input score data (must be fNumClusters-1).

Private Functions

void GenerateSignBasedCodeBook(const Matrix<T> &encodeMatrix, size_t numClusters)¶

Private method to generate the sing based cluster encoding.

Generates the sign based code book: the top K, K-1 sign coding determined from the K-1 dimensional entires of the input matrix. It also sets the fNumClusters member variable to the number of clusters. The encode matrix is either the score variable matrix $ \in \mathbb{R}^{(K-1)\times N_{tr}} $ or the reduced set coefficient matrix $ \in \mathbb{R}^{R\times (K-1)} $ that is indicated by the $\texttt{isScoreVarBased}$ flag.

After binarising the $ K-1$ dimensional input (either score or reduced set coeffitients) data by taking the sign of the components, the $ K$ most frequent $ \textbf{cw}^{(k)} \in \{-1,+1\}^{K-1}, k=1,\dots,K $ sign based code words are determined. These $ \textbf{cw}^{(k)}, k=1,\dots,K $ code words are stored then in the KscEncodingAndQM<T>::fTheSignBasedCodeBook member as $ K, K-1$ dimensional boolean vectors ( $ \texttt{true}=+; \texttt{false}=-$) and a reference to this member can be obtained by using the KscEncodingAndQM<T>::GetTheSignBasedCodeBook public method.

Private Members

KscQMType fKscQMType¶: Cluster membership encoding and quality measure type.

const std::string fKscQMTypeName¶: Name of the cluster membership encoding and quality measure type.

size_t fNumClusters¶: Number of required cluster.

std::vector<std::vector<bool>> fTheSignBasedCodeBook¶: The collection of the fNumClusters the sign based code words.

size_t fOutlierThreshold¶: A minimum required cardinality below which clusters are considered to to contain outliers and do not contribute to the quality measure.

T fEtaBalance¶: The weight given to the balance term over the collinearity in the quality measure.

template<typename T> class KscEncodingAndQM_BLF : public KscEncodingAndQM<T>¶

Sign based cluster membership encoding scheme and Balanced Line Fit (BLF) model quality measure.

Follows the base KscEncodingAndQM class interfaces to implement a cluster membership encoding based on the signs of the training score data components: data that belong to the same cluster are located in the same orthant of the score space. The cluster assigment then is done by computing the Hamming distance between these sign based code words and the binarised score space representation of the data. The data will be assigned to the cluster that gives the smallest Hamming distance. In order to generate a qualty measure, the collinearity of the score data that belongs to the same cluster is measured by computing a Line Fit and additional term accounts the Balance of teh resulted clusters (see more at the KscEncodingAndQM_BLF<T>::ComputeQualityMeasure interface implementation).

Since the base class implements the necessary sing based code book generation in its KscEncodingAndQM<T>::GenerateCodeBook method, the corresponding Hamming distance computation in its KscEncodingAndQM<T>::ComputeDistance method and the corresponding cluster assigments in its KscEncodingAndQM<T>::ClusterDataPoint methods, the only interface method is the KscEncodingAndQM_BLF<T>::ComputeQualityMeasure to be implemented here.

Public Functions

KscEncodingAndQM_BLF()¶: Constructor.

T ComputeQualityMeasure(Matrix<T, false> &aCMMatrix, const Matrix<T> *aScoreMatrix, const Matrix<T> *theSecondVarForBLFM) override¶

Implementation of the intrface method to compute the BLF quality measure for model selection.

The Line Fit part of this quality measure is motivated by the fact that the score space representation of the data points, that belong to the same cluster, are collinear in the ideal case. By measuring the collinearity of the score variables that have been assigned to the same cluster, one can have an indicator on how far the given KSC model and its results are from the ideal case. The collinearity of the score variables assigned the same cluster is measured by determining the fraction of variance contained along the first principal direction to the total. It can be done by forming the covariance matrix of the corresponding score data point

\[ \texttt{Cov}^{(k)} := \frac{1}{|\mathcal{A}_k|} Z^{(k)^T}Z^{(k)}, k=1,\dots,K \]

where the clustering resulted the partition $ \mathcal{A}_1,\dots,\mathcal{A}_K$ of the score data points and the rows of the matrix $ Z^{(k)} \in \mathbb{R}^{|\mathcal{A}_k|\times(K-1)}$ are the score data points that belong to the $ k$-th cluster. The required ratio of the variance can be obtained by computing the eigenvalues $ \lambda_1^{(k)} \geq \lambda_2^{(k)} \geq \dots \lambda_{K-1}^{(k)} $ of the covariance matrix for the $ k$-th cluster and taking the ratio $ \lambda_1^{(k)}/\sum_{p=1}^{K-1} \lambda_p^{(k)} $. This ratio is equal to 1 in the ideal case when all the variance is contained along the first principal direction and equal to $ 1/(K-1)$ when evenly distributed along all the $ K-1$ principal directions. These scalled to the $ [0,1]$ intervall by defining the Line Fit as

\[ \texttt{LF}(K>2) = \frac{1}{K}\frac{K-1}{K-2} \sum_{k=1}^{K} \left[ \frac{ \lambda_1^{(k)} }{ \sum_{p=1}^{K-1} \lambda_p^{(k)} } - \frac{1}{K-1} \right] \]

In case of $ K=2$ there is a single score variable i.e. the score data a single dimensionals $K-1=1$ so the above procedure is not applicable. However, taking the $ \mathbf{z}_i = \sum_{j=1}^{N_{tr}} K(\mathbf{z}_i, \mathbf{z}_j) + b^{(1)}$ variable beyond the single score variable $ \mathbf{z}_i = \sum_{j=1}^{N_{tr}} \mathbf{\beta}^{(1)}K(\mathbf{z}_i, \mathbf{z}_j) + b^{(1)}$ (where $ \mathbf{\beta}^{(1)}$ is the leading eigenvector of the $ D^{-1}M_d\Omega$ matrix and $ b^{(1)}$ is the corresponding bias term) the above procedure becomes applicable and the corresponding Line Fit

\[ \texttt{LF}(K=2) = \sum_{k=1}^{2} \left[ \frac{ \lambda_1^{(k)} }{ \lambda_1^{(k)} + \lambda_2^{(k)} } -\frac{1}{2} \right] \]

In order to provide the possibility to give more weights to KSC models that result more balanced clustering, a balance (BL) terem can be introduced as $ \texttt{BL} = \texttt{min}(|\mathcal{A}_k|)/\texttt{max}(|\mathcal{A}_k|), k=1,\dots,K $ and the final quality measure is

\[ \texttt{BLF} = [1-\eta]\texttt{LF} + \eta\texttt{BL} \]

with $ \eta \in [0,1]$ as an input parameter determines the importance of the balance over the collinearity i.e. line fit term.

Return

The computed Balanced Line Fit KSC model evaluation criterion.

Parameters

[in] aCMMatrix: reference to the matrix that stores the assigned cluster indices as its first (zeroth) column. Note, that only this column is used in the quality measure computation and this infomation is filled by the KscEncodingAndQM_BFL<T>::ClusterDataSet method with the corresponding {flag>=0} value.
[in] aScoreMatrix: pointer to the score data matrix that was clusterred with the KscEncodingAndQM<T>::ClusterDataSet method and the corresponding cluster assigment was filled in the $ \texttt{aCMMatrix}$ matrix
[in] theSecondVarForBLFM: pointer to the second variable matrix used when the required number of clusters is two (used only in case of K=2)

template<typename T> class KscEncodingAndQM_AMS : public KscEncodingAndQM<T>¶

Angual similarity based KSC cluster membership encoding and Average Membership Strength clastering quality measure.

Follows the base KscEncodingAndQM class interfaces to implement a cluster membership encoding based on the collinearity of the score space representation of the training data that belong to the same cluster.

After the partition of the training set score data into the desired number of clustres by using the sign based cluster membership encoding, a mean score vector is comuted for each clusters that will be used as cluster prototype (see more at KscEncodingAndQM_AMS<T>::GenerateCodeBook interface method implementation).

In contrast to the binary, sign and Hamming distances based cluster indicator implemented in the base class, a soft cluster membership indicator is introduced (see more at the KscEncodingAndQM_AMS<T>::ClusterDataPoint interface method implementation) by computing the cosine distance of any score data from the cluster prototypes (see more at the KscEncodingAndQM_AMS<T>::ComputeDistance interface method implementation). The data is assigned to the cluster yielding the highest cluster membership indicator.

This cluster membership encoding and sof membership indicator comes with the so called Average cluster Membership Strength computation as the model evaluation critarion (see more at the KscEncodingAndQM_AMS<T>::ComputeQualityMeasure interface method implemntation).

Author: M. Novak
Date: February 2020

Public Functions

KscEncodingAndQM_AMS()¶: Constructor.

void GenerateCodeBook(const Matrix<T> &aScoreVariableM, size_t numClusters) override¶

Inerface method implementation to generate the cluster membership encoding.

The base class interface method implementation KscEncodingAndQM::GenerateCodeBook is called first to generate the sign based encoding of the clusters using the training score data provided as input argument. Then a cluster prototype, as the average score vector, is computed for each clusters as

\[ \mathbf{s}_k =\frac{1}{|\mathcal{A}_k|} \sum_{i, \mathbf{z}_i^* \in \mathcal{A}_k} \mathbf{z}_{i}^{*} ,k=1,\dots,K \]

where $ \{ \mathcal{A}_1, \dots,\mathcal{A}_K \} $ is the partiton of the training score data based on their signs. These cluster prototype score vectors are normalised $ \mathbf{s}_k = \mathbf{s}_k/\|\mathbf{s}_k\|_2 $, unless $ K=2$ (when the Euclidean distance between these prototype and a given score point will be compute for the cluster assignment instead of the cosine distance).

Parameters

[in] encodeMatrix: reference to the training set score data matrix.
[in] number: of required clusters.

size_t ClusterDataPoint(const T *aScoreData, size_t dim) override¶

Implementation of the interface method to cluster a data point given its representation in the score variable space.

The cosine distance between the cluster prototype vectors $ \mathbf{s}_k $ (determined previously by calling the GenerateCodeBook method) and the score variable space representation $ \mathbf{z}_{t}^{*} \in \mathbb{R}^{(K-1)}$ of any input data point $ \mathbf{x}_t \in \mathbb{R}^{d} $ is computed as

\[ \texttt{d}_{\texttt{cos}}(\mathbf{z}^*_t, \mathbf{s}_k) = 1-\frac{\mathbf{z}^{*^{T}}_t \mathbf{s}_k}{\|\mathbf{z}^*_t \|_2\|\mathbf{s}_k \|_2} \]

implemented in the KscEncodingAndQM_AMS<T>::ComputeDistance interface method. Then the cluster membership indicator for this data

\[ \texttt{cm}^{(k)} (\mathbf{z}_{t}^{*}) = \frac{ \prod_{p\neq k} \texttt{d}_{\texttt{cos}}(\mathbf{z}^*_t, \mathbf{s}_p) }{ \sum_{p=1}^{K} \prod_{p\neq k} \texttt{d}_{\texttt{cos}}(\mathbf{z}^*_t, \mathbf{s}_p) } \]

is computed for all $ k=1,\dots,K$ clustres and the data is assigned to the cluster yielding the highest cluster membership indicator value

\[ \mathbf{x}_t \to k = \arg \max_{k} (\texttt{cm}^{(k)}(\mathbf{z}_{t}^{*})), k=1,\dots,K \]

The index of this cluster is returned by this method.

Parameters

[in] aScoreData: pointer to a memory where a fNumClusters-1 dimensional score data is stored (in memory continous way) that is to be clusterred.

size_t ClusterDataPoint(const T *aScoreData, size_t dim, T &aMembership) override¶

Interface methods to cluster a data point given its representation in the score variable space and provide cluster membership indicator.

Same as above with the difference that soft cluster membership indicator $ \texttt{cm}^{(k)} $ for the selected cluster $ k \in \{1,\dots,K\}$ will also be written into the address specified by the corresponding input argument.

Parameters

[in] aScoreData: pointer to a memory where a fNumClusters-1 dimensional score data is stored (in memory continous way) that is to be clusterred.
[in] dim: dimension of the score data (must be fNumClusters-1).

size_t ClusterDataPoint(const T *aScoreData, size_t dim, T *aMemberships) override¶

Interface method to cluster a data point given its representation in the score variable space and provide cluster membership indicator.

Same as above with the difference that the soft cluster membership indicator $ \texttt{cm}^{(k)} $ for all clusters $ k \in \{1,\dots,K\}$ will also be written into the address specified by the corresponding input argument.

Parameters

[in] aScoreData: pointer to a memory where a fNumClusters-1 dimensional score data is stored (in memory continous way) that is to be clusterred.
[in] dim: dimension of the score data (must be fNumClusters-1).

T ComputeQualityMeasure(Matrix<T, false> &aCMMatrix, const Matrix<T> *aScoreMatrix = nullptr, const Matrix<T> *theSecondVarForBLFM = nullptr) override¶

Implementation of the intrface method to compute the AMS quality measure for model selection.

The Average Membership Strength(AMS) is impemented in this method that can be used to measure how far the clustering result of a data set, obtained by a given KSC model, is from the ideal case. This information can be used for model selection.

The AMS is defined as

\[ \texttt{AMS} = \frac{1}{K}\sum_{k=1}^{K} \frac{1}{|\mathcal{A}_k|}\sum_{i\in \mathcal{A}_k} \texttt{cm}^{(k)}_i \]

where the input data set is partitioned into the $\mathcal{A}_1,\dots, \mathcal{A}_K $ clusters and $ \texttt{cm}^{(k)}_i$ (computed in the KscEncodingAndQM_AMS<T>::ClusterDataPoint methods) is the soft cluster membership indicator value for the $ i$-th data point assigned to the $ k$-th cluster.

Beyond this AMS value that measures the average within cluster collinearity of the corresponding score variables, a second term is introduced as $ \texttt{BL} = \texttt{min}(|\mathcal{A}_k|)/\texttt{max}(|\mathcal{A}_k|), k=1,\dots,K $ to measure how balanced is the clustering result.

The final quality measure is

\[ [1-\eta]\texttt{AMS} + \eta\texttt{BL} \]

with $ \eta \in [0,1]$ as an input parameter determines the importance of the balance over the collinearity term.

Return

The computed balanced Average Membership Strength KSC model evaluation criterion.

Parameters

[in] aCMMatrix: reference to the matrix that stores the assigned cluster indices as its first (zeroth) column and the corresponding soft cluster membership indicator values as its second (first) column. This matrix was filled by the KscEncodingAndQM_AMS<T>::ClusterDataSet method with the corresponding {flag>0} value.
[in] aScoreMatrix: not used in this method
[in] theSecondVarForBLFM: not used in this method

T ComputeDistance(const T *aScoreData, size_t pCluster, size_t dim) override¶

Implementation of the interface method to compute the distance of a data point from a given cluster.

The cosine distance between a cluster prototype vectors $ \mathbf{s}_k $ (determined previously by calling the KscEncodingAndQM_AMS<T>::GenerateCodeBook method) and the score variable space representation $ \mathbf{z}_{t}^{*} \in \mathbb{R}^{(K-1)}$ of an input data point $ \mathbf{x}_t \in \mathbb{R}^{d} $ is computed as

\[ \texttt{d}_{\texttt{cos}}(\mathbf{z}^*_t, \mathbf{s}_k) = 1-\frac{\mathbf{z}^{*^{T}}_t \mathbf{s}_k}{\|\mathbf{z}^*_t \|_2\|\mathbf{s}_k \|_2} \]

Return

The cosine distance computed between the score variable space representation of a data point and a given cluster specified by its index.

Parameters

[in] aScoreData: pointer to a memeory space that stores the score variable space representation of a data point in a memory continuos way.
[in] pCluster: index of the cluster from which the distance needs to be computed ( $ k$).
[in] dim: dimension of the input score data (must be KscEncodingAndQM<T>::fNumClusters-1).

Private Members

std::vector<std::vector<T>> fThePrototypeVectorBook¶

The code book generated by calling the KscEncodingAndQM_AMS<T>::GenerateCodeBook method.

The normalised (unless $ K=2$) $ \mathbf{s}_k, k=1,\dots,K$ vectors.

std::vector<T> fTheMemberships¶: A utility vector to store some intermediate infomation on the membershipt.

template<typename T> class KscEncodingAndQM_BAS : public KscEncodingAndQM<T>¶

Angual similarity based KSC cluster membership encoding and Balanced Angual Similarity clastering quality measure.

Follows the base KscEncodingAndQM class interfaces to implement a cluster membership encoding and the corresponding model evaluation criterion. Similarity to the KscEncodingAndQM_AMS, the encoding is based on the collinearity of the score space representation of the training data that belong to the same cluster. However, the cluster prototype directions are determined based on the reduced set coefficients instead of using the score data. Therefore, this encoding can only be used when the reduced set method is utilised to obtain a sparse KSC model.

After the partition of the reduced set coeffitients data into the desired number of clustres by using the sign based cluster membership encoding, a mean reduced set coefficient vector is comuted for each clusters that will be used as cluster prototype directions (see more at KscEncodingAndQM_BAS<T>::GenerateCodeBook interface method implementation).

In contrast to the binary, sign and Hamming distances based cluster indicator implemented in the base class, a soft cluster membership indicator is introduced (see more at the KscEncodingAndQM_BAS<T>::ClusterDataPoint interface method implementation) by computing the Euclidean distance of any normalised score data from the cluster prototype directions (see more at the KscEncodingAndQM_BAS<T>::ComputeDistance interface method implementation). The data is assigned to the cluster yielding the smallest distance.

This cluster membership encoding comes with Balanced Angular Similarity model evaluation critarion (see more at the KscEncodingAndQM_BAS<T>::ComputeQualityMeasure interface method implemntation).

Author: M. Novak
Date: February 2020

Public Functions

KscEncodingAndQM_BAS()¶: Constructor.

void GenerateCodeBook(const Matrix<T> &aReducedSetCoefM, size_t numClusters) override¶

Inerface method implementation to generate the cluster membership encoding.

Similar to the KscEncodingAndQM_AMS<T>::GenerateCodeBook but specialised for the case when the reduced set method is used to obtain a sparese KSC model.

Using a reduced set $ \mathcal{R} = \{\mathbf{x}_r\}_{r=1}^{R} \subset \{\mathbf{x}_i\}_{i=1}^{N_{tr}}, R \ll N_{tr} $ and the corresponding coeffitients $ \{ \mathbf{\zeta}_k \}_{k=1}^{K-1}, \mathbf{\zeta}_k \in \mathbb{R}^{R} $ such that $ \Omega_{\Psi\Psi}\mathbf{\zeta}^{(k)} = \Omega_{\Psi\Phi}\mathbf{\beta}^{(k)}$ with $ \Omega_{\Psi\Psi}, \Omega_{\Psi\Phi} $ being the within reduced set and reduced set - training set kernel matrices respectively and $\beta^{(k)}$ is the $k$-th (approximated)eigenvector of the $ D^{-1}M_D\Omega_{\Phi\Phi}$ matrix. The approximated score points, associated to the $\mathbf{x}_i$ input data are computed as

\[ \tilde{z}^{*^{(k)}}_i = \sum_{r=1}^{R} \zeta_r^{(k)} K(\mathbf{x}_r, \mathbf{x}_i) + \tilde{b}^{(k)}, k=1,\dots,K-1 \]

It can be shown, that the $ R \ll N_{tr}, K-1 $ dimensional $ \mathbf{\tau}^*$ reduced set coefficient points (formed as the $ R $ rows of the reduced set coeffitient matrix), can be used to find the cluster proptotype directions instead of the score data points (since the reduced set coeffitients plays the role of the $ \mathbf{\beta}^{(k)}, k=1,\dots,K$ eigenvectors when the reduced set method is used).

Therefore, following a sign based encoding (using the base class KscEncodingAndQM::GenerateCodeBook method) and sign based clustering the $ R, \mathbf{\tau}^*$ reduced set coefficient points into the $\mathcal{D}_1,\dots,\mathcal{D}_K$, disjoint sets, the cluster prototype directions are computed for each clustrs as

\[ \mathbf{u}_k' =\frac{1}{|\mathcal{D}_k|} \sum_{i, \mathbf{x}_i \in \mathcal{D}_k} \mathbf{\tau}_i^* ,k=1,\dots,K \]

The final prototype directions are obtained with a normalisation as $ \mathbf{u}_k = \mathbf{u}_k'/\| \mathbf{u}_k'\|_2$.

Parameters

[in] encodeMatrix: reference to the training set score data matrix.
[in] number: of required clusters.

size_t ClusterDataPoint(const T *aScoreData, size_t dim) override¶

Implementation of the interface method to cluster a data point given its representation in the score variable space.

The Euclidean distance between the cluster prototype direction vectors $ \mathbf{u}_k $ (determined previously by calling the GenerateCodeBook method) and the (approximated)score variable space representation $ \tilde{\mathbf{z}}_{t}^{*} \in \mathbb{R}^{(K-1)}$ of any input data point $ \mathbf{x}_t \in \mathbb{R}^{d} $ is computed as

\[ \texttt{d}( \tilde{\mathbf{z}}^*_t, \mathbf{u}_k ) = \left\| \frac{ \tilde{\mathbf{z}}^{*}_t }{ \| \tilde{\mathbf{z}}^{*} \|_2 } - \mathbf{u}_k \right\|_2 \]

implemented in the KscEncodingAndQM_BAS<T>::ComputeDistance interface method. The data is assigned to the cluster yielding the smallest distance.

\[ \mathbf{x}_t \to k = \arg \max_{k} (\texttt{cm}^{(k)}(\tilde{\mathbf{z}}_{t}^{*})), k=1,\dots,K \]

The index of this cluster is returned by this method.

Parameters

[in] aScoreData: pointer to a memory where a fNumClusters-1 dimensional score data is stored (in memory continous way) that is to be clusterred.

size_t ClusterDataPoint(const T *aScoreData, size_t dim, T &aMembership) override¶

Interface methods to cluster a data point given its representation in the score variable space and provide cluster membership strength.

Same as above with the difference that cluster membership strength indicator $ \texttt{cm}^{(k)} $ for the selected cluster $ k \in \{1,\dots,K\}$ will also be written into the address specified by the corresponding input argument.

This cluster membership strength is defined as

\[ \texttt{cm}^{(k)} (\tilde{\mathbf{z}}_{t}^{*}) = 1 - \frac{\texttt{d}^{2}( \tilde{\mathbf{z}}^*_t, \mathbf{u}_{k_{2}} )}{\texttt{d}^{1}( \tilde{\mathbf{z}}^*_t, \mathbf{u}_{k_{1}} )} \]

where $_{k_{1}}$ and $_{k_{2}}$ indicates the clusters yielding the smallest $ \texttt{d}^{1}$ and the second smallest $\texttt{d}^{2}$ distances measured from the given $\tilde{\mathbf{z}}_{t}^{*}$ approximated score data point. Note, that this membership strength penalizes situation when data points are located near a decision boundary.

Note, that the score space is single dimensional when the desired number of clusters is two $K=2 \to K-1=1$ and the corresponding cluster prototype directions are $+1, -1$. Since the Euclidean distance of the normalised score data takes the binary ${0,+2}$ values in this case and the above cluster indicator strength becoms the constant 1. Therefore, this membership strength can be used only in $K>2$.

Parameters

[in] aScoreData: pointer to a memory where a fNumClusters-1 dimensional score data is stored (in memory continous way) that is to be clusterred.
[in] dim: dimension of the score data (must be fNumClusters-1).

T ComputeQualityMeasure(Matrix<T, false> &aCMMatrix, const Matrix<T> *aScoreMatrix = nullptr, const Matrix<T> *theSecondVarForBLFM = nullptr) override¶

Implementation of the intrface method to compute the BAS quality measure for model selection.

The Balanced Angular Similarity(BAS) model selection criterion is impemented in this method that can be used to measure how far the clustering result of a data set, obtained by a given KSC model, is from the ideal case. This information can be used for model selection.

The Angular Similarity (AS) is defined as

\[ \texttt{AS} = \frac{1}{K}\sum_{k=1}^{K} \frac{1}{|\mathcal{A}_k|}\sum_{i\in \mathcal{A}_k} \texttt{cm}^{(k)}_i \]

where the input data set is partitioned into the $\mathcal{A}_1,\dots, \mathcal{A}_K $ clusters and $ \texttt{cm}^{(k)}_i$ (computed in the KscEncodingAndQM_BAS<T>::ClusterDataPoint methods) is the value of the cluster membership strength for the $ i$-th data point assigned to the $ k$-th cluster.

Beyond this AS value, that measures the average within cluster collinearity of the corresponding score variables, a second term is introduced as $ \texttt{BL} = \texttt{min}(|\mathcal{A}_k|)/\texttt{max}(|\mathcal{A}_k|), k=1,\dots,K $ to measure how balanced is the clustering result.

The final quality measure is

\[ \texttt{BAS} = [1-\eta]\texttt{AS} + \eta\texttt{BL} \]

with $ \eta \in [0,1]$ as an input parameter determines the importance of the balance over the collinearity term.

Return

The computed Balanced Angular Similarity KSC model evaluation criterion.

Parameters

[in] aCMMatrix: reference to the matrix that stores the assigned cluster indices as its first (zeroth) column and the corresponding cluster membership strength values as its second (first) column. This matrix was filled by the KscEncodingAndQM_BAS<T>::ClusterDataSet method with the corresponding {flag=1} value.
[in] aScoreMatrix: not used in this method
[in] theSecondVarForBLFM: not used in this method

T ComputeDistance(const T *aScoreData, size_t pCluster, size_t dim) override¶

Implementation of the interface method to compute the distance of a data point from a given cluster.

The Euclidean distance between the cluster prototype direction vectors $ \mathbf{u}_k $ (determined previously by calling the GenerateCodeBook method) and the (approximated)score variable space representation $ \tilde{\mathbf{z}}_{t}^{*} \in \mathbb{R}^{(K-1)}$ of any input data point $ \mathbf{x}_t \in \mathbb{R}^{d} $ is computed as

\[ \texttt{d}( \tilde{\mathbf{z}}^*_t, \mathbf{u}_k ) = \left\| \frac{ \tilde{\mathbf{z}}^{*}_t }{ \| \tilde{\mathbf{z}}^{*} \|_2 } - \mathbf{u}_k \right\|_2 \]

Return

The Euclidean distance computed between the normalised, approximated score variable space representation of a data point and a given cluster specified by its index.

Parameters

[in] aScoreData: pointer to a memeory space that stores the (approximated) score variable space representation of a data point in a memory continuos way.
[in] pCluster: index of the cluster from which the distance needs to be computed ( $ k$).
[in] dim: dimension of the input score data (must be KscEncodingAndQM<T>::fNumClusters-1).

Private Members

std::vector<std::vector<T>> fThePrototypeVectorBook¶

The code book generated by calling the KscEncodingAndQM_BAS<T>::GenerateCodeBook method.

The normalised $ \mathbf{u}_k, k=1,\dots,K$ vectors.

std::vector<T> fTheMemberships¶: A utility vector to store some intermediate infomation on the membershipt.

template<class TKernel, typename T, typename TInputD> class KscWkpcaIChol¶

Sparse Kernel Spectral Clustering based on the incomplete Cholesky decompositon of the kernel matrix.

Data members

bool fUseGPU¶: Use GPU (if available and built with the -DUSE_CUBLAS CMake option) in the approximated training set eigenvector computation.

size_t fNumberOfClusters¶: Number of clusters to find.

TKernel *fKernel¶: Pointer to the kernel object that implements the kernel function.

Matrix<T> *fIncCholeskyM¶: Pointer to the incomplete Cholesky matrix (doesn’t owned by the class).

Matrix<TInputD, false> *fInputTrainingDataM¶: Pointer to the input data to be used for training (doesn’t owned by the class).

Matrix<T> *fTheAprxBiasTermsM¶

Only if BLF or AMS encoding and QM; Approximated bias terms (K-1); computed like Eq.

(26) with substit of after Eq(56)

Public Functions

KscWkpcaIChol()¶: Constructor.

~KscWkpcaIChol()¶: Destructor.

template<typename ...Args> void SetKernelParameters(Args... args)¶

Public method to set the parameters of the kernel function object.

The type of the kernel function object is selected at instantiation of an KscWkpcaIChol object since the class is templated on this type. This public method can be used to set the parameters of the kernel function object. Note, that the method will invoke the corresponding $ \texttt{SetKernelParameters}$ method of the kernel function object through the base class KernelBase::SetParameters interface method by passing all provided parameters as arguments.

Parameters

[in] args: input argument list.

void SetInputTrainingDataMatrix(Matrix<TInputD, false> *inTrDataM)¶

Public method to set the input pointer to the input data matrix.

Note

The class does NOT own the data i.e. the corresponding memory needs to be freed by the caller.

Parameters

[in] inDataM: pointer to the input data matrix that stores the input data vector in row-major (memory continous) order. This input data were used in the pivoted Cholesky decomposition (of the corresoonding training data kernel matrix) and the permutations on the rows of the input data matrix must have been done.

void SetIncCholeskyMatrix(Matrix<T> *incChM)¶

Public method to set the incomplete Cholesky matrix.

Note

The class OWNS the data: in incomplete Cholesky matrix will be destroyed in the Train method by its QR factorisation. Therefore, the corresponding memory is freed in the Train method.

Parameters

[in] incChM: pointer to the incomplete Cholesky matrix obtained previously by the (pivoted) incomplete Cholesky decomposition of training data kernel matrix.

void SetNumberOfClustersToFind(size_t nc)¶

Set number of clusters to find in the training.

Parameters

[in] nc: number of required clusters

size_t GetNumberOfClustersToFind() const¶

Get number of clusters to find in the training.

Return: Number of required clusters.

void SetUseGPU(bool val)¶

Request to use GPU in the computations of the approximated eigenvectors related to the training.

When building with the ${-DUSE_CBLAS}$, ${CMake}$ configuration option, i.e. with GPU (CUDA) support, the most compute-intensive part of the training might be accelerated by using the GPU. This is the QR, SVD and the QU computation of the approximated eigenvalues of the symmetric problem. These computations are performed on the GPU when this flag is turned to be true. This can especially be usefull in case of large number of reduce set sizes.

Parameters

val: The value to request GPU based computation. If ${true}$, the QR, SVD decompositions as well as the QU computations are done on the GPU (in a row, without moving back to the host) instead of the CPU. All computations are done by using the CPU otherwise (default).

void SetEncodingAndQualityMeasureType(KscQMType qmType)¶

Public method to set the cluster membership encoding/decoding (assigment) scheme and the corresponding and clustering quality measure.

See more on the available cluster membership encodings, assigments and the corresponding model evaluation criteria in KscEncodingAndQM base class. The following encoding, assigment and model quality measures are supported:

sign based encoding and Hamming distance: Balanced Line Fit (BLF) KscEncodingAndQM_BLF
direction based encoding-1 and cosine distance: Average Membership Strength (AMS) KscEncodingAndQM_AMS
direction based encoding-2 and norm of Euclidean distance: Balanced Angular Similarity (BAS) KscEncodingAndQM_BAS (model evaluation criterion can be used only for $ K>2 $)

Parameters

[in] qmType: type of the cluster membership encoding and the corresponding quality measure
- KscQMType::kBLF KscEncodingAndQM_BLF
- KscQMType::kAMS KscEncodingAndQM_AMS
- KscQMType::kBAS KscEncodingAndQM_BAS

const KscEncodingAndQM<T> *GetEncodingAndQualityMeasure() const¶

Public method to get the encoding and quality measure object pointer.

Return: Pointer to the encoding and quality measure object.

void SetQualityMeasureEtaBalance(T etaBalance)¶

Public method to set the weigth to be given to the balance term in the model selection criterion.

The model selection (quality measure) contains a term that accounts how balanced is the result of the clustering. This parameter gives the weight of this term over the other (collinearity) term.

Parameters

[in] etaBalance: weight of the balance term. Must be in [0,1] where 1 means that all the important is given to the balance term while 0 removes this term from the qualty measure.

void SetQualityMeasureOutlierThreshold(size_t val)¶

Public method to set the outlier threshold to be used in the model selection criterion computation.

When contibutions form the different cluster to the model selection (quality measure) criterion is computed, clusters with cardinality below a certain threshold are considered to contain outliers and the corresponding clusters won’t contibute to the model selection criterion value. This treshold can be set by using this method.

Parameters

[in] val: The minimum required cardinality or outlier threshold value.

void Train(size_t numBLASThreads, bool isQMOnTraining = true, size_t qmFlag = 1, int verbose = 0)¶

Public method to train the incomplete Cholesky factorisatio based sparese KSC model.

The method will train a KSC model on the training data set which means that all the required parameter values and quantities will be determined and stored in the obejct. The object can be used to cluster any unseen input data after the training using its Test() method. However, certain parameter values, obejct pointers needs to be set before invoking the training (see below).

Training steps:

computes the K-1 leading, approximated eigenvectors of the $ D^{-1}M_D\Omega $ matrix (K is the nimber of required clusters)
creates the reduce set and computes the corresponding reduce set coefficients
generates the cluster membership encoding
(optionally) clusters the training data and computes the corresponding model selection criterion

Needs to be done before training:

parameters of the kernel function object needs to be set by using the SetKernelParameters<>() method
the training data matrix pointer must be set by using the SetInputTrainingDataMatrix() method
the incomplete Cholesky decomposition of the training data set kernel matrix needs to be done by using an IncCholesky object and the resulted the incomplete Cholesky matrix must be set by using the SetIncCholeskyMatrix() method
the number of required clusters number of clusters must be set by the SetNumberOfClustersToFind()
a cluster membership encoding scheme, that also defines the model selection criterion, must be set by SetEncodingAndQualityMeasureType() method

Parameters

numBLASThreads: number of threads to be used in the BLAS and LAPACK functions (if the implementation used supports multi threading).
isQMOnTraining: flag to indicate if the optional clustering of the training data set, with the model selection criterion calculation, should also be done (true by default). The corresponding result can be obtained by using the GetTheClusterMembershipMatrix().
qmFlag: a value that determines what clustering information needs to be generated i.e. which one out of the 3 KscEncodingAndQM<>::ClusterDataPoint() interface should be used (generate only cluster assigment, also strength of this assigment or strength to for each cluster). Used only when clustering of the training data set is required i.e. when $\texttt{isQMOnTraining=true}$!
verbose: verbosity level that controls the verbosity of the output information.

template<typename TKernelParameterType> void Tune(std::vector<TKernelParameterType> &theKernelParametersVect, size_t minNumClusters, size_t maxNumClusters, Matrix<TInputD, false> &theValidInputDataM, size_t numBLASThreads, int verbose = 0)¶

Tuning of the parameters of the sparse KSC model.

The KSC model depends on the number of required clusters and the given kernel function parameters. This method trains a KSC model on the given training data set and evaluates the model selection criterion on the given validation data set over a 2D grid of cluster-number x kernel-parameters. The 2D grid is determined by the input parameters and the resulted KSC model evaluation criterion matrix can be obtained by using the GetTheTuningResultMatrix() method. The 2D grid point, that gives the highest value of model evaluation criterion, is also available through the GetTheOptimalClusterNumber() and GetTheOptimalKernelParIndex() methods. However, a more careful investigation of the resulted model evaluation surface is suggested to select the optimal KSC model parameters.

Since an incomplete Cholesky factorisation based sparese KSC model is trained on the training data set at each point of 2D parameter grid, **the sane things needs to be done as described at the Train() method before invoking the Tune() method***.

The incomplete Choleksy matrix will be available after the Tune() method, (unlike after the Train() method that destroyes it)

Parameters

theKernelParametersVect: vector that contains the kernel parameters that determines the rows of the 2D parameter grid. The SetKernelParameters<>() method will be invoked for each of these paraneters during the tuning.
minNumClusters: minim of the cluster number parameter that determines the minimum value of the columns of the 2D parameter grid.
maxNumClusters: maximum of the cluster number parameter that determines the maximum value of the columns of the 2D parameter grid.
theValidInputDataM: reference to the validation data set on which the model evaluation criterion will be evaluated at each point of the 2D parameter grid after training the model on the training data set. Note, that the training data set needs to be set by the SetInputTrainingDataMatrix() method before invoking this Tune() method.
numBLASThreads: number of threads to be used in the BLAS and LAPACK functions (if the implementation used supports multi threading).
verbose: verbosity level that controls the verbosity of the output information.

const Matrix<TInputD, false> *GetPermutedTrDataMatrix() const¶

Public method to obtain pointer to the matrix that stores the permuted input training data.

The order of the training data was changed during the incomplete Cholesky decomposition of the corresponding kernel matrix: the feature map with the highest residual (orthogonal projection) norm is ncluded at each step in the orthogonalisation. The KSC model expects the training data such that the corresponding permutations are applied in order to be in sync with the incomplete Cholesky factor matrix. This form of the training data, used in the KSC model, can be obtained with this method.

Return: Pointer to the permuted inpt training data matrix i.e. the order of the input training data that is consistent in the KSC model.

const Matrix<T, false> *GetTheClusterMembershipMatrix() const¶

Public method to obtain the result of the clustering.

Return: Pointer to the matrix that stores the result of the clustering. Ecah row correspond to the result obtained for the input data with the corresponding row index in either the permuted input training data matrix (after Train()) or int the test data matrix (after Test()).

const Matrix<T, false> *GetTheTuningResultMatrix() const¶

Obtain pointer to the model evaluation criterion matrix over the 2D KSC parameter grid generated during the tuning (Tune()).

Return: Pointer to the matrix that contains the KSC model evaluation criterion values over the 2D paraeter grid. Each row of the matrix contains model quality measure values that belongs to one kernel parameter and each column contains the values for a given cluster number.

Private Functions

void ComputeApproximatedEigenvectors(Matrix<T> &theAprxEigenvectM, int numBLASThreads, int verbose = 0)¶: Auxilary method to compute theapproximated eigenvetors of the $ D^{-1}M_D\Omega $ matrix.