TY - JOUR
T1 - Perturbation theory for cross data matrix-based PCA
AU - Wang, Shao Hsuan
AU - Huang, Su Yun
N1 - Publisher Copyright:
© 2022 Elsevier Inc.
PY - 2022/7
Y1 - 2022/7
N2 - Principal component analysis (PCA) has long been a useful and important tool for dimension reduction. However, this method must be used with care under certain circumstances such as high dimension and small sample size. In general, low dimension with large sample size or large signal to noise ratio is vital to guarantee the consistency of the leading eigenvalues and eigenvectors obtained by PCA. Cross data matrix (CDM)-based PCA is another way to estimate PCA components, through splitting data into two subsets and calculating singular value decomposition for the cross product of the corresponding covariance matrices. It has been shown that CDM-based PCA has a broader region of consistency than ordinary PCA for leading eigenvalues and eigenvectors. Although the difference in regions of consistency is well studied, an interesting practical as well as theoretical question is how they differ in eigenvalues and eigenvectors estimation, especially for the case where both fall in a common region of consistency. In this article, we derive the finite sample approximation results as well as the asymptotic behavior for CDM-based PCA via matrix perturbation. Furthermore, we also derive a comparison measure for CDM-based PCA vs. ordinary PCA. This measure only depends on the data dimension, noise correlations and the noise-to-signal ratio (NSR). Using this measure, we develop an algorithm, which selects good partitions and integrates results from these good partitions to form a final estimate for CDM-based PCA. Numerical and real data examples are presented for illustration.
AB - Principal component analysis (PCA) has long been a useful and important tool for dimension reduction. However, this method must be used with care under certain circumstances such as high dimension and small sample size. In general, low dimension with large sample size or large signal to noise ratio is vital to guarantee the consistency of the leading eigenvalues and eigenvectors obtained by PCA. Cross data matrix (CDM)-based PCA is another way to estimate PCA components, through splitting data into two subsets and calculating singular value decomposition for the cross product of the corresponding covariance matrices. It has been shown that CDM-based PCA has a broader region of consistency than ordinary PCA for leading eigenvalues and eigenvectors. Although the difference in regions of consistency is well studied, an interesting practical as well as theoretical question is how they differ in eigenvalues and eigenvectors estimation, especially for the case where both fall in a common region of consistency. In this article, we derive the finite sample approximation results as well as the asymptotic behavior for CDM-based PCA via matrix perturbation. Furthermore, we also derive a comparison measure for CDM-based PCA vs. ordinary PCA. This measure only depends on the data dimension, noise correlations and the noise-to-signal ratio (NSR). Using this measure, we develop an algorithm, which selects good partitions and integrates results from these good partitions to form a final estimate for CDM-based PCA. Numerical and real data examples are presented for illustration.
KW - Cross data matrix
KW - Finite sample approximation
KW - High dimension and low sample size
KW - Matrix perturbation
KW - Principal component analysis
KW - Spiked covariance model
UR - http://www.scopus.com/inward/record.url?scp=85125482530&partnerID=8YFLogxK
U2 - 10.1016/j.jmva.2022.104960
DO - 10.1016/j.jmva.2022.104960
M3 - 期刊論文
AN - SCOPUS:85125482530
SN - 0047-259X
VL - 190
JO - Journal of Multivariate Analysis
JF - Journal of Multivariate Analysis
M1 - 104960
ER -