Determining how many factors to retain as expression of an underlying structure is an important topic in principal component analysis (PCA). With this aim, different empirical criteria are usually adopted, such as to retain all eigenvalues higher than 1 (Kaiser-Guttman rule), or the first eigenvalues totalling a prefixed amount of explained variance, or higher than a prefixed threshold (broken-stick method), or those eigenvalues that depart from the straight line on which tend to lie all the other eigenvalues (scree plot). Yet, these rules often have weak theoretical bases and not always are appropriately applied. Facing this problem from a mathematical-analytical point of view by finding the distribution of the eigenvalues of sampling correlation matrices is a hard task, and most studies report results which are valid only asymptotically or under specific assumptions. There is a need to generalize the method, also to deal with real and possibly small datasets. The aim of this thesis was to model the decision thresholds for the eigenvalue distribution as a function of number of variables (k) and sample size (n), under the assumptions that no latent factors exist and variables are standard-normally distributed. Two methods were taken into consideration: a direct and an indirect method. Through a simulation, data were generated for 70 different settings, obtained combining 7 different values for n (75, 150, 300, 600, 1200, 2400, 4800) with 10 different values for k (6, 12, 18, 24, 30, 36, 42, 48, 60). All variables were generated as independent and standard-normally distributed. The distribution of the first 4 eigenvalues of the correlation matrix was considered and the values of the 95th centiles were computed. For each setting, PCA was applied to 6001 independent samples. It was shown that there is a positive correlation between couples of consecutive eigenvalues, and that this correlation increases as k increases and, to a lesser extent, as n increases. It is expected that this pattern also persists when latent factors are present. With the direct method, the observed 95th centile of the distribution of the first 4 eigenvalues could be predicted as a function of k and n by a nonlinear model with 7 parameters. With the indirect method, we normalized the distribution of the first 4 eigenvalues through a 3-parameter Box-Cox transformation. The parameters of the Box-Cox transformation were then expressed as functions of k and n, and used to predict the 95th centile. Both methods appeared to accurately predict the value of the 95th centile. For the first eigenvalue, the mean of the absolute difference between type I error risk associated with the observed and the predicted thresholds is 3‰ for the direct method and 5‰ for the indirect method. The latter method has the additional advantage of providing any computed eigenvalue with its probability of occurrence under the null hypothesis. The number of samples generated in this study is large enough to obtain highly precise estimates of the type I error risk as regards the 1st eigenvalue. The reliability of the estimates is lower for the 2nd and, a fortiori, the 3rd and 4th eigenvalue. Further research in this field should focus on how and to what extent the distribution of the eigenvalues of sample correlation matrices depends on the shape of the parent distribution (e.g. skewed, leptokurtic, multimodal), and on the possible extension of the predicting functions to the case where latent factors exist, by including a parameter that takes into account the variance explained by these factors.
IDENTIFICAZIONE DEGLI ASSI FATTORIALI INFORMATIVI NELL'ANALISI DELLE COMPONENTI PRINCIPALI
PLEBANI, MADDALENA
2014
Abstract
Determining how many factors to retain as expression of an underlying structure is an important topic in principal component analysis (PCA). With this aim, different empirical criteria are usually adopted, such as to retain all eigenvalues higher than 1 (Kaiser-Guttman rule), or the first eigenvalues totalling a prefixed amount of explained variance, or higher than a prefixed threshold (broken-stick method), or those eigenvalues that depart from the straight line on which tend to lie all the other eigenvalues (scree plot). Yet, these rules often have weak theoretical bases and not always are appropriately applied. Facing this problem from a mathematical-analytical point of view by finding the distribution of the eigenvalues of sampling correlation matrices is a hard task, and most studies report results which are valid only asymptotically or under specific assumptions. There is a need to generalize the method, also to deal with real and possibly small datasets. The aim of this thesis was to model the decision thresholds for the eigenvalue distribution as a function of number of variables (k) and sample size (n), under the assumptions that no latent factors exist and variables are standard-normally distributed. Two methods were taken into consideration: a direct and an indirect method. Through a simulation, data were generated for 70 different settings, obtained combining 7 different values for n (75, 150, 300, 600, 1200, 2400, 4800) with 10 different values for k (6, 12, 18, 24, 30, 36, 42, 48, 60). All variables were generated as independent and standard-normally distributed. The distribution of the first 4 eigenvalues of the correlation matrix was considered and the values of the 95th centiles were computed. For each setting, PCA was applied to 6001 independent samples. It was shown that there is a positive correlation between couples of consecutive eigenvalues, and that this correlation increases as k increases and, to a lesser extent, as n increases. It is expected that this pattern also persists when latent factors are present. With the direct method, the observed 95th centile of the distribution of the first 4 eigenvalues could be predicted as a function of k and n by a nonlinear model with 7 parameters. With the indirect method, we normalized the distribution of the first 4 eigenvalues through a 3-parameter Box-Cox transformation. The parameters of the Box-Cox transformation were then expressed as functions of k and n, and used to predict the 95th centile. Both methods appeared to accurately predict the value of the 95th centile. For the first eigenvalue, the mean of the absolute difference between type I error risk associated with the observed and the predicted thresholds is 3‰ for the direct method and 5‰ for the indirect method. The latter method has the additional advantage of providing any computed eigenvalue with its probability of occurrence under the null hypothesis. The number of samples generated in this study is large enough to obtain highly precise estimates of the type I error risk as regards the 1st eigenvalue. The reliability of the estimates is lower for the 2nd and, a fortiori, the 3rd and 4th eigenvalue. Further research in this field should focus on how and to what extent the distribution of the eigenvalues of sample correlation matrices depends on the shape of the parent distribution (e.g. skewed, leptokurtic, multimodal), and on the possible extension of the predicting functions to the case where latent factors exist, by including a parameter that takes into account the variance explained by these factors.File | Dimensione | Formato | |
---|---|---|---|
phd_unimi_R09128.pdf
Open Access dal 30/07/2015
Dimensione
8.86 MB
Formato
Adobe PDF
|
8.86 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/171571
URN:NBN:IT:UNIMI-171571