Factor Analysis of Cross-Classified Data

Undergraduate

ABSTRACT
This thesis introduces a model hierarchy related to Principal Component Analysis and Factor Analysis, in which vector measurements are linearly decomposed into a relatively small set of hypothetical principal directions, for purposes of dimension reduction. The mathematical specification of unknown parameters in the models is unified. Identifiability of the suitably defined models is proved. The EM algorithm and the Newton-Raphson algorithm based on likelihoods and profile likelihoods are implemented to get computationally effective (maximum likelihood) estimators for the unknown parameters. A restricted model (with some error variances 0) and a sufficient condition for a local maximum likelihood estimate are established. Score tests are constructed to check whether error variances are 0, which is shown to be associated with non-identifiability of models. Statistical tests of goodness of fit of the models to data are established in a likelihood ratio testing framework, so that the most parsimoniously parameterized model consistent with the data can be chosen for purposes of description and classification of the experimental settings. The results are applied on a real data set involving coronal cross-sectional ultrasound pictures of the human tongue surface during speech. The likelihood ratio test is used to test the fit of the PARAFAC model on the real coronal tongue data, leading to a finding of inadequacy of the PARAFAC model.

Introduction

In statistical practice, for investigations involving a large number of observed variables, it is often useful to simplify the analysis by considering a small number of linear combinations of the original variables. For example, scholastic achievement tests usually consist of a number of examinations in different subject areas. In attempting to rate students applying for admission, college administrators frequently attempt to reduce the scores from all subject areas to a single, overall score. If the reduction can be done with minimal information loss, it is better. Principal Component Analysis (PCA) is a method for data reduction. It is used to find linear combinations of the original variables which account for most of the variance in the original sample.
In many scientific fields, notably psychology and other social sciences, we are often interested in quantities, such as intelligence or social status, that are not directly measurable. However, it is often possible to measure other quantities which reflect the underlying variable of interest. Factor analysis is an attempt to explain the correlations between observable variables in terms of underlying factors, which are themselves not directly observable. For example, measurable quantities such as performance on a series of tests can be explained in terms of an underlying factor such as intelligence.
At first glimpse, factor analysis closely resembles principal components analy-
sis. Both use linear combinations of variables to explain sets of observations of many variables. In principal component analysis, the observed variables are themselves the quantities of interest. The combination of these variables in the principal components is primarily a tool for simplifying the interpretation of the observed variables. Principal components analysis is merely a transformation of the data. No assumptions are made about the form of the covariance matrix of the data. On the other hand, factor analysis assumes that the data comes from a statistical model which can be expressed in terms of a few underlying, but unobservable, random quantities called factors and some additional sources of variation called error. Factor analysis can be considered as an extension of principal components analysis. Both can be viewed as attempts to approximate the covariance matrix. Applications of PCA and factor analysis have become very popular in many fields such as psychology, economics, sociology, meteorology, medicine, political science, taxonomy and archaeology. Both of them have been successfully used in acoustic and phonetic research on tongue position by Harshman et al. (1977) , Jackson (1988), Nix et al. (1996), and Stone et al. (1997).

The PARAFAC model was pioneered by Harshman et al. (1977). It is a technique for extracting “articulatory prime” shapes from data allowing non-orthogonal components to scale differently for different speakers. The main concern underlying the PARAFAC model is how to modify the small set of prime shapes with large variance of sound production for different speakers, without requiring large numbers of parameters for all speaker and sound combinations. PCA might do well in reducing the dimension without extracting the behaviors for individual speaker differences. On the other hand, the PARAFAC model succeeds in decomposing tongue shape data into tongue shape factors. In my thesis, PCA, Factor Analysis and the PARAFAC model are introduced. A model hierarchy is defined, and then is applied to coronal tongue cross-section ultrasound data of multiple subjects collected in the laboratory of Dr. M. Stone. We also discuss the interpretation for the tongue data of the assumptions defining the models presented. Then we present data analytic results to distinguish which model is adequate.