THE MISSING VALUE PROBLEM: A REVIEW AND CASE STUDY

Postgraduate

ABSTRACT

The purpose of this thesis is to review methods of imputation and apply them to data collected by Equal Employment Opportunity Commission (EEOC). 

First, I discuss several imputation methods and review theory of multiple imputation (MI). Next, I review aspects of missing data and outline an artificial data simulation. I describe simulation based on EEOC dataset listing numbers of employees by ethnicity in large establishments. Mean imputation and MI are applied to simulated datasets. In the first scenario, we impute data for nonresponding establishments. The more we impute, the higher our resulting population means. In the second scenario, we simulate item nonresponse. I find mean imputation and MI generate similar means. The means are not affected by percentage of missingness regardless of imputation methods. The results suggest MI produces larger standard error than mean imputation. Last the percentage of missingness has no effect on standard error in case of MI.

Introduction
Missing data is omnipresent in survey research. Although when we collect information for statistical analysis, complete data for all subjects are desired, the possibility that some data will be unavailable should not be ignored. It may be lost, be costly to obtain or be unusable. When missing data means there is no response obtained for a whole unit in the survey, it is called unit nonresponse. When missing data means responses are obtained for some of the items for a unit but not for other items, it is called item nonresponse. Missing data has to be dealt with before we can do anything meaningful with the dataset. Many statistical problems have been viewed as missing data problems in the sense that one has to work with incomplete data. Advances in computer technology have not only made previously long and complicated numerical calculations a simple matter but also advanced the statistical analysis of missing data.

INTRODUCTION
Missing data usually means lack of responses in the data. It is often indicated by “Don’t know”, “Refused”, “Unavailable” and so on.  Missing data are problematic because most statistical procedures require a value for each variable. When a data set is incomplete, an analyst has to decide how to deal with it. This requires a missingvalue procedure (Graham and Schafer, 2002).
Data may be missing in any type of study due to any reason. For example, subjects in longitudinal studies often drop out before the study is complete. Sometimes this happens because they are not interested anymore, are not able to find time to participate, died or moved out of the area. Whatever the reason is, the study will suffer from missing-data problem.
Missing data causes a variety of problems in data analysis. First, lost data decrease statistical power. Statistical power refers to the ability of an analytic technique to detect a significant effect in a data set. Also, it is well known that a high level of power often requires a large sample. Thus, it appears that missing data may meaningfully diminish sample size and power.
Second, missing data produce biases in parameter estimates and can make the analysis harder to conduct and the results harder to present. The bias may be either upward or downward, which means the true score may be either overestimated or underestimated. In an example of Roth and Switzer (1995), a memory study has a true score validity of 0.7. Research may show that the estimated validity is 0.5, an underestimate introduced when the median of observed values is substituted for missing values. This may happen because substituting the median reduces observed  variance in a variable.
     The methods examined in this thesis to deal with missingness are mean imputation and multiple imputation. Imputation consists of replacing the missing data with values derived from the respondents or from a relationship between the nonrespondents and respondents. Mean imputation can be used when the missingness is either unit nonresponse or item nonresponse. Multiple imputation is more useful for item than unit nonresponse, but it is possible to use it for the latter as well (Little, 2006).
According to Little and Rubin (2002), the mechanisms leading to missing data can be classified into three subgroups: 
•    Missing completely at random (MCAR)
•    Missing at random (MAR) and
•    Not missing at random (NMAR).
Denote the complete data by  Y ={yij }, i = 1,...,n, j = 1,...,k and the missingdata indicator matrix by M ={M ij },i = 1,...,n, j = 1,...,k . Denote the conditional distribution of M given Y by f (M | Y, Q), where Q is a vector of unknown parameters.
MCAR means that the missing data mechanism is unrelated to the variables under study, whether missing or observed: a missing response happens purely by chance. That is f (M | Y,Q) = f (M | Q) for all Y,Q .
Let Yobs denote the observed components of Y and let Ymis denote the missing components. 
In the case of MAR, the missingness does not depend on the missing values, but may be related to other observed data. That is, f (M |Y,Q) = f (M |Yobs ,Q) for all
Ymis ,Q.
For example, consider a study with income as the key variable of interest. If ethnicity is always observed and minority group members tend not to report their income, the missing value mechanism may be MAR because whether a person responds depends on her/his ethnicity. 
When data are not missing at random, the missing data are said to be NMAR. In contrast to MAR where the probabilities of missingness are determined entirely by observed data and unknown parameters, NMAR arises due to the data missingness pattern being only explainable by the data which are missing. In other words, the distribution of M depends on the missing values in the data matrix Y . For this reason,  NMAR is also  called informative missingness.
I demonstrate and compare various imputation methods using a data set from the EEOC which contains reports on numbers of employees by gender, ethnicity and occupational category on all businesses meeting size and other criteria. In the first scenario, I identify companies which responded in 2002 but not in 2003 and treat these companies as nonrespondents. I impute their values—the numbers of employees-g by mean imputation.  In the second scenario, I artificially make some of the data values in the  2003 dataset to be missing (the number of minority employees). Then I impute their values by mean imputation and multiple imputation. 
This thesis is organized as follows. 
     The second chapter presents several single imputation methods and deletion methods. Mean imputation, regression imputation, hot deck imputation, listwise deletion and pairwise deletion are outlined. Their advantages and disadvantages are detailed. 
The third chapter deals with multiple imputation. First, I list the advantages and the assumptions. Next, the main concepts are pointed out as well as its key features. I also give a general idea how multiple imputation works.
The fourth chapter focuses on an artificial data simulation meant to resemble the EEOC data. At each Monte Carlo replication, I create variables y , the number of minority employees; N , the total number of employees and M , the missingness indicator. The resulting dataset contains 100 observations of each variable. I apply various imputation methods to this dataset.
The fifth chapter explains the structure of the EEOC data and provides some baseline information about the dataset. It includes a discussion of missing data mechanisms for the EEOC data. The method EEOC uses for dealing nonrespondents now is briefly mentioned and in contrast, the listwise deletion method is discussed.
The sixth chapter is devoted to simulations based on the EEOC data. The number of minority employees in each classified job categories is made missing to be imputed by mean imputation and multiple imputation. I  use a subset of the data based on SIC codes because the limitation of computer power and time. The results are analyzed and discussed.
The seventh chapter reviews the essence of missing data problem, methods dealing with it and the findings of our analyses. Suggestions to improve EEOC practices are proposed