Some Estimation Methods and Their Assessment in Multilevel Models: A Review
Yahia S El-Horbaty* and Eman M Hanafy
Department of Mathematics, Insurance and Applied Statistics, Helwan University, Egypt
Submission: January 27, 2018; Published: February 27, 2018
*Corresponding author: Yahia S El-Horbaty, Department of Mathematics, Insurance, and Applied Statistics, Helwan University, Egypt; Email: yahia_mohamed@commerce.helwan.edu.egHow to cite this article: Yahia S El-Horbaty, Eman M Hanafy. Some Estimation Methods and Their Assessment in Multilevel Models: A Review. Biostat Biometrics Open Acc J. 2018; 5(3): 555662.DOI:10.19080/BBOAJ.2018.05.555662
Abstract
Multilevel linear regression models represent a generalization of linear models in which the regression coefficients are themselves given a model whose parameters are also estimated from the data. This paper reviews multilevel random coefficients regression models with a focus on the estimation problem and its assessment. Parameter estimation for the fixed effects and the variance components are highlighted. In addition, comparisons that are made in the literature to choose among the competing methods are highlighted. This is particularly emphasized when some of the assumptions underlying the estimation methods are violated.
Keywords: Multilevel models; Parametric estimation; Robust standard errors
Abbreviations: GLS: Generalized Least Squares; MINQUE: Minimum Norm Quadratic Estimation; ML: Maximum Likelihood; REML: Restricted Maximum Likelihood; IGLS: Iterative Generalized Least Squares; EM: Expectation Maximization; OLS: Ordinary Least Squares; ARE: Asymptotic Relative Efficiency
Introduction
Multilevel modeling is an approach that can be used to handle clustered or grouped data. Social sciences often involve problems that investigate the relationship between individual and society. The general concept is that individuals and the social groups are conceptualized as a hierarchical system, where the individuals and groups are defined at separate levels of this hierarchical system. There are many types of multilevel models, which differ in terms of the number of levels, type of design (random intercept, random slopes and random coefficients regression model), scale of the outcome variable (continuous, categorical), and number of outcomes (univariate, multivariate). In this article, the two-level random coefficients regression model is represented, letting the univariate outcome variable to be continuous.
The parameters to be estimated in multilevel regression models are the fixed coefficients that represent the fixed part of the model. The parameters also include the variance covariance matrix of the random coefficients and the variances of the residual errors that represent the random part of the model. If the components of the random part in the model were known, the unknown fixed coefficients could be estimated using generalized least squares (GLS) estimation method [1]. Similarly, if the fixed coefficients were known, then the unknown variance components of the model could also be estimated using GLS estimation method. If both are unknown, hence we have to follow different approaches.
In order to estimate the unknown parameters several procedures are discussed in Searle et al. [2]. These methods include the ANOVA method for balanced data which uses the expected mean squares approach. However, this method is difficult to deal with under unbalanced data situations. Under the latter case, Rao [3] proposed the minimum norm quadratic estimation (MINQUE) method for estimating the variance parameters that produces quadratic unbiased estimators with minimum norm (MINQUE). Maximum likelihood method (full maximum likelihood (ML) and restricted maximum likelihood (REML)), iterative generalized least squares (IGLS) and the expectation maximization (EM) estimation method represent standard methods that are used under both balanced and unbalanced data. These methods involve iterative procedures and thus the resulting estimators are not necessarily expressed in a closed form.
The commonly used methods to estimate the multilevel model are ML and REML [4]. The ML estimators of the variance components do not correct for the degrees of freedom lost due to the estimation of the fixed effects. As a result, the estimates of the variance components are generally too small. REML estimation corrects for the uncertainty about the fixed effects in estimating the variance components, which is especially useful if the number of the second-level units (e.g. groups) is relatively small [4,5]. There are other methods used to estimate the unknown parameters of the multilevel model. An iterative two-stage procedure is the IGLS [6]. IGLS usually starts with an ordinary least squares (OLS) estimate of the fixed parameters, which is then used to estimate the random part of the model. Next, the estimate of the covariance matrix of the model is used to make an improved estimate of the fixed part, which in turn is used to improve the estimate of that covariance matrix. The IGLS method alternates between estimating the fixed and the variance components of the model until convergence is reached. As far as the maximum likelihood approach is concerned, the normality of the error distributions at each level is introduced. Goldstein [7] proved that under normality of all the error terms the parameter estimates resulting from IGLS procedure are maximum likelihood estimates.
Another method of estimation is based on the EM algorithm that developed in Dempster et al. and is applied in Raudenbush & Bryk to models covering multilevel random coefficients regression models. The EM algorithm addresses the problem of maximizing the likelihood by conceiving it as a problem in missing data where, similar to the IGLS method, its procedure is based on an iterative procedure. The aforementioned parametric estimation methods provide approximately the same results for the parameter estimates and their standard errors under standard assumptions and large sample size Goldstein [7], Raudenbush & Bryk and Gumedze & Dunne. One of the essential assumptions of the tests of significance used in multilevel programs is normality of the error distributions involved. Mass & Hox [4] highlight the fact that the aforementioned estimation methods are asymptotically efficient. The authors suggest a number of groups of 50 groups to achieve accurate estimates of the standard errors of the fixed part estimates. This suggestion is based on an earlier conclusion by simulations provided by Van der Leeden and Busing and Van der Leeden et al. in addition they suggest that when assumptions of normality and large samples are not met, the standard errors have a small downward bias. On the other hand, in dealing with outliers Longford and Lewis offered practical procedures for dealing with outliers in multilevel data then he assesses the effect of them on the model estimators using different parametric and robust parametric estimation methods.
Importantly, different measures of assessment are presented that can be used to investigate the accuracy of the parameter estimates, variances and their standard errors in multilevel regression models and compare different estimation methods. Those include the percentage relative bias, the coverage probability and the asymptotic relative efficiency. Accordingly, the rest of this article is organized as follows. Section 2 describes the two-level model. Some basic parametric and robust parametric estimation methods are reviewed in Section 3. The measures of assessment are highlighted in Section 4 and the conclusions are set in Section 5.
Model and notations
Assume we have data from m groups, with a different number of responses n. (where j=1,2,,...,m) in each group. Let the data be obtained in the form of a vector of the outcome variable Yj a set of explanatory variables in Wj, and another set of group- level explanatory variables in Zj, To model these data, separate regression models for each level are accommodated. First, the individual-level model
Y=Wjαjc002B; εj
Where, Yj is a vector of length nj representing the response variable of group j,Wj is a design matrix of size nj x q, and εj is a vector of length nj representing the residuals on the same level where, εj ~N(o,σ2eInj). The group-level model is given by
α = Zjβ+δj
Where, Zj is q x p design matrix, β is an p x 1 vector of fixed effects, the random vector δ1 is of order q x 1 such that
is the corresponding symmetric covariance matrix. From [1] and [2] the combined model for group j is given by
Where, Xj = [wjZj]. By combining the data for all groups, we
Where, Y = (Y1T,...YmT)T, X = (X1T,...,XTm)T is the individual- level block diagonal matrix with Wj in the corresponding block,
with Ωg being the group-level covariance matrix. This implies that E (Y ) = Xβ and
In the next section, some estimation methods that are frequently used in practice and are also embedded in many software packages are highlighted. Focus is given to the estimation of both the model fixed effects and the variance components.
Estimation methods
Restricted maximum likelihood estimation: The first step of the analysis is to obtain estimates of the unknown fixed effects in β, as well as the estimates of the unknown variance components contained in ƍ and σ2e which are defined in the previous section. The common estimation methods for the multilevel regression model are ML and REML methods. ML estimators of the variance parameters are biased downwards, especially in small samples, since they do not take into account the degrees of freedom lost in the estimation of the fixed effects. The REML estimation method under model is presented as in the sequel. Unlike ML, REML maximizes the likelihood of the linearly independent error contrasts of linear combinations of the data Y, that are orthogonal to the explanatory design matrix. The linear combinations are chosen as ATY instead of Y such that
corresponding likelihood function to [6] can be expressed as
Differentiating the log-likelihood function [8] with respect to
Set equal to zero and solve it we have a REML estimator for the residual variance as
Where, that must be computed iteratively. By recalculate the new solutions of log-likelihood to obtain the new estimates of variance components and reformulate the new matrix g. The process continues until convergence. REML estimation includes no procedure for estimating fixed effects. However, it would seem to use maximum likelihood estimation of being the REML estimate of £ [Searle et al. [1992], p. 254]. One important assumption underlying REML estimation method is normality of the error distributions. When the residual errors are not normally distributed, the parameter estimates produced by the REML method are still asymptotically unbiased. However, the asymptotic standard errors are incorrect. Significance tests and confidence intervals can thus not be trusted (Goldstein, 2010]. This problem does not completely vanish when the sample size gets larger.
REML sandwich estimator
One method to obtain better tests and confidence intervals is to correct the asymptotic standard errors of the fixed coefficients using the sandwich estimator which makes them less dependent on the distributional assumptions [8,9]. The method yields asymptotically consistent covariance matrix estimates without making distributional assumptions even if the assumed model is incorrectly specified with this respect. The sandwich estimator is often called the robust covariance matrix estimator. In the ML approach, the usual estimator of sampling variances and covariance is the inverse of the information matrix. Using matrix notation, the asymptotic covariance matrix of the estimated coefficients under REML can be written as follows
Where, B is the information matrix [10]. Then the sandwich estimator is given as
where the covariance matrix VR is the robust covariance matrix of the regression coefficients, and R is the correction matrix that is based on the observed raw residuals. If the residuals follow a normal distribution, VA and VR are both consistent estimators of the covariances of the regression coefficients, but the model based asymptotic covariance matrix VA is more efficient because it leads to the smallest standard errors. However, when the residuals do not follow a normal distribution, the model-based asymptotic covariance matrix is incorrect, while the sandwich estimator of VR is still a consistent estimator of the covariances of the regression coefficients. This makes the inference based on the robust standard errors less dependent on the assumption of normality.
Iterative generalized least square estimation
The IGLS method estimates regression coefficients, variances and their random effects. Under the assumption of residuals have multivariate normal, the method constructs an additional linear model whose unknown parameters are the between-group covariances σJk that represent the covariance between the jth and the kth element of α (i.e. the correlations between first-level parameter estimates), j,k = 1,.., q and the within-group σ2e variance.
Iterative generalized least square
This method begins with estimating the components of Ω and σ2e in a linear modeling framework. We begin by expanding, re-expressing the combined covariance matrix Σ as a linear combination of the between-groups covariances σjk and the residual variances σ2e This allows a subsequent formulation of a general linear model with the unknown parameters σjkand σ2e being estimated. Let H be defined as a matrix that selects the element of the variance components such that H. j.k is a q x q indicator matrix which is 0 in every element except the (j, k )th cell. Then, Σ can be written as
For a pictorial representation of see an interesting example in Lindquist et al. [11]. The next step in IGLS is to formulate the linear model that estimates the variance components. This is achieved based on by vectorizing the lower triangular and diagonal elements of the matrix corresponding to each unknown variance component. This can be done using the vech operator, which when applied to a matrix stacks its columns after removing all supra-diagonal elements. Using this notation, the summands in can be written These become the regressors in the new design matrix X * (where * will be used to indicate the linear model for variance components, following the notation of Goldstein [6]). The response variable Y * in this new model is based on the residual covariance matrix U where,
Y* and X * are used to estimate the variance components in by using the following equation of the extra linear model
The IGLS procedure alternates between estimating β and β* until convergence. The specific steps are as follows A. Start with OLS estimates of the covariance, setting B. Estimate fixed effects. Use the current estimate of to calculate by the GLS solution:
C. Estimate variance components. Use the residuals to form
Y * and update estimates for the within and between-groups variance components, again using the GLS procedure. Following Goldstein [6], the covariance of the variance components can be shown to be This result holds if and only if the data are multivariate normal or if the sample variance matrix is Wish art distributed. Thus, the GLS solution is given by
If negative estimates are obtained for the variance components σjk,, j = k and σ2e they are truncated to 0. However, the covariance terms (σjk, j ≠ k) are allowed to be negative. Reform from 4. Repeat steps 2-3 until convergence.
Sandwich IGLS
As in the ML case, we can obtain IGLS sandwich estimators that correct the asymptotic standard errors of the fixed parameters. Consider first the fixed part of the model and the usual IGLS estimate of the fixed parameters based upon the random parameter estimates using. The covariance matrix of these estimates is:
Where, Cov(Y) = Σ and is unknown. If we replace the central covariance term Cov(Y) by the assumed model estimated value, we obtain the usual formula
The sandwich estimator is formed by replacing the estimate of central covariance term, Cov(Y) in [21] by a robust estimator based on U in, and then the sandwich estimator for the parameter standard errors is constructed as
Further details of the Huber-White correction for the multilevel model are given by Goldstein and Raudenbush & Bryk.
Expectation maximization
The EM algorithm addresses the problem of maximization likelihood as a problem of missing data.
EM algorithm: Each EM step requires sums of squares and product of the conditional means of the random effects [given the data and current parameter estimates] as well as the conditional variances of these random effects. Consider the combined model in [10]. Then, the alogritm begins with the M-step followed by the E-step as shown in the sequel.
M step: The EM algorithm conceives Yj as the observed data with δj as the missing data. Thus, the complete data set is (Yj,j). To estimate β, this is achieved by deducting Wj δj from both sides of [12], yielding
and justifying the OLS estimate as the complete-data ML estimate of pfor all groups. Thus, considering
The complete-data ML estimates of Ω and σ2e are similarly straightforward, where
E step: The CDSS are not observed, but they can be estimated by their conditional expression, given the data Yj and the parameter estimates from the previous iteration. Dempster et al. showed that substituting the expressed CDSS for the M-step formulas would produce new parameter estimates having higher likelihood than the current estimates. Consider multivariate normal distribution data in the form
Now having identified the CDSS needed for M step, the EM algorithm is readily specified as;
I. Estimate the CDSS data:
II.Substitute the estimated CDSS in to the M-step formulas to obtain new estimates of the parameters.
III. Feed the new parameter estimate into step 1.
IV. Continue until changes in the log-likelihood become sufficiently small or the largest change in the value of the parameters is sufficiently small.
Measures of assessment
Beside the review of some estimation methods, one may be interested in knowing which method may be more useful when the standard multilevel model assumptions are not met in the available data. In this section, we provide some notes about three commonly used measures by which one could assess the estimated parameters and compare their performance. This shall guide the practitioners to use the better method whenever their data reveal the need for special inferential procedures.
Percentage relative biast
The accuracy of the parameter estimates can be calculated using the mean relative bias. Let represent the estimate of the population parameter θ, then the percentage relative bias is given by
The parameter θ and its estimator can be any regression coefficient that we considered in and the choices among the ML estimator or the estimators in and for fixed effects as well as for the variance components under the corresponding estimation method. A better estimator could be checked by testing whether its relative bias differs from one. With α=0.001, the p-values of the estimated parameters are Bonferroni-corrected [12]. The Bonferroni correction issued to adjust hypothesis being tested and the confidence intervals by testing each individual hypothesis at a significance level of α / ñ, where a is the desired overall alpha level and ñ is the number of hypotheses. Mass & Hox [13] provided a simulation study to assess the accuracy of the parameter estimates using the percentage relative bias when the normality assumption of the error distribution is violated. The study showed that, the percentage relative bias is the same for the REML and robust REML, since it focuses on investigate the parameter estimates and not their standard errors. The study resulted in only one significant difference when the group- level residuals were generated from chi-square distribution as the worst condition where the number of groups was 30, group size equal 5 and interclass correlation coefficient equal 0.1. In a different context, Korendijk et al. [14] highlighted the effect of ignoring the group-level hetroscedasticity on the estimation of regression coefficients, the variances components, and their standard errors. The authors provided a simulation study using the percentage relative bias and coverage probability (see the next paragraph).The results showed that the individual-level regression coefficients, variances, and their standard errors are unbiased. However, the standard errors of group-level variances are underestimated.
Confidence intervals
To assess the accuracy of the standard errors, another useful measure is the coverage probabilities of estimation. For each parameter set a 95% confidence interval using the asymptotic standard normal distribution of the estimator. First, we obtain the actual coverage probability, which is counted as the proportion of the overall iterations in which the confidence interval covers the true values of true parameters. We are concerned with how far the actual coverage probability falls below the nominal level. Using the null hypothesis; H0: Actual Coverage probability = 1 -α the deviation is analyzed with a simple z-score, which is calculated based on normal approximation to the binomial distribution as follows
Where, c is the count that θis covered by the confidence interval, 1000 is the number of replicates in each condition and (1 -α) is the intended coverage probability. The downward with this method appear when it used for estimated variances of random effects since it doesn't take their asymptotic distributions into account. For this reason, we can depend on asymptotic relative efficiency to assess the accuracy of the variance components in multilevel models. In a comparison between the REML as a parametric method and the robust REML as an alternative robust parametric methods, Mass and Hox [13] introduced a simulation study using coverage probability for both fixed and random parameters in the model to evaluate the accuracy of the parameter estimates and their standard errors when the assumptions of normality and large samples are not met. A variety of factors were considered including the number of groups, group size, interclass correlation coefficient and the group-level residuals generated from three non-normal (uniform, Laplace and chi-square)distributions. The study shows that under the violation of normality assumption at the group-level residuals, there is no effect of such factors on the parameter estimates. On the other hand, for the random part in the model the standard errors of variances at the group-level are highly inaccurate and thus the robust REML do better than REML especially in large number of groups.
In dealing with the problem of determining the effect of different sample sizes at the group-level on the accuracy of the estimates in multilevel regression models, Mass & Hox [15] provided a simulation study based on the aforementioned coverage probability measure to determine the effect of number of number of groups, group sizes and interclass correlation coefficients on the parameter estimates and their standard errors. The results show that having small number of groups leads to biased estimates of the group-level standard errors of the regression coefficients. However, for the remaining factors the estimates of the parameters (regression coefficients and variance components) and their standard errors are unbiased and yet accurate.
Asymptotic relative efficiency
To assess the performance of the variance components, the asymptotic relative efficiency (ARE) is defined as one of the principal comparison measures. The ARE of two procedures is the ratio of their efficiencies measures where efficiencies are often defined using the asymptotic mean squared error as one of the measures of desirability. In order to study the sample properties of two estimators under a simulation study, the ARE can be calculated according to the following steps:
o For each case study we generate, say 1000, simulated data sets.
o Let θ be the target population parameter and denote the value of the estimator for the first method in the ith simulation, and h = 1,2.
o Denote the MSE for the first estimation method, that,
Where, the MSEh is a measure of the efficiency of the hth estimator and values closer to zero are more preferred. o Repeat the previous steps for each method of estimation and compare between each method using the ARE criterion
Based on Mckean et al. [16] investigate the robust approach (weighted rank based analyses of linear models) introduced by Hettmansperger & Mckean [17]. The study compared the ARE of the weighted Wilcoxon and OLS estimators in terms of their asymptotic variances and concluded that under normal errors, the Wilcoxon estimator is 95% as efficient as the OLS method. Thus, there is only a 5% loss in efficiency if the Wilcoxon estimator is used and the error distribution is actually normal. However, the ARE is usually greater than 1 if the true distribution has tails heavier than those under a normal distribution (e.g. contaminated normal distribution or the data corrupted by outliers).
Under linear mixed models, which represent the general case of multilevel regression models, Kloke et al. [18] extended the robust approach introduced by Hettmansperger and Mckean [16] to linear mixed models with covariates using general score functions. The authors compared between the OLS and weighted Wilcoxon using the ARE as an assessment measure and concluded that the robust analysis is more effective than OLS in the presence of outliers or underlying error distributions with heavy tails. Besides, Mckean & Kloke [19] proposed a family of optimal score functions under contaminated normal distributions of the error terms in both linear and nonlinear models [20-25]. In this study they compared Wilcoxon, OLS, and ML in terms of their asymptotic variances by using the ARE to conclude about the efficiency and validity of Wilcoxon over skewed-normal and contaminated normal distributions [26-29].
Conclusion
In this article, some estimation methods for multilevel models are reviewed. In addition, some measures that assess the behavior, reliability, and efficiency of those estimators are highlighted [30-32]. One important remark is that the efficiencies of the variance components estimators that are produced under each estimation method may not be easily compared using the well-known coverage probability measure. The ARE becomes a useful candidate in this case. A future research point of assessing the behavior of the fixed effects and the variance components under multilevel models with more than two levels are complicated covariance structures is needed to choose among the competing estimation methods and their robust versions. The same recommendation shall apply to linear mixed models.
References
- Snijders TAB, Bosker RJ (1999) An Introduction to basic and advanced multilevel modeling. London.
- Searle SR, Casella G, McCulloch CE (1992) Variance Components, JohnWiley and Sons, New York, USA.
- Rao CR (1971) Estimation of variance and covariance components -MINQUE theory. J Multivariate Anal 1: 257-275.
- Hox JJ (2002) Multilevel analysis: Techniques and applications. Mahwah, NJ: Erlbaum.
- Bryk AS, Raudenbush SW (1992) Hierarchical Linear Models. Sage, Newbury Park, CA.
- Goldstein H (1986) Multilevel Mixed Linear Model Analysis Using Iterative Generalized Least-Squares. Biometrika 73: 43-56.
- Goldstein H (1995) Multilevel Statistical Models .London: Institute of Education.
- Huber P J (1967) The Behavior of Maximum Likelihood Estimates Under Non-Standard Conditions, University of California Press, Berkely, 221-233.
- White H (1982) Maximum likelihood estimation of misspecified models, Econometrics. 50: 1-50.
- Eliason SR (1993) Maximum Likelihood Estimation: Logic and Pract, SAGE University Papers.
- Lindquist MA, Spicer J, Asllani I, Wager TD (2012) Estimating and Testing Variance Components in a Multilevel GIM, Neurolmage 59: 490-501.
- Dunn, Olive Jean (1961) Multiple Comparisons Among Means. Journal of the American Statistical Association. 56 (293): 52-64
- Mass J MC, Hox JJ (2003) The Influence of Violations of Assumptions on Multilevel Parameter Estimates and Their Standard Errors. Computational statistics & Data analysis 46: 427-440.
- Korendijk EJH, Mass JMC, Mirjam Moerbeek, Van der Heijden PGM (2008) The Influence of Misspecification of the Hetroscedasicity on Multilevel Regression Parameters and Standard Error Estimates, Methodology4(1): 67-72
- Mass JMC, Hox JJ (2005) Sufficient Sample Sizes for Multilevel Modeling. Methodology 1(3): 86-92.
- Mckean JW, Naranjo JD, Sheather SJ (1999) Diagnostics for comparing robust and least squares fits. Journal of Non-parametric Statistics 11: 161-188.
- Hettmansperger TP, Mckean JW (1978) A Robust Analysis of the General Linear Models Based on One Step R-Estimate. Biometrika 65: 571-579.
- Kloke JD, Mckean JW, Rashid M (2009) Rank-Based Estimation and Associated Inferences for Linear Models with Cluster Correlated Errors. Journal of the American Statistical Association 104: 384-390.
- Mckean JW, Kloke JD (2014) Efficient and adaptive rank-based fits for linear models with skewed-normal errors, Journal of Statistical Distributions and Applications 1: 1-18.
- Anderson RL, Bancroft T A (1952) Statistical Theory in Research. McGraw-Hill, New York.
- Goldstein H (1988) A General Model for The Analysis of Multilevel Data. Biometrika 53(4): 455-467.
- Goldstein H (1989) Restricted Unbiased Iterative Generalized Least- Square Estimation. Biometrika, 76(3): 662- 663.
- Goldstein H, Rasbash J (1992) Efficient Computational Procedures for The Estimation of Parameters in Multilevel Models Based on Iterative Generalized Least Square. Computational Statistics & Data Analysis 13: 63- 71.
- Harville DA (1977) Maximum Likelihood Approaches to Variance Component Estimation and Related Problems. Journal of the American Statistical Association 72[358]: 320-338.
- Henderson CR [1950] Estimation of genetic parameters. Ann Math Stat 21: 309-310.
- Henderson CR [1953] Estimation of variance and covariance components. Biometrics 9: 226-252.
- Henderson CR, Kempthorne O, Searle SR, Von Krosig CN (1959) Estimation of environmental and genetic trends from records subject to culling. Biometrics 15: 192-218.
- Hettmansperger TP, Mckean JW (1998) Robust Nonparametric Statistical Methods, Arnold, London.
- Mckean J W (2004) Robust Analysis of Linear Models, Statistical Science 19: 562-570.
- Mckean JW, Terpstra JW, Kloke JD (2009) Computational rank-based statistics. Wiley Interdisciplinary Reviews: Computational statistics 1: 132-140.
- Patterson HD, Thompson WA (1971) Recovery of Inter-Block Information When Block Sizes are Unequal. Biometrica 58(3): 545554.
- Searle SR (1982) Matrix Algebra Useful for Statistics, JohnWiley and Sons, New York.