Some Estimation Methods and Their Assessment in Multilevel Models: A Review

Multilevel linear regression models represent a generalization of linear models in which the regression coefficients are themselves given a model whose parameters are also estimated from the data. This paper reviews multilevel random coefficients regression models with a focus on the estimation problem and its assessment. Parameter estimation for the fixed effects and the variance components are highlighted. In addition, comparisons that are made in the literature to choose among the competing methods are highlighted. This is particularly emphasized when some of the assumptions underlying the estimation methods are violated.


Introduction
Multilevel modeling is an approach that can be used to handle clustered or grouped data. Social sciences often involve problems that investigate the relationship between individual and society. The general concept is that individuals and the social groups are conceptualized as a hierarchical system, where the individuals and groups are defined at separate levels of this hierarchical system. There are many types of multilevel models, which differ in terms of the number of levels, type of design (random intercept, random slopes and random coefficients regression model), scale of the outcome variable (continuous, categorical), and number of outcomes (univariate, multivariate). In this article, the two-level random coefficients regression model is represented, letting the univariate outcome variable to be continuous.
The parameters to be estimated in multilevel regression models are the fixed coefficients that represent the fixed part of the model. The parameters also include the variance covariance matrix of the random coefficients and the variances of the residual errors that represent the random part of the model. If the components of the random part in the model were known, the unknown fixed coefficients could be estimated using generalized least squares (GLS) estimation method [1]. Similarly, if the fixed coefficients were known, then the unknown variance components of the model could also be estimated using GLS estimation method. If both are unknown, hence we have to follow different approaches.
In order to estimate the unknown parameters several procedures are discussed in Searle et al. [2]. These methods include the ANOVA method for balanced data which uses the expected mean squares approach. However, this method is difficult to deal with under unbalanced data situations. Under the latter case, Rao [3] proposed the minimum norm quadratic estimation (MINQUE) method for estimating the variance parameters that produces quadratic unbiased estimators with minimum norm (MINQUE). Maximum likelihood method (full maximum likelihood (ML) and restricted maximum likelihood (REML)), iterative generalized least squares (IGLS) and the expectation maximization (EM) estimation method represent standard methods that are used under both balanced and unbalanced data. These methods involve iterative procedures and thus the resulting estimators are not necessarily expressed in a closed form.
The commonly used methods to estimate the multilevel model are ML and REML [4]. The ML estimators of the variance components do not correct for the degrees of freedom lost due to the estimation of the fixed effects. As a result, the estimates of the variance components are generally too small. REML estimation corrects for the uncertainty about the fixed effects in estimating the variance components, which is especially useful if the number of the second-level units (e.g. groups) is relatively small [4,5]. There are other methods used to estimate the unknown parameters of the multilevel model. An iterative two-stage procedure is the IGLS [6]. IGLS usually starts with an ordinary least squares (OLS) estimate of the fixed parameters, which is then used to estimate the random part of the model. Next, the estimate of the covariance matrix of the model is used to make an improved estimate of the fixed part, which in turn is used to improve the estimate of that covariance matrix. The IGLS method alternates between estimating the fixed and the variance components of the model until convergence is reached. As far as the maximum likelihood approach is concerned, the normality of the error distributions at each level is introduced. Goldstein [7] proved that under normality of all the error terms the parameter estimates resulting from IGLS procedure are maximum likelihood estimates.
Another method of estimation is based on the EM algorithm that developed in Dempster et al. and is applied in Raudenbush & Bryk to models covering multilevel random coefficients regression models. The EM algorithm addresses the problem of maximizing the likelihood by conceiving it as a problem in missing data where, similar to the IGLS method, its procedure is based on an iterative procedure. The aforementioned parametric estimation methods provide approximately the same results for the parameter estimates and their standard errors under standard assumptions and large sample size Goldstein [7], Raudenbush & Bryk and Gumedze & Dunne. One of the essential assumptions of the tests of significance used in multilevel programs is normality of the error distributions involved. Mass & Hox [4] highlight the fact that the aforementioned estimation methods are asymptotically efficient. The authors suggest a number of groups of 50 groups to achieve accurate estimates of the standard errors of the fixed part estimates. This suggestion is based on an earlier conclusion by simulations provided by Van der Leeden and Busing and Van der Leeden et al. in addition they suggest that when assumptions of normality and large samples are not met, the standard errors have a small downward bias. On the other hand, in dealing with outliers Longford and Lewis offered practical procedures for dealing with outliers in multilevel data then he assesses the effect of them on the model estimators using different parametric and robust parametric estimation methods.
Importantly, different measures of assessment are presented that can be used to investigate the accuracy of the parameter estimates, variances and their standard errors in multilevel regression models and compare different estimation methods. Those include the percentage relative bias, the coverage probability and the asymptotic relative efficiency. Accordingly, the rest of this article is organized as follows. Section 2 describes the two-level model. Some basic parametric and robust parametric estimation methods are reviewed in Section 3. The measures of assessment are highlighted in Section 4 and the conclusions are set in Section 5.

Model and notations
Assume we have data from m groups, with a different number of responses n j (where j=1,2,,…,m) in each group. Let the data be obtained in the form of a vector of the outcome variableY j , a set of explanatory variables in W j , and another set of grouplevel explanatory variables in Z j . To model these data, separate regression models for each level are accommodated. First, the individual-level model The group-level model is given by is the corresponding symmetric covariance matrix. From [1] and [2] the combined model for group j is given by Where, Where, In the next section, some estimation methods that are frequently used in practice and are also embedded in many software packages are highlighted. Focus is given to the estimation of both the model fixed effects and the variance components.

Estimation methods
Restricted maximum likelihood estimation: The first step of the analysis is to obtain estimates of the unknown fixed Where, A is an idempotent matrix of rank , n p − and then the corresponding likelihood function to [6] can be expressed as then after some algebra, the loglikelihood function is given by: Differentiating the log-likelihood function [8] with respect to Set equal to zero and solve it we have a REML estimator for the residual variance as that must be computed iteratively. By recalculate the new solutions of log-likelihood to obtain the new estimates of variance components and reformulate the new matrix  .
∑ The process continues until convergence. REML estimation includes no procedure for estimating fixed effects. However, it would seem to use maximum being the REML estimate of Σ (Searle et al. (1992), p. 254). One important assumption underlying REML estimation method is normality of the error distributions. When the residual errors are not normally distributed, the parameter estimates produced by the REML method are still asymptotically unbiased. However, the asymptotic standard errors are incorrect. Significance tests and confidence intervals can thus not be trusted (Goldstein, 2010). This problem does not completely vanish when the sample size gets larger.

REML sandwich estimator
One method to obtain better tests and confidence intervals is to correct the asymptotic standard errors of the fixed coefficients using the sandwich estimator which makes them less dependent on the distributional assumptions [8,9]. The method yields asymptotically consistent covariance matrix estimates without making distributional assumptions even if the assumed model is incorrectly specified with this respect. The sandwich estimator is often called the robust covariance matrix estimator. In the ML approach, the usual estimator of sampling variances and covariance is the inverse of the information matrix. Using matrix notation, the asymptotic covariance matrix of the estimated coefficients under REML can be written as follows Where, Β is the information matrix [10]. Then the sandwich estimator is given as where the covariance matrix R V is the robust covariance matrix of the regression coefficients, and R is the correction matrix that is based on the observed raw residuals. If the residuals follow a normal distribution, A V and R V are both consistent estimators of the covariances of the regression coefficients, but the model based asymptotic covariance matrix A V is more efficient because it leads to the smallest standard errors. However, when the residuals do not follow a normal distribution, the model-based asymptotic covariance matrix is incorrect, while the sandwich estimator of R V is still a consistent estimator of the covariances of the regression coefficients. This makes the inference based on the robust standard errors less dependent on the assumption of normality.

Iterative generalized least square estimation
The IGLS method estimates regression coefficients, variances and their random effects. Under the assumption of residuals have multivariate normal, the method constructs an additional linear model whose unknown parameters are the betweengroup covariances jk σ that represent the covariance between For a pictorial representation of see an interesting example in Lindquist et al. [11]. The next step in IGLS is to formulate the linear model that estimates the variance components. This is achieved based on by vectorizing the lower triangular and diagonal elements of the matrix corresponding to each unknown variance component. This can be done using the vech operator, which when applied to a matrix stacks its columns after removing all supra-diagonal elements. Using this notation, the summands in can be written The IGLS procedure alternates between estimating β and * β until convergence. The specific steps are as follows

Sandwich IGLS
As in the ML case, we can obtain IGLS sandwich estimators that correct the asymptotic standard errors of the fixed parameters. Consider first the fixed part of the model and the usual IGLS estimate of the fixed parameters based upon the random parameter estimates using. The covariance matrix of these estimates is: Where,

( )
Cov Y = Σ and is unknown. If we replace the central covariance term ( ) Cov Y by the assumed model estimated value,  , Σ we obtain the usual formula

Biostatistics and Biometrics Open Access Journal
The sandwich estimator is formed by replacing the estimate of central covariance term, ( ) Cov Y in [21] by a robust estimator based on U in, and then the sandwich estimator for the parameter standard errors is constructed as Further details of the Huber-White correction for the multilevel model are given by Goldstein and Raudenbush & Bryk.

Expectation maximization
The EM algorithm addresses the problem of maximization likelihood as a problem of missing data.
EM algorithm: Each EM step requires sums of squares and product of the conditional means of the random effects (given the data and current parameter estimates) as well as the conditional variances of these random effects. Consider the combined model in [10]. Then, the alogritm begins with the M-step followed by the E-step as shown in the sequel.

M step:
The EM algorithm conceives j Y as the observed data with j δ as the missing data. Thus, the complete data set is To estimate , β this is achieved by deducting j j W δ from both sides of [12], yielding and justifying the OLS estimate as the complete-data ML estimate of βfor all groups. Thus, considering The complete-data ML estimates of Ω and 2 e σ are similarly straightforward, where And, Then, ( ) Now, having identified the CDSS needed for M step, the EM algorithm is readily specified as; I. Estimate the CDSS data: Substitute the estimated CDSS in to the M-step formulas to obtain new estimates of the parameters.
III. Feed the new parameter estimate into step 1.
IV. Continue until changes in the log-likelihood become sufficiently small or the largest change in the value of the parameters is sufficiently small.

Measures of assessment
Beside the review of some estimation methods, one may be interested in knowing which method may be more useful when the standard multilevel model assumptions are not met in the available data. In this section, we provide some notes about three commonly used measures by which one could assess the estimated parameters and compare their performance. This shall guide the practitioners to use the better method whenever their data reveal the need for special inferential procedures.

Percentage relative bias
The accuracy of the parameter estimates can be calculated using the mean relative bias. Let  θ represent the estimate of the population parameter θ , then the percentage relative bias is given by

Biostatistics and Biometrics Open Access Journal
The parameter θ and its estimator  θ can be any regression coefficient that we considered in and the choices among the ML estimator or the estimators in and for fixed effects as well as for the variance components under the corresponding estimation method. A better estimator could be checked by testing whether its relative bias differs from one. With α=0.001, the p-values of the estimated parameters are Bonferroni-corrected [12]. The Bonferroni correction issued to adjust hypothesis being tested and the confidence intervals by testing each individual hypothesis at a significance level of / n α  , where α is the desired overall alpha level and n  is the number of hypotheses. Mass & Hox [13] provided a simulation study to assess the accuracy of the parameter estimates using the percentage relative bias when the normality assumption of the error distribution is violated.
The study showed that, the percentage relative bias is the same for the REML and robust REML, since it focuses on investigate the parameter estimates and not their standard errors. The study resulted in only one significant difference when the grouplevel residuals were generated from chi-square distribution as the worst condition where the number of groups was 30, group size equal 5 and interclass correlation coefficient equal 0.1. In a different context, Korendijk et al. [14] highlighted the effect of ignoring the group-level hetroscedasticity on the estimation of regression coefficients, the variances components, and their standard errors. The authors provided a simulation study using the percentage relative bias and coverage probability (see the next paragraph).The results showed that the individual-level regression coefficients, variances, and their standard errors are unbiased. However, the standard errors of group-level variances are underestimated.

Confidence intervals
To assess the accuracy of the standard errors, another useful measure is the coverage probabilities of estimation. For each parameter set a 95% confidence interval using the asymptotic standard normal distribution of the estimator. First, we obtain the actual coverage probability, which is counted as the proportion of the overall iterations in which the confidence interval covers the true values of true parameters. We are concerned with how far the actual coverage probability falls below the nominal level. Using the null hypothesis; with this method appear when it used for estimated variances of random effects since it doesn't take their asymptotic distributions into account. For this reason, we can depend on asymptotic relative efficiency to assess the accuracy of the variance components in multilevel models. In a comparison between the REML as a parametric method and the robust REML as an alternative robust parametric methods, Mass and Hox [13] introduced a simulation study using coverage probability for both fixed and random parameters in the model to evaluate the accuracy of the parameter estimates and their standard errors when the assumptions of normality and large samples are not met. A variety of factors were considered including the number of groups, group size, interclass correlation coefficient and the group-level residuals generated from three non-normal (uniform, Laplace and chi-square)distributions. The study shows that under the violation of normality assumption at the group-level residuals, there is no effect of such factors on the parameter estimates. On the other hand, for the random part in the model the standard errors of variances at the group-level are highly inaccurate and thus the robust REML do better than REML especially in large number of groups.
In dealing with the problem of determining the effect of different sample sizes at the group-level on the accuracy of the estimates in multilevel regression models, Mass & Hox [15] provided a simulation study based on the aforementioned coverage probability measure to determine the effect of number of number of groups, group sizes and interclass correlation coefficients on the parameter estimates and their standard errors. The results show that having small number of groups leads to biased estimates of the group-level standard errors of the regression coefficients. However, for the remaining factors the estimates of the parameters (regression coefficients and variance components) and their standard errors are unbiased and yet accurate.

Asymptotic relative efficiency
To assess the performance of the variance components, the asymptotic relative efficiency (ARE) is defined as one of the principal comparison measures. The ARE of two procedures is the ratio of their efficiencies measures where efficiencies are often defined using the asymptotic mean squared error as one of the measures of desirability. In order to study the sample properties of two estimators under a simulation study, the ARE can be calculated according to the following steps:  [17]. The study compared the ARE of the weighted Wilcoxon and OLS estimators in terms of their asymptotic variances and concluded that under normal errors, the Wilcoxon estimator is 95% as efficient as the OLS method. Thus, there is only a 5% loss in efficiency if the Wilcoxon estimator is used and the error distribution is actually normal. However, the ARE is usually greater than 1 if the true distribution has tails heavier than those under a normal distribution (e.g. contaminated normal distribution or the data corrupted by outliers).
Under linear mixed models, which represent the general case of multilevel regression models, Kloke et al. [18] extended the robust approach introduced by Hettmansperger and Mckean [16] to linear mixed models with covariates using general score functions. The authors compared between the OLS and weighted Wilcoxon using the ARE as an assessment measure and concluded that the robust analysis is more effective than OLS in the presence of outliers or underlying error distributions with heavy tails. Besides, Mckean & Kloke [19] proposed a family of optimal score functions under contaminated normal distributions of the error terms in both linear and nonlinear models [20][21][22][23][24][25]. In this study they compared Wilcoxon, OLS, and ML in terms of their asymptotic variances by using the ARE to conclude about the efficiency and validity of Wilcoxon over skewed-normal and contaminated normal distributions [26][27][28][29].

Conclusion
In this article, some estimation methods for multilevel models are reviewed. In addition, some measures that assess the behavior, reliability, and efficiency of those estimators are highlighted [30][31][32]. One important remark is that the efficiencies of the variance components estimators that are produced under each estimation method may not be easily compared using the well-known coverage probability measure. The ARE becomes a useful candidate in this case. A future research point of assessing the behavior of the fixed effects and the variance components under multilevel models with more than two levels are complicated covariance structures is needed to choose among the competing estimation methods and their robust versions. The same recommendation shall apply to linear mixed models.