Preliminary tests of homogeneity- type I error rates under non-normality

Many statistical procedures utilize preliminary tests to enhance the accuracy of the final inferences. Preliminary tests like Goldfeld-Quandt (GQ) and Levene-type tests are used to assess the assumption of equality of population variances with normality as the underlying distributional assumption. Such tests must be used with care as the final inferences are conditional on the performance of these tests at first stage. This study explores the size distortions of GQ and Levene-type tests under non-normality. The results do not warrant the use of GQ & Levene test under non-normality as the size distortions are as high as 88 & 48% for the respective statistics. However, the modified form of Levene test (BF-test) retains its size properties except for the multi-model alternatives with relatively big outliers. which are multi-model and contains big outliers. This study does not recommend the use of GQ & Levene test for assessing the assumption of equality of populations variances when the distribution is non-nor mal. Although, the modified form of Levene’s test (BF -test) retains its size properties however, the use is not recommended in case the distribution is multi-model and contains relatively big outliers.


Introduction
Many statistical procedures utilize preliminary test(s) to enhance the accuracy of the final inference. For example, in time series regression model, the Chow-test is widely used to test the presence of any structural change in the Data Generating Process (DGP), employs the Goldfeld-Quandt (1965) test (GQ) as a preliminary test to assess the assumption of homogeneity of variances. The GQ-test is usually applied prior to the Chow test with normality as the underlying distributional assumption. Several other statistical procedures in the field of medical & social sciences, for example, One-way ANOVA makes use of the Levene's and the BF-tests as preliminary tests to assess equality of population variances.
These preliminary tests must be used with care (Schucany & Ng, 2006) as the final inferences are conditional on the performance of these preliminary tests at first stage (Gastwirth, Gel, & Miao, 2009). The GQ and the Levene's type tests assume the normality of the data while assessing the equal population variances. Although, some authors reassure robustness of modified Levene's type tests to normality but this study reemphasizes the use of diagnostic tests for normality for validating inferences made from regression models and from other statistical procedures which utilize GQ & Levene's type preliminary tests.
This study explores the impact of non-normality on the performance of the GQ & Levene's type tests. Since I plan to use numerical methods, the alternative (non-normal) space must be narrowed down to something sufficiently small to permit exploration by numerical methods. At the same time, the space should be large enough to proivde a good approximation to the full space of alternativesfailing that, it should be large enough to approximate the distributions conventionally used in simulations studies to assess the performance of normality

The Goldfeld-Quandt (GQ) Test (1965)
For this test, it is assumed that the observations can be divided into two groups in such a way that under the hypothesis of homoscedasticity, the disturbance variances would be the same in the two groups, whereas under the alternative, the disturbance variances would differ systematically. The most favorable case for this would be the group-wise heteroskedastic model To test explicitly, the suggested procedure is, by ranking the observations based on this x and dropping the central 'c' values, we can separate the observations into those with high and low variances. The test is applied by dividing the sample into two groups with 1 n and 2 n observations such that c n n n    2 1 . To obtain the statistically independent variances estimators, the regression is then estimated separately with the two sets of observations. The test statistic is Where, it is assumed that the disturbance variance is larger in the first sample. (If not, then reverse the subscripts.) Under the null hypothesis of homoscedasticity, this statistics has an F distribution with degrees of freedom. A larger value than the standard F table value at the given level of significance leads to the rejection of the null hypothesis.

The Levene-type Tests
The Levene's type tests are used to assess the underlying assumption of homogeneity of variances. Statistical procedures which typically assume equality of variances include analysis of variance (ANOVA) and t-tests. The Levene's test (1960) and the Brown-Forsythe (1974) test are often used as a preliminary test to validate the inferences drawn from the ANOVA and t-tests.
The ANOVA is used to assess whether the k populations have a common mean µ. For this, k samples 1 , 2 , … . , , of size with respective means, and variances, 2 , = 1, … . . , are drawn from each of k populations. To test the equality of means, the standard F-test assumes that the k populations has a common variance, 2 . To test the homogeneity of variances assumption, Levene proposed the following statistic.
The Levene's statistic is approximately F-distributed with k-1 and N-k degrees of freedom. The Brown-Forsythe (1974) test uses the median instead of mean. The Levene's type test based on median is recommended in the literature as these are robust statistics comparative to Levene's test against non-normality of data.

Simulation Study& Type-I Error Rates
Monte Carlo procedures are conducted to compute the type-I error rates for the GQ & Levene's type tests. These type-I error rates are obtained on the basis of 100,000 samples from the selected distributions (Table 1) for equal and unequal sizes of samples. Unequal sample sizes are chosen in 1:2, 1:3 &1:4 ratios.

Performance of the GQ Test
In general, the GQ test performed poorly in terms of its size when evaluated over the entire range of selected alternative space for all sample sizes (table 1 & 2). At 5% level of significance, the size of the GQ test goes up to 88% against highly skewed and heavy tailed alternatives both for the equal and unequal sample sizes. The size of the test is undervalued when the alternative belongs to symmetric short tail class of distributions. The tenacious size distortions do not improve with the increase in sample size ( fig. 1 a & b). The size distortions are more than 10% and less 20% only for those alternatives where both skewness and kurtosis statistics are not far away from the normal distribution benchmark values; 0 & 3 respectively.
Size distortions increase with the increase in value of either of the statistics-skewness and kurtosis.    The Levene's test performance is not satisfactory in comparison to its robust form (BF-test) which is based on median instead of arithmetic mean. The size of the test is more than 10% when the alternative space belongs to the group with skewness more than one and kurtosis more than five.
Maximum size distortion reaches to as high as 48% for sample size of 25 (Fig. 2a). There is a slight improvement in size distortions as the sample increases ( Fig. 2a & 2b). Mostly, the significant distortions are against the alternative distributions containing outliers with high values of skewness and kurtosis.

Conclusion
Preliminary tests of homogeneity such as Goldfeld-Quandt (1965) and Levene-type tests are used to assess the assumption of homogeneity of variances which serves as the underlying assumption of many statistical procedures including Chow-test and one-way ANOVA. These preliminary tests assume the normality of data while assessing the equal population variances. Such kind of preliminary tests should be used with care (Schucany & Ng, 2006) as the final inferences are conditional on the performance of these tests at first stage. This study explores the impact of non-normality of the size distortions of these tests. At 5% level of significance, the size of the GQ test goes up to 88% against highly skewed and heavy tailed alternatives both for the equal and unequal sample sizes (Table 1 & 2). The size of the Levene test is more than 10% when the alternative space belongs to the group with skewness more than one and kurtosis more than five.
Maximum size distortion reaches to as high as 48% for sample size of 25 (Fig. 2a) properties however, the use is not recommended in case the distribution is multi-model and contains relatively big outliers.