Biostatistics and Biometrics Open Access Journal

Review Article

Partial Variable Selection and Its' Applications in Biostatistics

**Jingwen Gu¹, Ao Yuan's^1,2* and Ming T Tan¹**

₁Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University, USA

₂Department of Epidemiology and Biostatistics Section, Rehabilitation Medicine, National Institutes of Health, USA

Submission: December 23, 2017; Published: April 18, 2018

*Corresponding author: Ao Yuan, Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University, Washington DC 20057, USA; Email: ay312@georgetown.edu

How to cite this article: Jingwen Gu, Ao Yuan, Ming T Tan. Partial Variable Selection and Its� Applications in Biostatistics. Biostat Biometrics Open Acc J. 2018; 6(1): 555678. DOI:10.19080/BBOAJ.2018.06.555678

Abstract

We propose and study a method for partial covariates selection, which only select the covariates with values fall in their effective ranges. The coefficients estimates based on the resulting data is more interpretable based on the effective covariates. This is in contrast to the existing method of variable selection, in which some variables are selected/deleted in whole. To test the validity of the partial variable selection, we extended the Wilks theorem to handle this case. Simulation studies are conducted to evaluate the performance of the proposed method, and it is applied to a real data analysis as illustration.

Keywords: Covariate; Effective range; Partial variable selection; Linear model; Likelihood ratio test

Abbreviations : NIH: The National Institutes of Health; UPDRS: Unified Parkinson's Disease Rating Scale

Introduction

Variables selection is a common practice in biostatistics and there is vast literature on this topic. Commonly used methods include the likelihood ratio test [1], AIC [2], BIC [3], the minimum description length [4,5], etc. The principal components models linear combinations of the original covariates, reduces large number of covariates to a handful of major principal components, but the result is not easy to interpret in terms of the original covariates. The stepwise regression starts from the full model, and deletes the covariate one by one according to some statistical significance measure. May et al. [6] addressed variable selection in artificial neural network models, Mehmood et al. [7] gave a review for variable selection with partial least squares model. Wang et al. [8] addressed variable selection in generalized additive partial linear models. Liu et al. [9] addressed variable selection in semiparametric additive partial linear models.

The Lasso [10,11] and its variation [12,13] are used to select some few significant variables in presence of large number of covariates. However, existing methods only select the whole variable(s) to enter into/delete from the model, which may not the most desirable in some bio-medical practice. For example, in the heart disease study [14,15], there are more than ten risk factors identified by medical researchers in their long time investigations, with the existing variable selection methods, some of the risk factors will be deleted wholly from the investigation, this is not desirable, since risk factors will be really risky only when they fall into some risk ranges. Thus delete the whole variable(s) in this case seems not reasonable in this case, while a more reasonable way is to find the risk ranges of these variables, and delete the un-risky ranges. In some other studies, some of the covariates values may just random errors which do not contribute to the influence of the responses, and remove these covariates values will make the model interpretation more accurate. In this sense we select the variables when they fall within some range. To our knowledge, method for partial variable selection hasn't been seen in the literature, and our goal here is to explore such a method. In the existing method of deleting whole variable(s), the validity of such selection can be justified using the Wilks result, under the null hypothesis of no effect of the deleted variable(s), the resulting two times log- likelihood ratio will be asymptotically chi-squared distributed. We extended the Wilks theorem to the case for partial variable deletion, and use it to justify the partial deletion procedure. Simulation studies are conducted to evaluate the performance of the proposed method, and it is applied to analyze a real data set as illustration.

The proposed method

The observed data is (y_i,x_i)(i = 1,...,n),.¡¿where yi is the response and x_iϵR^d is the covariates, of the l^th subject. Denote y_n = (y₁,..., y_n )' and X_n = (x₁',.....,x_n')'. Consider the linear model

y_n = X_nβ+ò_n, (1)

Where β=(β₁,.....β_d)' is the vector of regression parameter,ò_n=(ò₁,......,ò_n) ' is the vector of random errors. Without loss of generality we consider the case the ò_i's are iid, i.e. Var(ò)=σI_n where I_n is the n dimensional identity matrix. When the ò_i's are not iid, often it is assumed Var(ò)=Ω for some known positive-definite Ω, then make the transformation =Ω^-1/2y_n=Ω^-1/2x_n and =ù^-1/2ò then we get the model =β+ó and the 's are iid with Var(ò)=I_n. When Ω s unknown, it can be estimated by various ways. So below we only need to discuss the case the ò's are iid.

We first give a brief review of the existing method of variable selection. Assume ϵ=y-x'β has some known density f(.) (such as normal), with possibly some unknown parameter(s). For simple of discussion we assume there is no unknown parameters. Then the log-likelihood is

Let be the MLE of β (when f (.) is the standard normal density, is just the least squares estimate). If we delete k(≤d) columns of X_n and the corresponding components of β denote the remaining covariate matrix as X_n^- and the resulting β as β^- and the corresponding MLE as ^-. Then under the hypothesis H₀ : the deleted columns of X_n has no effects, or equivalently the deleted components of β are all zeros, then asymptotically

Where x_k² is the chi-squared distribution with k degrees of freedom. Let x_d²(1-α) be the (1 -α)^th upper quantile of the x_k² distribution, if (1-α) then H₀ is rejected at significance level α, and its’ not good to delete these columns of X_n; otherwise we accept H₀ and delete these columns of X_n. There are some other methods to select columns of X_n such as AIC, BIC and their variants, as in the model selection field. In These methods, the optimal deletion of columns of X_ncorresponds to the best model selection, which maximize the AIC or BIC. These methods are not as solid as the above one, as may sometimes depending on eye inspection to choose the model which maximize the AIC or BIC.

All the above methods require the models under consideration be nested within each other, i.e., one is a sub-model of the other. Another more general model selection criterion is the minimum description length (MDL) criterion, a measure of complexity, developed by Kolmogorov [4], Wallace and Boulton, etc. The Kolmogorov complexity has close relationship with the entropy, it is the output of a Markov information source, normalized by the length of the output. It converges almost surely (as the length of the output goes to infinity) to the entropy of the source. Let ζ = {g(.,.)} be a finite set of candidate models under consideration, and e = {θ_j;j= 1,...,h.} be the set of parameters of interest. θ_i may or may not be nested within some other θ_j, or θ_i and θ_j both in È may have the same direction but with different parametrization. Next consider a fixed density f(.θ_j) with parameter θ_j running through a subset to emphasize the index of the parameter, we denote the MLE θ_j of under model f(.|.) by (instead of by to emphasize the dependence on the sample size), I(θ_j) the Fisher information for ^fyj under θ_j (θ_j) its determinant, and θ_j the dimension ofθ_j Then the MDL criterion [16] chooses θ_j to minimize

This method does not require the models be nested, but still require select/delete some whole columns, and does not apply to our case.

Now come to our question, which is non-standard and we are not aware of a formal method to address this problem. However, we think the following question is of practical meaning. Consider deleting some of the components within fixed k(k≤d) columns of X_n the deleted proportions for these columns are γ₁,......γ_k(0<γ_j<1) Denote x_n^-for the remaining covariate matrix, which is x_n with y_nβ+ò_n

After the partial deletion of covariates, the model becomes y_nβ+ò_n

Note that here β and β^- have the same dimension, as no covariate is completely deleted. β is the effects of the original covariates, β^- is the effects of the covariates after some possible partial deletion. It is the effects of the effective covariates. Thus, though β and β^- have the same structure, they have different interpretation. The problem can be formulated as testing the hypothesis:

H₀:β=β^- vs H₁:β≠β^-

Note that different from the standard null hypothesis that some components of the parameters be zeros, the above null hypothesis is not a nested hypothesis, or -β^- is not a subset of -β, so the existing Wilks’ theorem for likelihood ratio statistic does not directly apply to our problem. Denote l_n^-(β) be the corresponding log-likelihood based on data (y_nX_n^-) and the corresponding MLE as -β^- Since after the partial deletion, β- is the MLE of under a constrained log-likelihood, while is the MLE under the full likelihood, we have Parallel to the log-likelihood ratio statistic for (whole) variable deletion, let, for our case,

Let (j₁,......j_k) be the columns with partial deletions, C_jr={i:x_j_r,i is deleted 1ࣘiࣘn} 1 be the index set for the deleted covariates in the j_r^thcolumn be the cardinality of C_j_r thus ma_r = | C_h \ jn(r = 1,...,k). We first give the following Proposition, in the simple case in which the index sets ^Cj_r are mutually exclusive. Then in Corollary 1 we give the result in more general case in which the index sets ^Cjr are not need to be mutually exclusive. For given ^Xn, there are many different ways of partial column deletions, we may use Theorem 1 to test each of these deletions. Given a significance level ^a, a deletion is valid at level α if En<X²(1-α), where X(1 -α) is the ^{(1 -α)} upper quantile of the distribution, which can be computed by simulation for given γ₁,......γ_k

The following Theorem is a generalization of the Wilks [1] Theorem. Deleting some whole columns in X_n, corresponds to γ = l (j = 1,..., k) in the theorem, and then we get the existing Wilks' Theorem.

Theorem 1

Under ^Ho, suppose ^Cjr⋂ ^Cjr=Δ the empty set, for all 1 ≦r≠ , then we have

Where x₁²......x_k² are iid chi-squared random variable with 1-degree of freedom. The case the ^Cj_r are not mutually exclusive is a bit more complicated. We first re-write the sets C_r such that

where the D _r'^s are mutually exclusive, ^D_j1,....,^D_Jk are index sets for one column of X_n only; the ^dju₁’^s are index sets common for columns j₁ and j₂ only; the D_j1,j2,j3’s are index sets common for columns j₁ j₂ and j₃ only,.... Generally some of the D_j₁,....._j_r's

are empty sets. Let γ_j₁,....._j_r=|D_j₁,....._j_r| be the cardiality of D_j₁,....._j_r and γ_j₁,....._j_r=|D_j₁,....._j_r|/n(r=1,....,k). By Examining the proof of Theorem 1, we get the following corollary which gives the result in the more general case.

Corollary 1: Under H₀, we have

Where the x₂j₁,......,j_r's are all independent chi-squared random variables with r-degrees of freedom (r=1,.....,k)

Below we give two examples to illustrate the usage of Proposition.

Example 1: n = l000, d = 5 k = 3. Columns (1,2,4) has some partial deletions with C₁ = {201,202,....,299,300}, C₂ ={351,352,...,549,550}, C₃ = {601,602,...,849,850} the have no C_j's overlap; γ1 = 1 / 10 γ2 = 1/5, γ3 = 1/4. o by the Proposition, under H_o we have

Where all the chi-squared random variables are independent, each has 1 degree of freedom.

Example 2: n = 1000 , d = 5, k = 3 Columns (1,2,4) has some partial deletions with c₁ ={101,102,...., 299,300;65i, 652, ...,749,750},

C₂ = {201,202,...,349,350},

C₃ = {251,252,..., 299,300; 701,702,799,800}. ^{In this case the C}j’^s have overlaps, the Proposition can not be used directly, so we use the Corollary. Then A = {¹⁰¹,^102,...^,199,^200}, D₂ = {301,302,...,349,350}, D₃ ={701,702,...,799,800},

D_1,2,3 = {201,202,...,249,250}, D_1,3 = {701,702,...,749,750}

D_1,2,3 ={251,252,...,299,300}, γ₁ = 1/5 γ₂ = 1/20, γ₃ = 1/10, γ_1,2 = 1/20, y,₃ = 1/20, γ₂,₃ = 0, γ_w = 1/20. So by the Corollary, under H₀ we have

where all the chi-squared random variables are independent, with X₁² x₂² and X₃² are each of 1 degree of freedom, x_1,2² and X,_1,3² are each of 2-degrees of freedom, and x_1,2² and X,_1,3² is of 3-degrees of freedom. Next, we discuss the consistency of estimation of ^- under the null hypothesis H_o Let ^x = ^x_r with probability

Theorem 2

Under conditions of Theorem 1,

Where,

To extend the results of Theorem 2 to the general case, we need the following more notations. Let x_{(j1,....,jk) be an i.i.d. copy of data in the set D_j1,.....,jk. Let x^-=x^-_j1,.....,jk with probability γ _j1,.....,jk(r=0,1,....,k) where x^-_j1,.....,jk is an i.i.d. copy of the _j1,.....,jk's, whose components with index in ^C_h}

Corollary 2: Under conditions of Corollary 1, results of Theorem 2 hold with x given above.

Computationally E[(x^--μ^-)(x^--μ^-) is well approximated by

Simulation study and application

Simulation study : We illustrate the proposed method with two examples, Example 3 and Example 4 below. The former rejects the null hypothesis H₀ while the latter accepts. In each case we simulate n=1000 i.d. data with response y_i and with covariates x_i=(x_i1,x_i2,x_i3,x_i4,x_i5) We first generate the covariates, sample the x_i’s from the 5-dimensional normal distribution with mean vector μ= (3.1,1.8,-0.5,0.7,1.5)' and a given covariance matrix Ã. Then we generate the response data, which, given the covariates. The are y_i's, generated as = ^x,-β + °^{, (i} = ^1,...>")^, Po =(0.42,0.11,0.65,0.83,0.72)' the ò_i's are i.i.d. N(0,1). Hypothesis test is conducted to examine if the partial deletion is valid or not. Significant level is set as α= 0.05. The experiment repeated 1000 times, prop represents the proportion Ė> x²(1-α).

Example 3: In this example, we are interested to know whether covariates with can be deleted. Five data set with different β₀ values are simulated. With γ=(γ₁,....,γ_k) the results are shown in Table 1. We see that the proportion of rejecting ^H0 ^(Prop) are all smaller than in the five set of β₀. This suggests that covariates with should not be deleted at 0.05 significance level. Example 4. In this example, the original X as in Example 3, but now we replace the entries in first 100 rows and first three columns by ò, where òN(0,1/9) We are interested to see in this case whether these noises can be deleted, i.e. H⁰ can be rejected or not. The results are shown in the following. We see that the proportion of rejecting H₀ (prop) are all greater than 0.95 for the five sets of β₀. It suggests that the data provided strong evidence to conclude that the deleted value are noises and they are not necessary to the data set at 0.05 significance level.

Application to real data problem

We analyze a data set from the Deprenyl and Tocopherol Antioxidative Therapy of Parkinsonnism, which is obtained from The National Institutes of Health (NIH)[17]. It is a multicenter, placebo-controlled clinical trial that aimed to determine a treatment for early Parkinson's disease patient to prolong their time requiring levodopa therapy. The number of patients enrolled was 800. The selected object were untreated patients with Parkinson’s disease (stage I or II) for less than five years and met other eligible criteria. They were randomly assigned according to a two-by-two factorial design to one of four treatment groups:

Placebo

Active tocopherol

Active deprenyl

Active deprenyl and tocopherol.

The observation continued for 14±6 months and reevaluated every 3 months. At each visit, Unified Parkinson’s Disease Rating Scale (UPDRS) including its motor, mental and activities of daily living components were evaluated. Statistical analysis result was based on 800 subjects. The result revealed that no beneficial effect of tocopherol. Deprenyl effect was found significantly prolong the time requiring levodopa therapy which reduced the risk of disability by50 percent according to the measurement of UPDRS. Our goal is to examine whether some of the covariates can be partially deleted. The response variables to examined are PDRS, TREMOR,S/E ADL by Rater, PIGD, Days from enrollment and Days from enrollment to Need for LEVODOPA. The covariates are Age, Motor and ADL for all these responses [18]. The deleted covariates are the ones with values below the γ^th data quantile, with γ= 0.0 1,0.02,0.03 and 0.05. We examine the responses one by one. The results are shown in Tables 2-5 below.

In Table 3, response TREMOR is examined. For covariable Age, the likelihood ratio ^E, is larger than the cutoff point x₂(1-α) at 0.03 and 0.05 levels, it suggests that for Age, partial deletions with these proportions are not valid. For covariable Motor, e , is smaller than the cutoff point x₂(1-α) at the 0.05 and 0.1 levels, this covariable can be partially deleted at these proportions. For covariable of ADL, with deletion proportions 0.01-0.1, the likelihood ratio is smaller than x₂(1-α) which suggest that the lower percentage of 1%-10% can be deleted. In Table 4, PIGD is the response variable. For covariable age, is larger than the cutoff point <2(1 -α) at 0.01, 0.02, 0.03 and 0.05 level, suggests that it cannot be partially deleted with these proportions [19]. For covariable Motor, da_n is smaller than cutoff point x²(1-α)up>at the deletion proportions of 0.02 and 0.03, suggests that the lower percentage of 2% and 3% can be deleted from the covariable Motor. For the variable ADL,^E, is larger than the cutoff point ^{Q (1-α)} at the delete proportion of 0.01, 0.02, 0.03 and 0.05, hence partial deletion is not valid.

In Tables 5, the response is PDRS. The likelihood ratios ^En of Age, Motor and ADL all are larger than x^{2 (1 -α)} at the

deletion proportions of 0.01, 0.02, 0.03 and 0.05. Thus the null hypothesis are rejected at all these proportions. Note that the coefficient for Age is insignificant, and hence the corresponding ^En values with deleted proportions are senseless Appendix.

Concluding remarks

We proposed a method for partial variable deletion, which is a generalization of the existing variable selection. The question is motivated from practical problems. It can used to find the effective ranges of the covariates, or to remove possible noises in the covariates, and thus the corresponding estimated effects are more interpretable. The procedure is a generalization of the Wilks likelihood ratio statistic, and is simple to use. Simulation studies are conducted to evaluate the performance of the method, and it is applied to analyze a real Parkinson disease data as illustration.