Partial Variable Selection and its Applications in Biostatistics

Citation: Gu J, Yuan A, Zhou C, Chan L, Tan MT (2018) Partial Variable Selection and its Applications in Biostatistics. Int J Clin Biostat Biom 4:017. doi.org/10.23937/2469-5831/1510017 Received: January 08, 2018: Accepted: April 12, 2018: Published: April 14, 2018 Copyright: © 2018 Gu J, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.


Introduction
Variables selection is a common practice in biostatistics and there is vast literature on this topic. Commonly used methods include the likelihood ratio test [1], Akaike information criterion, AIC [2] Bayesian information criterion, BIC [3], the minimum description length [4,5] stepwise regression and Lasso [6], etc. The principal components model linear combinations of the original covariates, reduces large number of covariates to a handful of major principal components, but the result is not easy to interpret in terms of the original covariates. The stepwise regression starts from the full model and deletes the covariate one by one according to some statistical significance measure. May, et al. [7] ter(s). For simple of discussion we assume there are no unknown parameters. Then the log-likelihood is Then under the hypothesis 0 H : the deleted columns of n X has no effects, or equivalently the deleted components of β are all zeros, then asymptotically [1].
where 2 k χ is the chi-squared distribution with k -degrees of freedom. For a given nominal level α , let ( ) the literature, which is the goal of our study here. Note that in existing method of variable selection, some variables are selected/deleted, while in our method, some variable(s) are partially selected/deleted, i.e., only some proportions of some variable observations are selected/deleted. The latter is very different from the existing methods. In summary, traditional variable selection methods, such as stepwise or Lasso, some covariate(s) will be removed either wholly or none from the analysis. This is not very reasonable, since some of the removed covariates may be partially effective, removing all their values may yield miss-leading results, or at least cost information loss; while for the variables remaining in the model, not all their values are necessarily effective for the analysis. With the proposed method, only the non-effective values of the covariates are removed, and the effective values of the covariates are kept in the analysis. This is more reasonable than the existing methods of removing all or nothing.
In the existing method of deleting whole variable(s), the validity of such selection can be justified using the Wilks result, under the null hypothesis of no effect of the deleted variable(s), the resulting two times log-likelihood ratio will be asymptotically chi-squared distributed. We extended the Wilks theorem to the case for the proposed partial variable deletion, and use it to justify the partial deletion procedure. Simulation studies are conducted to evaluate the performance of the proposed method, and it is applied to analyze a real data set as illustration.

The Proposed Method
The observed data is ( )( )

Summary of existing work:
We first give a brief review of the existing method of variable selection. Assume the model residual has some known density function ( ) f ⋅ (such as normal), with possibly some unknown parame-lel to the log-likelihood ratio statistic for (whole) variable deletion, let, for our case,   Note that in Wilks problem the null hypothesis is that, the coefficients corresponding to some variables are zero. The null hypothesis is nested within the alternative; while the null hypothesis in our problem is: The coefficients correspond to some partial variables, and the null hypothesis is not nested within the alternative. So the results of the two methods are not really comparable.
The case the r j C 's are not mutually exclusive is a bit more complicated. We first re-write the sets This method does not require the models be nested, but still require select/delete some whole columns. The other existing methods for variable selection, such as stepwise regression and Lasso, etc., are all for deleting/ keeping some whole variables, and does not apply to our problem.

The proposed work
Now come to our question, which is non-standard and we are not aware of a formal method to address this problem. However, we think the following question is of practical meaning. Consider deleting some of the components within fixed k ( ) k d ≤ columns of n X , the deleted proportions for these columns are Denote n X − for the remaining covariate matrix, which is n X with some entries replaced by 0's, corresponding to the deleted elements. Before the partial deletion, the model is After the partial deletion of covariates, the model becomes Note that here β andβ have the same dimension, as no covariate is completely deleted. β is the effects of the original covariates,β is the effects of the covariates after some possible partial deletion. It is the effects of the effective covariates. As an over simplified example, we have individuals, with five responses ( ) : : H is accepted, the partial deletion is valid.
Note that different from the standard null hypothesis that some components of the parameters be zeros, the above null hypothesis is not a nested hypothesis, or , where the notation

Simulation study
We illustrate the proposed method with two examples, Examples 3 and 4 below. The former rejects the null hypothesis 0 H while the latter accepts. In each case we simulate  Table 1. The five rows in Table 1 are the results for the five data sets. For each data, the parameter β is estimated, a and test is conducted using the given γ, the Λ n is computed, ( ) 1 Q α − is given, and the corresponding p-value is provided.
Note that for our problem, a p-value smaller than α be the cardinality of By examining the proof of Theorem 1, we get the following corollary which gives the result in the more general case.
where all the chi-squared random variables are independent, each has 1 degree of freedom.
To extend the results of Theorem 2 to the gener-(stage I or II) for less than five years and met other eligible criteria. They were randomly assigned according to a two-by-two factorial design to one of four treatment groups: 1) Placebo 2) Active tocopherol 3) Active deprenyl 4) Active deprenyl and tocopherol. The observation continued for 14 6 ± months and reevaluated every 3 months. At each visit, Unified Parkinson's Disease Rating Scale (UPDRS) including its motor, mental and activities of daily living components were evaluated. Statistical analysis result was based on 800 subjects. The result revealed that no beneficial effect of tocopherol. Deprenyl effect was found significantly prolong the time requiring levodopa therapy which reduced the risk of disability by 50 percent according to the measurement of UPDRS.
Our goal is to examine whether some of the covariates can be partially deleted. If traditional variable selection methods are used, such as stepwise or Lasso, it will end up with some covariate(s) been removed wholly from the analysis. This is not very reasonable, since some of the removed covariates may be partially effective, removing all their values may yield miss-leading results, or at least cost information loss. We use the proposed method to examine three of the response variables, PDRS, TREMOR and PIGD, and three covariates,  Table 3, Table 4 and Table 5 below.
In Table 3, response TREMOR is examined. For covariable Age, the likelihood ratio Λ n is larger than the cut-off point ( ) at the 0.01 proportion, this covariable can be partially deleted at this proportion. In other words, the covariate Motor with values smaller than 1%-th of its quantile have no impact on the analysis, or can be treated as noise and means a significant value of Λ n , or significant difference between the regression coefficients of original covariates and those of the covariates after partial deletion, which implies in turn that the null hypothesis should be rejected, or the partial deletion should not be conducted (Table 1).
We see that the p-values of rejecting 0 H , are all smaller than 0.05 in the five set of 0 β . This suggests that covariates with

Application to real data problem.
We analyze a data set from the Deprenyl and Tocopherol Antioxidative Therapy of Parkinsonism, which is obtained from The National Institutes of Health (NIH). (For detailed description and data link, https://www. ncbi.nlm.nih.gov/pubmed/2515723). It is a multi-center, placebo-controlled clinical trial that aimed to determine a treatment for early Parkinson's disease patient to prolong their time requiring levodopa therapy. The number of patients enrolled was 800. The selected object were untreated patients with Parkinson's disease  mates are more meaning full since the on-effective values of covariate Motor are removed from the analysis.
In Table 5 Thus the null hypothesis are rejected at all these proportions, or no deletion is valid at these proportions, and the analysis should be based on the original full data, with the parameter estimates shown in the Table  (Table 3, Table 4, and Table 5).
Note that the coefficient for Age is insignificant, and hence the corresponding Λ n values with deleted proportions are senseless.

Concluding Remarks
We proposed a method for partial variable deletion, in which only some proportion(s) of covariate(s) values are to be deleted. This is in contrast to the existing methods either select or delete the entire variable(s). Thus this method is new and is a generalization of the existing variable selection. The question is motivated from practical problems. It can used to find the effective ranges of the covariates, or to remove possible noises in the covariates, and thus the corresponding estimated effects are more interpretable. The proposed test statistic is a generalization of the Wilks likelihood ratio statistic, the asymptotic distribution of the proposed statistic is generally a chi-squared mixture distribution, the corresponding cut-off point can be computed by simulation. Simulation studies are conducted to should be removed from the analysis. For covariable ADL, with deletion proportions 0.01-0.1 , the likelihood ratio Λ n is smaller than ( ) 1 Q α − which suggest that the lower percentage of 1% -10% of this covariate have no impact on the analysis and should be deleted. After removing the corresponding proportions of Motor and ADL, the model is re-fitted to get the parameter estimates shown there. These estimates have better meaning than the ones based on the whole covariates data, since now the noise values of covariates are removed, and only the effective covariates entered the analysis. However, if traditional variable methods are used, such as stepwise regression or Lasso, it may end up with the whole covariate Motor, ADL, or both to be removed, and leads loss of information or even miss-leading results.
In Table 4, response PIGD is investigated. For covariable age, Λ n is larger than the cut-off point ( )  Table. The new esti-     evaluate the performance of the method, and it is applied to analyze a real Parkinson disease data as illustration. A drawback of the current version of the method is that it needs to specify the proportions of possible deletions for the variables, this makes the optimal proportions are not easy to find. In our next step research we will try to implement an algorithm which finds the optimal proportions automatically, and more easy to use. As suggested from a reviewer, simulation studies should be performed for statistical significance test between the proposed method and existing variable selection method(s) to address the contribution of the proposed method. This will be potential for our future research work (Appendix).