Linear regression model with a randomly censored predictor: Estimation procedures

.


Introduction
Modeling continuous outcome data using linear regression usually assumes in theory that the values of the covariates are fully observed.However, in practice and especially for any large data set, it is unlikely that complete information will be available for all study participants.The issue of censored data is ubiquitous and affects many studies and permeates a wide range of research areas, including medicine, economics, and social sciences .
Multiple reasons lead to incomplete observations in a data set including nonresponse, attrition, and absence of event of interest.Usually during the design, implementation, and data collection phases of a study, efforts are made to minimize the occurrence of incomplete data whenever possible and, when unavoidable, to understand the reasons for such a discrepancy in order to handle the available data adequately and run appropriate statistical analyses.Although there is an extensive literature on missing data [33,34,52] and censored outcomes, [28,26,25] only a small number of papers have explored scenarios in which the covariate is censored.[12,6,43,42] Arguably, the inadequacy of linear regression models on censored outcome variables has sparked an interest in alternative methods, and subsequently has led to major developments of regression models for survival analysis for decades.Extensive literature has been published regarding censored outcomes, especially in studies of time-to-event outcomes where censoring is due to loss of follow-up, drop out, or study termination.[28,26,25] While there is a vast literature on censored outcomes and different related methods have been discussed extensively, only a limited number of papers have focused on the issue of censored covariates.Ignoring or using a wrong approach to account for the censored nature of a covariate in regression model estimation can lead to analytical issues and spurious results [43,42,8].It is important that censored covariates be recognized, acknowledged, and handled appropriately to produce reliable results.[18,17,16,35,4,41,3,19,44] However, the vast majority of literature on censored covariates has focused on censoring due limit of detection or type 1 censored covariates where observations of the covariate below such a limit cannot be measured or detected, but recorded at or less than the limit of detection value.[18,10,46,8,9,53,30] Only a handful of publications have investigated the implications of randomly censored covariates where some observations of the covariate are censored at varying censoring time points.[44,5,6,7,31] Censored covariate measurements arise when, for some participants in a study, the ascertained information of interest has not yet occurred (or will not occur) at the time of assessment.This is due to a time lag between the time when a covariate is measured (usually at baseline) and the occurrence (or non-occurrence) of an event of particular interest that needs to happen for such a measure to be available and assessed.For instance, Clayton [11] investigated familial aggregation in chronic disease incidence and modeled the possible influence that parental age at onset of a given disease might have on an individual's risk of succumbing to a particular disease.Using the Framingham Heart Study-an ongoing multi-generational landmark study designed to identify factors and characteristics that contribute to the development of cardiovascular disease (CVD) and other diseases through long-term, active surveillance and monitoring-Atem and Matsouaka [5] studied the impact that age at onset of clinically-diagnosed cardiovascular events in parents may have on the onset of cardiovascular events among offspring.In both cases, even if important factors have been thoroughly measured for both parents and their offspring, it is unlikely that all parents have had or will have developed the disease of interest at the time of investigation.This means that the variable "parental age at onset of a given disease" is guaranteed to be censored, i.e., not fully observed.Therefore, it is extremely important in any statistical analysis to account for the fact that the variable of interest is censored for some participants.
In theory, there are many ways to address the issue of censored covariates in data analyses.From a practical point of view, however, the most important questions are: When and under what conditions can one safely consider the problem of censored covariates to be trivial?How can current methods be applied and under what conditions can one expect (asymptotically) unbiased and meaningful results?In general, inappropriate handling of censored covariates may affect the type I error [8], yield biased results, hinder the power to detect any meaningful treatment differences, or lead to loss of efficiency in estimating the coefficient parameters of a regression model.[42] Complete-case analysis, whereby observations with censored covariate values are discarded (on purpose or through a software default option), is the most commonly used method.When the sample size of the data is large, the censoring mechanism is independent of the outcome, and the proportion of censored data is relatively small, complete-case analysis of the data can be employed safely [33,34] since the impact of censored observations on the analysis of the data may be negligible.In that case the complete-case analysis yields valid (consistent and asymptotically unbiased) estimates for regression coefficient parameters.
However, under moderate to heavy censoring of data there might be a substantial loss of efficiency due to the reduction in the sample size and the significant loss of information on other fully-observed covariates and on the outcome measures of the deleted observations.[32] Furthermore, when the censoring mechanism is informative, using a complete-case analysis can lead to biased results which are in part exacerbated by the losses of information and efficiency since restricting the analysis solely to truly observed covariate measures may introduce some imbalance in the dataset in a way that misrepresents the population under study.
When the issue of censored covariate is not ignored all together, simple substitution methods (or to be more accurate ad hoc fill-in methods)-where censored observations are replaced by the overall mean or median of the observed variable or, alternatively, by a constant-are frequently used because they are simple, easy to understand, are easy to implement.Unfortunately, they may lead to substantial biased estimates and inaccurate conclusions.[13,10,45,46,38,43,17] Several non-trivial statistical methods have been developed specifically to input censored covariates and estimate regression coefficient parameters in a model with a censored covariate.[51,27] Some of these methods, known as parametric methods, use maximum likelihood estimation (MLE) under the assumption that the covariate follows a specific distribution.[40,4,29,45,31,13,10,12,3,48,47,27] For example, when such a distribution assumption is plausible, Richardson and Ciampi [41] proposed using MLE and input censored observations with E(X|X < ξ), in the context where measurements of the covariate, X, are left-censored due to limit of detection ξ.However, this approach has some limitations, especially when the censored covariate is correlated with other covariates.[40,27,48,47] As we know, an MLE method relies on a parametric distribution assumption of the censored covariate, i.e., the postulated distribution is assumed to be true and correctly specified.It is less reliable when the distribution assumption is incorrect or when the data set is so small that it becomes questionable whether the assumed distribution fits the data well.In that case, a semiparametric model that makes weaker parametric distribution assumptions or, even better, a nonparametric method that does not assume any specific distribution model at all is preferable.[40,27,35] As we previously mentioned, most of these methods highlighted above have been developed to account for type I censoring or limit of detection and are typically developed for left-censored covariates.Nevertheless, it is fairly straightforward to adapt the methods for a limit of detection data or type I censored covariate to a right-censored covariate.However, to the best of our knowledge, no such parametric approach employed for type I censored covariates has been extended to handle a randomly censored covariates.In addition, barring a few papers on dependent (randomly) censoring mechanism [44], the vast majority of published methods for type I censoring rely on the assumption that the censoring mechanism is independent of the outcome in interest.[19,35,38,18] Our primary objective in this paper is to adapt a parametric method proposed in linear regression models with a type I censored covariate to the case in which a covariate is randomly censored covariate.We use simulation studies to compare this newly developed method to the methods proposed by Sampene and Folefac [44]-in which randomly censored covariate values are replaced by a nonparametric and a semiparametric estimations of E(X|X ≥ τ ) or E(X|X ≥ τ, Y ), where τ denotes the maximum observation time for the variable X, and the outcome of interest Y .For this purpose, we will consider both dependent and independent censoring mechanism which occurred depending on whether such a censoring mechanism depends or not on the outcome of interest.Furthermore, we will also compare the aforementioned nonparametric estimation method to the commonly used deletion or complete-case analysis and give recommendations on the methods of estimation based on our simulation results.
We begin in Section 2 by presenting parametric and non-parametric methods used in the censored covariate literature.We then introduce the methods proposed by Sampene and Folefac [44] to handle randomly censored covariates.In Section 3 we run simulation studies to compare each of the discussed methods as well as the complete-case analysis method.Finally, we apply these methods in Section 4 to the Framingham Offspring Study to assess the influence of parental age at onset of cardiovascular disease on the systolic blood pressure of their offspring.

Notation and Methods
We consider n study participants independently sampled from a referenced population.Let Y i , X i and C i be, respectively, the continuous outcome variable, the potentially censored covariate (from which we are interested in making inferences), and the right censoring variable, where i indexes subjects.
Due to the right-censoring in the covariate X, for each participant i, we observe the vector

, n. The linear regression model is given by
where the parameter coefficients β 0 and β 1 are the intercept and slope, respectively.The random error is assumed to be independent of X and follows a normal distribution with mean 0 and variance σ 2 i.e. ∼ N (0, σ 2 ).We consider two different cases of censoring mechanisms.In the first case, we assume that the censoring mechanism is non-informative, i.e., C is independent of the outcome Y .For the second case, we assume that the censoring mechanism depends on the outcome Y, in a sense that there is some known point c 0 ∈ R such that the random variable C follows a distribution characterized by the distribution function G 0 when Y < c 0 and by the distribution function G 1 when Y ≥ c 0 .
For simplicity and demonstrative purposes, we limit our discussion to cases with no additional covariates.If, in practice, the data at hand contain a set of additional fully-observed (i.e.noncensored) covariates, Z, the method discussed here could easily be extended to accommodate such covariates.

Parametric method: Maximum likelihood estimation
A parametric method assumes an underlying distribution of the population from which the data at hand were sampled and uses the maximum likelihood estimation method to draw inference.
Suppose that the censoring C is independent of Y ; this implies X i and i are independent.Therefore, the distribution of Y is a product of the distributions of X i and i .The likelihood L of Y is made up of two components; one based on the uncensored (observed) X 1 , . . ., X m and the other on the right-censored X m+1 , . . ., X n : The maximum likelihood estimate of the unknown regression parameter corresponding to the censored covariate X is derived from the log-likelihood function where log Suppose that X follows a normal distribution with mean µ 2 x and variance σ 2 x , we have For the censored component, consider Φ(x) the cummulative gaussian distribution function and When censoring is dependent as described above, C i and Y are dependent.The likelihood can be expressed as where the data is made of fully observed x 1 , . . ., x m and right censored x m+1 , . . ., x n .The censored component is divided into x m+1 , . . ., x k and x k+1 , . . ., x n components with associated distribution G 0 and G 1 respectively, as in equation ( 2).

Overview of nonparametric methods
As stated in the introduction, most of the methods described in published literature that examine the issue of covariates are subject to the limit of detection.Prior to the late 1990's, the most common approach to handling such censored covariates was the complete-case analysis method.Alternatively, several naive ad hoc alternative methods have been proposed, including substitution methods, which consist of replacing censored covariate values with either a function of the limit of detection, ξ, e.g., ξ, √ 2ξ, 2ξ [19] or the mean of the observed covariate measures (mean substitution) [48] as well as dichotomizing the potentially censored covariate into a binary covariate.[8,43] Inevitably, each of these ad hoc methods leads to a biased estimation of β 1 .For instance, Helsel investigated the use of these naive substitution methods and concluded that they are inefficient and have no mathematically plausible backing [16].The extent of the inefficiency depends on the extent and the severity of censoring (i.e., the distance between the limit of detection or random censoring value and the natural limit for X) of censoring.Finally, Atem et al [6,7] explored additional non-parametric methods, based on multiple imputation approach, but concluded that these methods were not efficient when applied to the cases of dependent censoring.
Recently, Sampene and Atem [44] proposed two conditional multiple imputation methods for estimation and inference.The underpinnings of these methods involve replacing the randomly censored values X i by estimates of E(X i |X i > τ ), for i = m + 1 . . ., n.In the absence of additional covariates, the former is determined via a Kaplan-Meier estimator and performs well when the correlation between X and Y is weak.When the correlation between X and Y is strong, similar to case of missing covariate [33] the outcome of interest is included in the imputation E(X i |X i > τ, Y i ).This conditional imputation involving outcome Y , unlike the imputation not involving Y used estimates from the Cox proportional hazard hence the name Cox Multiple Imputation.To estimate the corresponding variance of β 1 for inference, Sampene and Atem [44] suggested using either a conditional multiple imputation or a conditional single imputation along with a bootstrap resampling procedure to correct for the underestimation of the variance inherent to the single imputation.In doing so, we showed that these improvements to the complete-case analysis method result in valid inferences regardless of whether the censoring mechanism is dependent or independent of the outcome.Furthermore, using simulation studies, they demonstrated that the multiple imputation method is similar to the conditional single imputation with bootstrap resampling.
In the next section, we will run simulation studies to compare the complete-case analysis, parametric, mean imputation, naive ad hoc, conditional single imputation, conditional multiple imputation and Cox multiple imputation methods.It is worth mentioning that Cole et al (2009) and Nie et al [40] have explored the parametric approach for type 1 censoring.They showed that this approach is very efficient for limit of detection data.However, as pointed out by one of the reviewers, it is worth exploring how well this parametric approach works compared to others non parametric approaches when censoring is random.
3 Monte Carlo Simulations

Data generation and simulation set up
We assumed that the true linear regression model is given by Y = β 0 + β 1 X + , with (β 0 , β 1 ) = (1, 0.5) and ∼ N (0, 1).The variable X as well as the censoring variable distribution were generated from a two-parameter Weibull distribution where θ is the shape parameter also known as the Weibull slope, with θ > 0 and λ, λ > 0, is the scale parameter.

Simulation results
Tables 1-4 summarize the results of the four sets of simulations performed for light censoring (20%) and heavy censoring (40%) in terms of , which is the (overall) deviation of a parameter estimate from the true parameter β 1 = 0.5, where β 1k is the estimate from the k-th generated data set; 4. mean squared error (MSE), which is the expectation of the square deviation of a parameter estimate from the truth.It is equal to Bias 2 + SE( β 1 ) 2 ; 5. coverage probability which is the proportion of simulated samples for which the 95% confidence interval β 1k ± Z 1−α/2 SE( β 1k ) includes β 1 = 0.5, for k = 1, ..., K.
Tables 1 and 2 show that when the distribution of X is highly skewed (see Figure 1), the parametric approach results in larger bias and MSE as compared to the conditional multiple imputation approach.Although the complete case is unbiased, deleting observations reduces the sample size, which results in an increased standard error and larger MSE as compared to both the maximum likelihood and the conditional multiple imputation methods.Despite being unbiased, both the complete case and the mean substitution methods are inefficient with higher MSE as compared to the maximum likelihood approach ,conditional multiple imputation and the Cox multiple imputation approach.The single conditional imputation is unbiased and is more efficient than the mean imputation with smaller MSE because its underestimates the standard error when imputed values are used as true values with no uncertainty.All approaches resulted in acceptable coverage probabilities.
Tables 3 and 4 show that, when the distribution of X is close to normal (see Figure 1), the maximum likelihood approach results in smaller bias and standard error.The log likelihood (2) is derived under the normal distribution assumption.Therefore, the distribution of X in Tables 3-4, which is close to the true distribution from which the maximum likelihood method is based, provides a better and a more efficient parameter estimate than the multiple imputation methods.The standard error and MSE is smaller than that of both the conditional multiple and Cox multiple imputation approaches.The other imputation methods are less efficient and more biased.Overall, it is worth mentioning that the Cox multiple imputation is more efficient than the conditional imputation when the data is well powered.As the sample size increases, this Cox multiple imputation is very efficient, which might be due to the fact that this approach uses one additional parameter Y in the imputation model as compared to the Kaplan-Meier based conditional imputation that does not involve the outcome in the imputation model.
4 Illustrative example: Association between parent age of cardiac events and low density lipoprotein (LDL) in offspring.
According to the American Heart Association cardiovascular disease (CVD) is a multi-faceted disease that affects the heart or blood vessels.CVD includes hypertensive, rheumatic, congenital, and vulvar heart diseases as well as cardiomyopathies, heart arrhythmias, carditis, aortic aneurysms, peripheral artery disease, venous thrombosis, coronary death, myocardial infarction, coronary insufficiency, angina, ischemic stroke, hemorrhagic stroke, transient ischemic attack, peripheral artery disease, and heart failure.It is the global leading cause of death, accounting for over 30% of all deaths worldwide-approximately 17.3 million deaths per year.In the United States, someone dies from CVD every 39 seconds, with most of those deaths being attributed to coronary heart disease.[49,2] Though the death rate due to CVD has decreased slowly over the last 30 years, death from heart disease remains the leading cause of death in the United States, and caring for patients with poor cardiovascular health continues to be one of the largest burdens on the health care system today.From 1990 to 2009, CVD ranked first in the number of days for which patient received hospital care, [1] yet 72% of Americans do not consider themselves at risk for heart disease.[2] Associations have long been established between CVD and a wide variety of risk factors, including non-modifiable variable such as family history [21,24,39,37,15,23,20].Blood levels of low density lipoproteins (LDL), one of the five major groups of lipoproteins categorized by density, are regarded as a strong predictor of CVD.To illustrate the methods proposed in this paper, we study the association between LDL in offspring and age at onset of a clinically-diagnosed cardiovascular event in parents, using data from the Framingham Heart Study database and looking at both the Original and Offspring cohorts.[22] The Framingham Heart Study (FHS) is an ongoing prospective study of the etiology of cardiovascular disease, among other prevalent diseases.The study began in 1948 and enrolled 5,209 participants (55% women) aged 28 and 62 years old residing in Framingham, Massachusetts as part of the original cohort who have been followed up to the present.In 1971, the Framingham Offspring Study was established with a sample of 5,124 men and women aged 5 to 70 years old who were either (genetic or adoptive) offspring or spouses of offspring of the original cohort [14,36].Study participants are examined routinely to update their health status information and potential risk factors.Standard clinical examinations included physician interview, physical examination, and laboratory tests, and continue to the present.Participants in the original cohort have been followed biannually; there were 40 participants during the 32 nd Exam visit held in 2012-2014.In the offspring cohort, participants have been followed approximately every four years.The Offspring Examination Cycle 9 covered the years 2011 to 2014 and had 2430 participants.
In this example, we performed two separate analyses, one for each parent, to evaluate the relationship between ages of CVD in parents and log(LDL) in offspring.Data gathered from the original cohort (Exam 12(1971(Exam 12( -1974)); 3,261 participants) and the offspring cohort(Exam 1 (1971)(1972)(1973)(1974)(1975); 5,124 participants) were used.We deleted all missing data and restricted the LDL to physician recorded values; this reduced the sample size to n = 2622 (1,401 mothers and 1,221 fathers).
Of the 1401 mothers in the final data set, 907 of them (i.e., 35.26%) experienced a cardiovascular event whereas 909 (i.e 74.45%) out of 1,221 fathers experienced a cardiovascular event.The median age of CVD was 66 years and 63 years for mothers and fathers, respectively.
Results of the data analyses are provided in Tables 5 and 6.The results for the complete-case analysis, mean substitution, maximum likelihood, conditional single imputation and the conditional multiple imputation are consistent with the simulation results.With a larger sample, the assumption of normality for the censored covariate is met and the parametric method provides better estimates, along with smaller standard error.On the other hands, the results from the ad hoc substitution methods are inconsistent with the simulation results.This is because there is no scientific bases for such substitutions.

Conclusion
Most of the literature on censored covariates deals with the issue of limits of detection, the point at which observations below this limit cannot be measured or detected and are instead recorded at the limit of detection value.[10,46,43] In this paper, we considered the estimation of linear regression models when the covariate of interest is randomly censored.We evaluated non parametric conditional imputation methods based on the Kaplan-Meier estimate to impute a censored covariate.We compared this non parametric approach based on Kaplan-Meier to the regression from the full data (without censoring), the complete-case analysis, a naive ad hoc substitution (replacing censored values by the mean of the observed covariate values) and the maximum likelihood approach.
Parametric estimators were determined via maximum likelihood estimation method based on an underlying distribution assumption of the censored covariate.Throughout our simulations, we demonstrate that the naive ad hoc substitution method provides biased estimation of the regression parameter of the censored covariate.As Helsel pointed out [17], these substitution methods are akin to fabricating data; they don't have any theoretical basis and should thus be discouraged.
The complete-case analysis method is the widely used approach for handling censored predictors as it is easy to implement.The obvious pitfall of the complete-case approach is that it potentially sacrifices information by discarding observations.Although, this method leads to unbiased estimates under independent censoring, it can result in a substantial loss of power, especially under moderate to high percentages of censored observations.Under dependent censoring, complete-case analysis may lead to model misspecification due to selection bias if a group of subjects with similar characteristics do not experience the event of interest or leave a study before its completion.
The mean substitution approach is easy and looks reliable but a detailed analysis of this approach shows it has many short comings.We cannot always guarantee that the mean of the complete case will be greater than the time at censoring.One basic assumption of censored data is that, if the event is to occur, it can only happen after the censored time.Furthermore, this approach does not make use of the available information, that is, the time at censoring.
Using parametric methods requires prior knowledge or postulating a distribution model for the censored covariate.When the postulated parametric distribution of the censored covariate corresponds to the true distribution, the maximum likelihood estimation method and the nonparametric method via Kaplan-Meier estimation all provide consistent estimates, under independent censoring.Under dependent censoring, if the distribution of X and C are similar, these methods are efficient; however, if the distribution of X and C are dissimilar the MLE approach will be highly inefficient (as shown in section 2.1).Therefore, we propose the use of Kaplan-Meier nonparametric imputation in absence of prior knowledge of the distribution of censored covariate or when such a distribution cannot be accurately ascertained.On the other hand, if the sample size is large and the distribution of X is a member of the exponential family, the MLE approach can be suitable.

25 Figure 1 :
Figure 1: Distribution of the censored covariate as the function of the shape and scale parameters

Table 5 :
Relationship between Maternal age of onset of CVD and LDL in offspring

Table 6 :
Relationship between Paternal age of onset of CVD and LDL in offspring