Online Computerized Adaptive Testing to Examine a Person Being Depressed Using a Taiwanese Depression Scale (TDS)

Background: Person’s depression has been measured in many studies to investigate mental health issues. None uses online computerized adaptive testing (CAT) with cutting points to report a prevalence rate of depression at a workplace. Objective: To develop an online CAT to examine person being depressed and verify whether item response theory–based computerized adaptive testing (CAT) can be online applied to measure person’s depression. Methods: A total of 413 persons (213 depression patients and 200 normal undergraduates) were recruited and responded to the 22-item Taiwanese Depression Scale (TDS). All non-adaptive testing (NAT) items were calibrated with the Rasch rating scale model. Three scenarios (i.e., NAT, CAT, and the randomly-selected method to NAT) were manipulated to compare their response efficiency and precision by comparing i) item length for answering questions, person measure, ii) correlation coefficients, iii) paired t tests, and iv) estimated standard errors (SE) between CAT and the random to its counterpart of NAT. Results: The TDS is a unidimensional construct that can be applied for patients to measure depressive perceptions on CAT. CAT required fewer items (= 14) than NAT (= 22, an efficient gain of 36% = 1-14/22). Person measures derived from both tests (CAT and the random to NAT) were highly correlated (r = 0.96 and 0.98) and their measurement precisions were not statistically different (the percentage of significant count number less than 5%) as expected, but CAT earns substantially smaller person measure SE than the random scenario. The positive and negative predictive values for this study were0.96 and 0.91, respectively, when cutting points were set at -0.7 and 0.7 logits. Conclusion: With CAT online-based administration of the TDS for patients, their burden was substantially reduced without compromising measurement precision.


Introduction
Depression is a disease of modernity with an increasing prevalence rate due to drastic changes in daily life over the past century [1]. Mental disorders are common in the United States and internationally. According to a report from the US National Institute of Mental Health (NIMH) in 2013, an estimated 26.2 percent of Americans ages 18 and older, about one in four adults, suffer from a diagnosable mental disorder in a given year [2], increasing from 16.2% in 2003 [3]. Major Depressive Disorder (MDD) is the leading cause of disability in the U.S. for ages 15-44 [4]. MDD affects about 14.8 million American adults (i.e., 6.7% of the U.S. population age beyond 18) in a given year [5,6]. While MDD develops at any age, the median age at onset is 32years old [7]. MDD is more prevalent (about two times) in women than in men [8]. People are most likely to suffer their first depressive episode between the ages from 30 to 40, and there is a second, smaller peak of incidence between ages 50 and 60 [9].
DSM-III criteria were used to define depression. Lifetime prevalence estimates of MDD ranged from 1.0% (Czech Republic) to 16.9% (US), with midpoints at 8.3% (Canada) and 9.0% (Chile) [10], 1.14% in 1996 [11] and1.20% in 2012 [12] (Taiwan). Weissman et al. [13] published the first crossnational comparison of major depression from 10 populationbased surveys in 1996 and reported their prevalence rates from the lowest Taiwan (1.5%) to highest Beirut (19.0%), with the midpoints at 9.2% (West Germany) and 9.6% (Edmonton, Canada). How to develop an easy and friendly assessment that can help an institute or a school to consecutively monitor its own MDD prevalence rate is urgently required in this high-tech industrial society.

An Assessment to calculate depression prevalence rates
Many kinds of depression scale have been developed and validated in published papers [14][15][16][17][18]. Some were translated into local languages for use and some were developed by researchers using their own languages. All of them encountered a common problem that is to determine cutting points for calculating a depression prevalence rate. Different item length and category number lead to different cutting points if summation score is applied. Importantly, a comparison between derived score levels and the suggested best cutoff points can help clinicians (or practitioners) assess examinees at risk of an incidence [19,20]. Multiple cutoff points are recommended being more powerful and useful than one single cutoff point [21,22]. How to determine appropriate cutting points for a depression scale is the first research question of the current study.

Online computerized adaptive testing is required
Many studies [23][24][25][26][27] have addressed that item response theory (IRT)-based computer adaptive testing (CAT) has the advantages of both long-form and short-form questionnaires [28][29][30] in precision and efficiency. Mobile phones are commonly used by people in all walks of life in this technology era. However, no studies till now report any kind of online CAT via mobile phones for gathering data in healthcare settings. The second research question is how to develop an online CAT that can be used for a depression assessment.

Objectives
First, we demonstrate a Taiwanese Depression Scale (TDS) that is a unidimensional construct. Second, we determine a set of cutting points that can be used for computing a prevalence rate at workplace on CAT. Third, we compare CAT with non-adaptive testing (NAT) and the randomly selected method to NAT on efficiency and precision. Fourth, we develop an online CAT for individual examinees to measure the level of person depression.

Taiwanese depression scale (TDS) and data source
The study data (i.e., item and person parameters) was extracted from two published papers [31,32], comprising 213 depression patients and 200 normal undergraduates who answered 22-itemTDS in 4-point Likert-type format (i.e., 0 for seldom, 1 for occasionally, 2 for frequently, and 3 for always).Four facets included in this TDS scale are cognitive dimension with six items; emotional dimension with six items; physical dimension with six items, and interpersonal dimension with four items. It was evident that the 22-itemTDS (Cronbah's α=0.97) can be a unidimensional instrument for evaluating depressive symptoms and act as a means to replace some out-of-date depression scales in Taiwan [31]. With 22 item difficulties, three threshold difficulties under Rasch [33] rating scale model [34], and 413 person ability parameters, we conducted a simulation [35] to form a 413ⅹ22 response rectangle metric fitting to Rasch model's requirement (see the demonstration in Additional files from 1 to 3). Rasch person separation reliability is 0.85 (Mean=-0.70, SD=1.81). According to the literature [36][37][38], as a scale's reliability (i.e., Cronbach's ) increases, so does the person-number of ranges that can be confidently distinguished. Measures from two instruments with reliabilities of 0.67 will tend to vary within two groups that can be separated with 95% confidence; 0.80 will vary within three groups; 0.90, within four groups; 0.94, within five groups; 0.96, within six groups; 0.97, within seven groups etc. [39]. More conservative to compute the number of the strata, the scale reliability was relied on the Rasch person separation reliability (=0.85), and then referred to the Rasch threshold difficulty guideline [40] with an appropriate distance between two thresholds ranging from 1.4 to 5.0 logits (log odds). Three strata were thus determined. Standard error of measurement was 0.7(=SD -reliability=1.81 -0.85). Accordingly, the cutting points can be set at -0.7 and 0.7 logits when the mean of the 22 item difficulties is usually calibrated at zero logit ( Figure 1). A comparison was made to select the highest Kappa coefficient and hit ratio (i.e., accurate classification rate=the number of accurate classification in both positive negative and cells divided by the total number) among all possible cutting points. Three scenarios (i.e., NAT, CAT, and the randomly-selected method to NAT) were manipulated to compare their response efficiency and precision by comparing item length for answering questions, person measure; correlation coefficients; Smith's paired t tests [41]; estimated standard errors (SE) between CAT and the random to its counterpart of NAT ( Figure 2) and Additional file 4. We ran an author-programed VBA (Visual Basic for Applications) module in Microsoft Excel. Rasch person separation reliability of the TDS yielded by Winsteps (i.e., excluding all extreme scores summed to zero) was used to determine the CAT termination criterion using the standard error of measurement (SEM=SD -reliability). Another termination criterion is the mean of the last five change differences between the pre-and post estimated abilities on each CAT <0.05. The minimum number of questions required for completion was set at 7 (7/22 items on TDS item length = 30%). The first item was randomly selected from the 22 items when starting the CAT. The provisional measures were estimated by the maximum log likelihood estimation (MLE). The next question selected was the one with the most information obtained from the remaining unanswered items, interacting with the previously provisional person measures.

An online CAT was designed for smart phones
An online CAT was designed for examinees to report their depression scores in a unit of logit. The 22 items with their threshold difficulties (calibrated by Rasch Winsteps) and their responsive audios and pictures were uploaded to the website. The rules of the first and the next selected CAT item and the termination criteria are like the aforementioned simulation method. SPSS 15.0 for Windows (SPSS Inc., Chicago, IL) and Med Calc 9.5.0.0 for Windows (Med Calc Software, Mariakerke, Belgium) were used to calculate (1) Cronbach's, (2) dimension coefficients [42], and (3) correlation coefficients between estimated person measures for CAT and the random to its counterpart of NAT. Independent t tests were used to compare (4) the ratios of the different paired person measures. RaschWinsteps was used for producing (5) person separation reliability. The prevalence (or incidence) rate is calculated by the formula (= the number of the depression grade excluded from the low stratum divided by the sample).

Results
The sample of 413 persons was obtained from the study (Additional file 3). Count distribution for the two study samples is shown in Figure 2.

Dimensionality
The TDS can be unidimensional because a) One factor was extracted using parallel analysis; b) All Infit and Outfit mean squares for the 22 items are in a range of 0.5 to 1.5 (the Infit column in Table 1; Figure 4). c) Item loadings from the Rasch PCA of residuals on the first contrast are standardized i.e., (loading -mean)/SD) within -1.66 and 2.24 (<2.58, P>.01) in Table 1; PTME (pointmeasure) are between 0.71 and 0.84 (in the PTME column in Table 1) indicating high item loading to the unobserved latent trait. d) Rasch person separation reliability = 0.85, Cronbach α= 0.97, DC = 0.88 (> 0.67), and Smith's t test of proportions [41] is near to zero (= 1.4% = 11/414) outside the range +/-1.96. In addition, category structure for the TDS displays the monotonically increasing threshold (-1.08,-0.52, and 1.60 logits) in compliance with Linacre's guidelines [40].

Cutting point determination
The person separation reliability for the TDS is 0.85, indicating that three strata can be separated with thresholds at-0.7 and 0.7with a highest Kappa coefficient and hit ratio ( Table   2). The incidence rate of MDD for this study sample is 52.7% (= 218/413), Figure 2. We can see that three equal sizes with an equivalent accumulative probability are separated by the cutting points at -0.7 and 0.7, (Figure 3,4).

Comparison of efficiency and precision
The CAT required substantially fewer items (mean = 14.3; SD = 0.39; SE = 0.28; 95% CI = 13.7-14.8, p<.05) than did NAT (= 22) and provided an efficient gain in test length of 36% (= 1-14/22), Figure 5 in panel A. Person measures from CAT did not statistically differ from NAT because (1) Smith's t test of proportions [41] is 3.1% (= 13/413< 5%), Figure 5 in panel B, and (2) correlation coefficient = 0.97(= -square0.95, see Figure  5 in panel C). As compared to the random scenario, CAT earns a set of smaller SE, Figure 5 in panel D.  By scanning a QR-code ( Figure 6) at right bottom, the TDS item appears on the smartphone. We developed an online CAT module to demonstrate the assessment in action. The CAT processed each person item-by-item with picture animations (Figure 6) at left top. Adaptive item selection is based on maximizing information across unanswered items. The measurement of standard error (MSE) for each subscale decreased when the number of the items increased ( Figure 6). The result with a person measure and the depression grade (i.e., low, moderate, or high) instantly shows on smart phone ( Figure  6).

Key findings
The results from this study indicate that the 22-item NAQ-R is unidimensional. A set of cutting point at -0.7 and 0.7 logits were determined for future use in workplace depression surveys. The incidence rate of depression for the study sample was 52.7%. The CAT is 36% more efficient for answering questions and achieved similar precision in measurements as did NAT. An available-for-download online CATNAQ-R APP for nurses was suited for smart phones.

What this adds to what was known
Consistent with the literature [43][44][45][46][47][48], the 22-item TDS can be unidimensional. The efficiency of CAT over NAT was supported. We confirm that CAT-based TDS requires significantly fewer answered items to measure depression symptom than NAT without compromising its measurement precision.

What it implies and what should be changed? Cutoff point recommended for calculating depression prevalence rate
Many kinds of depression scale encounter a common problem that is to determine cutting points for calculating a depression prevalence rate. Different item length and category number lead to different cutting point if summation score is applied. In this study we determine cutting points at -0.7 and 0.7 that can be suitable for CAT in correspondence to different item length and can be referred to any kind of depression scale with different summation score using the percentage score ( Figure 1).
For instance, a 20-itemdepression scale with 5 rating categories has two cutting points at < 26(=33%х80) and <52(=67%х80), where 80 is the summation score (=20 х 4). Through which, a comparison between derived score levels and the suggested best cutoff points can help clinicians (or practitioners) assess examinees at risk of an incidence [19,20]. Multiple cutoff points are usually more powerful and useful than one single cutoff point [21,22]. Maslach et al. [49] suggested setting an equal sample size in each stratum as a way to determine cutting points. The value of 0.7 logit is the measurement of standard error beyond the mean of the sample. In this study the person SD=1.81, which is similar to the 1.7 adjustment for IRT because the person logistic ogive distribution (in logit units) is wider (i.e., 1.7 times) than the one with normal ogive distribution, see the difference in logit and probit [50].
At the end of 2016, more than 10,977 papers were found in a search with keyword "cut point". None discussed the determination of cutting points used for CAT with different item length for a respondent. In practice, we usually do not know the patient's true-and false-positive disease-specific status, like the TDS. The issue we face in clinical settings is how to identify the degree of patient incident problems. Through this study, if cutting points at -0.7 and 0.7 logits are selected for the TDS, the raw score in cutting points can be transformed by the formula (= total score × the probability at 0.33 and 0.67), whereas 0.33 comes from the equation exp (-0.7)/(1 + exp (-0.7)) and 0.67 is from the equation 1 -exp (-0.7)/(1 + exp (-0.7)), total score = 66 when 4-point (from 0 to 4) 22-item TDS is defined In Methods. The cutting points in raw score can be set at <22(= 66 ×0.33), and ≥ 44 (= 66×0.67) to separate three strata in depression degree. The prevalence (or incidence) rate is easy to calculated and compared either with paper-and-pen format or with CAT in future.

Online CAT assessment
At the end of 2016, 757 papers were collected in US National Library of Medicine National Institutes of Health (pubmed.org) when searching keywords: computer adaptive testing. None was applicable using an online assessment suited for smart phones until the online skin cancer CAT was published [51]. We do ensure that more papers in future will be published on the usefulness of online CAT as with all forms of Web-based technology are rapidly increasing [52].

Unidimensional scale detection
Many studies [42,53] reported the issue of scale unidimensionality detection. From the Library of PubMed and BioMed Central, we got 1,005 and 333 papers with the keyword "unidimensionality", 359,957 and 23,902 results for "depression". In the current study, we demonstrated the method Tennant [54] suggested using three steps to assess scale unidimensionality: Conduct prior testing using Horn's parallel analysis; use Rasch fit statistics; run post hoc tests using Rasch standardized residual loading, and Smith [41] independent t-tests to compare estimates of the percentages (< 5%, within +/-1.96) . In addition, the dimension coefficient (≥0.67, DC) and PTME (> 0.40) included in detecting scale unidimensionality are recommended to readers.

Strengths of this study
Four goals have been reached in this study: We demonstrate a Taiwanese Depression Scale (TDS) that is a unidimensional construct (2) cutting points at -0.7 and 0.7 logits were recommended to future studies in computing depression prevalence rate at work place using TDS; CAT gains 36% efficient than did NAT, and; online CAT is applicable in practice. Among them, the reason for36% efficient than did NAT is because we added another termination rule in CAT: the mean of the last five change differences between the pre-and post estimated abilities on each CAT less than 0.05. Through the termination rule of detecting the last five change differences in estimated abilities less than 0.05 makes the item length less than that in other studies [42,53]. If all CAT cases are controlled by the only termination rule of SE less than 0.44 (= SQRT (1 -0.8) = SQRT (1 -reliability)), the precision measured by SE on CAT ( Figure  3) will be substantially higher than the dual stop conditions we did in this study because a longer item length leads to a high reliability (or a smaller measure SE) than a shorter one.
In addition, the online CAT with audio and picture animations is available for interested readers to practice if scanned on the QR-code in Figure 6 which is rare in any previously published articles. Furthermore, cutting points set at -0.7 and 0.7 logits with an equal stratum member size might be generalized to other incidences or diseases when the patient's true-and falsepositive disease-specific status is not known beforehand. Like the TDS, we merely intend to identify the grade of the incidence and compare to the norm.

Limitations of the study
Several issues should be considered more thoroughly in further. First, the secondary data source limits us not to identify differential item functioning (DIF) on gender or other race groups. Second, the high incidence rate (52.7%) cannot be generalized to the prevalence rate because the sample (comprising 213 depression patients and 200 normal undergraduates) was particularly manipulated for verifying the TDS validation only (Table 1) instead of calculating prevalence rate in a real world. More studies are recommended to assess the generalizability of the study with different samples using the same cutting points and the same version of TDS in future. Third, the online CAT is not equipped with much useful functionality as we expected in practice. Such as protecting cheating behaviors and detecting aberrant responses that are required to be in future advanced versions. Fourth, although the scale's Cronbach's  coefficients was 0.96, we conservatively determined that the scales' person strata were three according to Rasch separation reliability =0.85 and literature [36][37][38]. Multiple cutoff points are not limited to three strata if the separation index reaches an extremely higher level, which will affect the determination of appropriate cutting points of the TDS.

Conclusion
The CAT-based TDS forming a unidimensional construct reduces respondents' burden without compromising measurement precision and increases endorsement efficiency. The online TDS module developed by the authors is recommended for assessing hospital employees or other workplace members using the criteria at -0.7 and 0.7(or <22 and <44 in summed score) to identify depression grade as one of the three levels (high, moderate, and low).

Ethics approval and consent to participate
The secondary data were retrieved from two published papers [31,32] both for CAT used as item pool and for simulation as well as for demonstration in a MS Excel format. The way we extracted data from papers is fully disclosed with a video in Additional file 3.

Availability of data and materials
All data used for verifying the proposed computer module during this study are extracted from two published papers [31,32]. The Microsoft Excel-based computer module including the demonstrated data can be downloaded from the supplementary information files.

Authors contribution
TW developed the study concept and design. TW and YS analyzed and interpreted the data. TW drafted the manuscript, and all authors provided critical revisions for important intellectual content. All authors have read and approved the final manuscript as well as agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was supervised by SC.