Daniel R Jeske

doi:10.19080/BBOAJ.2018.08.555728

Review Article

Metrics Used When Evaluating the Performance of Statistical Classifiers

Daniel R Jeske*

Department of Statistics, University of California, USA

Submission: June 05, 2018; Published: August 01, 2018

*Corresponding author: Daniel R Jeske, Department of Statistics, University of California, Riverside, CA, USA, Tel: 951-827-3014; Email: daniel.jeske@ucr.edu

How to cite this article: Daniel R Jeske. Metrics Used When Evaluating the Performance of Statistical Classifiers. Biostat Biometrics Open Acc J. 2018; 8(1): 555728. DOI: 10.19080/BBOAJ.2018.08.555728

Abstract

This article reviews important performance metrics that are used to evaluate the accuracy of statistical classifiers. How the metrics are used to construct Receiver Operator Characteristic (ROC) curves, Predictive ROC (PROC) curves, and Precision-Recall (PR) curves is also discussed. Relationships between the metrics are revealed.

Keywords: False positive rate; False negative rate; Specificity; Sensitivity; Positive predictive value; Negative predictive value; Precision; Recall; Youden threshold

Abbrevations: ROC: Receiver Operator Characteristic Curves; PROC: Predictive ROC Curves; PR: Precision-Recall; AUC: Area Under the Curve; NPV: Negative Predictive Value; PPV: Positive Predictive Value; FPR: False Positive Rate; FNR: False Negative Rate

Introduction

statistical classifier maps a set of features ,X to a class variable .C The features X can be a mix of categorical and interval variables and the class C is one of a finite number of possible classes. Applications frequently are concerned with two classes, and in this context the classifier is referred to as a binary classifier. In medical diagnostic applications, X could represent patient characteristics and C=0 (C=1) might correspond to healthy (diseased) patient status.

There are a number of methods available for developing a statistical classifier, including Bayes classifiers, tree classifiers, support vector classifiers, neural network classifiers, logistic regression classifiers, and ensemble classification methods. See, for example, reference [1], for details on these methods. Using training data that has both features and the class label for a sample of subjects, the classification methods construct a predictive function ⋅()T that maps X to a predicted class label, ˆ.C For binary classifiers,

where u is a threshold that is determined to trade-off performance objectives for the classifier. Equation (1) assumes, without loss of generality, that large values of T(X) correlate to class =1.C It is understood in practice that no single classification method works uniformly the best, and typically investigators will experiment with a variety of options and choose the one the works best for their application.

When choosing the threshold ,u there are a four important performance metrics that should be examined. The key to understanding these metrics is the notion of class-conditional distributions of T(X) Let ()01FF denote the conditional cumulative distribution functions of T(X) given C=0 (C=1).

The ROC curves

The first two performance metrics of importance are the false positive rate (FPR) and false negative rate (FNR), defined as

Alternative terminology used with the ROC curve is sensitivity and specificity , which are defined as

The Receiver Operating Characteristic (ROC) curve is a plot of the locus of points defined by obtained by varying u. Figure 1 shows a schematic picture of an ROC curve, and it can be seen how it facilitates choosing a threshold u that strikes a balance between the conflicting objectives of simultaneously achieving high sensitivity and high specificity [2-4]. A commonly used threshold is which is known as the Youden threshold [5]. An alternative threshold is the point on the ROC curve that is closest to the optimal point (0,1).

The area under the curve (AUC) is often reported as a measure of merit for the particular methodology used to develop the classifier [6]. AUC is a global measure that is not particular to a single threshold, and as such it loses its relevance with a specific implementation of the classifier that requires choosing one threshold.

The PROC Curve

A second pair of important performance metrics for a classifier are negative predictive value (NPV) and positive predictive value (PPV), defined as

NPV and PPV have the interpretation of the fraction of class 0 predictions that are correct and the fraction of class 1 predictions that are correct, respectively. Whereas FPR and FNR measure error rates of the classifier before the prediction is made (a-prioir), NPV and PPV measure the accuracy of the classifier after the prediction is made (a-posteriori). In medical diagnostic applications, FPR and FNR aid in determining whether or not it is useful to perform the diagnostic procedure and NPV and PPV aid in interpreting the results if it is performed. Each of the metrics plays a role in providing a comprehensive assessment of the performance capability of the classifier.

In order to calculate NPV and PPV, it is necessary to know the prevalence of class c=1, denoted by π₁ This necessity is revealed in the following formulas for NPV and PPV which follow from use of Bayes’ rule,

The predictive ROC (PROC) curve is a plot of the locus of points−(1(),()),NPVuPPVu obtained by varying .u Unlike the ROC curve, which is always monotone increasing, the PROC curve need not be monotone increasing. Monotonicity of the PROC curve requires the hazard and reversed hazard functions of 0F and 1F be ordered [7]. Figure 2 illustrates the general result that when 0F and 1F are homogenous normal distributions, the PROC curve is monotone, but it is not monotone for the heterogeneous normal case.

Discussion

The literature on classifier performance metrics also includes discussion of the precision-recall (PR) curve [8-9]. Precision is an alternative term for PPV and recall is an alternative term for sensitivity. The PR curve is therefore an alternative plot for showing two of the four important performance metrics that have been discussed. The diversity in the references included in this review reflect the fact that research pertaining to the use and evaluation of statistical classifiers span a variety of different disciplines.

BBOAJ.MS.ID.555728

Our Media Partner

BBOAJ Menu

Useful Links

Downloads