There are many medical tests for detecting the presence of a disease or condition. Some examples include tests for lesions, cancer, pregnancy, or COVID-19. While these tests are usually accurate, they’re not perfect. In addition, some tests are designed to detect the same condition, but use a different method. A recent example are PCR and antigen tests for COVID-19. In these cases we might want to compare the two tests on the same subjects. This is known as a *paired study design*. This can help us assess if one test is better than the other, or maybe just as good as the other. The latter may be of interest if an alternative test is cheaper or less painful or invasive.

In this article we discuss some approaches for carrying out such analyses in R.

We’ll look at data presented in the paper *Comparing the Predictive Values of Diagnostic Tests: Sample Size and Analysis for Paired Study Designs* by Moskowitz and Pepe (2006). In the paper the authors analyze data from the National Cystic Fibrosis Patient Registry. The goal is to compare two prognostic factors as tests for identifying subjects that will develop severe respiratory infection. The sample size is 11,960, of which 5,054 experienced severe respiratory infection.

The first prognostic factor, or test, is a positive culture for the bacterium *Pseudomonas Aeruginosa*, which we will call “PA” for the remainder of the article. In the two-way table below we present the results of the test (positive or negative) versus whether or not an infection occurred (presence or absence). We enter the data as a matrix and add names to the rows and columns using the `dimnames()`

function. Finally we print the matrix with margin totals courtesy of the `addmargins()`

function.

PA <- matrix(c(185, 4869, 123, 6783), ncol = 2) dimnames(PA) <- list("Test" = c("Positive", "Negative"), "Disease" = c("Present", "Absent")) addmargins(PA) ## Disease ## Test Present Absent Sum ## Positive 185 123 308 ## Negative 4869 6783 11652 ## Sum 5054 6906 11960

Two common statistics calculated from such a table are *sensitivity* and *specificity*. Each estimate the probability of a correct test result given disease or condition status. Sensitivity is the proportion of positive tests out of everyone who has the disease or condition. Specificity is the proportion of negative tests out of everyone who does not have the disease or condition.

In our data:

- sensitivity = 185/5054 = 0.037
- specificity = 6783/6906 = 0.982

We can calculate these using matrix indexing as follows:

# sensitivity: 185/5054 PA[1,1]/sum(PA[1,1], PA[2,1]) ## [1] 0.03660467

# specificity: 6783/6906 PA[2,2]/sum(PA[1,2], PA[2,2]) ## [1] 0.9821894

It appears the PA test has about a 3% chance of being “positive” given someone with cystic fibrosis (CF) will develop a severe respiratory infection. That’s not good at all. On the other hand, it has about a 98% chance of being “negative” given someone with CF will not develop a severe respiratory infection.

We can flip sensitivity and specificity around and calculate two other statistics: *Positive Predictive Value* (PPV) and *Negative Predictive Value* (NPV). The first estimates the probability of having the disease or condition given a positive test result. The latter estimates the probability of not having the disease or condition given a negative test result.

In our data:

- PPV = 185/308 = 0.601
- NPV = 6783/11652 = 0.582

We can also calculate these using matrix indexing:

# PPV: 185/308 PA[1,1]/sum(PA[1,1], PA[1,2]) ## [1] 0.6006494

# NPV: 6783/11562 PA[2,2]/sum(PA[2,1], PA[2,2]) ## [1] 0.5821318

It appears there is a 60% chance someone with CF will develop a severe respiratory infection given a positive PA test. Likewise there is about a 58% chance someone with CF will not develop a respiratory infection given a negative PA test.

According to Moskowitz and Pepe (2006) sensitivity and specificity statistics tend to be used in the early stages of test development, while PPV and NPV are used in the later stages. At first we’re interested in the performance of the test given disease (sensitivity and specificity). But in the end – in a clinical setting – we’re more interested in the disease status given the test result (PPV and NPV).

We can quickly calculate these statistics in R using the `acc.1test()`

in the DTComPair package. To do this we need to supply the data as a `tab.1test`

object, which we can do with the `read.tab.1test()`

function. We enter the cell totals from our table above into four arguments:

`a`

: The number of diseased subjects with a positive test.`b`

: The number of non-diseased subjects with a positive test.`c`

: The number of diseased subjects with a negative test.`d`

: The number of non-diseased subjects with a negative test.

We also supply the name of our test to the `testname`

argument. Then we save the result to “PA_tab” and use the `acc.1test()`

function to calculate our statistics.

# install.packages("DTComPair") library(DTComPair) PA_tab <- read.tab.1test(a = 185, b = 123, c = 4869, d = 6783, testname = "PA") acc.1test(PA_tab) ## Diagnostic accuracy of test 'PA' ## ## (Estimates, standard errors and 95%-confidence intervals) ## ## Est. SE Lower CL Upper CL ## Sensitivity 0.03660467 0.002641514 0.0314274 0.04178194 ## Specificity 0.98218940 0.001591562 0.9790700 0.98530881 ## PPV 0.60064935 0.027906948 0.5459527 0.65534596 ## NPV 0.58213182 0.004569094 0.5731766 0.59108708 ## ## Est. SE (log) Lower CL Upper CL ## PDLR 2.0552183 0.114860010 1.6409258 2.5741092 ## NDLR 0.9808651 0.003184913 0.9747613 0.9870071

The output presents estimates for all four statistics along with standard errors and 95% confidence intervals. Let’s recap these interpretations one more time:

- sensitivity: about 3.6% of subjects will get a positive bacterium test if they develop a respiratory infection. 95% CI: [3.1, 4.2]
- specificity: about 98.2% of subjects will get a negative bacterium test if they do not develop a respiratory infection. 95% CI: [97.9, 98.5]
- PPV: about 60% of subjects develop a respiratory infection given a positive bacterium test. 95% CI: [54.6, 65.5]
- NPP: about 58% of subjects do not develop a respiratory infection given a negative bacterium test. 95% CI: [57.3, 59.1]

The other two estimates, PDLR and NDLR, are for the positive and negative diagnostic likelihood ratios. More information on these statistics, and the ones we just discussed, can be found on this helpful web page on the Columbia Public Health web site.

Getting back to our Cystic Fibrosis data, the second prognostic factor of interest is the occurrence of a prior respiratory infection. The paper refers to these as pulmonary exacerbations, or PExs. As we did before, we present the results of the test (positive or negative) versus whether or not an infection occurred (presence or absence) in a two-way table. In this case “positive” means the subject had a prior PEx, and “negative” means the subject did not have a prior PEx.

PEx <- matrix(c(3630, 1424, 1342, 5564), ncol = 2) dimnames(PEx) <- list("Test" = c("Positive", "Negative"), "Disease" = c("Present", "Absent")) addmargins(PEx) ## Disease ## Test Present Absent Sum ## Positive 3630 1342 4972 ## Negative 1424 5564 6988 ## Sum 5054 6906 11960

And once again we can calculate statistics of interest using functions from the DTComPair package:

PEx_tab <- read.tab.1test(a = 3630, b = 1342, c = 1424, d = 5564, testname = "PEx") acc.1test(PEx_tab) ## Diagnostic accuracy of test 'PEx' ## ## (Estimates, standard errors and 95%-confidence intervals) ## ## Est. SE Lower CL Upper CL ## Sensitivity 0.7182430 0.006327839 0.7058406 0.7306453 ## Specificity 0.8056762 0.004761348 0.7963442 0.8150083 ## PPV 0.7300885 0.006295539 0.7177495 0.7424275 ## NPV 0.7962221 0.004818582 0.7867778 0.8056663 ## ## Est. SE (log) Lower CL Upper CL ## PDLR 3.696115 0.02603793 3.512222 3.8896361 ## NDLR 0.349715 0.02322303 0.334154 0.3660005

The PEx test appears to be superior to the PA test in terms of sensitivity, PPV and NPV.

Let’s say we wanted to formally compare the PPV statistics for each test. The PPV for the PA test is 0.60. The PPV for the PEx test is 0.73. The latter test seems better. It correctly identified 73% of the subjects with a respiratory infection based on them having a prior infection. The bacterium test only identified 60% of the subjects. Is this difference meaningful? Would we expect to see a difference like this again given a different sample?

We can statistically compare PPV and NPV values using the `pv.rpv()`

function, also from the DTComPair package. Before we show how to do this it’s important to note that the following procedure is for *paired data*. That is, it assumes both tests were conducted on the same patients.

Unfortunately we cannot use our earlier data objects to carry out this analysis. Instead we need to set up our data as a paired data object. We can do that using the `tab.paired()`

function. It requires three arguments:

`d`

: A numeric vector specifying the gold-standard results (1 = presence of disease, 0 = absence of disease).`y1`

: A numeric vector specifying the results of diagnostic test 1 (1 = positive, 0 = negative).`y2`

: A numeric vector specifying the results of diagnostic test 2 (1 = positive, 0 = negative).

We first create vectors for each of these arguments. Since we have 5,054 subjects with infection, we have 11,096 – 5,054 = 6,906 without infection. We use the `rep()`

function to create and repeat the necessary ones and zeroes.

d <- c(rep(1, 5054), rep(0, 6906))

Creating the `y1`

and `y2`

vectors take a little more work. We basically create a vector of ones and zeroes according to the original accuracy tables we created above. We start in the upper-left cell and work our way down the columns.

# PEx y1 <- c(rep(1,3630), rep(0,1424), rep(1, 1342), rep(0,5564)) # PA y2 <- c(rep(1,185), rep(0,4869), rep(1, 123), rep(0,6783))

We then use the `tab.paired()`

function to create our object, which we save as “paired.d”. Notice we also use the `testnames`

argument to label our tests. Printing the data object shows the paired data stratified by “Diseased” and “Non-diseased”. In our case “Diseased” corresponds to those with respiratory infections. Test 1 (PEx) is in the columns, Test 2 (PA) is in the rows. We see that, for example, 3,445 of the subjects with infection had a “positive” PEx test but a “negative” PA test.

paired.d <- tab.paired(d, y1, y2, testnames = c("PEx", "PA")) paired.d ## Two binary diagnostic tests (paired design) ## ## Test1: 'PEx' ## Test2: 'PA' ## ## Diseased: ## Test1 pos. Test1 neg. Total ## Test2 pos. 185 0 185 ## Test2 neg. 3445 1424 4869 ## Total 3630 1424 5054 ## ## Non-diseased: ## Test1 pos. Test1 neg. Total ## Test2 pos. 123 0 123 ## Test2 neg. 1219 5564 6783 ## Total 1342 5564 6906

With our paired data object created, we can carry out a formal statistical comparison between the PPV statistics for each test using the `pv.rpv()`

function. The function name is an abbreviation of “Comparison of Predictive Values using Relative Predictive Values.” In other words, instead of comparing the absolute difference in PPV or NPV, it compares the relative difference. We save the result to an object named “rpv.test” and then we look at the ppv results by extracting the `ppv`

component.

rpv.test <- pv.rpv(paired.d) rpv.test$ppv ## $test1 ## [1] 0.7300885 ## ## $test2 ## [1] 0.6006494 ## ## $rppv ## [1] 1.215499 ## ## $se.log.rppv ## [1] 0.04486625 ## ## $lcl.rppv ## [1] 1.113177 ## ## $ucl.rppv ## [1] 1.327225 ## ## $test.statistic ## [1] 4.349694 ## ## $p.value ## [1] 1.363278e-05

The output is a list of values. The first two elements, test1 and test2, list the PPV statistics for each test, which matches our calculations above. The third element is the relative PPV, which is 0.73/0.60 = 1.22. It says the Test 1 PPV (PEx) is about 22% higher than the Test 1 (PA) PPV. The next element is the estimated standard error of the relative PPV, and the following two elements provide a 95% confidence interval on the relative PPV: [1.11, 1.33] This suggests the PPV of the PEx prognostic factor is at least 11% higher than PA. Finally the last two elements return a test statistic and p-value for the null hypothesis test that the relative PPV is equal to 1 (Recall: a ratio of 1 implies no difference.) The p-value is small and provides strong evidence against the null. We conclude that if someone with cystic fibrosis receives a “positive” PEx test, they have a higher probability of developing a respiratory infection than if they receive a “positive” PA test.

We can also see results for the comparison of the two NPV statistics by extracting the `npv`

element. We present the code but leave the exploration of the output to the reader.

rpv.test$npv

The details of these procedures are detailed in Moskowitz and Pepe (2006).

The DTComPair package also provides a function for testing absolute differences in NPV and PPV called `pv.gs()`

. However it provides no confidence interval on the absolute difference between PPV or NPV.

To compare sensitivities and specificities, we use the McNemar Test. The name of the function for this procedure is `sesp.mcnemar()`

. Once again it requires a paired data object. Below we run the function on our data and save to “mtest”. To see the results for “sensitivity”, call `mtest$sensitivity`

. To see the results for specificity, call `mtest$specificity`

. Each test is for the null that the absolute difference is 0. Small p-values provide evidence against the null. Below we report the results for the comparison of specificity. The difference is large and the p-value is virtually 0.

mtest <- sesp.mcnemar(paired.d) mtest$specificity ## $test1 ## [1] 0.8056762 ## ## $test2 ## [1] 0.9821894 ## ## $diff ## [1] 0.1765132 ## ## $test.statistic ## [1] 1219 ## ## $p.value ## [1] 0

To get confidence intervals for differences in sensitivity and specificity, we can use the `sesp.diff.ci()`

function. As before you need a paired data object. Below we demonstrate the function and extract the 95% confidence interval for specificity. The lower and upper limits are presented under `diff.lcl`

and `diff.ucl`

. The Wald method is used to calculate the confidence interval. (Other methods are available. Enter `?sesp.diff.ci`

in the R console to view the documentation.) It appears the difference in specificity between the PEx (test1) and PA (test2) prognostic factors is plausibly between 0.17 and 0.19.

ci <- sesp.diff.ci(paired.d) ci$specificity ## test1 test2 diff diff.se diff.lcl diff.ucl ## 0.805676224 0.982189401 0.176513177 0.004587791 0.167521272 0.185505082

Hopefully you now have a better understanding of how to compare the accuracy of two binary diagnostic tests in a paired study design.

**References**

- Christian Stock, Thomas Hielscher (2014). DTComPair: comparison of binary diagnostic tests in a paired study design. R package version 1.0.3. URL: http://CRAN.R-project.org/package=DTComPair
- McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153-7.
- Moskowitz, C.S., and Pepe, M.S. (2006). Comparing the predictive values of diagnostic tests: sample size and analysis for paired study designs. Clin Trials, 3(3):272-9.
- R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
- “Evaluating Risk Prediction with ROC Curves.” Columbia Public Health, 30 March 2022, https://www.publichealth.columbia.edu/research/population-health-methods/evaluating-risk-prediction-roc-curves.

For questions or clarifications regarding this article, contact the UVA Library StatLab: statlab@virginia.edu

View the entire collection of UVA Library StatLab articles.

Clay FordStatistical Research Consultant

University of Virginia Library

March 30, 2022