Clause 61: The Pushback Blog

Because ideas have consequences

Posts Tagged ‘false negative

Testing, Sensitivity, Specificity and Errors

leave a comment »

There are already an abundance of material discussing decisions under uncertainty in general, medical testing specifically, and how to interpret results. This is a basic summary of the terms you will see and what they mean.

Many disciplines have encountered these issues in parallel as the disciplines developed. As the people worked in their silos, they developed their own jargon for the same phenomena (natch). I am going to use an example of a medical test, but similar considerations exist in data science, machine learning and social science methodology.

A provider performs a medical test on a subject, and it has a binary outcome: it predicts that the subject does or does not have the condition of interest. There is also an objective reality: the subject really does or does not have the condition of interest. The test attempts to pierce the veil of uncertainty around the objective reality, but tests will always have inaccuracies. We can build a matrix around the possible results for a single test:

Simple Results Matrix

We would like all the positive test results to be true positives, and all the negative test results to be true negatives. In the real world, we are not going to get that, so we have to be able to take a large number of test outcomes and analyze them.

We repeat this test over a number of subjects in a population, such as a sample of participants in a clinical trial. We get a stack of outcomes that we can aggregate and analyze statistically:

Aggregate Results Matrix

There are the customary tradeoffs between the confidence of a large sample size and the time and expense of gathering data as the sample size increases. In some social sciences, people just say that if your sample size is greater than 30, you’re good. Medical research is more rigorous than that; the peer review of a medical research finding can include discussion of how the researchers determined the sample size.

If you are into data science, you recognize the test as a kind of classifier, and you call this matrix a confusion matrix. In social sciences, a false positive is sometimes called a type I error and a false negative a type II error.

There are a number of names for these measures:

  • Sensitivity is also known as true positive rate (TPR), recall, power or probability of detection;
  • Specificity is also known as true negative rate (TNR) or selectivity;
  • False positive rate is also known as fall-out or probability of false alarm;
  • False negative rate is also known as miss rate.

It can be difficult to keep sensitivity and specificity straight:

  • Sensitivity measures the ability to detect the positive cases.
  • Specificity measures the ability to avoid flagging the negative cases.

Which is worse, false positives or false negatives? There is no uniform answer. It depends on a number of factors, including:

  • Subject domain;
  • Relative consequences of the two classes of errors;
  • Other available risk mitigation strategies.

Consider the subject domain of criminal law. In the Anglo-American tradition, we believe:

Better that ten guilty persons escape, than that one innocent suffer.
— William Blackstone (1723-1780).

Mapping this to the matrix, there is an objective reality — did the defendant really do it? — and a test result — the court verdict. Blackstone’s Ratio says that it is better to have ten false negatives than one false positive in the domain of a criminal trial.

In medicine, there are risks either way. A false negative can cause you to think the patient is healthy when she is not. A false positive can cause you to attempt therapies that are ineffective or even harmful to the patient.

In public health, false positives cause you to overreact; false negatives cause you to fail to react. This punts the question to the next stop: is it better to react excessively or insufficiently to a public health threat? The health professionals have an instinctive preference to err on the side of excess caution. However, there are not the resources to be excessively cautious against all threats, so an excessive reaction to one threat may take resources away from defenses against several others.

There are calculations that attempt to account for both false positives and false negatives, but these always assert implicit weights to them.  For example, a common measure of accuracy of a test takes the sum of the true positive results and the true negative results and divides them by the total population. However, this implicitly gives equal weight to a false positive and a false negative, which may not be the trade-off you really want for the problem you have in mind.

Let’s imagine a disease that really occurs in 0.5% of the population. We pull a sample of 5,000 people to test; our sample also contains 23 persons who actually have the disease. The people performing the trial don’t know who has the disease among the sample population, nor do they know anything about the internal characteristics of the tests.

We have two tests we want to try on his population, test A and test B. The participants in the trial do not know that test B is actually worthless; it will never, under any circumstances, return a positive result.

Here are the results collected for the trial:

Test A Test B
Real Positives 23
Real Negatives 4,977
Test Positive 45 0
Test Negative 4,955 5,000
True Positives 21 0
True Negatives 4,953 4,977
False Positives 24 0
False Negatives 2 23
Sensitivity 91.30% 0.00%
Specificity 99.52% 100.00%
FP Rate 0.48% 0.00%
FN Rate 8.70% 100.00%
Accuracy 99.48% 99.54%

If you just look at accuracy, test B looks a little better than test A, even though test B is really useless.

Moral: beware of reducing comparative measures to a single number.

Written by srojak

April 21, 2020 at 4:30 pm