— Jonathan A. Handler, MD, FACEP, FAMIA

— Craig F. Feied, MD, FACEP, FAMIA

Imagine this:

My personal library contains 100 books, 50 with red bindings and 50 with blue bindings. I hide coins inside 20 of the books. 10 of the red books each contain a penny and ten of the blue books each contain a silver dollar. I’ll let you pick 10 red books and 10 blue books from my library (20 books in all). After you select your 20 books I’ll let you open each one to see whether they contain any money, and you can keep whatever money you find. If you choose perfectly, you’ll get to keep all the money in my library!

You immediately start grabbing books off the shelves.

Every book you grab that is found to have a coin in it we will label a True Positive (TP). Every book you grab that doesn’t have a coin in it we will label a False Positive (FP). Every book you fail to grab that has a coin in it we will label a False Negative (FN). Every book you don’t grab that doesn’t have a coin in it we will call a True Negative (TN).

After you’ve chosen your 20 books, it turns out that two of your red books contain a penny, eight of your blue books contain a silver dollar, and the other 10 books contain no coin at all.

How good a chooser are you? There are several possible metrics that might be used to describe your performance at choosing books with hidden coins. Which of the following metrics describes your performance in the way that matters most to you?

- Performance = 50%: the percentage of available coins in my library that you found (you found 10 coins out of 20 available).
- Performance = 79%: the percentage of available value in my library that you found (you found $8.02 out of $10.10 available).

Using metric #1, I could achieve 50% performance by finding all 10 pennies while finding none of the dollars — or vice versa. Those two seem completely different to me, but to metric #1 they are exactly the same! I’d prefer metric #2 because it always tells me how much of the available money I got to take home. The value captured is much more important to me than just the number of coins I found.

Unfortunately, our familiar Boolean (True/False) performance metrics don’t give us this option. They are based solely on the simple counts of true positives, false positives, true negatives, and false negatives. For each prediction, a “score” of 1 is assigned to one of those “buckets” (TP, TN, FP, or FN). This “count-based” approach makes some astonishing assumptions that are almost never true in real-life clinical scenarios. One of these, demonstrated in this example, is the often false assumption that every result has the same amount of value (“utility”) as every other result. In other words, the utility is assumed to be “** uniform**” across all results. In this case it’s obvious that utility is not uniform across all results: some results are worth 1 penny and others are worth 1 silver dollar. The uniformity assumption of count-based metrics does not hold.

So maybe now you conclude that we shouldn’t do counts. Instead, we should weight each result depending on how much utility it created. You decide that each true positive in which a penny is found gets 0.01 “points” and each true positive in which a silver dollar is found gets 1.00 “points.” Each false negative in which a penny was missed loses 0.01 points, and each false negative in which a silver dollar was missed loses 1.00 point. Great! The problem of non-uniformity among results is solved!

Now imagine I make you a different offer. I’ve hidden a silver dollar in ten of the 100 books in my library. You can choose any ten books, but again, no peeking! After you’ve chosen, we’ll look in the books to see how many silver dollars you’ve found. You can keep anything you’ve found up to a maximum of $5. However, if you find any more than 5 silver dollars then I’m going to label you “greedy” and you can’t keep any of the money. On top of that, I’m going to fine you $5 for being so greedy.

Classic count-based metrics assume that every correct result is “good” and every incorrect result is “bad.” We call this an assumption that the results are “*fully*** dichotomized**.” However, with my new offer the sixth true positive result is actually harmful to you: it makes you lose all your winnings AND pay an additional $5. Even though the sixth true positive result is correct, it’s harmful rather than beneficial. We can see that in this case, the results are not fully dichotomized. The assumption of full dichotomization required by count-based metrics does not hold, and that creates a problem: count-based metrics will continue to calculate a better and better performance the more positive results you get, when in actuality you’ll be losing money for any positive result after the first five.

So… maybe now you conclude that, in addition to assigning points for utility, we should also stop categorizing results by whether they are correct (true) or incorrect (false) and instead categorize them by whether the result was beneficial or harmful. In other words, instead of counting true positives we’d add up all the value created by positives that were beneficial (regardless of whether correct or incorrect). Instead of counting false positives we’d sum up the value from all the positives that were harmful. Instead of counting false negatives we’d sum up the value from all the negatives that were harmful, and instead of true negatives we’d sum up the value from all the negatives that were beneficial. Great! Now the problem of results that are not fully dichotomized is solved too!

Determined to win as much money as possible, you go home and train yourself to detect the presence of a coin inside a book without peeking inside. Not so fast! Given your training, I’m changing the deal. Once again I’ve hidden a silver dollar in each of 10 blue books and a penny in each of 10 red books out of the 100 books in my library. Once again you can choose 10 red and 10 blue books. You can still keep every coin you find. However, since you claim you’re such an expert, this time you must pay me $10 for every coin you fail to find. You agree to the deal because practicing in your own home has convinced you that you are really amazing at detecting books holding hidden coins. You apply your detection technique to my library and choose 20 books, finding 1 penny and 5 silver dollars. Oops, you now owe me a bunch of money for the 14 coins you didn’t find!

**Now** which of the following metrics describes the performance that matters most to you?

- Performance = 30%: The percentage of books with coins in them that you found (6/20).
- Performance = 50%: The percentage of potential benefit available to you in my library that you found ($5.01/$10.10)
- Performance = 3%: The percentage of money exchanged between us that actually went in the right direction (into your wallet). You got $5.01 from me but you had to pay me $140. So, of the $145.01 in total that we exchanged, 3% went from me to you ($5.01/$145.01) and 97% went from you to me. Anytime this percentage is greater than 50% you are making money from me. Anytime it’s less than 50%, you are losing money to me. Muwahahahaha.

We’ve already decided that metric #1 doesn’t measure up because the percentage of available coins actually found does not take into account the fact that some coins are worth more than others.

Metric #2 recognizes that we care more about the money (benefit) than about the number of coins. It tells us how much of the money we found compared with how much we **could** have found if all our false negatives (representing missed benefit) had been true positives instead. However, in this situation each coin you failed to detect not only represents missed value, but it also costs real cash money out of your pocket (harm experienced). Metric #2 doesn’t include harm at all. The missed benefit is small and the cost you’ll have to pay is large, so #2 can’t be a very good measure of what we care about.

OTOH, metric #3 takes into account both the benefit and the harm resulting from each prediction you made. That’s exactly what we need! Of these three metrics, it’s the only one that measures what we really care about: how well (or poorly) we are doing at making money rather than losing it.

By now it’s pretty obvious that our choice of performance metric has to depend on the rules of the game. In the very first scenario above (where I let you keep all the coins you found and I did not penalize you for coins you failed to find) we cared most about the percentage of available opportunity that we captured (the value we actually received compared to the value we would have gotten if all of our false negatives had instead been true positives). In that “no-lose” situation, we’d choose metric #2 because that is exactly what metric #2 measures. The scenario where I penalized you heavily for each missed coin gives you a chance to end up better off than you started, but you also risk ending up worse off if you perform poorly. In that scenario, we’d choose metric #3 because it incorporates both the benefit you receive and the harm you incurred.

Unfortunately, with classic count-based metrics we only get a single formula to describe the relationship of true positives to false negatives. That’s the formula for metric #1:

*TP/(TP + FN)*

where TP is the count of True Positives and FN is the count of False Negatives.

This formula might be familiar to you. In count-based metrics, it has been given two different names: Recall and Sensitivity. Since count-based metrics use the same formula for both Recall and Sensitivity, many (perhaps most) people think of them as being “the same thing”.

However, we’ve just seen that count-based metrics ignore a lot of things that we care about. With our utility-based approach, we see that instead of the single count-based formula #1, we now have two distinctly different formulas (#2 and #3) that take utility into account. The two metrics measure different aspects of the relationship between True Positives and False Negatives. Which one to use depends on what question you’re trying to answer in a particular scenario.

Let’s look first at the formula for metric #3:

* (#TP * $ per TP) / ((#TP * $ per TP) + (#FN * $ per FN))*

In this formula #TP is the count of true positives and $ per TP is the amount of money we receive for each true positive, while #FN is the count of false negatives and $ per FN is the amount of money we pay out for each false negative. We can simplify this formula to:

*$TP/($TP + $FN)*where $TP is the sum of all the money we received from the true positives and $FN is the sum of all the money we paid out due to the false negatives. You’ve probably already noticed that this formula looks an awful lot like the traditional count-based for metric #1, except that Instead of counting items we sum up the money (or utility) associated with each result.

Wasn’t that easy? Now let’s look at the formula for metric #2:

*(#TP * $ per TP) / ((#TP * $ per TP) + (#FN * $ per TP))*

In this formula the last term represents the money we **would** have received if we had made the “complementary” choice (positive instead of negative) for each of the false negatives. It’s the money we could have gotten if every single choice had been perfect, so all the false negatives had instead been true positives. We call this the complementary benefit of the false negatives.

If we do the same simplification as in #2, this becomes:**$TP/($TP + $C(FN))**

where $C(FN) is the complementary value of the false negatives: the sum of all the dollars we could have gotten if we had made the correct guess instead of the wrong one.

You may have noticed that whenever the harm from a false negative exactly equals the benefit of a true positive (the result is * symmetric*), these two different equations will yield the exact same result.

if ** $C(FN)** =

**then**

*$FN*$TP/($TP + ** $C(FN)**) = $TP/($TP +

**)**

*$FN*In fact, count-based metrics not only ignore the fact that different results might have different utility (treating utility as uniform) and the fact that true predictions might cause harm instead of benefit (treating utility as fully dichotomized) but **they also always treat all results as symmetric**. For this reason, under count-based metrics the two equations will always be the same, whether or not this matches reality.

OTOH, any time results **are** asymmetric, the two equations always yield different numbers. Each equation has a different meaning, and they only happen to evaluate to the same number in the rare cases when the size of the benefit of a true positive is **exactly** equal to the size of the harm from the alternative false negative.

So what should we call these two distinctly different metrics? In count-based metrics formula #1 is called both Sensitivity and Recall.

Many (perhaps most) people would recognize “Recall” as the fraction of available benefit that was captured (for example, here). Recall quantifies the relationship between captured value and total available value, where total value includes both captured and missed value (lost opportunity). It makes sense, then, for us to call metric #2 “Recall” because that’s exactly what it measures!

In count-based metrics the term “Sensitivity” is often used to mean the fraction of actual items (e.g., coins, or cases of a disease) that were correctly detected or predicted. However, even though there’s only one formula, the term is used to mean different things in different contexts.

For utility-based metrics we’ll use “Sensitivity” to mean the fraction of all the utility associated with actual items that is beneficial utility. This is the beneficial utility of the actual items divided by *all* the utility of those items. If the ones we got wrong generate harm, that gets included.

In sum:

: A measure of the relationship between the benefit actually received (usually it comes from true positives) and any harm actually received (if there is any, usually it comes from false negatives). This is calculated as:*Sensitivity**$TP/($TP + $FN)*

Going forward, since utility is not always measured in dollars, let’s substitute “u” to more generically indicate the amount of utility, so that the equation becomes:*uTP/(uTP + uFN)*

Making it even more general… remember that results are not always dichotomized: not all benefit comes from true results and not all harm comes from false results. So if we set Beneficial Positives (BP) to mean the sum of utilities from all positive results that were beneficial (whether a true positive or false positive) and Adverse Negatives (AN) to mean the sum of utilities from all negative results that were adverse or harmful (or at least would have been more beneficial had it been positive), then the final equation to allow for non-dichotomy of results becomes:.*BP/(BP + AN)*: A measure of the benefit actually received from positive results compared to the benefit from positive results that would have been received from an optimally performing system. This is calculated as:*Recall**$TP/($TP + $C(FN))*

or more generally (allowing for any measure of utility):*uTP/(uTP + uC(FN))*

or even more generally (allowing for non-dichotomy of results):*BP/(BP +B*_{C}(AN))

The last term in this equation would be read as “the complementary benefit of adverse negatives.”

**Our shocking discovery**, one that we cannot find has ever been described previously, is that when we properly weight each result based on its utility, and the utility is asymmetric (different magnitude if predict positive than if predict negative), then there’s not just one but rather two distinct basic metrics for the relationship between true positives and false negatives. These two metrics, Sensitivity and Recall, measure different things and they also return different values in nearly every case. In medicine, the only time they give the same result is the exceedingly rare case where the **harm** (as opposed to loss of benefit) from a negative result exactly balances the **benefit **from a positive result. Taking it a step further, there’s actually an entirely new set of basic performance metrics (not previously described, to our knowledge) that we have uncovered.

To properly handle cases of non-uniformity, non-dichotomy, and asymmetry of prediction results, we also discovered that a better and more complete version of the classic count-based confusion matrix is needed.

The classic confusion matrix tallies counts of True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN):

Actual = Yes | Actual = No | |

Predict = Yes | TP | FP |

Predict = No | FN | TN |

Our new, utility-based confusion matrix sums the utilities of positive and negative predictions rather than just tallying counts. In this way it properly handles both uniform and non-uniform utility of results. It categorizes results based on the benefit or harm created rather than the correctness or incorrectness of the result. In this way it properly handles both dichotomized and non-dichotomized results. Finally, it includes columns that explicitly sum the “complementary” utilities — the utilities that would have been realized had the opposite prediction been made. In this way it properly handles both symmetric and asymmetric results.

Here is the complete, utility-based confusion matrix that makes no assumptions about the uniformity, dichotomy, or symmetry of results:

Complementary AdverseUtility | Realized Beneficial Utility | RealizedAdverse Utility | ComplementaryBeneficial Utility | |

Predict = Yes | A_{C}(BP) | BP | AP | B_{C}(AP) |

Predict = No | A_{C}(BN) | BN | AN | B_{C}(AN) |

We have published an academic paper, ** available here**, that more fully describes and assesses:

- The entirety of this discovery of a more complete and utility-based confusion matrix, along with how to implement it in real-world use.
- Two new and important sets of key performance metrics: one for metrics of realized utility and another for the capture rates of potentially available utility.
- A method for improving the performance of Boolean predictors in real world use, often improving utility-based precision substantially for any given level of utility-based recall. We won’t spend more time on this exciting development here, but in a later post we will dive right in.

OK, let’s play the game one final time. I’ll put a new penny in each of 10 red books and a vintage gold doubloon (worth at least $1,000) in each of 10 blue books. Again, you can pick 10 red books and 10 blue books out of my library. I feel badly for taking 140 bucks from you earlier, so I’ll let you keep whatever you find — no penalty! Plus, I’ll even let you use a metal detector.

You go online and search for metal detectors. You find two that are about the same price, and both get good reviews from customers. Since my library is cramped and you won’t be able to bring both inside, you’ll have to pick one. Yikes, that means it’s time for performance metrics. But which one?

Wait a minute. With no penalty, this is a no-lose situation, so why make a big deal over the metrics? These academics in their ivory towers with their fancy equations! They’re out of touch with us down here in the real world just trying to make things a little better. You resolve to follow the “KISS” principle and just rely on classic count-based metrics to help you choose. After all, everybody uses them all the time!

The manufacturers of Metal Detector #1 claim a sensitivity (TP/TP+FN) of 50%: you interpret this to mean that it will find 50% of the coins. The manufacturers of Metal Detector #2 report a sensitivity of only 40%. Obviously #1 is better, right? You can’t argue with the numbers!

However, a nagging little voice reminds you that there’s a lot of money at stake. You decide to order both units and test them yourself before choosing one. You replicate the exact scenario you expect to encounter in my library, and your test shows that the manufacturers didn’t lie: Metal Detector #1 finds 10 coins, or 50%. Metal Detector #2 finds only 8 coins, or 40%. How can they even sell such an inferior product?.

But wait, what about utility? You look at the coins that were detected.

It turns out the classifier inside Metal Detector #1 was trained on US pocket change and is very good at detecting copper, zinc, and nickel. The 10 coins found by Metal Detector #1 were all pennies! Your captured value was just $0.10 of $10,000.10 available.

The classifier inside Metal Detector #2 was trained on recovered pirate treasure, and is good at detecting precious metals. Arrrr, matey! All 8 coins found were gold doubloons, so your captured value was $8,000 of $10,000.10 available.

No matter which metal detector you choose, you will go home with more money than you originally had. If you “keep it simple” and use count-based metrics to make your decision, you will choose Metal Detector #1 because it has a count-based sensitivity of 50%, obviously better than the 40% sensitivity of Metal Detector #2. That’s a special kind of KISS, with the emphasis on the last ‘S’. No pirate treasure for you!

Pirate treasure aside, when choosing between different predictors or different implementations of predictors that will be used in the clinical arena** , patient lives are affected by those decisions**. If you assess predictors using metrics that don’t reflect what you care about, you’re likely to choose the wrong predictor or design a bad implementation. In many cases, this may result in clinical harm and even in excess lives lost. Choosing metrics that accurately reflect the kind of real-world clinical value you care about will give you a better chance of selecting the right predictor and the right implementation to achieve better outcomes.

In recent years, clinicians have often complained that “machine learning” is overhyped. We often see articles claiming great performance for a predictor, when the real-world performance on the front lines doesn’t seem to be anywhere near as good. At other times, predictors that are working well in practice are abandoned because “measured performance” using count-based metrics was poor.

Our new utility-based performance metrics are designed to more closely match up with the utility that real users will actually experience from a predictor. We hope this will eventually reduce the frequency with which health systems adopt a bad predictor with good c-metrics, or abandon a good predictor that happens to have poor c-metrics. Even without the help of a machine-learning algorithm we can predict that would be a good thing for healthcare.

*The opinions expressed here are personal, and not necessarily those of anyone else, including our employers.*