— Jonathan A. Handler, MD, FACEP, FAMIA
w/ Craig F Feied, MD, FACEP, FAMIA, FAAEM
In a previous post we described our novel “u-metrics” approach to assessing the performance of Boolean (Yes/No) algorithms (for example, “does this patient have cancer — yes or no?”). This approach addressed the serious deficiencies in the classic “count-based” methods that are based on simple tallies of number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). As shown in that post, the classic approach is valid only when the magnitude of the utility (benefit or harm) of each prediction is exactly the same as the utility magnitude of every other prediction (i.e., TP, TN, FP, and FN predictions all have exactly the same utility magnitude.) In real life, this assumption almost never holds true.
Classic metrics (c-metrics) were so terrible at measuring the real-world performance of our Boolean classifiers that we were forced to create a new utility-based statistical approach (u-metrics) to address these deficiencies. This work was published as a peer-reviewed academic paper that appeared in the IEEE journal of applied informatics as described in the previous post. If you haven’t read the previous post or the academic paper, it’s strongly recommended to do so before reading this post.
Now we’re going to address the problem that if the foundation is wrong, anything built upon it will also be wrong. Since count-based “scoring” metrics fail to correctly measure the performance of Boolean algorithms, they cannot be relied upon to track changes in performance. In particular, when a change is made that improves the performance of an algorithm, count-based scoring may fail to recognize the improvement.
One important example arises when we take action to reduce false positive alerts, commonly called false alarms. False alarms rarely create benefit, and in some cases they can be quite harmful. Let’s explore that scenario and see how utility-based u-metrics correctly detect and measure the usefulness of a method that reduces false alarms. While we’re at it, we’ll also reduce the number of redundant true positive alerts, because multiple alerts for the same event generally are just annoying and distracting, and rarely add any value.
Let’s start by questioning the implicit assumptions that are built into count-based metrics. At first consideration it seems logical to assume that all correct predictions are good and all false predictions are bad, but is that always true in real life? If so, can we also assume that all correct predictions are always equally good, and all false predictions are always equally bad? And do false predictions create as much harm as correct predictions create benefit?
Consider a predicting machine that predicts whether a patient on narcotics will stop breathing some time in the next 10 minutes. The machine operates silently in the background, so nobody is even aware of the predictor until it predicts that a patient will stop breathing. A prediction of impending respiratory arrest causes an alert to appear on the patient’s monitoring screens, along with a loud siren noise lasting 5 seconds. Every 30 seconds the predictor assesses all the available data and makes a new prediction. The alert and the siren are triggered every time there is a prediction that the patient will stop breathing, regardless of how many times the alarm has already been sounded. The prediction machine scores well on datasets used for validation, so let’s consider what happens when it’s deployed in your favorite hospital.
A patient in the ICU has been overmedicated with narcotics and his respiratory rate is slowing. As long as an alarm sounds at least once before the patient stops breathing, no harm will come to the patient. Fortunately, the predicting machine correctly predicts that the patient will stop breathing (“respiratory arrest”) within its 10-minute prediction window. As it happens, if there were no intervention then respiratory arrest would actually occur in 8 minutes, meaning that in this case the predictor missed the first two minutes of its prediction window (it made false negative predictions during those two minutes). However, no harm done, the alarm was sounded with more than enough time for the care team to take care of the problem.
When the first positive prediction is made, the loud siren is turned on to alert the care team of trouble. The doctor and nurse immediately rush to the bedside to assess and address the issue. They correctly recognize the problem and begin therapy to address it. While they are trying to work, the predicting machine continues to make predictions every 30 seconds.
Predictably, the predicting machine continues to predict respiratory arrest. Every 30 seconds the doctor and nurse are distracted by the loud siren noise of the predictor redundantly re-alerting them of the impending respiratory arrest that they already know about and are already trying to treat. “Shut that thing off!” yells the doctor after the 4th alarm, 3 of which were redundant alarms following the initial one. “I can’t think with it constantly blaring in my ear.” The nurse presses a button to temporarily suppress alerts while they work. Over the next 6 minutes, 12 more redundant alarms are suppressed. Without the constant distraction of the unwanted redundant alarms, the doctor and nurse successfully identify the problem and administer the life-saving medication that reverses the narcotic effects. The patient begins to breathe normally again. Another save by the care team!
The predictor’s first alarm may very well have saved the patient’s life, and it clearly deserves full credit for providing benefit! But what about the other “extra” alarms after the first one? How much credit should we give to those additional, redundant alarms? You probably have already recognized that this is where classic count-based scoring gets into trouble. Classic c-metrics will always give full credit for each and every correct guess: all true predictions are classified together as true positive alarms, regardless of when they were made or what outcome they produced. But does that reflect reality? Were the extra, redundant alarms really as helpful as that first alarm? Were they helpful at all? Or were they actually harmful?
The doctor treating the patient found the additional alarms not just useless but actually harmful. Not surprisingly, count-based statistics continue to give full credit for benefit provided by each of these annoying, harmful, redundant alarms. After all, they were true positive predictions, and that’s all that matters.
And how should we account for the 12 alarms that were suppressed after the nurse pressed the “snooze” button? There are only a limited number of ways that a prediction can be classified in count-based statistics. C-metrics could attempt to handle this in one of the following four ways:
- Option one
Treat each of the 12 suppressed alerts as true positives (which they are) and give each one full credit for providing benefit, even though those alerts never actually fired nor did they create any benefit. This makes no sense. For one thing, it will score performance as identical regardless of whether or not snoozing is implemented. Nonetheless, it’s what count-based metrics will do by default if we don’t figure out something better.
- Option two
Treat the suppressed alerts as if the predictions never occurred at all (simply don’t score them). Can you already guess why this is not a very good option? If we were to do this, the system would not be penalized even if it were suppressed so long that it completely missed predicting one or more events. For example, a patient could have a respiratory arrest without the suppressed alarm ever going off at all, and the predictor would still get a perfect score. This would be nonsensical.
- Option three
Treat each of the 12 suppressed alerts as if they had been true negative predictions, even though they were actually true positives. Is this a good idea? Think about situations in which an actual true negative prediction creates benefit (e.g., a negative Covid test when you want to get on an airplane). Treating a true positive prediction as if it were a true negative means rewarding the predictor for predictions that it didn’t make! Among other problems, both the true positive rate and the true negative rate would be completely wrong. Suppressing the redundant alarms prevented them from creating additional harm, but it certainly didn’t convert them into beneficial true negative predictions. As Edison might say, “we’ve successfully found another approach that doesn’t work.”
- Option four
Treat each of the 12 suppressed alerts as false negatives, scoring each of them as if they created harm exactly equal to the benefit of a true positive prediction. Wait, what? Wasn’t the whole point of suppressing the alarms to reduce the harm caused by annoying and distracting interruptions? Now we’re going to penalize the predictor rather than score an improvement in performance? It could be argued that treating the alerts as false negatives approximates the actual experience of the users, and to a certain extent this makes sense: the problem does exist, yet users get the (non-alerting) experience of a negative prediction. However, if we were to choose this option then count-based metrics would penalize the predictor for imaginary false negatives, even though the suppression of redundant alerts actually improved the performance experienced by the users. Converting true positives into false negatives means the calculation of sensitivity (TP/TP+FN) would be completely broken. Can we agree to never do this?
By now it should be apparent that there is no way for count-based metrics to properly account for snoozing. Options two, three, and four will break c-metrics in ways that are simply not acceptable, yet simply calculating c-metrics as usual (option one) will never show any difference between snoozing and non-snoozing, no matter how much snoozing improves the actual performance experienced by the users.
Count-based metrics will fail to measure the performance impact of snoozing whenever “redundant” true positive alerts are less helpful than the first true positive alert. The calculated results will either be nonsensical or they will falsely suggest that snoozing was useless or even detrimental to predictor performance. This may be one reason why we were unable to find any previous literature (prior to our own paper) reporting the performance benefits of snoozing after a positive prediction that fires an interruptive alert.
In reality, snoozing is helpful in many situations. One common situation in which snoozing tends to be very helpful is what we call “alarm-centric scenarios.” This is the term we use for situations in which
1) the event to be predicted is relatively rare,
2) multiple predictions can be made for the same event,
3) the user experiences a positive prediction as an interruptive alert or alarm,
4) false positives are always annoying or harmful, and
5) the user gets no indication of negative predictions and does not value them.
Our respiratory arrest example above was an alarm-centric scenario because all of these attributes apply:
- Most patients do not experience respiratory arrest in the hospital (reportedly it occurs in less than 1% of hospitalized patients).
- An alarm could fire multiple times in relation to a single future event.
- Users are notified of positive predictions (as alarms) and do not get any notification when there is a negative prediction (when the patient is predicted not to have a respiratory arrest in the next 10 minutes).
- False positives generate an interruptive alarm that needlessly distracts users from other important work.
- Users don’t put high value in negative predictions because respiratory arrest is relatively rare and therefore they already assume that most patients won’t suddenly have a respiratory arrest unless a strong reason arises to think otherwise.
Let’s go ahead and work through the 10 minute period leading up to the respiratory arrest event in our example, and calculate some statistics both with and without snoozing.
The algorithm falsely predicted negative every 30 seconds for the first 2 minutes, yielding 4 false negative predictions. All the predictions for the next 8 minutes were true positive predictions. Since predictions are made twice a minute, that yields 16 true positive predictions.
Without snoozing, the alarm would have fired on each of those 16 predictions. However, the users only wanted (and the patient only benefitted from) the first alarm for the event. If the other 15 alarms went off, they would be redundant and each one would be an annoying distraction. Classic count-based metrics would report the the recall as:
TP/(TP+FN) = 16/(16+4) = 16/20 = 75%.
and the precision as:
TP/(TP+FP) = 20/(20+0) = 100%
C-metrics gave full credit for all the unwanted alarms, resulting in a calculated c-precision of 100% despite the fact that users got 16 times more alarms than they wanted (and 15/16 = 94% of the alarms were harmful rather than beneficial). At the same time, c-metrics penalized the predictor for missing each of the first 4 redundant opportunities to provide unwanted alarms, resulting in a calculated c-sensitivity of 75% , despite the fact that users got 100% of the alarms they wanted. The users needed one alarm any time prior to the respiratory arrest, and they got it.
With snoozing activated we still have the same initial 4 false negatives. After that, the alarm fired 4 times and 12 subsequent alerts were suppressed due to snoozing. Since the users only wanted a single alarm, the first of the 4 alarms provided benefit to the users and the next 3 alarms were redundant nuisance alarms.
If we count the snoozed alarms as if they were false negatives (matching the way they are experienced by the user), then classic count-based metrics would report the the precision as TP/(TP+FP) = 4/(4+0) = 4/4 = 100% and the recall as TP/(TP+FN) = 4/(4+16) = 4/20 = 20%. That makes no sense, since the true event recall was 100% (there was one event and it was captured).
Problem one: the user received 100% of the value they desired from the predictor (all the value was derived from the initial alarm that sent them scrambling to the bedside), yet c-metrics falsely show that recall worsened with snoozing. Calculated recall dropped from 75% without snoozing to just 20% with snoozing.
Problem two: even though snoozing reduced the number of unwanted nuisance alarms from 15 down to 3, c-metrics falsely show no change in precision. In fact, c-metrics measures of precision claimed a perfect 100% all the way along, even when 15 out of 16 alarms created harm rather than benefit.
It’s apparent that in scenarios like this one, count-based metrics fail to correctly report the impact of snoozing. C-metrics falsely show worsening recall and fail to show improved precision, whereas users experience the exact opposite: improved precision without loss of recall.
In contrast, the utility-based approach matches the user experience because scoring is based on the amount of benefit or harm resulting from each prediction, rather than just the correctness of the prediction.
Beneficial utility is quantified on a scale from 0 to 1 Beneficial, and adverse utility is quantified on a scale from 0 to 1 Adverse (the range 0 to 1 is chosen for convenience, but any range could be used). The utility of each prediction is assigned based on the context in which it is made and how it affects real-world outcomes. At the time of implementation and configuration, somebody (often a team of clinicians) makes decisions about what beneficial or adverse utility scores should be assigned to each type of prediction. To understand how that works, let’s walk through it for the predictor used in the example given above.
- At the time of implementation, the configuration team decided that the first true positive prediction for an event should be classified as a “beneficial positive” prediction and be given full credit: a beneficial positive score of 1 (the maximum possible).
- The team also decided that each redundant true positive prediction (triggering yet another alarm for the same event) is a nuisance that produces no benefit. Each positive prediction of this type will be classified as “adverse positive” rather than as “beneficial positive,” but how big an adverse positive score should be assigned? Although the redundant alarms create harm, the team decided that the harm from a nuisance alert is much less than the benefit from that first correct alarm. They reasoned that they’d be willing to suffer through ten nuisance alerts in order to receive one beneficial useful alert, so they chose to assign an adverse positive score of 0.1.
- The team concluded that false positive predictions are even worse than redundant true positives. False alarms reduce the likelihood that users will respond to true positive alarms in the future. False positives also force clinicians to abandon their work on another patient to evaluate someone who isn’t actually having a problem. The team therefore classified false positives as “adverse positive” predictions and assigned an adverse positive score of 0.5, reasoning that it would be worth putting up with two false alarms to be able to correctly capture one event. Why did the team decide that every redundant false alarm should have the same score as the first false alarm, when redundant true alarms get a different score than the first true alarm for an event? They recognized that each positive prediction could be correct even if the previous one was not, thus clinicians have to respond each time.
- The team thought for a long time about how to score true negative predictions. Clinicians don’t even know negative predictions are being made, since there aren’t any notifications, and the baseline clinical assumption is that patients will not have a respiratory arrest unless something suggests otherwise. The team decided to classify true negatives as “beneficial negative” predictions and assigned a beneficial negative score of zero. The zero score of a true negative prediction is classified as “beneficial” because if the opposite prediction had been made it would have been a false positive and would have been classified as “adverse.”
- False negative predictions are scored depending on whether or not the event was detected by a true positive prediction. In our respiratory arrest example, the configuration team made the assumption that as long as clinicians treat a patient before respiratory arrest occurs, no harm occurs to the patient.
- If there is at least one true positive for an event (i.e., the alarm went off and the event was detected), then all the false negatives for that same event are classified as “beneficial negative” and are assigned an beneficial negative score of zero, since they neither create harm nor benefit. In this case the zero score is classified as “beneficial” because if the opposite prediction had been made it would have been a redundant true positive, and those are classified as “adverse.”
- If there is not at least one true positive for an event (only false negative predictions exist), the first false negative prediction is classified as an “adverse negative” prediction and assigned an adverse negative score of 1 (the maximum possible). The team’s reasoning is that clinicians depend on the alarm to notify them of an impending problem, so failure to fire the alarm causes patient harm. What about the additional false negatives for the same event? They are assigned a beneficial negative score of zero: they create no value but they also don’t create any extra harm, since we’ve already accounted for the fact that the alarm isn’t going to be triggered for the event. The zero score is classified as “beneficial” because in this case if the opposite prediction had been made it would have been an annoying redundant true positive, which would have been “adverse.”
Once we have defined a set of utility scores that reflect the benefit or harm for each type of prediction based on context, we no longer have to base our performance metrics solely on the counts of TP, FP, TN, and FN predictions. Utility-based metrics use formulae that are similar to those of c-metrics, but that derive from the scores aggregated across all the different types of predictions observed: beneficial positives (BP), adverse positives (AP), beneficial negatives (BN) and adverse negatives (AN).
In u-metrics, recall is calculated using the formula BP / (BP + BC(AN)), where BP means the sum of beneficial positive scores and BC(AN) means the sum of the “complementary benefit for adverse negative predictions.” This is the amount of missed benefit that could have been captured if the opposite predictions had been made (i.e., if the adverse negative predictions had instead been beneficial positive predictions).
To calculate the utility-based recall (u-recall) for our example scenario, we need to add up the total utility for Beneficial Positive (BP) predictions and the total complementary beneficial (Bc) utility for adverse negative (AN) predictions. The beneficial positive utility comes entirely from the first true positive prediction, which contributed a score of BP = 1. Since there was at least one true positive prediction (causing the event to be detected) all of the false negatives are classified as “Beneficial Negatives,” thus the example scenario doesn’t actually contain any Adverse Negative predictions. The complementary benefit that would have been captured had the opposite predictions been made is therefore BC(AN)=0. The calculated utility-weighted recall (u-recall) in our example is thus:
BP/(BP+BC(AN)) = 1/(1+0) = 100%
This accurately reflects the user experience: there was one event and that event had at least one alarm. If we do the same calculation with snoozing, u-recall remains unchanged at 100% because we still have one Beneficial Positive prediction with a score of 1 (thus BP=1), and we still have no adverse negatives (thus BC(AN)= 0).
Utility-weighted precision (u-precision) is calculated as BP/(BP+AP). In our example, before snoozing is applied we have one beneficial positive prediction with a score of BP=1 and fifteen adverse positives predictions, each with a score of AP=0.1. The calculated u-precision in our example is thus:
BP/(BP+AP) = 1/(1+(15*0.1)) = 1/(1+1.5) = 1/2.5 = 40%.
Does this match user experience? We have one true and useful alarm followed by 15 unwanted and harmful (though true) alarms. Does that translate to a predictor that has 40% precision with respect to capturing benefit while avoiding harm? If you think this number is too high or too low, it means you disagree with the configuration team about the relative harm of an annoying redundant true alarm or the relative value of the first true alarm. Unlike classic metrics, utility-based metrics are configurable, and the utility values should be adjusted as necessary until they match up as closely as possible with user experience. That’s what “utility” is all about.
When we turn on snoozing, we still receive one beneficial positive prediction contributing a score of BP=1 , but now instead of fifteen annoying alarms we have only three adverse positive predictions, each contributing a score of AP=0.1. The calculated precision with snoozing is thus:
BP/(BP+AP) = 1/(1 + (3*0.1)) = 1/(1+0.3) = 1/1.3 = 77%.
This improvement in precision matches the user experience that snoozing significantly reduced the burden of unwanted redundant alarms without missing events. In this case snoozing significantly improved u-precision (from 40% to 77%) while maintaining u-recall (at 100%). This is not an isolated occurrence: our published paper demonstrates that snoozing often can maintain event capture while significantly reducing false positive notifications and also reducing redundant true positive notifications that provide limited benefit and may even cause harm.
Traditional count-based metrics cannot correctly account for the benefits of snoozing. This is because c-metrics are only correct (and appropriate for assessing Boolean predictors) when certain underlying assumptions can be met, and these assumptions are not met in most real-world deployments. Most often, the use of c-metrics leads to the false conclusion that any benefits of snoozing are offset by harm created by snoozing. If count-based metrics are used to assess the overall benefit provided by the system’s implementation, the implementation team often will incorrectly conclude that snoozing is of limited benefit or is harmful. They may even falsely conclude that snoozing is comparable to setting a different classifier cutoff, with the same tradeoffs in sensitivity vs specificity. Nothing could be further from the truth.
The more universal utility-based approach correctly measures predictor performance both with and without snoozing. In scenarios where snoozing significantly improves realized predictor performance, u-metrics correctly reflect this improvement. This enables an implementation team to identify the specific implementation details (including snoozing configuration) that will yield the most useful predictor. When used correctly, u-metrics help to achieve a predictor configuration optimized for capturing desired events, while minimizing false positive alerts through a mechanism that is completely different from simply choosing a different classifier cutoff (and that does not come with the same sensitivity vs specificity tradeoffs).
If you want to learn more about how to use and apply our approach to utility-weighted metrics for Boolean algorithms, take a look at our paper on the topic. If you made it this far, you might just be one of those people.
Opinions expressed here are our own, and not necessarily those of anyone else, including any employers we may or may not have.