— Jonathan A. Handler, MD, FACEP, FAMIA
Healthcare AI is (mostly) not shareable because it is neither interoperable nor generalizable
NOTE: Many consider “AI” (or artificial intelligence) and “ML” (or machine learning) to be distinct concepts. For convenience purposes, I intentionally conflate them here under the term “AI”.
Every day, sometimes multiple times a day, some institution publishes a press release announcing a study that proves their new machine learning or artificial intelligence system (let’s call it an “AI model”) can do something amazing in healthcare. The implication by the authors, or at least a likely inference by readers, is typically that the model can “revolutionize healthcare” if widely adopted.
In reality, in most cases, a model is appropriate for use only at the institution at which it was developed. Let’s define “interoperability” as the ability of a model developed in one place to be technically implemented at another. Let’s define “generalizability” as the ability of a model to perform about as well on another institution’s data as it did on the data used to train it. Together, interoperability and generalizability determine whether a model developed at one place is appropriate for use at another. Let’s label models that are both interoperable and generalizable as “shareable”. Unfortunately, the concept that AI models will be shareable is generally a myth. Sure, some models use only a small number of commonly available and readily specified input features, and therefore may be easily implemented at other sites and achieve similar performance. However, that is not the typical AI model of the modern world. Below I give a list of reasons why someone else’s AI model likely won’t work at all, or won’t work as well, at your institution. Not all will apply in every case, but many will apply in many cases.
- They didn’t publish their model’s actual code: No matter how carefully a model’s creators describe their input features, their hyperparameters (configuration settings), the dataset, and every other element of their system, unless you have the actual code that generated the model, you will never be able to perfectly reproduce their model. Well, that’s not totally true — if they provide their training dataset (including all engineered features, such as moving averages), hyperparameters, and the code of the algorithm generating the model, then you can perfectly reproduce the model. But if they do that, they might as well publish the actual model’s code as well. Since sharing the model itself (and/or the elements required to perfectly recreate that model) is rarely done in healthcare, the model is not truly reproducible. You can create your own version, but it will be different, often meaningfully so, from the one that was published.
- They didn’t publish all their input features: Nowadays it’s not uncommon to see models that use literally thousands of data elements as inputs (“features”) to a model. Often, only the “top” most important features are published, or only the raw data elements are published and not the engineered features. Even when all the features are published, there’s usually at least a few for which there’s not enough explanation/description to know exactly “what is this data element?”
- Their patients are different than yours: Even if the authors share their model and their input features, their model will not perform the same way for you as it did for them, because their model was trained on their own patient population and yours is surely different, at least a little and sometimes a lot.
- Their data is different than yours: Let’s consider some different kinds of data…
- Lab tests can vary significantly in their normal range and clinical meaning depending upon the methodology and instrumentation used. If converted to a categorical (e.g. high, normal, low), the impact of these variances across institutions may be magnified or reduced.
- Medical images can vary by device manufacturer, device version, institutional imaging protocols, and more. In contrast to the proliferation of publications reporting imaging AI successes, some have noted the disappointments in real-life implementations and estimated that very few imaging AI algorithms have widespread use (see here and here). With that said, AI on medical images may ultimately prove more amenable to “share-ability” than other data types.
- Medication prescribing can vary significantly across institutions based on practice patterns, formulary differences, and more. If converted to a categorical (e.g. “is on blood thinner”), the impact of these variances across institutions can be magnified or reduced.
- The definitions of the most basic elements can vary widely across institutions. For example, if a patient sees an internist and comes back in 2 days for a planned blood pressure re-check, is that a new encounter or is it considered an extension of the same encounter? I know of a health system whose logic to determine the time of an inpatient discharge from the hospital spanned 2 pages! Do you think any other health system uses that exact same logic? Alternatively, consider the time of admission for a patient presenting to the ED. Is it the time of registration? The time of triage? The estimated time the patient walked through the front door? Other? A recent article highlighted that IV infusion start times can differ significantly depending on whether the nurse-documented start time is used or the start time from IV pump logs is used. If these things aren’t specified exactly, you will be giving the model different data than it expects. These differences may have minimal or significant impact on the model’s performance.
- The health system where the model was developed may have data elements that you don’t. Perhaps they capture social determinants of health (SDOH) at the time of a visit and you don’t. Perhaps you both capture SDOH, but you ask different questions.
- The clinical processes, expectations, initiatives, and norms may vary widely across institutions. The system where the model was developed may have a quality improvement program in place for controlling hypertension but not one for managing diabetes, while at your institution perhaps the situation is reversed.
- The clinical notes used for training the model will probably contain lingo, grammar, abbreviations, and common phrases that differ from yours. Extracting content using natural language processing will have different success rates at your institution.
Yeah, yeah, yeah, I know what you’re thinking. “At least for #4 (‘their data is different than yours’), standards are gonna solve all those data problems.” Except they don’t. I’m not saying they’re not good and they don’t help, but we don’t have standards for every data element, and there’s often variation in the assignment of a standard terminology code to a concept. For example, one recent study found that the wrong LOINC (medical terminology) code was assigned to commonly-ordered lab tests in about 20% of the reported cases.
As the number of input features grows, the model becomes less and less likely to be applicable at your institution, because:
- It becomes increasingly likely that you don’t have one or more of these features.
- It becomes increasingly likely that some of your features are not perfectly equivalent to those expected by the model.
- Even worse, the greater the number of features in the model, the more likely the model has been tuned very specifically to the population of the institution on which it was trained. Therefore, it becomes less likely to perform similarly at your institution.
Perhaps it’s no surprise, then, that a recent systematic review of machine learning algorithms to predict trauma outcomes found significant “heterogeneity in the development and evaluation of machine learning models, even within the single field of trauma medicine. Differences were widespread in areas including model development (ie, feature selection, data sampling, features used), algorithms used, model validation, performance metrics, and research reporting.” YIKES!
The Typical “Fix” Actually Equates to “Start Over”
The usual proposed “fix” for all these issues? Retrain the model on your own data. To the uninitiated, it sounds like you use the same model, just with some minor “tweaks” here and there. Kind of like taking your car for a tune-up in prep for a trip to Alaska in the winter. They’ll just put some lower viscosity motor oil in the engine and a little more air in the tires, and then your car — the very same car you drive daily around your neighborhood in California — is ready for Alaska! Nope, that’s not at all what happens when you retrain an AI model. When you “retrain”, it’s not the old model with a few settings changed. In fact, you don’t use the old model at all. Instead, you create a completely new and different model specifically based on your population and available data. It may not use the same inputs, and even if it does, it will not process the inputs in exactly the same way. Of course, this new model needs to be tested and validated at your site. There’s no guarantee it will perform adequately, especially on the first try.
In many (maybe most) cases, the model needs far more than just “retraining,” it actually needs significant modification to achieve adequate performance. That often includes finding new input features and developing new, engineered input features.
Some models built at other institutions “may just work” at your own place if they were trained on populations similar to your own and only require a few, commonly available data elements. But very often, the existing model is more like an example, and each institution must entirely recreate their own, new version. The new version may be inspired by the original, but it’s still a different “thing.” Declaring a “retrained” model “the same as” the “original” model is akin to declaring ibuprofen “the same as” aspirin. Try convincing your platelets of that! (if you aren’t familiar with the differing effect of those medications on platelets, see #3 of 7 in this link)
The FDA, which had previously declined regulatory enforcement of most AI models, recently announced an intent to tighten regulatory enforcement for some models. Since a retrained model is a new model, it will be interesting to see if the FDA requires a new approval any time a site implements a model by retraining a previously approved one (in other words, by creating a new model). If it does, that may strongly disincentivize model retraining, leading to implementations using models that perform poorly for patients. If it doesn’t, then how will it achieve its goals?
So, where do we go from here?
Clearly, a lot of clinical AI will have to be DIY, at least in the near-term. It may seem crazy to imagine anyone but data scientists building a predictive model. Indeed, only specially trained people have the skills needed to do this work today. However, remember that many technologies used every day by nearly everyone used to be limited to specially trained folks. Back in my day, spreadsheet software was limited to technically-oriented accountants and financial whizzes — the original “quants“. Most executives couldn’t and wouldn’t type a document — that was handled by secretaries with proven typing speed and accuracy. Now, word processing and spreadsheets are used by virtually everyone. With the right components and processes, AI will be no different. Below, I list some key needed components and processes for AI development to be more widely accessible to those with little training:
- Access to real-time data: There is no AI without data, and many clinical use cases require the freshest data. Want to develop, retrain, validate, and/or implement a model that predicts respiratory compromise in the next hour? It probably needs data from the last few minutes or hours. If the soonest data available for the model to use was from yesterday, that’s probably a lot less helpful. A real-time data warehouse is often very costly to implement and maintain if you have to build it from scratch, especially if you have to do so using healthcare’s typical HL7 data feeds. Health system software should offer real-time (or very near real-time) data warehouses that customers can freely use for any need and any reason they see fit, including providing the data needed to develop and fuel AI models. It should be straightforward to query, clone, migrate, process, and/or combine data in very near real-time from across these warehouses using typical methods. Healthcare software systems, especially electronic health records (EHRs), are usually costly to implement and even more costly to “rip and replace.” Therefore, some customers may be financially “locked in” to their products and have little market power to demand this feature if it’s not already available. As a result, regulatory forces may be needed to achieve this goal across all software used in health systems.
- True AutoML solutions: If most clinical AI will be “DIY” and require development, validation, retraining, revalidation, and often significant modification, then simple but powerful tools that anyone can use are needed to to do that work. AutoML software claims to “automatically” generate machine learning (AI) models. Unfortunately, all AutoML systems I have seen actually require very significant training and manual effort to use. There’s very little “auto” about AutoML today. Soon though, truly automated machine learning systems will enable anyone to easily do this work. To move things forward, I have built “Architensor” (blog post about it coming soon), a much more automated AutoML for creating, validating, and implementing Boolean predictors. Except for gathering and engineering the data (self-service tools already exist for this), Architensor does just about everything else to auto-generate AI easily, quickly, and correctly. That includes doing appropriate validation and testing (using appropriate metrics) so that users can make correctly informed decisions when considering implementation. I hope to be able to make Architensor widely available soon.
- Right-sized processes: When it comes to AI in healthcare, misinformation and bias are rampant. For example, I have heard a claim that “all AI in healthcare is, by its very nature, intended to be generalizable, and therefore constitutes research.” On the contrary, as described extensively above, most AI cannot generalize and must be rebuilt and revalidated at each new institution and on each new population. In my experience, generalizable AI is exceedingly rare. I define here bespoke AI as all other AI, including any AI that that must be modified, retrained, and/or rebuilt to be used at other institutions or with other populations. For virtually all bespoke AI, validation (or re-validation) is appropriate. Processes regarding AI should be “right-sized” to the intended and anticipated use of the AI and its attendant risks and benefits. This is important for managing scarce reviewer resources, and to enable AI that delivers significant benefits with minimal risk to be efficiently and safely developed and deployed.
Today, interoperable and generalizable (“shareable”) healthcare AI models are mostly a myth. However, AI functionality will more easily be replicable when the three key ingredients are in place: real-time access to data, truly automated “AutoML” tools, and right-sized AI-related processes. With those ingredients, new AI models can be adopted easily and widely enough to truly revolutionize healthcare.
Opinions expressed here are those of the authors, not necessarily those of anyone else, including any employers the authors may or may not have. The authors reserve the right to change their minds at any time.
One response to “A contradiction in terms: AI interoperability in healthcare”
[…] a long time. The good news is we can “clone” trained models, but as my friend Jon points out doing so effectively can be quite tricky. Yes, we are for sure going to see robot apprentices out there […]