AI to Help Assess Clinical Quality: Example (potential) Darth Vecdor Use Case + Configs

Picture created by ChatGPT from prompts by Jonathan A. Handler

by Jonathan A. Handler, MD, FACEP, FAMIA

Not long ago I posted the availability of Darth Vecdor (DV), a platform for using Large Language Models (LLMs, like ChatGPT) to create knowledge bases. I made it open source and freely available and said I hoped someday to share DV use cases. To facilitate that sharing by me and by others, I enhanced DV with export/import functionality for DV configurations (announced in a recent post). I imagined a world where people share their DV configurations so that everyone could have continually improving knowledge bases. I expressed my hope to someday share some DV use cases.

Well, with some trepidation, today is that day! I am sharing a set of example DV configurations targeted to a theoretical use case in healthcare.

Important Note: Be sure to read Epilogue #2 at the end of this post for very important warnings, caveats, and more.

The Use Case: Facilitating Clinical Quality Improvement

Note: During my personal and professional life, I have probably interacted with at least hundreds (thousands?) of healthcare providers, clinics, systems… Unless otherwise indicated, my statements about quality measurement in healthcare are generalizations based on those experiences and may not be applicable to any of those or other specific system or systems.

The Problem

I previously proposed a “patient hierarchy of healthcare needs.” In that hierarchy, the most basic needs (just above avoidance of gross negligence and ill-intentions) are providing a beneficial diagnosis and treatment. Yet, it seems that the quality of diagnosis and treatment for the overwhelming majority of healthcare visits are never measured. Rather, it appears as if healthcare generally makes the following assumption:

“You saw a credentialed provider, so we will assume everything was perfect unless we hear otherwise from you.”

Unfortunately, we have plenty of evidence telling us otherwise (cited in yet another post). Healthcare’s safety record lags far behind other high-stakes industries (such as the airline industry).

In fairness, though, it has been exceedingly difficult and expensive to measure the clinical quality of a patient encounter. Even if there was enough funding to manually review the quality of every encounter, we don’t have the people needed to do the work. We already have a shortage of doctors and nurses that is projected to significantly worsen. Given this, some have resorted to reviewing a small number of randomly selected records to look for opportunities to improve quality. However, looking for serious medical care quality issues is a bit like looking for needles in a haystack. If you randomly sample only a small part of the haystack, the likelihood that you will find a needle is low. It is hard to get a sense of common patterns and make the system better if you can’t reliably identify the problems in it.

You might be thinking, “Didn’t you say that medical errors are too common? Now you say that finding them is like searching for a needle in a haystack? You have contradicted yourself!”

Studies suggest that harmful medical errors occur rarely, yet also too commonly. How can that be? Errors may be relatively rare on a “per-encounter” or “per-decision” basis. However, when all decisions across all medical encounters are aggregated across a population of patients or over the lifetime of a single person, the likelihood that one or more of those decisions will have caused harm or failed to benefit becomes unacceptably high.

For example, a late 2022 AHRQ report stated (page ES-1) that “overall diagnostic accuracy in the emergency department (ED) is high,” with only a minority (about 5.7%, or 1 in 18) of visits having an incorrect diagnosis, even fewer having an adverse event because of an incorrect diagnosis (2.0%, or 1 in 50), and fewer still having permanent disability or death from an incorrect diagnosis (~0.3%, or 1 in 350). The report notes that these rates are similar in primary care and inpatient hospital care. As rare as those errors may seem on a per-visit basis, consider them on a per-doctor basis. Emergency physicians reportedly may see between 1.8 to 5.0 patients per hour. Even at the lowest end of that range, if each emergency physician averages 20 ten-hour shifts per month, that’s an average of 360 patients per emergency physician per month. Thus, based on the AHRQ report, we might expect each emergency physician, on average, to make an incorrect diagnosis leading to an adverse event for 7 patients per month (2% of 360), and to make an incorrect diagnosis leading to permanent disability or death for 1 patient per month (0.3% of 360). So, on a per-visit basis, errors may be relatively rare, but on a per-doctor basis, they seem relatively common. Similarly, on a per-patient basis, since people will usually have many encounters with the healthcare system over their lifetime, the US National Academy of Medicine reportedly concluded that “most people will experience at least one diagnostic error in their lifetime, sometimes with devastating consequences.”

Given these challenges, another typical approach in healthcare has been to track quality for a few common conditions (e.g., high blood pressure, diabetes) using relatively simple electronic health record queries. For example, the hemoglobin A1C test is a simple, common, and effective way to measure the control of diabetes. It’s also easy to get those test results from most electronic health records. Unfortunately, we don’t have simple, easy-to-get tests we can use to assess care quality for most other conditions.

So, to improve healthcare quality, we need to be able to find failures, assess the root causes for the failures, and make changes to the process of healthcare that make those failures less likely to occur. That’s really hard to do if the failures are rare and therefore hard to find on a per-encounter or per-decision basis, even if they add up to being far too common on a per-patient, per-doctor, or per-health system basis. Finding serious clinical errors may be a little like finding a few needles in a haystack, but if you jump into the haystack enough times, eventually you’re going to get pricked. So before you jump in, you really want someone to look hard for needles and remove them before you jump!

A Proposed (partial) Solution

LLM-generated knowledge graphs — perhaps those created using Darth Vecdor — may help us find those “needles in the haystack.” When we can find more of them, we may better identify failure patterns, enabling us to put in place mitigations, and then better assess whether those mitigations were successful, ultimately improving the quality of care. Right now, healthcare mostly “drives blind,” unable to see when, where, on whom, how, or why clinical failures occur. How might knowledge graphs help?

Misdiagnosis: Imagine that we had a knowledge base containing the “lookalike conditions” for virtually every medical problem. Clinicians call the lookalike conditions for a problem its “differential diagnosis.” For example, a strain of the muscles between the ribs can cause chest pain. A heart attack can also chest pain. So, “rib muscle strain” is a lookalike diagnosis of (in the “differential diagnosis” of) heart attack. Imagine a patient sees a doctor and gets a diagnosis of “rib muscle strain.” Two days later, the same patient goes to the emergency room and is found to have a heart attack. One might reasonably wonder whether the patient was actually having a heart attack on that first visit and was misdiagnosed as having a rib muscle strain. A knowledge base of the differential diagnosis of every medical problem might enable a simple electronic health record query to identify many more potential cases of misdiagnosis among patients getting care within a health system.
Mismanagement / Suboptimal management: Imagine that we had a knowledge base containing the complications one might suffer from virtually every medical problem. For example, if a patient has a blood clot in the big, deep vein of their thigh and is not put on proper treatment for it, they may suffer the potentially fatal complication of that blood clot traveling to the lung (“pulmonary embolism”). Imagine someone has a diagnosis of “deep thigh vein blood clot” and three days later has a diagnosis of “pulmonary embolism.” Some patients will have this complication despite getting the best possible treatment. At the same time, one might reasonably wish to double-check in such a guess to ensure the patient did get the best possible treatment. A knowledge base of the complications of every medical problem might enable a simple electronic health record query to identify many more potential cases of suboptimal management among patients getting care within a health system.
Clinical Failure Severity: Virtually every case of misdiagnosis, or suboptimal management represents a potentially fruitful opportunity to explore whether something in the system of care could be improved. At the same time, given limited resources and the high costs of investigating each such case, we might imagine focusing first on the failures having the most significant outcomes. Unexpected and preventable death is an obvious outcome warranting significant attention in many cases, but there are many other serious complications warranting attention. How do we limit our “needle in the haystack” search to just the biggest, sharpest, most dangerous needles? Imagine that we had a knowledge base containing the “medical gravity” for virtually every medical problem. In other words, how grave, serious, or concerning is that problem? Then we could limit our search for cases of potential misdiagnosis and mismanagement to only the most serious ones. The costly time and effort of case reviews could be focused on cases that may be the most important to find.

Note: if you ask doctors “how bad is [X] disease?” in most cases they will probably (correctly) say “it depends.” For example, although many cancers cause serious harm and even death, many cases are easily and permanently cured. Conversely, while most diaper rashes are mild, a very severe diaper rash can be complicated by serious infection. Still, given no other information other than “cancer” or “diaper rash,” I think most doctors would say cancer generally has the higher “gravity” than diaper rash.

So, imagine that we had a knowledge base containing:

The differential diagnosis for every problem.
The potential complications for every problem.
The “medical gravity” (“concerningness”) of every problem.

Health systems might be able to use that knowledge base to help:

Automatically find patients who had an encounter with the health system, then returned a relatively short time later and received another diagnosis having high medical gravity (very serious) and that is a “lookalike” to the diagnosis from the first encounter (suggesting that the first diagnosis may have been incorrect).
Automatically find patients who had an encounter with the health system, then returned a relatively short time later and received another diagnosis having high medical gravity (very serious) and that is a complciation of the diagnosis from the first encounter (suggesting that the treatment of the first diagnosis may have been suboptimal).

Of course, those cases need further investigation by a human expert to determine if:

The automated flagging of the case was accurate.
If so, whether a different and better outcome could have been achieved.
And if so, how the health system overall could be improved to make future cases more likely to be successful.

The hope is that such a system would more accurately identify appropriate cases for continuous quality improvement review than a random chart audit.

Enter Darth Vecdor

I have made and released as open source a series of Darth Vecdor (DV) configurations that aim to generate differential diagnosis, complication, symptom, and medical gravity knowledge graphs for virtually every ICD-10 CM diagnosis under 7 characters in length. As described above, such knowledge graphs may be a useful part of “human-in-the-loop” workflows for assessing clinical care quality. The configurations use OpenAI’s 4o-mini model through its API. Be aware that OpenAI charges for these queries, but both OpenAI and DV have features intended to protect a user from spending more on queries than they intended. Do not rely on DV’s feature alone for this.

The configurations are available here.

Each configuration (see Configuration Files Listing section below) should be importable onto the relevant DV form using Darth Vecdor’s import functions. The configuration files are intended to be imported and run in the order listed. Some may run quickly, others may take a long time (hours, even days). I have a Macbook Pro M1 Max laptop with 64 GB of RAM and a 4 TB solid state hard drive. It’s getting “long in the tooth” but it’s still a monster. Your own speeds may be much faster or slower, depending on your hardware and other factors.

At the end of the process, a single database table is generated containing every ICD-10 diagnosis pair identified by an algorithm in a query as “potentially concerning”. The query generating the table assumes that ancestor (less specific) diagnoses in the ICD-10 hierarchy share the relationships of their descendants. This table is the capstone of all this effort, potentially providing an automated way to help identify potential quality of care issues in support of continuous quality improvement efforts.

The table should be easily exportable into a delimited file that should be importable into almost any other database. I believe having this content as a simple table of ICD-10 diagnosis pairs has significant advantages:

Easy to query: Relatively simple database queries joining the table to existing visit data might help identify potentially concerning patterns of visits warranting further evaluation.
Fast and cheap to query: The queries might run orders of magnitude faster and cheaper than repeatedly querying an external LLM. The table is generated by pre-processing vector comparisons, with the intent of avoiding vector-based queries running in real-time production while users wait, since those queries may be both costly and slow.
No ambiguity or mysterious “black box” AI behavior: The same query on the same data in that table should return the same result every time, and that result can be completely explainable by simply examining the underlying table. Although it may be “black box” as to how or why the AI generated the content, the results of queries on the table after that point can be understood and explained.
Straightforward to validate and correct: The knowledge base content is in common English and/or standard ICD-10 codes. Human experts can examine and validate it as they see fit. When, during validation or use, information is found to be incorrect or missing, the table can be modified to correct it.
Potentially more reliable, secure, and private: Running queries in a safe, secure, private, and trusted environment may avoid increased potential privacy security, and failure risks when relying on an external LLM service.

Conclusion

I haven’t rigorously tested the output of the configuration files, and I can absolutely see that they are imperfect. I am sure I made mistakes, potentially serious ones. The output will be incomplete and some of it will be incorrect. However, I am guessing it can help find a higher percentage of re-visits appropriate for expert quality review than a truly random chart audit. Even if it would not replace random audits, it might supplement them. It’s also possible the outputs are completely useless for ANY purpose. But, even if that’s the case, as LLMs continue to improve and as DV users create new and improved prompts, we might anticipate ever-improving versions. Ultimately, I hope Darth Vecdor can be a tool that helps substantially advance the state of the art in healthcare’s ability to review, quantify, assess, and improve the quality of care.

Epilogue #1: Configuration Files Listing

Be careful, I may have made mistakes here.

01a – terminology_populator_major_umls_vocabs_mrconso_2024aa_srl0149_eng_20260111_144733.json: This loads the Concept Unique Identifiers (CUIs) and associated strings (text descriptions) from the UMLS MRCONSO table into Darth Vecdor. The query assumes that the user has a UMLS MRCONSO table containing the major UMLS vocabularies in a specific database location. The user should adjust the query as needed to point to the correct table.
01b – terminology_populator_icd10cm_under_7_chars_20260111_144810.json: This loads the ICD-10 CM codes and associated strings (text descriptions) from the UMLS MRCONSO table into Darth Vecdor. The query assumes that the user has a UMLS MRCONSO table containing the major UMLS vocabularies in a specific database location. The user should adjust the query as needed to point to the correct table.
02a – code_set_populator_icd10_under_7_chars_20260111_144655.json: This generates a code subset of UMLS CUIs found in the ICD-10 CM vocabulary for the ICD-10 CM codes under 7 characters in length (including the dot).
03a – relationship_set_medical_gravity_v004_20260111_144533.json: This generates a set of relationships (a knowledge graph) attempting to characterize the medical gravity of each code in the 02a code subset above. For example, it attempts to characterize how often the condition has mild, moderate, and severe effects.
03b – relationship_set_icd_diff_dx_and_compl_cgpt4o_mini_v001_20260111_144219.json: This generates a set of relationships (a knowledge graph) attempting to list the symptoms, differential diagnoses, and complications of each code in the 02a code subset above.
04a – relationship_string_to_code_matcher_term_matcher_diff_dx_v001_20260111_144124.json: This atttempts to match the free-text responses of differential diagnoses from the LLM to an ICD-10 CM code by comparing vectors using cosine similarity. The query assumes that the user has a UMLS MRCONSO table containing the major UMLS vocabularies in a specific database location. The user should adjust the query as needed to point to the correct table.
04b – relationship_string_to_code_matcher_term_matcher_compl_v001_20260111_144135.json: This atttempts to match the free-text responses of complications from the LLM to an ICD-10 CM code (abbreviated as ICD-10 going forward) by comparing vectors using cosine similarity. The query assumes that the user has a UMLS MRCONSO table containing the major UMLS vocabularies in a specific database location. The user should adjust the query as needed to point to the correct table.
05a – custom_table_populator_icd10_under_7_char_single_cui_maps_v001_20260111_143709.json: This generates a “custom table” in the database that maps ICD-10 codes to a single UMLS CUI. Even though the configs create a “terminology” in DV for ICD-10 codes directly, I had originally generated the relationships using UMLS CUI codes that map to ICD-10 codes. However, some ICD-10 codes map to multiple CUIs. So, this table maps under seven character ICD-10 codes to a single UMLS CUI to facilitate later processing.
05b – custom_table_populator_icd10_max_descendant_dists_v002_20260111_143637.json: This generates a “custom table” in the database that has the maximum cosine distance among all descendant pairs of a code (at least I think I am remembering that correctly). The purpose of this is to help determine if a descendant code might be reasonably considered “almost the same as” its ancestor. If none of the descendants of the ancestor significantly differ from one another, then a potential assumption is that the ancestor can be used interchangeably for any of its descendants.
05c – custom_table_populator_icd10_mgi_v001_20260111_144041.json: This table combines the various DV-generated medical gravity relationships into a single number for all included ICD-10 codes. I call this number the “Medical Gravity Index” (or “MGI”). A higher number is intended to indicate a higher gravity/severity/level of concern. On anecdotal review of some of the data, I disagreed with some of the values, and some of them seemed way off (incorrect in my view), but most of the values I saw seemed reasonable to me.
05d – custom_table_populator_icd10_ddx_by_llm_and_hierarchy_v003_20260111_143911.json: For all included ICD-10 codes, this table provides all the other ICD-10 codes that might be considered “in the differential diagnosis of” each code. This uses the results of custom table 05b to determine if codes can share each other’s differential diagnosis. If I remember my logic correctly, if all the codes of an ancestor are close to one another in cosine distance, then if any one code among the descendants and the ancestor have another code as part of its differential diagnosis, then all of the descendants and the ancestor will be considered to have that as part of their differential diagnoses. Since lookalikes should be symmetrical (if diagnosis A is a lookalike of diagnosis B, then diagnosis B is a lookalike of diagnosis A), this too is taken into account when generating this table.
05e – custom_table_populator_icd10_compls_by_llm_and_hierarchy_v002_20260111_143323.json: For all included ICD-10 codes, this table provides all the other ICD-10 codes that might be considered “a complication of” each code. If I remember correctly, this uses the results of custom table 05b to determine if codes can share each other’s complications, similar to that of 05d. Unlike differential diagnosis, I wouldn’t consider complication a symmetrical relationship (if diagnosis B is a complication of diagnosis A, diagnosis A is not usually a complication of diagnosis B). Therefore, the query differs from 05d by not assuming a symmetrical relationship.
05f – custom_table_populator_icd10_sxs_by_llm_and_hierarchy_v001_20260113_222605.json: For all included ICD-10 codes, this table provides all the other ICD-10 codes that might be considered “a symptom of” each code. I believe this also uses 05b to determine if codes can share each others symptoms.
06a – possibly_clinically_concerning _icd10_pairs_20260113_2228.sql: This combines diagnosis pairs from 05d (differential diagnoses) and 05e (complications) to identify possibly concerning diagnosis pairs, incorporating the MGI from 05c to enable queries to select only pairs in which the second diagnosis MGI exceeds a certain value. If I remember correctly, it also uses 05f data to exclude situations in which the second diagnosis in the pair is a symptom.
other configs and info.txt: This contains some of the other DV configurations and information that I think were in place when I created and ran all these configurations, to hopefully facilitate replication by others.

Epilogue #2: Important Warnings, Caveats, and More

While I hope Darth Vecdor can provide value in many areas, Darth Vecdor and the released configuration files are distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. Any use is entirely at your own discretion and risk (and, of course, the risk of those for whom you have responsibility). Read the license and other information on the relevant GitHub sites, in the code, and on the Darth Vecdor user interface if it is run. Nothing here should be construed as superseding the license terms. Nothing here should be construed as medical advice. You need the appropriate expertise and associated due diligence to safely, effectively, and appropriately use Darth Vecdor and any of its outputs. There is no assurance that Darth Vecdor or any of its outputs meets or will meet any or all needs for any use. I have discussed its potential use in the medical domain, but that does not imply it is safe or appropriate for any use in healthcare or any use at all. Darth Vecdor is highly configurable, so suitability or insuitability for any use also may relate to your configuration and use of the system. I tried hard to build Darth Vecdor and the shared configuration files with quality, but you should assume they have serious bugs and design flaws, and the system and content surely lack critical functionality. Depending on whether and how it and/or its outputs are used, Darth Vecdor and/or its outputs could lead to dangerous outcomes. It’s entirely up to you assess and validate its suitability (or the suitability of its outputs) for any purpose whatsoever. If there are disclaimers that I should have put here but didn’t, please imagine that any and all such disclaimers are here.

All opinions expressed here are entirely those of the author(s) and do not necessarily represent the opinions or positions of his/her/their employers (if any), affiliates (if any), or anyone else. The author(s) reserve the right to change his/her/their minds at any time.