— Jonathan A. Handler, MD, FACEP, FAMIA
This is my first “grimoire”-type post. A grimoire is a book of magic spells. In the real world, very often I have to search all over or create a solution to complete some task, typically when doing something on a computer. Installing software under certain conditions is a typical issue — easy if you are on a common platform, harder at other times. When I find a solution (e.g., on Reddit, Quora, StackExchange, I create one myself, or some combo), it feels like magic! Hence, the solution feels like a magic spell or potion, and a bunch of those in one place would then be a “grimoire.”
With ZeroEffectors now live, I have a convenient place to put these grimoire entries for myself and for others who face the same issues. This one is my first.
The National Library of Medicine’s (NLM’s) Unified Medical Language System (UMLS) is totally awesome. It comes with SQL scripts to load the UMLS into a database. Unfortunately, the scripts never seem made for the database with which I’m working. Lately I’ve been doing a lot of work with PostgreSQL. The UMLS does not come with load scripts made for that database software. I’ve also found that, very often, unexpected characters cause load scripts to fail or load improperly. Sometimes that has caused errors that were very hard to debug, and sometimes very difficult to even know they occurred. ARRRGGHHHH!!!
Well, at least for loading the UMLS MRCONSO table (and presumably it would work for all others), I found this site! Thank you Zeljko! The MRCONSO load script there worked almost perfectly. The copy command failed because of a few rows (less than 50 out of millions) that had a backslash at the end of a field. I think the backslash was being interpreted as an escape character, even though I thought the documentation I read for PostgreSQL said that the same character used for quoting, when doubled up, would serve as the escape character. The backslash was not the quote character. In any event, I:
- Opened the MRCONSO.rrf file in a text editor and did a simple Find and Replace to remove all the backslashes from the file.
- Some backslashes seemed unnecessary, others were used to separate items in a “list”, so I figured the best general substitute was a forward slash.
- To remove the backslashes, I used Visual Studio Code from Microsoft as my text editor.
- To load such a large file, I had to increase the maximum memory allowed for use by the program. I did this by going to Visual Studio Code’s Settings -> Text Editor -> Files -> Max Memory For Large Files MB and adjusting the value to 4096 MB. Luckily I have a lot of RAM on my computer.
- I pasted Zeljko’s script for loading MRCONSO into a Query Window in PGAdmin, the PostgreSQL tool for administrating the database.
- I modified the copy command part of the script to point to the location of my MRCONSO.rrf file.
- I added the database and schema to the table name in the script wherever I found it. For example: my_database.my_schema.mrconso.
- NOTE: The database and schema names need to be lower case in their actual names in the database or you will get errors. If you make them upper case, then no matter what you do in the query, you will get errors. ARRGGHHH! There’s probably some collation or setting that fixes this, someday I’ll figure it out and apply it.
- It loaded perfectly!
A better approach would surely be to adjust the escape code in the SQL script’s copy command, but I admit it, I couldn’t figure that out in the time allotted and this approach served my needs. Please comment if you have a better approach.
— All opinions and content expressed here are my own and do not necessarily reflect those of any employer(s) I may have or anyone else.