Querying healthcare bills with Dolt

Brian Kotos

December 21, 2022

Background

On a regular basis, I am incorrectly billed by healthcare providers in accordance with what is outlined in my health insurance plan. As a result, I face the logistical burden of keeping track of health insurance claims, invoices from healthcare providers, and all my communications with them. It's a daunting problem that I know isn't unique to me.

One recent example was when my family and I came into contact with a bat in our 100-year-old house. Since rabies is essentially always fatal, we erred on the side of caution and decided to get vaccinated. The rabies vaccine in most cases is only administered in hospital emergency rooms. It involves four doses that must be administered on four separate ER visits. While we were grateful to receive the vaccine and to the doctors and nurses, it was incredibly expensive. Considering my wife, son, and I each got vaccinated, I was billed for 12 ER visits. In addition to the financial burden, it was draining on my time. It was also confusing to manually track and tie together the numerous health insurance claims, hospital invoices, and how close I was to my out-of-pocket maximum.

Currently, I track this all digitally with the help of Microsoft Excel and my handy OCR scanner. While my current solution works, it's far from ideal and involves a lot of manual data entry.

Enter Dolt

I recently set out to improve the above process by writing some custom software in either Node.js or PHP. My health insurer, Anthem Blue Cross Blue Shield, allows you to export all claims to a CSV file. Originally I was going to write code to extract data from the CSV, transform it to the correct data type, and load it into an SQLite database. However, then I remembered this project Dolt I heard about a couple of years ago. If you haven't tried Dolt, I'd highly recommend it. Essentially, Dolt is like a hybrid between MySQL and Git. It allows you to instantly create a MySQL-compatible database by creating a directory and running dolt init. Sound familiar? That's because their CLI is modeled after Git. Since I've already a high proficiency in both MySQL and Git, picking up Dolt has been a breeze. Additionally, Dolt allows you to jump into a MySQL-like console by running dolt sql, and host a MySQL-compatible server that I could connect to using JetBrains DataGrip.

Extract

Dolt made the process of extracting CSV data into a relational database much easier. Instead of having to write and test my own code to read the CSV and line-by-line run INSERT INTO, with Dolt this was simple:

dolt table import -c --pk="Claim Number" source_anthem_claims ./claims.csv

Using dolt sql, I was able to inspect the schema using a MySQL-like console:

pdms> DESCRIBE source_anthem_claims;
+----------------+----------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+----------------+------+-----+---------+-------+
| Claim Type | varchar(16383) | YES | | NULL | |
| Claim Number | varchar(16383) | NO | PRI | NULL | |
| Patient | varchar(16383) | YES | | NULL | |
| Service Date | varchar(16383) | YES | | NULL | |
| Claim Received | varchar(16383) | YES | | NULL | |
| Status | varchar(16383) | YES | | NULL | |
| Processed Date | varchar(16383) | YES | | NULL | |
| Provided By | varchar(16383) | YES | | NULL | |
| Billed | varchar(16383) | YES | | NULL | |
| Plan Discount | varchar(16383) | YES | | NULL | |
| Allowed | varchar(16383) | YES | | NULL | |
| Plan Paid | varchar(16383) | YES | | NULL | |
| Deductible | varchar(16383) | YES | | NULL | |
| Coinsurance | varchar(16383) | YES | | NULL | |
| Copay | varchar(16383) | YES | | NULL | |
| Not Covered | varchar(16383) | YES | | NULL | |
| Your Cost | varchar(16383) | YES | | NULL | |
+----------------+----------------+------+-----+---------+-------+
17 rows in set (0.00 sec)

And this is a sample row:

pdms> SELECT * FROM source_anthem_claims LIMIT 1\G
*************************** 1. row ***************************
Claim Type: Medical
Claim Number: 2022XXXXXXXXX
Patient: XXXXXX (XX/XX/XXXX)
Service Date: Nov 18, 2022
Claim Received: Nov 24, 2022
Status: Approved
Processed Date: Nov 24, 2022
Provided By: XXXXXXXXXXX
Billed: $ 44021.80
Plan Discount: $ 35681.80
Allowed: $ 8340.00
Plan Paid: $ 7506.00
Deductible: $ 0.00
Coinsurance: $ 834.00
Copay: $ 0.00
Not Covered: $ 0.00
Your Cost: $ 834.00
1 row in set (0.00 sec)

Not bad, considering the fact that it created the schema automatically and I instantly have the ability to query the data using the richness of SQL rather than using various formulas and filters in Microsoft Excel.

Transform Dates

In addition to being imported as VARCHAR-typed data, the columns Service Date, Claim Received, and Processed Date were not in ISO 8601 format, and thus couldn't easily be converted to the MySQL DATE type.

Writing a regular expression for this was fairly easy:

^[A-Z][a-z][a-z] [0-9]{1,2}, [0-9]{4}$

And rather than creating an endless set of SQL CASE expressions, I created two lookup tables for getting two-digit months and days:

CREATE TABLE month_abbreviations(
abbreviation VARCHAR(5) PRIMARY KEY,
twoDigits CHAR(2)
);
INSERT INTO month_abbreviations (abbreviation, twoDigits)
VALUES ('Jan', '01'), ('Feb', '02'), ('Mar', '03'), ('Apr', '04'), ('May', '05'), ('Jun', '06'), ('Jul', '07'), ('Aug', '08'),
('Sep', '09'), ('Oct', '10'), ('Nov', '11'), ('Dec', '12');
CREATE TABLE day_abbreviations(
abbreviation VARCHAR(2) PRIMARY KEY,
twoDigits CHAR(2)
);
INSERT INTO day_abbreviations (abbreviation, twoDigits)
VALUES ('1', '01'), ('2', '02'), ('3', '03'), ('4', '04'), ('5', '05'), ('6', '06'), ('7', '07'), ('8', '08'), ('9', '09'),
('10', '10'), ('11', '11'), ('12', '12'), ('13', '13'), ('14', '14'), ('15', '15'), ('16', '16'), ('17', '17'), ('18', '18'),
('19', '19'), ('20', '20'), ('21', '21'), ('22', '22'), ('23', '23'), ('24', '24'), ('25', '25'), ('26', '26'), ('27', '27'),
('28', '28'), ('29', '29'), ('30', '30'), ('31', '31');

And voila! Using some regex and a couple of nested SELECT statements, I can now convert the dates to ISO 8061:

SELECT
`Service Date` as 'Service Date (original)',
(
DATE(CONCAT(REGEXP_REPLACE(`Service Date`, '^[A-Z][a-z][a-z] [0-9]{1,2}, ([0-9]{4}$)', '$1'),
'-',
(
SELECT twoDigits
FROM month_abbreviations
WHERE abbreviation = REGEXP_REPLACE(`Service Date`, '^([A-Z][a-z][a-z]) [0-9]{1,2}, [0-9]{4}$', '$1')
LIMIT 1
),
'-',
(
SELECT twoDigits
FROM day_abbreviations
WHERE abbreviation = REGEXP_REPLACE(`Service Date`, '^[A-Z][a-z][a-z] ([0-9]{1,2}), [0-9]{4}$', '$1')
LIMIT 1
), ' 00:00:00'))
) as 'Service Date (ISO 8061)'
FROM source_anthem_claims
LIMIT 1;
+-------------------------+-------------------------+
| Service Date (original) | Service Date (ISO 8061) |
+-------------------------+-------------------------+
| Nov 18, 2022 | 2022-11-18 |
+-------------------------+-------------------------+
1 row in set (0.00 sec)

Transform Currency

This was much more straightforward. The strings were prepended with a $, so I simply had to remove that from the beginning. I also added 0.0 to cast the value to a MySQL FLOAT.

SELECT
`Billed` as 'Billed (original)',
REPLACE(`Billed`, '$ ', '') + 0.0 as 'Billed (float)'
FROM source_anthem_claims
LIMIT 1;
+-------------------+----------------+
| Billed (original) | Billed (float) |
+-------------------+----------------+
| $ 44021.80 | 44021.8 |
+-------------------+----------------+
1 row in set (0.00 sec)

Load

After figuring out the transform piece, I could now load the data into a new schema.

I created the following new schema, with the correct data types:

CREATE TABLE transformed_anthem_claims(
claim_number VARCHAR(32) PRIMARY KEY,
claim_type VARCHAR(32),
patient VARCHAR(64),
service_date DATE,
claim_received DATE,
status VARCHAR(32),
processed_date DATE NULL DEFAULT NULL,
provided_by VARCHAR(256),
billed DOUBLE(10, 2),
plan_discount DOUBLE(10, 2),
allowed DOUBLE(10, 2),
plan_paid DOUBLE(10, 2),
deductible DOUBLE(10, 2),
coinsurance DOUBLE(10, 2),
copay DOUBLE(10, 2),
not_covered DOUBLE(10, 2),
your_cost DOUBLE(10, 2)
);

And brought everything together into a MySQL INSERT INTO SELECT statement:

INSERT INTO transformed_anthem_claims
(claim_number, claim_type, patient, service_date, claim_received,
status, processed_date, provided_by, billed, plan_discount, allowed,
plan_paid, deductible, coinsurance, copay, not_covered, your_cost)
SELECT
`Claim Number`,
`Claim Type`,
`Patient`,
(
DATE(CONCAT(REGEXP_REPLACE(`Service Date`, '^[A-Z][a-z][a-z] [0-9]{1,2}, ([0-9]{4}$)', '$1'),
'-',
(
SELECT twoDigits
FROM month_abbreviations
WHERE abbreviation = REGEXP_REPLACE(`Service Date`, '^([A-Z][a-z][a-z]) [0-9]{1,2}, [0-9]{4}$', '$1')
LIMIT 1
),
'-',
(
SELECT twoDigits
FROM day_abbreviations
WHERE abbreviation = REGEXP_REPLACE(`Service Date`, '^[A-Z][a-z][a-z] ([0-9]{1,2}), [0-9]{4}$', '$1')
LIMIT 1
), ' 00:00:00'))
) as `service_date`,
(
DATE(CONCAT(REGEXP_REPLACE(`Claim Received`, '^[A-Z][a-z][a-z] [0-9]{1,2}, ([0-9]{4}$)', '$1'),
'-',
(
SELECT twoDigits
FROM month_abbreviations
WHERE abbreviation = REGEXP_REPLACE(`Claim Received`, '^([A-Z][a-z][a-z]) [0-9]{1,2}, [0-9]{4}$', '$1')
LIMIT 1
),
'-',
(
SELECT twoDigits
FROM day_abbreviations
WHERE abbreviation = REGEXP_REPLACE(`Claim Received`, '^[A-Z][a-z][a-z] ([0-9]{1,2}), [0-9]{4}$', '$1')
LIMIT 1
), ' 00:00:00'))
) as `claim_received`,
`Status`,
IF(
`Processed Date` != 'Not Available',
DATE(CONCAT(REGEXP_REPLACE(`Processed Date`, '^[A-Z][a-z][a-z] [0-9]{1,2}, ([0-9]{4}$)', '$1'),
'-',
(
SELECT twoDigits
FROM month_abbreviations
WHERE abbreviation = REGEXP_REPLACE(`Processed Date`, '^([A-Z][a-z][a-z]) [0-9]{1,2}, [0-9]{4}$', '$1')
LIMIT 1
),
'-',
(
SELECT twoDigits
FROM day_abbreviations
WHERE abbreviation = REGEXP_REPLACE(`Processed Date`, '^[A-Z][a-z][a-z] ([0-9]{1,2}), [0-9]{4}$', '$1')
LIMIT 1
), ' 00:00:00')),
NULL
) as `processed_date`,
`Provided By`,
REPLACE(`Billed`, '$ ', '') + 0.0 as billed,
REPLACE(`Plan Discount`, '$ ', '') + 0.0 as plan_discount,
REPLACE(`Allowed`, '$ ', '') + 0.0 as allowed,
REPLACE(`Plan Paid`, '$ ', '') + 0.0 as plan_paid,
REPLACE(`Deductible`, '$ ', '') + 0.0 as deductible,
REPLACE(`Coinsurance`, '$ ', '') + 0.0 as coinsurance,
REPLACE(`Copay`, '$ ', '') + 0.0 as copay,
REPLACE(`Not Covered`, '$ ', '') + 0.0 as not_covered,
REPLACE(`Your Cost`, '$ ', '') + 0.0 as your_cost
FROM `source_anthem_claims`;

Now the loaded data looks like this, with dates and currency values that I can query and run aggregate calculations against.

pdms> SELECT * FROM transformed_anthem_claims LIMIT 1\G
*************************** 1. row ***************************
claim_number: 2022XXXXXXXXX
claim_type: Medical
patient: XXXXXX (XX/XX/XXXX)
service_date: 2022-11-18
claim_received: 2022-11-24
status: Approved
processed_date: 2022-11-24
provided_by: XXXXXXXXXXX
billed: 44021.8
plan_discount: 35681.8
allowed: 8340
plan_paid: 7506
deductible: 0
coinsurance: 834
copay: 0
not_covered: 0
your_cost: 834
1 row in set (0.00 sec)

What's Next?

Now I have the data extracted, transformed, and loaded from the Anthem-provided CSV. While I could have written software to do this, Dolt made this process much simpler. Not only on the ETL side but also allowing me to create a MySQL-compatible database instantly.

Next, I plan to write some code that will query this data, compare it against my health insurance plan, and recommend "next actions" that I can take to contest various invoices and claims. I also plan on implementing a system that will allow me to keep track of all correspondence with healthcare providers and my health insurance company.

Longer term I'd like to scrape the text from my scanned invoice PDFs and get them populated into Dolt.

To be continued.