Building a Personal Medical AI

What is record structuring and abstraction?

PicnicHealth’s AI strategy is focused on building personal medical AI that helps patients get better care and, through our research-focused products, eventually better therapeutic options. To do this, our approach to large language model (LLM) development introduces a distinctive dimension – we bring generalized medical knowledge together with a detailed understanding of a patient’s own medical journey by interpreting their records.

Practically, this means that our team has unique expertise in building AI that can interpret longitudinal sets of records as effectively as an experienced clinician. The records we operate on for a given patient span large ranges of time, different facilities, different doctors, different diseases, etc. We train models that pull that data together into a coherent, complete, picture and power PicnicHealth’s analysis, recommendation, and care tools.

Under the hood, that means our team works on developing an LLM that is jointly trained on a tasks that uncover all of the nuggets of information in a set of records and align them in a common vocabulary – a process we call structuring – as well as tasks that roll that data into more intuitive, directly useful outputs – a process we call abstraction.

Our experience is that only once a set of medical records has been both structured and abstracted can we practically get insights from them. Prior to structuring, it is impossible to sift through the sheer volume of information, to filter noise, and to look for low level data that may combine into trends or recognizable milestones, all while knowing that we didn’t miss anything important. And it’s not until the data is abstracted that we can intuitively interrogate it and relate health status, treatments, and outcomes. This is because rarely does any one record or one piece of information in isolation say much about a patient’s journey.

The AI tasks powering structuring and abstraction

Structuring takes a stack of patient records – paper arriving at our Oakland mailroom, faxes, e-mails, CDs, instant electronic records from different EHR systems – and gets the underlying, noisy information into a form that we can manipulate and model. It includes everything from OCR (pixels to words), document detection and metadata tagging (from 1,000 pages faxed by a hospital to the 20 documents they represent, tagged with dates, providers, specialty, etc.), named entity recognition (NER) (free text to drugs, labs, vitals, conditions and associated values), and ontology alignment (various medication brand names to a standardized identifier).


At first blush, structuring alludes to many traditional machine learning and AI problems. In this vein, PicnicHealth has been developing in-house models from our earliest days for OCR, NER, boundary detection, anonymization, ontology alignment via k-nearest neighbor, etc. – all tuned to the peculiarities and nuances of medical data. Throughout, we have sought to evolve with the best techniques available at any given time – so for NER, that meant starting in 2015 with lots of boosted decision trees and CNNs, later followed by our own BERT-style language model with positional encodings that suit the embedded tables common in EHR printouts, to LLMD, an LLM based on instruction-fine tuning on top of open-source foundational models.


A common format for medical records that breaks the simple left-to-right, top-to-bottom layout assumptions of positional embeddings in many early language models. One of the many improvements we made when building our first in-house language model for medical records in 2021 included
a more appropriate positional embedding approach than early off-the-shelf models.

Abstraction then rolls high-granularity data collected across a patient’s records up into a form that can more directly answer clinical questions (from 1,000 mentions of a drug to a single era). Typically, abstraction produces one of three types of information: distinct events such as diagnoses or procedures tied to a specific date, multi-occurrence or episodic events such as pain crises, relapses, or side effects, and eras that represent spans of time, e.g. when a patient is taking a particular drug.

Abstraction maps less clearly at first to standard AI problems. Given 1,000 mentions of a drug in a patient’s records, it is surprisingly difficult to conclude “the patient started the drug on this date, stopped on this date, and did so for the following reason.” This is because even structured medical data is filled with uncertainty, contradictions, gaps, and noise – some mentions of a drug may capture its administration to a patient, some may be recommendations from a specialist, while others may be patient recollections of prior events, a portion of which we can verify with ancillary data such as insurance claims. Sometimes the stop reason for a medication is stated, but more often we need to reach deep into narrative text to identify symptoms and connect them to the change in treatment. How do we confidently combine all of that information into an intuitive drug era and ensure that we didn’t misinterpret the record, or that the answer wouldn’t change unexpectedly if we perturb some model parameters? For these reasons, many abstraction tasks remain at-best partially solved in the literature and among the practitioners we work with.

Ultimately, we approach difficult abstraction tasks by mimicking clinicians – a great model for drug eras may be elusive at first, but we found that we could give a stack of medical records to a clinician and just ask them when patients started and stopped medications and why. Their expertise and experience were key to coming back with both an answer, and an answer that 9 out of 10 others would agree on. Our approach has therefore been to build software to help them do this, and to simultaneously learn how they read and interpret records. This means tracking what context they look for, how they filter the information they see, how they make connections between concepts, and how they build up evidence for their conclusion.


A screenshot from an abstraction task for Multiple Sclerosis ambulatory status designed to both enable fast, consistent clinical abstraction and to collect training data to mimic how clinicians answer such questions.

Imitating how a clinical abstractor reads a record drives our approach to training our LLMs. Today, we use instruction fine-tuning on top of foundational open-weight models. This allows us to leverage the best pattern-matching and generalized knowledge capabilities available, and then tailor them to the types of tasks that matter when reading medical records. Critically, instruction fine tuning allows us to break down interpretative abstraction tasks into simpler steps, reminiscent of chain-of-thought reasoning on top of retrieval augmented generation mechanisms to manage context.

In the next post in this series, we’ll take a look at some specific examples of the data we see in real-world patient records.