May 14, 2025 2:55:17 PM

Parsing Clinical Trial Criteria for Accurate Patient Matching

Unlocking the Potential of Trial Criteria to Improve Patient Matching

At Cognome, we’ve built (O)TESSA (Oncology Trial Eligibility Smart Screening Algorithm)—an end-to-end clinical trial matching system powered by an ensemble of open-source language models. The system is organized into three core modules: Clinical Trial Eligibility Processing, Patient Information Retrieval, and Matching. In this post, we focus on the first: how structuring trial criteria enables better downstream retrieval and ultimately, more accurate patient-trial matching. This post uses Breast Cancer trials as examples but the core logic can be extended to any type of clinical trial.

Parsing Clinical Trial Criteria

The Hidden Bottleneck in Patient-Trial Matching

Many systems promise AI-driven patient-trial matching—but in our experience working closely with clinicians, we've found a major bottleneck is the trial data itself.

ClinicalTrials.gov makes it easy to pull raw JSONs, but what you get is often inconsistent, fragmented, and hard to parse. Eligibility criteria are free-form text filled with shorthand, clinical assumptions, and ambiguous logic. Before any model can determine if a patient qualifies, the trial must be deeply understood and structured.

Why Parsing Trial Criteria Is So Hard

Eligibility criteria come in all shapes and sizes. Some are long paragraphs, others are poorly formatted bullet points. A single criterion might contain multiple conditions, unclear negations, or assumptions about medical knowledge (e.g., "TNM Stage III"). Even though the raw trial json we pull has a lot of information about standard concepts pertaining to the trial, how those concepts are connected to the listed eligibility criteria is often missing or unclear. The challenge starts at the trial level: unless we structure this chaos, downstream AI matching falls short.

Our Guiding Philosophy

The quality of patient-trial matching depends entirely on the quality of the trial preprocessing. Our system doesn't just read raw text; it operates over structured, semantically-rich representations that let them reason like a clinician.

Screenshot 2025-05-14 at 1.59.34 PM

Our Pipeline for Parsing Trials

We process trials in stages (diagram on the left):

Format Detection:

We first check if the criteria are written as a paragraph or a list. Poor formatting is more common than you'd think. Some trials even merge multiple eligibility rules into a single run-on sentence.

Entity Enrichment:

Raw criteria like: "Patient must have ER+ or HER2- breast cancer" contain biomarker references that are never explicitly mapped in the raw trial requirements. In other words, the raw json of the trial that we pull contains NCI Thesaurus Concepts that implicitly map to the trial eligibility. To make this mapping explicit, we use our system to map terms like "ER+" to "Estrogen Receptor Positive" and "HER2-" to "HER2 Negative". Once a criterion has mapped concept IDs, we can use their synonyms to enhance the context for matching downstream.

 
Criterion: Patient must have ER+ or HER2- breast cancer
Available Entities in trial JSON: 
[
"Estrogen Receptor Positive", 
"HER2 Negative", 
"Triple Negative Breast Cancer"
"Oral Endocrine Therapy" ...
]

Map all terms that correspond to the criterion.

Key Requirement Extraction:

We extract four anchor requirement types in the "extract_structured_criteria" phase: cancer stage, type, HR status, and prior therapies for breast cancer trials. Extracting these key requirements helps pre-filter unqualified patients early in our matching pipeline.

{
  "id": "cancer_stage_inc_001",
  "value": "Stage IV or locally advanced breast cancer",
  "group": "cancer_stage",
  "type": "include",
  "source": trial_title
},
{
  "id": "cancer_stage_exc_001",
  "value": "Metastatic disease",
  "group": "cancer_stage",
  "type": "exclude",
  "source": "criterion_exc_C1" 
} ....

Segmentation

In this phase, we break-down each listed criterion into atomic testable conditions. Consider this criterion:

"No prior chemotherapy or radiation to the breast is allowed. Bisphosphonates are allowed."

We break it into:

Segment A: Patient must not have received chemotherapy to the breast
Segment B: Patient must not have received radiation to the breast
Segment C: Patient may have received bisphosphonates

The breakdown of criterion enables better retrieval for each segment, where retrieved patient information is relevant only to said segment. In our next stage, we enrich each segment to facilitate even better retrieval and therefore, better matching downstream.

Segment Enrichment

Every segment is tagged by:

Group: e.g., Biomarkers, Lab/Vitals, Prior Therapies
Matchability: Can we find this in unstructured patient notes?
Hardness: Is it objective and testable?

For each segment, we also create search queries and simplified versions. The search queries reflect how information about a criterion may be present in patient notes. For example, the segment "Patient must not have received prior oral endocrine therapy" produces the following search queries:

"search_queries": ["no prior oral endocrine therapy",
                   "patient has not received oral endocrine therapy",
                   "no history of oral endocrine therapy",
                   "prior oral endocrine therapy exclusion",
                   "patient must not have had oral endocrine therapy"]

These search queries help in retrieving relevant patient note chunks that may have been missed had we used the original criterion, or even the broken down segment.

Table 1: Segment Classification (Segment Enrichment module), where green rows are what we match on.

Segment	Group	Matchable	Hardness	Reason
Age > 18	Demographics	Yes	HARD	Always in notes
Stage III Breast Cancer	Cancer Stage	Yes	HARD	Present in notes
ECOG score 0-1	Performance Status	Yes	SOFT	Often subjective
Consent form signed	Consent/Compliance	No	SOFT	Administrative only

System Criteria

Once we filter for matchable, HARD segments from the right groups, we regenerate the full eligibility logic using only those building blocks:

INPUT:
Segments (Filtered for HARD segments that fall in specified groups):
  Segment A: Biomarkers - HER2-positive [Inclusion]
  Segment C: Cancer Stage - Stage IV [Inclusion]
  Segment D: Previous Therapies - Prior Radiation Therapy [Exclusion]

OUTPUT:
{
System Criterion: A patient may qualify if they have HER2-positive, Stage IV breast cancer AND no prior radiation.
Structured Logic: (Segment A AND Segment C AND Segment D)

}

Why Modularity Matters

Our trial parsing pipeline is designed as a modular, stepwise chain—each stage adding structure, context, and precision. We begin by detecting the formatting of the eligibility criteria as sourced from ClinicalTrials.gov. From there, we enrich each criterion by mapping terms to known clinical concepts provided in the trial metadata.

Next, we extract structured key requirements—such as cancer type, stage, biomarkers, and prior therapies—which help us pre-filter patients. We then segment complex eligibility statements into atomic, testable conditions. Each segment is further enriched with metadata that flags whether it’s matchable, objective (HARD), and clinically useful.

Finally, we recombine only the matchable, relevant segments into unified system criteria—clear, structured representations that define how a patient may qualify. These become the foundation for downstream patient-trial matching.

We use LangGraph to manage this architecture, allowing us to track, debug, and iterate on each node independently without breaking the entire pipeline.

Interactive Dashboards for Clinicians

All of this would mean nothing if clinicians couldn’t see it, understand it, and intervene. That’s why we expose every step of our pipeline in a live dashboard. Clinicians can view the parsed trial, see segment-level logic, and even edit entries if something looks off. This is exactly where a modular design shines; we can take clinician feedback at each stage, and independently improve it without affecting the rest of the pipeline too much.

Matching Starts with the Right Data

In summary, the structured trial data powers our downstream RAG-based matcher. We query relevant patient note chunks per segment, ensuring we only match on HARD, matchable criteria from matchable groups. (O)TESSA, guided by this structure, can now make judgments with clarity and confidence.

What’s Next

We’re expanding this pipeline to other disease domains and integrating OMOP concept mapping to link criteria with standard vocabularies. And because every trial we process is structured and traceable, clinicians can close the loop with real-world feedback.

Conclusion

Parsing trial criteria isn’t a side task. It’s the foundation of everything. At Cognome, we’re not just matching patients to trials. We’re building the data layer that makes it possible. In the next few articles we will dive deeper into how patient information is retrieved and how our multi-stage matching runs, realizing the benefits of having a robust criteria processing stage as the first step.