Why my clinical RAG is mostly SQL

In the previous post I built a RAG system over my One Piece notes. At the end, I said I'd tackle something more real next: RAG for the healthcare industry. So here we are.
The anime version was fun, and the stakes were zero. If the model got the Void Century wrong, the worst case was me arguing with my nakama. Clinical data is a different story: here, a confident wrong answer is the problem.
So this time I kept the same architecture but pointed it at data where correctness matters: patient-scoped Q&A over synthetic electronic health records. The kind of question a clinician would ask about the one patient in front of them: "What conditions has this patient been diagnosed with?", "Any drug-allergy contraindications?", "What was their HbA1c trend?"
No real patient data, of course. I used Synthea, which generates realistic synthetic FHIR bundles, complete with encounters, conditions, medications, labs, and clinical notes.
Here's how the pieces fit together at the application layer:
One patient at a time
There's a pretty standard RAG playbook by now: docs go through an embedding model into a vector store, the question goes through the same model, and you grab the nearest matches to stuff into the prompt. That works fine if you're asking what the literature says about diabetes.
But a clinician doesn't ask the literature. They ask about this patient. "Does Lindsey have any active medications?" is not a semantic search problem, it's a SELECT ... WHERE patient_id = ? AND status = 'active'. The data is already structured.
That reframing has two consequences. The most important one is for access control: the patient ID stops being a query parameter and becomes a security boundary. It's bound server-side from the URL (/api/patients/:patientId/query), never something the model can choose. The model literally cannot ask about a patient it wasn't scoped to.
The other consequence is for retrieval. Structured data like encounters, labs, and medications lives in SQL tables. The freeform notes a doctor writes live in a vector index. The system needs both, and it needs to know which to use per question.
@Post(':patientId/query/stream')
@Sse()
async queryStream(
@Param('patientId', new ParseUUIDPipe({ version: '4' })) patientId: string,
@Body() body: QueryDto,
) {
return this.queryService.streamAboutPatient(patientId, body.question);
}
The agentic loop with tool-use
Instead of pre-retrieving context and hoping it's relevant, I gave the model a set of tools and let it decide what to fetch.
The model gets seven tools, each backed by a real query:
| Tool | Backed by |
|---|---|
get_conditions |
SQL on the conditions table |
get_medications |
SQL, filtered by status = 'active' |
get_labs |
SQL on observations |
get_allergies |
SQL on allergies |
get_encounters |
SQL on encounters |
get_immunizations |
SQL on immunizations |
search_notes |
vector similarity search (pgvector + Titan embeddings) |
Bedrock Converse supports tool-use natively: you pass the tool definitions, the model responds with either text or a tool_use block, you execute the tool, hand back the result, and repeat until the model stops asking for tools.
Here's one turn of that loop:
Here's a simplified version. The real one is an async generator that yields SSE events (tool calls, text deltas, final usage):
async function* runAgenticLoop(question: string, tools: Tool[]) {
const messages = [{ role: 'user', content: [{ text: question }] }];
while (true) {
const response = await bedrock.converse({
messages,
toolConfig: { tools },
});
messages.push(response.output.message);
const toolUses = response.output.message.content.filter((c) => c.toolUse);
if (toolUses.length === 0) {
break;
}
const results = await Promise.all(
toolUses.map((t) => executeTool(t.toolUse.name, t.toolUse.input))
);
messages.push({ role: 'user', content: results });
}
}
Here's a real trace. I asked the deployed system "What conditions has this patient been diagnosed with?" about a synthetic patient:
The model called get_conditions once (no arguments needed, the patient is already scoped), got back 4 rows, and summarized them. The metrics panel shows what that cost: time-to-first-token 1518ms, total 7.61s, 3,248 input tokens, 309 output tokens, 2 iterations, 1 tool call, and $0.0144. The per-tool latency was 94ms, so almost all the wall-clock time went to Bedrock round-trips, not to the database.
Hybrid retrieval: SQL where it makes sense, vectors where it doesn't
If your data is already structured and you're running it through a vector store anyway, that's a smell.
Embedding "blood pressure 120/80 on 2024-03-15" into a 1024-dimension vector and hoping cosine similarity retrieves it when someone asks for the latest reading is strictly worse than SELECT ... ORDER BY date DESC LIMIT 1. It's slower, less precise, and you lose the ability to filter or sort.
So the split is deliberate:
async getMedications(patientId: string) {
return this.prisma.medication.findMany({
where: {
patientId,
status: 'active',
OR: [
{ periodEnd: null },
{ periodEnd: {
gt: new Date() }
}
],
},
orderBy: { periodStart: 'desc' },
});
}
async searchNotes(patientId: string, query: string) {
const embedding = await this.bedrock.embed(query);
return this.prisma.$queryRaw`
SELECT content, 1 - (embedding <=> ${embedding}::vector) AS score
FROM document_chunks
WHERE patient_id = ${patientId}::uuid
ORDER BY embedding <=> ${embedding}::vector
LIMIT 5
`;
}
The <=> operator is pgvector's cosine distance. The index behind it is HNSW (Hierarchical Navigable Small World), which makes approximate nearest-neighbor search fast even as the note corpus grows.
The model picks the right tool per question based on the tool descriptions. Lifestyle and symptom questions land on search_notes. Questions about concrete clinical facts (medications, allergies, labs) land on the structured tools. I never wrote any routing logic.
Eval-driven development
I built an eval harness before I trusted any of this.
I auto-generated about 60 questions across three categories, pulling the ground-truth answers straight from the database so I always knew the correct response:
Factoid: "How many encounters has this patient had?" (direct lookup)
Temporal: "What was the most recent lab result?" (requires ordering by date)
Reasoning: "Are there any drug-allergy contraindications?" (requires combining two tools)
Then I used Claude Haiku as a judge to compare the agent's answer against ground truth. In my runs the system scored well on factoid and reasoning, and noticeably weaker on temporal questions, which tracks: "most recent" requires the model to reason about dates, and that's exactly where it slipped.
Two failure modes the evals caught that unit tests didn't:
get_medicationswasn't filtering bystatus = 'active'. It returned historical meds too, so "what is she taking now" included drugs she'd stopped months ago.get_labsdeduplicated lab results by code using aMap, butMap.setkeeps the last write, which after iteration meant the oldest reading survived instead of the newest. A one-line fix once I saw it, but I only saw it because the eval flagged a wrong "latest value."
The bugs didn't throw errors. They returned confident wrong answers, which is the failure mode unit tests miss most often.
Streaming the response
A 7-second response feels broken when the screen stays blank. Spend the same 7 seconds watching the model call get_conditions, see the rows come back, and watch the answer type itself out, and the wait disappears.
So the frontend streams everything over Server-Sent Events. React + Vite on the front, eventsource-parser decoding the stream, and a "live turn" component that renders the in-flight response (tool calls included) separately from the completed messages.
Here's what that looks like in the deployed app:
The tool calls aren't there for show. The clinician sees which data each answer is based on. When the model says "she has an active medication review due," the trace shows it actually queried get_conditions to say that.
A few notes from the build
Tool-input streaming in Bedrock Converse is cleaner than you might expect. The contentBlockDelta events interleave text and tool-input deltas, and the model "filling in" a tool's arguments streams without any extra parsing on the consumer side.
Synthea's FHIR bundles have one quirk worth flagging. They use urn:uuid references that only resolve inside that one bundle. A condition pointing to its encounter via urn:uuid:abc-123 won't match any real database id. The fix is two-pass ingestion: first insert everything to build a urn:uuid → db_uuid map, then wire up the foreign keys on the second pass.
And one recommendation on testing LLM systems: don't lean on unit tests for correctness. They check that your code does what you asked, not whether the answers are right. For anything where the failure mode is "confident wrong answer" instead of "thrown error," evals are the only safety net.
What's next
The demo runs perfectly on localhost. Next post: how I deployed it to AWS (Terraform, ECS Fargate, RDS with pgvector) and six gotchas I'd flag for anyone running this stack in production.
Keep coding and learning!



