JuliaCon 2025

A RAG-LLM Workflow for Observational Health Research
2025-07-23 , Main Room 3

We describe a Retrieval-Augmented Generation (RAG)-informed Large Language Model (LLM) workflow for querying observational health data in the OMOP Common Data Model (OMOP CDM). Leveraging Julia software such as FunSQL.jl and OMOPCDMCohortCreator.jl, we explore how the model can automate and refine complex research queries in health research settings. This work explores RAG architectures, query examples, and reproducibility within health informatics.


Description

This work presents an innovative approach to developing a Retrieval-Augmented Generation (RAG)-informed Large Language Model (LLM) tailored to a domain-specific query language within observational health settings. By leveraging the Julia ecosystem — particularly the growing JuliaHealth community and its specialized tools - this project aims to aid observational health researchers' investigation into complex patient datasets stored in the OMOP Common Data Model (OMOP CDM).

Background and Context

Observational health research relies on retrospective data, including patient medical records and claims, to inform studies in pharmacovigilance, public health surveillance, and health economics. The OMOP CDM standardizes these data, enabling scalable and reproducible research across diverse clinical datasets. Within this framework, the JuliaHealth community has been exploring tools that simplify complex data queries and analyses. Tools such as FunSQL.jl offer a domain-specific language (DSL) that abstracts SQL’s intricacies, translating high-level query expressions into executable SQL commands. This DSL not only facilitates query composition and reasoning but also promotes reuse and modularity across various observational health research applications. Additionally, it enables the development of queries that work across different SQL syntax versions, enhancing interoperability across database systems.

\subsection{RAG-Informed LLM and Agentic Workflows}
At the core of this project is the development of an RAG-informed LLM that integrates seamlessly with the FunSQL.jl DSL. The proposed system leverages a comprehensive knowledge corpus drawn from key resources — such as FunSQL.jl, OMOPCDMCohortCreator.jl, the OMOP CDM standard, and additional materials from the broader Julia and generative AI communities. This corpus serves as the backbone for the LLM’s contextual understanding, enabling it to generate, refine, and validate complex queries automatically.

An agentic workflow is a dynamic, iterative process where an AI system actively engages with data, tools, and human feedback to refine its outputs over time. Instead of passively generating responses, the system continuously evaluates, modifies, and optimizes its results based on execution feedback, constraints, and expert input. This approach enhances adaptability, ensuring that AI-generated outputs are accurate and contextually relevant.

Within our approach, we incorporate a number of tools from within the JuliaGenAI community. We use various agentic tools, such as PromptingTools.jl and RAGTools.jl, which facilitate dynamic and interactive workflows. These tools allow the LLM not only to generate queries but also to iteratively refine them based on feedback and real-world constraints encountered during execution and given researcher feedback. This agentic workflow ensures an adaptive query development process, where the LLM autonomously generates initial queries, evaluates their execution results, and integrates researcher feedback to enhance accuracy and relevance. By continuously iterating and refining queries, this approach bridges the gap between automated query generation and human expert validation, ensuring that the queries produced are both semantically correct and meaningful in a research context.

Objectives and Methodology

The primary goals of this project are multifaceted:

Evaluation of State-of-the-Art LLMs

This project assesses various local LLMs based on inference times, accuracy, and adherence to best practices within the Julia ecosystem. This evaluation is crucial to determine the most suitable models for integration into the RAG pipeline.

Knowledge Corpus Development

A robust corpus assembled using documentation and source code from FunSQL.jl, OMOPCDMCohortCreator.jl, and the OMOP CDM. This corpus can be augmented with insights from related packages and tools—such as EasyContext.jl for embedders and rerankers—ensuring a comprehensive knowledge base for the LLM.

Designing a Hybrid RAG Architecture

The LLM pipeline incorporates advanced embedding models with optimized dimensions (e.g., 256–1024) and, where applicable, binary embeddings. The integration of vector databases (or alternative optimizations like binary or Matrioska-based approaches) are carefully considered. In hybrid RAG setups, traditional techniques such as BM25 are also evaluated to compliment embedding-based methods.

Agentic Workflow and Query Development

The primary objective of this study is to evaluate the effectiveness of a RAG-informed LLM workflow in answering complex observational health research questions. By integrating FunSQL.jl, OMOP CDM, and JuliaGenAI tools, this workflow aims to automate, refine, and validate query generation for large-scale health data analysis. To assess its real-world applicability, we will apply this approach to a population characterization st

My name is Jacob Scott Zelko! I am currently pursuing my MS in Applied Mathematics at Northeastern University (NEU) and am a trainee of NEU's Roux Institute.

My research career has focused primarily and broadly on population health. In particular, chronic mental illness (i.e. depression, suicidality, and bipolar disorder), social determinants of health and health disparities within intersectional populations, chronic illness, and neurocognitive disabilities. As a convergence of my interests, I am very interested in how we can use mathematical structures (such as categories) to establish meaningful relationships between non-traditional health data sources to gain greater insights into population health. To bridge these worlds, I have been heavily involved with observational health research methods using "Real World Data" and am an active member of both the OHDSI and Category Theory communities.

This speaker also appears in:

Param Umesh Thakkar is a Computer Engineering student at VJTI, Mumbai, specializing in Generative AI, LLMs, and automation. He has built AI-driven recommendation models, content tools, and multi-agent systems using LangChain, CrewAI, and Llama.An open-source contributor at SciML, he has worked on Julia's OrdinaryDiffEq.jl and ODE solver testing. Skilled in C++, JavaScript, and Python, he secured third place in the Deloitte Quantum Climate Challenge 2024.