2021-07-28 –, Red
Julia programmers sometimes need to interrogate data with the Structured Query Language (SQL). But SQL is notoriously hard to write in a modular fashion. There is no way to reuse SQL query fragments among different queries.
FunSQL exposes full expressive power of SQL with a compositional semantics. FunSQL allows you to build queries incrementally from small independent fragments. This approach is particularly useful for building applications that programmatically construct SQL queries.
To introduce FunSQL, we will construct a practical query from healthcare informatics and then discuss how it works. We use a fragment of the OMOP Common Data Model, a cross-platform database model for observational healthcare data.
As typical in healthcare, this schema is patient-centric. The table person
contains de-identified information about patients including the unique identifier, approximate birthdate, and demographic information. To make this table available for FunSQL, we define it as follows.
const person = SQLTable(:person, columns = [:person_id, :year_of_birth, :location_id])
The patient
table has a foreign key to location
, which specifies geographic location, typically down to a zipcode.
const location = SQLTable(:location, columns = [:location_id, :city, :state, :zip])
Each person is associated with clinical events: encounters with care providers, recorded observations, diagnosed conditions, performed procedures, etc. We will represent one of them.
const visit_occurrence = SQLTable(:visit_occurrence, columns = [:visit_occurrence_id, :person_id, :visit_start_date])
With this background in place, let us suppose a physician scientist asks:
When was the last time each person, born in 2000 or earlier and living in Illinois, was seen by a care provider?
This research question could be answered using FunSQL.
From(person) |> Where(Get.year_of_birth .<= 2000) |> Join(:location => From(location), on = (Get.location_id .== Get.location.location_id)) |> Where(Get.location.state .== "IL") |> Join(:visit_group => From(visit_occurrence) |> Group(Get.person_id), on = (Get.person_id .== Get.visit_group.person_id), left = true) |> Select(Get.person_id, :max_visit_start_date => Get.visit_group |> Agg.Max(Get.visit_start_date))
FunSQL provides operations with familiar SQL names such as From
, Where
, Join
, Group
, and Select
, which can be chained together using the |>
operator. The notation :location => From(location)
, and its counterpart Get.location.state
, lets us arrange table attributes hierarchically. Most importantly, the query can be constructed and tested incrementally, one operation at a time.
Contrast this with a hand-crafted SQL query.
SELECT p.person_id, MAX(vo.visit_start_date) FROM person p JOIN location l ON (p.location_id = l.location_id) LEFT JOIN visit_occurrence vo ON (p.person_id = vo.person_id) WHERE (p.year_of_birth <= 2000) AND (l.state = 'IL') GROUP BY p.person_id
Although the SQL query is compact, it cannot be incrementally constructed. Indeed, if we follow the progression of the research question, we arrive at:
FROM person p WHERE (p.year_of_birth <= 2000) JOIN location l ON (p.location_id = l.location_id) ...
But this is not valid SQL. SQL enforces a rigid order of clauses: FROM
, JOIN
, WHERE
, GROUP BY
. As we refine a SQL query, attempting to incrementally correlate it with the research question, we are always forced to backtrack and rebuild it. This is what makes SQL tedious and error-prone.
FunSQL solves the problem of compositional query construction by representing individual operations as subqueries with a deferred SELECT
list.
q1 AS (SELECT ... FROM person) q2 AS (SELECT ... FROM q1 WHERE q1.year_of_birth <= 2000) q3 AS (SELECT ... FROM location) q4 AS (SELECT ... FROM q2 JOIN q3 ON (q2.location_id = q3.location_id)) q5 AS (SELECT ... FROM q4 WHERE q4.state = 'IL') q6 AS (SELECT ... FROM visit_occurrence) q7 AS (SELECT ... FROM q6 GROUP BY q6.person_id) q8 AS (SELECT ... FROM q5 LEFT JOIN q7 ON (q5.person_id = q7.person_id))
The final subquery fixes the output columns.
SELECT q8.person_id, q8.max_visit_start_date FROM q8
Once the output columns are known, each deferred SELECT
list can be resolved automatically. For instance, references q1.year_of_bith
, q2.location_id
, q5.person_id
force q1
to take the following form.
q1 AS (SELECT person_id, year_of_birth, location_id FROM person)
This SELECT
resolution also propagates aggregate expressions. Thus, q7
becomes:
q7 AS (SELECT q6.person_id, MAX(q6.visit_start_date) AS max_visit_start_date FROM q6 GROUP BY q6.person_id)
This approach provides a uniform compositional interface to the variety of SQL operations, preserving the expressive power of SQL while eliminating its stifling inflexibility.
For a Julia programmer, FunSQL realizes query operations as 1st class objects. Treated as values, they could be generated independently, assembled into composite operations, and remixed as needed. FunSQL lets us construct queries systematically, converging upon the research questions we wish to ask our databases.
Collaborator on YAML, HTSQL, DataKnots, and other projects that advance the usability of software systems.
The author of PyYAML, LibYAML, HTSQL, and DataKnots.