JuliaCon 2022 (Times are UTC)

RegressionFormulae.jl: familiar `@formula` syntax for regression
07-29, 17:40–17:50 (UTC), Purple

StatsModels.jl provides the @formula mini-language for conveniently specifying table-to-matrix transformations for statistical modeling. RegressionFormulae.jl extends this mini-language with additional syntax that users coming from other statistical modeling ecosystems such as R may be familiar with. This package also serves as a template for developers wish to expand the StatsModels.jl @formula syntax in their own packages.


StatsModels.jl provides the @formula mini-language for conveniently specifying table-to-matrix transformations for statistical modeling. This mini-language is designed with extensibility and composability in mind, using normal Julia mechanisms of multiple dispatch to implement additional syntax both inside StatsModels.jl and in external packages. RegressionFormulae.jl takes advantage of this extensibility to provide additional syntax that is familiar to many users of other statistical software (e.g., R) in an "opt-in" manner, without forcing all downstream packages that depend on StatsModels.jl/@formula to support this syntax.

The StatsModels.jl @formula syntax is based on the Wilkinson-Rogers Formula Notation which has been a widely-used standard in multi-factor regression modeling since it was first described in Wilkinson and Rogers (1973). The basic syntax includes operators for addition (+) and crossing (& and *) of regressors, as well as the ~ operator to link outcome and regressor terms. As the conventions around this syntax have evolved in the last 50 years, other systems have introduced additional operators.

RegressionFormulae.jl expands the StatsModels.jl @formula to support two commonly-used operators from R: ^ (incomplete crossing) and / (nesting). Specifically, it implements
- (a + b + c + ...) ^ n to create all interactions up to n-way, corresponding to an incomplete cross of a, b, c, ....
- a / b to create a + a & b, which results in a "nested" model of b, with a separate coefficient for b for each level of a

Both of these operators are particularly useful for creating interpretable models. Models with high-order interactions are extremely challenging to interpret and require considerable care, and are prone to over-fitting since the number of coefficients grows very quickly with additional terms participating in the interactions. The incomplete cross ^ syntax can ameliorate these difficulties, limiting the highest degree of the resulting interaction terms and reducing the overall number of predictors. Nesting (a / b) similarly provides an alternative to fully crossed models (a * b) that is more directly interpretable in situations where the analytic questions are focused on the effects of a predictor b within each individual level of some other variable a, without concern for direct comparison of these effects to each other.

Finally, this syntax is implemented in a way that does not require other modeling packages that use @formula to support them, or even prevent other packages from defining alternative meaning to the ^ or / operators. Within a @formula, the special syntax is implemented by methods like

function StatsModels.apply_schema(
    t::FunctionTerm{typeof(/)},
    ...

and

function Base.:(/)(outer::CategoricalTerm, inner::AbstractTerm)
    ...

The result of this is that if RegressionFormulae.jl is not loaded, then / and ^ inside a @formula behave exactly as they normally would (e.g., as calls the normal Julia functions / and ^). Moreover, if a user loads RegressionFormulae.jl at the same time as some other package that defines special syntax for / or ^ (for RegressionModel), they will receive a warning about method redefinition or method ambiguity.

Phillip is a neuroscientist and contributor to the MixedModels.jl ecosystem. Additionally, he has contributed substantially to Effects.jl and StandardizedPredictors.jl

Research scientist at Beacon Biosignals, recovering academic.