What are formulas?
This section introduces the basic notions and origins of formulas. If you are already familiar with formulas from another context, you might want to skip forward to the Formula Grammar or other User Guides.
Origins
Formulas were originally proposed by Wilkinson et al.1 to aid in the description of ANOVA problems, but were popularised by the S language (and then R, as an implementation of S) in the context of linear regression. Since then they have been extended in R, and implemented in Python (by patsy), in MATLAB, in Julia, and quite conceivably elsewhere. Each implementation has its own nuances and grammatical extensions, including Formulaic's which are described more completely in the Formula Grammar section of this manual.
Why are they useful?
Formulas are useful because they provide a concise and explicit specification for how data should be prepared for a model. Typically, the raw input data for a model is stored in a dataframe, but the actual implementations of various statistical methodologies (e.g. linear regression solvers) act on two-dimensional numerical matrices that go by several names depending on the prevailing nomenclature of your field, including "model matrices", "design matrices" and "regressor matrices" (within Formulaic, we refer to them as "model matrices"). A formula provides the necessary information required to automate much of the translation of a dataframe into a model matrix suitable for ingestion into a statistical model.
Suppose, for example, that you have a dataframe with \(N\) rows and three
numerical columns labelled: y
, a
and b
. You would like to construct a
linear regression model for y
based on a
, b
and their interaction: \[ y =
\alpha + \beta_a a + \beta_b b + \beta_{ab} ab + \varepsilon \] with
\(\varepsilon \sim \mathcal{N}(0, \sigma^2)\). Rather than manually
constructing the required matrices to pass to the regression solver, you could
specify a formula of form:
y ~ a + b + a:b
y
, and an \( N \times 4
\) matrix \(X\) for the input columns intercept
, a
, b
, and a * b
. You
can then directly pass these matrices to your regression solver, which
internally will solve for \(\beta\) in: \[ Y = X\beta + \varepsilon. \]
The true value of formulas becomes more apparent as model complexity increases, where they can be a huge time-saver. For example:
~ (f1 + f2 + f3) * (x1 + x2 + scale(x3))
f*
fields (3), each of the x*
fields (3),
and the combination of each f
with each x
(9). It also instructs the
materializer to ensure that the x3
column is rescaled during the model matrix
materialization phase such that it has mean zero and standard error of 1. If any
of these columns is categorical in nature, they would by default also be
one-hot/dummy encoded. Depending on the formula interpreter (including
Formulaic), extra steps would also be taken to ensure that the resulting model
matrix is structurally full-rank.
As an added bonus, some formula implementations (including Formulaic) can
remember any choices made during the materialization process, and apply them to
consistently to new data, making it possible to easily generate new data that
conforms to the same structure as the training data. For example, the
scale(...)
transform in the example above makes use of the mean and variance
of the column to be scaled. Any future data should, however, should not undergo
scaling based on its own mean and variance, but rather on the mean and variance
that was measured for the training data set (otherwise the new dataset will not
be consistent with the expectations of the trained model which will be
interpreting it).
Limitations
Formulas are a very flexible tool, and can be augmented with arbitrary user-defined transforms. However, some transformations required by certain models may be more elegantly defined via a pre-formula dataframe operation or post-formula model matrix operation. Another consideration is that the default encoding and materialization choices for data are aligned with linear regression. If you are using a tree model, for example, you may not be interested in dummy encoding of "categorical" features, and this type of transform would have to be explicitly noted in the formula. Nevertheless, even in these cases, formulas are an excellent tool, and can often be used to greatly simplify data preparation workflows.
Where to from here?
To learn about the full set of features supported by the formula language as
implemented by Formulaic, please review the Formula Grammar. To
get a feel for how you can use formulaic
to transform your dataframes into
model matrices, please review the Quickstart.
-
Wilkinson, G. N., and C. E. Rogers. Symbolic description of factorial models for analysis of variance. J. Royal Statistics Society 22, pp. 392–399, 1973. ↩