How it works
This section of the documentation is intended to provide a high-level overview of the way in which formulae are interpreted and materialized by Formulaic.
Recall that the goal of a formula is to act as a recipe for building a "model matrix" (also known as a "design matrix") from an existing dataset. Following the recipe should result in a dataset that consists only of numerical columns that can be linearly combined to model an outcome/response of interest (the coefficients of which typically being estimated using linear regression). As such, this process will bake in any desired non-linearity via interactions or transforms, and will encode nominal/categorical/factor data as a collection of numerical contrasts.
The ingredients of each formula are the columns of the original dataset, and
each operator acting on these columns in the formula should be thought of as
inclusion/exclusion of the column in the resulting model matrix, or as a
transformation on the column(s) prior to inclusion. Thus, a +
operator does
not act in its usual algebraic manner, but rather acts as set union, indicating
that both the left- and right-hand arguments should be included in the model
matrix; a -
operator acts like a set difference; and so on.
Anatomy of a Formula¶
Formulas in Formulaic are represented by (subclasses of) the Formula
class. Instances of Formula
subclasses are a ultimately containers for sets of Term
instances, which in turn are a container for a set of Factor
instances. Let's start our dissection at the bottom, and work our way up.
Factor¶
Factor
instances are the atomic unit of a formula, and represent the output
of a single expression evaluation. Typically this will be one vector of data,
but could also be more than one column (especially common with categorically
encoded data).
A Factor
instance's expression can be evaluated in one of three ways:
- As a literal; in which case the expression is taken as number or string, and repeated over all rows in the resulting model matrix.
- As a lookup: in which case the expression is treated as a name to be looked up in the data context provided during materialization.
- As a Python expression: in which case it is executed in the data context provided.
Note: Factor instances act as metadata only, and are not directly
responsible for doing the evaluation. This is handled in a backend specific
way by the appropriate Materializer
instance.
In code, instantiating a factor looks like:
from formulaic.parser.types import Factor
Factor("1", eval_method="literal") # a factor that represents the numerical constant of 1
Factor("a") # a factor that will be looked up from the data context
Factor("a + b", eval_method="python") # a factor that will return the sum of `a` and `b`
a + b
Term¶
Term
instances are a thin wrapper around a set of Factor
instances, and
represent the Cartesian (or Kronecker) product of the factors. If all of the
Factor
instances evaluate to single columns, then the Term
represents the
product of all of the factor columns.
Instantiating a Term
looks like:
from formulaic.parser.types import Term
Term(factors=[Factor("b"), Factor("a"), Factor("c")])
b:a:c
Note that to ensure uniqueness in the representation, the factor instances are sorted.
Formula¶
Formula
instances are (potentially nested) wrappers around collections of
Term
instances. During materialization into a model matrix, each Term
instance will have its columns independently inserted into the resulting matrix.
Formula
instances can consist of a single "list" of Term
instances, or may
be "structured"; for example, we may want a separate collection of terms for the
left- and right-hand side of a formula; or to simultaneously construct multiple
model matrices for different parts of our modeling process.
For example, an unstructured formula might look like:
from formulaic import Formula
# Unstructured formula (a simple list of terms)
Formula([
Term(factors=[Factor("c"), Factor("d"), Factor("e")]),
Term(factors=[Factor("a"), Factor("b")]),
])
a:b + c:d:e
Note that unstructured formulae are actually instances of SimpleFormula
(a Formula
subclass that acts like a mutable list of Term
instances):
type(_), list(_)
(formulaic.formula.SimpleFormula, [a:b, c:d:e])
Also note that in its standard representation, the terms are separated by "+"
which is interpreted as the set union in this context, and that (as we have seen
for Term
instances) Formula
instances are sorted (the default is to sort
terms only by interaction order, but this can be customized and/or disabled, as
described below).
Structured formula are constructed similary:
f = Formula(
[
Term(factors=[Factor("root_col")]),
],
my_substructure=[
Term(factors=[Factor("sub_col")]),
],
nested=Formula(
[
Term(factors=[Factor("nested_col")]),
Term(factors=[Factor("another_nested_col")]),
],
really_nested=[
Term(factors=[Factor("really_nested_col")]),
],
)
)
f
root: root_col .my_substructure: sub_col .nested: root: nested_col + another_nested_col .really_nested: really_nested_col
Structured formulae are instances of StructuredFormula
:
type(_)
formulaic.formula.StructuredFormula
And the sub-formula can be selected using:
f.root
root_col
f.nested
root: nested_col + another_nested_col .really_nested: really_nested_col
Formulae can also have different ordering conventions applied to them. By
default, Formulaic follows R conventions around ordering whereby terms are
sorted by their interaction degree (number of factors) and then by the order in
which they were present in the the term list. This behaviour can be modified to
perform no ordering or full lexical sorting of terms and factors by passing
_ordering="none"
or _ordering="sort"
respectively to the Formula
constructor. The default ordering is equivalent to passing _ordering="degree"
.
For example:
{
"degree": Formula("z + z:a + z:b:a + g"),
"none": Formula("z + z:a + z:b:a + g", _ordering="none"),
"sort": Formula("z + z:a + z:b:a + g", _ordering="sort"),
}
{'degree': 1 + z + g + z:a + z:b:a, 'none': 1 + z + z:a + z:b:a + g, 'sort': 1 + g + z + a:z + a:b:z}
Parsed Formulae¶
While it would be possible to always manually construct Formula
instances in
this way, it would quickly grow tedious. As you might have guessed from reading
the quickstart or via other implementations, this is where Wilkinson formulae
come in. Formulaic has a rich extensible formula parser that converts string
expressions into the formula structures you see above. Where functionality and
grammar overlap, it tries to conform to existing patterns found in R and patsy.
Formula parsing happens in three phases:
- tokenization of the formula string;
- conversion of the tokens into an abstract syntax tree (AST); and
- evaluation of the AST into (potentially structured) lists of
Term
instances.
In the sections below these phases are described in more detail.
Tokenization¶
Formulaic intentionally makes the tokenization phase as unopinionated and
unstructured as possible. This allows formula grammars to be extended via
plugins only high-level APIs (usually Operator
s).
The tokenizer's role is to take an arbitrary string representation of a formula
and convert it into a series of Token
instances. The tokenization phase knows
very little about formula grammar except that whitespace doesn't matter, that we
distinguish non-word characters as operators or context indicators.
Interpretation of these tokens is left to the AST generation phase. There are
five different kinds of tokens:
- Context: Symbol pairs representing he opening or closing of nested contexts.
This applies to parentheses
()
and square brackets[]
. - Operator: Symbol(s) representing a operation on other tokens in the string (interpreted during AST creation).
- Name: A character sequence representing variable/data-column name.
- Python: A character sequence representing executable Python code.
- Value: A character sequence representing a Python literal.
The tokenizer treats text quoted with `
characters as a name token, and
{}
are used to quote Python operations.
An example of the tokens generated can be seen below:
from formulaic.parser import DefaultFormulaParser
[
f"{token.token} : {token.kind.value}"
for token in (
DefaultFormulaParser(include_intercept=False)
.get_tokens("y ~ 1 + b:log(c) | `d$in^df` + {e + f}")
)
]
['y : name', '~ : operator', '1 : value', '+ : operator', 'b : name', ': : operator', 'log(c) : python', '| : operator', 'd$in^df : name', '+ : operator', 'e + f : python']
Abstract Syntax Tree (AST)¶
The next phase is to assemble an abstract syntax tree (AST) from the tokens
output from the above that when evaluated will generate the Term
instances we
need to build a formula. This is done by using an enriched shunting yard
algorithm which
determines how to interpret each operator token based on the symbol used, the
number and position of the non-operator arguments, and the current context (i.e.
how many parentheses deep we are). This allows us to disambiguate between, for
example, unary and binary addition operators. The available operators and their
implementation are described in more detail in the Formula
Grammar section of this documentation. It is worth noting that the
available operators can be easily modified at runtime, and is typically all that
needs to be modified in order to add new formula grammars.
The result is an AST that look something like:
DefaultFormulaParser().get_ast("y ~ a + b:c")
<ASTNode ~: [y, <ASTNode +: [<ASTNode +: [1, a]>, <ASTNode :: [b, c]>]>]>
Evaluation¶
Now that we have the AST, we can readily evaluate it to generate the Term
instances we need to pass to our Formula
constructor. For example:
terms = DefaultFormulaParser(include_intercept=False).get_terms("y ~ a + b:c")
terms
.lhs: {y} .rhs: {a, b:c}
Formula(terms)
.lhs: y .rhs: a + b:c
Of course, manually building the terms and passing them to the formula
constructor is a bit annoying, and so instead we allow passing the string
directly to the Formula
constructor; and allow you to override the default
parser if you so desire (though 99.9% of the time this wouldn't be necessary).
Thus, we can generate the same formula from above using:
Formula("y ~ a + b:c", _parser=DefaultFormulaParser(include_intercept=False))
.lhs: y .rhs: a + b:c
Materialization¶
Once you have Formula
instance, the next logical step is to use it to
materialize a model matrix. This is usually as simple passing the raw data as
an argument to .get_model_matrix()
:
import pandas
data = pandas.DataFrame({"a": [1,2,3], "b": [4,5,6], "c": [7, 8, 9], "A": ["a", "b", "c"]})
Formula("a + b:c").get_model_matrix(data)
Intercept | a | b:c | |
---|---|---|---|
0 | 1.0 | 1 | 28 |
1 | 1.0 | 2 | 40 |
2 | 1.0 | 3 | 54 |
Just as for formulae, the model matrices can be structured, and will be structured in the same way as the original formula. For example:
Formula("a", group="b+c").get_model_matrix(data)
root: Intercept a 0 1.0 1 1 1.0 2 2 1.0 3 .group: b c 0 4 7 1 5 8 2 6 9
Under the hood, both of these calls have looked at the type of the data
(pandas.DataFrame
here) and then looked up the FormulaMaterializer
associated with that type (PandasMaterializer
here), and then passed the
formula and data along to the materializer for materialization. It is also
possible to request a specific output type that varies by materializer
(PandasMaterializer
supports "pandas", "numpy", and "sparse"). If one is not
selected, the first available output type is selected for you. Thus, the above
code is equivalent to:
from formulaic.materializers import PandasMaterializer
PandasMaterializer(data).get_model_matrix(Formula("a + b:c"), output="pandas")
Intercept | a | b:c | |
---|---|---|---|
0 | 1.0 | 1 | 28 |
1 | 1.0 | 2 | 40 |
2 | 1.0 | 3 | 54 |
The return type of .get_model_matrix()
is either a ModelMatrix
instance if
the original formula was unstructured, or a ModelMatrices
instance that is
just a structured container for ModelMatrix
instances. However, ModelMatrix
is an ObjectProxy
subclass, and so it also acts like the type of object requested. For example:
import numpy
from formulaic import ModelMatrix
mm = Formula("a + b:c").get_model_matrix(data, output="numpy")
isinstance(mm, ModelMatrix), isinstance(mm, numpy.ndarray)
(True, True)
The main purpose of this additional proxy layer is to expose the ModelSpec
instance associated with the materialization, which retains all of the encoding
choices made during materialization (for reuse in subsequent materializations),
as well as metadata about the feature names of the current model matrix (which
is very useful when your model matrix output type doesn't have column names,
like numpy or sparse arrays). This ModelSpec
instance is always available via
.model_spec
, and is introduced in more detail in the Model
Specs section of this documentation.
mm.model_spec
ModelSpec(formula=1 + a + b:c, materializer='pandas', materializer_params={}, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='numpy', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a']), EncodedTermStructure(term=b:c, scoped_terms=[b:c], columns=['b:c'])], transform_state={}, encoder_state={'a': (<Kind.NUMERICAL: 'numerical'>, {}), 'b': (<Kind.NUMERICAL: 'numerical'>, {}), 'c': (<Kind.NUMERICAL: 'numerical'>, {})})
It is sometimes convenient to have the columns in the final model matrix
be clustered by numerical factors included in the terms. This means that in
regression reports, for example, all of the columns related to a particular
feature of interest (including its interactions with various categorical
features) are contiguously clustered. This is the default behaviour in patsy.
You can perform this clustering in Formulaic by passing the
cluster_by="numerical_factors"
argument to model_matrix
or any of the
.get_model_matrix(...)
methods. For example:
Formula("a + b + a:A + A:b").get_model_matrix(data, cluster_by="numerical_factors")
Intercept | a | a:A[T.b] | a:A[T.c] | b | A[T.b]:b | A[T.c]:b | |
---|---|---|---|---|---|---|---|
0 | 1.0 | 1 | 0 | 0 | 4 | 0 | 0 |
1 | 1.0 | 2 | 2 | 0 | 5 | 5 | 0 |
2 | 1.0 | 3 | 0 | 3 | 6 | 0 | 6 |