How it works

This section of the documentation is intended to provide a high-level overview of the way in which formulae are interpreted and materialized by Formulaic.

Recall that the goal of a formula is to act as a recipe for building a "model matrix" (also known as a "design matrix") from an existing dataset. Following the recipe should result in a dataset that consists only of numerical columns that can be linearly combined to model an outcome/response of interest (the coefficients of which typically being estimated using linear regression). As such, this process will bake in any desired non-linearity via interactions or transforms, and will encode nominal/categorical/factor data as a collection of numerical contrasts.

The ingredients of each formula are the columns of the original dataset, and each operator acting on these columns in the formula should be thought of as inclusion/exclusion of the column in the resulting model matrix, or as a transformation on the column(s) prior to inclusion. Thus, a + operator does not act in its usual algebraic manner, but rather acts as set union, indicating that both the left- and right-hand arguments should be included in the model matrix; a - operator acts like a set difference; and so on.

Anatomy of a Formula¶

Formulas in Formulaic are represented by (subclasses of) the Formula class. Instances of Formula subclasses are a ultimately containers for sets of Term instances, which in turn are a container for a set of Factor instances. Let's start our dissection at the bottom, and work our way up.

Factor¶

Factor instances are the atomic unit of a formula, and represent the output of a single expression evaluation. Typically this will be one vector of data, but could also be more than one column (especially common with categorically encoded data).

A Factor instance's expression can be evaluated in one of three ways:

As a literal; in which case the expression is taken as number or string, and repeated over all rows in the resulting model matrix.
As a lookup: in which case the expression is treated as a name to be looked up in the data context provided during materialization.
As a Python expression: in which case it is executed in the data context provided.

Note: Factor instances act as metadata only, and are not directly responsible for doing the evaluation. This is handled in a backend specific way by the appropriate Materializer instance.

In code, instantiating a factor looks like:

In [1]:

Copied!

from formulaic.parser.types import Factor

Factor("1", eval_method="literal")  # a factor that represents the numerical constant of 1
Factor("a")  # a factor that will be looked up from the data context
Factor("a + b", eval_method="python")  # a factor that will return the sum of `a` and `b`
from formulaic.parser.types import Factor

Factor("1", eval_method="literal")  # a factor that represents the numerical constant of 1
Factor("a")  # a factor that will be looked up from the data context
Factor("a + b", eval_method="python")  # a factor that will return the sum of `a` and `b`

Out[1]:

a + b

Term¶

Term instances are a thin wrapper around a set of Factor instances, and represent the Cartesian (or Kronecker) product of the factors. If all of the Factor instances evaluate to single columns, then the Term represents the product of all of the factor columns.

Instantiating a Term looks like:

In [2]:

Copied!

from formulaic.parser.types import Term

Term(factors=[Factor("b"), Factor("a"), Factor("c")])
from formulaic.parser.types import Term

Term(factors=[Factor("b"), Factor("a"), Factor("c")])

Out[2]:

b:a:c

Note that to ensure uniqueness in the representation, the factor instances are sorted.

Formula¶

Formula instances are (potentially nested) wrappers around collections of Term instances. During materialization into a model matrix, each Term instance will have its columns independently inserted into the resulting matrix.

Formula instances can consist of a single "list" of Term instances, or may be "structured"; for example, we may want a separate collection of terms for the left- and right-hand side of a formula; or to simultaneously construct multiple model matrices for different parts of our modeling process.

For example, an unstructured formula might look like:

In [3]:

Copied!





from formulaic import Formula

# Unstructured formula (a simple list of terms)
Formula([
    Term(factors=[Factor("c"), Factor("d"), Factor("e")]),
    Term(factors=[Factor("a"), Factor("b")]),
])
from formulaic import Formula

# Unstructured formula (a simple list of terms)
Formula([
    Term(factors=[Factor("c"), Factor("d"), Factor("e")]),
    Term(factors=[Factor("a"), Factor("b")]),
])

Out[3]:

a:b + c:d:e

Note that unstructured formulae are actually instances of SimpleFormula (a Formula subclass that acts like a mutable list of Term instances):

In [4]:

Copied!

type(_), list(_)
type(_), list(_)

Out[4]:

(formulaic.formula.SimpleFormula, [a:b, c:d:e])

Also note that in its standard representation, the terms are separated by "+" which is interpreted as the set union in this context, and that (as we have seen for Term instances) Formula instances are sorted (the default is to sort terms only by interaction order, but this can be customized and/or disabled, as described below).

Structured formula are constructed similary:

In [5]:

Copied!





f = Formula(
    [
        Term(factors=[Factor("root_col")]),
    ],
    my_substructure=[
        Term(factors=[Factor("sub_col")]),
    ],
    nested=Formula(
        [
            Term(factors=[Factor("nested_col")]),
            Term(factors=[Factor("another_nested_col")]),
        ],
        really_nested=[
            Term(factors=[Factor("really_nested_col")]),
        ],
    )
)
f
f = Formula(
    [
        Term(factors=[Factor("root_col")]),
    ],
    my_substructure=[
        Term(factors=[Factor("sub_col")]),
    ],
    nested=Formula(
        [
            Term(factors=[Factor("nested_col")]),
            Term(factors=[Factor("another_nested_col")]),
        ],
        really_nested=[
            Term(factors=[Factor("really_nested_col")]),
        ],
    )
)
f

Out[5]:

root:
    root_col
.my_substructure:
    sub_col
.nested:
    root:
        nested_col + another_nested_col
    .really_nested:
        really_nested_col

Structured formulae are instances of StructuredFormula:

In [6]:

Copied!

type(_)
type(_)

Out[6]:

formulaic.formula.StructuredFormula

And the sub-formula can be selected using:

In [7]:

Copied!

f.root
f.root

Out[7]:

root_col

In [8]:

Copied!

f.nested
f.nested

Out[8]:

root:
    nested_col + another_nested_col
.really_nested:
    really_nested_col

Formulae can also have different ordering conventions applied to them. By default, Formulaic follows R conventions around ordering whereby terms are sorted by their interaction degree (number of factors) and then by the order in which they were present in the the term list. This behaviour can be modified to perform no ordering or full lexical sorting of terms and factors by passing _ordering="none" or _ordering="sort" respectively to the Formula constructor. The default ordering is equivalent to passing _ordering="degree". For example:

In [9]:

Copied!





{
    "degree": Formula("z + z:a + z:b:a + g"),
    "none": Formula("z + z:a + z:b:a + g", _ordering="none"),
    "sort": Formula("z + z:a + z:b:a + g", _ordering="sort"),
}
{
    "degree": Formula("z + z:a + z:b:a + g"),
    "none": Formula("z + z:a + z:b:a + g", _ordering="none"),
    "sort": Formula("z + z:a + z:b:a + g", _ordering="sort"),
}

Out[9]:

{'degree': 1 + z + g + z:a + z:b:a,
 'none': 1 + z + z:a + z:b:a + g,
 'sort': 1 + g + z + a:z + a:b:z}

Parsed Formulae¶

While it would be possible to always manually construct Formula instances in this way, it would quickly grow tedious. As you might have guessed from reading the quickstart or via other implementations, this is where Wilkinson formulae come in. Formulaic has a rich extensible formula parser that converts string expressions into the formula structures you see above. Where functionality and grammar overlap, it tries to conform to existing patterns found in R and patsy.

Formula parsing happens in three phases:

tokenization of the formula string;
conversion of the tokens into an abstract syntax tree (AST); and
evaluation of the AST into (potentially structured) lists of Term instances.

In the sections below these phases are described in more detail.

Tokenization¶

Formulaic intentionally makes the tokenization phase as unopinionated and unstructured as possible. This allows formula grammars to be extended via plugins only high-level APIs (usually Operators).

The tokenizer's role is to take an arbitrary string representation of a formula and convert it into a series of Token instances. The tokenization phase knows very little about formula grammar except that whitespace doesn't matter, that we distinguish non-word characters as operators or context indicators. Interpretation of these tokens is left to the AST generation phase. There are five different kinds of tokens:

Context: Symbol pairs representing he opening or closing of nested contexts. This applies to parentheses () and square brackets [].
Operator: Symbol(s) representing a operation on other tokens in the string (interpreted during AST creation).
Name: A character sequence representing variable/data-column name.
Python: A character sequence representing executable Python code.
Value: A character sequence representing a Python literal.

The tokenizer treats text quoted with ` characters as a name token, and {} are used to quote Python operations.

An example of the tokens generated can be seen below:

In [10]:

Copied!





from formulaic.parser import DefaultFormulaParser

[
    f"{token.token} : {token.kind.value}"
    for token in (
        DefaultFormulaParser(include_intercept=False)
        .get_tokens("y ~ 1 + b:log(c) | `d$in^df` + {e + f}")
    )
]
from formulaic.parser import DefaultFormulaParser

[
    f"{token.token} : {token.kind.value}"
    for token in (
        DefaultFormulaParser(include_intercept=False)
        .get_tokens("y ~ 1 + b:log(c) | `d$in^df` + {e + f}")
    )
]

Out[10]:

['y : name',
 '~ : operator',
 '1 : value',
 '+ : operator',
 'b : name',
 ': : operator',
 'log(c) : python',
 '| : operator',
 'd$in^df : name',
 '+ : operator',
 'e + f : python']

Abstract Syntax Tree (AST)¶

The next phase is to assemble an abstract syntax tree (AST) from the tokens output from the above that when evaluated will generate the Term instances we need to build a formula. This is done by using an enriched shunting yard algorithm which determines how to interpret each operator token based on the symbol used, the number and position of the non-operator arguments, and the current context (i.e. how many parentheses deep we are). This allows us to disambiguate between, for example, unary and binary addition operators. The available operators and their implementation are described in more detail in the Formula Grammar section of this documentation. It is worth noting that the available operators can be easily modified at runtime, and is typically all that needs to be modified in order to add new formula grammars.

The result is an AST that look something like:

In [11]:

Copied!

DefaultFormulaParser().get_ast("y ~ a + b:c")
DefaultFormulaParser().get_ast("y ~ a + b:c")

Out[11]:

<ASTNode ~: [y, <ASTNode +: [<ASTNode +: [1, a]>, <ASTNode :: [b, c]>]>]>

Evaluation¶

Now that we have the AST, we can readily evaluate it to generate the Term instances we need to pass to our Formula constructor. For example:

In [12]:

Copied!

terms = DefaultFormulaParser(include_intercept=False).get_terms("y ~ a + b:c")
terms
terms = DefaultFormulaParser(include_intercept=False).get_terms("y ~ a + b:c")
terms

Out[12]:

.lhs:
    {y}
.rhs:
    {a, b:c}

In [13]:

Copied!

Formula(terms)
Formula(terms)

Out[13]:

.lhs:
    y
.rhs:
    a + b:c

Of course, manually building the terms and passing them to the formula constructor is a bit annoying, and so instead we allow passing the string directly to the Formula constructor; and allow you to override the default parser if you so desire (though 99.9% of the time this wouldn't be necessary).

Thus, we can generate the same formula from above using:

In [14]:

Copied!

Formula("y ~ a + b:c", _parser=DefaultFormulaParser(include_intercept=False))

Formula("y ~ a + b:c", _parser=DefaultFormulaParser(include_intercept=False))

Out[14]:

.lhs:
    y
.rhs:
    a + b:c

Materialization¶

Once you have Formula instance, the next logical step is to use it to materialize a model matrix. This is usually as simple passing the raw data as an argument to .get_model_matrix():

In [15]:

Copied!

import pandas

data = pandas.DataFrame({"a": [1,2,3], "b": [4,5,6], "c": [7, 8, 9], "A": ["a", "b", "c"]})
Formula("a + b:c").get_model_matrix(data)
import pandas

data = pandas.DataFrame({"a": [1,2,3], "b": [4,5,6], "c": [7, 8, 9], "A": ["a", "b", "c"]})
Formula("a + b:c").get_model_matrix(data)

Out[15]:

	Intercept	a	b:c
0	1.0	1	28
1	1.0	2	40
2	1.0	3	54

Just as for formulae, the model matrices can be structured, and will be structured in the same way as the original formula. For example:

In [16]:

Copied!

Formula("a", group="b+c").get_model_matrix(data)
Formula("a", group="b+c").get_model_matrix(data)

Out[16]:

root:
       Intercept  a
    0        1.0  1
    1        1.0  2
    2        1.0  3
.group:
       b  c
    0  4  7
    1  5  8
    2  6  9

Under the hood, both of these calls have looked at the type of the data (pandas.DataFrame here) and then looked up the FormulaMaterializer associated with that type (PandasMaterializer here), and then passed the formula and data along to the materializer for materialization. It is also possible to request a specific output type that varies by materializer (PandasMaterializer supports "pandas", "numpy", and "sparse"). If one is not selected, the first available output type is selected for you. Thus, the above code is equivalent to:

In [17]:

Copied!

from formulaic.materializers import PandasMaterializer
PandasMaterializer(data).get_model_matrix(Formula("a + b:c"), output="pandas")
from formulaic.materializers import PandasMaterializer
PandasMaterializer(data).get_model_matrix(Formula("a + b:c"), output="pandas")

Out[17]:

	Intercept	a	b:c
0	1.0	1	28
1	1.0	2	40
2	1.0	3	54

The return type of .get_model_matrix() is either a ModelMatrix instance if the original formula was unstructured, or a ModelMatrices instance that is just a structured container for ModelMatrix instances. However, ModelMatrix is an ObjectProxy subclass, and so it also acts like the type of object requested. For example:

In [18]:

Copied!

import numpy
from formulaic import ModelMatrix

mm = Formula("a + b:c").get_model_matrix(data, output="numpy")
isinstance(mm, ModelMatrix), isinstance(mm, numpy.ndarray)
import numpy
from formulaic import ModelMatrix

mm = Formula("a + b:c").get_model_matrix(data, output="numpy")
isinstance(mm, ModelMatrix), isinstance(mm, numpy.ndarray)

Out[18]:

(True, True)

The main purpose of this additional proxy layer is to expose the ModelSpec instance associated with the materialization, which retains all of the encoding choices made during materialization (for reuse in subsequent materializations), as well as metadata about the feature names of the current model matrix (which is very useful when your model matrix output type doesn't have column names, like numpy or sparse arrays). This ModelSpec instance is always available via .model_spec, and is introduced in more detail in the Model Specs section of this documentation.

In [19]:

Copied!

mm.model_spec
mm.model_spec

Out[19]:

ModelSpec(formula=1 + a + b:c, materializer='pandas', materializer_params={}, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='numpy', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a']), EncodedTermStructure(term=b:c, scoped_terms=[b:c], columns=['b:c'])], transform_state={}, encoder_state={'a': (<Kind.NUMERICAL: 'numerical'>, {}), 'b': (<Kind.NUMERICAL: 'numerical'>, {}), 'c': (<Kind.NUMERICAL: 'numerical'>, {})})

It is sometimes convenient to have the columns in the final model matrix be clustered by numerical factors included in the terms. This means that in regression reports, for example, all of the columns related to a particular feature of interest (including its interactions with various categorical features) are contiguously clustered. This is the default behaviour in patsy. You can perform this clustering in Formulaic by passing the cluster_by="numerical_factors" argument to model_matrix or any of the .get_model_matrix(...) methods. For example:

In [20]:

Copied!

Formula("a + b + a:A + A:b").get_model_matrix(data, cluster_by="numerical_factors")
Formula("a + b + a:A + A:b").get_model_matrix(data, cluster_by="numerical_factors")

Out[20]:

	Intercept	a	a:A[T.b]	a:A[T.c]	b	A[T.b]:b	A[T.c]:b
0	1.0	1	0	0	4	0	0
1	1.0	2	2	0	5	5	0
2	1.0	3	0	3	6	0	6