Skip to content

Quickstart

This document provides high-level documentation on how to get started using Formulaic. For deeper documentation about the internals, please refer to the Advanced Usage documentation.

Building Model Matrices

In formulaic, the simplest way to build your model matrices is to use the high-level model_matrix function:

import pandas
from formulaic import model_matrix

df = pandas.DataFrame({
    'y': [0, 1, 2],
    'a': ['A', 'B', 'C'],
    'b': [0.3, 0.1, 0.2],
})

y, X = model_matrix("y ~ a + b + a:b", df)
# This is short-hand for:
# y, X = formulaic.Formula('y ~ a + b + a:b').get_model_matrix(df)
# This lower-level API discussed in the Advanced Usage documentation.

y =

y
0 0
1 1
2 2

X =

Intercept a[T.B] a[T.C] b a[T.B]:b a[T.C]:b
0 1.0 0 0 0.3 0.0 0.0
1 1.0 1 0 0.1 0.1 0.0
2 1.0 0 1 0.2 0.0 0.2

You will notice that the categorical values for a have been one-hot (aka dummy) encoded, and to ensure structural full-rankness of X1, one level has been dropped from a. For more details about how this guarantees that the matrix is full-rank, please refer to the excellent patsy documentation. If you are not using the model matrices for regression, and don't care if the matrix is not full-rank, you can pass ensure_full_rank=False:

X = model_matrix("a + b + a:b", df, ensure_full_rank=False)

X =

Intercept a[T.A] a[T.B] a[T.C] b a[T.A]:b a[T.B]:b a[T.C]:b
0 1.0 1 0 0 0.3 0.3 0.0 0.0
1 1.0 0 1 0 0.1 0.0 0.1 0.0
2 1.0 0 0 1 0.2 0.0 0.0 0.2

Note that the dropped level in a has been restored.

Sparse Model Matrices

By default, the generated model matrices are dense. In some case, particularly in large datasets with many categorical features, dense model matrices become hugely memory inefficient (since most entries of the data will be zero). Formulaic allows you to directly generate sparse model matrices using:

X = model_matrix("a + b + a:b", df, output='sparse')
In this example, X is a \( 6 \times 3 \) scipy.sparse.csc_matrix instance.


  1. X must be full-rank in order for the regression algorithm to invert a matrix derived from X