Quickstart
This document provides highlevel documentation on how to get started using Formulaic. For deeper documentation about the internals, please refer to the Advanced Usage documentation.
Building Model Matrices
In formulaic
, the simplest way to build your model matrices is to use the
highlevel model_matrix
function:
import pandas
from formulaic import model_matrix
df = pandas.DataFrame({
'y': [0, 1, 2],
'a': ['A', 'B', 'C'],
'b': [0.3, 0.1, 0.2],
})
y, X = model_matrix("y ~ a + b + a:b", df)
# This is shorthand for:
# y, X = formulaic.Formula('y ~ a + b + a:b').get_model_matrix(df)
# This lowerlevel API discussed in the Advanced Usage documentation.
y =
y  

0  0 
1  1 
2  2 
X =
Intercept  a[T.B]  a[T.C]  b  a[T.B]:b  a[T.C]:b  

0  1.0  0  0  0.3  0.0  0.0 
1  1.0  1  0  0.1  0.1  0.0 
2  1.0  0  1  0.2  0.0  0.2 
You will notice that the categorical values for a
have been onehot (aka dummy) encoded,
and to ensure structural fullrankness of X
^{1}, one level has been dropped
from a
. For more details about how this guarantees that the matrix is fullrank,
please refer to the excellent patsy documentation.
If you are not using the model matrices for regression, and don't care if the
matrix is not fullrank, you can pass ensure_full_rank=False
:
X = model_matrix("a + b + a:b", df, ensure_full_rank=False)
X =
Intercept  a[T.A]  a[T.B]  a[T.C]  b  a[T.A]:b  a[T.B]:b  a[T.C]:b  

0  1.0  1  0  0  0.3  0.3  0.0  0.0 
1  1.0  0  1  0  0.1  0.0  0.1  0.0 
2  1.0  0  0  1  0.2  0.0  0.0  0.2 
Note that the dropped level in a
has been restored.
Sparse Model Matrices
By default, the generated model matrices are dense. In some case, particularly in large datasets with many categorical features, dense model matrices become hugely memory inefficient (since most entries of the data will be zero). Formulaic allows you to directly generate sparse model matrices using:
X = model_matrix("a + b + a:b", df, output='sparse')
X
is a \( 6 \times 3 \) scipy.sparse.csc_matrix
instance.

X
must be fullrank in order for the regression algorithm to invert a matrix derived fromX
. ↩