Quickstart

This document provides high-level documentation on how to get started using Formulaic.

Building Model Matrices¶

In formulaic, the simplest way to build your model matrices is to use the high-level model_matrix function:

In [1]:

Copied!





import pandas
from formulaic import model_matrix

df = pandas.DataFrame({
    'y': [0, 1, 2],
    'a': ['A', 'B', 'C'],
    'b': [0.3, 0.1, 0.2],
})

y, X = model_matrix("y ~ a + b + a:b", df)
# This is short-hand for:
# y, X = formulaic.Formula('y ~ a + b + a:b').get_model_matrix(df)
import pandas
from formulaic import model_matrix

df = pandas.DataFrame({
    'y': [0, 1, 2],
    'a': ['A', 'B', 'C'],
    'b': [0.3, 0.1, 0.2],
})

y, X = model_matrix("y ~ a + b + a:b", df)
# This is short-hand for:
# y, X = formulaic.Formula('y ~ a + b + a:b').get_model_matrix(df)

In [2]:

Copied!

y
y

Out[2]:

	y
0	0
1	1
2	2

In [3]:

Copied!

X
X

Out[3]:

	Intercept	a[T.B]	a[T.C]	b	a[T.B]:b	a[T.C]:b
0	1.0	0	0	0.3	0.0	0.0
1	1.0	1	0	0.1	0.1	0.0
2	1.0	0	1	0.2	0.0	0.2

You will notice that the categorical values for a have been one-hot (aka dummy) encoded, and to ensure structural full-rankness of X[^1], one level has been dropped from a. For more details about how this guarantees that the matrix is full-rank, please refer to the excellent patsy documentation. If you are not using the model matrices for regression, and don't care if the matrix is not full-rank, you can pass ensure_full_rank=False:

In [4]:

Copied!

X = model_matrix("a + b + a:b", df, ensure_full_rank=False)
X

X = model_matrix("a + b + a:b", df, ensure_full_rank=False)
X

Out[4]:

	Intercept	a[T.A]	a[T.B]	a[T.C]	b	a[T.A]:b	a[T.B]:b	a[T.C]:b
0	1.0	1	0	0	0.3	0.3	0.0	0.0
1	1.0	0	1	0	0.1	0.0	0.1	0.0
2	1.0	0	0	1	0.2	0.0	0.0	0.2

Note that the dropped level in a has been restored.

There is a rich trove of information about the columns and structure of the the model matrix stored in the ModelSpec instance attached to the model matrix, for example:

In [5]:

Copied!

X.model_spec
X.model_spec

Out[5]:

ModelSpec(formula=1 + a + b + a:b, materializer='pandas', materializer_params={}, ensure_full_rank=False, na_action=<NAAction.DROP: 'drop'>, output='pandas', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a[T.A]', 'a[T.B]', 'a[T.C]']), EncodedTermStructure(term=b, scoped_terms=[b], columns=['b']), EncodedTermStructure(term=a:b, scoped_terms=[a:b], columns=['a[T.A]:b', 'a[T.B]:b', 'a[T.C]:b'])], transform_state={}, encoder_state={'a': (<Kind.CATEGORICAL: 'categorical'>, {'categories': ['A', 'B', 'C']}), 'b': (<Kind.NUMERICAL: 'numerical'>, {})})

You can read more about the model specs in the Model Specs documentation.

Sparse Model Matrices¶

By default, the generated model matrices are dense. In some case, particularly in large datasets with many categorical features, dense model matrices become hugely memory inefficient (since most entries of the data will be zero). Formulaic allows you to directly generate sparse model matrices using:

In [6]:

Copied!

X = model_matrix("a + b + a:b", df, output='sparse')
X
X = model_matrix("a + b + a:b", df, output='sparse')
X

Out[6]:

<3x6 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Column format>

In this example, X is a $ 6 \times 3 $ scipy.sparse.csc_matrix instance.

Since sparse matrices do not have labels for columns, you can look these up from the model spec described above; for example:

In [7]:

Copied!

X.model_spec.column_names
X.model_spec.column_names

Out[7]:

('Intercept', 'a[T.B]', 'a[T.C]', 'b', 'a[T.B]:b', 'a[T.C]:b')

[^1]: X must be full-rank in order for the regression algorithm to invert a matrix derived from X.