Quickstart
This document provides high-level documentation on how to get started using Formulaic.
Building Model Matrices¶
In formulaic
, the simplest way to build your model matrices is to use the
high-level model_matrix
function:
import pandas
from formulaic import model_matrix
df = pandas.DataFrame(
{
"y": [0, 1, 2],
"a": ["A", "B", "C"],
"b": [0.3, 0.1, 0.2],
}
)
y, X = model_matrix("y ~ a + b + a:b", df)
# This is short-hand for:
# y, X = formulaic.Formula('y ~ a + b + a:b').get_model_matrix(df)
y
y | |
---|---|
0 | 0 |
1 | 1 |
2 | 2 |
X
Intercept | a[T.B] | a[T.C] | b | a[T.B]:b | a[T.C]:b | |
---|---|---|---|---|---|---|
0 | 1.0 | 0 | 0 | 0.3 | 0.0 | 0.0 |
1 | 1.0 | 1 | 0 | 0.1 | 0.1 | 0.0 |
2 | 1.0 | 0 | 1 | 0.2 | 0.0 | 0.2 |
You will notice that the categorical values for a
have been one-hot (aka dummy) encoded,
and to ensure structural full-rankness of X
[^1], one level has been dropped
from a
. For more details about how this guarantees that the matrix is full-rank,
please refer to the excellent patsy documentation.
If you are not using the model matrices for regression, and don't care if the
matrix is not full-rank, you can pass ensure_full_rank=False
:
X = model_matrix("a + b + a:b", df, ensure_full_rank=False)
X
Intercept | a[T.A] | a[T.B] | a[T.C] | b | a[T.A]:b | a[T.B]:b | a[T.C]:b | |
---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1 | 0 | 0 | 0.3 | 0.3 | 0.0 | 0.0 |
1 | 1.0 | 0 | 1 | 0 | 0.1 | 0.0 | 0.1 | 0.0 |
2 | 1.0 | 0 | 0 | 1 | 0.2 | 0.0 | 0.0 | 0.2 |
Note that the dropped level in a
has been restored.
There is a rich trove of information about the columns and structure of the the
model matrix stored in the ModelSpec
instance attached to the model matrix,
for example:
X.model_spec
ModelSpec(formula=1 + a + b + a:b, materializer='pandas', materializer_params={}, ensure_full_rank=False, na_action=<NAAction.DROP: 'drop'>, output='pandas', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a[T.A]', 'a[T.B]', 'a[T.C]']), EncodedTermStructure(term=b, scoped_terms=[b], columns=['b']), EncodedTermStructure(term=a:b, scoped_terms=[a:b], columns=['a[T.A]:b', 'a[T.B]:b', 'a[T.C]:b'])], transform_state={}, encoder_state={'a': (<Kind.CATEGORICAL: 'categorical'>, {'categories': ['A', 'B', 'C']}), 'b': (<Kind.NUMERICAL: 'numerical'>, {})})
You can read more about the model specs in the Model Specs documentation.
Sparse Model Matrices¶
By default, the generated model matrices are dense. In some case, particularly in large datasets with many categorical features, dense model matrices become hugely memory inefficient (since most entries of the data will be zero). Formulaic allows you to directly generate sparse model matrices using:
X = model_matrix("a + b + a:b", df, output="sparse")
X
<3x6 sparse matrix of type '<class 'numpy.float64'>' with 10 stored elements in Compressed Sparse Column format>
In this example, X
is a $ 6 \times 3 $ scipy.sparse.csc_matrix
instance.
Since sparse matrices do not have labels for columns, you can look these up from the model spec described above; for example:
X.model_spec.column_names
('Intercept', 'a[T.B]', 'a[T.C]', 'b', 'a[T.B]:b', 'a[T.C]:b')
[^1]: X
must be full-rank in order for the regression algorithm to invert a matrix derived from X
.