Model Specs

While Formula instances (discussed in How it works) are the source of truth for abstract user intent, ModelSpec instances are the source of truth for the materialization process; and bundle a Formula instance with explicit metadata about the encoding choices that were made (or should be made) when a formula was (or will be) materialized. As soon as materialization begins, Formula instances are upgraded into ModelSpec instances, and any missing metadata is attached as decisions are made during the materialization process.

Besides acting as runtime state during materialization, it serves two main purposes:

It acts as a metadata store about model matrices, for example providing ready access to the column names, the terms from which they derived, and so on. This is especially useful when the output data type does not have native ways of representing this information (e.g. numpy arrays or scipy sparse matrices where even naming columns is challenging).
It guarantees reproducibility. Once a Formula has been materialized once, you can use the generated ModelSpec instance to repeat the process on similar datasets, being confident that the encoding choices will be identical. This is especially useful during out-of-sample prediction, where you need to prepare the out-of-sample data in exactly the same was as the training data for the predictions to be valid.

In the remainder of this portion of the documentation, we will introduce how to leverage the metadata stored inside ModelSpec instances derived from materializations, and for more advanced programmatic use-cases, how to manually build a ModelSpec.

Anatomy of a `ModelSpec` instance.¶

As noted above, a ModelSpec is the complete specification and record of the materialization process, combining all user-specified parameters with the runtime state of the materializer. In particular, ModelSpec instances have the following explicitly specifiable attributes:

Configuration (these attributes are typically specified by the user):
- formula: The formula for which the model matrix was (and/or will be) generated.
- materializer: The materializer used (and/or to be used) to materialize the formula into a matrix.
- ensure_full_rank: Whether to ensure that the generated matrix is "structurally" full-rank (features are not included which are known to violate full-rankness).
- na_action: The action to be taken if NA values are found in the data. Can be on of: "drop" (the default), "raise" or "ignore".
- output: The desired output type (as interpreted by the materializer; e.g. "pandas", "sparse", etc).
State (these attributes are typically only populated during materialization):
- structure: The model matrix structure resulting from materialization.
- transform_state: The state of any stateful transformations that took place during factor evaluation.
- encoder_state: The state of any implicit stateful transformations that took place during encoding.

Often, only formula is explicitly specified, and the rest is inferred on the user's behalf.

ModelSpec instances also have derived properties and methods that you can use to introspect the structure of generated model matrices. These derived methods assume that the ModelSpec has been fully populated, and thus usually only make sense to consider on ModelSpec instances that are attached to a ModelMatrix. They are:

Property attributes:
- column_names: An ordered sequence of names associated with the columns of the generated model matrix.
- column_indicies: An ordered mapping from column names to the column index in generated model matrices.
- terms: A sequence of Term instances that were used to generate this model matrix.
- term_indices: An ordered mapping of Term instances to the generated column indices.
- term_slices: An ordered mapping of Term instances to a slice that when used on the columns of the model matrix will subsample the model matrix down to those corresponding to each term.
- term_factors: An ordered mapping of Term instances to the set of factors used by that term.
- term_variables: An order mapping of Term instances to Variable instances (a string subclass with addition attributes of roles and source), indicating the variables used by that term.
- factors: A set of Factor instances used in the entire formula.
- factor_terms: A mapping from Factor instances to the Term instances that used them.
- factor_variables: A mapping from Factor instances to Variable instances, corresponding to the variables used by that factor.
- factor_contrasts: A mapping from Factor instances to ContrastsState instances that can be used to reproduce the coding matrices used during materialization.
- variables: A set of Variable instances describing the variables used in entire formula.
- variable_terms: The reverse lookup of term_variables.
- variable_indices: A mapping from Variable instance to the indices of the columns in the model matrix that variable.
- variables_by_source: A mapping from source name (typically one of "data", "context", or "transforms") to the variables derived from that source.
Utility methods:
- get_model_matrix(...): Build a model matrix using this spec. This allows a new dataset to be generated using exactly the same encoding process as an earlier dataset.
- get_linear_constraints(...): Build a set of linear constraints for use during constrained linear regressions.
- get_slice(...): Build a slice instance that can subset a matrix down to the columns associated with a Term instance, its string representation, a column name, or pre-specified ints/slices.
Transform methods:
- update(...): Create a copy of this ModelSpec instance with the nominated attributes mutated.

We'll cover some of these attributes and methods in examples below, but you can always refer to help(ModelSpec) for more details.

Using `ModelSpec` as metadata¶

One of the most common use-cases for ModelSpec instances is as metadata to describe a generated model matrix. This metadata can be used to programmatically access the appropriate features in the model matrix in order (e.g.) to assign sensible names to the coefficients fit during a regression.

In [1]:

Copied!





# Let's get ourselves a simple `ModelMatrix` instance to play with.
from formulaic import model_matrix
from pandas import DataFrame

mm = model_matrix("center(a) + b", DataFrame({"a": [1,2,3], "b": ["A", "B", "C"]}))
mm
# Let's get ourselves a simple `ModelMatrix` instance to play with.
from formulaic import model_matrix
from pandas import DataFrame

mm = model_matrix("center(a) + b", DataFrame({"a": [1,2,3], "b": ["A", "B", "C"]}))
mm

Out[1]:

	Intercept	center(a)	b[T.B]	b[T.C]
0	1.0	-1.0	0	0
1	1.0	0.0	1	0
2	1.0	1.0	0	1

In [2]:

Copied!

# And extract the model spec from it
ms = mm.model_spec
ms
# And extract the model spec from it
ms = mm.model_spec
ms

Out[2]:

ModelSpec(formula=1 + center(a) + b, materializer='pandas', materializer_params={}, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='pandas', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=center(a), scoped_terms=[center(a)], columns=['center(a)']), EncodedTermStructure(term=b, scoped_terms=[b-], columns=['b[T.B]', 'b[T.C]'])], transform_state={'center(a)': {'ddof': 1, 'center': np.float64(2.0), 'scale': None}}, encoder_state={'center(a)': (<Kind.NUMERICAL: 'numerical'>, {}), 'b': (<Kind.CATEGORICAL: 'categorical'>, {'categories': ['A', 'B', 'C'], 'contrasts': ContrastsState(contrasts=TreatmentContrasts(base=UNSET), levels=['A', 'B', 'C'])})})

In [3]:

Copied!





# We can now interrogate it for various column, factor, term, and variable related metadata
{
    "column_names": ms.column_names,
    "column_indices": ms.column_indices,
    "terms": ms.terms,
    "term_indices": ms.term_indices,
    "term_slices": ms.term_slices,
    "term_factors": ms.term_factors,
    "term_variables": ms.term_variables,
    "factors": ms.factors,
    "factor_terms": ms.factor_terms,
    "factor_variables": ms.factor_variables,
    "factor_contrasts": ms.factor_contrasts,
    "variables": ms.variables,
    "variable_terms": ms.variable_terms,
    "variable_indices": ms.variable_indices,
    "variables_by_source": ms.variables_by_source,
}
# We can now interrogate it for various column, factor, term, and variable related metadata
{
    "column_names": ms.column_names,
    "column_indices": ms.column_indices,
    "terms": ms.terms,
    "term_indices": ms.term_indices,
    "term_slices": ms.term_slices,
    "term_factors": ms.term_factors,
    "term_variables": ms.term_variables,
    "factors": ms.factors,
    "factor_terms": ms.factor_terms,
    "factor_variables": ms.factor_variables,
    "factor_contrasts": ms.factor_contrasts,
    "variables": ms.variables,
    "variable_terms": ms.variable_terms,
    "variable_indices": ms.variable_indices,
    "variables_by_source": ms.variables_by_source,
}

Out[3]:

{'column_names': ('Intercept', 'center(a)', 'b[T.B]', 'b[T.C]'),
 'column_indices': {'Intercept': 0, 'center(a)': 1, 'b[T.B]': 2, 'b[T.C]': 3},
 'terms': [1, center(a), b],
 'term_indices': {1: [0], center(a): [1], b: [2, 3]},
 'term_slices': {1: slice(0, 1, None),
  center(a): slice(1, 2, None),
  b: slice(2, 4, None)},
 'term_factors': {1: {1}, center(a): {center(a)}, b: {b}},
 'term_variables': {1: set(), center(a): {'a', 'center'}, b: {'b'}},
 'factors': {1, b, center(a)},
 'factor_terms': {1: {1}, center(a): {center(a)}, b: {b}},
 'factor_variables': {b: {'b'}, 1: set(), center(a): {'a', 'center'}},
 'factor_contrasts': {b: ContrastsState(contrasts=TreatmentContrasts(base=UNSET), levels=['A', 'B', 'C'])},
 'variables': {'a', 'b', 'center'},
 'variable_terms': {'center': {center(a)}, 'a': {center(a)}, 'b': {b}},
 'variable_indices': {'center': [1], 'a': [1], 'b': [2, 3]},
 'variables_by_source': {'transforms': {'center'}, 'data': {'a', 'b'}}}

In [4]:

Copied!

# And use it to select out various parts of the model matrix; here the columns
# produced by the `b` term.
mm.iloc[:, ms.term_indices["b"]]
# And use it to select out various parts of the model matrix; here the columns
# produced by the `b` term.
mm.iloc[:, ms.term_indices["b"]]

Out[4]:

	b[T.B]	b[T.C]
0	0	0
1	1	0
2	0	1

Some of this metadata may seem redundant at first, but this kind of metadata is essential when the generated model matrix does not natively support indexing by names; for example:

In [5]:

Copied!





mm_numpy = model_matrix(
    "center(a) + b",
    DataFrame({"a": [1,2,3], "b": ["A", "B", "C"]}),
    output='numpy'
)
mm_numpy
mm_numpy = model_matrix(
    "center(a) + b",
    DataFrame({"a": [1,2,3], "b": ["A", "B", "C"]}),
    output='numpy'
)
mm_numpy

Out[5]:

array([[ 1., -1.,  0.,  0.],
       [ 1.,  0.,  1.,  0.],
       [ 1.,  1.,  0.,  1.]])

In [6]:

Copied!

ms_numpy = mm_numpy.model_spec
mm_numpy[:, ms_numpy.term_indices['b']]
ms_numpy = mm_numpy.model_spec
mm_numpy[:, ms_numpy.term_indices['b']]

Out[6]:

array([[0., 0.],
       [1., 0.],
       [0., 1.]])

Reusing model specifications¶

Another common use-case for ModelSpec instances is replaying the same materialization process used to prepare a training dataset on a new dataset. Since the ModelSpec instance stores all relevant choices made during materialization achieving this is a simple as using using the ModelSpec to generate the new model matrix.

By way of example, recall from above section that we used the formula

center(a) + b

where a was a numerical vector, and b was a categorical vector. When generating model matrices for subsequent datasets it is very important to use the same centering used during the initial model matrix generation, and not just center the incoming data again. Likewise, b should be aware of which categories were present during the initial training, and ensure that the same columns are created during subsequent materializations (otherwise the model matrices will not be of the same form, and cannot be used for predictions/etc). These kinds of transforms that require memory are called "stateful transforms" in Formulaic, and are described in more detail in the Transforms documentation.

We can see this in action below:

In [7]:

Copied!

ms.get_model_matrix(DataFrame({"a": [4,5,6], "b": ["A", "B", "D"]}))
ms.get_model_matrix(DataFrame({"a": [4,5,6], "b": ["A", "B", "D"]}))

/home/matthew/Repositories/github/formulaic/formulaic/transforms/contrasts.py:169: DataMismatchWarning: Data has categories outside of the nominated levels (or that were not seen in original dataset): {'D'}. They are being  cast to nan, which will likely skew the results of your analyses.
  warnings.warn(

Out[7]:

	Intercept	center(a)	b[T.B]
0	1.0	2.0	0
1	1.0	3.0	1
2	1.0	4.0	0

Notice that when the assumptions of the stateful transforms are violated warnings and/or exceptions will be generated.

You can also just pass the ModelSpec directly to model_matrix, for example:

In [8]:

Copied!

model_matrix(ms, data=DataFrame({"a": [4,5,6], "b": ["A", "A", "A"]}))
model_matrix(ms, data=DataFrame({"a": [4,5,6], "b": ["A", "A", "A"]}))

Out[8]:

	Intercept	center(a)
0	1.0	2.0
1	1.0	3.0
2	1.0	4.0

Directly constructing `ModelSpec` instances¶

It is possible to directly construct Model Matrices, and to prepopulate them with various choices (e.g. output types, materializer, etc). You could even, in principle, populate them with state information (but this is not recommended; it is easy to make mistakes here, and is likely better to encode these choices into the formula itself where possible). For example:

In [9]:

Copied!

from formulaic import ModelSpec

ms = ModelSpec("a+b+c", output='numpy', ensure_full_rank=False)
ms
from formulaic import ModelSpec

ms = ModelSpec("a+b+c", output='numpy', ensure_full_rank=False)
ms

Out[9]:

ModelSpec(formula=1 + a + b + c, materializer=None, materializer_params=None, ensure_full_rank=False, na_action=<NAAction.DROP: 'drop'>, output='numpy', cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})

In [10]:

Copied!

import pandas
mm = ms.get_model_matrix(pandas.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7,8,9]}))
mm
import pandas
mm = ms.get_model_matrix(pandas.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7,8,9]}))
mm

Out[10]:

array([[1., 1., 4., 7.],
       [1., 2., 5., 8.],
       [1., 3., 6., 9.]])

In [11]:

Copied!

mm.model_spec
mm.model_spec

Out[11]:

ModelSpec(formula=1 + a + b + c, materializer='pandas', materializer_params={}, ensure_full_rank=False, na_action=<NAAction.DROP: 'drop'>, output='numpy', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a']), EncodedTermStructure(term=b, scoped_terms=[b], columns=['b']), EncodedTermStructure(term=c, scoped_terms=[c], columns=['c'])], transform_state={}, encoder_state={'a': (<Kind.NUMERICAL: 'numerical'>, {}), 'b': (<Kind.NUMERICAL: 'numerical'>, {}), 'c': (<Kind.NUMERICAL: 'numerical'>, {})})

Notice that any missing fields not provided by the user are imputed automatically.

Structured `ModelSpecs`¶

As discussed in How it works, formulae can be arbitrarily structured, resulting in a similarly structured set of model matrices. ModelSpec instances can also be arranged into a structured collection using ModelSpecs, allowing different choices to be made at different levels of the structure. You can either create these structures yourself, or inherit the structure from a formula. For example:

In [12]:

Copied!

from formulaic import Formula, ModelSpecs

ModelSpecs(ModelSpec("a"), substructure=ModelSpec("b"), another_substructure=ModelSpec("c"))
from formulaic import Formula, ModelSpecs

ModelSpecs(ModelSpec("a"), substructure=ModelSpec("b"), another_substructure=ModelSpec("c"))

Out[12]:

root:
    ModelSpec(formula=1 + a, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})
.substructure:
    ModelSpec(formula=1 + b, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})
.another_substructure:
    ModelSpec(formula=1 + c, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})

In [13]:

Copied!

ModelSpec.from_spec(Formula(lhs="y", rhs="a + b"))
ModelSpec.from_spec(Formula(lhs="y", rhs="a + b"))

Out[13]:

.lhs:
    ModelSpec(formula=y, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})
.rhs:
    ModelSpec(formula=a + b, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})

Serialization¶

ModelSpec and ModelSpecs instances have been designed to support serialization via the standard pickling process offered by Python. This allows model specs to be persisted into storage and reloaded at a later time, or used in multiprocessing scenarios.

Serialized model specs are not guaranteed to work between different versions of formulaic. While things will work in the vast majority of cases, the internal state of transforms is free to change from version to version, and may invalidate previously serialized model specs. Efforts will be made to reduce the likelihood of this, and when it happens it should be indicated in the changelogs.

Model Specs

Anatomy of a ModelSpec instance.¶

Using ModelSpec as metadata¶

Reusing model specifications¶

Directly constructing ModelSpec instances¶

Structured ModelSpecs¶

Serialization¶

Anatomy of a `ModelSpec` instance.¶

Using `ModelSpec` as metadata¶

Directly constructing `ModelSpec` instances¶

Structured `ModelSpecs`¶