Model Specs
While Formula
instances (discussed in How it works) are the
source of truth for abstract user intent, ModelSpec
instances are the source
of truth for the materialization process; and bundle a Formula
instance with
explicit metadata about the encoding choices that were made (or should be made)
when a formula was (or will be) materialized. As soon as materialization begins,
Formula
instances are upgraded into ModelSpec
instances, and any missing
metadata is attached as decisions are made during the materialization process.
Besides acting as runtime state during materialization, it serves two main purposes:
- It acts as a metadata store about model matrices, for example providing ready access to the column names, the terms from which they derived, and so on. This is especially useful when the output data type does not have native ways of representing this information (e.g. numpy arrays or scipy sparse matrices where even naming columns is challenging).
- It guarantees reproducibility. Once a
Formula
has been materialized once, you can use the generatedModelSpec
instance to repeat the process on similar datasets, being confident that the encoding choices will be identical. This is especially useful during out-of-sample prediction, where you need to prepare the out-of-sample data in exactly the same was as the training data for the predictions to be valid.
In the remainder of this portion of the documentation, we will introduce how to
leverage the metadata stored inside ModelSpec
instances derived from
materializations, and for more advanced programmatic use-cases, how to manually
build a ModelSpec
.
Anatomy of a ModelSpec
instance.¶
As noted above, a ModelSpec
is the complete specification and record of the
materialization process, combining all user-specified parameters with the
runtime state of the materializer. In particular, ModelSpec
instances have the
following explicitly specifiable attributes:
- Configuration (these attributes are typically specified by the user):
- formula: The formula for which the model matrix was (and/or will be) generated.
- materializer: The materializer used (and/or to be used) to materialize the formula into a matrix.
- ensure_full_rank: Whether to ensure that the generated matrix is "structurally" full-rank (features are not included which are known to violate full-rankness).
- na_action: The action to be taken if NA values are found in the data. Can be on of: "drop" (the default), "raise" or "ignore".
- output: The desired output type (as interpreted by the materializer; e.g. "pandas", "sparse", etc).
- State (these attributes are typically only populated during materialization):
- structure: The model matrix structure resulting from materialization.
- transform_state: The state of any stateful transformations that took place during factor evaluation.
- encoder_state: The state of any implicit stateful transformations that took place during encoding.
Often, only formula
is explicitly specified, and the rest is inferred on the
user's behalf.
ModelSpec
instances also have derived properties and methods that you can use
to introspect the structure of generated model matrices. These derived methods
assume that the ModelSpec
has been fully populated, and thus usually only make
sense to consider on ModelSpec
instances that are attached to a ModelMatrix
.
They are:
- Property attributes:
- column_names: An ordered sequence of names associated with the columns of the generated model matrix.
- column_indicies: An ordered mapping from column names to the column index in generated model matrices.
- terms: A sequence of
Term
instances that were used to generate this model matrix. - term_indices: An ordered mapping of
Term
instances to the generated column indices. - term_slices: An ordered mapping of
Term
instances to a slice that when used on the columns of the model matrix will subsample the model matrix down to those corresponding to each term. - term_factors: An ordered mapping of
Term
instances to the set of factors used by that term. - term_variables: An order mapping of
Term
instances toVariable
instances (a string subclass with addition attributes ofroles
andsource
), indicating the variables used by that term. - factors: A set of
Factor
instances used in the entire formula. - factor_terms: A mapping from
Factor
instances to theTerm
instances that used them. - factor_variables: A mapping from
Factor
instances toVariable
instances, corresponding to the variables used by that factor. - factor_contrasts: A mapping from
Factor
instances toContrastsState
instances that can be used to reproduce the coding matrices used during materialization. - variables: A set of
Variable
instances describing the variables used in entire formula. - variable_terms: The reverse lookup of
term_variables
. - variable_indices: A mapping from
Variable
instance to the indices of the columns in the model matrix that variable. - variables_by_source: A mapping from source name (typically one of
"data"
,"context"
, or"transforms"
) to the variables derived from that source.
- Utility methods:
- get_model_matrix(...): Build a model matrix using this spec. This allows a new dataset to be generated using exactly the same encoding process as an earlier dataset.
- get_linear_constraints(...): Build a set of linear constraints for use during constrained linear regressions.
- get_slice(...): Build a slice instance that can subset a matrix down
to the columns associated with a
Term
instance, its string representation, a column name, or pre-specified ints/slices.
- Transform methods:
- update(...): Create a copy of this
ModelSpec
instance with the nominated attributes mutated.
- update(...): Create a copy of this
We'll cover some of these attributes and methods in examples below, but you can
always refer to help(ModelSpec)
for more details.
Using ModelSpec
as metadata¶
One of the most common use-cases for ModelSpec
instances is as metadata to
describe a generated model matrix. This metadata can be used to programmatically
access the appropriate features in the model matrix in order (e.g.) to assign
sensible names to the coefficients fit during a regression.
# Let's get ourselves a simple `ModelMatrix` instance to play with.
from formulaic import model_matrix
from pandas import DataFrame
mm = model_matrix("center(a) + b", DataFrame({"a": [1,2,3], "b": ["A", "B", "C"]}))
mm
Intercept | center(a) | b[T.B] | b[T.C] | |
---|---|---|---|---|
0 | 1.0 | -1.0 | 0 | 0 |
1 | 1.0 | 0.0 | 1 | 0 |
2 | 1.0 | 1.0 | 0 | 1 |
# And extract the model spec from it
ms = mm.model_spec
ms
ModelSpec(formula=1 + center(a) + b, materializer='pandas', materializer_params={}, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='pandas', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=center(a), scoped_terms=[center(a)], columns=['center(a)']), EncodedTermStructure(term=b, scoped_terms=[b-], columns=['b[T.B]', 'b[T.C]'])], transform_state={'center(a)': {'ddof': 1, 'center': np.float64(2.0), 'scale': None}}, encoder_state={'center(a)': (<Kind.NUMERICAL: 'numerical'>, {}), 'b': (<Kind.CATEGORICAL: 'categorical'>, {'categories': ['A', 'B', 'C'], 'contrasts': ContrastsState(contrasts=TreatmentContrasts(base=UNSET), levels=['A', 'B', 'C'])})})
# We can now interrogate it for various column, factor, term, and variable related metadata
{
"column_names": ms.column_names,
"column_indices": ms.column_indices,
"terms": ms.terms,
"term_indices": ms.term_indices,
"term_slices": ms.term_slices,
"term_factors": ms.term_factors,
"term_variables": ms.term_variables,
"factors": ms.factors,
"factor_terms": ms.factor_terms,
"factor_variables": ms.factor_variables,
"factor_contrasts": ms.factor_contrasts,
"variables": ms.variables,
"variable_terms": ms.variable_terms,
"variable_indices": ms.variable_indices,
"variables_by_source": ms.variables_by_source,
}
{'column_names': ('Intercept', 'center(a)', 'b[T.B]', 'b[T.C]'), 'column_indices': {'Intercept': 0, 'center(a)': 1, 'b[T.B]': 2, 'b[T.C]': 3}, 'terms': [1, center(a), b], 'term_indices': {1: [0], center(a): [1], b: [2, 3]}, 'term_slices': {1: slice(0, 1, None), center(a): slice(1, 2, None), b: slice(2, 4, None)}, 'term_factors': {1: {1}, center(a): {center(a)}, b: {b}}, 'term_variables': {1: set(), center(a): {'a', 'center'}, b: {'b'}}, 'factors': {1, b, center(a)}, 'factor_terms': {1: {1}, center(a): {center(a)}, b: {b}}, 'factor_variables': {b: {'b'}, 1: set(), center(a): {'a', 'center'}}, 'factor_contrasts': {b: ContrastsState(contrasts=TreatmentContrasts(base=UNSET), levels=['A', 'B', 'C'])}, 'variables': {'a', 'b', 'center'}, 'variable_terms': {'center': {center(a)}, 'a': {center(a)}, 'b': {b}}, 'variable_indices': {'center': [1], 'a': [1], 'b': [2, 3]}, 'variables_by_source': {'transforms': {'center'}, 'data': {'a', 'b'}}}
# And use it to select out various parts of the model matrix; here the columns
# produced by the `b` term.
mm.iloc[:, ms.term_indices["b"]]
b[T.B] | b[T.C] | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 0 |
2 | 0 | 1 |
Some of this metadata may seem redundant at first, but this kind of metadata is essential when the generated model matrix does not natively support indexing by names; for example:
mm_numpy = model_matrix(
"center(a) + b",
DataFrame({"a": [1,2,3], "b": ["A", "B", "C"]}),
output='numpy'
)
mm_numpy
array([[ 1., -1., 0., 0.], [ 1., 0., 1., 0.], [ 1., 1., 0., 1.]])
ms_numpy = mm_numpy.model_spec
mm_numpy[:, ms_numpy.term_indices['b']]
array([[0., 0.], [1., 0.], [0., 1.]])
Reusing model specifications¶
Another common use-case for ModelSpec
instances is replaying the same
materialization process used to prepare a training dataset on a new dataset.
Since the ModelSpec
instance stores all relevant choices made during
materialization achieving this is a simple as using using the ModelSpec
to
generate the new model matrix.
By way of example, recall from above section that we used the formula
center(a) + b
where a
was a numerical vector, and b
was a categorical vector. When
generating model matrices for subsequent datasets it is very important to use
the same centering used during the initial model matrix generation, and not just
center the incoming data again. Likewise, b
should be aware of which
categories were present during the initial training, and ensure that the same
columns are created during subsequent materializations (otherwise the model
matrices will not be of the same form, and cannot be used for predictions/etc).
These kinds of transforms that require memory are called "stateful transforms"
in Formulaic, and are described in more detail in the Transforms
documentation.
We can see this in action below:
ms.get_model_matrix(DataFrame({"a": [4,5,6], "b": ["A", "B", "D"]}))
/home/matthew/Repositories/github/formulaic/formulaic/transforms/contrasts.py:169: DataMismatchWarning: Data has categories outside of the nominated levels (or that were not seen in original dataset): {'D'}. They are being cast to nan, which will likely skew the results of your analyses. warnings.warn(
Intercept | center(a) | b[T.B] | b[T.C] | |
---|---|---|---|---|
0 | 1.0 | 2.0 | 0 | 0 |
1 | 1.0 | 3.0 | 1 | 0 |
2 | 1.0 | 4.0 | 0 | 0 |
Notice that when the assumptions of the stateful transforms are violated warnings and/or exceptions will be generated.
You can also just pass the ModelSpec
directly to model_matrix
, for example:
model_matrix(ms, data=DataFrame({"a": [4,5,6], "b": ["A", "A", "A"]}))
Intercept | center(a) | b[T.B] | b[T.C] | |
---|---|---|---|---|
0 | 1.0 | 2.0 | 0 | 0 |
1 | 1.0 | 3.0 | 0 | 0 |
2 | 1.0 | 4.0 | 0 | 0 |
Directly constructing ModelSpec
instances¶
It is possible to directly construct Model Matrices, and to prepopulate them with various choices (e.g. output types, materializer, etc). You could even, in principle, populate them with state information (but this is not recommended; it is easy to make mistakes here, and is likely better to encode these choices into the formula itself where possible). For example:
from formulaic import ModelSpec
ms = ModelSpec("a+b+c", output='numpy', ensure_full_rank=False)
ms
ModelSpec(formula=1 + a + b + c, materializer=None, materializer_params=None, ensure_full_rank=False, na_action=<NAAction.DROP: 'drop'>, output='numpy', cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})
import pandas
mm = ms.get_model_matrix(pandas.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7,8,9]}))
mm
array([[1., 1., 4., 7.], [1., 2., 5., 8.], [1., 3., 6., 9.]])
mm.model_spec
ModelSpec(formula=1 + a + b + c, materializer='pandas', materializer_params={}, ensure_full_rank=False, na_action=<NAAction.DROP: 'drop'>, output='numpy', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=a, scoped_terms=[a], columns=['a']), EncodedTermStructure(term=b, scoped_terms=[b], columns=['b']), EncodedTermStructure(term=c, scoped_terms=[c], columns=['c'])], transform_state={}, encoder_state={'a': (<Kind.NUMERICAL: 'numerical'>, {}), 'b': (<Kind.NUMERICAL: 'numerical'>, {}), 'c': (<Kind.NUMERICAL: 'numerical'>, {})})
Notice that any missing fields not provided by the user are imputed automatically.
Structured ModelSpecs
¶
As discussed in How it works, formulae can be arbitrarily
structured, resulting in a similarly structured set of model matrices.
ModelSpec
instances can also be arranged into a structured collection using
ModelSpecs
, allowing different choices to be made at different levels of the
structure. You can either create these structures yourself, or inherit the
structure from a formula. For example:
from formulaic import Formula, ModelSpecs
ModelSpecs(ModelSpec("a"), substructure=ModelSpec("b"), another_substructure=ModelSpec("c"))
root: ModelSpec(formula=1 + a, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={}) .substructure: ModelSpec(formula=1 + b, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={}) .another_substructure: ModelSpec(formula=1 + c, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})
ModelSpec.from_spec(Formula(lhs="y", rhs="a + b"))
.lhs: ModelSpec(formula=y, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={}) .rhs: ModelSpec(formula=a + b, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, cluster_by=<ClusterBy.NONE: 'none'>, structure=None, transform_state={}, encoder_state={})
Serialization¶
ModelSpec
and ModelSpecs
instances have been designed to support
serialization via the standard pickling process offered by Python. This allows
model specs to be persisted into storage and reloaded at a later time, or used
in multiprocessing scenarios.
Serialized model specs are not guaranteed to work between different versions of formulaic. While things will work in the vast majority of cases, the internal state of transforms is free to change from version to version, and may invalidate previously serialized model specs. Efforts will be made to reduce the likelihood of this, and when it happens it should be indicated in the changelogs.