Transforms
A transform in Formulaic is any function that is called to modify factor values
during the evaluation of a Factor
(see the How it works
documentation). Any function can be used as a transform, so long as it is
present in the evaluation context (see below).
There are two types of transform:
- Regular transforms: These are just normal functions that are applied to
features prior to encoding. For example, you could apply the
numpy.cumsum
function to any vector being fed into the model matrix materialization procedure. - Stateful transforms: These are functions that keep track of the transform state so that they can be reapplied in the future with the same state. This is useful if the transform does something data specific that has to be replicated in future materializations (such as subtracting the mean of the dataset; subsequent materializations should use the mean of the training dataset rather than the mean of the current data).
In the below we describe how to make a function available for use as a transform during materialization, demonstrate this for regular transforms, and then introduce how to use already implemented stateful transforms and/or write your own.
Adding transforms to the evaluation context¶
The only requirement for using a transform in formula is making it available in the execution context. The evaluation context is always pre-seeded with:
- Regular transforms (and modules):
- np: The top-level
numpy
module. - log:
numpy.log
. - log10:
numpy.log10
. - log2:
numpy.log2
. - exp:
numpy.exp
. - exp10:
numpy.exp10
. - exp2:
numpy.exp2
. - I: Identity/null transform (alternative to
{<expr>}
syntax). - lag: Generate lagging or leading columns (useful for datasets collected at regular intervals).
- np: The top-level
- Stateful transforms (documented below):
- bs: Basis spline coding.
- center: Subtraction of the mean.
- hashed: Categorical coding of a deterministically hashed representation.
- poly: Polynomial spline coding.
- scale: Centering and renormalization.
- C: Categorical coding.
- contr.
: An R-like interface to specification of contrast coding.
- contr.
The evaluation context can be extended to include arbitrary additional
functions. If you are using the top-level model_matrix
function then the local
context in which model_matrix
is called is automatically added to the
execution context, otherwise you need to manually specify this context. For
example:
import pandas
from formulaic import model_matrix, Formula
def my_transform(col: pandas.Series) -> pandas.Series:
return col ** 2
# Local context is automatically added
model_matrix("a + my_transform(a)", pandas.DataFrame({"a": [1, 2, 3]}))
Intercept | a | my_transform(a) | |
---|---|---|---|
0 | 1.0 | 1 | 1 |
1 | 1.0 | 2 | 4 |
2 | 1.0 | 3 | 9 |
# Manually add `my_transform` to the context
Formula("a + my_transform(a)").get_model_matrix(
pandas.DataFrame({"a": [1, 2, 3]}),
context={"my_transform": my_transform}, # could also use: context=locals()
)
Intercept | a | my_transform(a) | |
---|---|---|---|
0 | 1.0 | 1 | 1 |
1 | 1.0 | 2 | 4 |
2 | 1.0 | 3 | 9 |
Stateful transforms¶
In Formulaic, a stateful transform is just a regular callable object (typically
a function) that has an attribute __is_stateful_transform__
that is set to
True
. Such callables will be passed up to three additional arguments by
formulaic if they are present in the callable signature:
_state
: The existing state or an empty dictionary that should be mutated to record any additional state._metadata
: An additional metadata dictionary passed on about the factor orNone
. Will typically only be present if theFactor
metadata is populated._spec
: The current model spec being evaluated (or an emptyModelSpec
if being called outside of Formulaic's materialization routines).
Only _state
is required, _metadata
and _spec
will only be passed in by
Formulaic if they are present in the callable signature.
Provided stateful transforms¶
Formulaic comes preloaded with some useful stateful transforms, which are outlined below.
Scaling and Centering¶
There are two provided scaling transforms: scale(...)
and center(...)
.
scale
rescales the data such that it is centered around zero with a standard
deviation of 1. The centering and variance standardisation can be independently
disabled as necessary. center
is a simple wrapper around scale
that only
does the centering. For more details, refer to inline documentation:
help(scale)
.
Example usage is shown below:
from formulaic.transforms import scale, center
scale(pandas.Series([1,2,3,4,5,6,7,8]))
array([-1.42886902, -1.02062073, -0.61237244, -0.20412415, 0.20412415, 0.61237244, 1.02062073, 1.42886902])
center(pandas.Series([1,2,3,4,5,6,7,8]))
array([-3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5])
Categorical Encoding¶
Formulaic provides a rich family of categorical stateful transforms. These are perhaps the most commonly used transforms, and are used to encode categorical/factor data into a form suitable for numerical analysis. Use of these transforms is separately documented in the Categorical Encoding section.
Spline Encoding¶
Spline coding is used to enable non-linear dependence on numerical features in
linear models. Formulaic currently provides two spline transforms: bs
for
basis splines, and poly
for polynomial splines. These are separately
documented in the Spline Encoding section.
Implementing custom stateful transforms¶
You can either implement the above interface directly, or leverage the
stateful_transform
decorator provided by Formulaic, which then also updates
your function into a single dispatch function, allowing multiple implementations
that depend on the currently materialized type. A simple centering example is
explored below.
import numpy
from formulaic.transforms import stateful_transform
@stateful_transform
def center(data, _state=None, _metadata=None, _spec=None):
print("state", _state)
print("metadata", _metadata)
print("spec", _spec)
if "mean" not in _state:
_state["mean"] = numpy.mean(data)
return data - _state["mean"]
state = {}
center(pandas.Series([1,2,3]), _state=state)
state {} metadata None spec ModelSpec(formula=, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, structure=None, transform_state={}, encoder_state={})
0 -1.0 1 0.0 2 1.0 dtype: float64
state
{'mean': 2.0}
The mutated state object is then stored by formulaic automatically into the
right context in the appropriate ModelSpec
instance for reuse as necessary.
If you wanted to leverage the single dispatch functionality, you could do something like:
import numpy
from formulaic.transforms import stateful_transform
@stateful_transform
def center(data, _state=None, _metadata=None, _spec=None):
raise ValueError(f"No implementation for data of type {repr(type(data))}")
@center.register(pandas.Series)
def _(data, _state=None, _metadata=None, _spec=None):
if "mean" not in _state:
_state["mean"] = numpy.mean(data)
return data - _state["mean"]
Note
If taking advantage of the single dispatch functionality, it is important that the top-level function has exactly the same signature as the type specific implementations.