Transforms

A transform in Formulaic is any function that is called to modify factor values during the evaluation of a Factor (see the How it works documentation). Any function can be used as a transform, so long as it is present in the evaluation context (see below).

There are two types of transform:

Regular transforms: These are just normal functions that are applied to features prior to encoding. For example, you could apply the numpy.cumsum function to any vector being fed into the model matrix materialization procedure.
Stateful transforms: These are functions that keep track of the transform state so that they can be reapplied in the future with the same state. This is useful if the transform does something data specific that has to be replicated in future materializations (such as subtracting the mean of the dataset; subsequent materializations should use the mean of the training dataset rather than the mean of the current data).

In the below we describe how to make a function available for use as a transform during materialization, demonstrate this for regular transforms, and then introduce how to use already implemented stateful transforms and/or write your own.

Adding transforms to the evaluation context¶

The only requirement for using a transform in formula is making it available in the execution context. The evaluation context is always pre-seeded with:

Regular transforms (and modules):
- np: The top-level numpy module.
- log: numpy.log.
- log10: numpy.log10.
- log2: numpy.log2.
- exp: numpy.exp.
- exp10: numpy.exp10.
- exp2: numpy.exp2.
- I: Identity/null transform (alternative to {<expr>} syntax).
- lag: Generate lagging or leading columns (useful for datasets collected at regular intervals).
Stateful transforms (documented below):
- bs: Basis spline coding.
- center: Subtraction of the mean.
- hashed: Categorical coding of a deterministically hashed representation.
- poly: Polynomial spline coding.
- scale: Centering and renormalization.
- C: Categorical coding.
  - contr.: An R-like interface to specification of contrast coding.

The evaluation context can be extended to include arbitrary additional functions. If you are using the top-level model_matrix function then the local context in which model_matrix is called is automatically added to the execution context, otherwise you need to manually specify this context. For example:

In [1]:

Copied!

import pandas
from formulaic import model_matrix, Formula

def my_transform(col: pandas.Series) -> pandas.Series:
    return col ** 2
import pandas
from formulaic import model_matrix, Formula

def my_transform(col: pandas.Series) -> pandas.Series:
    return col ** 2

In [2]:

Copied!

# Local context is automatically added
model_matrix("a + my_transform(a)", pandas.DataFrame({"a": [1, 2, 3]}))
# Local context is automatically added
model_matrix("a + my_transform(a)", pandas.DataFrame({"a": [1, 2, 3]}))

Out[2]:

	Intercept	a	my_transform(a)
0	1.0	1	1
1	1.0	2	4
2	1.0	3	9

In [3]:

Copied!





# Manually add `my_transform` to the context
Formula("a + my_transform(a)").get_model_matrix(
    pandas.DataFrame({"a": [1, 2, 3]}),
    context={"my_transform": my_transform},  # could also use: context=locals()
)
# Manually add `my_transform` to the context
Formula("a + my_transform(a)").get_model_matrix(
    pandas.DataFrame({"a": [1, 2, 3]}),
    context={"my_transform": my_transform},  # could also use: context=locals()
)

Out[3]:

	Intercept	a	my_transform(a)
0	1.0	1	1
1	1.0	2	4
2	1.0	3	9

Stateful transforms¶

In Formulaic, a stateful transform is just a regular callable object (typically a function) that has an attribute __is_stateful_transform__ that is set to True. Such callables will be passed up to three additional arguments by formulaic if they are present in the callable signature:

_state: The existing state or an empty dictionary that should be mutated to record any additional state.
_metadata: An additional metadata dictionary passed on about the factor or None. Will typically only be present if the Factor metadata is populated.
_spec: The current model spec being evaluated (or an empty ModelSpec if being called outside of Formulaic's materialization routines).

Only _state is required, _metadata and _spec will only be passed in by Formulaic if they are present in the callable signature.

Provided stateful transforms¶

Formulaic comes preloaded with some useful stateful transforms, which are outlined below.

Scaling and Centering¶

There are two provided scaling transforms: scale(...) and center(...).

scale rescales the data such that it is centered around zero with a standard deviation of 1. The centering and variance standardisation can be independently disabled as necessary. center is a simple wrapper around scale that only does the centering. For more details, refer to inline documentation: help(scale).

Example usage is shown below:

In [4]:

Copied!

from formulaic.transforms import scale, center
scale(pandas.Series([1,2,3,4,5,6,7,8]))
from formulaic.transforms import scale, center
scale(pandas.Series([1,2,3,4,5,6,7,8]))

Out[4]:

array([-1.42886902, -1.02062073, -0.61237244, -0.20412415,  0.20412415,
        0.61237244,  1.02062073,  1.42886902])

In [5]:

Copied!

center(pandas.Series([1,2,3,4,5,6,7,8]))

center(pandas.Series([1,2,3,4,5,6,7,8]))

Out[5]:

array([-3.5, -2.5, -1.5, -0.5,  0.5,  1.5,  2.5,  3.5])

Categorical Encoding¶

Formulaic provides a rich family of categorical stateful transforms. These are perhaps the most commonly used transforms, and are used to encode categorical/factor data into a form suitable for numerical analysis. Use of these transforms is separately documented in the Categorical Encoding section.

Spline Encoding¶

Spline coding is used to enable non-linear dependence on numerical features in linear models. Formulaic currently provides two spline transforms: bs for basis splines, and poly for polynomial splines. These are separately documented in the Spline Encoding section.

Implementing custom stateful transforms¶

You can either implement the above interface directly, or leverage the stateful_transform decorator provided by Formulaic, which then also updates your function into a single dispatch function, allowing multiple implementations that depend on the currently materialized type. A simple centering example is explored below.

In [6]:

Copied!





import numpy
from formulaic.transforms import stateful_transform

@stateful_transform
def center(data, _state=None, _metadata=None, _spec=None):
    print("state", _state)
    print("metadata", _metadata)
    print("spec", _spec)
    if "mean" not in _state:
        _state["mean"] = numpy.mean(data)
    return data - _state["mean"]

state = {}
center(pandas.Series([1,2,3]), _state=state)
import numpy
from formulaic.transforms import stateful_transform

@stateful_transform
def center(data, _state=None, _metadata=None, _spec=None):
    print("state", _state)
    print("metadata", _metadata)
    print("spec", _spec)
    if "mean" not in _state:
        _state["mean"] = numpy.mean(data)
    return data - _state["mean"]

state = {}
center(pandas.Series([1,2,3]), _state=state)

state {}
metadata None
spec ModelSpec(formula=, materializer=None, materializer_params=None, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output=None, structure=None, transform_state={}, encoder_state={})

Out[6]:

0   -1.0
1    0.0
2    1.0
dtype: float64

In [7]:

Copied!

state
state

Out[7]:

{'mean': 2.0}

The mutated state object is then stored by formulaic automatically into the right context in the appropriate ModelSpec instance for reuse as necessary.

If you wanted to leverage the single dispatch functionality, you could do something like:

In [8]:

Copied!





import numpy
from formulaic.transforms import stateful_transform

@stateful_transform
def center(data, _state=None, _metadata=None, _spec=None):
    raise ValueError(f"No implementation for data of type {repr(type(data))}")

@center.register(pandas.Series)
def _(data, _state=None, _metadata=None, _spec=None):
    if "mean" not in _state:
        _state["mean"] = numpy.mean(data)
    return data - _state["mean"]
import numpy
from formulaic.transforms import stateful_transform

@stateful_transform
def center(data, _state=None, _metadata=None, _spec=None):
    raise ValueError(f"No implementation for data of type {repr(type(data))}")

@center.register(pandas.Series)
def _(data, _state=None, _metadata=None, _spec=None):
    if "mean" not in _state:
        _state["mean"] = numpy.mean(data)
    return data - _state["mean"]

Note

If taking advantage of the single dispatch functionality, it is important that the top-level function has exactly the same signature as the type specific implementations.