Categorical Encoding

Categorical data (also known as "factors") is encoded in model matrices using "contrast codings" that transform categorical vectors into a collection of numerical vectors suitable for use in regression models. In this guide we begin with some basic examples, before introducing the concepts behind contrast codings, how to select and/or design your own coding, and (for more advanced readers) describe how we guarantee structural full-rankness of formulae with complex interactions between categorical and other features.

Basic usage¶

Formulaic follows in the stead of R and Patsy by automatically inferring from the data whether a feature needs to categorically encoded. For example:

In [1]:

Copied!





from formulaic import model_matrix
from pandas import DataFrame, Categorical

df = DataFrame({
    "letters": ["a", "b", "c"],
    "numbers": Categorical([1,2,3]),
    "values": [20, 200, 30],
})

model_matrix("letters + numbers + values", df)
from formulaic import model_matrix
from pandas import DataFrame, Categorical

df = DataFrame({
    "letters": ["a", "b", "c"],
    "numbers": Categorical([1,2,3]),
    "values": [20, 200, 30],
})

model_matrix("letters + numbers + values", df)

Out[1]:

	Intercept	letters[T.b]	letters[T.c]	numbers[T.2]	numbers[T.3]	values
0	1.0	0	0	0	0	20
1	1.0	1	0	1	0	200
2	1.0	0	1	0	1	30

Here letters was identified as a categorical variable because of it consisted of strings, numbers was identified as categorical because of its data type, and values was treated as a vector of numerical values. The categorical data was encoded using the default encoding of "Treatment" (aka. "Dummy", see below for more details).

If we wanted to force formulaic to treat a column as categorical, we can use the C() transform (just as in patsy and R). For example:

In [2]:

Copied!

model_matrix("C(values)", df)
model_matrix("C(values)", df)

Out[2]:

	Intercept	C(values)[T.30]	C(values)[T.200]
0	1.0	0	0
1	1.0	0	1
2	1.0	1	0

The C() transform tells Formulaic that the column should be encoded as categorical data, and allows you to customise how the encoding is performed. For example, we could use polynomial coding (detailed below) and explicitly specify the categorical levels and their order using:

In [3]:

Copied!

model_matrix("C(values, contr.poly, levels=[10, 20, 30])", df)
model_matrix("C(values, contr.poly, levels=[10, 20, 30])", df)

/home/matthew/Repositories/github/formulaic/formulaic/transforms/contrasts.py:124: DataMismatchWarning: Data has categories outside of the nominated levels (or that were not seen in original dataset): {200}. They are being  cast to nan, which will likely skew the results of your analyses.
  warnings.warn(

Out[3]:

	Intercept	C(values, contr.poly, levels=[10, 20, 30]).L	C(values, contr.poly, levels=[10, 20, 30]).Q
0	1.0	0.000000	-0.816497
1	1.0	0.000000	0.000000
2	1.0	0.707107	0.408248

Where possible, as you can see above, we also provide warnings when a categorical encoding does not reflect the structure of the data.

How does contrast coding work?¶

As you have seen, contrast coding transforms categorical vectors into a matrix of numbers that can be used during modeling. If your data has $K$ mutually exclusive categories, these matrices typically consist of $K-1$ columns. This reduction in dimensionality reflects the fact that membership of the $K$th category could be inferred from the lack of membership in any other category, and so is redundant in the presence of a global intercept. You can read more about this in the full rankness discussion below.

The first step toward generating numerical vectors from categorical data is to dummy encode it. This transforms the single vector of $K$ categories into $K$ boolean vectors, each having a $1$ only in rows that are members of the corresponding category. If you do not have a global intercept, you can directly use this dummy encoding with the full $K$ columns and contrasts are unnecessary. This is not always the case, which requires you to reduce the rank of your coding by thinking about contrasts (or differences) between the levels.

In practice, this dimension reduction using "contrasts" looks like constructing a $K \times (K-1)$ "coding matrix" that describes the contrasts of interest. You can then post-multiply your dummy-encoded columns by it. That is: $$ E = DC $$ where $E \in \mathbb{R}^{N \times (K-1)}$ is the contrast coded categorical data, $D \in \{0, 1\}^{N \times K}$ is the dummy encoded data, and $C \in \mathbb{R}^{K \times (K-1)}$ is the coding matrix.

The easiest way to construct a coding matrix is to start with a "coefficient matrix" $Z \in \mathbb{R}^{K \times K}$ which describes the contrasts that you want the coefficients of a trained linear regression model to represent (with columns representing the untransformed levels, and rows representing the transforms). For a consistently chosen set of contrasts, this matrix will be full-rank, and the inverse of this matrix will have a constant column representing the global intercept. Removing this column results in the $K \times (K-1)$ coding matrix that should be apply to the dummy encoded data in order for the coefficients to have the desired interpretation.

For example, if we wanted all of the levels to be compared to the first level, we would build a matrix $Z$ as: $$ \begin{align} Z =& \left(\begin{array}{c c c c} 1 & 0 & 0 & 0 \\ -1 & 1 & 0 & 0\\ -1 & 0 & 1 & 0\\ -1 & 0 & 0 & 1 \end{array}\right)\\ \therefore Z^{-1} =& \left(\begin{array}{c c c c} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0\\ 1 & 0 & 1 & 0\\ 1 & 0 & 0 & 1 \end{array}\right)\\ \implies C =& \left(\begin{array}{c c c} 0 & 0 & 0 \\ 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{array}\right) \end{align} $$ This is none other than the default "treatment" coding described below, which applies one-hot coding to the categorical data.

It is important to note that while your choice of contrast coding will change the interpretation and values of your coefficients, all contrast encodings ultimately result in equivalent regressions, and it is possible to restrospectively infer any other set of interesting contrasts given the regression covariance matrix. The task is therefore to find the most useful representation, not the "correct" one.

For those interested in reading more, the R Documentation on Coding Matrices covers this in more detail.

Contrast codings¶

This section introduces the contrast encodings that are shipped as part of Formulaic. These implementations live in formulaic.transforms.contrasts, and are surfaced by default in formulae as an attribute of contr (e.g. contr.treatment, in order to be consistent with R). You can always implement your own contrasts if the need arises.

If you would like to dig deeper and see the actual contrast/coefficient matrices for various parameters you can directly import these contrast implementations and play with them in a Python shell, but otherwise for brevity we will not exhaustively show these in the following documentation. For example:

In [4]:

Copied!

from formulaic.transforms.contrasts import C, TreatmentContrasts

TreatmentContrasts(base="B").get_coding_matrix(["A", "B", "C", "D"])
from formulaic.transforms.contrasts import C, TreatmentContrasts

TreatmentContrasts(base="B").get_coding_matrix(["A", "B", "C", "D"])

Out[4]:

	A	C	D
A	1.0	0.0	0.0
B	0.0	0.0	0.0
C	0.0	1.0	0.0
D	0.0	0.0	1.0

In [5]:

Copied!

TreatmentContrasts(base="B").get_coefficient_matrix(["A", "B", "C", "D"])
TreatmentContrasts(base="B").get_coefficient_matrix(["A", "B", "C", "D"])

Out[5]:

	A	B	C	D
B	0.0	1.0	0.0	0.0
A-B	1.0	-1.0	-0.0	-0.0
C-B	0.0	-1.0	1.0	0.0
D-B	0.0	-1.0	0.0	1.0

Treatment (aka. dummy)¶

This contrast coding compares each level with some reference level. If not specified, the reference level is taken to be the first level. The reference level can be specified as the first argument to the TreatmentContrasts/contr.treatment constructor.

Example formulae:

~ X: Assuming X is categorical, the treatment encoding will be used by default.
~ C(X): You can also explicitly flag a feature to be encoded as categorical, whereupon the default is treatment encoding.
~ C(X, contr.treatment): Explicitly indicate that the treatment encoding should be used.
~ C(X, contr.treatment("x")): Indicate that the reference treatment should be "x" instead of the first index.
~ C(X, contr.treatment(base="x")): As above.

In [6]:

Copied!

model_matrix("C(letters, contr.treatment)", df)
model_matrix("C(letters, contr.treatment)", df)

Out[6]:

	Intercept	C(letters, contr.treatment)[T.b]	C(letters, contr.treatment)[T.c]
0	1.0	0	0
1	1.0	1	0
2	1.0	0	1

SAS Treatment¶

This constrasts generated by this class are the same as the above, but with the reference level defaulting to the last level (the default in SAS).

Example formulae:

~ C(X, contr.SAS): Basic use-case.
~ C(X, contr.SAS("x")): Same as treatment encoding case above.
~ C(X, contr.SAS(base="x")): Same as treatment encoding case above.

In [7]:

Copied!

model_matrix("C(letters, contr.SAS)", df)
model_matrix("C(letters, contr.SAS)", df)

Out[7]:

	Intercept	C(letters, contr.SAS)[T.a]	C(letters, contr.SAS)[T.b]
0	1.0	1	0
1	1.0	0	1
2	1.0	0	0

Sum (or Deviation)¶

These contrasts compare each level (except the last, which is redundant) to the global average of all levels.

Example formulae:

~ C(X, contr.sum): Encode categorical data using the sum coding.

In [8]:

Copied!

model_matrix("C(letters, contr.sum)", df)
model_matrix("C(letters, contr.sum)", df)

Out[8]:

	Intercept	C(letters, contr.sum)[S.a]	C(letters, contr.sum)[S.b]
0	1.0	1.0	0.0
1	1.0	0.0	1.0
2	1.0	-1.0	-1.0

Helmert¶

These contrasts compare each successive level to the average all previous/subsequent levels. It has two configurable parameters: reverse which controls the direction of comparison, and scale which controls whether to scale the encoding to simplify interpretation of coefficients (results in a floating point model matrix instead of an integer one). When reverse is True, the contrasts compare a level to all previous levels; and when False, it compares it to all subsequent levels.

The default parameter values are chosen to match the R implementation, which corresponds to a reversed and unscaled Helmert coding.

Example formulae:

~ C(X, contr.helmert): Unscaled reverse coding.
~ C(X, contr.helmert(reverse=False)): Unscaled forward coding.
~ C(X, contr.helmert(scale=True)): Scaled reverse coding.
~ C(X, contr.helmert(scale=True, reverse=False)): Scaled forward coding.

In [9]:

Copied!

model_matrix("C(letters, contr.helmert)", df)
model_matrix("C(letters, contr.helmert)", df)

Out[9]:

	Intercept	C(letters, contr.helmert)[H.b]	C(letters, contr.helmert)[H.c]
0	1.0	-1.0	-1.0
1	1.0	1.0	-1.0
2	1.0	0.0	2.0

Diff¶

These contrasts take the difference of each level with the previous level. It has one parameter, forward, which indicates that the difference should be inverted such the difference is taken between the previous level and the current level. The default attribute values are chosen to match the R implemention, and correspond to a backward difference coding.

Example formulae:

~ C(X, contr.diff): Backward coding.
~ C(X, contr.diff(forward=True)): Forward coding.

In [10]:

Copied!

model_matrix("C(letters, contr.diff)", df)
model_matrix("C(letters, contr.diff)", df)

Out[10]:

	Intercept	C(letters, contr.diff)[D.b]	C(letters, contr.diff)[D.c]
0	1.0	-0.666667	-0.333333
1	1.0	0.333333	-0.333333
2	1.0	0.333333	0.666667

Polynomial¶

The "contrasts" represent a categorical variable that is assumed to equal (or known) spacing/scores, and allow us to model non-linear polynomial behaviour of the dependent variable with respect to the ordered levels by projecting the spacing onto a basis of orthogonal polynomials. It has one parameter, scores which indicates the spacing of the categories. It must have the same length as the number of levels. If not provided, the categories are assumed equidistant and spaced by 1.

In [11]:

Copied!

model_matrix("C(letters, contr.poly)", df)
model_matrix("C(letters, contr.poly)", df)

Out[11]:

	Intercept	C(letters, contr.poly).L	C(letters, contr.poly).Q
0	1.0	-0.707107	0.408248
1	1.0	0.000000	-0.816497
2	1.0	0.707107	0.408248

Aliasing categorical features¶

The feature names of categorical variables can become quite unwieldy, as you may have noticed. Fortunately this is easily remedied by aliasing the variable outside of your formula (and then making it available via formula context). This is done automatically if you use the model_matrix function. For example:

In [12]:

Copied!

my_letters = C(df.letters, TreatmentContrasts(base="b"))
model_matrix("my_letters", df)
my_letters = C(df.letters, TreatmentContrasts(base="b"))
model_matrix("my_letters", df)

Out[12]:

	Intercept	my_letters[T.a]	my_letters[T.c]
0	1.0	1	0
1	1.0	0	0
2	1.0	0	1

Writing your own coding¶

It may be useful to define your own coding matrices in some contexts. This is readily achieved using the CustomContrasts class directly or via the contr.custom alias. In these cases, you are responsible for providing the coding matrix ($C$ from above). For example, if you had four levels: A, B, C and D, and wanted to compute the contrasts: B - A, C - B, and D - A, you could write:

In [13]:

Copied!





import numpy
Z = numpy.array([
    [1, 0, 0, 0],  # A
    [-1, 1, 0, 0], # B - A
    [0, -1, 1, 0],  # C - B
    [-1, 0, 0, 1],  # D - A
])
coding = numpy.linalg.inv(Z)[:,1:]
coding
import numpy
Z = numpy.array([
    [1, 0, 0, 0],  # A
    [-1, 1, 0, 0], # B - A
    [0, -1, 1, 0],  # C - B
    [-1, 0, 0, 1],  # D - A
])
coding = numpy.linalg.inv(Z)[:,1:]
coding

Out[13]:

array([[0., 0., 0.],
       [1., 0., 0.],
       [1., 1., 0.],
       [0., 0., 1.]])

In [14]:

Copied!

model_matrix("C(letters, contr.custom(coding))", DataFrame({"letters": ["A", "B", "C", "D"]}))
model_matrix("C(letters, contr.custom(coding))", DataFrame({"letters": ["A", "B", "C", "D"]}))

Out[14]:

	Intercept	C(letters, contr.custom(coding))[1]	C(letters, contr.custom(coding))[2]	C(letters, contr.custom(coding))[3]
0	1.0	0.0	0.0	0.0
1	1.0	1.0	0.0	0.0
2	1.0	1.0	1.0	0.0
3	1.0	0.0	0.0	1.0

Guaranteeing Structural Full Rankness¶

The model matrices generated from formulae are often consumed directly by linear regression algorithms. In these cases, if your model matrix is not full rank, then the features in your model are not linearly independent, and the resulting coefficients (assuming they can be computed at all) cannot be uniquely determined. While there are ways to handle this, such as regularization, it is usually easiest to put in a little more effort during the model matrix creation process, and make the incoming vectors in your model matrix linearly independent from the outset. As noted in the text above, categorical coding requires consideration about the overlap of the coding with the intercept in order to remain full rank. The good news is that Formulaic will do most of the heavy lifting for you, and does so by default.

It is important to note at this point that Formulaic does not protect against all forms of linear dependence, only structural linear dependence; i.e. linear dependence that results from multiple categorical variables overlapping in vectorspace. If you have two identical numerical vectors called by two different names in your model matrix, Formulaic will happily build the model matrix you requested, and you're on your own. This is intentional. While Formulaic strives to make the model matrix generation process as painless as possible, it also doesn't want to make more assumptions about the use of the data than is necessary. Note that you can also disable Formulaic's structural full-rankness algorithms by passing ensure_full_rank=False to model_matrix() or .get_model_matrix() methods; and can bypass the reducing of the rank of a single categorical term in a formula using C(..., spans_intercept=False) (this is especially useful, for example, if your model includes regularization and you would prefer to use the over-specified model to ensure fairer shrinkage).

The algorithm that Formulaic uses was heavily inspired by patsy^1. The basic idea is to recognize that all categorical codings span the intercept[^2]; and then to break that coding up into two pieces: a single column that can be dropped to avoid spanning the intercept, and the remaining body of the coding that will always be present. You expand associatively the categorical factors, and then greedily recombine the components, omitting any that would lead to structural linear dependence. The result is a set of categorical codings that only spans the intercept when it is safe to do so, guaranteeing structural full rankness. The patsy documentation goes into this in much more detail if this is interesting to you.

[^2]: This assumes that categories are "complete", in that each unit has been assigned a category. You can "complete" categories by treating all those unassigned as being a member of an imputed "null" category.