Categorical Encoding
Categorical data (also known as "factors") is encoded in model matrices using "contrast codings" that transform categorical vectors into a collection of numerical vectors suitable for use in regression models. In this guide we begin with some basic examples, before introducing the concepts behind contrast codings, how to select and/or design your own coding, and (for more advanced readers) describe how we guarantee structural full-rankness of formulae with complex interactions between categorical and other features.
Basic usage¶
Formulaic follows in the stead of R and Patsy by automatically inferring from the data whether a feature needs to categorically encoded. For example:
from formulaic import model_matrix
from pandas import DataFrame, Categorical
df = DataFrame({
"letters": ["a", "b", "c"],
"numbers": Categorical([1,2,3]),
"values": [20, 200, 30],
})
model_matrix("letters + numbers + values", df)
Intercept | letters[T.b] | letters[T.c] | numbers[T.2] | numbers[T.3] | values | |
---|---|---|---|---|---|---|
0 | 1.0 | 0 | 0 | 0 | 0 | 20 |
1 | 1.0 | 1 | 0 | 1 | 0 | 200 |
2 | 1.0 | 0 | 1 | 0 | 1 | 30 |
Here letters
was identified as a categorical variable because of it consisted
of strings, numbers
was identified as categorical because of its data type,
and values
was treated as a vector of numerical values. The categorical data
was encoded using the default encoding of "Treatment" (aka. "Dummy", see below
for more details).
If we wanted to force formulaic to treat a column as categorical, we can use the
C()
transform (just as in patsy and R). For example:
model_matrix("C(values)", df)
Intercept | C(values)[T.30] | C(values)[T.200] | |
---|---|---|---|
0 | 1.0 | 0 | 0 |
1 | 1.0 | 0 | 1 |
2 | 1.0 | 1 | 0 |
The C()
transform tells Formulaic that the column should be encoded as
categorical data, and allows you to customise how the encoding is performed. For
example, we could use polynomial coding (detailed below) and explicitly specify
the categorical levels and their order using:
model_matrix("C(values, contr.poly, levels=[10, 20, 30])", df)
/home/matthew/Repositories/github/formulaic/formulaic/transforms/contrasts.py:124: DataMismatchWarning: Data has categories outside of the nominated levels (or that were not seen in original dataset): {200}. They are being cast to nan, which will likely skew the results of your analyses. warnings.warn(
Intercept | C(values, contr.poly, levels=[10, 20, 30]).L | C(values, contr.poly, levels=[10, 20, 30]).Q | |
---|---|---|---|
0 | 1.0 | 0.000000 | -0.816497 |
1 | 1.0 | 0.000000 | 0.000000 |
2 | 1.0 | 0.707107 | 0.408248 |
Where possible, as you can see above, we also provide warnings when a categorical encoding does not reflect the structure of the data.
How does contrast coding work?¶
As you have seen, contrast coding transforms categorical vectors into a matrix of numbers that can be used during modeling. If your data has $K$ mutually exclusive categories, these matrices typically consist of $K-1$ columns. This reduction in dimensionality reflects the fact that membership of the $K$th category could be inferred from the lack of membership in any other category, and so is redundant in the presence of a global intercept. You can read more about this in the full rankness discussion below.
The first step toward generating numerical vectors from categorical data is to dummy encode it. This transforms the single vector of $K$ categories into $K$ boolean vectors, each having a $1$ only in rows that are members of the corresponding category. If you do not have a global intercept, you can directly use this dummy encoding with the full $K$ columns and contrasts are unnecessary. This is not always the case, which requires you to reduce the rank of your coding by thinking about contrasts (or differences) between the levels.
In practice, this dimension reduction using "contrasts" looks like constructing a $K \times (K-1)$ "coding matrix" that describes the contrasts of interest. You can then post-multiply your dummy-encoded columns by it. That is: $$ E = DC $$ where $E \in \mathbb{R}^{N \times (K-1)}$ is the contrast coded categorical data, $D \in \{0, 1\}^{N \times K}$ is the dummy encoded data, and $C \in \mathbb{R}^{K \times (K-1)}$ is the coding matrix.
The easiest way to construct a coding matrix is to start with a "coefficient matrix" $Z \in \mathbb{R}^{K \times K}$ which describes the contrasts that you want the coefficients of a trained linear regression model to represent (with columns representing the untransformed levels, and rows representing the transforms). For a consistently chosen set of contrasts, this matrix will be full-rank, and the inverse of this matrix will have a constant column representing the global intercept. Removing this column results in the $K \times (K-1)$ coding matrix that should be apply to the dummy encoded data in order for the coefficients to have the desired interpretation.
For example, if we wanted all of the levels to be compared to the first level, we would build a matrix $Z$ as: $$ \begin{align} Z =& \left(\begin{array}{c c c c} 1 & 0 & 0 & 0 \\ -1 & 1 & 0 & 0\\ -1 & 0 & 1 & 0\\ -1 & 0 & 0 & 1 \end{array}\right)\\ \therefore Z^{-1} =& \left(\begin{array}{c c c c} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0\\ 1 & 0 & 1 & 0\\ 1 & 0 & 0 & 1 \end{array}\right)\\ \implies C =& \left(\begin{array}{c c c} 0 & 0 & 0 \\ 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{array}\right) \end{align} $$ This is none other than the default "treatment" coding described below, which applies one-hot coding to the categorical data.
It is important to note that while your choice of contrast coding will change the interpretation and values of your coefficients, all contrast encodings ultimately result in equivalent regressions, and it is possible to restrospectively infer any other set of interesting contrasts given the regression covariance matrix. The task is therefore to find the most useful representation, not the "correct" one.
For those interested in reading more, the R Documentation on Coding Matrices covers this in more detail.
Contrast codings¶
This section introduces the contrast encodings that are shipped as part of Formulaic. These implementations live in formulaic.transforms.contrasts
, and are surfaced by default in formulae as an attribute of contr
(e.g. contr.treatment
, in order to be consistent with R). You can always implement your own contrasts if the need arises.
If you would like to dig deeper and see the actual contrast/coefficient matrices for various parameters you can directly import these contrast implementations and play with them in a Python shell, but otherwise for brevity we will not exhaustively show these in the following documentation. For example:
from formulaic.transforms.contrasts import C, TreatmentContrasts
TreatmentContrasts(base="B").get_coding_matrix(["A", "B", "C", "D"])
A | C | D | |
---|---|---|---|
A | 1.0 | 0.0 | 0.0 |
B | 0.0 | 0.0 | 0.0 |
C | 0.0 | 1.0 | 0.0 |
D | 0.0 | 0.0 | 1.0 |
TreatmentContrasts(base="B").get_coefficient_matrix(["A", "B", "C", "D"])
A | B | C | D | |
---|---|---|---|---|
B | 0.0 | 1.0 | 0.0 | 0.0 |
A-B | 1.0 | -1.0 | -0.0 | -0.0 |
C-B | 0.0 | -1.0 | 1.0 | 0.0 |
D-B | 0.0 | -1.0 | 0.0 | 1.0 |
Treatment (aka. dummy)¶
This contrast coding compares each level with some reference level. If not
specified, the reference level is taken to be the first level. The reference
level can be specified as the first argument to the
TreatmentContrasts
/contr.treatment
constructor.
Example formulae:
~ X
: AssumingX
is categorical, the treatment encoding will be used by default.~ C(X)
: You can also explicitly flag a feature to be encoded as categorical, whereupon the default is treatment encoding.~ C(X, contr.treatment)
: Explicitly indicate that the treatment encoding should be used.~ C(X, contr.treatment("x"))
: Indicate that the reference treatment should be "x" instead of the first index.~ C(X, contr.treatment(base="x"))
: As above.
model_matrix("C(letters, contr.treatment)", df)
Intercept | C(letters, contr.treatment)[T.b] | C(letters, contr.treatment)[T.c] | |
---|---|---|---|
0 | 1.0 | 0 | 0 |
1 | 1.0 | 1 | 0 |
2 | 1.0 | 0 | 1 |
SAS Treatment¶
This constrasts generated by this class are the same as the above, but with the reference level defaulting to the last level (the default in SAS).
Example formulae:
~ C(X, contr.SAS)
: Basic use-case.~ C(X, contr.SAS("x"))
: Same as treatment encoding case above.~ C(X, contr.SAS(base="x"))
: Same as treatment encoding case above.
model_matrix("C(letters, contr.SAS)", df)
Intercept | C(letters, contr.SAS)[T.a] | C(letters, contr.SAS)[T.b] | |
---|---|---|---|
0 | 1.0 | 1 | 0 |
1 | 1.0 | 0 | 1 |
2 | 1.0 | 0 | 0 |
Sum (or Deviation)¶
These contrasts compare each level (except the last, which is redundant) to the global average of all levels.
Example formulae:
~ C(X, contr.sum)
: Encode categorical data using the sum coding.
model_matrix("C(letters, contr.sum)", df)
Intercept | C(letters, contr.sum)[S.a] | C(letters, contr.sum)[S.b] | |
---|---|---|---|
0 | 1.0 | 1.0 | 0.0 |
1 | 1.0 | 0.0 | 1.0 |
2 | 1.0 | -1.0 | -1.0 |
Helmert¶
These contrasts compare each successive level to the average all
previous/subsequent levels. It has two configurable parameters: reverse
which
controls the direction of comparison, and scale
which controls whether to
scale the encoding to simplify interpretation of coefficients (results in a
floating point model matrix instead of an integer one). When reverse
is
True
, the contrasts compare a level to all previous levels; and when False
,
it compares it to all subsequent levels.
The default parameter values are chosen to match the R implementation, which corresponds to a reversed and unscaled Helmert coding.
Example formulae:
~ C(X, contr.helmert)
: Unscaled reverse coding.~ C(X, contr.helmert(reverse=False))
: Unscaled forward coding.~ C(X, contr.helmert(scale=True))
: Scaled reverse coding.~ C(X, contr.helmert(scale=True, reverse=False))
: Scaled forward coding.
model_matrix("C(letters, contr.helmert)", df)
Intercept | C(letters, contr.helmert)[H.b] | C(letters, contr.helmert)[H.c] | |
---|---|---|---|
0 | 1.0 | -1.0 | -1.0 |
1 | 1.0 | 1.0 | -1.0 |
2 | 1.0 | 0.0 | 2.0 |
Diff¶
These contrasts take the difference of each level with the previous level. It
has one parameter, forward
, which indicates that the difference should be
inverted such the difference is taken between the previous level and the current
level. The default attribute values are chosen to match the R implemention, and
correspond to a backward difference coding.
Example formulae:
~ C(X, contr.diff)
: Backward coding.~ C(X, contr.diff(forward=True))
: Forward coding.
model_matrix("C(letters, contr.diff)", df)
Intercept | C(letters, contr.diff)[D.b] | C(letters, contr.diff)[D.c] | |
---|---|---|---|
0 | 1.0 | -0.666667 | -0.333333 |
1 | 1.0 | 0.333333 | -0.333333 |
2 | 1.0 | 0.333333 | 0.666667 |
Polynomial¶
The "contrasts" represent a categorical variable that is assumed to equal (or
known) spacing/scores, and allow us to model non-linear polynomial behaviour of
the dependent variable with respect to the ordered levels by projecting the
spacing onto a basis of orthogonal polynomials. It has one parameter, scores
which indicates the spacing of the categories. It must have the same length as
the number of levels. If not provided, the categories are assumed equidistant
and spaced by 1.
model_matrix("C(letters, contr.poly)", df)
Intercept | C(letters, contr.poly).L | C(letters, contr.poly).Q | |
---|---|---|---|
0 | 1.0 | -0.707107 | 0.408248 |
1 | 1.0 | 0.000000 | -0.816497 |
2 | 1.0 | 0.707107 | 0.408248 |
Aliasing categorical features¶
The feature names of categorical variables can become quite unwieldy, as you may
have noticed. Fortunately this is easily remedied by aliasing the variable
outside of your formula (and then making it available via formula context). This
is done automatically if you use the model_matrix
function. For example:
my_letters = C(df.letters, TreatmentContrasts(base="b"))
model_matrix("my_letters", df)
Intercept | my_letters[T.a] | my_letters[T.c] | |
---|---|---|---|
0 | 1.0 | 1 | 0 |
1 | 1.0 | 0 | 0 |
2 | 1.0 | 0 | 1 |
Writing your own coding¶
It may be useful to define your own coding matrices in some contexts. This is
readily achieved using the CustomContrasts
class directly or via the
contr.custom
alias. In these cases, you are responsible for providing the
coding matrix ($C$ from above). For example, if you had four levels: A, B, C and
D, and wanted to compute the contrasts: B - A, C - B, and D - A, you could
write:
import numpy
Z = numpy.array([
[1, 0, 0, 0], # A
[-1, 1, 0, 0], # B - A
[0, -1, 1, 0], # C - B
[-1, 0, 0, 1], # D - A
])
coding = numpy.linalg.inv(Z)[:,1:]
coding
array([[0., 0., 0.], [1., 0., 0.], [1., 1., 0.], [0., 0., 1.]])
model_matrix("C(letters, contr.custom(coding))", DataFrame({"letters": ["A", "B", "C", "D"]}))
Intercept | C(letters, contr.custom(coding))[1] | C(letters, contr.custom(coding))[2] | C(letters, contr.custom(coding))[3] | |
---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 0.0 |
1 | 1.0 | 1.0 | 0.0 | 0.0 |
2 | 1.0 | 1.0 | 1.0 | 0.0 |
3 | 1.0 | 0.0 | 0.0 | 1.0 |
Guaranteeing Structural Full Rankness¶
The model matrices generated from formulae are often consumed directly by linear regression algorithms. In these cases, if your model matrix is not full rank, then the features in your model are not linearly independent, and the resulting coefficients (assuming they can be computed at all) cannot be uniquely determined. While there are ways to handle this, such as regularization, it is usually easiest to put in a little more effort during the model matrix creation process, and make the incoming vectors in your model matrix linearly independent from the outset. As noted in the text above, categorical coding requires consideration about the overlap of the coding with the intercept in order to remain full rank. The good news is that Formulaic will do most of the heavy lifting for you, and does so by default.
It is important to note at this point that Formulaic does not protect
against all forms of linear dependence, only structural linear dependence;
i.e. linear dependence that results from multiple categorical variables
overlapping in vectorspace. If you have two identical numerical vectors called
by two different names in your model matrix, Formulaic will happily build the
model matrix you requested, and you're on your own. This is intentional. While
Formulaic strives to make the model matrix generation process as painless as
possible, it also doesn't want to make more assumptions about the use of the
data than is necessary. Note that you can also disable Formulaic's structural
full-rankness algorithms by passing ensure_full_rank=False
to model_matrix()
or .get_model_matrix()
methods; and can bypass the reducing of the rank of a
single categorical term in a formula using C(..., spans_intercept=False)
(this
is especially useful, for example, if your model includes regularization and you
would prefer to use the over-specified model to ensure fairer shrinkage).
The algorithm that Formulaic uses was heavily inspired by patsy
^1. The basic
idea is to recognize that all categorical codings span the intercept[^2]; and
then to break that coding up into two pieces: a single column that can be
dropped to avoid spanning the intercept, and the remaining body of the coding
that will always be present. You expand associatively the categorical factors,
and then greedily recombine the components, omitting any that would lead to
structural linear dependence. The result is a set of categorical codings that
only spans the intercept when it is safe to do so, guaranteeing structural full
rankness. The patsy
documentation
goes into this in much more detail if this is interesting to you.
[^2]: This assumes that categories are "complete", in that each unit has been assigned a category. You can "complete" categories by treating all those unassigned as being a member of an imputed "null" category.