Handling Missing Data

Sooner or later, you will encounter datasets with null values, and it is important to know how their presence will impact your modeling. Formulaic model matrix materialization procedures allow you to specify how you want nulls to be handled. You can either:

Automatically drop null rows from the dataset (the default).
Ignore nulls, and allow them to propagate naturally.
Raise an exception when null values are encountered.

You can specify the desired behaviour by passing an NAAction enum value (or string value thereof) to the materialization methods (model_matrix, and *.get_model_matrix()). Examples of each of these approaches is show below.

Dropping null rows (`NAAction.DROP`, or `"drop"`)¶

This the default behaviour, and will result in any row with a null in any column that is being used by the materialization being dropped from the resulting dataset. For example:

In [36]:

Copied!





from formulaic import model_matrix
from formulaic.materializers import NAAction
from pandas import DataFrame, Categorical

df = DataFrame({
    "c": [1, 2, None, 4, 5],
    "C": Categorical(["a", "b", "c", None, "e"], categories=["a", "b", "c", "d", "e"])
})

model_matrix("c + C", df, na_action=NAAction.DROP)
# Equivlent to:
#   * model_matrix("c + C", df)
#   * model_matrix("c + C", df, na_action="drop")
from formulaic import model_matrix
from formulaic.materializers import NAAction
from pandas import DataFrame, Categorical

df = DataFrame({
    "c": [1, 2, None, 4, 5],
    "C": Categorical(["a", "b", "c", None, "e"], categories=["a", "b", "c", "d", "e"])
})

model_matrix("c + C", df, na_action=NAAction.DROP)
# Equivlent to:
#   * model_matrix("c + C", df)
#   * model_matrix("c + C", df, na_action="drop")

Out[36]:

	Intercept	c	C[T.b]	C[T.e]
0	1.0	1.0	0	0
1	1.0	2.0	1	0
4	1.0	5.0	0	1

You can also specify additional rows to drop using the drop_rows argument:

In [24]:

Copied!

model_matrix("c + C", df, drop_rows={0, 4})
model_matrix("c + C", df, drop_rows={0, 4})

Out[24]:

	Intercept	c	C[T.b]	C[T.c]	C[T.d]	C[T.e]
1	1.0	2.0	1	0	0	0

Note that the set passed to drop_rows is expected to be mutable, as it will be updated with the indices of rows dropped automatically also; which can be useful if you need to keep track of this information outside of the materialization procedure.

In [25]:

Copied!

drop_rows = {0, 4}
model_matrix("c + C", df, drop_rows=drop_rows)
drop_rows
drop_rows = {0, 4}
model_matrix("c + C", df, drop_rows=drop_rows)
drop_rows

Out[25]:

{0, np.int64(2), np.int64(3), 4}

Ignore nulls (`NAAction.IGNORE`, or `"ignore"`)¶

If your modeling toolkit can handle the presence of nulls, or you otherwise want to keep them in the dataset, you can pass na_action = "ignore" to the materialization methods. This will allow null values to remain in columns, and take no action to prevent the propagation of nulls.

In [31]:

Copied!

model_matrix("c + C", df, na_action="ignore")
model_matrix("c + C", df, na_action="ignore")

Out[31]:

	Intercept	c	C[T.b]	C[T.c]	C[T.e]
0	1.0	1.0	0	0	0
1	1.0	2.0	1	0	0
2	1.0	NaN	0	1	0
3	1.0	4.0	0	0	0
4	1.0	5.0	0	0	1

Note the NaN in the c column, and that NaN does NOT appear in the dummy coding of C on row 3, consistent with standard implementations of dummy coding. This could result in misleading model estimates, so care should be taken.

You can combine this with drop_rows, as described above, to manually filter out the null values you are concerned about.

Raise for null values (`NAAction.RAISE` or `"raise"`)¶

If you are unwilling to risk the perils of dropping or ignoring null values, you can instead opt to raise an exception whenever a null value is found. This can prevent yourself from accidentally biasing your model, but also makes your code more brittle. For example:

In [41]:

Copied!





try:
    model_matrix("c + C", df, na_action="raise")
except Exception as e:
    print(e)
try:
    model_matrix("c + C", df, na_action="raise")
except Exception as e:
    print(e)

Error encountered while checking for nulls in `C`: `C` contains null values after evaluation.

As with ignoring nulls above, you can combine this raising behaviour with drop_rows to manually filter out the null values that you feel you can safely ignore, and then raise if any additional null values make it into your data.

Handling Missing Data

Dropping null rows (NAAction.DROP, or "drop")¶

Ignore nulls (NAAction.IGNORE, or "ignore")¶

Raise for null values (NAAction.RAISE or "raise")¶

Dropping null rows (`NAAction.DROP`, or `"drop"`)¶

Ignore nulls (`NAAction.IGNORE`, or `"ignore"`)¶

Raise for null values (`NAAction.RAISE` or `"raise"`)¶