Handling Missing Data
Sooner or later, you will encounter datasets with null values, and it is important to know how their presence will impact your modeling. Formulaic model matrix materialization procedures allow you to specify how you want nulls to be handled. You can either:
- Automatically drop null rows from the dataset (the default).
- Ignore nulls, and allow them to propagate naturally.
- Raise an exception when null values are encountered.
You can specify the desired behaviour by passing an NAAction
enum value (or
string value thereof) to the materialization methods (model_matrix
, and
*.get_model_matrix()
). Examples of each of these approaches is show below.
Dropping null rows (NAAction.DROP
, or "drop"
)¶
This the default behaviour, and will result in any row with a null in any column that is being used by the materialization being dropped from the resulting dataset. For example:
from pandas import Categorical, DataFrame
from formulaic import model_matrix
from formulaic.materializers import NAAction
df = DataFrame(
{
"c": [1, 2, None, 4, 5],
"C": Categorical(
["a", "b", "c", None, "e"], categories=["a", "b", "c", "d", "e"]
),
}
)
model_matrix("c + C", df, na_action=NAAction.DROP)
# Equivlent to:
# * model_matrix("c + C", df)
# * model_matrix("c + C", df, na_action="drop")
Intercept | c | C[T.b] | C[T.c] | C[T.d] | C[T.e] | |
---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | 0 | 0 | 0 | 0 |
1 | 1.0 | 2.0 | 1 | 0 | 0 | 0 |
4 | 1.0 | 5.0 | 0 | 0 | 0 | 1 |
You can also specify additional rows to drop using the drop_rows
argument:
model_matrix("c + C", df, drop_rows={0, 4})
Intercept | c | C[T.b] | C[T.c] | C[T.d] | C[T.e] | |
---|---|---|---|---|---|---|
1 | 1.0 | 2.0 | 1 | 0 | 0 | 0 |
Note that the set passed to drop_rows
is expected to be mutable, as it will be
updated with the indices of rows dropped automatically also; which can be useful
if you need to keep track of this information outside of the materialization
procedure.
drop_rows = {0, 4}
model_matrix("c + C", df, drop_rows=drop_rows)
drop_rows
{0, np.int64(2), np.int64(3), 4}
Ignore nulls (NAAction.IGNORE
, or "ignore"
)¶
If your modeling toolkit can handle the presence of nulls, or you otherwise want
to keep them in the dataset, you can pass na_action = "ignore"
to the
materialization methods. This will allow null values to remain in columns, and
take no action to prevent the propagation of nulls.
model_matrix("c + C", df, na_action="ignore")
Intercept | c | C[T.b] | C[T.c] | C[T.d] | C[T.e] | |
---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | 0 | 0 | 0 | 0 |
1 | 1.0 | 2.0 | 1 | 0 | 0 | 0 |
2 | 1.0 | NaN | 0 | 1 | 0 | 0 |
3 | 1.0 | 4.0 | 0 | 0 | 0 | 0 |
4 | 1.0 | 5.0 | 0 | 0 | 0 | 1 |
Note the NaN
in the c
column, and that NaN
does NOT appear in the dummy
coding of C on row 3, consistent with standard implementations of dummy coding.
This could result in misleading model estimates, so care should be taken.
You can combine this with drop_rows
, as described above, to manually filter
out the null values you are concerned about.
Raise for null values (NAAction.RAISE
or "raise"
)¶
If you are unwilling to risk the perils of dropping or ignoring null values, you can instead opt to raise an exception whenever a null value is found. This can prevent yourself from accidentally biasing your model, but also makes your code more brittle. For example:
try:
model_matrix("c + C", df, na_action="raise")
except Exception as e:
print(e)
Error encountered while checking for nulls in `C`: `C` contains null values after evaluation.
As with ignoring nulls above, you can combine this raising behaviour with
drop_rows
to manually filter out the null values that you feel you can safely
ignore, and then raise if any additional null values make it into your data.