Using with existing libraries
As Formulaic matures it is expected that it will be integrated directly into downstream projects where formula parsing is required. This is known to have already happened in the following high-profile projects:
- lifelines
- linearmodels
- Others?
Where direct integration has not yet happened, you can still use Formulaic in conjunction with other commonly used libraries. On this page, we will add various examples how to achieve this. If you have done some integration work, please feel free to submit a PR that extends this documentation!
StatsModels¶
statsmodels is a popular
toolkit hosting many different statistical models, tests, and exploration tools.
The formula API in statsmodels
is currently based on patsy
. If you need the
features found in Formulaic, you can use it directly to generate the model
matrices, and use the regular API. For example:
import pandas
from formulaic import model_matrix
from statsmodels.api import OLS
data = pandas.DataFrame({"y": [0.1, 0.4, 3], "a": [1,2,3], "b": ["A", "B", "C"]})
y, X = model_matrix("y ~ a + b", data)
model = OLS(y, X)
results = model.fit()
print(results.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 1.000 Model: OLS Adj. R-squared: nan Method: Least Squares F-statistic: nan Date: Fri, 14 Apr 2023 Prob (F-statistic): nan Time: 21:15:59 Log-Likelihood: 102.01 No. Observations: 3 AIC: -198.0 Df Residuals: 0 BIC: -200.7 Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -0.7857 inf -0 nan nan nan a 0.8857 inf 0 nan nan nan b[T.B] -0.5857 inf -0 nan nan nan b[T.C] 1.1286 inf 0 nan nan nan ============================================================================== Omnibus: nan Durbin-Watson: 0.820 Prob(Omnibus): nan Jarque-Bera (JB): 0.476 Skew: -0.624 Prob(JB): 0.788 Kurtosis: 1.500 Cond. No. 6.94 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The input rank is higher than the number of observations.
/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/stats/stattools.py:74: ValueWarning: omni_normtest is not valid with less than 8 observations; 3 samples were given. warn("omni_normtest is not valid with less than 8 observations; %i " /home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1765: RuntimeWarning: divide by zero encountered in divide return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid) /home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1765: RuntimeWarning: invalid value encountered in scalar multiply return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid) /home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1687: RuntimeWarning: divide by zero encountered in scalar divide return np.dot(wresid, wresid) / self.df_resid
Scikit-Learn¶
scikit-learn is a very popular machine learning
toolkit for Python. You can use Formulaic directly, as for statsmodels
, or
as a module in scikit-learn pipelines along the lines of:
from typing import Iterable, List, Optional
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from formulaic import Formula, FormulaSpec, ModelSpec
class FormulaicTransformer(TransformerMixin, BaseEstimator):
def __init__(self, formula: FormulaSpec):
self.formula: Formula = Formula.from_spec(formula)
self.model_spec: Optional[ModelSpec] = None
if self.formula._has_structure:
raise ValueError(f"Formula specification {repr(formula)} results in a structured formula, which is not supported.")
def fit(self, X, y = None):
"""
Generate the initial model spec by which subsequent X's will be
transformed.
"""
self.model_spec = self.formula.get_model_matrix(X).model_spec
return self
def transform(self, X, y = None):
"""
Transform `X` by generating a model matrix from it based on the fit
model spec.
"""
if self.model_spec is None:
raise RuntimeError("`FormulaicTransformer.fit()` must be called before `.transform()`.")
X_ = self.model_spec.get_model_matrix(X)
return X_
def get_feature_names_out(self, input_features: Optional[Iterable[str]] = None) -> List[str]:
"""
Expose model spec column names to scikit learn to allow column transforms later in the pipeline.
"""
if self.model_spec is None:
raise RuntimeError("`FormulaicTransformer.fit()` must be called before columns can be assigned names.")
return self.model_spec.column_names
pipe = Pipeline([
("formula", FormulaicTransformer("x1 + x2 + x3")),
("model", LinearRegression())
])
pipe_fit = pipe.fit(pandas.DataFrame({"x1": [1,2,3], "x2": [2, 3.4, 6], "x3": [7, 3, 1]}), y=pandas.Series([1,3,5]))
pipe_fit
# Note: You could optionally serialize `pipe_fit` here.
# Then: Use the pipe to predict outcomes for new data.
Pipeline(steps=[('formula', FormulaicTransformer(formula=1 + x1 + x2 + x3)), ('model', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('formula', FormulaicTransformer(formula=1 + x1 + x2 + x3)), ('model', LinearRegression())])
FormulaicTransformer(formula=1 + x1 + x2 + x3)
LinearRegression()