Using with existing libraries

As Formulaic matures it is expected that it will be integrated directly into downstream projects where formula parsing is required. This is known to have already happened in the following high-profile projects:

Where direct integration has not yet happened, you can still use Formulaic in conjunction with other commonly used libraries. On this page, we will add various examples how to achieve this. If you have done some integration work, please feel free to submit a PR that extends this documentation!

StatsModels¶

statsmodels is a popular toolkit hosting many different statistical models, tests, and exploration tools. The formula API in statsmodels is currently based on patsy. If you need the features found in Formulaic, you can use it directly to generate the model matrices, and use the regular API. For example:

In [1]:

            
                Copied!
                
                    
                    
                
                

        
import pandas
from formulaic import model_matrix
from statsmodels.api import OLS
data = pandas.DataFrame({"y": [0.1, 0.4, 3], "a": [1,2,3], "b": ["A", "B", "C"]})
y, X = model_matrix("y ~ a + b", data)
model = OLS(y, X)
results = model.fit()
print(results.summary())
import pandas
from formulaic import model_matrix
from statsmodels.api import OLS
data = pandas.DataFrame({"y": [0.1, 0.4, 3], "a": [1,2,3], "b": ["A", "B", "C"]})
y, X = model_matrix("y ~ a + b", data)
model = OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                    nan
Method:                 Least Squares   F-statistic:                       nan
Date:                Fri, 14 Apr 2023   Prob (F-statistic):                nan
Time:                        21:15:59   Log-Likelihood:                 102.01
No. Observations:                   3   AIC:                            -198.0
Df Residuals:                       0   BIC:                            -200.7
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.7857        inf         -0        nan         nan         nan
a              0.8857        inf          0        nan         nan         nan
b[T.B]        -0.5857        inf         -0        nan         nan         nan
b[T.C]         1.1286        inf          0        nan         nan         nan
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.820
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.476
Skew:                          -0.624   Prob(JB):                        0.788
Kurtosis:                       1.500   Cond. No.                         6.94
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.

/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/stats/stattools.py:74: ValueWarning: omni_normtest is not valid with less than 8 observations; 3 samples were given.
  warn("omni_normtest is not valid with less than 8 observations; %i "
/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1765: RuntimeWarning: divide by zero encountered in divide
  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)
/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1765: RuntimeWarning: invalid value encountered in scalar multiply
  return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)
/home/matthew/.pyenv/versions/3.11.2/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1687: RuntimeWarning: divide by zero encountered in scalar divide
  return np.dot(wresid, wresid) / self.df_resid

Scikit-Learn¶

scikit-learn is a very popular machine learning toolkit for Python. You can use Formulaic directly, as for statsmodels, or as a module in scikit-learn pipelines along the lines of:

In [2]:

            
                Copied!
                
                    
                    
                
                

        
from typing import Iterable, List, Optional

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from formulaic import Formula, FormulaSpec, ModelSpec


class FormulaicTransformer(TransformerMixin, BaseEstimator):

    def __init__(self, formula: FormulaSpec):
        self.formula: Formula = Formula.from_spec(formula)
        self.model_spec: Optional[ModelSpec] = None
        if self.formula._has_structure:
            raise ValueError(f"Formula specification {repr(formula)} results in a structured formula, which is not supported.")

    def fit(self, X, y = None):
        """
        Generate the initial model spec by which subsequent X's will be
        transformed.
        """
        self.model_spec = self.formula.get_model_matrix(X).model_spec
        return self

    def transform(self, X, y = None):
        """
        Transform `X` by generating a model matrix from it based on the fit
        model spec.
        """
        if self.model_spec is None:
            raise RuntimeError("`FormulaicTransformer.fit()` must be called before `.transform()`.")
        X_ = self.model_spec.get_model_matrix(X)
        return X_

    def get_feature_names_out(self, input_features: Optional[Iterable[str]] = None) -> List[str]:
        """
        Expose model spec column names to scikit learn to allow column transforms later in the pipeline.
        """
        if self.model_spec is None:
            raise RuntimeError("`FormulaicTransformer.fit()` must be called before columns can be assigned names.")
        return self.model_spec.column_names


pipe = Pipeline([
    ("formula", FormulaicTransformer("x1 + x2 + x3")),
    ("model", LinearRegression())
])
pipe_fit = pipe.fit(pandas.DataFrame({"x1": [1,2,3], "x2": [2, 3.4, 6], "x3": [7, 3, 1]}), y=pandas.Series([1,3,5]))
pipe_fit
# Note: You could optionally serialize `pipe_fit` here.
# Then: Use the pipe to predict outcomes for new data.
from typing import Iterable, List, Optional

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from formulaic import Formula, FormulaSpec, ModelSpec


class FormulaicTransformer(TransformerMixin, BaseEstimator):

    def __init__(self, formula: FormulaSpec):
        self.formula: Formula = Formula.from_spec(formula)
        self.model_spec: Optional[ModelSpec] = None
        if self.formula._has_structure:
            raise ValueError(f"Formula specification {repr(formula)} results in a structured formula, which is not supported.")

    def fit(self, X, y = None):
        """
        Generate the initial model spec by which subsequent X's will be
        transformed.
        """
        self.model_spec = self.formula.get_model_matrix(X).model_spec
        return self

    def transform(self, X, y = None):
        """
        Transform `X` by generating a model matrix from it based on the fit
        model spec.
        """
        if self.model_spec is None:
            raise RuntimeError("`FormulaicTransformer.fit()` must be called before `.transform()`.")
        X_ = self.model_spec.get_model_matrix(X)
        return X_

    def get_feature_names_out(self, input_features: Optional[Iterable[str]] = None) -> List[str]:
        """
        Expose model spec column names to scikit learn to allow column transforms later in the pipeline.
        """
        if self.model_spec is None:
            raise RuntimeError("`FormulaicTransformer.fit()` must be called before columns can be assigned names.")
        return self.model_spec.column_names


pipe = Pipeline([
    ("formula", FormulaicTransformer("x1 + x2 + x3")),
    ("model", LinearRegression())
])
pipe_fit = pipe.fit(pandas.DataFrame({"x1": [1,2,3], "x2": [2, 3.4, 6], "x3": [7, 3, 1]}), y=pandas.Series([1,3,5]))
pipe_fit
# Note: You could optionally serialize `pipe_fit` here.
# Then: Use the pipe to predict outcomes for new data.

Out[2]:

Pipeline(steps=[('formula', FormulaicTransformer(formula=1 + x1 + x2 + x3)),
                ('model', LinearRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.