Machine Learning Models (tm.ml)

The tablemage.ml module contains the machine learning models used by the tablemage.Analyzer.regress() and tablemage.Analyzer.classify() methods of the tablemage.Analyzer class. These models are designed to be used in a similar way to the models in the scikit-learn package, but with additional functionality for feature selection, hyperparameter optimization, and cross-validation.

tm.ml.LinearR

class tablemage.ml.LinearR(type: Literal['ols', 'l1', 'l2', 'elasticnet'] = 'ols', hyperparam_search_method: Literal['optuna', 'grid'] | None = None, hyperparam_search_space: Mapping[str, Iterable | BaseDistribution] | None = None, feature_selectors: list[BaseFSR] | None = None, max_n_features: int | None = None, model_random_state: int = 42, name: str | None = None, **kwargs)[source]

Linear regression.

Hyperparameter optimization is performed automatically during training. The hyperparameter search process can be modified by the user.

__init__(type: Literal['ols', 'l1', 'l2', 'elasticnet'] = 'ols', hyperparam_search_method: Literal['optuna', 'grid'] | None = None, hyperparam_search_space: Mapping[str, Iterable | BaseDistribution] | None = None, feature_selectors: list[BaseFSR] | None = None, max_n_features: int | None = None, model_random_state: int = 42, name: str | None = None, **kwargs)[source]

Initializes a LinearR object.

Parameters:
  • type (Literal['ols', 'l1', 'l2', 'elasticnet']) – Default: ‘ols’. The type of linear regression to be used.

  • hyperparam_search_method (Literal[None, 'grid', 'optuna']) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • hyperparam_search_space (Mapping[str, Iterable | BaseDistribution]) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • feature_selectors (list[BaseFSC]) – Default: None. If not None, specifies the feature selectors for the VotingSelectionReport.

  • max_n_features (int | None) – Default: None. Only useful if feature_selectors is not None. If None, then all features with at least 50% support are selected.

  • model_random_state (int) – Default: 42. Random seed for the model.

  • name (str) – Default: None. Determines how the model shows up in the reports. If None, the name is set to be the class name.

  • **kwargs (dict) –

    Key word arguments are passed directly into the intialization of the HyperparameterSearcher class. See below for options.

    inner_cvint | BaseCrossValidator

    Default: 5. Number of inner cross validation folds. Inner cross validation is used for hyperparameter optimization.

    inner_cv_seedint

    Default: 42. Random seed for inner cross validation.

    n_jobsint

    Default: 1. Number of parallel jobs to run.

    verboseint

    Default: 0. Sets the sklearn verbosity level for the sklearn estimator. 2 is the most verbose.

    n_trialsint

    Default: 100. Number of trials for hyperparameter optimization. Only used if hyperparam_search_method is ‘optuna’.

feature_importance() DataFrame

Returns the feature importances of the best estimator. If the best estimator is a linear model, the coefficients are returned.

Returns:

A DataFrame with feature importances or coefficients. None if no importances or coefficients are available.

Return type:

pd.DataFrame | None

fit(verbose: bool = False)

Fits and evaluates the model.

The model fitting process is as follows:

1. The train data is emitted. This means that the data is preprocessed based on user specifications AND necessary automatic preprocessing steps. That is, the DataEmitter will automatically drop observations with missing entries and encode categorical variables if not specified by user. 2. The hyperparameter search is performed. The best estimator is saved and evaluated on the train data. 3. The test data is emitted. Preprocessing steps were previously fitted on the train data. The test data is transformed accordingly. 4. The best estimator determined from the training step is evaluated on the test data.

If cross validation is specified, fold-specific DataEmitters are generated. Steps 1-4 are repeated for each fold.

The fitting process yields three sets of metrics:

  1. The training set metrics.

  2. The cross validation set metrics. only if cross validation was specified

    Note that the cross validation metrics are computed on the test set of each fold and are therefore a more robust estimate of model performance than the test set metrics.

  3. The test set metrics.

Parameters:

verbose (bool) – Default: False. If True, prints progress.

fs_report() VotingSelectionReport | None

Returns the VotingSelectionReport object.

Returns:

None if the VotingSelectionReport object has not been set (e.g. no feature selection was conducted).

Return type:

VotingSelectionReport | None

hyperparam_searcher() HyperparameterSearcher

Returns the HyperparameterSearcher object.

Return type:

HyperparameterSearcher

is_cross_validated() bool

Returns True if the cross validation metrics are available.

Returns:

True if cross validation metrics are available.

Return type:

bool

predictors() list[str] | None

Returns a list predictor variable names. A warning is printed if the model has not been fitted, and None is returned.

Returns:

A list of predictor variable names used in the final model, after feature selection and data transformation.

Return type:

list[str] | None

sklearn_estimator() BaseEstimator

Returns the best estimator (sklearn estimator object). The best estimator was fitted on the train data through the hyperparameter search process.

Note that the sklearn estimator can be saved and used for future predictions. However, the input data must be preprocessed in the same way. If you intend to use the estimator for future predictions, it is recommended that you manually specify every preprocessing step, which will ensure that you have full control over how the data is being transformed for future reproducibility and predictions.

It is not recommended to use TableMage for production/model environment. We recommend using TableMage to quickly identify promising models and then manually implementing and retraining the best model.

Return type:

BaseEstimator

sklearn_pipeline() Pipeline | InverseTransformRegressor

Returns an sklearn pipeline object. The pipeline allows for retrieving model predictions directly from data formatted like the original train and test data.

The pipeline is composed of the following steps:
  1. Custom data preprocessing steps.

  2. Hyperparameter search object.

  3. The best model determined from the hyperparameter search process.

It is not recommended to use TableMage for production/model environment. We recommend using TableMage to quickly identify promising models and then manually implementing and retraining the best model.

Returns:

Returns either a Pipeline (from scikit-learn) or InverseTransformRegressor object. Both objects have a .predict(X) method. In the case that the target variable was transformed, an InverseTransformRegressor object is returned, which simply wraps around the Pipeline and inverse transforms the predictions as the final step.

Return type:

Pipeline | InverseTransformRegressor

specify_data(dataemitter: DataEmitter | None = None, dataemitters: list[DataEmitter] | None = None)

Specifies the DataEmitters for the model fitting process.

Parameters:
  • dataemitter (DataEmitter | None) – Default: None. DataEmitter that contains the data. If not None, re-specifies the DataEmitter for the model.

  • dataemitters (list[DataEmitter] | None) – Default: None. If not None, re-specifies the DataEmitters for nested cross validation.

tm.ml.RobustLinearR

class tablemage.ml.RobustLinearR(type: Literal['huber', 'ransac'] = 'huber', hyperparam_search_method: Literal['optuna', 'grid'] | None = None, hyperparam_search_space: Mapping[str, Iterable | BaseDistribution] | None = None, feature_selectors: list[BaseFSR] | None = None, max_n_features: int | None = None, model_random_state: int = 42, name: str | None = None, **kwargs)[source]

Robust linear regressor.

Hyperparameter optimization is performed automatically during training. The hyperparameter search process can be modified by the user.

__init__(type: Literal['huber', 'ransac'] = 'huber', hyperparam_search_method: Literal['optuna', 'grid'] | None = None, hyperparam_search_space: Mapping[str, Iterable | BaseDistribution] | None = None, feature_selectors: list[BaseFSR] | None = None, max_n_features: int | None = None, model_random_state: int = 42, name: str | None = None, **kwargs)[source]

Initializes a RobustLinearR object.

Parameters:
  • type (Literal['huber', 'ransac']) – Default: ‘huber’.

  • hyperparam_search_method (Literal[None, 'grid', 'optuna']) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • hyperparam_search_space (Mapping[str, Iterable | BaseDistribution]) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • feature_selectors (list[BaseFSC]) – Default: None. If not None, specifies the feature selectors for the VotingSelectionReport.

  • max_n_features (int | None) – Default: None. Only useful if feature_selectors is not None. If None, then all features with at least 50% support are selected.

  • model_random_state (int) – Default: 42. Random seed for the model.

  • name (str) – Default: None. Determines how the model shows up in the reports. If None, the name is set to be the class name.

  • **kwargs (dict) –

    Key word arguments are passed directly into the intialization of the HyperparameterSearcher class. See below for options.

    inner_cvint | BaseCrossValidator

    Default: 5. Number of inner cross validation folds. Inner cross validation is used for hyperparameter optimization.

    inner_cv_seedint

    Default: 42. Random seed for inner cross validation.

    n_jobsint

    Default: 1. Number of parallel jobs to run.

    verboseint

    Default: 0. Sets the sklearn verbosity level for the sklearn estimator. 2 is the most verbose.

    n_trialsint

    Default: 100. Number of trials for hyperparameter optimization. Only used if hyperparam_search_method is ‘optuna’.

feature_importance() DataFrame

Returns the feature importances of the best estimator. If the best estimator is a linear model, the coefficients are returned.

Returns:

A DataFrame with feature importances or coefficients. None if no importances or coefficients are available.

Return type:

pd.DataFrame | None

fit(verbose: bool = False)

Fits and evaluates the model.

The model fitting process is as follows:

1. The train data is emitted. This means that the data is preprocessed based on user specifications AND necessary automatic preprocessing steps. That is, the DataEmitter will automatically drop observations with missing entries and encode categorical variables if not specified by user. 2. The hyperparameter search is performed. The best estimator is saved and evaluated on the train data. 3. The test data is emitted. Preprocessing steps were previously fitted on the train data. The test data is transformed accordingly. 4. The best estimator determined from the training step is evaluated on the test data.

If cross validation is specified, fold-specific DataEmitters are generated. Steps 1-4 are repeated for each fold.

The fitting process yields three sets of metrics:

  1. The training set metrics.

  2. The cross validation set metrics. only if cross validation was specified

    Note that the cross validation metrics are computed on the test set of each fold and are therefore a more robust estimate of model performance than the test set metrics.

  3. The test set metrics.

Parameters:

verbose (bool) – Default: False. If True, prints progress.

fs_report() VotingSelectionReport | None

Returns the VotingSelectionReport object.

Returns:

None if the VotingSelectionReport object has not been set (e.g. no feature selection was conducted).

Return type:

VotingSelectionReport | None

hyperparam_searcher() HyperparameterSearcher

Returns the HyperparameterSearcher object.

Return type:

HyperparameterSearcher

is_cross_validated() bool

Returns True if the cross validation metrics are available.

Returns:

True if cross validation metrics are available.

Return type:

bool

predictors() list[str] | None

Returns a list predictor variable names. A warning is printed if the model has not been fitted, and None is returned.

Returns:

A list of predictor variable names used in the final model, after feature selection and data transformation.

Return type:

list[str] | None

sklearn_estimator() BaseEstimator

Returns the best estimator (sklearn estimator object). The best estimator was fitted on the train data through the hyperparameter search process.

Note that the sklearn estimator can be saved and used for future predictions. However, the input data must be preprocessed in the same way. If you intend to use the estimator for future predictions, it is recommended that you manually specify every preprocessing step, which will ensure that you have full control over how the data is being transformed for future reproducibility and predictions.

It is not recommended to use TableMage for production/model environment. We recommend using TableMage to quickly identify promising models and then manually implementing and retraining the best model.

Return type:

BaseEstimator

sklearn_pipeline() Pipeline | InverseTransformRegressor

Returns an sklearn pipeline object. The pipeline allows for retrieving model predictions directly from data formatted like the original train and test data.

The pipeline is composed of the following steps:
  1. Custom data preprocessing steps.

  2. Hyperparameter search object.

  3. The best model determined from the hyperparameter search process.

It is not recommended to use TableMage for production/model environment. We recommend using TableMage to quickly identify promising models and then manually implementing and retraining the best model.

Returns:

Returns either a Pipeline (from scikit-learn) or InverseTransformRegressor object. Both objects have a .predict(X) method. In the case that the target variable was transformed, an InverseTransformRegressor object is returned, which simply wraps around the Pipeline and inverse transforms the predictions as the final step.

Return type:

Pipeline | InverseTransformRegressor

specify_data(dataemitter: DataEmitter | None = None, dataemitters: list[DataEmitter] | None = None)

Specifies the DataEmitters for the model fitting process.

Parameters:
  • dataemitter (DataEmitter | None) – Default: None. DataEmitter that contains the data. If not None, re-specifies the DataEmitter for the model.

  • dataemitters (list[DataEmitter] | None) – Default: None. If not None, re-specifies the DataEmitters for nested cross validation.

tm.ml.TreesR

class tablemage.ml.TreesR(type: Literal['decision_tree', 'random_forest', 'gradient_boosting', 'adaboost', 'bagging', 'xgboost', 'xgboostrf'] = 'random_forest', hyperparam_search_method: Literal['optuna', 'grid'] | None = None, hyperparam_search_space: Mapping[str, Iterable | BaseDistribution] | None = None, feature_selectors: list[BaseFSR] | None = None, max_n_features: int | None = None, model_random_state: int = 42, name: str | None = None, **kwargs)[source]

Tree ensemble regressor.

Hyperparameter optimization is performed automatically during training. The hyperparameter search process can be modified by the user.

__init__(type: Literal['decision_tree', 'random_forest', 'gradient_boosting', 'adaboost', 'bagging', 'xgboost', 'xgboostrf'] = 'random_forest', hyperparam_search_method: Literal['optuna', 'grid'] | None = None, hyperparam_search_space: Mapping[str, Iterable | BaseDistribution] | None = None, feature_selectors: list[BaseFSR] | None = None, max_n_features: int | None = None, model_random_state: int = 42, name: str | None = None, **kwargs)[source]

Initializes a TreesR object.

Parameters:
  • type (Literal['decision_tree', 'random_forest', 'gradient_boosting', 'adaboost', 'bagging', 'xgboost', 'xgboostrf']) – Default: ‘random_forest’. The type of tree ensemble to use.

  • hyperparam_search_method (Literal[None, 'grid', 'optuna']) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • hyperparam_search_space (Mapping[str, Iterable | BaseDistribution]) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • feature_selectors (list[BaseFSC]) – Default: None. If not None, specifies the feature selectors for the VotingSelectionReport.

  • max_n_features (int | None) – Default: None. Only useful if feature_selectors is not None. If None, then all features with at least 50% support are selected.

  • model_random_state (int) – Default: 42. Random seed for the model.

  • name (str) – Default: None. Determines how the model shows up in the reports. If None, the name is set to be the class name.

  • **kwargs (dict) –

    Key word arguments are passed directly into the intialization of the HyperparameterSearcher class. See below for options.

    inner_cvint | BaseCrossValidator

    Default: 5. Number of inner cross validation folds. Inner cross validation is used for hyperparameter optimization.

    inner_cv_seedint

    Default: 42. Random seed for inner cross validation.

    n_jobsint

    Default: 1. Number of parallel jobs to run.

    verboseint

    Default: 0. Sets the sklearn verbosity level for the sklearn estimator. 2 is the most verbose.

    n_trialsint

    Default: 100. Number of trials for hyperparameter optimization. Only used if hyperparam_search_method is ‘optuna’.

feature_importance() DataFrame

Returns the feature importances of the best estimator. If the best estimator is a linear model, the coefficients are returned.

Returns:

A DataFrame with feature importances or coefficients. None if no importances or coefficients are available.

Return type:

pd.DataFrame | None

fit(verbose: bool = False)

Fits and evaluates the model.

The model fitting process is as follows:

1. The train data is emitted. This means that the data is preprocessed based on user specifications AND necessary automatic preprocessing steps. That is, the DataEmitter will automatically drop observations with missing entries and encode categorical variables if not specified by user. 2. The hyperparameter search is performed. The best estimator is saved and evaluated on the train data. 3. The test data is emitted. Preprocessing steps were previously fitted on the train data. The test data is transformed accordingly. 4. The best estimator determined from the training step is evaluated on the test data.

If cross validation is specified, fold-specific DataEmitters are generated. Steps 1-4 are repeated for each fold.

The fitting process yields three sets of metrics:

  1. The training set metrics.

  2. The cross validation set metrics. only if cross validation was specified

    Note that the cross validation metrics are computed on the test set of each fold and are therefore a more robust estimate of model performance than the test set metrics.

  3. The test set metrics.

Parameters:

verbose (bool) – Default: False. If True, prints progress.

fs_report() VotingSelectionReport | None

Returns the VotingSelectionReport object.

Returns:

None if the VotingSelectionReport object has not been set (e.g. no feature selection was conducted).

Return type:

VotingSelectionReport | None

hyperparam_searcher() HyperparameterSearcher

Returns the HyperparameterSearcher object.

Return type:

HyperparameterSearcher

is_cross_validated() bool

Returns True if the cross validation metrics are available.

Returns:

True if cross validation metrics are available.

Return type:

bool

predictors() list[str] | None

Returns a list predictor variable names. A warning is printed if the model has not been fitted, and None is returned.

Returns:

A list of predictor variable names used in the final model, after feature selection and data transformation.

Return type:

list[str] | None

sklearn_estimator() BaseEstimator

Returns the best estimator (sklearn estimator object). The best estimator was fitted on the train data through the hyperparameter search process.

Note that the sklearn estimator can be saved and used for future predictions. However, the input data must be preprocessed in the same way. If you intend to use the estimator for future predictions, it is recommended that you manually specify every preprocessing step, which will ensure that you have full control over how the data is being transformed for future reproducibility and predictions.

It is not recommended to use TableMage for production/model environment. We recommend using TableMage to quickly identify promising models and then manually implementing and retraining the best model.

Return type:

BaseEstimator

sklearn_pipeline() Pipeline | InverseTransformRegressor

Returns an sklearn pipeline object. The pipeline allows for retrieving model predictions directly from data formatted like the original train and test data.

The pipeline is composed of the following steps:
  1. Custom data preprocessing steps.

  2. Hyperparameter search object.

  3. The best model determined from the hyperparameter search process.

It is not recommended to use TableMage for production/model environment. We recommend using TableMage to quickly identify promising models and then manually implementing and retraining the best model.

Returns:

Returns either a Pipeline (from scikit-learn) or InverseTransformRegressor object. Both objects have a .predict(X) method. In the case that the target variable was transformed, an InverseTransformRegressor object is returned, which simply wraps around the Pipeline and inverse transforms the predictions as the final step.

Return type:

Pipeline | InverseTransformRegressor

specify_data(dataemitter: DataEmitter | None = None, dataemitters: list[DataEmitter] | None = None)

Specifies the DataEmitters for the model fitting process.

Parameters:
  • dataemitter (DataEmitter | None) – Default: None. DataEmitter that contains the data. If not None, re-specifies the DataEmitter for the model.

  • dataemitters (list[DataEmitter] | None) – Default: None. If not None, re-specifies the DataEmitters for nested cross validation.

tm.ml.SVMR

class tablemage.ml.SVMR(type: Literal['linear', 'poly', 'rbf'] = 'rbf', hyperparam_search_method: Literal['optuna', 'grid'] | None = None, hyperparam_search_space: Mapping[str, Iterable | BaseDistribution] | None = None, feature_selectors: list[BaseFSR] | None = None, max_n_features: int | None = None, name: str | None = None, **kwargs)[source]

Support vector machine regressor.

Hyperparameter optimization is performed automatically during training. The hyperparameter search process can be modified by the user.

__init__(type: Literal['linear', 'poly', 'rbf'] = 'rbf', hyperparam_search_method: Literal['optuna', 'grid'] | None = None, hyperparam_search_space: Mapping[str, Iterable | BaseDistribution] | None = None, feature_selectors: list[BaseFSR] | None = None, max_n_features: int | None = None, name: str | None = None, **kwargs)[source]

Initializes a SVMR object.

Parameters:
  • type (Literal['linear', 'poly', 'rbf']) – Default: ‘rbf’. The type of kernel to use.

  • hyperparam_search_method (Literal[None, 'grid', 'optuna']) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • hyperparam_search_space (Mapping[str, Iterable | BaseDistribution]) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • feature_selectors (list[BaseFSC]) – Default: None. If not None, specifies the feature selectors for the VotingSelectionReport.

  • max_n_features (int | None) – Default: None. Only useful if feature_selectors is not None. If None, then all features with at least 50% support are selected.

  • model_random_state (int) – Default: 42. Random seed for the model.

  • name (str) – Default: None. Determines how the model shows up in the reports. If None, the name is set to be the class name.

  • **kwargs (dict) –

    Key word arguments are passed directly into the intialization of the HyperparameterSearcher class. See below for options.

    inner_cvint | BaseCrossValidator

    Default: 5. Number of inner cross validation folds. Inner cross validation is used for hyperparameter optimization.

    inner_cv_seedint

    Default: 42. Random seed for inner cross validation.

    n_jobsint

    Default: 1. Number of parallel jobs to run.

    verboseint

    Default: 0. Sets the sklearn verbosity level for the sklearn estimator. 2 is the most verbose.

    n_trialsint

    Default: 100. Number of trials for hyperparameter optimization. Only used if hyperparam_search_method is ‘optuna’.

feature_importance() DataFrame

Returns the feature importances of the best estimator. If the best estimator is a linear model, the coefficients are returned.

Returns:

A DataFrame with feature importances or coefficients. None if no importances or coefficients are available.

Return type:

pd.DataFrame | None

fit(verbose: bool = False)

Fits and evaluates the model.

The model fitting process is as follows:

1. The train data is emitted. This means that the data is preprocessed based on user specifications AND necessary automatic preprocessing steps. That is, the DataEmitter will automatically drop observations with missing entries and encode categorical variables if not specified by user. 2. The hyperparameter search is performed. The best estimator is saved and evaluated on the train data. 3. The test data is emitted. Preprocessing steps were previously fitted on the train data. The test data is transformed accordingly. 4. The best estimator determined from the training step is evaluated on the test data.

If cross validation is specified, fold-specific DataEmitters are generated. Steps 1-4 are repeated for each fold.

The fitting process yields three sets of metrics:

  1. The training set metrics.

  2. The cross validation set metrics. only if cross validation was specified

    Note that the cross validation metrics are computed on the test set of each fold and are therefore a more robust estimate of model performance than the test set metrics.

  3. The test set metrics.

Parameters:

verbose (bool) – Default: False. If True, prints progress.

fs_report() VotingSelectionReport | None

Returns the VotingSelectionReport object.

Returns:

None if the VotingSelectionReport object has not been set (e.g. no feature selection was conducted).

Return type:

VotingSelectionReport | None

hyperparam_searcher() HyperparameterSearcher

Returns the HyperparameterSearcher object.

Return type:

HyperparameterSearcher

is_cross_validated() bool

Returns True if the cross validation metrics are available.

Returns:

True if cross validation metrics are available.

Return type:

bool

predictors() list[str] | None

Returns a list predictor variable names. A warning is printed if the model has not been fitted, and None is returned.

Returns:

A list of predictor variable names used in the final model, after feature selection and data transformation.

Return type:

list[str] | None

sklearn_estimator() BaseEstimator

Returns the best estimator (sklearn estimator object). The best estimator was fitted on the train data through the hyperparameter search process.

Note that the sklearn estimator can be saved and used for future predictions. However, the input data must be preprocessed in the same way. If you intend to use the estimator for future predictions, it is recommended that you manually specify every preprocessing step, which will ensure that you have full control over how the data is being transformed for future reproducibility and predictions.

It is not recommended to use TableMage for production/model environment. We recommend using TableMage to quickly identify promising models and then manually implementing and retraining the best model.

Return type:

BaseEstimator

sklearn_pipeline() Pipeline | InverseTransformRegressor

Returns an sklearn pipeline object. The pipeline allows for retrieving model predictions directly from data formatted like the original train and test data.

The pipeline is composed of the following steps:
  1. Custom data preprocessing steps.

  2. Hyperparameter search object.

  3. The best model determined from the hyperparameter search process.

It is not recommended to use TableMage for production/model environment. We recommend using TableMage to quickly identify promising models and then manually implementing and retraining the best model.

Returns:

Returns either a Pipeline (from scikit-learn) or InverseTransformRegressor object. Both objects have a .predict(X) method. In the case that the target variable was transformed, an InverseTransformRegressor object is returned, which simply wraps around the Pipeline and inverse transforms the predictions as the final step.

Return type:

Pipeline | InverseTransformRegressor

specify_data(dataemitter: DataEmitter | None = None, dataemitters: list[DataEmitter] | None = None)

Specifies the DataEmitters for the model fitting process.

Parameters:
  • dataemitter (DataEmitter | None) – Default: None. DataEmitter that contains the data. If not None, re-specifies the DataEmitter for the model.

  • dataemitters (list[DataEmitter] | None) – Default: None. If not None, re-specifies the DataEmitters for nested cross validation.

tm.ml.MLPR

class tablemage.ml.MLPR(hyperparam_search_method: Literal['optuna', 'grid'] | None = None, hyperparam_search_space: Mapping[str, Iterable | BaseDistribution] | None = None, feature_selectors: list[BaseFSR] | None = None, max_n_features: int | None = None, model_random_state: int = 42, name: str | None = None, **kwargs)[source]

Multi-layer perceptron regressor.

Hyperparameter optimization is performed automatically during training. The hyperparameter search process can be modified by the user.

__init__(hyperparam_search_method: Literal['optuna', 'grid'] | None = None, hyperparam_search_space: Mapping[str, Iterable | BaseDistribution] | None = None, feature_selectors: list[BaseFSR] | None = None, max_n_features: int | None = None, model_random_state: int = 42, name: str | None = None, **kwargs)[source]

Initializes an MLPR object.

Parameters:
  • hyperparam_search_method (Literal[None, 'grid', 'optuna']) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • hyperparam_search_space (Mapping[str, Iterable | BaseDistribution]) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • feature_selectors (list[BaseFSC]) – Default: None. If not None, specifies the feature selectors for the VotingSelectionReport.

  • max_n_features (int | None) – Default: None. Only useful if feature_selectors is not None. If None, then all features with at least 50% support are selected.

  • model_random_state (int) – Default: 42. Random seed for the model.

  • name (str) – Default: None. Determines how the model shows up in the reports. If None, the name is set to be the class name.

  • **kwargs (dict) –

    Key word arguments are passed directly into the intialization of the HyperparameterSearcher class. See below for options.

    inner_cvint | BaseCrossValidator

    Default: 5. Number of inner cross validation folds. Inner cross validation is used for hyperparameter optimization.

    inner_cv_seedint

    Default: 42. Random seed for inner cross validation.

    n_jobsint

    Default: 1. Number of parallel jobs to run.

    verboseint

    Default: 0. Sets the sklearn verbosity level for the sklearn estimator. 2 is the most verbose.

    n_trialsint

    Default: 100. Number of trials for hyperparameter optimization. Only used if hyperparam_search_method is ‘optuna’.

feature_importance() DataFrame

Returns the feature importances of the best estimator. If the best estimator is a linear model, the coefficients are returned.

Returns:

A DataFrame with feature importances or coefficients. None if no importances or coefficients are available.

Return type:

pd.DataFrame | None

fit(verbose: bool = False)

Fits and evaluates the model.

The model fitting process is as follows:

1. The train data is emitted. This means that the data is preprocessed based on user specifications AND necessary automatic preprocessing steps. That is, the DataEmitter will automatically drop observations with missing entries and encode categorical variables if not specified by user. 2. The hyperparameter search is performed. The best estimator is saved and evaluated on the train data. 3. The test data is emitted. Preprocessing steps were previously fitted on the train data. The test data is transformed accordingly. 4. The best estimator determined from the training step is evaluated on the test data.

If cross validation is specified, fold-specific DataEmitters are generated. Steps 1-4 are repeated for each fold.

The fitting process yields three sets of metrics:

  1. The training set metrics.

  2. The cross validation set metrics. only if cross validation was specified

    Note that the cross validation metrics are computed on the test set of each fold and are therefore a more robust estimate of model performance than the test set metrics.

  3. The test set metrics.

Parameters:

verbose (bool) – Default: False. If True, prints progress.

fs_report() VotingSelectionReport | None

Returns the VotingSelectionReport object.

Returns:

None if the VotingSelectionReport object has not been set (e.g. no feature selection was conducted).

Return type:

VotingSelectionReport | None

hyperparam_searcher() HyperparameterSearcher

Returns the HyperparameterSearcher object.

Return type:

HyperparameterSearcher

is_cross_validated() bool

Returns True if the cross validation metrics are available.

Returns:

True if cross validation metrics are available.

Return type:

bool

predictors() list[str] | None

Returns a list predictor variable names. A warning is printed if the model has not been fitted, and None is returned.

Returns:

A list of predictor variable names used in the final model, after feature selection and data transformation.

Return type:

list[str] | None

sklearn_estimator() BaseEstimator

Returns the best estimator (sklearn estimator object). The best estimator was fitted on the train data through the hyperparameter search process.

Note that the sklearn estimator can be saved and used for future predictions. However, the input data must be preprocessed in the same way. If you intend to use the estimator for future predictions, it is recommended that you manually specify every preprocessing step, which will ensure that you have full control over how the data is being transformed for future reproducibility and predictions.

It is not recommended to use TableMage for production/model environment. We recommend using TableMage to quickly identify promising models and then manually implementing and retraining the best model.

Return type:

BaseEstimator

sklearn_pipeline() Pipeline | InverseTransformRegressor

Returns an sklearn pipeline object. The pipeline allows for retrieving model predictions directly from data formatted like the original train and test data.

The pipeline is composed of the following steps:
  1. Custom data preprocessing steps.

  2. Hyperparameter search object.

  3. The best model determined from the hyperparameter search process.

It is not recommended to use TableMage for production/model environment. We recommend using TableMage to quickly identify promising models and then manually implementing and retraining the best model.

Returns:

Returns either a Pipeline (from scikit-learn) or InverseTransformRegressor object. Both objects have a .predict(X) method. In the case that the target variable was transformed, an InverseTransformRegressor object is returned, which simply wraps around the Pipeline and inverse transforms the predictions as the final step.

Return type:

Pipeline | InverseTransformRegressor

specify_data(dataemitter: DataEmitter | None = None, dataemitters: list[DataEmitter] | None = None)

Specifies the DataEmitters for the model fitting process.

Parameters:
  • dataemitter (DataEmitter | None) – Default: None. DataEmitter that contains the data. If not None, re-specifies the DataEmitter for the model.

  • dataemitters (list[DataEmitter] | None) – Default: None. If not None, re-specifies the DataEmitters for nested cross validation.

tm.ml.LinearC

class tablemage.ml.LinearC(type: Literal['no_penalty', 'l1', 'l2', 'elasticnet'] = 'l2', hyperparam_search_method: Literal['optuna', 'grid'] | None = None, hyperparam_search_space: Mapping[str, Iterable | BaseDistribution] | None = None, feature_selectors: list[BaseFSC] | None = None, max_n_features: int | None = None, model_random_state: int = 42, name: str | None = None, threshold_strategy: Literal['f1', 'roc'] | None = 'roc', **kwargs)[source]

Logistic regression classifier.

Hyperparameter optimization is performed automatically during training. The hyperparameter search process can be modified by the user.

__init__(type: Literal['no_penalty', 'l1', 'l2', 'elasticnet'] = 'l2', hyperparam_search_method: Literal['optuna', 'grid'] | None = None, hyperparam_search_space: Mapping[str, Iterable | BaseDistribution] | None = None, feature_selectors: list[BaseFSC] | None = None, max_n_features: int | None = None, model_random_state: int = 42, name: str | None = None, threshold_strategy: Literal['f1', 'roc'] | None = 'roc', **kwargs)[source]

Initializes a LinearC object.

Parameters:
  • type (Literal['no_penalty', 'l1', 'l2', 'elasticnet']) – Default: ‘l2’. Specifies the type of logistic regression penalty.

  • hyperparam_search_method (Literal[None, 'grid', 'optuna']) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • hyperparam_search_space (Mapping[str, Iterable | BaseDistribution]) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • feature_selectors (list[BaseFSC]) – Default: None. If not None, specifies the feature selectors for the VotingSelectionReport.

  • max_n_features (int | None) – Default: None. Only useful if feature_selectors is not None. If None, then all features with at least 50% support are selected.

  • model_random_state (int) – Default: 42. Random seed for the model.

  • name (str) – Default: None. Determines how the model shows up in the reports. If None, a default name is set based on the type of the model.

  • threshold_strategy (Literal['f1', 'roc'] | None) – Default: ‘f1’. Determines the decision threshold optimization strategy. ‘f1’ uses the F1 score, ‘roc’ uses the ROC curve. If None, no threshold optimization is performed. Only considered if model yields probabilities.

  • **kwargs (dict) –

    Key word arguments are passed directly into the intialization of the HyperparameterSearcher class. See below for options.

    inner_cvint | BaseCrossValidator

    Default: 5. Number of inner cross validation folds. Inner cross validation is used for hyperparameter optimization.

    inner_cv_seedint

    Default: 42. Random seed for inner cross validation.

    n_jobsint

    Default: 1. Number of parallel jobs to run.

    verboseint

    Default: 0. Sets the sklearn verbosity level for the sklearn estimator. 2 is the most verbose.

    n_trialsint

    Default: 100. Number of trials for hyperparameter optimization. Only used if hyperparam_search_method is ‘optuna’.

feature_importance() DataFrame | None

Returns the feature importances of the best estimator. If the best estimator is a linear model, the coefficients are returned.

Returns:

A DataFrame with feature importances or coefficients. None if no importances or coefficients are available.

Return type:

pd.DataFrame | None

fit(verbose: bool = False)

Fits and evaluates the model.

The model fitting process is as follows:

1. The train data is emitted. This means that the data is preprocessed based on user specifications AND necessary automatic preprocessing steps. That is, the DataEmitter will automatically drop observations with missing entries and encode categorical variables IF NOT SPECIFIED BY USER. 2. The hyperparameter search is performed. The best estimator is saved and evaluated on the train data. 3. The test data is emitted. Preprocessing steps were previously fitted on the train data. The test data is transformed accordingly. 4. The best estimator determined from the training step is evaluated on the test data.

If cross validation is specified, fold-specific DataEmitters are generated. Steps 1-4 are repeated for each fold.

The fitting process yields three sets of metrics:

  1. The training set metrics.

  2. The cross validation set metrics. only if cross validation was specified

    Note that the cross validation metrics are computed on the test set of each fold and are therefore a more robust estimate of model performance than the test set metrics.

  3. The test set metrics.

Parameters:

verbose (bool) – Default: False. If True, prints progress.

fs_report() VotingSelectionReport | None

Returns the VotingSelectionReport object.

Returns:

None if the VotingSelectionReport object has not been set (e.g. no feature selection was conducted).

Return type:

VotingSelectionReport | None

hyperparam_searcher() HyperparameterSearcher

Returns the HyperparameterSearcher object.

Return type:

HyperparameterSearcher

is_binary() bool

Returns True if the model is binary.

Returns:

True if the model is binary.

Return type:

bool

is_cross_validated() bool

Returns True if the cross validation metrics are available.

Returns:

True if cross validation metrics are available.

Return type:

bool

predictors() list[str] | None

Returns a list predictor variable names. A warning is printed if the model has not been fitted, and None is returned.

Returns:

A list of predictor variable names used in the final model, after feature selection and data transformation.

Return type:

list[str] | None

sklearn_estimator()

Returns the best estimator (sklearn estimator object). The best estimator was fitted on the train data through the hyperparameter search process.

Note that the sklearn estimator can be saved and used for future predictions. However, the input data must be preprocessed in the same way. If you intend to use the estimator for future predictions, it is recommended that you manually specify every preprocessing step, which will ensure that you have full control over how the data is being transformed for future reproducibility and predictions.

It is not recommended to use TableMage for ML production. We recommend using TableMage to quickly identify promising models and then manually implementing and training the best model in a production environment.

Return type:

BaseEstimator

sklearn_pipeline() Pipeline

Returns an sklearn pipeline object. The pipeline allows for retrieving model predictions directly from data formatted like the original train and test data.

The pipeline is composed of the following steps:
  1. Custom data preprocessing steps.

  2. Hyperparameter search object.

  3. The best model determined from the hyperparameter search process.

It is not recommended to use TableMage for ML production. We recommend using TableMage to quickly identify promising models and then manually implementing and training the best model in a production environment.

Return type:

Pipeline

specify_data(dataemitter: DataEmitter | None = None, dataemitters: list[DataEmitter] | None = None)

Specifies the DataEmitters for the model fitting process.

Parameters:
  • dataemitter (DataEmitter | None) – Default: None. DataEmitter that contains the data. If not None, re-specifies the DataEmitter for the model.

  • dataemitters (list[DataEmitter] | None) – Default: None. If not None, re-specifies the DataEmitters for nested cross validation.

tm.ml.TreesC

class tablemage.ml.TreesC(type: Literal['decision_tree', 'random_forest', 'gradient_boosting', 'adaboost', 'bagging', 'xgboost', 'xgboostrf'] = 'random_forest', hyperparam_search_method: Literal['optuna', 'grid'] | None = None, hyperparam_search_space: Mapping[str, Iterable | BaseDistribution] | None = None, feature_selectors: list[BaseFSC] | None = None, max_n_features: int | None = None, model_random_state: int = 42, name: str | None = None, threshold_strategy: Literal['f1', 'roc'] | None = 'roc', **kwargs)[source]

Tree ensemble classifier.

Hyperparameter optimization is performed automatically during training. The hyperparameter search process can be modified by the user.

__init__(type: Literal['decision_tree', 'random_forest', 'gradient_boosting', 'adaboost', 'bagging', 'xgboost', 'xgboostrf'] = 'random_forest', hyperparam_search_method: Literal['optuna', 'grid'] | None = None, hyperparam_search_space: Mapping[str, Iterable | BaseDistribution] | None = None, feature_selectors: list[BaseFSC] | None = None, max_n_features: int | None = None, model_random_state: int = 42, name: str | None = None, threshold_strategy: Literal['f1', 'roc'] | None = 'roc', **kwargs)[source]

Initializes a TreeEnsembleC object.

Parameters:
  • type (Literal['decision_tree', 'random_forest', 'gradient_boosting', 'adaboost', 'bagging', 'xgboost', 'xgboostrf']) – Default: ‘random_forest’. The type of tree ensemble to use.

  • hyperparam_search_method (Literal[None, 'grid', 'optuna']) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • hyperparam_search_space (Mapping[str, Iterable | BaseDistribution]) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • feature_selectors (list[BaseFSC]) – Default: None. If not None, specifies the feature selectors for the VotingSelectionReport.

  • max_n_features (int | None) – Default: None. Only useful if feature_selectors is not None. If None, then all features with at least 50% support are selected.

  • model_random_state (int) – Default: 42. Random seed for the model.

  • name (str) – Default: None. Determines how the model shows up in the reports. If None, a default name is set based on the type of the model.

  • threshold_strategy (Literal['f1', 'roc'] | None) – Default: ‘f1’. Determines the decision threshold optimization strategy. ‘f1’ uses the F1 score, ‘roc’ uses the ROC curve. If None, no threshold optimization is performed. Only considered if model yields probabilities.

  • **kwargs (dict) –

    Key word arguments are passed directly into the intialization of the HyperparameterSearcher class. See below for options.

    inner_cvint | BaseCrossValidator

    Default: 5. Number of inner cross validation folds. Inner cross validation is used for hyperparameter optimization.

    inner_cv_seedint

    Default: 42. Random seed for inner cross validation.

    n_jobsint

    Default: 1. Number of parallel jobs to run.

    verboseint

    Default: 0. Sets the sklearn verbosity level for the sklearn estimator. 2 is the most verbose.

    n_trialsint

    Default: 100. Number of trials for hyperparameter optimization. Only used if hyperparam_search_method is ‘optuna’.

feature_importance() DataFrame | None

Returns the feature importances of the best estimator. If the best estimator is a linear model, the coefficients are returned.

Returns:

A DataFrame with feature importances or coefficients. None if no importances or coefficients are available.

Return type:

pd.DataFrame | None

fit(verbose: bool = False)

Fits and evaluates the model.

The model fitting process is as follows:

1. The train data is emitted. This means that the data is preprocessed based on user specifications AND necessary automatic preprocessing steps. That is, the DataEmitter will automatically drop observations with missing entries and encode categorical variables IF NOT SPECIFIED BY USER. 2. The hyperparameter search is performed. The best estimator is saved and evaluated on the train data. 3. The test data is emitted. Preprocessing steps were previously fitted on the train data. The test data is transformed accordingly. 4. The best estimator determined from the training step is evaluated on the test data.

If cross validation is specified, fold-specific DataEmitters are generated. Steps 1-4 are repeated for each fold.

The fitting process yields three sets of metrics:

  1. The training set metrics.

  2. The cross validation set metrics. only if cross validation was specified

    Note that the cross validation metrics are computed on the test set of each fold and are therefore a more robust estimate of model performance than the test set metrics.

  3. The test set metrics.

Parameters:

verbose (bool) – Default: False. If True, prints progress.

fs_report() VotingSelectionReport | None

Returns the VotingSelectionReport object.

Returns:

None if the VotingSelectionReport object has not been set (e.g. no feature selection was conducted).

Return type:

VotingSelectionReport | None

hyperparam_searcher() HyperparameterSearcher

Returns the HyperparameterSearcher object.

Return type:

HyperparameterSearcher

is_binary() bool

Returns True if the model is binary.

Returns:

True if the model is binary.

Return type:

bool

is_cross_validated() bool

Returns True if the cross validation metrics are available.

Returns:

True if cross validation metrics are available.

Return type:

bool

predictors() list[str] | None

Returns a list predictor variable names. A warning is printed if the model has not been fitted, and None is returned.

Returns:

A list of predictor variable names used in the final model, after feature selection and data transformation.

Return type:

list[str] | None

sklearn_estimator()

Returns the best estimator (sklearn estimator object). The best estimator was fitted on the train data through the hyperparameter search process.

Note that the sklearn estimator can be saved and used for future predictions. However, the input data must be preprocessed in the same way. If you intend to use the estimator for future predictions, it is recommended that you manually specify every preprocessing step, which will ensure that you have full control over how the data is being transformed for future reproducibility and predictions.

It is not recommended to use TableMage for ML production. We recommend using TableMage to quickly identify promising models and then manually implementing and training the best model in a production environment.

Return type:

BaseEstimator

sklearn_pipeline() Pipeline

Returns an sklearn pipeline object. The pipeline allows for retrieving model predictions directly from data formatted like the original train and test data.

The pipeline is composed of the following steps:
  1. Custom data preprocessing steps.

  2. Hyperparameter search object.

  3. The best model determined from the hyperparameter search process.

It is not recommended to use TableMage for ML production. We recommend using TableMage to quickly identify promising models and then manually implementing and training the best model in a production environment.

Return type:

Pipeline

specify_data(dataemitter: DataEmitter | None = None, dataemitters: list[DataEmitter] | None = None)

Specifies the DataEmitters for the model fitting process.

Parameters:
  • dataemitter (DataEmitter | None) – Default: None. DataEmitter that contains the data. If not None, re-specifies the DataEmitter for the model.

  • dataemitters (list[DataEmitter] | None) – Default: None. If not None, re-specifies the DataEmitters for nested cross validation.

tm.ml.SVMC

class tablemage.ml.SVMC(type: Literal['linear', 'poly', 'rbf'] = 'rbf', hyperparam_search_method: Literal[None, 'grid', 'optuna'] = None, hyperparam_search_space: Mapping[str, Iterable] | None = None, feature_selectors: list[BaseFSC] | None = None, max_n_features: int | None = None, name: str | None = None, threshold_strategy: Literal['f1', 'roc'] | None = 'roc', **kwargs)[source]

Support Vector Machine with kernel trick.

Hyperparameter optimization is performed automatically during training. The hyperparameter search process can be modified by the user.

__init__(type: Literal['linear', 'poly', 'rbf'] = 'rbf', hyperparam_search_method: Literal[None, 'grid', 'optuna'] = None, hyperparam_search_space: Mapping[str, Iterable] | None = None, feature_selectors: list[BaseFSC] | None = None, max_n_features: int | None = None, name: str | None = None, threshold_strategy: Literal['f1', 'roc'] | None = 'roc', **kwargs)[source]

Initializes a SVMC object.

Parameters:
  • type (Literal['linear', 'poly', 'rbf']) – Default: ‘rbf’. The type of kernel to use.

  • hyperparam_search_method (Literal[None, 'grid', 'optuna']) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • hyperparam_search_space (Mapping[str, Iterable | BaseDistribution]) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • feature_selectors (list[BaseFSC]) – Default: None. If not None, specifies the feature selectors for the VotingSelectionReport.

  • max_n_features (int | None) – Default: None. Only useful if feature_selectors is not None. If None, then all features with at least 50% support are selected.

  • model_random_state (int) – Default: 42. Random seed for the model.

  • name (str) – Default: None. Determines how the model shows up in the reports. If None, a default name is set based on the type of the model.

  • threshold_strategy (Literal['f1', 'roc'] | None) – Default: ‘f1’. Determines the decision threshold optimization strategy. ‘f1’ uses the F1 score, ‘roc’ uses the ROC curve. If None, no threshold optimization is performed. Only considered if model yields probabilities.

  • **kwargs (dict) –

    Key word arguments are passed directly into the intialization of the HyperparameterSearcher class. See below for options.

    inner_cvint | BaseCrossValidator

    Default: 5. Number of inner cross validation folds. Inner cross validation is used for hyperparameter optimization.

    inner_cv_seedint

    Default: 42. Random seed for inner cross validation.

    n_jobsint

    Default: 1. Number of parallel jobs to run.

    verboseint

    Default: 0. Sets the sklearn verbosity level for the sklearn estimator. 2 is the most verbose.

    n_trialsint

    Default: 100. Number of trials for hyperparameter optimization. Only used if hyperparam_search_method is ‘optuna’.

feature_importance() DataFrame | None

Returns the feature importances of the best estimator. If the best estimator is a linear model, the coefficients are returned.

Returns:

A DataFrame with feature importances or coefficients. None if no importances or coefficients are available.

Return type:

pd.DataFrame | None

fit(verbose: bool = False)

Fits and evaluates the model.

The model fitting process is as follows:

1. The train data is emitted. This means that the data is preprocessed based on user specifications AND necessary automatic preprocessing steps. That is, the DataEmitter will automatically drop observations with missing entries and encode categorical variables IF NOT SPECIFIED BY USER. 2. The hyperparameter search is performed. The best estimator is saved and evaluated on the train data. 3. The test data is emitted. Preprocessing steps were previously fitted on the train data. The test data is transformed accordingly. 4. The best estimator determined from the training step is evaluated on the test data.

If cross validation is specified, fold-specific DataEmitters are generated. Steps 1-4 are repeated for each fold.

The fitting process yields three sets of metrics:

  1. The training set metrics.

  2. The cross validation set metrics. only if cross validation was specified

    Note that the cross validation metrics are computed on the test set of each fold and are therefore a more robust estimate of model performance than the test set metrics.

  3. The test set metrics.

Parameters:

verbose (bool) – Default: False. If True, prints progress.

fs_report() VotingSelectionReport | None

Returns the VotingSelectionReport object.

Returns:

None if the VotingSelectionReport object has not been set (e.g. no feature selection was conducted).

Return type:

VotingSelectionReport | None

hyperparam_searcher() HyperparameterSearcher

Returns the HyperparameterSearcher object.

Return type:

HyperparameterSearcher

is_binary() bool

Returns True if the model is binary.

Returns:

True if the model is binary.

Return type:

bool

is_cross_validated() bool

Returns True if the cross validation metrics are available.

Returns:

True if cross validation metrics are available.

Return type:

bool

predictors() list[str] | None

Returns a list predictor variable names. A warning is printed if the model has not been fitted, and None is returned.

Returns:

A list of predictor variable names used in the final model, after feature selection and data transformation.

Return type:

list[str] | None

sklearn_estimator()

Returns the best estimator (sklearn estimator object). The best estimator was fitted on the train data through the hyperparameter search process.

Note that the sklearn estimator can be saved and used for future predictions. However, the input data must be preprocessed in the same way. If you intend to use the estimator for future predictions, it is recommended that you manually specify every preprocessing step, which will ensure that you have full control over how the data is being transformed for future reproducibility and predictions.

It is not recommended to use TableMage for ML production. We recommend using TableMage to quickly identify promising models and then manually implementing and training the best model in a production environment.

Return type:

BaseEstimator

sklearn_pipeline() Pipeline

Returns an sklearn pipeline object. The pipeline allows for retrieving model predictions directly from data formatted like the original train and test data.

The pipeline is composed of the following steps:
  1. Custom data preprocessing steps.

  2. Hyperparameter search object.

  3. The best model determined from the hyperparameter search process.

It is not recommended to use TableMage for ML production. We recommend using TableMage to quickly identify promising models and then manually implementing and training the best model in a production environment.

Return type:

Pipeline

specify_data(dataemitter: DataEmitter | None = None, dataemitters: list[DataEmitter] | None = None)

Specifies the DataEmitters for the model fitting process.

Parameters:
  • dataemitter (DataEmitter | None) – Default: None. DataEmitter that contains the data. If not None, re-specifies the DataEmitter for the model.

  • dataemitters (list[DataEmitter] | None) – Default: None. If not None, re-specifies the DataEmitters for nested cross validation.

tm.ml.MLPC

class tablemage.ml.MLPC(hyperparam_search_method: Literal['optuna', 'grid'] | None = None, hyperparam_search_space: Mapping[str, Iterable | BaseDistribution] | None = None, feature_selectors: list[BaseFSC] | None = None, max_n_features: int | None = 10, model_random_state: int = 42, name: str | None = None, threshold_strategy: Literal['f1', 'roc'] | None = 'roc', **kwargs)[source]

Multi-layer Perceptron classifier.

Hyperparameter optimization is performed automatically during training. The hyperparameter search process can be modified by the user.

__init__(hyperparam_search_method: Literal['optuna', 'grid'] | None = None, hyperparam_search_space: Mapping[str, Iterable | BaseDistribution] | None = None, feature_selectors: list[BaseFSC] | None = None, max_n_features: int | None = 10, model_random_state: int = 42, name: str | None = None, threshold_strategy: Literal['f1', 'roc'] | None = 'roc', **kwargs)[source]

Initializes an MLPC object.

Parameters:
  • hyperparam_search_method (Literal[None, 'grid', 'optuna']) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • hyperparam_search_space (Mapping[str, Iterable | BaseDistribution]) – Default: None. If None, a model-specific default hyperparameter search is conducted.

  • feature_selectors (list[BaseFSC]) – Default: None. If not None, specifies the feature selectors for the VotingSelectionReport.

  • max_n_features (int | None) – Default: None. Only useful if feature_selectors is not None. If None, then all features with at least 50% support are selected.

  • model_random_state (int) – Default: 42. Random seed for the model.

  • name (str) – Default: None. Determines how the model shows up in the reports. If None, a default name is set based on the type of the model.

  • threshold_strategy (Literal['f1', 'roc'] | None) – Default: ‘f1’. Determines the decision threshold optimization strategy. ‘f1’ uses the F1 score, ‘roc’ uses the ROC curve. If None, no threshold optimization is performed. Only considered if model yields probabilities.

  • **kwargs (dict) –

    Key word arguments are passed directly into the intialization of the HyperparameterSearcher class. See below for options.

    inner_cvint | BaseCrossValidator

    Default: 5. Number of inner cross validation folds. Inner cross validation is used for hyperparameter optimization.

    inner_cv_seedint

    Default: 42. Random seed for inner cross validation.

    n_jobsint

    Default: 1. Number of parallel jobs to run.

    verboseint

    Default: 0. Sets the sklearn verbosity level for the sklearn estimator. 2 is the most verbose.

    n_trialsint

    Default: 100. Number of trials for hyperparameter optimization. Only used if hyperparam_search_method is ‘optuna’.

feature_importance() DataFrame | None

Returns the feature importances of the best estimator. If the best estimator is a linear model, the coefficients are returned.

Returns:

A DataFrame with feature importances or coefficients. None if no importances or coefficients are available.

Return type:

pd.DataFrame | None

fit(verbose: bool = False)

Fits and evaluates the model.

The model fitting process is as follows:

1. The train data is emitted. This means that the data is preprocessed based on user specifications AND necessary automatic preprocessing steps. That is, the DataEmitter will automatically drop observations with missing entries and encode categorical variables IF NOT SPECIFIED BY USER. 2. The hyperparameter search is performed. The best estimator is saved and evaluated on the train data. 3. The test data is emitted. Preprocessing steps were previously fitted on the train data. The test data is transformed accordingly. 4. The best estimator determined from the training step is evaluated on the test data.

If cross validation is specified, fold-specific DataEmitters are generated. Steps 1-4 are repeated for each fold.

The fitting process yields three sets of metrics:

  1. The training set metrics.

  2. The cross validation set metrics. only if cross validation was specified

    Note that the cross validation metrics are computed on the test set of each fold and are therefore a more robust estimate of model performance than the test set metrics.

  3. The test set metrics.

Parameters:

verbose (bool) – Default: False. If True, prints progress.

fs_report() VotingSelectionReport | None

Returns the VotingSelectionReport object.

Returns:

None if the VotingSelectionReport object has not been set (e.g. no feature selection was conducted).

Return type:

VotingSelectionReport | None

hyperparam_searcher() HyperparameterSearcher

Returns the HyperparameterSearcher object.

Return type:

HyperparameterSearcher

is_binary() bool

Returns True if the model is binary.

Returns:

True if the model is binary.

Return type:

bool

is_cross_validated() bool

Returns True if the cross validation metrics are available.

Returns:

True if cross validation metrics are available.

Return type:

bool

predictors() list[str] | None

Returns a list predictor variable names. A warning is printed if the model has not been fitted, and None is returned.

Returns:

A list of predictor variable names used in the final model, after feature selection and data transformation.

Return type:

list[str] | None

sklearn_estimator()

Returns the best estimator (sklearn estimator object). The best estimator was fitted on the train data through the hyperparameter search process.

Note that the sklearn estimator can be saved and used for future predictions. However, the input data must be preprocessed in the same way. If you intend to use the estimator for future predictions, it is recommended that you manually specify every preprocessing step, which will ensure that you have full control over how the data is being transformed for future reproducibility and predictions.

It is not recommended to use TableMage for ML production. We recommend using TableMage to quickly identify promising models and then manually implementing and training the best model in a production environment.

Return type:

BaseEstimator

sklearn_pipeline() Pipeline

Returns an sklearn pipeline object. The pipeline allows for retrieving model predictions directly from data formatted like the original train and test data.

The pipeline is composed of the following steps:
  1. Custom data preprocessing steps.

  2. Hyperparameter search object.

  3. The best model determined from the hyperparameter search process.

It is not recommended to use TableMage for ML production. We recommend using TableMage to quickly identify promising models and then manually implementing and training the best model in a production environment.

Return type:

Pipeline

specify_data(dataemitter: DataEmitter | None = None, dataemitters: list[DataEmitter] | None = None)

Specifies the DataEmitters for the model fitting process.

Parameters:
  • dataemitter (DataEmitter | None) – Default: None. DataEmitter that contains the data. If not None, re-specifies the DataEmitter for the model.

  • dataemitters (list[DataEmitter] | None) – Default: None. If not None, re-specifies the DataEmitters for nested cross validation.