Reports (tm._reports)

Report objects are outputted by the tablemage.Analyzer.eda(), tablemage.Analyzer.ols(), tablemage.Analyzer.logit(), tablemage.Analyzer.regress(), and tablemage.Analyzer.classify() methods of the tablemage.Analyzer class. They may contain information about model performance, feature importance, or other relevant statistics. They also have methods for plotting relevant diagnostic figures.

tm._reports.MLClassificationReport

class tablemage._reports.MLClassificationReport(models: list[BaseC], datahandler: DataHandler, target: str, predictors: list[str], feature_selectors: list[BaseFSC] | None = None, max_n_features: int | None = None, outer_cv: int | None = None, outer_cv_seed: int = 42, verbose: bool = True)[source]

Class for evaluating multiple classification models. Fits the model based on provided DataHandler.

cv_metrics(average_across_folds: bool = True) DataFrame | None[source]

Returns a DataFrame containing the evaluation metrics for all models on the training data. Cross validation must have been conducted, otherwise None is returned.

Parameters:

average_across_folds (bool) – Default: True. If True, returns a DataFrame containing goodness-of-fit statistics across all folds.

Returns:

None is returned if cross validation was not conducted.

Return type:

pd.DataFrame | None

cv_metrics_by_class(averaged_across_folds: bool = True) DataFrame | None[source]

Returns a DataFrame containing the cross-validated evaluation metrics for all models on the specified data, broken down by class.

Parameters:

averaged_across_folds (bool) – Default: True. If True, returns a DataFrame containing goodness-of-fit statistics across all folds.

Returns:

None is returned if cross validation was not conducted.

Return type:

pd.DataFrame | None

feature_importance(model_id: str) DataFrame | None[source]

Returns the feature importances of the model with the specified id. If the model does not have feature importances, the coefficients are returned instead. Otherwise, None is returned.

Parameters:

model_id (str) – The id of the model.

Returns:

None is returned if the model does not have feature importances or coefficients.

Return type:

pd.DataFrame | None

fs_report() VotingSelectionReport | None[source]

Returns the feature selection report. If feature selectors were specified at the model level or not at all, then this method will return None.

To access the feature selection report for a specific model, use model_report(<model_id>).feature_selection_report().

Returns:

None is returned if no feature selectors were specified.

Return type:

VotingSelectionReport | None

is_binary() bool[source]

Returns True if the target variable is binary.

Returns:

True if the target variable is binary.

Return type:

bool

metrics(dataset: Literal['train', 'test', 'both']) DataFrame[source]

Returns a DataFrame containing the evaluation metrics for all models on the specified data.

Parameters:

dataset (Literal['train', 'test', 'both']) – The dataset to return the metrics for.

Return type:

pd.DataFrame

metrics_by_class(dataset: Literal['train', 'test']) DataFrame | None[source]

Returns a DataFrame containing the evaluation metrics for all models on the specified data, broken down by class.

Parameters:

dataset (Literal['train', 'test']) – The dataset to return the fit statistics for.

Returns:

None is returned if the model is binary.

Return type:

pd.DataFrame | None

model(model_id: str) BaseC[source]

Returns the model with the specified id.

Parameters:

model_id (str) – The id of the model.

Return type:

BaseC

plot_confusion_matrix(model_id: str, dataset: Literal['train', 'test'], figsize: tuple[float, float] = (5, 5), ax: Axes | None = None) Figure[source]

Returns a figure that is the confusion matrix for the model.

Parameters:
  • model_id (str) – The id of the model.

  • dataset (Literal['train', 'test']) – The dataset to plot the confusion matrix for.

  • figsize (tuple[float, float]) – Default: (5, 5). The size of the figure.

  • ax (plt.Axes | None) – Default: None. The axes on which to plot the figure. If None, a new figure is created.

Returns:

Figure of the confusion matrix.

Return type:

plt.Figure

plot_roc_curve(model_id: str, dataset: Literal['train', 'test'], figsize: tuple[float, float] = (5, 5), ax: Axes | None = None) Figure | None[source]

Plots the ROC curve for a single model.

Parameters:
  • model_id (str) – The id of the model.

  • dataset (Literal['train', 'test']) – The dataset to plot the ROC curve for.

  • figsize (tuple[float, float]) – Default: (5, 5). The size of the figure.

Returns:

Figure of the ROC curve. None is returned if the model is not binary.

Return type:

plt.Figure | None

plot_roc_curves(dataset: Literal['train', 'test'], figsize: tuple[float, float] = (5, 5), ax: Axes | None = None) Figure[source]

Plots the ROC curves for all models.

Parameters:
  • dataset (Literal['train', 'test']) – The dataset to plot the ROC curves for.

  • figsize (tuple[float, float]) – Default: (5, 5). The size of the figure.

  • ax (plt.Axes | None) – Default: None. The axes to plot on. If None, a new figure is created.

tm._reports.MLRegressionReport

class tablemage._reports.MLRegressionReport(models: list[BaseR], datahandler: DataHandler, target: str, predictors: list[str], feature_selectors: list[BaseFSR] | None = None, max_n_features: int | None = None, outer_cv: int | None = None, outer_cv_seed: int = 42, verbose: bool = True)[source]

Class for reporting model goodness of fit. Fits the model based on provided DataHandler.

cv_metrics(average_across_folds: bool = True) DataFrame | None[source]

Returns a DataFrame containing the cross-validated goodness-of-fit statistics for all models on the training data. Cross validation must have been conducted, otherwise None is returned.

Parameters:

average_across_folds (bool) – Default: True. If True, returns a DataFrame containing goodness-of-fit statistics averaged across all folds. Otherwise, returns a DataFrame containing goodness-of-fit statistics for each fold.

Returns:

None if cross validation was not conducted.

Return type:

pd.DataFrame | None

feature_importance(model_id: str) DataFrame | None[source]

Returns the feature importances of the model with the specified id. If the model does not have feature importances, the coefficients are returned instead. Otherwise, None is returned.

Parameters:

model_id (str) – The id of the model.

Returns:

None is returned if the model does not have feature importances or coefficients.

Return type:

pd.DataFrame | None

fs_report() VotingSelectionReport | None[source]

Returns the feature selection report. If feature selectors were specified at the model level or not at all, then this method will return None.

To access the feature selection report for a specific model, use model_report(<model_id>).feature_selection_report().

Returns:

None if feature selectors were not specified.

Return type:

VotingSelectionReport | None

metrics(dataset: Literal['train', 'test', 'both']) DataFrame[source]

Returns a DataFrame containing the metrics for all models on the specified data.

Parameters:

dataset (Literal['train', 'test', 'both']) – The dataset for which to return the metrics.

Return type:

pd.DataFrame

model(model_id: str) BaseR[source]

Returns the model with the specified id.

Parameters:

model_id (str) – The id of the model.

Return type:

BaseR

plot_obs_vs_pred(model_id: str, dataset: Literal['train', 'test'], figsize: tuple[float, float] = (5, 5), ax: Axes | None = None) Figure[source]

Returns a figure that is a scatter plot of the observed (y-axis) and predicted (x-axis) values for the specified model and dataset.

Parameters:
  • model_id (str) – The id of the model.

  • dataset (Literal['train', 'test']) – The dataset for which to plot the observed vs predicted values.

  • figsize (tuple[float, float]) – Default: (5, 5). The size of the figure.

  • ax (plt.Axes | None) – Default: None. The axes on which to plot the figure. If None, a new figure is created.

Return type:

plt.Figure

tm._reports.OLSReport

class tablemage._reports.OLSReport(model: OLSLinearModel, datahandler: DataHandler, target: str, predictors: list[str], dataemitter: DataEmitter | None = None)[source]

OLSReport. Fits the model based on provided DataHandler. Contains methods for generating regression-relevant diagnostic plots and tables for a single linear regression model.

coefs(format: Literal['coef(se)|pval', 'coef|se|pval', 'coef(ci)|pval', 'coef|ci_low|ci_high|pval'] = 'coef(se)|pval') DataFrame[source]

Returns the coefficients of the model.

Parameters:

format (Literal["coef(se)|pval", "coef|se|pval", "coef(ci)|pval",) – “coef|ci_low|ci_high|pval”] Default: ‘coef(se)|pval’.

Return type:

pd.DataFrame

get_outlier_indices(dataset: Literal['train', 'test'] = 'test') list[source]

Returns the indices corresponding to DataFrame examples associated with standardized residual outliers.

Parameters:

dataset (Literal['train', 'test']) – Default: ‘test’.

Returns:

outliers_df_idx

Return type:

list ~ (n_outliers)

metrics(dataset: Literal['train', 'test', 'both']) DataFrame[source]

Returns a DataFrame containing the goodness-of-fit statistics for the model.

Parameters:

dataset (Literal['train', 'test', 'both']) – The dataset to compute the metrics for.

Return type:

pd.DataFrame

model() OLSLinearModel[source]

Returns the fitted OLSLinearModel object.

Return type:

OLSLinearModel

plot_diagnostics(dataset: Literal['train', 'test'], show_outliers: bool = False, figsize: tuple[float, float] = (7.0, 7.0)) Figure[source]

Plots several useful linear regression diagnostic plots.

Parameters:
  • dataset (Literal['train', 'test']) – The dataset to generate the plot for.

  • show_outliers (bool) – Default: False. If True, plots the residual outliers in red.

  • figsize (tuple[float, float]) – Default: (7.0, 7.0).

Return type:

plt.Figure

plot_obs_vs_pred(dataset: Literal['train', 'test'], show_outliers: bool = True, figsize: tuple[float, float] = (5.0, 5.0), ax: Axes | None = None) Figure[source]

Plots a scatter plot of the true and predicted y values.

Parameters:
  • dataset (Literal['train', 'test']) – The dataset to generate the plot for.

  • show_outliers (bool) – Default: True. If True, then the outliers calculated using standard errors will be shown in red.

  • figsize (tuple[float, float]) – Default: (5.0,5.0). Sets the size of the resulting graph.

  • ax (plt.Axes) – Default: None.

Return type:

  • Figure

plot_qq(dataset: Literal['train', 'test'], standardized: bool = True, show_outliers: bool = False, figsize: tuple[float, float] = (5.0, 5.0), ax: Axes | None = None) Figure[source]

Plots a quantile-quantile plot of the residuals.

Parameters:
  • dataset (Literal['train', 'test']) – The dataset to generate the plot for.

  • standardized (bool) – Default: True. If True, standardizes the residuals.

  • show_outliers (bool) – Default: False. If True, plots the outliers in red.

  • figsize (tuple[float, float]) – Default: (5.0, 5.0).

  • ax (plt.Axes) – Default: None.

Return type:

plt.Figure

plot_residuals_hist(dataset: Literal['train', 'test'], standardized: bool = False, density: bool = False, figsize: tuple[float, float] = (5.0, 5.0), ax: Axes | None = None) Figure[source]

Returns a figure that is a histogram of the residuals.

Parameters:
  • dataset (Literal['train', 'test']) – The dataset to generate the plot for.

  • standardized (bool) – Default: False. If True, standardizes the residuals.

  • density (bool) – Default: False. If True, plots density rather than frequency.

  • figsize (tuple[float, float]) – Default: (5.0, 5.0). Determines the size of the returned figure.

  • ax (plt.Axes) – Default: None.

Return type:

plt.Figure

plot_residuals_vs_fitted(dataset: Literal['train', 'test'], standardized: bool = False, show_outliers: bool = True, figsize: tuple[float, float] = (5.0, 5.0), ax: Axes | None = None) Figure[source]

Plots the residuals versus the fitted values.

Parameters:
  • dataset (Literal['train', 'test']) – The dataset to generate the plot for.

  • standardized (bool) – Default: False. If True, plots the standardized residuals as opposed to the raw residuals.

  • show_outliers (bool) – Default: True. If True, colors the outliers determined by the standardized residuals in red.

  • figsize (tuple[float, float]) – Default: (5.0, 5.0). Determines the size of the returned figure.

  • ax (plt.Axes) – Default: None.

Return type:

plt.Figure

plot_residuals_vs_leverage(dataset: Literal['train', 'test'], standardized: bool = True, show_outliers: bool = True, figsize: tuple[float, float] = (5.0, 5.0), ax: Axes | None = None) Figure[source]

Plots the residuals versus leverage.

Parameters:
  • dataset (Literal['train', 'test']) – Default: ‘test’.

  • standardized (bool) – Default: True. If True, standardizes the residuals.

  • show_outliers (bool) – Default: True. If True, plots the outliers in red.

  • figsize (tuple[float, float]) – Default: (5.0, 5.0).

  • ax (plt.Axes) – Default: None.

Return type:

plt.Figure

plot_residuals_vs_var(predictor: str, dataset: Literal['train', 'test'], standardized: bool = False, show_outliers: bool = False, figsize: tuple[float, float] = (5.0, 5.0), ax: Axes | None = None) Figure[source]

Returns a figure that is a residuals vs fitted (y_pred) plot.

Parameters:
  • predictor (str) – The predictor variable whose values should be plotted on the x-axis.

  • dataset (Literal['train', 'test']) – The dataset to generate the plot for.

  • standardized (bool) – Default: False. If True, standardizes the residuals.

  • show_outliers (bool) – Default: False. If True, plots the outliers in red.

  • figsize (tuple[float, float]) – Default: (5.0, 5.0). Determines the size of the returned figure.

  • ax (plt.Axes) – Default: None.

Return type:

plt.Figure

plot_scale_location(dataset: Literal['train', 'test'], show_outliers: bool = True, figsize: tuple[float, float] = (5.0, 5.0), ax: Axes | None = None) Figure[source]

Returns a figure that is a plot of the sqrt of the residuals versus the fitted.

Parameters:
  • dataset (Literal['train', 'test']) – The dataset to generate the plot for.

  • show_outliers (bool) – Default: True. If True, plots the outliers in red.

  • figsize (tuple[float, float]) – Default: (5.0, 5.0).

  • ax (plt.Axes) – Default: None.

Return type:

plt.Figure

set_outlier_threshold(threshold: float) OLSReport[source]

Standardized residuals threshold for outlier identification. Recomputes the outliers.

Parameters:

threshold (float) – Default: 2. Must be a nonnegative value.

Returns:

Returns self for method chaining.

Return type:

OLSReport

statsmodels_summary()[source]

Returns the summary of the statsmodels RegressionResultsWrapper for OLS.

step(direction: Literal['both', 'backward', 'forward'] = 'backward', criteria: Literal['aic', 'bic'] = 'aic', kept_vars: list[str] | None = None, all_vars: list[str] | None = None, start_vars: list[str] | None = None, max_steps: int = 100) OLSReport[source]

Performs stepwise selection. Returns a new OLSReport object with the reduced model.

Parameters:
  • direction (Literal["both", "backward", "forward"]) – Default: ‘backward’. The direction of the stepwise selection.

  • criteria (Literal["aic", "bic"]) – Default: ‘aic’. The criteria to use for selecting the best model.

  • kept_vars (list[str]) – Default: None. The variables that should be kept in the model. If None, defaults to an empty list.

  • all_vars (list[str]) – Default: None. The variables that are candidates for inclusion in the model. If None, defaults to all variables in the training data.

  • start_vars (list[str]) – Default: None. The variables to start the bidirectional stepwise selection with. Ignored if direction is not ‘both’. If direction is ‘both’ and start_vars is None, then the starting variables are the kept_vars.

  • max_steps (int) – Default: 100. The maximum number of steps to take.

Return type:

OLSReport

test_lr(alternative_report: OLSReport) StatisticalTestReport[source]

Performs a likelihood ratio test to compare an alternative OLSLinearModel. Returns an object of class StatisticalTestReport describing the results.

Parameters:

alternative_report (OLSReport) – The report of an alternative OLSLinearModel. The alternative model must be a nested version of the current model or vice-versa.

Return type:

StatisticalTestReport

test_partialf(alternative_report: OLSReport) StatisticalTestReport[source]

Performs a partial F-test to compare an alternative OLSLinearModel. Returns an object of class StatisticalTestReport describing the results.

Parameters:

alternative_report (OLSReport) – The report of an alternative OLSLinearModel. The alternative model must be a nested version of the current model or vice-versa.

Return type:

StatisticalTestReport

tm._reports.EDAReport

class tablemage._reports.EDAReport(df: DataFrame)[source]

Class for generating EDA-relevant plots and tables for all variables.

anova(numeric_var: str, stratify_by: str, strategy: Literal['auto', 'anova_oneway', 'kruskal'] = 'auto') StatisticalTestReport[source]

Tests for equal means between three or more groups. Null hypothesis: All group means are equal. Alternative hypothesis: At least one group’s mean is different from the others. NaNs in numeric_var and stratify_by are dropped before the test is conducted.

Parameters:
  • numeric_var (str) – Numeric variable name to be stratified and compared.

  • stratify_by (str) – Categorical variable name.

  • strategy (Literal['auto', 'anova_oneway', 'kruskal']) – Default: ‘auto’. If ‘auto’, a test is selected as follows: If the data in any group is not normally distributed or not homoskedastic, then the Kruskal-Wallis test is used. Otherwise, the one-way ANOVA test is used. ANOVA is somewhat robust to heteroscedasticity and violations of the normality assumption.

Return type:

StatisticalTestResult

categorical_stats() DataFrame | None[source]

Returns a DataFrame containing summary statistics for all categorical variables.

Returns None if there are no categorical variables.

Return type:

pd.DataFrame | None

categorical_vars() list[str][source]

Returns a list of the names of all categorical variables.

Return type:

list[str]

chi2(categorical_var_1: str, categorical_var_2: str) StatisticalTestReport[source]

Tests for independence between two categorical variables using the chi-squared test.

Parameters:
  • categorical_var_1 (str) – Name of the first categorical variable.

  • categorical_var_2 (str) – Name of the second categorical variable.

Returns:

A structured report of the statistical test results.

Return type:

StatisticalTestReport

numeric_stats() DataFrame | None[source]

Returns a DataFrame containing summary statistics for all numeric variables.

Returns None if there are no numeric variables.

Return type:

pd.DataFrame | None

numeric_vars() list[str][source]

Returns a list of the names of all numeric variables.

Return type:

list[str]

plot(x: str, y: str | None = None, figsize: tuple[float, float] = (5, 5), ax: Axes | None = None) Figure[source]

General purpose plot method for single variable distributions and relationships between two variables. Variables may be numeric or categorical.

If both numeric, scatter plot is produced. If one numeric and one categorical, boxplot is produced. If both categorical, cross tab heatmap is produced.

Parameters:
  • x (str) – The name of the variable to plot on the x-axis.

  • y (str | None) – Default: None. The name of the variable to plot on the y-axis.

  • figsize (tuple[float, float]) – Default: (5, 5). The size of the figure. Only used if ax is None.

  • ax (plt.Axes | None) – Default: None. The axes to plot on. If None, a new figure is created.

plot_correlation_heatmap(numeric_vars: list[str] | None = None, htest: bool = False, cmap: str | Colormap | None = None, figsize: tuple[float, float] = (7, 7), ax: Axes | None = None) Figure[source]

Plots a heatmap of the correlation matrix of the numeric variables.

Parameters:
  • numeric_vars (list[str] | None) – List of numeric variables to include in the heatmap. If None, all numeric variables are considered.

  • htest (bool) – If True, displays correlation coefficients with their corresponding p-values in parentheses.

  • cmap (str | plt.Colormap | None) – The colormap to use for the heatmap visualization. If None, uses a default colormap.

  • figsize (tuple[float, float]) – The size of the figure (width, height) in inches. Only used if ax is None.

  • ax (plt.Axes | None) – If provided, the plot is drawn on this Axes instance.

Returns:

The figure containing the correlation heatmap.

Return type:

plt.Figure

plot_pairs(vars: list[str] | None = None, htest: bool = True, figsize: tuple[float, float] = (7, 7)) Figure[source]

Plots pairwise relationships among the specified variables (numeric or categorical).

Diagonal plots show distributions of single variables, lower panels show one type of plot, upper panels another.

Parameters:
  • df (pd.DataFrame) – Your DataFrame containing the data.

  • vars (list[str] | None) – Default: None. A list of variable names (numeric or categorical). If None, all columns are considered.

  • htest (bool) – Default: True. If True, includes correlation coefficients and p-values for numeric-numeric pairs, chi-squared test results for categorical-categorical pairs, and either t-test or ANOVA results for numeric-categorical pairs in the upper triangle.

  • figsize (tuple[float, float]) – Default: (7, 7). The size of the figure.

Return type:

plt.Figure

plot_pca(numeric_vars: list[str], stratify_by: str | None = None, strata: Series | None = None, scale_strategy: Literal['standardize', 'center', 'none'] = 'center', whiten: bool = False, three_components: bool = False, figsize: tuple[float, float] = (5, 5), ax: Axes | None = None) Figure[source]

Plots the first two (or three) principle components, optionally stratified by an additional variable. Drops examples with missing values across the given variables of interest.

Parameters:
  • numeric_vars (list[str]) – List of numeric variables across which the PCA will be performed.

  • stratify_by (str) – Categorical variable from which strata are identified.

  • strata (pd.Series | None) – Default: None. The lables/strata. Must be the same length as the dataset. Index must be compatible with self.df. Overidden by stratify_by if both provided.

  • scale_strategy (Literal["standardize", "center", "none"].) – Default: “center”.

  • whiten (bool) – Default: False. If True, performs whitening on the data during PCA.

  • three_components (bool) – Default: False. If True, returns a 3D plot. Otherwise plots the first two components only.

  • figsize (tuple[float, float]) – Default: (5, 5). The size of the figure. Only used if ax is None.

  • ax (plt.Axes | None) – Default: None. If not None, does not return a figure; plots the plot directly onto the input Axes.

Return type:

plt.Figure

tabulate_correlation_comparison(numeric_vars: list[str], target: str, bonferroni_correction: bool = False) DataFrame[source]

Generates a table of the Pearson correlation coefficients between the numeric variables and a target variable.

Parameters:
  • numeric_vars (list[str]) – List of numeric variables.

  • target (str) – The numeric variable to correlate the numeric_vars with.

  • bonferroni_correction (bool, default=False) – If True, applies the Bonferroni correction to the p-values (multiplies them by the number of tests).

  • dropna (bool, default=True) – If True, drops rows with NaN values when computing correlations. If False, raises an error if NaN values are present.

Returns:

DataFrame with index as the numeric variables. Columns include the Pearson correlation coefficient, p-value, and number of units considered (if dropna was True).

Return type:

pd.DataFrame

tabulate_correlation_matrix(numeric_vars: list[str], htest: bool = False) DataFrame[source]

Generates a table of the Pearson correlation coefficients between numeric variables.

The function computes correlations efficiently by leveraging numpy operations and avoiding redundant calculations. For symmetric pairs (i,j) and (j,i), it only computes one and mirrors the result. Handles missing values by using pairwise complete observations.

Parameters:
  • numeric_vars (list[str]) – List of numeric variables to compute correlations for.

  • htest (bool, default=False) – If True, includes p-values in the output in format: “corr (p-val)”

Returns:

DataFrame with index and columns as the numeric variables. Values are either correlation coefficients or “correlation (p-value)” if p_values=True. Missing values are represented as “NA”.

Return type:

pd.DataFrame

Raises:

ValueError – If any variable in numeric_vars is not a known numeric variable.

tabulate_tableone(vars: list[str], stratify_by: str | None, show_missingness: bool = True, show_htest_name: bool = True, bonferroni_correction: bool = False) TableOne[source]

Generates a tableone for the given variables stratified by the given variable.

Parameters:
  • vars (list[str]) – List of variables to include in the tableone.

  • stratify_by (str) – Categorical variable to stratify by.

  • show_missingness (bool) – Default: True. If True, includes missingness information in the table.

  • show_htest_name (bool) – Default: True. If True, includes the name of the hypothesis test in the table.

  • bonferroni_correction (bool) – Default: False. If True, applies Bonferroni correction to the p-values.

Return type:

TableOne

test_categorical_independence(categorical_var_1: str, categorical_var_2: str) StatisticalTestReport[source]

Tests for independence between two categorical variables using the chi-squared test.

Parameters:
  • categorical_var_1 (str) – Name of the first categorical variable.

  • categorical_var_2 (str) – Name of the second categorical variable.

Returns:

A structured report of the statistical test results.

Return type:

StatisticalTestReport

test_equal_means(numeric_var: str, stratify_by: str) StatisticalTestReport[source]

Conducts the appropriate statistical test to test for equal means between two ore more groups (null hypothesis).

Parameters:
  • numeric_var (str) – Numeric variable name to be stratified and compared.

  • stratify_by (str) – Categorical variable name.

Return type:

StatisticalTestResult

test_normality(numeric_var: str, method: Literal['shapiro', 'kstest', 'anderson'] = 'shapiro') StatisticalTestReport[source]

Tests the normality of a numeric variable.

Parameters:
  • numeric_var (str) – Numeric variable name.

  • method (str) – Default: ‘shapiro’. The normality test to use. Options: ‘shapiro’, ‘kstest’, ‘anderson’.

Return type:

StatisticalTestResult

ttest(numeric_var: str, stratify_by: str, strategy: Literal['auto', 'student', 'welch', 'yuen', 'mann-whitney'] = 'welch') StatisticalTestReport[source]

Conducts the appropriate statistical test to test for equal means between two groups. The parameter stratify_by must be the name of a binary variable, i.e. a categorical or numeric variable with exactly two unique values.

Null hypothesis: mu_1 = mu_2. Alternative hypothesis: mu_1 != mu_2 This is a two-sided test.

NaNs in numeric_var and stratify_by

are dropped before the test is conducted.

Parameters:
  • numeric_var (str) – numeric variable name to be stratified and compared.

  • stratify_by (str) – Categorical or numeric variable name. Must be binary.

  • strategy (Literal['auto', 'student', 'welch', 'yuen', 'mann-whitney']) – Default: ‘welch’. If ‘auto’, a test is selected as follows: If the data in either group is not normally distributed, and the variances are not equal, then Yuen’s (20% trimmed mean) t-test is used. If the data in either group is not normally distributed, but the variances are equal, then the Mann-Whitney U test is used. If the data in both groups are normally distributed but the variances are not equal, Welch’s t-test is used. Otherwise, Student’s t-test is used.

Return type:

StatisticalTestResult

value_counts(var: str, normalize: bool = False) DataFrame[source]

Returns the value counts for a given categorical variable as a DataFrame, with first column as the unique values and the second column as the counts.

Parameters:
  • var (str) – Categorical variable name.

  • normalize (bool) – Default: False. If True, returns the value counts as proportions.

Return type:

pd.DataFrame

tm._reports.VotingSelectionReport

class tablemage._reports.VotingSelectionReport(selectors: list[BaseFS], dataemitter: DataEmitter, max_n_features: int | None = None, verbose: bool = True)[source]

Class for generating feature selection-relevant tables.

all_features() list[source]

Returns a list of all features considered by the voting selectors.

Returns:

All features.

Return type:

list

top_features() list[source]

Returns a list of top features determined by the voting selectors.

Returns:

Top features.

Return type:

list

votes() DataFrame[source]

Returns a DataFrame that describes the distribution of votes among selectors.

Returns:

Votes DataFrame.

Return type:

pd.DataFrame

tm._reports.StatisticalTestReport

class tablemage._reports.StatisticalTestReport(description: str, statistic: float, pval: float, descriptive_statistic: float | None = None, degfree: float | None = None, statistic_description: str | None = None, descriptive_statistic_description: str | None = None, null_hypothesis_description: str | None = None, alternative_hypothesis_description: str | None = None, assumptions_description: str | list | None = None, long_description: str | None = None)[source]

Class for storing and displaying statistical test results.

pval() float[source]

Returns the p-value.

statistic() float[source]

Returns the statistic.

tm._reports.CausalReport

class tablemage._reports.CausalReport(estimate: float, se: float, n_units: int, n_units_treated: int, outcome_var: str, treatment_var: str, confounders: list[str], estimand: str, method: str, method_description: str, p_value: float | None = None)[source]

Class for storing and displaying causal inference results.

effect()[source]

Returns the estimate of the causal effect.

n_units()[source]

Returns the number of units in the data.

pval()[source]

Returns the p-value of the estimator.

se()[source]

Returns the standard error of the estimator.