Analyzer `(tm.Analyzer)`

tablemage.Analyzer is a high-level class that provides low-code tools for data analysis.

class tablemage.Analyzer(df: DataFrame, df_test: DataFrame | None = None, test_size: float = 0.0, split_seed: int = 42, id_column: str | None = None, verbose: bool = True, name: str = 'Unnamed Dataset')[source]

Analyzer is the high-level interface class of TableMage.

An Analyzer object can be initialized from a single DataFrame which is then split into train and test DataFrames, or, alternatively, from pre-split train and test DataFrames. The object can then be used to conduct a variety of analyses, including exploratory data analysis (the eda() method), regression analysis (ols() and logit() methods), and machine learning modeling (classify() and regress() methods).

The Analyzer object also handles data preprocessing tasks, such as scaling, imputing missing values, dropping rows with missing values, one-hot encoding, and selecting variables. These methods can be chained together for easy data transformation. The Analyzer object remembers how the data was transformed, enabling proper fitting and transforming of cross validation splits of the train dataset.

__init__(df: DataFrame, df_test: DataFrame | None = None, test_size: float = 0.0, split_seed: int = 42, id_column: str | None = None, verbose: bool = True, name: str = 'Unnamed Dataset')[source]

Initializes a Analyzer object.

Parameters:

df (pd.DataFrame | None) – The DataFrame to be analyzed. Must be in wide format, i.e. with shape (n_units, n_vars). If df_test is provided, then the df is treated as the train DataFrame. Otherwise, the df is split into train and test DataFrames according to the test_size parameter.
df_test (pd.DataFrame | None) – Default: None. If not None, then treats df as the train DataFrame.
test_size (float) – Default: 0. Proportion of the DataFrame to withhold for testing. If test_size = 0, then the train DataFrame and the test DataFrame will both be the same as the input df. If df_test is provided, then test_size is ignored.
id_column (str | None) – Default: None. The name of the column containing unique identifiers. If not None, then the column will be set as the index of the DataFrame. If None, then the input index will be used as the index of the DataFrame.
split_seed (int) – Default: 42. Used only for the train test split. If df_test is provided, then split_seed is ignored.
verbose (bool) – Default: False. If True, prints helpful update messages for certain Analyzer function calls.
name (str) – Default: ‘Unnamed Dataset’. Name of the dataset the Analyzer is initialized for.

categorical_vars() → list[str][source]

Returns the categorical variables in the working train DataFrame.

Returns:: The categorical variables.
Return type:: list[str]

causal(treatment: str, outcome: str, confounders: list[str], dataset: Literal['train', 'test', 'all'] = 'all') → CausalModel[source]

Returns a CausalModel object for estimating causal effects. The CausalModel object contains rudimentary methods for causal effect estimation (weighted least squares, IPW estimator).

Parameters:

treatment (str) – The treatment variable. Must be binary numeric (0 or 1-valued).
outcome (str) – The outcome variable.
confounders (list[str]) – The confounding variables.
dataset (Literal['train', 'test', 'all']) – The dataset to be analyzed. By default, analyzes all data.

Returns:

The CausalModel object contains methods for estimating causal effects.

Return type:

CausalModel

classify(models: list[BaseC], target: str, predictors: list[str] | None = None, feature_selectors: list[BaseFSC] | None = None, max_n_features: int | None = None, outer_cv: int | None = None, outer_cv_seed: int = 42) → MLClassificationReport[source]

Conducts a comprehensive classification ML model benchmarking exercise. Observations with missing data will be dropped.

Parameters:

models (list[BaseC]) – Models to be evaluated.
target (str) – The variable to be predicted.
predictors (list[str]) – Default: None. If None, uses all variables except target as predictors.
feature_selectors (list[BaseFSR]) – The feature selectors for voting selection. Feature selectors can be used to select the most important predictors. Feature selectors can also be specified at the model level. If specified here, the same feature selectors will be used for all models.
max_n_features (int) – Default: None. Maximum number of predictors to utilize. Ignored if feature_selectors is None. If None, then all features with at least 50% support are selected.
outer_cv (int) – Default: None. If not None, reports training scores via nested k-fold CV.
outer_cv_seed (int) – Default: 42. The random seed for the outer cross validation loop.

Return type:

MLClassificationReport

cluster(models: list[BaseClust], features: list[str] | None = None, dataset: Literal['train', 'all'] = 'all') → ClusterReport[source]

Conducts a clustering exercise.

Parameters:

models (list[BaseClust]) – Models to be evaluated.
features (list[str] | None) – Default: None. The features to cluster on. If None, uses all the variables.
dataset (Literal['train', 'all']) – Dataset to fit models on. If “train”, only fits models on training data. Then, cluster predictions can be made on test data. If “all”, fits models on all data. By default, fits models on all data.

df_all() → DataFrame[source]

Returns the working DataFrame.

Returns:: The working DataFrame.
Return type:: pd.DataFrame

df_test() → DataFrame[source]

Returns the working test DataFrame.

Returns:: The working test DataFrame.
Return type:: pd.DataFrame

df_train() → DataFrame[source]

Returns the working train DataFrame.

Returns:: The working train DataFrame.
Return type:: pd.DataFrame

drop_highly_missing_vars(include_vars: list[str] | None = None, exclude_vars: list[str] | None = None, threshold: float = 0.5) → Analyzer[source]

Drops variables (columns) with missingness rate above a specified threshold.

Parameters:

include_vars (list[str] | None) – Default: None. If not None, only drops columns with more than 50% missing values in the specified variables. Otherwise, drops columns with more than 50% missing values in all variables.
exclude_vars (list[str] | None) – Default: None. If not None, excludes the specified variables from the list of variables to drop (which is set to all variables by default).
threshold (float) – Default: 0.5. Proportion of missing values above which a column is dropped. For example, if threshold = 0.2, then columns with more than 20% missing values are dropped.

Returns:

Returns self for method chaining.

Return type:

Analyzer

dropna(include_vars: list[str] | None = None, exclude_vars: list[str] | None = None) → Analyzer[source]

Drops observations (rows) with missing values on both the train and test DataFrames.

Parameters:

include_vars (list[str] | None) – Default: None. List of columns along which to drop rows with missing values. If None, drops rows with missing values in all columns.
exclude_vars (list[str] | None) – Default: None. List of columns along which to exclude from dropping rows with missing values. If None, no variables are excluded.

Returns:

Returns self for method chaining.

Return type:

Analyzer

eda(dataset: Literal['train', 'test', 'all'] = 'all') → EDAReport[source]

Constructs an EDAReport object for the working train dataset, the working test dataset, or both datasets combined.

Parameters:: dataset (Literal['train', 'test', 'all']) – The dataset to be analyzed. By default, analyzes all data.
Returns:: The EDAReport object contains a variety of exploratory data analysis methods, including summary statistics for numeric and categorical variables, t-tests, and data visualizations.
Return type:: EDAReport

engineer_categorical_var(name: str, numeric_var: str, level_names: list[str], thresholds: list[float], leq: bool = False) → Analyzer[source]

Engineers a new categorical variable/feature based on a list of thresholds.

Parameters:

name (str) – The name of the new variable engineered.
numeric_var (str) – The name of the numeric variable.
level_names (list[str]) – The names of the levels of the new categorical variable. The first level is the lowest level, and the last level is the highest level.
thresholds (list[float]) – The (upper) thresholds for the levels of the new categorical variable. The thresholds must be in ascending order. For example, if thresholds = [0, 10, 20], and level_names = [“Low”, “Medium”, “High”, “Very High”], then the new variable will have the following levels: - “Low” for values less than 0, - “Medium” for other values less than 10, - “High” for other values less than 20, - “Very High” for values greater than or equal to 20.
leq (bool) – Default: False. If True, the thresholds are inclusive.

Returns:

Returns self for method chaining.

Return type:

Analyzer

engineer_numeric_var(name: str, formula: str) → Analyzer[source]

Engineers a new variable/feature based on a formula. The formula can only involve numeric variables. Creates another numeric variable.

Parameters:

name (str) – The name of the new variable engineered.
formula (str) –
Formula for the new feature. For example, “x1 + x2” would create a new feature that is the sum of the columns x1 and x2 in the DataFrame. All variables used must be numeric. Handles the following operations: - Addition (+) - Subtraction (-) - Multiplication (*) - Division (/) - Parentheses () - Exponentiation (**) - Logarithm (log) - Exponential (exp) - Square root (sqrt)

If the i-th unit is missing a value in any of the variables used in the formula, then the i-th unit of the new feature will be missing.

Examples

>>> analyzer.engineer_numeric_feature("x3", "x1 + x2")
>>> assert "x3" in analyzer.datahandler.vars()
True
>>> assert analyzer.datahandler.df_train()["x3"].equals(
...     analyzer.datahandler.df_train()["x1"] + analyzer.datahandler.df_train()["x2"]
... )
True

Returns:: Returns self for method chaining.
Return type:: Analyzer

force_binary(var: str, pos_label: str | None = None, ignore_multiclass: bool = True, rename: bool = True) → Analyzer[source]

Forces variables to be binary (0 and 1 valued numeric variables). Does nothing if the data contains more than two classes unless ignore_multiclass is True and pos_label is specified, in which case all classes except pos_label are labeled with zero.

Parameters:

vars (str) – Name of variable to force to binary.
pos_labels (str) – Default: None. The positive label. If None, the most common class is labeled as the positive class.
ignore_multiclass (bool) – Default: False. If True, all classes except pos_label are labeled with zero. Otherwise raises ValueError.
rename (bool) – Default: True. If True, the variable is renamed to {var}::{pos_label}.

Returns:

Returns self for method chaining.

Return type:

Analyzer

force_categorical(vars: list[str]) → Analyzer[source]

Forces specificed variables (columns) to have categorical values. That is, the variables’ values are converted to strings.

Parameters:: vars (list[str]) – Name of variables to force to categorical.
Returns:: Returns self for method chaining.
Return type:: Analyzer

force_numeric(vars: list[str]) → Analyzer[source]

Forces specificed variables to numeric (float).

Parameters:: vars (list[str]) – Name of variables to force to numeric.
Returns:: Returns self for method chaining.
Return type:: Analyzer

impute(include_vars: list[str] | None = None, exclude_vars: list[str] | None = None, numeric_strategy: Literal['median', 'mean', '5nn', '10nn'] = 'median', categorical_strategy: Literal['most_frequent', 'missing'] = 'most_frequent') → Analyzer[source]

Imputes missing values. The imputer is fit on the train DataFrame and transforms both train and test DataFrames.

Parameters:

include_vars (list[str] | None) – Default: None. List of variables to impute missing values. If None, imputes missing values in all columns.
exclude_vars (list[str] | None) – Default: None. List of variables to exclude from imputing missing values. If None, no variables are excluded.
numeric_strategy (Literal['median', 'mean', '5nn', '10nn']) – Default: ‘median’. Strategy for imputing missing values in numeric variables. - ‘median’: impute with median. - ‘mean’: impute with mean. - ‘5nn’: impute with 5-nearest neighbors. - ‘10nn’: impute with 10-nearest neighbors.
categorical_strategy (Literal['most_frequent', 'missing']) – Default: ‘most_frequent’. Strategy for imputing missing values in categorical variables. - ‘most_frequent’: impute with most frequent value. - ‘missing’: impute with ‘missing’.

Returns:

Returns self for method chaining.

Return type:

Analyzer

load_data_checkpoint(checkpoint_name: str | None = None) → Analyzer[source]

Loads the original train and test DataFrames.

Parameters:: checkpoint_name (str | None) – Default: None. The name of the checkpoint to load. If None, loads the original train and test DataFrames.
Returns:: Returns self for method chaining.
Return type:: Analyzer

logit(target: str | None = None, predictors: list[str] | None = None, alpha: float = 0.0, l1_weight: float = 0.0, threshold_strategy: Literal['f1', 'roc'] | None = None) → LogitReport | MNLogitReport[source]

Performs logistic regression. Units with missing data will be dropped.

Parameters:

target (str | None) – Default: None. The variable to be predicted.
predictors (list[str] | None) – Default: None. If None, all variables except target will be used as predictors.
alpha (float) – Default: 0. Regularization strength. Must be a positive float.
l1_weight (float) – Default: 0. The weight of the L1 penalty. Must be a float between 0 and 1.
threshold_strategy (Literal['f1', 'roc'] | None) – Default: None. The strategy for determining the threshold for binary classification. If None, the threshold is set to 0.5.

Returns:

The appropriate regression report object is returned.

Return type:

LogitReport | MNLogitReport

numeric_vars() → list[str][source]

Returns the numeric variables in the working train DataFrame.

Returns:: The numeric variables.
Return type:: list[str]

ols(target: str | None = None, predictors: list[str] | None = None, alpha: float = 0.0, l1_weight: float = 0.0) → OLSReport[source]

Performs OLS regression. Units with missing data will be dropped.

Parameters:

target (str | None) – Default: None. The variable to be predicted.
predictors (list[str]) – Default: None. If None, all variables except target will be used as predictors.
alpha (float) – Default: 0. Regularization strength. Must be a positive float.
l1_weight (float) – Default: 0. The weight of the L1 penalty. Must be a float between 0 and 1.

Returns:

The OLSReport object contains a variety of OLS regression methods, including summary statistics, model coefficients, and data visualizations.

Return type:

OLSReport

onehot(include_vars: list[str] | None = None, exclude_vars: list[str] | None = None, dropfirst: bool = True, keep_original: bool = False) → Analyzer[source]

One-hot encodes the specified variables (columns).

Parameters:

include_vars (list[str]) – Default: None. List of variables to one-hot encode. If None, one-hot encodes all categorical variables.
exclude_vars (list[str]) – Default: None. List of variables to exclude from one-hot encoding. If None, no variables are excluded.
dropfirst (bool) – Default: True. If True, drops the first one-hot encoded column.
keep_original (bool) – Default: False. If True, keeps the original variables in the DataFrame.

Returns:

Returns self for method chaining.

Return type:

Analyzer

regress(models: list[BaseR], target: str, predictors: list[str] | None = None, feature_selectors: list[BaseFSR] | None = None, max_n_features: int | None = None, outer_cv: int | None = None, outer_cv_seed: int = 42) → MLRegressionReport[source]

Conducts a comprehensive regression ML model benchmarking exercise. Observations with missing data will be dropped.

Parameters:

models (list[BaseR]) – Models to be evaluated.
target (str) – The variable to be predicted.
predictors (list[str]) – Default: None. If None, uses all variables except target as predictors.
feature_selectors (list[BaseFSR]) – The feature selectors for voting selection. Feature selectors can be used to select the most important predictors. Feature selectors can also be specified at the model level. If specified here, the same feature selectors will be used for all models.
max_n_features (int | None) – Default: None. Maximum number of predictors to utilize. Ignored if feature_selectors is None. If None, then all features with at least 50% support are selected.
outer_cv (int) – Default: None. If not None, reports training scores via nested k-fold CV.
outer_cv_seed (int) – Default: 42. The random seed for the outer cross validation loop.

Return type:

MLRegressionReport

remove_data_checkpoint(checkpoint_name: str) → Analyzer[source]

Deletes a saved checkpoint.

Parameters:: checkpoint_name (str) – The name of the checkpoint to delete.
Returns:: Returns self for method chaining.
Return type:: Analyzer

save_data_checkpoint(checkpoint_name: str) → Analyzer[source]

Saves the current train and test DataFrames.

Parameters:: checkpoint_name (str) – The name of the checkpoint.
Returns:: Returns self for method chaining.
Return type:: Analyzer

scale(include_vars: list[str] | None = None, exclude_vars: list[str] | None = None, strategy: Literal['standardize', 'minmax', 'log', 'log1p', 'robust_standardize', 'normal_quantile', 'uniform_quantile'] = 'standardize') → Analyzer[source]

Scales the variables.

Parameters:

include_vars (list[str] | None) – Default: None. List of variables to scale. If None, scales values in all columns.
exclude_vars (list[str] | None) – Default: None. List of variables to exclude from scaling. If None, no variables are excluded.
strategy (str) – Default: ‘standardize’. The scaling strategy.

Returns:

Returns self for method chaining.

Return type:

Analyzer

select_features(target: str, predictors: list[str] | None = None, feature_selectors: list[BaseFSR] | list[BaseFSC] | None = None, max_n_features: int | None = None) → VotingSelectionReport[source]

Selects the most important features using a variety of feature selection methods. The feature selection methods can be used to select the most important predictors for regression or classification.

Parameters:

target (str) – The target variable.
predictors (list[str] | None) – Default: None. The predictors to select from. If None, uses all variables except the target as predictors.
feature_selectors (list[BaseFSR] | list[BaseFSC] | None) – Default: None. The feature selection methods to use. If None, uses all feature selection methods.
max_n_features (int | None) – Default: None. Maximum number of features to select. If None, then all features with at least 50% support are selected.

Returns:

Report object containing the results of the feature selection methods.

Return type:

VotingSelectionReport

select_vars(include_vars: list[str] | None = None, exclude_vars: list[str] | None = None) → Analyzer[source]

Selects the specified variables.

Parameters:

include_vars (list[str]) – Default: None. List of variables to include. If None, includes all variables.
exclude_vars (list[str]) – Default: None. List of variables to exclude. If None, no variables are excluded.

Returns:

Returns self for method chaining.

Return type:

Analyzer

shape(dataset: Literal['train', 'test']) → tuple[int, int][source]

Returns the shape of the working train DataFrame.

Parameters:: dataset (Literal['train', 'test']) – The dataset to get the shape of.
Returns:: The shape of the working DataFrame.
Return type:: tuple[int, int]

value_counts(var: str, dataset: Literal['train', 'test', 'both'] = 'both') → Series[source]

Returns the value counts of a variable in the working train DataFrame.

Parameters:

var (str) – The variable to get the value counts of.
dataset (Literal['train', 'test']) – The dataset to get the value counts of.

Returns:

The value counts of the variable.

Return type:

pd.Series

vars() → list[str][source]

Returns the variables in the working train DataFrame.

Returns:: The variables.
Return type:: list[str]

Analyzer (tm.Analyzer)

Analyzer `(tm.Analyzer)`