Analyzer (tm.Analyzer)
tablemage.Analyzer is a high-level class that provides low-code tools for data analysis.
- class tablemage.Analyzer(df: DataFrame, df_test: DataFrame | None = None, test_size: float = 0.0, split_seed: int = 42, id_column: str | None = None, verbose: bool = True, name: str = 'Unnamed Dataset')[source]
Analyzer is the high-level interface class of TableMage.
An Analyzer object can be initialized from a single DataFrame which is then split into train and test DataFrames, or, alternatively, from pre-split train and test DataFrames. The object can then be used to conduct a variety of analyses, including exploratory data analysis (the eda() method), regression analysis (ols() and logit() methods), and machine learning modeling (classify() and regress() methods).
The Analyzer object also handles data preprocessing tasks, such as scaling, imputing missing values, dropping rows with missing values, one-hot encoding, and selecting variables. These methods can be chained together for easy data transformation. The Analyzer object remembers how the data was transformed, enabling proper fitting and transforming of cross validation splits of the train dataset.
- __init__(df: DataFrame, df_test: DataFrame | None = None, test_size: float = 0.0, split_seed: int = 42, id_column: str | None = None, verbose: bool = True, name: str = 'Unnamed Dataset')[source]
Initializes a Analyzer object.
- Parameters:
df (pd.DataFrame | None) – The DataFrame to be analyzed. Must be in wide format, i.e. with shape (n_units, n_vars). If df_test is provided, then the df is treated as the train DataFrame. Otherwise, the df is split into train and test DataFrames according to the test_size parameter.
df_test (pd.DataFrame | None) – Default: None. If not None, then treats df as the train DataFrame.
test_size (float) – Default: 0. Proportion of the DataFrame to withhold for testing. If test_size = 0, then the train DataFrame and the test DataFrame will both be the same as the input df. If df_test is provided, then test_size is ignored.
id_column (str | None) – Default: None. The name of the column containing unique identifiers. If not None, then the column will be set as the index of the DataFrame. If None, then the input index will be used as the index of the DataFrame.
split_seed (int) – Default: 42. Used only for the train test split. If df_test is provided, then split_seed is ignored.
verbose (bool) – Default: False. If True, prints helpful update messages for certain Analyzer function calls.
name (str) – Default: ‘Unnamed Dataset’. Name of the dataset the Analyzer is initialized for.
- categorical_vars() list[str][source]
Returns the categorical variables in the working train DataFrame.
- Returns:
The categorical variables.
- Return type:
list[str]
- causal(treatment: str, outcome: str, confounders: list[str], dataset: Literal['train', 'test', 'all'] = 'all') CausalModel[source]
Returns a CausalModel object for estimating causal effects. The CausalModel object contains rudimentary methods for causal effect estimation (weighted least squares, IPW estimator).
- Parameters:
treatment (str) – The treatment variable. Must be binary numeric (0 or 1-valued).
outcome (str) – The outcome variable.
confounders (list[str]) – The confounding variables.
dataset (Literal['train', 'test', 'all']) – The dataset to be analyzed. By default, analyzes all data.
- Returns:
The CausalModel object contains methods for estimating causal effects.
- Return type:
CausalModel
- classify(models: list[BaseC], target: str, predictors: list[str] | None = None, feature_selectors: list[BaseFSC] | None = None, max_n_features: int | None = None, outer_cv: int | None = None, outer_cv_seed: int = 42) MLClassificationReport[source]
Conducts a comprehensive classification ML model benchmarking exercise. Observations with missing data will be dropped.
- Parameters:
models (list[BaseC]) – Models to be evaluated.
target (str) – The variable to be predicted.
predictors (list[str]) – Default: None. If None, uses all variables except target as predictors.
feature_selectors (list[BaseFSR]) – The feature selectors for voting selection. Feature selectors can be used to select the most important predictors. Feature selectors can also be specified at the model level. If specified here, the same feature selectors will be used for all models.
max_n_features (int) – Default: None. Maximum number of predictors to utilize. Ignored if feature_selectors is None. If None, then all features with at least 50% support are selected.
outer_cv (int) – Default: None. If not None, reports training scores via nested k-fold CV.
outer_cv_seed (int) – Default: 42. The random seed for the outer cross validation loop.
- Return type:
- cluster(models: list[BaseClust], features: list[str] | None = None, dataset: Literal['train', 'all'] = 'all') ClusterReport[source]
Conducts a clustering exercise.
- Parameters:
models (list[BaseClust]) – Models to be evaluated.
features (list[str] | None) – Default: None. The features to cluster on. If None, uses all the variables.
dataset (Literal['train', 'all']) – Dataset to fit models on. If “train”, only fits models on training data. Then, cluster predictions can be made on test data. If “all”, fits models on all data. By default, fits models on all data.
- df_all() DataFrame[source]
Returns the working DataFrame.
- Returns:
The working DataFrame.
- Return type:
pd.DataFrame
- df_test() DataFrame[source]
Returns the working test DataFrame.
- Returns:
The working test DataFrame.
- Return type:
pd.DataFrame
- df_train() DataFrame[source]
Returns the working train DataFrame.
- Returns:
The working train DataFrame.
- Return type:
pd.DataFrame
- drop_highly_missing_vars(include_vars: list[str] | None = None, exclude_vars: list[str] | None = None, threshold: float = 0.5) Analyzer[source]
Drops variables (columns) with missingness rate above a specified threshold.
- Parameters:
include_vars (list[str] | None) – Default: None. If not None, only drops columns with more than 50% missing values in the specified variables. Otherwise, drops columns with more than 50% missing values in all variables.
exclude_vars (list[str] | None) – Default: None. If not None, excludes the specified variables from the list of variables to drop (which is set to all variables by default).
threshold (float) – Default: 0.5. Proportion of missing values above which a column is dropped. For example, if threshold = 0.2, then columns with more than 20% missing values are dropped.
- Returns:
Returns self for method chaining.
- Return type:
- dropna(include_vars: list[str] | None = None, exclude_vars: list[str] | None = None) Analyzer[source]
Drops observations (rows) with missing values on both the train and test DataFrames.
- Parameters:
include_vars (list[str] | None) – Default: None. List of columns along which to drop rows with missing values. If None, drops rows with missing values in all columns.
exclude_vars (list[str] | None) – Default: None. List of columns along which to exclude from dropping rows with missing values. If None, no variables are excluded.
- Returns:
Returns self for method chaining.
- Return type:
- eda(dataset: Literal['train', 'test', 'all'] = 'all') EDAReport[source]
Constructs an EDAReport object for the working train dataset, the working test dataset, or both datasets combined.
- Parameters:
dataset (Literal['train', 'test', 'all']) – The dataset to be analyzed. By default, analyzes all data.
- Returns:
The EDAReport object contains a variety of exploratory data analysis methods, including summary statistics for numeric and categorical variables, t-tests, and data visualizations.
- Return type:
- engineer_categorical_var(name: str, numeric_var: str, level_names: list[str], thresholds: list[float], leq: bool = False) Analyzer[source]
Engineers a new categorical variable/feature based on a list of thresholds.
- Parameters:
name (str) – The name of the new variable engineered.
numeric_var (str) – The name of the numeric variable.
level_names (list[str]) – The names of the levels of the new categorical variable. The first level is the lowest level, and the last level is the highest level.
thresholds (list[float]) – The (upper) thresholds for the levels of the new categorical variable. The thresholds must be in ascending order. For example, if thresholds = [0, 10, 20], and level_names = [“Low”, “Medium”, “High”, “Very High”], then the new variable will have the following levels: - “Low” for values less than 0, - “Medium” for other values less than 10, - “High” for other values less than 20, - “Very High” for values greater than or equal to 20.
leq (bool) – Default: False. If True, the thresholds are inclusive.
- Returns:
Returns self for method chaining.
- Return type:
- engineer_numeric_var(name: str, formula: str) Analyzer[source]
Engineers a new variable/feature based on a formula. The formula can only involve numeric variables. Creates another numeric variable.
- Parameters:
name (str) – The name of the new variable engineered.
formula (str) –
Formula for the new feature. For example, “x1 + x2” would create a new feature that is the sum of the columns x1 and x2 in the DataFrame. All variables used must be numeric. Handles the following operations: - Addition (+) - Subtraction (-) - Multiplication (*) - Division (/) - Parentheses () - Exponentiation (**) - Logarithm (log) - Exponential (exp) - Square root (sqrt)
If the i-th unit is missing a value in any of the variables used in the formula, then the i-th unit of the new feature will be missing.
Examples
>>> analyzer.engineer_numeric_feature("x3", "x1 + x2") >>> assert "x3" in analyzer.datahandler.vars() True >>> assert analyzer.datahandler.df_train()["x3"].equals( ... analyzer.datahandler.df_train()["x1"] + analyzer.datahandler.df_train()["x2"] ... ) True
- Returns:
Returns self for method chaining.
- Return type:
- force_binary(var: str, pos_label: str | None = None, ignore_multiclass: bool = True, rename: bool = True) Analyzer[source]
Forces variables to be binary (0 and 1 valued numeric variables). Does nothing if the data contains more than two classes unless ignore_multiclass is True and pos_label is specified, in which case all classes except pos_label are labeled with zero.
- Parameters:
vars (str) – Name of variable to force to binary.
pos_labels (str) – Default: None. The positive label. If None, the most common class is labeled as the positive class.
ignore_multiclass (bool) – Default: False. If True, all classes except pos_label are labeled with zero. Otherwise raises ValueError.
rename (bool) – Default: True. If True, the variable is renamed to {var}::{pos_label}.
- Returns:
Returns self for method chaining.
- Return type:
- force_categorical(vars: list[str]) Analyzer[source]
Forces specificed variables (columns) to have categorical values. That is, the variables’ values are converted to strings.
- Parameters:
vars (list[str]) – Name of variables to force to categorical.
- Returns:
Returns self for method chaining.
- Return type:
- force_numeric(vars: list[str]) Analyzer[source]
Forces specificed variables to numeric (float).
- Parameters:
vars (list[str]) – Name of variables to force to numeric.
- Returns:
Returns self for method chaining.
- Return type:
- impute(include_vars: list[str] | None = None, exclude_vars: list[str] | None = None, numeric_strategy: Literal['median', 'mean', '5nn', '10nn'] = 'median', categorical_strategy: Literal['most_frequent', 'missing'] = 'most_frequent') Analyzer[source]
Imputes missing values. The imputer is fit on the train DataFrame and transforms both train and test DataFrames.
- Parameters:
include_vars (list[str] | None) – Default: None. List of variables to impute missing values. If None, imputes missing values in all columns.
exclude_vars (list[str] | None) – Default: None. List of variables to exclude from imputing missing values. If None, no variables are excluded.
numeric_strategy (Literal['median', 'mean', '5nn', '10nn']) – Default: ‘median’. Strategy for imputing missing values in numeric variables. - ‘median’: impute with median. - ‘mean’: impute with mean. - ‘5nn’: impute with 5-nearest neighbors. - ‘10nn’: impute with 10-nearest neighbors.
categorical_strategy (Literal['most_frequent', 'missing']) – Default: ‘most_frequent’. Strategy for imputing missing values in categorical variables. - ‘most_frequent’: impute with most frequent value. - ‘missing’: impute with ‘missing’.
- Returns:
Returns self for method chaining.
- Return type:
- load_data_checkpoint(checkpoint_name: str | None = None) Analyzer[source]
Loads the original train and test DataFrames.
- Parameters:
checkpoint_name (str | None) – Default: None. The name of the checkpoint to load. If None, loads the original train and test DataFrames.
- Returns:
Returns self for method chaining.
- Return type:
- logit(target: str | None = None, predictors: list[str] | None = None, alpha: float = 0.0, l1_weight: float = 0.0, threshold_strategy: Literal['f1', 'roc'] | None = None) LogitReport | MNLogitReport[source]
Performs logistic regression. Units with missing data will be dropped.
- Parameters:
target (str | None) – Default: None. The variable to be predicted.
predictors (list[str] | None) – Default: None. If None, all variables except target will be used as predictors.
alpha (float) – Default: 0. Regularization strength. Must be a positive float.
l1_weight (float) – Default: 0. The weight of the L1 penalty. Must be a float between 0 and 1.
threshold_strategy (Literal['f1', 'roc'] | None) – Default: None. The strategy for determining the threshold for binary classification. If None, the threshold is set to 0.5.
- Returns:
The appropriate regression report object is returned.
- Return type:
LogitReport | MNLogitReport
- numeric_vars() list[str][source]
Returns the numeric variables in the working train DataFrame.
- Returns:
The numeric variables.
- Return type:
list[str]
- ols(target: str | None = None, predictors: list[str] | None = None, alpha: float = 0.0, l1_weight: float = 0.0) OLSReport[source]
Performs OLS regression. Units with missing data will be dropped.
- Parameters:
target (str | None) – Default: None. The variable to be predicted.
predictors (list[str]) – Default: None. If None, all variables except target will be used as predictors.
alpha (float) – Default: 0. Regularization strength. Must be a positive float.
l1_weight (float) – Default: 0. The weight of the L1 penalty. Must be a float between 0 and 1.
- Returns:
The OLSReport object contains a variety of OLS regression methods, including summary statistics, model coefficients, and data visualizations.
- Return type:
- onehot(include_vars: list[str] | None = None, exclude_vars: list[str] | None = None, dropfirst: bool = True, keep_original: bool = False) Analyzer[source]
One-hot encodes the specified variables (columns).
- Parameters:
include_vars (list[str]) – Default: None. List of variables to one-hot encode. If None, one-hot encodes all categorical variables.
exclude_vars (list[str]) – Default: None. List of variables to exclude from one-hot encoding. If None, no variables are excluded.
dropfirst (bool) – Default: True. If True, drops the first one-hot encoded column.
keep_original (bool) – Default: False. If True, keeps the original variables in the DataFrame.
- Returns:
Returns self for method chaining.
- Return type:
- regress(models: list[BaseR], target: str, predictors: list[str] | None = None, feature_selectors: list[BaseFSR] | None = None, max_n_features: int | None = None, outer_cv: int | None = None, outer_cv_seed: int = 42) MLRegressionReport[source]
Conducts a comprehensive regression ML model benchmarking exercise. Observations with missing data will be dropped.
- Parameters:
models (list[BaseR]) – Models to be evaluated.
target (str) – The variable to be predicted.
predictors (list[str]) – Default: None. If None, uses all variables except target as predictors.
feature_selectors (list[BaseFSR]) – The feature selectors for voting selection. Feature selectors can be used to select the most important predictors. Feature selectors can also be specified at the model level. If specified here, the same feature selectors will be used for all models.
max_n_features (int | None) – Default: None. Maximum number of predictors to utilize. Ignored if feature_selectors is None. If None, then all features with at least 50% support are selected.
outer_cv (int) – Default: None. If not None, reports training scores via nested k-fold CV.
outer_cv_seed (int) – Default: 42. The random seed for the outer cross validation loop.
- Return type:
- remove_data_checkpoint(checkpoint_name: str) Analyzer[source]
Deletes a saved checkpoint.
- Parameters:
checkpoint_name (str) – The name of the checkpoint to delete.
- Returns:
Returns self for method chaining.
- Return type:
- save_data_checkpoint(checkpoint_name: str) Analyzer[source]
Saves the current train and test DataFrames.
- Parameters:
checkpoint_name (str) – The name of the checkpoint.
- Returns:
Returns self for method chaining.
- Return type:
- scale(include_vars: list[str] | None = None, exclude_vars: list[str] | None = None, strategy: Literal['standardize', 'minmax', 'log', 'log1p', 'robust_standardize', 'normal_quantile', 'uniform_quantile'] = 'standardize') Analyzer[source]
Scales the variables.
- Parameters:
include_vars (list[str] | None) – Default: None. List of variables to scale. If None, scales values in all columns.
exclude_vars (list[str] | None) – Default: None. List of variables to exclude from scaling. If None, no variables are excluded.
strategy (str) – Default: ‘standardize’. The scaling strategy.
- Returns:
Returns self for method chaining.
- Return type:
- select_features(target: str, predictors: list[str] | None = None, feature_selectors: list[BaseFSR] | list[BaseFSC] | None = None, max_n_features: int | None = None) VotingSelectionReport[source]
Selects the most important features using a variety of feature selection methods. The feature selection methods can be used to select the most important predictors for regression or classification.
- Parameters:
target (str) – The target variable.
predictors (list[str] | None) – Default: None. The predictors to select from. If None, uses all variables except the target as predictors.
feature_selectors (list[BaseFSR] | list[BaseFSC] | None) – Default: None. The feature selection methods to use. If None, uses all feature selection methods.
max_n_features (int | None) – Default: None. Maximum number of features to select. If None, then all features with at least 50% support are selected.
- Returns:
Report object containing the results of the feature selection methods.
- Return type:
- select_vars(include_vars: list[str] | None = None, exclude_vars: list[str] | None = None) Analyzer[source]
Selects the specified variables.
- Parameters:
include_vars (list[str]) – Default: None. List of variables to include. If None, includes all variables.
exclude_vars (list[str]) – Default: None. List of variables to exclude. If None, no variables are excluded.
- Returns:
Returns self for method chaining.
- Return type:
- shape(dataset: Literal['train', 'test']) tuple[int, int][source]
Returns the shape of the working train DataFrame.
- Parameters:
dataset (Literal['train', 'test']) – The dataset to get the shape of.
- Returns:
The shape of the working DataFrame.
- Return type:
tuple[int, int]
- value_counts(var: str, dataset: Literal['train', 'test', 'both'] = 'both') Series[source]
Returns the value counts of a variable in the working train DataFrame.
- Parameters:
var (str) – The variable to get the value counts of.
dataset (Literal['train', 'test']) – The dataset to get the value counts of.
- Returns:
The value counts of the variable.
- Return type:
pd.Series