pyCADD.Dance package

Submodules

pyCADD.Dance.common module

class pyCADD.Dance.common.Dancer[source]

Bases: object

Data analyzer for CADD (Computer-Aided Drug Design).

A comprehensive data preprocessing and analysis tool for molecular datasets in computer-aided drug design workflows. Handles dataset merging, preprocessing, and preparation for machine learning models.

__init__() → None[source]

Initialize the Dancer instance.

Creates necessary directories and initializes data storage attributes.

add_pos_dataset(csv_path: str) → None[source]

Add a positive sample dataset.

Parameters:: csv_path (str) – Path to the positive sample dataset CSV file.

add_neg_dataset(csv_path: str) → None[source]

Add a negative sample dataset.

Parameters:: csv_path (str) – Path to the negative sample dataset CSV file.

add_dataset(csv_path: str, *args, **kwargs) → None[source]

Add a dataset.

Parameters:

csv_path (str) – Path to the dataset CSV file.
*args – Additional arguments passed to _add_dataset.
**kwargs – Additional keyword arguments passed to _add_dataset.

get_merged_data() → DataFrame[source]

Return the merged dataset.

Returns:: The merged dataset DataFrame.

prepare_data(fill_nan: bool = True, *args, **kwargs) → None[source]

Prepare the dataset for analysis.

Parameters:

fill_nan (bool) – Whether to fill missing values. Defaults to True.
*args – Additional arguments passed to self._fill_nan().
**kwargs – Additional keyword arguments passed to self._fill_nan(). The ‘value’ parameter can specify the fill value (default 0).

save_pickle(file_name: str, dataset: DataFrame = None) → None[source]

Save dataset as pickle file.

Parameters:

file_name (str) – Name of the pickle file to save.
dataset (DataFrame, optional) – Dataset to save. Defaults to self.merged_data.

save_csv(file_name: str, dataset: DataFrame = None) → None[source]

Save dataset as CSV file.

Parameters:

file_name (str) – Name of the CSV file to save.
dataset (DataFrame, optional) – Dataset to save. Defaults to self.merged_data.

save(file_name: str, dataset: DataFrame = None) → None[source]

Save dataset in the appropriate format based on file extension.

Parameters:

file_name (str) – Name of the file to save. Must end with .csv, .pickle, or .pkl.
dataset (DataFrame, optional) – Dataset to save. Defaults to self.merged_data.

Raises:

ValueError – If file extension is not supported.

class pyCADD.Dance.common.Matrix(dataframe: DataFrame, test_size: float = 0.25, random_seed: int = 42)[source]

Bases: object

Result matrix for molecular data analysis.

A data structure for handling molecular datasets with train/test splitting and data preprocessing capabilities for machine learning workflows.

__init__(dataframe: DataFrame, test_size: float = 0.25, random_seed: int = 42) → None[source]

Initialize Matrix with molecular data.

Parameters:

dataframe (DataFrame) – Input DataFrame containing molecular data.
test_size (float) – Proportion of data to use for testing. Defaults to 0.25.
random_seed (int) – Random seed for reproducible splits. Defaults to 42.

classmethod from_pickle(path: str, *args, **kwargs) → Matrix[source]

Create Matrix instance from pickle file.

Parameters:

path (str) – Path to the pickle file.
*args – Additional arguments passed to Matrix constructor.
**kwargs – Additional keyword arguments passed to Matrix constructor.

Returns:

Matrix instance loaded from pickle file.

classmethod from_csv(path: str, *args, **kwargs) → Matrix[source]

Create Matrix instance from CSV file.

Parameters:

path (str) – Path to the CSV file.
*args – Additional arguments passed to Matrix constructor.
**kwargs – Additional keyword arguments passed to Matrix constructor.

Returns:

Matrix instance loaded from CSV file.

classmethod from_splited_data(train_data: DataFrame, test_data: DataFrame) → Matrix[source]

Create Matrix instance from pre-split data.

Parameters:

train_data (DataFrame) – Training dataset.
test_data (DataFrame) – Testing dataset.

Returns:

Matrix instance with pre-split train and test data.

split_train_test_data(test_size: float = None, random_seed: int = None, label_col: str = 'activity') → tuple[source]

Split data into training and testing sets.

Parameters:

test_size (float, optional) – Proportion of the dataset to include in the test split. If None, uses the instance’s test_size.
random_seed (int, optional) – Random seed for reproducible splits. If None, uses the instance’s random_seed.
label_col (str) – Name of the label column. Defaults to ‘activity’.

Returns:

Tuple containing (train_data, test_data).

get_train_data(label_col: str = 'activity') → DataFrame[source]

Get training data.

Parameters:: label_col (str) – Name of the label column. Defaults to ‘activity’.
Returns:: Training dataset DataFrame.

get_test_data(label_col: str = 'activity') → DataFrame[source]

Get testing data.

Parameters:: label_col (str) – Name of the label column. Defaults to ‘activity’.
Returns:: Testing dataset DataFrame.

class pyCADD.Dance.common.Evaluator(matrix: Matrix, label_col: str = 'activity')[source]

Bases: object

Model performance evaluator.

A comprehensive evaluation toolkit for machine learning models in CADD workflows. Suitable for non-neural network models with cross-validation and test set evaluation.

__init__(matrix: Matrix, label_col: str = 'activity') → None[source]

Initialize the Evaluator with a data matrix.

Parameters:

matrix (Matrix) – Result matrix containing the molecular data.
label_col (str) – Name of the label column. Defaults to ‘activity’.

property gbt_default_params: dict

Get default GBT parameter space.

Returns:: Dictionary containing default Gradient Boosting Tree parameters.

property lr_default_params: dict

Get default Logistic Regression parameter space.

Returns:: Dictionary containing default Logistic Regression parameters.

property rf_default_params: dict

Get default Random Forest parameter space.

Returns:: Dictionary containing default Random Forest parameters.

static get_weights(y: Series) → list[source]

Calculate weights for each class label in imbalanced datasets.

Parameters:: y (Series) – Series containing class labels.
Returns:: List of weights for each sample, inversely proportional to class frequency.

get_lr_default_params() → dict[source]

Get default Logistic Regression parameter space.

Returns:: Dictionary containing default LR parameters.

get_rf_default_params() → dict[source]

Get default Random Forest parameter space.

Returns:: Dictionary containing default RF parameters.

get_gbt_default_params() → dict[source]

Get default Gradient Boosting Tree parameter space.

Returns:: Dictionary containing default GBT parameters.

load_params(path: str) → dict[source]

Load parameters from file.

Parameters:: path (str) – Path to the parameter file.
Returns:: Dictionary containing loaded parameters.

save_params(file_name: str, params: dict) → None[source]

Save parameters to file.

Parameters:

file_name (str) – Name of the parameter file.
params (dict) – Dictionary containing parameters to save.

search_params(clf: Any, params_grid: dict, method: str = 'grid', *args, **kwargs) → dict[source]

Perform hyperparameter search for the model.

Parameters:

clf (Any) – Classifier instance.
params_grid (dict) – Parameter space dictionary.
method (str) – Optimization method. Options: - ‘grid’: Grid search - ‘random’: Random search
*args – Additional arguments passed to hyperparam_tuning function.
**kwargs – Additional keyword arguments passed to hyperparam_tuning function.

Returns:

Dictionary containing the best parameters found.

add_clf(clf: Any, clf_name: str = None) → None[source]

Add classifier instance to the evaluation dictionary.

Classifiers with the same name will be overwritten.

Parameters:

clf (Any) – Classifier instance to add.
clf_name (str, optional) – Name for the classifier. If None, uses clf.__class__.__name__.

del_clf(clf_name: str) → None[source]

Delete classifier instance.

Parameters:: clf_name (str) – Name of the classifier to delete.

get_clf(clf_name: str) → Any[source]

Get a single classifier instance by name.

Parameters:: clf_name (str) – Name of the classifier to retrieve.
Returns:: The classifier instance.

get_clfs_dict() → dict[source]

Get the dictionary of all classifier instances.

Returns:: Dictionary mapping classifier names to instances.

print_classifier_info() → None[source]

Print parameter information for all added classifiers.

Displays a formatted table showing classifier names, parameters, and their values.

repeat_cv(n_repeats: int = 30, k_folds: int = 4, random_seed: int = 42, score_func: ~typing.Callable = <function roc_auc_score>, use_train_set_only: bool = False) → dict[source]

Perform repeated cross-validation on all added classifiers.

Parameters:

n_repeats (int) – Number of repetitions for cross-validation.
k_folds (int) – Number of folds for cross-validation.
random_seed (int) – Random seed for reproducible results.
score_func (Callable) – Evaluation function. Defaults to roc_auc_score.
use_train_set_only (bool) – Whether to use only training set for cross-validation. If False, uses the complete dataset for cross-validation.

Returns:

SCP results for single conformation performance
clf_cv_results for cross-validation results of classifiers

Return type:

Dictionary containing evaluation results

Raises:

ValueError – If no classifiers have been added.

print_cv_results() → None[source]

Print cross-validation results.

Displays formatted results including SCP scores and classifier performance.

Raises:: ValueError – If no cross-validation has been performed.

testset_eval() → dict[source]

Evaluate performance using test set data.

Returns:: Dictionary containing classifier evaluation results on the test set.
Raises:: ValueError – If no classifiers have been added.

pyCADD.Dance.core module

pyCADD.Dance.core.hyperparam_tuning(model: Any, param_gird: dict, X: DataFrame, y: Series, scoring: str = 'roc_auc', cv: int = 5, n_jobs: int = -1, method: str = 'grid', save_dir: str = None, model_name: str = None) → dict[source]

Hyperparameter optimization for machine learning models.

Parameters:

model (Any) – Model instance to optimize.
param_gird (dict) – Hyperparameter grid/distribution dictionary.
X (DataFrame) – Training feature data.
y (Series) – Training labels.
scoring (str) – Evaluation metric. Defaults to ‘roc_auc’.
cv (int) – Number of cross-validation splits. Defaults to 5.
n_jobs (int) – Number of parallel jobs. Defaults to -1 (use all processors).
method (str) – Optimization method. Options: - ‘grid’: Grid search - ‘random’: Random search
save_dir (str, optional) – Directory to save parameter file. If None, no file is saved.
model_name (str, optional) – Name of the model for file naming.

Returns:

Dictionary containing optimized model parameters.

Raises:

ValueError – If method is not ‘grid’ or ‘random’.

pyCADD.Dance.core.calc_scp_score(X: ~pandas.core.frame.DataFrame, y_true: ~pandas.core.series.Series, lower_is_better: bool = True, score_func: ~typing.Callable = <function roc_auc_score>) → dict[source]

Calculate Single Conformation Performance (SCP) scores.

Parameters:

X (DataFrame) – Feature data with conformations as columns.
y_true (Series) – True labels.
lower_is_better (bool) – Whether the evaluation metric is a descending indicator.
score_func (Callable) – Evaluation function from sklearn.metrics. Defaults to roc_auc_score.

Returns:

Dictionary mapping conformation names to their SCP scores.

pyCADD.Dance.metrics module

pyCADD.Dance.metrics.nef_score(y_true, y_score, percent: int = None) → float[source]

Calculate Normalized Enrichment Factor (NEF) score.

NEF measures the enrichment of active compounds in the top-ranked subset compared to random selection.

Parameters:

y_true (array-like or pd.Series) – True binary labels (0 or 1) for samples.
y_score (array-like or pd.Series) – Predicted scores/probabilities for samples.
percent (int, optional) – Early enrichment percentage. Calculate NEF for top percent% of samples. If None, uses the ratio of actives to total samples (Ra = actives / total).

Returns:

NEF score as a float value.

Module contents

Data Analyzer for Computer-aided drug design.

This module provides tools for molecular data analysis and machine learning model evaluation in computer-aided drug design (CADD) workflows.

pyCADD.Dance package

Subpackages

Submodules

pyCADD.Dance.common module

pyCADD.Dance.core module

pyCADD.Dance.metrics module

Module contents