pyCADD.Dance package

Subpackages

Submodules

pyCADD.Dance.common module

class pyCADD.Dance.common.Dancer[source]

Bases: object

Data analyzer for CADD (Computer-Aided Drug Design).

A comprehensive data preprocessing and analysis tool for molecular datasets in computer-aided drug design workflows. Handles dataset merging, preprocessing, and preparation for machine learning models.

__init__() None[source]

Initialize the Dancer instance.

Creates necessary directories and initializes data storage attributes.

add_pos_dataset(csv_path: str) None[source]

Add a positive sample dataset.

Parameters:

csv_path (str) – Path to the positive sample dataset CSV file.

add_neg_dataset(csv_path: str) None[source]

Add a negative sample dataset.

Parameters:

csv_path (str) – Path to the negative sample dataset CSV file.

add_dataset(csv_path: str, *args, **kwargs) None[source]

Add a dataset.

Parameters:
  • csv_path (str) – Path to the dataset CSV file.

  • *args – Additional arguments passed to _add_dataset.

  • **kwargs – Additional keyword arguments passed to _add_dataset.

get_merged_data() DataFrame[source]

Return the merged dataset.

Returns:

The merged dataset DataFrame.

prepare_data(fill_nan: bool = True, *args, **kwargs) None[source]

Prepare the dataset for analysis.

Parameters:
  • fill_nan (bool) – Whether to fill missing values. Defaults to True.

  • *args – Additional arguments passed to self._fill_nan().

  • **kwargs – Additional keyword arguments passed to self._fill_nan(). The ‘value’ parameter can specify the fill value (default 0).

save_pickle(file_name: str, dataset: DataFrame = None) None[source]

Save dataset as pickle file.

Parameters:
  • file_name (str) – Name of the pickle file to save.

  • dataset (DataFrame, optional) – Dataset to save. Defaults to self.merged_data.

save_csv(file_name: str, dataset: DataFrame = None) None[source]

Save dataset as CSV file.

Parameters:
  • file_name (str) – Name of the CSV file to save.

  • dataset (DataFrame, optional) – Dataset to save. Defaults to self.merged_data.

save(file_name: str, dataset: DataFrame = None) None[source]

Save dataset in the appropriate format based on file extension.

Parameters:
  • file_name (str) – Name of the file to save. Must end with .csv, .pickle, or .pkl.

  • dataset (DataFrame, optional) – Dataset to save. Defaults to self.merged_data.

Raises:

ValueError – If file extension is not supported.

class pyCADD.Dance.common.Matrix(dataframe: DataFrame, test_size: float = 0.25, random_seed: int = 42)[source]

Bases: object

Result matrix for molecular data analysis.

A data structure for handling molecular datasets with train/test splitting and data preprocessing capabilities for machine learning workflows.

__init__(dataframe: DataFrame, test_size: float = 0.25, random_seed: int = 42) None[source]

Initialize Matrix with molecular data.

Parameters:
  • dataframe (DataFrame) – Input DataFrame containing molecular data.

  • test_size (float) – Proportion of data to use for testing. Defaults to 0.25.

  • random_seed (int) – Random seed for reproducible splits. Defaults to 42.

classmethod from_pickle(path: str, *args, **kwargs) Matrix[source]

Create Matrix instance from pickle file.

Parameters:
  • path (str) – Path to the pickle file.

  • *args – Additional arguments passed to Matrix constructor.

  • **kwargs – Additional keyword arguments passed to Matrix constructor.

Returns:

Matrix instance loaded from pickle file.

classmethod from_csv(path: str, *args, **kwargs) Matrix[source]

Create Matrix instance from CSV file.

Parameters:
  • path (str) – Path to the CSV file.

  • *args – Additional arguments passed to Matrix constructor.

  • **kwargs – Additional keyword arguments passed to Matrix constructor.

Returns:

Matrix instance loaded from CSV file.

classmethod from_splited_data(train_data: DataFrame, test_data: DataFrame) Matrix[source]

Create Matrix instance from pre-split data.

Parameters:
  • train_data (DataFrame) – Training dataset.

  • test_data (DataFrame) – Testing dataset.

Returns:

Matrix instance with pre-split train and test data.

split_train_test_data(test_size: float = None, random_seed: int = None, label_col: str = 'activity') tuple[source]

Split data into training and testing sets.

Parameters:
  • test_size (float, optional) – Proportion of the dataset to include in the test split. If None, uses the instance’s test_size.

  • random_seed (int, optional) – Random seed for reproducible splits. If None, uses the instance’s random_seed.

  • label_col (str) – Name of the label column. Defaults to ‘activity’.

Returns:

Tuple containing (train_data, test_data).

get_train_data(label_col: str = 'activity') DataFrame[source]

Get training data.

Parameters:

label_col (str) – Name of the label column. Defaults to ‘activity’.

Returns:

Training dataset DataFrame.

get_test_data(label_col: str = 'activity') DataFrame[source]

Get testing data.

Parameters:

label_col (str) – Name of the label column. Defaults to ‘activity’.

Returns:

Testing dataset DataFrame.

class pyCADD.Dance.common.Evaluator(matrix: Matrix, label_col: str = 'activity')[source]

Bases: object

Model performance evaluator.

A comprehensive evaluation toolkit for machine learning models in CADD workflows. Suitable for non-neural network models with cross-validation and test set evaluation.

__init__(matrix: Matrix, label_col: str = 'activity') None[source]

Initialize the Evaluator with a data matrix.

Parameters:
  • matrix (Matrix) – Result matrix containing the molecular data.

  • label_col (str) – Name of the label column. Defaults to ‘activity’.

property gbt_default_params: dict

Get default GBT parameter space.

Returns:

Dictionary containing default Gradient Boosting Tree parameters.

property lr_default_params: dict

Get default Logistic Regression parameter space.

Returns:

Dictionary containing default Logistic Regression parameters.

property rf_default_params: dict

Get default Random Forest parameter space.

Returns:

Dictionary containing default Random Forest parameters.

static get_weights(y: Series) list[source]

Calculate weights for each class label in imbalanced datasets.

Parameters:

y (Series) – Series containing class labels.

Returns:

List of weights for each sample, inversely proportional to class frequency.

get_lr_default_params() dict[source]

Get default Logistic Regression parameter space.

Returns:

Dictionary containing default LR parameters.

get_rf_default_params() dict[source]

Get default Random Forest parameter space.

Returns:

Dictionary containing default RF parameters.

get_gbt_default_params() dict[source]

Get default Gradient Boosting Tree parameter space.

Returns:

Dictionary containing default GBT parameters.

load_params(path: str) dict[source]

Load parameters from file.

Parameters:

path (str) – Path to the parameter file.

Returns:

Dictionary containing loaded parameters.

save_params(file_name: str, params: dict) None[source]

Save parameters to file.

Parameters:
  • file_name (str) – Name of the parameter file.

  • params (dict) – Dictionary containing parameters to save.

search_params(clf: Any, params_grid: dict, method: str = 'grid', *args, **kwargs) dict[source]

Perform hyperparameter search for the model.

Parameters:
  • clf (Any) – Classifier instance.

  • params_grid (dict) – Parameter space dictionary.

  • method (str) – Optimization method. Options: - ‘grid’: Grid search - ‘random’: Random search

  • *args – Additional arguments passed to hyperparam_tuning function.

  • **kwargs – Additional keyword arguments passed to hyperparam_tuning function.

Returns:

Dictionary containing the best parameters found.

add_clf(clf: Any, clf_name: str = None) None[source]

Add classifier instance to the evaluation dictionary.

Classifiers with the same name will be overwritten.

Parameters:
  • clf (Any) – Classifier instance to add.

  • clf_name (str, optional) – Name for the classifier. If None, uses clf.__class__.__name__.

del_clf(clf_name: str) None[source]

Delete classifier instance.

Parameters:

clf_name (str) – Name of the classifier to delete.

get_clf(clf_name: str) Any[source]

Get a single classifier instance by name.

Parameters:

clf_name (str) – Name of the classifier to retrieve.

Returns:

The classifier instance.

get_clfs_dict() dict[source]

Get the dictionary of all classifier instances.

Returns:

Dictionary mapping classifier names to instances.

print_classifier_info() None[source]

Print parameter information for all added classifiers.

Displays a formatted table showing classifier names, parameters, and their values.

repeat_cv(n_repeats: int = 30, k_folds: int = 4, random_seed: int = 42, score_func: ~typing.Callable = <function roc_auc_score>, use_train_set_only: bool = False) dict[source]

Perform repeated cross-validation on all added classifiers.

Parameters:
  • n_repeats (int) – Number of repetitions for cross-validation.

  • k_folds (int) – Number of folds for cross-validation.

  • random_seed (int) – Random seed for reproducible results.

  • score_func (Callable) – Evaluation function. Defaults to roc_auc_score.

  • use_train_set_only (bool) – Whether to use only training set for cross-validation. If False, uses the complete dataset for cross-validation.

Returns:

  • SCP results for single conformation performance

  • clf_cv_results for cross-validation results of classifiers

Return type:

Dictionary containing evaluation results

Raises:

ValueError – If no classifiers have been added.

print_cv_results() None[source]

Print cross-validation results.

Displays formatted results including SCP scores and classifier performance.

Raises:

ValueError – If no cross-validation has been performed.

testset_eval() dict[source]

Evaluate performance using test set data.

Returns:

Dictionary containing classifier evaluation results on the test set.

Raises:

ValueError – If no classifiers have been added.

pyCADD.Dance.core module

pyCADD.Dance.core.hyperparam_tuning(model: Any, param_gird: dict, X: DataFrame, y: Series, scoring: str = 'roc_auc', cv: int = 5, n_jobs: int = -1, method: str = 'grid', save_dir: str = None, model_name: str = None) dict[source]

Hyperparameter optimization for machine learning models.

Parameters:
  • model (Any) – Model instance to optimize.

  • param_gird (dict) – Hyperparameter grid/distribution dictionary.

  • X (DataFrame) – Training feature data.

  • y (Series) – Training labels.

  • scoring (str) – Evaluation metric. Defaults to ‘roc_auc’.

  • cv (int) – Number of cross-validation splits. Defaults to 5.

  • n_jobs (int) – Number of parallel jobs. Defaults to -1 (use all processors).

  • method (str) – Optimization method. Options: - ‘grid’: Grid search - ‘random’: Random search

  • save_dir (str, optional) – Directory to save parameter file. If None, no file is saved.

  • model_name (str, optional) – Name of the model for file naming.

Returns:

Dictionary containing optimized model parameters.

Raises:

ValueError – If method is not ‘grid’ or ‘random’.

pyCADD.Dance.core.calc_scp_score(X: ~pandas.core.frame.DataFrame, y_true: ~pandas.core.series.Series, lower_is_better: bool = True, score_func: ~typing.Callable = <function roc_auc_score>) dict[source]

Calculate Single Conformation Performance (SCP) scores.

Parameters:
  • X (DataFrame) – Feature data with conformations as columns.

  • y_true (Series) – True labels.

  • lower_is_better (bool) – Whether the evaluation metric is a descending indicator.

  • score_func (Callable) – Evaluation function from sklearn.metrics. Defaults to roc_auc_score.

Returns:

Dictionary mapping conformation names to their SCP scores.

pyCADD.Dance.metrics module

pyCADD.Dance.metrics.nef_score(y_true, y_score, percent: int = None) float[source]

Calculate Normalized Enrichment Factor (NEF) score.

NEF measures the enrichment of active compounds in the top-ranked subset compared to random selection.

Parameters:
  • y_true (array-like or pd.Series) – True binary labels (0 or 1) for samples.

  • y_score (array-like or pd.Series) – Predicted scores/probabilities for samples.

  • percent (int, optional) – Early enrichment percentage. Calculate NEF for top percent% of samples. If None, uses the ratio of actives to total samples (Ra = actives / total).

Returns:

NEF score as a float value.

Module contents

Data Analyzer for Computer-aided drug design.

This module provides tools for molecular data analysis and machine learning model evaluation in computer-aided drug design (CADD) workflows.