pyCADD.Dance package
Subpackages
Submodules
pyCADD.Dance.common module
- class pyCADD.Dance.common.Dancer[source]
Bases:
objectData analyzer for CADD (Computer-Aided Drug Design).
A comprehensive data preprocessing and analysis tool for molecular datasets in computer-aided drug design workflows. Handles dataset merging, preprocessing, and preparation for machine learning models.
- __init__() None[source]
Initialize the Dancer instance.
Creates necessary directories and initializes data storage attributes.
- add_pos_dataset(csv_path: str) None[source]
Add a positive sample dataset.
- Parameters:
csv_path (str) – Path to the positive sample dataset CSV file.
- add_neg_dataset(csv_path: str) None[source]
Add a negative sample dataset.
- Parameters:
csv_path (str) – Path to the negative sample dataset CSV file.
- add_dataset(csv_path: str, *args, **kwargs) None[source]
Add a dataset.
- Parameters:
csv_path (str) – Path to the dataset CSV file.
*args – Additional arguments passed to _add_dataset.
**kwargs – Additional keyword arguments passed to _add_dataset.
- get_merged_data() DataFrame[source]
Return the merged dataset.
- Returns:
The merged dataset DataFrame.
- prepare_data(fill_nan: bool = True, *args, **kwargs) None[source]
Prepare the dataset for analysis.
- Parameters:
fill_nan (bool) – Whether to fill missing values. Defaults to True.
*args – Additional arguments passed to self._fill_nan().
**kwargs – Additional keyword arguments passed to self._fill_nan(). The ‘value’ parameter can specify the fill value (default 0).
- save_pickle(file_name: str, dataset: DataFrame = None) None[source]
Save dataset as pickle file.
- Parameters:
file_name (str) – Name of the pickle file to save.
dataset (DataFrame, optional) – Dataset to save. Defaults to self.merged_data.
- save_csv(file_name: str, dataset: DataFrame = None) None[source]
Save dataset as CSV file.
- Parameters:
file_name (str) – Name of the CSV file to save.
dataset (DataFrame, optional) – Dataset to save. Defaults to self.merged_data.
- save(file_name: str, dataset: DataFrame = None) None[source]
Save dataset in the appropriate format based on file extension.
- Parameters:
file_name (str) – Name of the file to save. Must end with .csv, .pickle, or .pkl.
dataset (DataFrame, optional) – Dataset to save. Defaults to self.merged_data.
- Raises:
ValueError – If file extension is not supported.
- class pyCADD.Dance.common.Matrix(dataframe: DataFrame, test_size: float = 0.25, random_seed: int = 42)[source]
Bases:
objectResult matrix for molecular data analysis.
A data structure for handling molecular datasets with train/test splitting and data preprocessing capabilities for machine learning workflows.
- __init__(dataframe: DataFrame, test_size: float = 0.25, random_seed: int = 42) None[source]
Initialize Matrix with molecular data.
- Parameters:
dataframe (DataFrame) – Input DataFrame containing molecular data.
test_size (float) – Proportion of data to use for testing. Defaults to 0.25.
random_seed (int) – Random seed for reproducible splits. Defaults to 42.
- classmethod from_pickle(path: str, *args, **kwargs) Matrix[source]
Create Matrix instance from pickle file.
- Parameters:
path (str) – Path to the pickle file.
*args – Additional arguments passed to Matrix constructor.
**kwargs – Additional keyword arguments passed to Matrix constructor.
- Returns:
Matrix instance loaded from pickle file.
- classmethod from_csv(path: str, *args, **kwargs) Matrix[source]
Create Matrix instance from CSV file.
- Parameters:
path (str) – Path to the CSV file.
*args – Additional arguments passed to Matrix constructor.
**kwargs – Additional keyword arguments passed to Matrix constructor.
- Returns:
Matrix instance loaded from CSV file.
- classmethod from_splited_data(train_data: DataFrame, test_data: DataFrame) Matrix[source]
Create Matrix instance from pre-split data.
- Parameters:
train_data (DataFrame) – Training dataset.
test_data (DataFrame) – Testing dataset.
- Returns:
Matrix instance with pre-split train and test data.
- split_train_test_data(test_size: float = None, random_seed: int = None, label_col: str = 'activity') tuple[source]
Split data into training and testing sets.
- Parameters:
test_size (float, optional) – Proportion of the dataset to include in the test split. If None, uses the instance’s test_size.
random_seed (int, optional) – Random seed for reproducible splits. If None, uses the instance’s random_seed.
label_col (str) – Name of the label column. Defaults to ‘activity’.
- Returns:
Tuple containing (train_data, test_data).
- class pyCADD.Dance.common.Evaluator(matrix: Matrix, label_col: str = 'activity')[source]
Bases:
objectModel performance evaluator.
A comprehensive evaluation toolkit for machine learning models in CADD workflows. Suitable for non-neural network models with cross-validation and test set evaluation.
- __init__(matrix: Matrix, label_col: str = 'activity') None[source]
Initialize the Evaluator with a data matrix.
- Parameters:
matrix (Matrix) – Result matrix containing the molecular data.
label_col (str) – Name of the label column. Defaults to ‘activity’.
- property gbt_default_params: dict
Get default GBT parameter space.
- Returns:
Dictionary containing default Gradient Boosting Tree parameters.
- property lr_default_params: dict
Get default Logistic Regression parameter space.
- Returns:
Dictionary containing default Logistic Regression parameters.
- property rf_default_params: dict
Get default Random Forest parameter space.
- Returns:
Dictionary containing default Random Forest parameters.
- static get_weights(y: Series) list[source]
Calculate weights for each class label in imbalanced datasets.
- Parameters:
y (Series) – Series containing class labels.
- Returns:
List of weights for each sample, inversely proportional to class frequency.
- get_lr_default_params() dict[source]
Get default Logistic Regression parameter space.
- Returns:
Dictionary containing default LR parameters.
- get_rf_default_params() dict[source]
Get default Random Forest parameter space.
- Returns:
Dictionary containing default RF parameters.
- get_gbt_default_params() dict[source]
Get default Gradient Boosting Tree parameter space.
- Returns:
Dictionary containing default GBT parameters.
- load_params(path: str) dict[source]
Load parameters from file.
- Parameters:
path (str) – Path to the parameter file.
- Returns:
Dictionary containing loaded parameters.
- save_params(file_name: str, params: dict) None[source]
Save parameters to file.
- Parameters:
file_name (str) – Name of the parameter file.
params (dict) – Dictionary containing parameters to save.
- search_params(clf: Any, params_grid: dict, method: str = 'grid', *args, **kwargs) dict[source]
Perform hyperparameter search for the model.
- Parameters:
clf (Any) – Classifier instance.
params_grid (dict) – Parameter space dictionary.
method (str) – Optimization method. Options: - ‘grid’: Grid search - ‘random’: Random search
*args – Additional arguments passed to hyperparam_tuning function.
**kwargs – Additional keyword arguments passed to hyperparam_tuning function.
- Returns:
Dictionary containing the best parameters found.
- add_clf(clf: Any, clf_name: str = None) None[source]
Add classifier instance to the evaluation dictionary.
Classifiers with the same name will be overwritten.
- Parameters:
clf (Any) – Classifier instance to add.
clf_name (str, optional) – Name for the classifier. If None, uses clf.__class__.__name__.
- del_clf(clf_name: str) None[source]
Delete classifier instance.
- Parameters:
clf_name (str) – Name of the classifier to delete.
- get_clf(clf_name: str) Any[source]
Get a single classifier instance by name.
- Parameters:
clf_name (str) – Name of the classifier to retrieve.
- Returns:
The classifier instance.
- get_clfs_dict() dict[source]
Get the dictionary of all classifier instances.
- Returns:
Dictionary mapping classifier names to instances.
- print_classifier_info() None[source]
Print parameter information for all added classifiers.
Displays a formatted table showing classifier names, parameters, and their values.
- repeat_cv(n_repeats: int = 30, k_folds: int = 4, random_seed: int = 42, score_func: ~typing.Callable = <function roc_auc_score>, use_train_set_only: bool = False) dict[source]
Perform repeated cross-validation on all added classifiers.
- Parameters:
n_repeats (int) – Number of repetitions for cross-validation.
k_folds (int) – Number of folds for cross-validation.
random_seed (int) – Random seed for reproducible results.
score_func (Callable) – Evaluation function. Defaults to roc_auc_score.
use_train_set_only (bool) – Whether to use only training set for cross-validation. If False, uses the complete dataset for cross-validation.
- Returns:
SCP results for single conformation performance
clf_cv_results for cross-validation results of classifiers
- Return type:
Dictionary containing evaluation results
- Raises:
ValueError – If no classifiers have been added.
pyCADD.Dance.core module
- pyCADD.Dance.core.hyperparam_tuning(model: Any, param_gird: dict, X: DataFrame, y: Series, scoring: str = 'roc_auc', cv: int = 5, n_jobs: int = -1, method: str = 'grid', save_dir: str = None, model_name: str = None) dict[source]
Hyperparameter optimization for machine learning models.
- Parameters:
model (Any) – Model instance to optimize.
param_gird (dict) – Hyperparameter grid/distribution dictionary.
X (DataFrame) – Training feature data.
y (Series) – Training labels.
scoring (str) – Evaluation metric. Defaults to ‘roc_auc’.
cv (int) – Number of cross-validation splits. Defaults to 5.
n_jobs (int) – Number of parallel jobs. Defaults to -1 (use all processors).
method (str) – Optimization method. Options: - ‘grid’: Grid search - ‘random’: Random search
save_dir (str, optional) – Directory to save parameter file. If None, no file is saved.
model_name (str, optional) – Name of the model for file naming.
- Returns:
Dictionary containing optimized model parameters.
- Raises:
ValueError – If method is not ‘grid’ or ‘random’.
- pyCADD.Dance.core.calc_scp_score(X: ~pandas.core.frame.DataFrame, y_true: ~pandas.core.series.Series, lower_is_better: bool = True, score_func: ~typing.Callable = <function roc_auc_score>) dict[source]
Calculate Single Conformation Performance (SCP) scores.
- Parameters:
X (DataFrame) – Feature data with conformations as columns.
y_true (Series) – True labels.
lower_is_better (bool) – Whether the evaluation metric is a descending indicator.
score_func (Callable) – Evaluation function from sklearn.metrics. Defaults to roc_auc_score.
- Returns:
Dictionary mapping conformation names to their SCP scores.
pyCADD.Dance.metrics module
- pyCADD.Dance.metrics.nef_score(y_true, y_score, percent: int = None) float[source]
Calculate Normalized Enrichment Factor (NEF) score.
NEF measures the enrichment of active compounds in the top-ranked subset compared to random selection.
- Parameters:
y_true (array-like or pd.Series) – True binary labels (0 or 1) for samples.
y_score (array-like or pd.Series) – Predicted scores/probabilities for samples.
percent (int, optional) – Early enrichment percentage. Calculate NEF for top percent% of samples. If None, uses the ratio of actives to total samples (Ra = actives / total).
- Returns:
NEF score as a float value.
Module contents
Data Analyzer for Computer-aided drug design.
This module provides tools for molecular data analysis and machine learning model evaluation in computer-aided drug design (CADD) workflows.