Methods
The deconomix.methods submodule provides the models for DTD and ADTD and the respective methods to train them.
DTD Class
- class deconomix.methods.DTD(X_mat: DataFrame, Y_mat: DataFrame, C_mat: DataFrame)
Class that implements Digital Tissue Deconvolution via Loss-function Learning
Attributes
- X_matpd.DataFrame
Single cell reference matrix X containing (average) profiles as columns per celltype. Shape: genes x celltypes
- Y_matpd.DataFrame
Y matrix containing the generated artificial bulk profiles. Shape: genes x n_mixtures
- C_matpd.DataFrame
C matrix containing the relative composition of the bulk mixtures in Y. Shape: cell types x n_mixtures
- run(iterations=1000, plot=False, path_plot=None, func='pearson')
Function that executes the training of a DTD model and saves the results in the model attributes.
Parameters
- iterationsint
How many training steps should be conducted.
- plotbool
Whether to plot the development of loss-values during the training.
Updates Attributes
- gamma: pd.DataFrame
gene weights resulting form the training. Shape: genes x 1
- losses: list
list of loss per training step.
ADTD Class
- class deconomix.methods.ADTD(X_mat: DataFrame, Y_mat: DataFrame, gamma: DataFrame, lambda1: float = 1.0, lambda2: float = 1.0, max_iterations: int = 200, eps: float = 1e-08, C_static: bool = False, Delta_static: bool = False, gamma_offset: bool = True, delta_stepsize: int = 1)
Class that implements the Adaptive Digital Tissue Deconvolution algorithm.
Attributes
- X_matpd.DataFrame
Single cell reference matrix X containing (average) profiles as columns per celltype. Shape: genes x celltypes
- Y_matpd.DataFrame
Y matrix containing bulk profiles. Shape: genes x n_mixtures
- gammapd.DataFrame
gene weights. Shape: genes x 1
- max_iterationsint
Maximum amount of optimization steps.
- lambda1float
Hyperparameter for cellular composition estimation.
- lambda2float
Hyperparameter for reference profile adaption (gene regulation).
- epsfloat
Stopping criterion based on error.
- C_staticbool
Whether cellular composition shall be optimized.
- Delta_staticbool
Whether reference profile adaption shall be optimized.
- run(verbose=True)
Fit an ADTD model
Attributes
- verbosebool
Wether progress bar should be shown during training (default) or not (silent mode).
Updates Attributes
- C_estpd.DataFrame
Estimated contribution of the referenced cell types in X to the composition of mixtures in Y. Shape: referenced cell types x mixtures
- c_estpd.DataFrame
Estimated contribution of the hidden background to the composition of mixtures in Y. Shape: 1 x mixtures
- x_estpd.DataFrame
Estimated consensus hidden background profile. Shape: genes x 1
- Delta_estpd.DataFrame
Estimated element-wise adaption factors for the reference matrix X (gene regulation). Shape: genes x referenced cell types (same shape as X)
- setup()
Still needed due to reasonable estimate for x. Random initialization may be possible.
- update_C0()
Calculate C0 with with lambda1=0 and lambda2->inf. Reasoning that lambda1->inf yields DTD solution no longer holds.
Hyperparameter Search
Finding suitable hyperparameters is not trivial. We provided an extensive gridsearch in the example section of the repository (gridsearch.py, Dockerfile). As a low-cost alternative, you can use this function to estimate a good lambda_2 for the case of lambda_1 to infinity:
- class deconomix.methods.HPS(X_ref: DataFrame, Y_test: DataFrame, gamma: DataFrame, Y_all: DataFrame | None = None, kfold: int = 5, parallel: bool = True, workers: int = 5, plot: bool = False, n_points: int = 11, lambda_min: float = 1e-10, lambda_max: float = 1)
HPS: Hyperparameter Search for L2 Regularization
This class performs a hyperparameter search to determine a suitable value for the L2 regularization parameter (lambda2) in a scenario where L1 regularization (lambda1) is fixed to infinity. It leverages 5-fold cross-validation on bulk RNA-seq datasets to evaluate performance and determines optimal lambda2 values using various methods, including the 1SE rule and gradient analysis.
Parameters
- X_refpd.DataFrame
Single-cell reference matrix containing average profiles as columns per cell type. Shape: genes x celltypes.
- Y_testpd.DataFrame
Test set bulk RNA-seq profiles. Shape: genes x n_mixtures.
- gammapd.DataFrame
Gene weights learned by DTD on a training set. Shape: genes x 1.
- Y_allpd.DataFrame, optional
Complete bulk RNA-seq dataset. Defaults to Y_test if not provided. Shape: genes x n_mixtures.
- kfoldint, optional
Number of folds for cross-validation. Default is 5.
- parallelbool, optional
If True, performs parallel computation for efficiency. Default is True.
- workersint, optional
Number of workers to use in parallel mode. Default is 5.
- plotbool, optional
If True, generates plots for the cross-validation loss curves. Default is False.
- n_pointsint, optional
Number of lambda2 values to test on the logarithmic grid. Default is 11.
- lambda_minfloat, optional
Minimum value of lambda2 for the grid search. Default is 1e-10.
- lambda_maxfloat, optional
Maximum value of lambda2 for the grid search. Default is 1.
Attributes
- lambda_minfloat
Optimal lambda2 minimizing the average loss across folds.
- lambda_1sefloat
Largest lambda2 within one standard error of the minimum average loss.
- Lossespd.DataFrame
Cross-validation losses for all lambda2 values and folds.
- lambda_max_gradientfloat
Optimal lambda2 based on the maximum gradient of the loss curve.
- static cross_validation_split_columns(df, n_splits=5)
Split the columns of a DataFrame into train and test sets for k-fold cross-validation.
Parameters
- dfpd.DataFrame
Input data with columns to split.
- n_splitsint, optional
Number of folds for cross-validation. Default is 5.
Returns
- list of tuples
A list where each tuple contains train and test DataFrame subsets for one fold.
- static evaluate_l2(j, l2, X_ref, train_data, gamma, test_data, Cc_all)
Evaluate the loss for a specific lambda2 value.
Parameters
- jint
Index of the lambda2 value in the grid.
- l2float
Lambda2 regularization parameter to evaluate.
- X_refpd.DataFrame
Single-cell reference matrix.
- train_datapd.DataFrame
Training bulk RNA-seq data for this fold.
- gammapd.DataFrame
Gene weights learned by DTD.
- test_datapd.DataFrame
Test bulk RNA-seq data for this fold.
- Cc_allnp.ndarray
Concatenated matrix of bulk profiles for all samples.
Returns
- tuple
The index j and the computed loss for the given lambda2 value.
- static get_lambda_1se(Losses, polyfit=False)
Determine lambda2 using the 1SE rule.
Parameters
- Lossespd.DataFrame
Cross-validation losses for each lambda2 value.
- polyfitbool, optional
If True, considers polynomial smoothing when calculating the 1SE rule. Default is False.
Returns
- float
Lambda2 value corresponding to the 1SE rule.
- static get_lambda_1se_polyfit(Losses, degree=5, num_points=1000)
Determine lambda2 using the 1SE rule with polynomial smoothing.
Parameters
- Lossespd.DataFrame
Cross-validation losses for each lambda2 value.
- degreeint, optional
Degree of the polynomial fit. Default is 5.
- num_pointsint, optional
Number of points for lambda2 interpolation. Default is 1000.
Returns
- tuple
- lambda_1sefloat
Lambda2 value corresponding to the 1SE rule.
- fine_lambdasnp.ndarray
Interpolated lambda2 values.
- smoothed_avgLossnp.ndarray
Smoothed average loss values.
- static get_lambda_max_gradient(Losses)
Determine lambda2 based on the maximum gradient of the loss curve.
Parameters
- Lossespd.DataFrame
Cross-validation losses for each lambda2 value.
Returns
- float
Lambda2 value at the point of maximum gradient.
- get_plot(Losses, polyfit=False)
Generate a plot of the cross-validation loss curve.
Parameters
- Lossespd.DataFrame
Cross-validation losses for each lambda2 value.
- polyfitbool, optional
If True, includes a polynomial fit and highlights the lambda2 from the 1SE rule. Default is False.
- run()
Perform the hyperparameter search for lambda2.
Executes k-fold cross-validation and evaluates a range of lambda2 values to identify optimal hyperparameters based on loss minimization.
Populates the following attributes: - Losses : pd.DataFrame - lambda_min : float - lambda_1se : float - lambda_max_gradient : float