Methods

The deconomix.methods submodule provides the models for DTD and ADTD and the respective methods to train them.

DTD Class

class deconomix.methods.DTD(X_mat: DataFrame, Y_mat: DataFrame, C_mat: DataFrame)

Class that implements Digital Tissue Deconvolution via Loss-function Learning

Attributes

X_matpd.DataFrame: Single cell reference matrix X containing (average) profiles as columns per celltype. Shape: genes x celltypes
Y_matpd.DataFrame: Y matrix containing the generated artificial bulk profiles. Shape: genes x n_mixtures
C_matpd.DataFrame: C matrix containing the relative composition of the bulk mixtures in Y. Shape: cell types x n_mixtures

run(iterations=1000, plot=False, path_plot=None, func='pearson')

Function that executes the training of a DTD model and saves the results in the model attributes.

Parameters

iterationsint: How many training steps should be conducted.
plotbool: Whether to plot the development of loss-values during the training.

Updates Attributes

gamma: pd.DataFrame: gene weights resulting form the training. Shape: genes x 1
losses: list: list of loss per training step.

ADTD Class

class deconomix.methods.ADTD(X_mat: DataFrame, Y_mat: DataFrame, gamma: DataFrame, lambda1: float = 1.0, lambda2: float = 1.0, max_iterations: int = 200, eps: float = 1e-08, C_static: bool = False, Delta_static: bool = False, gamma_offset: bool = True, delta_stepsize: int = 1)

Class that implements the Adaptive Digital Tissue Deconvolution algorithm.

Attributes

X_matpd.DataFrame: Single cell reference matrix X containing (average) profiles as columns per celltype. Shape: genes x celltypes
Y_matpd.DataFrame: Y matrix containing bulk profiles. Shape: genes x n_mixtures
gammapd.DataFrame: gene weights. Shape: genes x 1
max_iterationsint: Maximum amount of optimization steps.
lambda1float: Hyperparameter for cellular composition estimation.
lambda2float: Hyperparameter for reference profile adaption (gene regulation).
epsfloat: Stopping criterion based on error.
C_staticbool: Whether cellular composition shall be optimized.
Delta_staticbool: Whether reference profile adaption shall be optimized.

run(verbose=True)

Fit an ADTD model

Attributes

verbosebool: Wether progress bar should be shown during training (default) or not (silent mode).

Updates Attributes

C_estpd.DataFrame: Estimated contribution of the referenced cell types in X to the composition of mixtures in Y. Shape: referenced cell types x mixtures
c_estpd.DataFrame: Estimated contribution of the hidden background to the composition of mixtures in Y. Shape: 1 x mixtures
x_estpd.DataFrame: Estimated consensus hidden background profile. Shape: genes x 1
Delta_estpd.DataFrame: Estimated element-wise adaption factors for the reference matrix X (gene regulation). Shape: genes x referenced cell types (same shape as X)

setup(): Still needed due to reasonable estimate for x. Random initialization may be possible.

update_C0(): Calculate C0 with with lambda1=0 and lambda2->inf. Reasoning that lambda1->inf yields DTD solution no longer holds.

Hyperparameter Search

Finding suitable hyperparameters is not trivial. We provided an extensive gridsearch in the example section of the repository (gridsearch.py, Dockerfile). As a low-cost alternative, you can use this function to estimate a good lambda_2 for the case of lambda_1 to infinity:

class deconomix.methods.HPS(X_ref: DataFrame, Y_test: DataFrame, gamma: DataFrame, Y_all: DataFrame | None = None, kfold: int = 5, parallel: bool = True, workers: int = 5, plot: bool = False, n_points: int = 11, lambda_min: float = 1e-10, lambda_max: float = 1)

HPS: Hyperparameter Search for L2 Regularization

This class performs a hyperparameter search to determine a suitable value for the L2 regularization parameter (lambda2) in a scenario where L1 regularization (lambda1) is fixed to infinity. It leverages 5-fold cross-validation on bulk RNA-seq datasets to evaluate performance and determines optimal lambda2 values using various methods, including the 1SE rule and gradient analysis.

Parameters

X_refpd.DataFrame: Single-cell reference matrix containing average profiles as columns per cell type. Shape: genes x celltypes.
Y_testpd.DataFrame: Test set bulk RNA-seq profiles. Shape: genes x n_mixtures.
gammapd.DataFrame: Gene weights learned by DTD on a training set. Shape: genes x 1.
Y_allpd.DataFrame, optional: Complete bulk RNA-seq dataset. Defaults to Y_test if not provided. Shape: genes x n_mixtures.
kfoldint, optional: Number of folds for cross-validation. Default is 5.
parallelbool, optional: If True, performs parallel computation for efficiency. Default is True.
workersint, optional: Number of workers to use in parallel mode. Default is 5.
plotbool, optional: If True, generates plots for the cross-validation loss curves. Default is False.
n_pointsint, optional: Number of lambda2 values to test on the logarithmic grid. Default is 11.
lambda_minfloat, optional: Minimum value of lambda2 for the grid search. Default is 1e-10.
lambda_maxfloat, optional: Maximum value of lambda2 for the grid search. Default is 1.

Attributes

lambda_minfloat: Optimal lambda2 minimizing the average loss across folds.
lambda_1sefloat: Largest lambda2 within one standard error of the minimum average loss.
Lossespd.DataFrame: Cross-validation losses for all lambda2 values and folds.
lambda_max_gradientfloat: Optimal lambda2 based on the maximum gradient of the loss curve.

static cross_validation_split_columns(df, n_splits=5)

Split the columns of a DataFrame into train and test sets for k-fold cross-validation.

Parameters

dfpd.DataFrame: Input data with columns to split.
n_splitsint, optional: Number of folds for cross-validation. Default is 5.

Returns

list of tuples: A list where each tuple contains train and test DataFrame subsets for one fold.

static evaluate_l2(j, l2, X_ref, train_data, gamma, test_data, Cc_all)

Evaluate the loss for a specific lambda2 value.

Parameters

jint: Index of the lambda2 value in the grid.
l2float: Lambda2 regularization parameter to evaluate.
X_refpd.DataFrame: Single-cell reference matrix.
train_datapd.DataFrame: Training bulk RNA-seq data for this fold.
gammapd.DataFrame: Gene weights learned by DTD.
test_datapd.DataFrame: Test bulk RNA-seq data for this fold.
Cc_allnp.ndarray: Concatenated matrix of bulk profiles for all samples.

Returns

tuple: The index j and the computed loss for the given lambda2 value.

static get_lambda_1se(Losses, polyfit=False)

Determine lambda2 using the 1SE rule.

Parameters

Lossespd.DataFrame: Cross-validation losses for each lambda2 value.
polyfitbool, optional: If True, considers polynomial smoothing when calculating the 1SE rule. Default is False.

Returns

float: Lambda2 value corresponding to the 1SE rule.

static get_lambda_1se_polyfit(Losses, degree=5, num_points=1000)

Determine lambda2 using the 1SE rule with polynomial smoothing.

Parameters

Lossespd.DataFrame: Cross-validation losses for each lambda2 value.
degreeint, optional: Degree of the polynomial fit. Default is 5.
num_pointsint, optional: Number of points for lambda2 interpolation. Default is 1000.

Returns

tuple

lambda_1sefloat
Lambda2 value corresponding to the 1SE rule.
fine_lambdasnp.ndarray
Interpolated lambda2 values.
smoothed_avgLossnp.ndarray
Smoothed average loss values.

static get_lambda_max_gradient(Losses)

Determine lambda2 based on the maximum gradient of the loss curve.

Parameters

Lossespd.DataFrame: Cross-validation losses for each lambda2 value.

Returns

float: Lambda2 value at the point of maximum gradient.

get_plot(Losses, polyfit=False)

Generate a plot of the cross-validation loss curve.

Parameters

Lossespd.DataFrame: Cross-validation losses for each lambda2 value.
polyfitbool, optional: If True, includes a polynomial fit and highlights the lambda2 from the 1SE rule. Default is False.

run()

Perform the hyperparameter search for lambda2.

Executes k-fold cross-validation and evaluates a range of lambda2 values to identify optimal hyperparameters based on loss minimization.

Populates the following attributes: - Losses : pd.DataFrame - lambda_min : float - lambda_1se : float - lambda_max_gradient : float