Utilities

The DeconomiX.utils submodule provides useful utilities around the normal (A)DTD workflow.

deconomix.utils.calculate_corr(C_true, C_est, hidden_ct=None, c_est=None)

Function that calculates the correlation of estimated composition for multiple bulks with the ground truth.

Parameters

C_truepd.DataFrame: True cellular compositions for all mixtures analyzed in C_est. Features all cell types, even if one cell type is hidden. Shape: n_cell_types x n_mixtures
C_estpd.DataFrame: Estimated cellular compositions for multiple bulk mixtures. Features all non-hidden cell types from the ground truth. Shape: b_cell_types x n_mixtures
hidden_ctString (Optional): Needs to be set to the label of the cell type in C_true that is hidden, which means its not featured in C_est.
c_estpd.DataFrame (Optional): Estimated hidden background contributions for all mixtures featured in C_est (ADTD). Shape: 1 x n_mixtures

Returns

correlationspd.DataFrame: Spearman correlation values for C_true vs. C_est for each cell type. Shape: n_cell_types

deconomix.utils.calculate_estimated_composition(X_mat: DataFrame, Y_mat: DataFrame, gamma: DataFrame)

Function that calculates the estimated composition of Y based on a reference matrix X and the gene weight vector gamma generated by the DTD training.

Parameters

X_matpd.DataFrame: Single cell reference matrix X containing (average) profiles as columns per celltype. Shape: genes x celltypes
Y_matpd.DataFrame: Y matrix containing the generated artificial bulk profiles. Shape: genes x n_mixtures
gammapd.DataFrame: DataFrame containing the weights for the genes featured in X_mat and Y_mat. Usually generated by Loss-function Learning for Digital Tissue Deconvolution. Shape: genes x 1

Returns

C_estimatedpd.DataFrame: Matrix that contains estimated cellular abundances for the bulk profiles of the Y matrix.

deconomix.utils.fishers_exact_test(ranking, geneset, percentage=5.0, verbose=True)

Function that performs Fisher’s exact test to determine wether genes in a given set of genes are enriched in a top percentage of a ranked list of genes.

Parameters

rankinglist: Ranked list of genes, for instance the gene names of one column in a Delta matrix of an ADTD model after ordering it.
genesetlist: List of genes of interest, unordered, for instance a gene set corresponding to a specific pathway in one cell type.
percentagefloat: Top percentage of the ranked list, that are considered ‘highly’ ranked, default is 5 percent, parameter of the Fisher’s exact test.
verbosebool: Wether to print all results (default) or not.

Returns

resultsdict: Dictionary containing the statistical results of the test, keys “Odds Ratio” and “p-Value”.
contingency_dfpd.DataFrame: Contingency table as pandas dataframe.

deconomix.utils.gene_set_enrichment(Deltas, gene_set)

Perform gene set enrichment analysis using Mann-Whitney U tests.

This function evaluates whether a specified gene set (e.g., a family of genes) shows a statistically significant tendency to have higher or lower values in the given data, compared to other genes. The analysis is conducted across multiple datasets, each represented as a dataframe in Deltas.

Parameters:

Deltaslist of pandas.DataFrame: A list of dataframes where each dataframe contains gene expression or delta values. Rows represent genes, and columns represent different cell types or conditions.
gene_setstr or list: A string specifying the indicator of a gene family or list with gene names. Genes in each dataframe that contain this string will be considered part of the target gene set.

Returns:

all_resultslist of pandas.DataFrame

A list of dataframes, one for each dataframe in Deltas. Each result dataframe contains the following columns: - celltypes: The cell type or condition. - p_values: P-values from the Mann-Whitney U test comparing the

target gene set against all other genes.

direction: Indicates whether the target gene set shows higher (“up”) or lower (“down”) values compared to the other genes.

deconomix.utils.load_example()

Function providing an example test and training set for a DeconomiX workflow. Data is downloaded from NCBI (Tirosh et al., https://doi.org/10.1126/science.aad0501) once and cached for further usage.

Returns

tir_testpd.DataFrame: Matrix that contains multiple single cell RNA profiles for different cell types from a set of tumors. Column labels equals cell type. Shape: genes x n_samples
tir_trainpd.DataFrame: Matrix that contains multiple single cell RNA profiles for different cell types from a different set of tumors. Column labels equals cell type. Shape: genes x n_samples

deconomix.utils.plot_corr(C_true, C_est, title='', color='grey', hidden_ct=None, c_est=None, path=None)

Function that calculates and visualizes the correlation of estimated composition for multiple bulks with the ground truth.

Parameters

C_truepd.DataFrame: True cellular compositions for all mixtures analyzed in C_est. Features all cell types, even if one cell type is hidden. Shape: n_cell_types x n_mixtures
C_estpd.DataFrame: Estimated cellular compositions for multiple bulk mixtures. Features all non-hidden cell types from the ground truth. Shape: b_cell_types x n_mixtures
colorString: Sets the color of the correlation plots for the non-hidden cell types.
hidden_ctString (Optional): Needs to be set to the label of the cell type in C_true that is hidden, which means its not featured in C_est.
c_estpd.DataFrame (Optional): Estimated hidden background contributions for all mixtures featured in C_est (ADTD). Shape: 1 x n_mixtures
pathString (Optional): Specify a path for saving the figure as pdf.

Returns

correlationspd.DataFrame: Spearman correlation values for C_true vs. C_est for each cell type. Shape: n_cell_types

deconomix.utils.simulate_data(scRNA_df: DataFrame, n_mixtures: int, n_cells_in_mix: int, n_genes: int | None = None, seed: int = 1, coverage_threshold: int = 5)

Function that generates artificial bulk mixtures Y from a collection of single cell profiles and saves the ground truth of their composition C. Furthermore, it derives a reference matrix X by averaging over provided example profiles per cell type.

Parameters

scRNA_dfpd.DataFrame: A matrix containing labeled single cell RNA profiles from multiple cell types. Please provide multiple examples per cell type and label each sample with a column label stating the cell type. Shape: genes x n_samples
n_mixturesint: How many artificial bulks should be created.
n_cells_in_mixint: How many profiles from scRNA_df should be randomly mixed together in one artifical bulk.
n_genesint: Filters genes from scRNA_df by variance across celltypes and restricts the gene space to the n_genes most variable genes.
seedint: Random seed for the drawing process.
coverage_thresholdint: Minimum number of examples per cell type, if number of examples is below the threshold a warning will be thrown. (Default = 5).

Returns

X_refpd.DataFrame: Single cell reference matrix X calculated by averaging over all examples of each cell type in scRNA_df. Shape: genes x celltypes
Y_matpd.DataFrame: Y matrix containing the generated artificial bulk profiles. Each column represents one artificial bulk, normalized by n_cells_in_mix Shape: genes x n_mixtures
C_matpd.DataFrame: Composition matrix containing the true cellular abundance in percent for each artificial mixture according to the drawing process. Shape: cell types x n_mixtures