Utilities
The DeconomiX.utils submodule provides useful utilities around the normal (A)DTD workflow.
- deconomix.utils.calculate_corr(C_true, C_est, hidden_ct=None, c_est=None)
Function that calculates the correlation of estimated composition for multiple bulks with the ground truth.
Parameters
- C_truepd.DataFrame
True cellular compositions for all mixtures analyzed in C_est. Features all cell types, even if one cell type is hidden. Shape: n_cell_types x n_mixtures
- C_estpd.DataFrame
Estimated cellular compositions for multiple bulk mixtures. Features all non-hidden cell types from the ground truth. Shape: b_cell_types x n_mixtures
- hidden_ctString (Optional)
Needs to be set to the label of the cell type in C_true that is hidden, which means its not featured in C_est.
- c_estpd.DataFrame (Optional)
Estimated hidden background contributions for all mixtures featured in C_est (ADTD). Shape: 1 x n_mixtures
Returns
- correlationspd.DataFrame
Spearman correlation values for C_true vs. C_est for each cell type. Shape: n_cell_types
- deconomix.utils.calculate_estimated_composition(X_mat: DataFrame, Y_mat: DataFrame, gamma: DataFrame)
Function that calculates the estimated composition of Y based on a reference matrix X and the gene weight vector gamma generated by the DTD training.
Parameters
- X_matpd.DataFrame
Single cell reference matrix X containing (average) profiles as columns per celltype. Shape: genes x celltypes
- Y_matpd.DataFrame
Y matrix containing the generated artificial bulk profiles. Shape: genes x n_mixtures
- gammapd.DataFrame
DataFrame containing the weights for the genes featured in X_mat and Y_mat. Usually generated by Loss-function Learning for Digital Tissue Deconvolution. Shape: genes x 1
Returns
- C_estimatedpd.DataFrame
Matrix that contains estimated cellular abundances for the bulk profiles of the Y matrix.
- deconomix.utils.fishers_exact_test(ranking, geneset, percentage=5.0, verbose=True)
Function that performs Fisher’s exact test to determine wether genes in a given set of genes are enriched in a top percentage of a ranked list of genes.
Parameters
- rankinglist
Ranked list of genes, for instance the gene names of one column in a Delta matrix of an ADTD model after ordering it.
- genesetlist
List of genes of interest, unordered, for instance a gene set corresponding to a specific pathway in one cell type.
- percentagefloat
Top percentage of the ranked list, that are considered ‘highly’ ranked, default is 5 percent, parameter of the Fisher’s exact test.
- verbosebool
Wether to print all results (default) or not.
Returns
- resultsdict
Dictionary containing the statistical results of the test, keys “Odds Ratio” and “p-Value”.
- contingency_dfpd.DataFrame
Contingency table as pandas dataframe.
- deconomix.utils.gene_set_enrichment(Deltas, gene_set)
Perform gene set enrichment analysis using Mann-Whitney U tests.
This function evaluates whether a specified gene set (e.g., a family of genes) shows a statistically significant tendency to have higher or lower values in the given data, compared to other genes. The analysis is conducted across multiple datasets, each represented as a dataframe in Deltas.
Parameters:
- Deltaslist of pandas.DataFrame
A list of dataframes where each dataframe contains gene expression or delta values. Rows represent genes, and columns represent different cell types or conditions.
- gene_setstr or list
A string specifying the indicator of a gene family or list with gene names. Genes in each dataframe that contain this string will be considered part of the target gene set.
Returns:
- all_resultslist of pandas.DataFrame
A list of dataframes, one for each dataframe in Deltas. Each result dataframe contains the following columns: - celltypes: The cell type or condition. - p_values: P-values from the Mann-Whitney U test comparing the
target gene set against all other genes.
direction: Indicates whether the target gene set shows higher (“up”) or lower (“down”) values compared to the other genes.
- deconomix.utils.load_example()
Function providing an example test and training set for a DeconomiX workflow. Data is downloaded from NCBI (Tirosh et al., https://doi.org/10.1126/science.aad0501) once and cached for further usage.
Returns
- tir_testpd.DataFrame
Matrix that contains multiple single cell RNA profiles for different cell types from a set of tumors. Column labels equals cell type. Shape: genes x n_samples
- tir_trainpd.DataFrame
Matrix that contains multiple single cell RNA profiles for different cell types from a different set of tumors. Column labels equals cell type. Shape: genes x n_samples
- deconomix.utils.plot_corr(C_true, C_est, title='', color='grey', hidden_ct=None, c_est=None, path=None)
Function that calculates and visualizes the correlation of estimated composition for multiple bulks with the ground truth.
Parameters
- C_truepd.DataFrame
True cellular compositions for all mixtures analyzed in C_est. Features all cell types, even if one cell type is hidden. Shape: n_cell_types x n_mixtures
- C_estpd.DataFrame
Estimated cellular compositions for multiple bulk mixtures. Features all non-hidden cell types from the ground truth. Shape: b_cell_types x n_mixtures
- colorString
Sets the color of the correlation plots for the non-hidden cell types.
- hidden_ctString (Optional)
Needs to be set to the label of the cell type in C_true that is hidden, which means its not featured in C_est.
- c_estpd.DataFrame (Optional)
Estimated hidden background contributions for all mixtures featured in C_est (ADTD). Shape: 1 x n_mixtures
- pathString (Optional)
Specify a path for saving the figure as pdf.
Returns
- correlationspd.DataFrame
Spearman correlation values for C_true vs. C_est for each cell type. Shape: n_cell_types
- deconomix.utils.simulate_data(scRNA_df: DataFrame, n_mixtures: int, n_cells_in_mix: int, n_genes: int | None = None, seed: int = 1, coverage_threshold: int = 5)
Function that generates artificial bulk mixtures Y from a collection of single cell profiles and saves the ground truth of their composition C. Furthermore, it derives a reference matrix X by averaging over provided example profiles per cell type.
Parameters
- scRNA_dfpd.DataFrame
A matrix containing labeled single cell RNA profiles from multiple cell types. Please provide multiple examples per cell type and label each sample with a column label stating the cell type. Shape: genes x n_samples
- n_mixturesint
How many artificial bulks should be created.
- n_cells_in_mixint
How many profiles from scRNA_df should be randomly mixed together in one artifical bulk.
- n_genesint
Filters genes from scRNA_df by variance across celltypes and restricts the gene space to the n_genes most variable genes.
- seedint
Random seed for the drawing process.
- coverage_thresholdint
Minimum number of examples per cell type, if number of examples is below the threshold a warning will be thrown. (Default = 5).
Returns
- X_refpd.DataFrame
Single cell reference matrix X calculated by averaging over all examples of each cell type in scRNA_df. Shape: genes x celltypes
- Y_matpd.DataFrame
Y matrix containing the generated artificial bulk profiles. Each column represents one artificial bulk, normalized by n_cells_in_mix Shape: genes x n_mixtures
- C_matpd.DataFrame
Composition matrix containing the true cellular abundance in percent for each artificial mixture according to the drawing process. Shape: cell types x n_mixtures