library("tidyverse")
library("SingleCellExperiment")
set.seed(1)
Multi-condition single-cell RNA-seq analysis with LEMUR
What you will learn in this lab
Our objective is to compare single cell RNA-seq data from multiple, heterogeneous tissue samples across conditions. An obvious approach is to divide the constituent cells into clusters meant to represent cell types and, for each cluster separately, do “pseudo-bulk” differential expression with methods such as edgeR or DESEq2. However, such discrete categorization tends to be an unsatisfactory model of the underlying biology. Here, we use latent embedding multivariate regression (LEMUR) (Constantin Ahlmann-Eltze and Huber (2025)), a model that operates without, or before, commitment to discrete categorization. LEMUR
- integrates data from different conditions,
- predicts each cell’s gene expression changes as a function of the conditions and its position in latent space and
- for each gene, identifies a compact neighborhood of cells with consistent differential expression.
Getting started
To start, we attach the tidyverse
and SingleCellExperiment
packages:
The call to set.seed
, to set the initial conditions (“seed”) of R’s random number generator, is optional. Here, it ensures that the results are exactly the same for everyone. Some of the methods here use stochastic sampling, but the expectation is that even with different random seeds, the results will be essentially the same. You can verify this by omiting the call to set.seed
or calling it with a different argument.
Example data
We consider a dataset by Kang et al. (2018). The authors measured the effect of interferon-\(\beta\) stimulation on blood cells from eight patients. The muscData
package provides the data as a SingleCellExperiment
. If downloading the data with the muscData
package fails, you can also directly download the file from http://oc.embl.de/index.php/s/tpbYcH5P9NfXeM5 and load it into R using sce = readRDS("PATH_TO_THE_FILE")
.
= muscData::Kang18_8vs8() sce
see ?muscData and browseVignettes('muscData') for documentation
loading from cache
sce
class: SingleCellExperiment
dim: 35635 29065
metadata(0):
assays(1): counts
rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
rowData names(2): ENSEMBL SYMBOL
colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):
You can read about the number of genes and cells when printing the summary of the sce
. Alternatively, you can use nrow(sce)
, ncol(sce)
, or dim(sce)
to get these values.
In a SingleCellExperiment
object, the metadata (“data about data”) on the cells are accessed with colData(sce)
, and those on the genes with rowData(sce)
.
Follow-up question: why does the documentation for colData
(run ?colData
in the R console) talk about SummarizedExperiment
objects instead of SingleCellExperiment
?
First, to visually explore the data, we logarithm transform them to account for the heteroskedasticity of the counts (C. Ahlmann-Eltze and Huber (2023)), perform PCA to obtain for each cell a 50-dimensional, smoothed embedding vector, and then further call UMAP to map these vectors into a 2-dimensional representation for visualization on the screen. We use the scater
package, which adds a new slot called logcounts
and the two slots reducedDims(sce)
named PCA
and UMAP
to the SummarizedExperiment
object. Equivalent functionality exists in Seurat
and scanpy
.
= scater::logNormCounts(sce)
sce = order(MatrixGenerics::rowVars(logcounts(sce)), decreasing = TRUE)
hvg = sce[hvg[1:500], ]
sce = scater::runPCA(sce, ncomponents = 50)
sce = scater::runUMAP(sce, dimred = "PCA") sce
The transformGamPoi package from Bioconductor provides a residual_transform
function.
assay(sce, "pearson_residuals") = transformGamPoi::residual_transform(sce, residual = "pearson", on_disk = FALSE)
Too few dimensions for PCA mean that it cannot capture enough of the relevant variation in the data. This leads to a loss of subtle differences between cell states.
Too many dimensions for PCA can also pose a problem. The objective of doing PCA with a number of components that is smaller than that of the original data, is to smooth out the sampling noise in the data. If too many PCA components are included, the smoothing is too weak, and the additional dimensions capture noise that can obscure biologically relevant differences. For more details see Fig. 2d in C. Ahlmann-Eltze and Huber (2023).
The scater package also provides a runTSNE
function
= scater::runTSNE(sce, dimred = "PCA") sce
On one hand, tSNE and UMAP are de facto standards for visualizing the cell type and state space in single cell data analysis. On the other hand, they are often criticized for failing to represent global structure and distorting differences in the data. A number of alternatives have been suggested, see e.g.:
- Force directed layout of the \(k\) nearest neighbor (kNN) graph (igraph),
- PHATE
- IVIS
- Unification of tSNE and UMAP using contrastive learning (Böhm, 2023)
In the below two UMAP plots, the cells’ positions strongly separate by treatment status (stim
), as well as by annotated cell type (cell
). One goal of this tutorial is to understand how different approaches use these covariates to integrate the data into shared embedding spaces or representations.
In this tutorial, we use direct calls to functions in the ggplot2
package to visualize our data. This is a little more verbose than calling higher-level, specialized wrapper functions (e.g., scater::plotReducedDim(sce, "UMAP", color_by = "stim")
), on the other hand, you see more directly what is going on.
as_tibble(colData(sce)) |>
mutate(umap = reducedDim(sce, "UMAP")) |>
ggplot(aes(x = umap[, 1], y = umap[, 2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()
colData(sce)$ind
column lists the patient identifier for each cell. Can you use this to create a separate plot for each patient?
One solution could be to use a for
loop and filter the data for each patient, but there is a much simpler solution! Adding facet_wrap() to the ggplot call will automatically create a subpanel for each patient:
coord_fixed()
in the above plots?
Dimension reduction methods such as UMAP put in considerable effort to approximate the distance relationships between the points by arranging them in the Euclidean 2D plane, so it would be a shame to distort this by incidental figure size settings and by distinct scalings of the \(x\)- and the \(y\)-axis.
This dataset already comes with cell type annotations. The cell type labels help interpret the results; however, for the purpose of this tutorial, we will not need them and, besides Figure 3, will ignore them.
= as_tibble(colData(sce)) |>
centroids mutate(umap = reducedDim(sce, "UMAP")) |>
summarize(umap = matrix(colMedians(umap), nrow = 1), .by = c(stim, cell))
# Use `geom_text_repel` to avoid overlapping annotations
as_tibble(colData(sce)) |>
mutate(umap = reducedDim(sce, "UMAP")) |>
ggplot(aes(x = umap[, 1], y = umap[, 2])) +
geom_point(aes(color = paste(stim, cell)), size = 0.3) +
::geom_text_repel(data = centroids, aes(label = paste(stim, cell))) +
ggrepelcoord_fixed()
The following code is based on the Seurat Guided Clustering Tutorial.
# For more information about the conversion see `?as.Seurat.CellDataSet`
= Seurat::as.Seurat(muscData::Kang18_8vs8(), data = NULL)
seur_obj = Seurat::NormalizeData(seur_obj, normalization.method = "LogNormalize", scale.factor = 10000)
seur_obj = Seurat::FindVariableFeatures(seur_obj, selection.method = "vst", nfeatures = 500)
seur_obj # Subset to highly variable genes for memory efficiency
= seur_obj[Seurat::VariableFeatures(object = seur_obj),]
seur_obj = Seurat::ScaleData(seur_obj)
seur_obj = Seurat::RunPCA(seur_obj, verbose = FALSE)
seur_obj = Seurat::RunUMAP(seur_obj, dims = 1:10) seur_obj
::DimPlot(seur_obj, reduction = "umap", group.by = "stim") Seurat
A prelude on data integration
When we made Figure 1, we discussed that the cells’ embedding locations separate strongly both by treatment status and cell type. For many analyses, we would like to overlay the corresponding cells from the stimulated condition with the cells from the control condition. For example, for cell type assignment, we might want to annotate both conditions together, irrespective of treatment. This aspiration is called integration. The goal is a mathematical representation (e.g., a low-dimensional embedding) of the cells where the treatment status does not visibly affect the positions overall, and all the visible variance represents different cell types. Figure 4 shows an example.
There are many approaches to this end. The question is somewhat ill-defined and can be formalized into a mathematical algorithm in many different ways. Luecken et al. (2022) benchmarked several approaches. Here, we first show how to do an integration manually, then with harmony
. Later we will compare both to lemur
. If you are just interested in using lemur
, you can skip ahead to Section 5.
A 2D embedding like UMAP or tSNE gives us a visual impression if variation that comes from “uninteresting” covariates (here: treatment condition) is eliminated and variation that comes from covariates we care about is retained, but the results are not quantitative, which means we cannot objectively assess or compare the quality of the result.
One simple metric to measure integration success is to see how mixed the cells from the two conditions are. For example, we can count for each cell how many of its \(k\) nearest neighbors come from the same condition and how many come from the other condition(s), and then look at the distribution of this fraction. For more, see Luecken et al. (2022).
Follow-up challenge: Write a function to calculate the mixing metric.
Follow-up questions: Can you write an integration function that scores perfectly on this mixing metric? Hint: it can be biologically completely useless. What else would you need to measure to protect against such silly solutions?
Manual Projection
In transcriptomic data analysis, each cell is characterized by its gene expression profile, which here we operationalize as the vector of counts for the 500 most variable genes. Accordingly, each cell is a point in a 500-dimensional gene expression space. Principal component analysis (PCA) finds a \(P\)-dimensional space (\(P\le500\)) such that overall, distance relationships between cells in the original space are optimally approximated by those in the reduced, \(P\)-dimensional space.
Although there are over 20,000 genes in the human genome, the sensitivity of the assay that was used for our dataset is such that only a much smaller number of genes are well-measured, i.e., have counts high enough to be informative. Whether we use the top 500 genes by mean or by standard deviation makes no big difference.
To make these concepts more tangible, consider the cartoon in Figure 5. The orange blob represents an optimal two-dimensional embedding of all cells from the control condition. Similarly, the blue blob represents the treated cells. Each blob is contained within a two-dimensional subspace (plane) of the ambient space, but in general, these are different. To integrate both conditions, we can map the two planes into each other, so that each point from the blue blob is mapped into the orange subspace, and vice versa.
We can implement this procedure in a few lines of R code:
- We create a matrix for the cells from the control and treated conditions,
- we fit PCA on the control cells,
- we center the data, and finally
- we project the cells from both conditions onto the subspace of the control condition
# Each column from the matrix is the coordinate of a cell in the 500-dimensional space
= as.matrix(logcounts(sce)[,sce$stim == "ctrl"])
ctrl_mat = as.matrix(logcounts(sce)[,sce$stim == "stim"])
stim_mat
= rowMeans(ctrl_mat)
ctrl_centers = rowMeans(stim_mat)
stim_centers
# `prcomp` is R's name for PCA and IRLBA is an algorithm to calculate it.
= irlba::prcomp_irlba(t(ctrl_mat - ctrl_centers), n = 20)
ctrl_pca
# With a little bit of linear algebra, we project the points onto the subspace of the control cells
= matrix(NA, nrow = 20, ncol = ncol(sce))
integrated_mat $stim == "ctrl"] = t(ctrl_pca$rotation) %*% (ctrl_mat - ctrl_centers)
integrated_mat[,sce$stim == "stim"] = t(ctrl_pca$rotation) %*% (stim_mat - stim_centers) integrated_mat[,sce
We check how successful we were by looking at a UMAP of the integrated embedding.
= uwot::umap(t(integrated_mat))
int_mat_umap
as_tibble(colData(sce)) |>
mutate(umap = int_mat_umap) |>
ggplot(aes(x = umap[, 1], y = umap[, 2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()
The overlap is not perfect, but better than in Figure 1.
"stim"
subspace instead?
The projection is orthogonal onto the subspace, which means it matters which condition is chosen as reference.
= irlba::prcomp_irlba(t(stim_mat - stim_centers), n = 20, center = FALSE)
stim_pca
= matrix(NA, nrow = 20, ncol = ncol(sce))
integrated_mat2 $stim == "ctrl"] = t(stim_pca$rotation) %*% (ctrl_mat - ctrl_centers)
integrated_mat2[,sce$stim == "stim"] = t(stim_pca$rotation) %*% (stim_mat - stim_centers)
integrated_mat2[,sce
= uwot::umap(t(integrated_mat2))
int_mat_umap2
as_tibble(colData(sce)) |>
mutate(umap = int_mat_umap2) |>
ggplot(aes(x = umap[, 1], y = umap[, 2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()
For this example, using the "stim"
condition as the reference leads to a worse integration.
The projection approach consists of three steps:
- Centering the data (e.g.,
ctrl_mat - ctrl_centers
). - Choosing a reference condition and calculating the subspace that approximates the data from the reference condition (
irlba::prcomp_irlba(t(stim_mat - stim_centers))$rotation
). - Projecting the data from the other conditions onto that subspace (
t(ctrl_pca$rotation) %*% (stim_mat - stim_centers)
).
For arbitrary experimental designs, we can perform the centering with a linear model fit. The second step remains the same. And after calculating the centered matrix, the third step is also straight forward. Below are the outlines how such a general procedure would work.
# A complex experimental design
= lm(t(logcounts(sce)) ~ condition + batch, data = colData(sce))
lm_fit # The `residuals` function returns the coordinates minus the mean at the condition.
= t(residuals(lm_fit))
centered_mat # Assuming that `is_reference_condition` contains a selection of the cells
= irlba::prcomp_irlba(centered_mat[,is_reference_condition], ...)
ref_pca = t(ref_pca$rotation) %*% centered_mat int_mat
Harmony
Harmony is one popular tool for data integration (Korsunsky et al. 2019). Harmony is relatively fast and can handle more complex experimental designs than just a treatment/control setup. It is built around maximum diversity clustering (Figure 7). Unlike classical clustering algorithms, maximum diversity clustering not just minimizes the distance of each data point to a cluster center but also maximizes the mixing of conditions assigned to each cluster.
= harmony::RunHarmony(reducedDim(sce, "PCA"), colData(sce), vars_use = c("stim"))
harm_mat = uwot::umap(harm_mat)
harm_umap
as_tibble(colData(sce)) |>
mutate(umap = harm_umap) |>
ggplot(aes(x = umap[, 1], y = umap[, 2])) +
geom_point(aes(color = stim), size = 0.3) +
coord_fixed()
The integration method of Seurat is based on mutual nearest neighbors (MNN) by Haghverdi et al. (2018). However, Seurat calls the mutual nearest neighbors integration anchors. The main difference is that Seurat uses canonical correspondence analysis (CCA) instead of principal component analysis (PCA) for dimensionality reduction (Butler et al. 2018).
# This code only works with Seurat v5!
= Seurat::as.Seurat(muscData::Kang18_8vs8(), data = NULL)
seur_obj2 "originalexp"]] = split(seur_obj2[["originalexp"]], f = seur_obj2$stim)
seur_obj2[[= Seurat::NormalizeData(seur_obj2, normalization.method = "LogNormalize", scale.factor = 10000)
seur_obj2 = Seurat::FindVariableFeatures(seur_obj2, selection.method = "vst", nfeatures = 500)
seur_obj2 = Seurat::ScaleData(seur_obj2)
seur_obj2 = Seurat::RunPCA(seur_obj2, verbose = FALSE)
seur_obj2
= Seurat::IntegrateLayers(object = seur_obj2, method = Seurat::CCAIntegration, orig.reduction = "pca", new.reduction = "integrated.cca", verbose = FALSE)
seur_obj2 = Seurat::RunUMAP(seur_obj2, dims = 1:30, reduction = "integrated.cca")
seur_obj2 ::DimPlot(seur_obj2, reduction = "umap", group.by = "stim") Seurat
Analysis with LEMUR
LEMUR is a tool to analyze multi-condition single-cell data. A typical analysis workflow with LEMUR goes through four steps (Figure 8):
- Covariate-adjusted dimensionality reduction.
- Prediction of the differential expression for each cell per gene.
- Identification of neighborhoods of cells with similar differential expression patterns.
- Pseudobulked differential expression testing per neighborhood.
These are implemented by the following code.
library("lemur")
# Step 1
= lemur(sce, design = ~ stim, n_embedding = 30)
fit = align_by_grouping(fit, fit$colData$cell)
fit
# Step 2
= test_de(fit, contrast = cond(stim = "stim") - cond(stim = "ctrl"))
fit
# Steps 3 and 4
= find_de_neighborhoods(fit, group_by = vars(ind, stim)) nei
In the next sections we discuss these steps in more detail.
LEMUR Integration (Step 1)
Tools like MNN and Harmony take a PCA embedding and remove the effects of the specified covariates. However, there is no way to go back from the integrated embedding to the original gene space. This means that we cannot ask the counterfactual what the expression of a cell from the control condition would have been, had it been treated.
lemur
achieves such counterfactuals by matching the subspaces of each condition (Constantin Ahlmann-Eltze and Huber 2025). Figure 9 illustrates the principle. In some sense this is a more sophisticated version of the manual matching that we saw in the previous section. The matching is a single affine transformation in contrast to shifts per cluster inferred by Harmony.
lemur
takes as input a SingleCellExperiment
object, the specification of the experimental design, and the number of latent dimensions. To refine the embedding, we call align_by_grouping
and tell it to use the provided cell type annotations.
library("lemur")
= lemur(sce, design = ~ stim, n_embedding = 30, verbose = FALSE)
fit = align_by_grouping(fit, fit$colData$cell, verbose = FALSE) fit
Making a UMAP plot of the embedding by lemur
shows that it successfully integrated the conditions (Figure 10).
= uwot::umap(t(fit$embedding))
lemur_umap
as_tibble(colData(sce)) |>
mutate(umap = lemur_umap) |>
ggplot(aes(x = umap[, 1], y = umap[, 2])) +
geom_point(aes(color = stim), size = 0.3) + coord_fixed()
We can cross-check our embedding with the provided cell type annotations by coloring each cell by cell type (using the information stored in the colData(sce)$cell
column).
as_tibble(colData(sce)) |>
mutate(umap = lemur_umap) |>
ggplot(aes(x = umap[, 1], y = umap[, 2])) +
geom_point(aes(color = cell), size = 0.3) + coord_fixed()
lemur
?
lemur
be refined with an automated tool?
If there are no cell type labels present, lemur
uses a modified version of the maximum diversity clustering of harmony
to automatically infer the cell relationships across conditions.
= lemur(sce, design = ~ stim, n_embedding = 30, verbose = FALSE)
fit2 = align_harmony(fit2, verbose = FALSE) fit2
Differential expression analysis (Step 2)
The advantage of the lemur
integration is that we can predict what the expression profile of a cell from the control condition would have been, had it been stimulated, and vice versa. Contrasting those predictions tells us how much the gene expression changes for that cell in any gene.
We call the test_de
function of lemur
to compare the expression values in the "stim"
and "ctrl"
conditions.
= test_de(fit, contrast = cond(stim = "stim") - cond(stim = "ctrl")) fit
We can now pick an individual gene (PLSCR1) and plot the predicted log fold change for each cell to show how it varies as a function of the underlying gene expression values (Figure 12).
as_tibble(colData(sce)) |>
mutate(umap = lemur_umap) |>
mutate(expr = logcounts(fit)["PLSCR1",]) |>
ggplot(aes(x = umap[, 1], y = umap[, 2])) +
geom_point(aes(color = expr), size = 0.3) +
scale_color_viridis_c() +
facet_wrap(vars(stim)) +
coord_fixed()
as_tibble(colData(sce)) |>
mutate(umap = lemur_umap) |>
mutate(de = assay(fit, "DE")["PLSCR1",]) |>
ggplot(aes(x = umap[, 1], y = umap[, 2])) +
geom_point(aes(color = de), size = 0.3) +
scale_color_gradient2(low = scales::muted("blue"), high = scales::muted("red")) +
coord_fixed()
This approach has the advantage over the traditional cluster/cell type-based analysis that it can detect smooth differential expression gradients.
glmGamPoi
is a differential expression tool inspired by edgeR
and DESeq2
that is specifically designed for single cell RNA-seq data.
= glmGamPoi::pseudobulk(fit, group_by = vars(cell, ind, stim)) psce
Aggregating assay 'counts' using 'rowSums2'.
Aggregating assay 'logcounts' using 'rowMeans2'.
Aggregating assay 'pearson_residuals' using 'rowMeans2'.
Aggregating assay 'DE' using 'rowMeans2'.
Aggregating reducedDim 'TSNE' using 'rowMeans2'.
Aggregating reducedDim 'PCA' using 'rowMeans2'.
Aggregating reducedDim 'UMAP' using 'rowMeans2'.
Aggregating reducedDim 'linearFit' using 'rowMeans2'.
Aggregating reducedDim 'embedding' using 'rowMeans2'.
# Filter out NAs
= psce[,! is.na(psce$cell)]
psce = glmGamPoi::glm_gp(psce, design = ~ stim * cell)
glm_fit = glmGamPoi::test_de(glm_fit, contrast = cond(stim = "stim", cell = "B cells") - cond(stim = "ctrl", cell = "B cells"))
glm_de |>
glm_de arrange(pval) |>
head()
name pval adj_pval f_statistic df1 df2 lfc
1 HERC5 9.370681e-36 4.685341e-33 329.5872 1 116.4645 5.152089
2 NT5C3A 7.953371e-35 1.988343e-32 313.5509 1 116.4645 3.760837
3 PLSCR1 1.370641e-32 2.284401e-30 277.2727 1 116.4645 4.032858
4 DYNLT1 1.018124e-31 1.272655e-29 263.9920 1 116.4645 2.913860
5 IRF7 2.424490e-30 2.424490e-28 243.9083 1 116.4645 3.543921
6 SPI1 7.996678e-30 6.663898e-28 236.6273 1 116.4645 -2.440995
lemur
predictions against the observed expression of a gene:
We use the pseudobulk
function from glmGamPoi
to calculate the average expression and latent embedding position per condition and cell type. The mean embedding and colData
is then fed to the predict
function of lemur
.
# You could choose any gene
= "HERC5"
gene_oi
= glmGamPoi::pseudobulk(fit, group_by = vars(stim, cell)) reduced_fit
Aggregating assay 'counts' using 'rowSums2'.
Aggregating assay 'logcounts' using 'rowMeans2'.
Aggregating assay 'pearson_residuals' using 'rowMeans2'.
Aggregating assay 'DE' using 'rowMeans2'.
Aggregating reducedDim 'TSNE' using 'rowMeans2'.
Aggregating reducedDim 'PCA' using 'rowMeans2'.
Aggregating reducedDim 'UMAP' using 'rowMeans2'.
Aggregating reducedDim 'linearFit' using 'rowMeans2'.
Aggregating reducedDim 'embedding' using 'rowMeans2'.
= predict(fit, newdata = colData(reduced_fit), embedding = t(reducedDim(reduced_fit, "embedding")))
lemur_pred_per_cell
as_tibble(colData(reduced_fit)) |>
mutate(obs = logcounts(reduced_fit)[gene_oi,]) |>
mutate(lemur_pred = lemur_pred_per_cell[gene_oi,]) |>
ggplot(aes(x = obs, y = lemur_pred)) +
geom_abline() +
geom_point(aes(color = stim)) +
labs(title = "`lemur` predictions averaged per cell type and condition")
Neighborhood identification (Step 3) and differential expression tests (Step 4)
To help identify the most interesting changes, lemur
can aggregate the results of each gene and identify the neighborhood of cells with the most prominent coherent pattern of differential gene expression. To conduct a statistically valid differential expression test, lemur
pseudobulks the counts for each replicate (here this is the patient).
The output is a data.frame
with one row for each gene. The neighborhood
column is a list column, where each element is a vector with the IDs of the cells that are inside the neighborhood. The other columns describe the significance of the expression change defined in the test_de
call.
= find_de_neighborhoods(fit, group_by = vars(ind, stim))
nei head(nei)
name neighborhood n_cells sel_statistic pval adj_pval
1 FTL AAACATAC.... 28564 -225.6253 2.384522e-02 3.057079e-02
2 ISG15 AAACATAC.... 5432 1111.8266 2.934118e-10 2.993998e-09
3 FTH1 AAACATAC.... 24625 -454.1447 5.463812e-06 1.588317e-05
4 TIMP1 AAACATAC.... 8700 -175.6102 1.076062e-04 2.152123e-04
5 CXCL10 AAACATAC.... 8001 307.0006 1.020590e-09 7.850693e-09
6 CCL2 AAACATAC.... 6257 306.5249 1.228543e-04 2.427951e-04
f_statistic df1 df2 lfc did_pval did_adj_pval did_lfc
1 6.106872 1 17.73896 -0.3821229 0.070805014 0.15317816 -0.7358318
2 157.595633 1 17.73896 6.6163621 0.039042089 0.09859113 -1.1874899
3 40.861722 1 17.73896 -0.8966073 0.095400965 0.18853946 0.4510703
4 24.511826 1 17.73896 -0.8958956 0.006051525 0.02208586 0.8942429
5 134.736636 1 17.73896 7.3423406 0.525641275 0.66705746 -0.7214599
6 23.906300 1 17.73896 1.6440703 0.047479744 0.11413400 1.2994391
We can sort the table either by the pval
or the did_pval
. The pval
measures the evidence of an expression change between control and stimulated condition for all cells inside the neighborhood. In contrast, the did_pval
measures how much larger the expression difference is for the cells inside the neighborhood versus outside.
slice_min(nei, pval, n = 3)
name neighborhood n_cells sel_statistic pval adj_pval
1 EIF2AK2 AAACATAC.... 27854 580.0410 9.356820e-18 4.678410e-15
2 NT5C3A AAACATAC.... 11214 390.0381 2.887243e-17 7.218107e-15
3 HERC5 AAACATAC.... 27222 497.5872 4.859467e-17 8.099112e-15
f_statistic df1 df2 lfc did_pval did_adj_pval did_lfc
1 1204.1678 1 17.73896 3.237861 3.132737e-01 4.527077e-01 0.40752003
2 1058.5008 1 17.73896 4.590639 5.037601e-08 2.798667e-06 -1.64130028
3 997.1973 1 17.73896 4.443825 9.357180e-01 9.747277e-01 0.07604668
slice_min(nei, did_pval, n = 3)
name neighborhood n_cells sel_statistic pval adj_pval
1 SSB AAACATAC.... 5210 452.3921 6.009359e-16 5.007799e-14
2 RPL23 AAACATAC.... 7766 -174.1778 1.311449e-11 2.261119e-10
3 PARK7 AAACATAC.... 7223 -207.7361 1.755118e-10 1.867147e-09
f_statistic df1 df2 lfc did_pval did_adj_pval did_lfc
1 746.8381 1 17.73896 2.983723 2.714391e-12 1.357195e-09 -2.113198
2 230.7645 1 17.73896 -1.951510 3.761147e-10 9.402869e-08 1.733275
3 167.9954 1 17.73896 -1.928855 1.183556e-09 1.594894e-07 1.531026
= slice_min(nei, did_pval)
top_did_gene
as_tibble(colData(sce), rownames = "cell_name") |>
mutate(umap = lemur_umap) |>
mutate(de = assay(fit, "DE")[top_did_gene$name,]) |>
ggplot(aes(x = umap[, 1], y = umap[, 2])) +
geom_point(aes(color = de), size = 0.3) +
scale_color_gradient2(low = scales::muted("blue"), high = scales::muted("red")) +
coord_fixed() +
labs(title = "Top differentially expressed gene, by did_pval")
As you can see in the above plot, there is strong differential expression of this gene in one cluster of cells, and essentially none in the other cells.
To see which cells lemur
included in the neighborhood for each gene, we will first define a small helper function that converts the nei
dataframe–which contains one row per gene, and the neighbourhood of that gene as a vector of cell identifiers in the neighborhood
columns–into another dataframe that has one row for each gene and for each cell, and column inside
with a logical variable that indicates whether this cell is in the neighbourhood for that gene.
Future versions of lemur
may contain this function inside, so you dont need to worry about it.
= function(x, fit) {
neighborhoods_to_long_dataframe = x[["neighborhood"]]
nei = x[["name"]]
gene_names = unique(unlist(nei))
cell_names if (!missing(fit)) cell_names = union(cell_names, colnames(fit))
stopifnot(!is.null(nei), !is.null(gene_names), !is.null(cell_names))
= matrix(FALSE, nrow = length(cell_names), ncol = length(gene_names),
res dimnames = list(cell = cell_names, name = gene_names))
for (i in seq_along(gene_names))
= TRUE
res[nei[[i]], gene_names[i]]
as.data.frame.table(res, responseName = "inside", stringsAsFactors = FALSE) |> as_tibble()
}
With this function, we can annotate for any gene if a cell is included in its lemur
neighborhood.
= neighborhoods_to_long_dataframe(top_did_gene, fit)
cells_inside_df
as_tibble(colData(sce), rownames = "cell_name") |>
mutate(umap = lemur_umap) |>
mutate(de = assay(fit, "DE")[top_did_gene$name,]) |>
left_join(cells_inside_df, by = c("cell_name" = "cell")) |>
ggplot(aes(x = umap[, 1], y = umap[, 2])) +
geom_point(aes(color = de), size = 0.3) +
scale_color_gradient2(low = scales::muted("blue"), high = scales::muted("red")) +
coord_fixed() +
facet_wrap(vars(inside), labeller = label_both) +
labs(title = "Top differentially expressed gene, according to did_pval")
lemur
neighborhoods different from the cluster-based differential expression tests?
The lemur
neighborhoods are adaptive to the underlying expression patterns for each gene. This means that they provide optimal power to detect differential expression as well the set of cells for which the gene expression changes most.
The plot below illustrates this for the top three DE genes by pval
which all have very different neighborhoods. Note that the neighborhoods are convex in the low-dimensional embedding space, even though in the UMAP embedding they might not look like it. This is due to the non-linear nature of the UMAP dimensionality reduction.
= neighborhoods_to_long_dataframe(slice_min(nei, pval, n = 3), fit)
cells_inside_df_top3
as_tibble(colData(sce), rownames = "cell_name") |>
mutate(umap = lemur_umap) |>
mutate(de = as_tibble(t(assay(fit, "DE")[slice_min(nei, pval, n = 3)$name,]))) |>
unpack(de, names_sep = "_") %>%
pivot_longer(starts_with("de_"), names_sep = "_", names_to = c(".value", "gene_name")) |>
left_join(cells_inside_df_top3, by = c("gene_name" = "name", "cell_name" = "cell")) |>
ggplot(aes(x = umap[, 1], y = umap[, 2])) +
geom_point(aes(color = de), size = 0.3) +
scale_color_gradient2(low = scales::muted("blue"), high = scales::muted("red")) +
coord_fixed() +
facet_grid(vars(inside), vars(gene_name), labeller = label_both) +
labs(title = "Top 3 DE gene by pval")
Outlook
lemur
can handle arbitrary design matrices as input, which means that it can also handle more complicated experiments than the simple treatment / control comparison shown here. Just as with limma
, edgeR
or DESeq2
, you can model
- multiple covariates, e.g., the treatment of interest as well as known confounders (experiment batch, sex, age) or blocking variables,
- interactions, e.g., drug treatment in wild type and mutant,
- continuous covariates, e.g., time courses, whose effects can also be modeled using smooth non-linear functions (spline regression).
For more details, see Constantin Ahlmann-Eltze and Huber (2025). For any questions, whether technical/practical or conceptual, please post to the Bioconductor forum using the lemur
tag and follow the posting guide.
Session Info
sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Rome
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] lemur_1.6.1 muscData_1.22.0
[3] ExperimentHub_2.16.0 AnnotationHub_3.16.0
[5] BiocFileCache_2.16.0 dbplyr_2.5.0
[7] SingleCellExperiment_1.30.1 SummarizedExperiment_1.38.1
[9] Biobase_2.68.0 GenomicRanges_1.60.0
[11] GenomeInfoDb_1.44.0 IRanges_2.42.0
[13] S4Vectors_0.46.0 BiocGenerics_0.54.0
[15] generics_0.1.4 MatrixGenerics_1.20.0
[17] matrixStats_1.5.0 lubridate_1.9.4
[19] forcats_1.0.0 stringr_1.5.1
[21] dplyr_1.1.4 purrr_1.0.4
[23] readr_2.1.5 tidyr_1.3.1
[25] tibble_3.3.0 ggplot2_3.5.2
[27] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] DBI_1.2.3 transformGamPoi_1.14.0
[3] gridExtra_2.3 rlang_1.1.6
[5] magrittr_2.0.3 scater_1.36.0
[7] RcppAnnoy_0.0.22 compiler_4.5.1
[9] RSQLite_2.4.1 DelayedMatrixStats_1.30.0
[11] png_0.1-8 vctrs_0.6.5
[13] pkgconfig_2.0.3 crayon_1.5.3
[15] fastmap_1.2.0 XVector_0.48.0
[17] labeling_0.4.3 scuttle_1.18.0
[19] rmarkdown_2.29 tzdb_0.5.0
[21] UCSC.utils_1.4.0 ggbeeswarm_0.7.2
[23] bit_4.6.0 xfun_0.52
[25] cachem_1.1.0 beachmat_2.24.0
[27] jsonlite_2.0.0 blob_1.2.4
[29] DelayedArray_0.34.1 BiocParallel_1.42.1
[31] irlba_2.3.5.1 parallel_4.5.1
[33] R6_2.6.1 stringi_1.8.7
[35] RColorBrewer_1.1-3 Rcpp_1.1.0
[37] knitr_1.50 splines_4.5.1
[39] Matrix_1.7-3 timechange_0.3.0
[41] tidyselect_1.2.1 abind_1.4-8
[43] yaml_2.3.10 viridis_0.6.5
[45] codetools_0.2-20 curl_6.4.0
[47] lattice_0.22-7 withr_3.0.2
[49] KEGGREST_1.48.1 evaluate_1.0.4
[51] Biostrings_2.76.0 pillar_1.10.2
[53] BiocManager_1.30.26 filelock_1.0.3
[55] BiocVersion_3.21.1 hms_1.1.3
[57] sparseMatrixStats_1.20.0 scales_1.4.0
[59] RhpcBLASctl_0.23-42 glue_1.8.0
[61] tools_4.5.1 BiocNeighbors_2.2.0
[63] ScaledMatrix_1.16.0 cowplot_1.1.3
[65] grid_4.5.1 AnnotationDbi_1.70.0
[67] GenomeInfoDbData_1.2.14 beeswarm_0.4.0
[69] BiocSingular_1.24.0 vipor_0.4.7
[71] cli_3.6.5 rsvd_1.0.5
[73] rappdirs_0.3.3 viridisLite_0.4.2
[75] S4Arrays_1.8.1 uwot_0.2.3
[77] glmGamPoi_1.20.0 gtable_0.3.6
[79] digest_0.6.37 SparseArray_1.8.0
[81] ggrepel_0.9.6 htmlwidgets_1.6.4
[83] farver_2.1.2 memoise_2.0.1
[85] htmltools_0.5.8.1 lifecycle_1.0.4
[87] httr_1.4.7 mime_0.13
[89] harmony_1.2.3 bit64_4.6.0-1