Multi-condition single-cell RNA-seq analysis with LEMUR

Author

Constantin Ahlmann-Eltze & Wolfgang Huber

Published

July 8, 2025

What you will learn in this lab

Our objective is to compare single cell RNA-seq data from multiple, heterogeneous tissue samples across conditions. An obvious approach is to divide the constituent cells into clusters meant to represent cell types and, for each cluster separately, do “pseudo-bulk” differential expression with methods such as edgeR or DESEq2. However, such discrete categorization tends to be an unsatisfactory model of the underlying biology. Here, we use latent embedding multivariate regression (LEMUR) (Constantin Ahlmann-Eltze and Huber (2025)), a model that operates without, or before, commitment to discrete categorization. LEMUR

integrates data from different conditions,
predicts each cell’s gene expression changes as a function of the conditions and its position in latent space and
for each gene, identifies a compact neighborhood of cells with consistent differential expression.

Figure 1 from Analysis of multi-condition single-cell data with latent embedding multivariate regression by Constantin Ahlmann-Eltze and Huber (2025). The goal is to identify cells that show gene expression changes associated with experimental perturbations or study conditions. — Figure 1 from *Analysis of multi-condition single-cell data with latent embedding multivariate regression* by Constantin Ahlmann-Eltze and Huber (2025). The goal is to identify cells that show gene expression changes associated with experimental perturbations or study conditions.

Getting started

To start, we attach the tidyverse and SingleCellExperiment packages:

library("tidyverse")
library("SingleCellExperiment")
set.seed(1)

The call to set.seed, to set the initial conditions (“seed”) of R’s random number generator, is optional. Here, it ensures that the results are exactly the same for everyone. Some of the methods here use stochastic sampling, but the expectation is that even with different random seeds, the results will be essentially the same. You can verify this by omiting the call to set.seed or calling it with a different argument.

Example data

We consider a dataset by Kang et al. (2018). The authors measured the effect of interferon-$\beta$ stimulation on blood cells from eight patients. The muscData package provides the data as a SingleCellExperiment. If downloading the data with the muscData package fails, you can also directly download the file from http://oc.embl.de/index.php/s/tpbYcH5P9NfXeM5 and load it into R using sce = readRDS("PATH_TO_THE_FILE").

sce = muscData::Kang18_8vs8()

see ?muscData and browseVignettes('muscData') for documentation

loading from cache

sce

class: SingleCellExperiment 
dim: 35635 29065 
metadata(0):
assays(1): counts
rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
rowData names(2): ENSEMBL SYMBOL
colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):

Question: How many genes and cells are in the data? How do you find the metadata about each cell / gene?

You can read about the number of genes and cells when printing the summary of the sce. Alternatively, you can use nrow(sce), ncol(sce), or dim(sce) to get these values.

In a SingleCellExperiment object, the metadata (“data about data”) on the cells are accessed with colData(sce), and those on the genes with rowData(sce).

Follow-up question: why does the documentation for colData (run ?colData in the R console) talk about SummarizedExperiment objects instead of SingleCellExperiment?

First, to visually explore the data, we logarithm transform them to account for the heteroskedasticity of the counts (C. Ahlmann-Eltze and Huber (2023)), perform PCA to obtain for each cell a 50-dimensional, smoothed embedding vector, and then further call UMAP to map these vectors into a 2-dimensional representation for visualization on the screen. We use the scater package, which adds a new slot called logcounts and the two slots reducedDims(sce) named PCA and UMAP to the SummarizedExperiment object. Equivalent functionality exists in Seurat and scanpy.

sce = scater::logNormCounts(sce)
hvg = order(MatrixGenerics::rowVars(logcounts(sce)), decreasing = TRUE)
sce = sce[hvg[1:500], ]
sce = scater::runPCA(sce, ncomponents = 50)
sce = scater::runUMAP(sce, dimred = "PCA")

Question: How would you do a sctransform-like transformation (i.e., Pearson residuals) without using Seurat?

The transformGamPoi package from Bioconductor provides a residual_transform function.

assay(sce, "pearson_residuals") = transformGamPoi::residual_transform(sce, residual = "pearson", on_disk = FALSE)

Question: What is the problem with using too few components for PCA? Is there also a problem with using too many?

Too few dimensions for PCA mean that it cannot capture enough of the relevant variation in the data. This leads to a loss of subtle differences between cell states.

Too many dimensions for PCA can also pose a problem. The objective of doing PCA with a number of components that is smaller than that of the original data, is to smooth out the sampling noise in the data. If too many PCA components are included, the smoothing is too weak, and the additional dimensions capture noise that can obscure biologically relevant differences. For more details see Fig. 2d in C. Ahlmann-Eltze and Huber (2023).

Challenge: How can you use tSNE instead of UMAP? What are alternative visualization methods?

The scater package also provides a runTSNE function

sce = scater::runTSNE(sce, dimred = "PCA")

On one hand, tSNE and UMAP are de facto standards for visualizing the cell type and state space in single cell data analysis. On the other hand, they are often criticized for failing to represent global structure and distorting differences in the data. A number of alternatives have been suggested, see e.g.:

Force directed layout of the $k$ nearest neighbor (kNN) graph (igraph),
PHATE
IVIS
Unification of tSNE and UMAP using contrastive learning (Böhm, 2023)

In the below two UMAP plots, the cells’ positions strongly separate by treatment status (stim), as well as by annotated cell type (cell). One goal of this tutorial is to understand how different approaches use these covariates to integrate the data into shared embedding spaces or representations.

In this tutorial, we use direct calls to functions in the ggplot2 package to visualize our data. This is a little more verbose than calling higher-level, specialized wrapper functions (e.g., scater::plotReducedDim(sce, "UMAP", color_by = "stim")), on the other hand, you see more directly what is going on.

as_tibble(colData(sce)) |>
  mutate(umap = reducedDim(sce, "UMAP")) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) + 
    geom_point(aes(color = stim), size = 0.3) +
    coord_fixed()

Figure 1: UMAP of log transformed counts

Question: The colData(sce)$ind column lists the patient identifier for each cell. Can you use this to create a separate plot for each patient?

One solution could be to use a for loop and filter the data for each patient, but there is a much simpler solution! Adding facet_wrap() to the ggplot call will automatically create a subpanel for each patient:

as_tibble(colData(sce)) |>
  mutate(umap = reducedDim(sce, "UMAP")) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) + geom_point(aes(color = stim), size = 0.3) + coord_fixed() +
    facet_wrap(vars(ind), ncol = 4)

Question: Why do we use coord_fixed() in the above plots?

Dimension reduction methods such as UMAP put in considerable effort to approximate the distance relationships between the points by arranging them in the Euclidean 2D plane, so it would be a shame to distort this by incidental figure size settings and by distinct scalings of the $x$- and the $y$-axis.

This dataset already comes with cell type annotations. The cell type labels help interpret the results; however, for the purpose of this tutorial, we will not need them and, besides Figure 3, will ignore them.

centroids = as_tibble(colData(sce)) |>
  mutate(umap = reducedDim(sce, "UMAP")) |>
  summarize(umap = matrix(colMedians(umap), nrow = 1), .by = c(stim, cell))

# Use `geom_text_repel` to avoid overlapping annotations
as_tibble(colData(sce)) |>
  mutate(umap = reducedDim(sce, "UMAP")) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
    geom_point(aes(color = paste(stim, cell)), size = 0.3) +
    ggrepel::geom_text_repel(data = centroids, aes(label = paste(stim, cell))) +
    coord_fixed()

Alternative path: Do the preprocessing with Seurat (click to see results)

The following code is based on the Seurat Guided Clustering Tutorial.

# For more information about the conversion see `?as.Seurat.CellDataSet`
seur_obj = Seurat::as.Seurat(muscData::Kang18_8vs8(), data = NULL)
seur_obj = Seurat::NormalizeData(seur_obj, normalization.method = "LogNormalize", scale.factor = 10000)
seur_obj = Seurat::FindVariableFeatures(seur_obj, selection.method = "vst", nfeatures = 500)
# Subset to highly variable genes for memory efficiency
seur_obj = seur_obj[Seurat::VariableFeatures(object = seur_obj),]
seur_obj = Seurat::ScaleData(seur_obj)
seur_obj = Seurat::RunPCA(seur_obj, verbose = FALSE)
seur_obj = Seurat::RunUMAP(seur_obj, dims = 1:10)

Seurat::DimPlot(seur_obj, reduction = "umap", group.by = "stim")

A prelude on data integration

When we made Figure 1, we discussed that the cells’ embedding locations separate strongly both by treatment status and cell type. For many analyses, we would like to overlay the corresponding cells from the stimulated condition with the cells from the control condition. For example, for cell type assignment, we might want to annotate both conditions together, irrespective of treatment. This aspiration is called integration. The goal is a mathematical representation (e.g., a low-dimensional embedding) of the cells where the treatment status does not visibly affect the positions overall, and all the visible variance represents different cell types. Figure 4 shows an example.

There are many approaches to this end. The question is somewhat ill-defined and can be formalized into a mathematical algorithm in many different ways. Luecken et al. (2022) benchmarked several approaches. Here, we first show how to do an integration manually, then with harmony. Later we will compare both to lemur. If you are just interested in using lemur, you can skip ahead to Section 5.

Question: In this tutorial we will just look at plots to assess integration success. Why is that suboptimal? How can we do better?

A 2D embedding like UMAP or tSNE gives us a visual impression if variation that comes from “uninteresting” covariates (here: treatment condition) is eliminated and variation that comes from covariates we care about is retained, but the results are not quantitative, which means we cannot objectively assess or compare the quality of the result.

One simple metric to measure integration success is to see how mixed the cells from the two conditions are. For example, we can count for each cell how many of its $k$ nearest neighbors come from the same condition and how many come from the other condition(s), and then look at the distribution of this fraction. For more, see Luecken et al. (2022).

Follow-up challenge: Write a function to calculate the mixing metric.

Follow-up questions: Can you write an integration function that scores perfectly on this mixing metric? Hint: it can be biologically completely useless. What else would you need to measure to protect against such silly solutions?

Manual Projection

Figure 5: Schematic picture of data from two conditions. The data from the treated condition is projected onto the subspace spanned by the control condition.

In transcriptomic data analysis, each cell is characterized by its gene expression profile, which here we operationalize as the vector of counts for the 500 most variable genes. Accordingly, each cell is a point in a 500-dimensional gene expression space. Principal component analysis (PCA) finds a $P$-dimensional space ($P\le500$) such that overall, distance relationships between cells in the original space are optimally approximated by those in the reduced, $P$-dimensional space.

Although there are over 20,000 genes in the human genome, the sensitivity of the assay that was used for our dataset is such that only a much smaller number of genes are well-measured, i.e., have counts high enough to be informative. Whether we use the top 500 genes by mean or by standard deviation makes no big difference.

To make these concepts more tangible, consider the cartoon in Figure 5. The orange blob represents an optimal two-dimensional embedding of all cells from the control condition. Similarly, the blue blob represents the treated cells. Each blob is contained within a two-dimensional subspace (plane) of the ambient space, but in general, these are different. To integrate both conditions, we can map the two planes into each other, so that each point from the blue blob is mapped into the orange subspace, and vice versa.

We can implement this procedure in a few lines of R code:

We create a matrix for the cells from the control and treated conditions,
we fit PCA on the control cells,
we center the data, and finally
we project the cells from both conditions onto the subspace of the control condition

# Each column from the matrix is the coordinate of a cell in the 500-dimensional space
ctrl_mat = as.matrix(logcounts(sce)[,sce$stim == "ctrl"])
stim_mat = as.matrix(logcounts(sce)[,sce$stim == "stim"])

ctrl_centers = rowMeans(ctrl_mat)
stim_centers = rowMeans(stim_mat)

# `prcomp` is R's name for PCA and IRLBA is an algorithm to calculate it.
ctrl_pca = irlba::prcomp_irlba(t(ctrl_mat - ctrl_centers), n = 20)

# With a little bit of linear algebra, we project the points onto the subspace of the control cells
integrated_mat = matrix(NA, nrow = 20, ncol = ncol(sce))
integrated_mat[,sce$stim == "ctrl"] = t(ctrl_pca$rotation) %*% (ctrl_mat - ctrl_centers)
integrated_mat[,sce$stim == "stim"] = t(ctrl_pca$rotation) %*% (stim_mat - stim_centers)

We check how successful we were by looking at a UMAP of the integrated embedding.

int_mat_umap = uwot::umap(t(integrated_mat))

as_tibble(colData(sce)) |>
  mutate(umap = int_mat_umap) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
    geom_point(aes(color = stim), size = 0.3) +
    coord_fixed()

Figure 6: UMAP of log transformed counts

The overlap is not perfect, but better than in Figure 1.

Challenge: What happens if you project on the "stim" subspace instead?

Schematic picture of data from two conditions using the stimulated condition as reference.

The projection is orthogonal onto the subspace, which means it matters which condition is chosen as reference.

stim_pca = irlba::prcomp_irlba(t(stim_mat - stim_centers), n = 20, center = FALSE)

integrated_mat2 = matrix(NA, nrow = 20, ncol = ncol(sce))
integrated_mat2[,sce$stim == "ctrl"] = t(stim_pca$rotation) %*% (ctrl_mat - ctrl_centers)
integrated_mat2[,sce$stim == "stim"] = t(stim_pca$rotation) %*% (stim_mat - stim_centers)

int_mat_umap2 = uwot::umap(t(integrated_mat2))

as_tibble(colData(sce)) |>
  mutate(umap = int_mat_umap2) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
    geom_point(aes(color = stim), size = 0.3) +
    coord_fixed()

For this example, using the "stim" condition as the reference leads to a worse integration.

Brain teaser: How can you make the manual projection approach work for any complex experimental designs?

The projection approach consists of three steps:

Centering the data (e.g., ctrl_mat - ctrl_centers).
Choosing a reference condition and calculating the subspace that approximates the data from the reference condition (irlba::prcomp_irlba(t(stim_mat - stim_centers))$rotation).
Projecting the data from the other conditions onto that subspace (t(ctrl_pca$rotation) %*% (stim_mat - stim_centers)).

For arbitrary experimental designs, we can perform the centering with a linear model fit. The second step remains the same. And after calculating the centered matrix, the third step is also straight forward. Below are the outlines how such a general procedure would work.

# A complex experimental design
lm_fit = lm(t(logcounts(sce)) ~ condition + batch, data = colData(sce))
# The `residuals` function returns the coordinates minus the mean at the condition.
centered_mat = t(residuals(lm_fit))
# Assuming that `is_reference_condition` contains a selection of the cells
ref_pca = irlba::prcomp_irlba(centered_mat[,is_reference_condition], ...)
int_mat = t(ref_pca$rotation) %*% centered_mat

Harmony

Harmony is one popular tool for data integration (Korsunsky et al. 2019). Harmony is relatively fast and can handle more complex experimental designs than just a treatment/control setup. It is built around maximum diversity clustering (Figure 7). Unlike classical clustering algorithms, maximum diversity clustering not just minimizes the distance of each data point to a cluster center but also maximizes the mixing of conditions assigned to each cluster.

harm_mat = harmony::RunHarmony(reducedDim(sce, "PCA"), colData(sce), vars_use = c("stim"))
harm_umap = uwot::umap(harm_mat)

as_tibble(colData(sce)) |>
  mutate(umap = harm_umap) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
    geom_point(aes(color = stim), size = 0.3) +
    coord_fixed()

Challenge: How much do the results change if you change the default parameters in Harmony?

Challenge: How do you do integration in Seurat?

The integration method of Seurat is based on mutual nearest neighbors (MNN) by Haghverdi et al. (2018). However, Seurat calls the mutual nearest neighbors integration anchors. The main difference is that Seurat uses canonical correspondence analysis (CCA) instead of principal component analysis (PCA) for dimensionality reduction (Butler et al. 2018).

# This code only works with Seurat v5!
seur_obj2 = Seurat::as.Seurat(muscData::Kang18_8vs8(), data = NULL)
seur_obj2[["originalexp"]] = split(seur_obj2[["originalexp"]], f = seur_obj2$stim)
seur_obj2 = Seurat::NormalizeData(seur_obj2, normalization.method = "LogNormalize", scale.factor = 10000)
seur_obj2 = Seurat::FindVariableFeatures(seur_obj2, selection.method = "vst", nfeatures = 500)
seur_obj2 = Seurat::ScaleData(seur_obj2)
seur_obj2 = Seurat::RunPCA(seur_obj2, verbose = FALSE)

seur_obj2 = Seurat::IntegrateLayers(object = seur_obj2, method = Seurat::CCAIntegration, orig.reduction = "pca", new.reduction = "integrated.cca", verbose = FALSE)
seur_obj2 = Seurat::RunUMAP(seur_obj2, dims = 1:30, reduction = "integrated.cca")
Seurat::DimPlot(seur_obj2, reduction = "umap", group.by = "stim")

Analysis with LEMUR

LEMUR is a tool to analyze multi-condition single-cell data. A typical analysis workflow with LEMUR goes through four steps (Figure 8):

Covariate-adjusted dimensionality reduction.
Prediction of the differential expression for each cell per gene.
Identification of neighborhoods of cells with similar differential expression patterns.
Pseudobulked differential expression testing per neighborhood.

These are implemented by the following code.

library("lemur")
# Step 1
fit = lemur(sce, design = ~ stim, n_embedding = 30)
fit = align_by_grouping(fit, fit$colData$cell)

# Step 2
fit = test_de(fit, contrast = cond(stim = "stim") - cond(stim = "ctrl"))

# Steps 3 and 4
nei = find_de_neighborhoods(fit, group_by = vars(ind, stim))

In the next sections we discuss these steps in more detail.

LEMUR Integration (Step 1)

Tools like MNN and Harmony take a PCA embedding and remove the effects of the specified covariates. However, there is no way to go back from the integrated embedding to the original gene space. This means that we cannot ask the counterfactual what the expression of a cell from the control condition would have been, had it been treated.

lemur achieves such counterfactuals by matching the subspaces of each condition (Constantin Ahlmann-Eltze and Huber 2025). Figure 9 illustrates the principle. In some sense this is a more sophisticated version of the manual matching that we saw in the previous section. The matching is a single affine transformation in contrast to shifts per cluster inferred by Harmony.

Figure 9: Schematic picture of data from two conditions with the respective linear subspace.

lemur takes as input a SingleCellExperiment object, the specification of the experimental design, and the number of latent dimensions. To refine the embedding, we call align_by_grouping and tell it to use the provided cell type annotations.

library("lemur")
fit = lemur(sce, design = ~ stim, n_embedding = 30, verbose = FALSE)
fit = align_by_grouping(fit, fit$colData$cell, verbose = FALSE)

Making a UMAP plot of the embedding by lemur shows that it successfully integrated the conditions (Figure 10).

lemur_umap = uwot::umap(t(fit$embedding))

as_tibble(colData(sce)) |>
  mutate(umap = lemur_umap) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
    geom_point(aes(color = stim), size = 0.3) + coord_fixed()

Figure 10: UMAP plot of LEMUR embedding.

We can cross-check our embedding with the provided cell type annotations by coloring each cell by cell type (using the information stored in the colData(sce)$cell column).

as_tibble(colData(sce)) |>
  mutate(umap = lemur_umap) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
    geom_point(aes(color = cell), size = 0.3) + coord_fixed()

Figure 11: Cell from the same cell types are close together after the integration

Challenge: How much do the results change if you change the default parameters in lemur?

Challenge: How to can the embedding of lemur be refined with an automated tool?

If there are no cell type labels present, lemur uses a modified version of the maximum diversity clustering of harmony to automatically infer the cell relationships across conditions.

fit2 = lemur(sce, design = ~ stim, n_embedding = 30, verbose = FALSE)
fit2 = align_harmony(fit2, verbose = FALSE)

Differential expression analysis (Step 2)

The advantage of the lemur integration is that we can predict what the expression profile of a cell from the control condition would have been, had it been stimulated, and vice versa. Contrasting those predictions tells us how much the gene expression changes for that cell in any gene.

Differential expression with an invertible integration can be inferred as the difference between the prediction for a position from one condition and the prediction at the same position in the other condition.

We call the test_de function of lemur to compare the expression values in the "stim" and "ctrl" conditions.

fit = test_de(fit, contrast = cond(stim = "stim") - cond(stim = "ctrl"))

We can now pick an individual gene (PLSCR1) and plot the predicted log fold change for each cell to show how it varies as a function of the underlying gene expression values (Figure 12).

as_tibble(colData(sce)) |>
  mutate(umap = lemur_umap) |>
  mutate(expr = logcounts(fit)["PLSCR1",]) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
    geom_point(aes(color = expr), size = 0.3) +
    scale_color_viridis_c() +
    facet_wrap(vars(stim)) +
    coord_fixed()

as_tibble(colData(sce)) |>
  mutate(umap = lemur_umap) |>
  mutate(de = assay(fit, "DE")["PLSCR1",]) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
    geom_point(aes(color = de), size = 0.3) +
    scale_color_gradient2(low = scales::muted("blue"), high = scales::muted("red")) +
    coord_fixed()

Figure 12: Expression of *PLSCR1* in control and stim condition

Figure 13: Counterfactual: `lemur` predicts *PLSCR1* differential expression between control and stimulus for each cell

This approach has the advantage over the traditional cluster/cell type-based analysis that it can detect smooth differential expression gradients.

Challenge: Calculate the differential expression for each cell type using glmGamPoi.

glmGamPoi is a differential expression tool inspired by edgeR and DESeq2 that is specifically designed for single cell RNA-seq data.

psce = glmGamPoi::pseudobulk(fit, group_by = vars(cell, ind, stim))

Aggregating assay 'counts' using 'rowSums2'.

Aggregating assay 'logcounts' using 'rowMeans2'.

Aggregating assay 'pearson_residuals' using 'rowMeans2'.

Aggregating assay 'DE' using 'rowMeans2'.

Aggregating reducedDim 'TSNE' using 'rowMeans2'.

Aggregating reducedDim 'PCA' using 'rowMeans2'.

Aggregating reducedDim 'UMAP' using 'rowMeans2'.

Aggregating reducedDim 'linearFit' using 'rowMeans2'.

Aggregating reducedDim 'embedding' using 'rowMeans2'.

# Filter out NAs
psce = psce[,! is.na(psce$cell)]
glm_fit = glmGamPoi::glm_gp(psce, design = ~ stim * cell)
glm_de = glmGamPoi::test_de(glm_fit, contrast = cond(stim = "stim", cell = "B cells") - cond(stim = "ctrl", cell = "B cells"))
glm_de |>
  arrange(pval) |>
  head()

    name         pval     adj_pval f_statistic df1      df2       lfc
1  HERC5 9.370681e-36 4.685341e-33    329.5872   1 116.4645  5.152089
2 NT5C3A 7.953371e-35 1.988343e-32    313.5509   1 116.4645  3.760837
3 PLSCR1 1.370641e-32 2.284401e-30    277.2727   1 116.4645  4.032858
4 DYNLT1 1.018124e-31 1.272655e-29    263.9920   1 116.4645  2.913860
5   IRF7 2.424490e-30 2.424490e-28    243.9083   1 116.4645  3.543921
6   SPI1 7.996678e-30 6.663898e-28    236.6273   1 116.4645 -2.440995

Challenge: For each cell type compare the lemur predictions against the observed expression of a gene:

We use the pseudobulk function from glmGamPoi to calculate the average expression and latent embedding position per condition and cell type. The mean embedding and colData is then fed to the predict function of lemur.

# You could choose any gene
gene_oi = "HERC5"

reduced_fit = glmGamPoi::pseudobulk(fit, group_by = vars(stim, cell))

Aggregating assay 'counts' using 'rowSums2'.

Aggregating assay 'logcounts' using 'rowMeans2'.

Aggregating assay 'pearson_residuals' using 'rowMeans2'.

Aggregating assay 'DE' using 'rowMeans2'.

Aggregating reducedDim 'TSNE' using 'rowMeans2'.

Aggregating reducedDim 'PCA' using 'rowMeans2'.

Aggregating reducedDim 'UMAP' using 'rowMeans2'.

Aggregating reducedDim 'linearFit' using 'rowMeans2'.

Aggregating reducedDim 'embedding' using 'rowMeans2'.

lemur_pred_per_cell = predict(fit, newdata = colData(reduced_fit), embedding = t(reducedDim(reduced_fit, "embedding")))

as_tibble(colData(reduced_fit)) |>
  mutate(obs = logcounts(reduced_fit)[gene_oi,]) |>
  mutate(lemur_pred =  lemur_pred_per_cell[gene_oi,]) |>
  ggplot(aes(x = obs, y = lemur_pred)) +
    geom_abline() +
    geom_point(aes(color = stim)) +
    labs(title = "`lemur` predictions averaged per cell type and condition")

Neighborhood identification (Step 3) and differential expression tests (Step 4)

To help identify the most interesting changes, lemur can aggregate the results of each gene and identify the neighborhood of cells with the most prominent coherent pattern of differential gene expression. To conduct a statistically valid differential expression test, lemur pseudobulks the counts for each replicate (here this is the patient).

The output is a data.frame with one row for each gene. The neighborhood column is a list column, where each element is a vector with the IDs of the cells that are inside the neighborhood. The other columns describe the significance of the expression change defined in the test_de call.

Schematic of the neighborhood inference for a gene

nei = find_de_neighborhoods(fit, group_by = vars(ind, stim))
head(nei)

    name neighborhood n_cells sel_statistic         pval     adj_pval
1    FTL AAACATAC....   28564     -225.6253 2.384522e-02 3.057079e-02
2  ISG15 AAACATAC....    5432     1111.8266 2.934118e-10 2.993998e-09
3   FTH1 AAACATAC....   24625     -454.1447 5.463812e-06 1.588317e-05
4  TIMP1 AAACATAC....    8700     -175.6102 1.076062e-04 2.152123e-04
5 CXCL10 AAACATAC....    8001      307.0006 1.020590e-09 7.850693e-09
6   CCL2 AAACATAC....    6257      306.5249 1.228543e-04 2.427951e-04
  f_statistic df1      df2        lfc    did_pval did_adj_pval    did_lfc
1    6.106872   1 17.73896 -0.3821229 0.070805014   0.15317816 -0.7358318
2  157.595633   1 17.73896  6.6163621 0.039042089   0.09859113 -1.1874899
3   40.861722   1 17.73896 -0.8966073 0.095400965   0.18853946  0.4510703
4   24.511826   1 17.73896 -0.8958956 0.006051525   0.02208586  0.8942429
5  134.736636   1 17.73896  7.3423406 0.525641275   0.66705746 -0.7214599
6   23.906300   1 17.73896  1.6440703 0.047479744   0.11413400  1.2994391

We can sort the table either by the pval or the did_pval. The pval measures the evidence of an expression change between control and stimulated condition for all cells inside the neighborhood. In contrast, the did_pval measures how much larger the expression difference is for the cells inside the neighborhood versus outside.

slice_min(nei, pval, n = 3)

     name neighborhood n_cells sel_statistic         pval     adj_pval
1 EIF2AK2 AAACATAC....   27854      580.0410 9.356820e-18 4.678410e-15
2  NT5C3A AAACATAC....   11214      390.0381 2.887243e-17 7.218107e-15
3   HERC5 AAACATAC....   27222      497.5872 4.859467e-17 8.099112e-15
  f_statistic df1      df2      lfc     did_pval did_adj_pval     did_lfc
1   1204.1678   1 17.73896 3.237861 3.132737e-01 4.527077e-01  0.40752003
2   1058.5008   1 17.73896 4.590639 5.037601e-08 2.798667e-06 -1.64130028
3    997.1973   1 17.73896 4.443825 9.357180e-01 9.747277e-01  0.07604668

slice_min(nei, did_pval, n = 3)

   name neighborhood n_cells sel_statistic         pval     adj_pval
1   SSB AAACATAC....    5210      452.3921 6.009359e-16 5.007799e-14
2 RPL23 AAACATAC....    7766     -174.1778 1.311449e-11 2.261119e-10
3 PARK7 AAACATAC....    7223     -207.7361 1.755118e-10 1.867147e-09
  f_statistic df1      df2       lfc     did_pval did_adj_pval   did_lfc
1    746.8381   1 17.73896  2.983723 2.714391e-12 1.357195e-09 -2.113198
2    230.7645   1 17.73896 -1.951510 3.761147e-10 9.402869e-08  1.733275
3    167.9954   1 17.73896 -1.928855 1.183556e-09 1.594894e-07  1.531026

top_did_gene = slice_min(nei, did_pval)

as_tibble(colData(sce), rownames = "cell_name") |>
  mutate(umap = lemur_umap) |>
  mutate(de = assay(fit, "DE")[top_did_gene$name,]) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
    geom_point(aes(color = de), size = 0.3) +
    scale_color_gradient2(low = scales::muted("blue"), high = scales::muted("red")) +
    coord_fixed() +
    labs(title = "Top differentially expressed gene, by did_pval")

As you can see in the above plot, there is strong differential expression of this gene in one cluster of cells, and essentially none in the other cells.

To see which cells lemur included in the neighborhood for each gene, we will first define a small helper function that converts the nei dataframe–which contains one row per gene, and the neighbourhood of that gene as a vector of cell identifiers in the neighborhood columns–into another dataframe that has one row for each gene and for each cell, and column inside with a logical variable that indicates whether this cell is in the neighbourhood for that gene.

Future versions of lemur may contain this function inside, so you dont need to worry about it.

neighborhoods_to_long_dataframe = function(x, fit) {
  nei = x[["neighborhood"]] 
  gene_names = x[["name"]]
  cell_names = unique(unlist(nei))  
  if (!missing(fit)) cell_names = union(cell_names, colnames(fit))
  stopifnot(!is.null(nei), !is.null(gene_names), !is.null(cell_names))
  
  res = matrix(FALSE, nrow = length(cell_names), ncol = length(gene_names),
               dimnames = list(cell = cell_names, name = gene_names))
  for (i in seq_along(gene_names))
    res[nei[[i]], gene_names[i]] = TRUE
  
  as.data.frame.table(res, responseName = "inside", stringsAsFactors = FALSE) |> as_tibble()
}

With this function, we can annotate for any gene if a cell is included in its lemur neighborhood.

cells_inside_df = neighborhoods_to_long_dataframe(top_did_gene, fit)

as_tibble(colData(sce), rownames = "cell_name") |>
  mutate(umap = lemur_umap) |>
  mutate(de = assay(fit, "DE")[top_did_gene$name,]) |>
  left_join(cells_inside_df, by = c("cell_name" = "cell")) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
    geom_point(aes(color = de), size = 0.3) +
    scale_color_gradient2(low = scales::muted("blue"), high = scales::muted("red")) +
    coord_fixed() +
    facet_wrap(vars(inside), labeller = label_both) +
    labs(title = "Top differentially expressed gene, according to did_pval")

Brain teaser: How are lemur neighborhoods different from the cluster-based differential expression tests?

The lemur neighborhoods are adaptive to the underlying expression patterns for each gene. This means that they provide optimal power to detect differential expression as well the set of cells for which the gene expression changes most.

The plot below illustrates this for the top three DE genes by pval which all have very different neighborhoods. Note that the neighborhoods are convex in the low-dimensional embedding space, even though in the UMAP embedding they might not look like it. This is due to the non-linear nature of the UMAP dimensionality reduction.

cells_inside_df_top3 = neighborhoods_to_long_dataframe(slice_min(nei, pval, n = 3), fit)

as_tibble(colData(sce), rownames = "cell_name") |>
  mutate(umap = lemur_umap) |>
  mutate(de = as_tibble(t(assay(fit, "DE")[slice_min(nei, pval, n = 3)$name,]))) |>
  unpack(de, names_sep = "_") %>%
  pivot_longer(starts_with("de_"), names_sep = "_", names_to = c(".value", "gene_name")) |>
  left_join(cells_inside_df_top3, by = c("gene_name" = "name", "cell_name" = "cell")) |>
  ggplot(aes(x = umap[, 1], y = umap[, 2])) +
    geom_point(aes(color = de), size = 0.3) +
    scale_color_gradient2(low = scales::muted("blue"), high = scales::muted("red")) +
    coord_fixed() +
    facet_grid(vars(inside), vars(gene_name), labeller = label_both) +
    labs(title = "Top 3 DE gene by pval")

Outlook

lemur can handle arbitrary design matrices as input, which means that it can also handle more complicated experiments than the simple treatment / control comparison shown here. Just as with limma, edgeR or DESeq2, you can model

multiple covariates, e.g., the treatment of interest as well as known confounders (experiment batch, sex, age) or blocking variables,
interactions, e.g., drug treatment in wild type and mutant,
continuous covariates, e.g., time courses, whose effects can also be modeled using smooth non-linear functions (spline regression).

For more details, see Constantin Ahlmann-Eltze and Huber (2025). For any questions, whether technical/practical or conceptual, please post to the Bioconductor forum using the lemur tag and follow the posting guide.

Session Info

sessionInfo()

R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Rome
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] lemur_1.6.1                 muscData_1.22.0            
 [3] ExperimentHub_2.16.0        AnnotationHub_3.16.0       
 [5] BiocFileCache_2.16.0        dbplyr_2.5.0               
 [7] SingleCellExperiment_1.30.1 SummarizedExperiment_1.38.1
 [9] Biobase_2.68.0              GenomicRanges_1.60.0       
[11] GenomeInfoDb_1.44.0         IRanges_2.42.0             
[13] S4Vectors_0.46.0            BiocGenerics_0.54.0        
[15] generics_0.1.4              MatrixGenerics_1.20.0      
[17] matrixStats_1.5.0           lubridate_1.9.4            
[19] forcats_1.0.0               stringr_1.5.1              
[21] dplyr_1.1.4                 purrr_1.0.4                
[23] readr_2.1.5                 tidyr_1.3.1                
[25] tibble_3.3.0                ggplot2_3.5.2              
[27] tidyverse_2.0.0            

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                 transformGamPoi_1.14.0   
 [3] gridExtra_2.3             rlang_1.1.6              
 [5] magrittr_2.0.3            scater_1.36.0            
 [7] RcppAnnoy_0.0.22          compiler_4.5.1           
 [9] RSQLite_2.4.1             DelayedMatrixStats_1.30.0
[11] png_0.1-8                 vctrs_0.6.5              
[13] pkgconfig_2.0.3           crayon_1.5.3             
[15] fastmap_1.2.0             XVector_0.48.0           
[17] labeling_0.4.3            scuttle_1.18.0           
[19] rmarkdown_2.29            tzdb_0.5.0               
[21] UCSC.utils_1.4.0          ggbeeswarm_0.7.2         
[23] bit_4.6.0                 xfun_0.52                
[25] cachem_1.1.0              beachmat_2.24.0          
[27] jsonlite_2.0.0            blob_1.2.4               
[29] DelayedArray_0.34.1       BiocParallel_1.42.1      
[31] irlba_2.3.5.1             parallel_4.5.1           
[33] R6_2.6.1                  stringi_1.8.7            
[35] RColorBrewer_1.1-3        Rcpp_1.1.0               
[37] knitr_1.50                splines_4.5.1            
[39] Matrix_1.7-3              timechange_0.3.0         
[41] tidyselect_1.2.1          abind_1.4-8              
[43] yaml_2.3.10               viridis_0.6.5            
[45] codetools_0.2-20          curl_6.4.0               
[47] lattice_0.22-7            withr_3.0.2              
[49] KEGGREST_1.48.1           evaluate_1.0.4           
[51] Biostrings_2.76.0         pillar_1.10.2            
[53] BiocManager_1.30.26       filelock_1.0.3           
[55] BiocVersion_3.21.1        hms_1.1.3                
[57] sparseMatrixStats_1.20.0  scales_1.4.0             
[59] RhpcBLASctl_0.23-42       glue_1.8.0               
[61] tools_4.5.1               BiocNeighbors_2.2.0      
[63] ScaledMatrix_1.16.0       cowplot_1.1.3            
[65] grid_4.5.1                AnnotationDbi_1.70.0     
[67] GenomeInfoDbData_1.2.14   beeswarm_0.4.0           
[69] BiocSingular_1.24.0       vipor_0.4.7              
[71] cli_3.6.5                 rsvd_1.0.5               
[73] rappdirs_0.3.3            viridisLite_0.4.2        
[75] S4Arrays_1.8.1            uwot_0.2.3               
[77] glmGamPoi_1.20.0          gtable_0.3.6             
[79] digest_0.6.37             SparseArray_1.8.0        
[81] ggrepel_0.9.6             htmlwidgets_1.6.4        
[83] farver_2.1.2              memoise_2.0.1            
[85] htmltools_0.5.8.1         lifecycle_1.0.4          
[87] httr_1.4.7                mime_0.13                
[89] harmony_1.2.3             bit64_4.6.0-1

References

Ahlmann-Eltze, C., and Wolfgang Huber. 2023. “Comparison of Transformations for Single-Cell RNA-Seq Data.” Nature Methods, February, 665–72. https://doi.org/10.1038/s41592-023-01814-1.

Ahlmann-Eltze, Constantin, and Wolfgang Huber. 2025. “Analysis of Multi-Condition Single-Cell Data with Latent Embedding Multivariate Regression.” Nature Genetics. https://doi.org/10.1038/s41588-024-01996-0.

Butler, Andrew, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. 2018. “Integrating Single-Cell Transcriptomic Data Across Different Conditions, Technologies, and Species.” Nature Biotechnology 36 (5): 411–20. https://doi.org/10.1038/nbt.4096.

Haghverdi, Laleh, Aaron T L Lun, Michael D Morgan, and John C Marioni. 2018. “Batch Effects in Single-Cell RNA-Sequencing Data Are Corrected by Matching Mutual Nearest Neighbors.” Nature Biotechnology 36 (5): 421–27. https://doi.org/10.1038/nbt.4091.

Kang, Hyun Min, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, et al. 2018. “Multiplexed Droplet Single-Cell RNA-Sequencing Using Natural Genetic Variation.” Nature Biotechnology 36 (1): 89–94. https://doi.org/10.1038/nbt.4042.

Korsunsky, Ilya, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-ru Loh, and Soumya Raychaudhuri. 2019. “Fast, Sensitive and Accurate Integration of Single-Cell Data with Harmony.” Nature Methods 16 (12): 1289–96. https://doi.org/10.1038/s41592-019-0619-0.

Luecken, Malte D., M. Büttner, K. Chaichoompu, A. Danese, M. Interlandi, M. F. Mueller, D. C. Strobl, et al. 2022. “Benchmarking Atlas-Level Data Integration in Single-Cell Genomics.” Nature Methods 19 (1): 41–50. https://doi.org/10.1038/s41592-021-01336-8.