This session will guide you to use DESeq2 to identify differentially expressed genes and become familiar with design matrices. The session will also introduce concepts essential for working with real-world data, such as incorporating covariates and interaction terms into models.
The objectives of this session are:
Note: All the code in this tutorial will be performed in R (or Rstudio).
This workshop will use processed RNA-seq counts from this study, where they studied the transcriptional response to SARS-CoV-2 infection in three tissues (cornea, limbus, sclera). The dataset consists of 18 samples, with three replicates per tissue and condition (mock vs CoV-2).
The data can be found here:
ls ../data/
c2.cp.kegg_legacy.v2025.1.Hs.symbols.gmt
GSE164073_Eye_count_matrix.tsv
GSE164073_Eye_count_meta.tsv
h.all.v2025.1.Hs.symbols.gmt
res_Cornea_CoV2_vs_Mock.rds
library(DESeq2)
library(vsn) # Used for standard deviation vs mean plots
library(hexbin) # Used for standard deviation vs mean plots
library(pheatmap)
library(RColorBrewer)
library(EnhancedVolcano)
library(ggplot2)
library(ggbeeswarm)
library(ashr) # Used for log-fold shrinkage
library(knitr) # Used for formatting output
Let’s load the dataset that consists of a raw count matrix and a metadata.
raw_counts<- read.table("../data/GSE164073_Eye_count_matrix.tsv", header=T, row.names=1)
meta<- read.table("../data/GSE164073_Eye_count_meta.tsv", row.names=1)
kable(head(raw_counts[,1:5]))
MW1 | MW2 | MW3 | MW4 | MW5 | |
---|---|---|---|---|---|
A1BG | 91 | 131 | 86 | 77 | 69 |
A1BG-AS1 | 292 | 284 | 271 | 232 | 250 |
A1CF | 0 | 0 | 0 | 0 | 0 |
A2M | 255 | 273 | 263 | 150 | 176 |
A2M-AS1 | 8 | 9 | 9 | 11 | 16 |
A2ML1 | 0 | 0 | 0 | 0 | 0 |
kable(head(meta))
Tissue | Condition | |
---|---|---|
MW1 | cornea | mock |
MW2 | cornea | mock |
MW3 | cornea | mock |
MW4 | cornea | CoV2 |
MW5 | cornea | CoV2 |
MW6 | cornea | CoV2 |
The first step is to create a design matrix. Let’s start by creating a very simple design matrix.
# Converting Tissue into a factor
meta$Tissue<- factor(meta$Tissue)
# Adjusting levels for Condition, so that mock is the reference
meta$Condition<- factor(meta$Condition, levels=c("mock", "CoV2"))
meta_design<- model.matrix( ~ Condition, data=meta)
kable(head(meta_design))
(Intercept) | ConditionCoV2 | |
---|---|---|
MW1 | 1 | 0 |
MW2 | 1 | 0 |
MW3 | 1 | 0 |
MW4 | 1 | 1 |
MW5 | 1 | 1 |
MW6 | 1 | 1 |
This design matrix is answering the question: What are the overall differences between Mock and CoV2, without considering the tissues?
Now, let’s create a DESeq2 object.
dds <- DESeqDataSetFromMatrix(countData = raw_counts,
colData = meta,
design= meta_design)
dds
class: DESeqDataSet
dim: 27946 18
metadata(1): version
assays(1): counts
rownames(27946): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(0):
colnames(18): MW1 MW2 ... MW17 MW18
colData names(2): Tissue Condition
Removing low expressed counts is generally recommended for memory and speed reasons.
smallestGroupSize <- 3
keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
dds <- dds[keep,]
dds
class: DESeqDataSet
dim: 14700 18
metadata(1): version
assays(1): counts
rownames(14700): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(0):
colnames(18): MW1 MW2 ... MW17 MW18
colData names(2): Tissue Condition
We can obtain a variance stabilizing transformation
# Obtain the variance stabilizing transformation
vsd <- vst(dds)
# Let's compare it to the normal transformation
ntd <- normTransform(dds)
meanSdPlot(assay(ntd), ranks=F) # Normal
meanSdPlot(assay(vsd), ranks=F) # VST
We observe that the vst removes the dependence of the variance on the mean, which is evident for the low expressed genes.
We can obtain distances between samples and then plot them as a heatmap.
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd$Tissue, vsd$Condition, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)
We observe that in general, samples from the same tissue cluster together, as expected.
We can also perform a PCA analysis of the VST counts.
plotPCA(vsd, intgroup=c("Tissue", "Condition")) +
geom_point(size=0.5)
using ntop=500 top features by variance
Again, we observe that the first PCA is mostly associated with tissue, although samples can also subcluster between the mock and CoV2 conditions.
The standard DE is performed by the DESeq function.
dds <- DESeq(dds)
using supplied model matrix
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
dds
class: DESeqDataSet
dim: 14700 18
metadata(1): version
assays(5): counts mu H cooks originalCounts
rownames(14700): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(23): baseMean baseVar ... maxCooks replace
colnames(18): MW1 MW2 ... MW17 MW18
colData names(4): Tissue Condition sizeFactor replaceable
resultsNames(dds)
[1] "Intercept" "ConditionCoV2"
Intercept
ConditionCoV2
res <- results(dds, name="ConditionCoV2") # This function provides the lof2Fold Changes and p-values for a given coefficient name
res
log2 fold change (MLE): ConditionCoV2
Wald test p-value: ConditionCoV2
DataFrame with 14700 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
A1BG 90.4475 -0.1439495 0.118021 -1.219695 0.222581 0.803126
A1BG-AS1 278.3007 -0.0480946 0.112076 -0.429126 0.667832 0.963881
A2M 873.3212 -0.8092580 0.741401 -1.091525 0.275042 0.833286
A2M-AS1 11.3404 0.4892287 0.256631 1.906348 0.056605 0.531853
A4GALT 775.8491 0.1133252 0.223626 0.506761 0.612322 0.951852
... ... ... ... ... ... ...
ZYG11A 17.6651 -0.3825263 0.3813743 -1.003021 0.315851 0.855794
ZYG11B 1385.1616 0.0993295 0.1068925 0.929246 0.352761 0.874910
ZYX 6722.2007 -0.2497730 0.1098916 -2.272903 0.023032 0.359851
ZZEF1 1364.0281 0.1296682 0.0923057 1.404769 0.160090 0.741104
ZZZ3 1187.3067 0.1200589 0.0744052 1.613582 0.106618 0.661463
Then, a results table can be generated. In this case, we provide a value in the ‘name’ argument in the results function, which correspondds to the CoV2 effect encoded in our design matrix. We will also explore other ways to access results in the follow up sections.
Shrinkage of the log-fold changes estimates is recommended for visualization and ranking of genes.
resLFC <- lfcShrink(dds, coef="ConditionCoV2", type="apeglm")
using 'apeglm' for LFC shrinkage. If used in published research, please cite:
Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
sequence count data: removing the noise and preserving large differences.
Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
The lfcShrink has various shrinkage methods, here we will use apeglm. Again, we look for the CoV2 effect model by setting up the coef argument in the lfcShrink function.
Let’s compare the MA plots between the shrunken FC and the regular FC.
plotMA(res)
plotMA(resLFC)
You can see that the shrunken log2 fold changes remove the noise associated with log2 fold changes from low count genes.
EnhancedVolcano(resLFC,
lab = rownames(resLFC),
x = 'log2FoldChange',
y = 'pvalue')
We can explore also the results using a Volcano Plot.
geneCounts <- plotCounts(dds, gene = "SOD2", intgroup = c("Condition","Tissue"),returnData = T)
ggplot(geneCounts, aes(x = Condition, y = log2(count), color = Tissue)) + geom_beeswarm(cex = 3)
We can also plot the normalized count values for a given gene.
sig_genes<- rownames(resLFC[with(resLFC, abs(log2FoldChange > 1) & padj < 0.05),])
colors <- colorRampPalette(c("darkblue","white","darkred"))(255)
pheatmap(assay(vsd[sig_genes,]), annotation_col=meta,col=colors, scale="row")
Finally, we can generate a heatmap for a subset of genes with significant DE changes (padj < 0.05 and log2FoldChange >1).
So far, we haven’t considered the different tissues into our model. One way to study the tissue-specific effects is through a mean group model.
meta$group<- factor(paste0(".", meta$Tissue, "_", meta$Condition))
meta_design<- model.matrix( ~ 0 + group, data=meta) # We remove the intercept in this model
kable(head(meta_design))
group.cornea_CoV2 | group.cornea_mock | group.limbus_CoV2 | group.limbus_mock | group.sclera_CoV2 | group.sclera_mock | |
---|---|---|---|---|---|---|
MW1 | 0 | 1 | 0 | 0 | 0 | 0 |
MW2 | 0 | 1 | 0 | 0 | 0 | 0 |
MW3 | 0 | 1 | 0 | 0 | 0 | 0 |
MW4 | 1 | 0 | 0 | 0 | 0 | 0 |
MW5 | 1 | 0 | 0 | 0 | 0 | 0 |
MW6 | 1 | 0 | 0 | 0 | 0 | 0 |
This design is asking the question: What is the mean expression level in each tissue-condition group? Then, via contrasts we can specify comparisons, like, what is the difference in expression between CoV2 and mock in Cornea
design(dds)<- meta_design
dds <- DESeq(dds)
using supplied model matrix
using pre-existing size factors
estimating dispersions
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
dds
class: DESeqDataSet
dim: 14700 18
metadata(1): version
assays(5): counts mu H cooks originalCounts
rownames(14700): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(38): baseMean baseVar ... deviance maxCooks
colnames(18): MW1 MW2 ... MW17 MW18
colData names(4): Tissue Condition sizeFactor replaceable
resultsNames(dds)
[1] "group.cornea_CoV2" "group.cornea_mock" "group.limbus_CoV2"
[4] "group.limbus_mock" "group.sclera_CoV2" "group.sclera_mock"
group.cornea_CoV2
group.cornea_mock
group.limbus_CoV2
group.limbus_mock
group.sclera_CoV2
group.sclera_mock
We need to specify the contrast, asking what is the difference between CoV2 and mock in cornea. There are multiple ways to establish a contrast, one of them is to supply the contrast argument a list of two character vectors, a list of 2 character vectors: the names of the fold changes for the numerator, and the names of the fold changes for the denominator.
res<- results(dds, contrast=list("group.cornea_CoV2", "group.cornea_mock")) # Let's obtain the results, as these contain the wald st
resLFC <- lfcShrink(dds, contrast=list("group.cornea_CoV2", "group.cornea_mock"), type="ashr")
using 'ashr' for LFC shrinkage. If used in published research, please cite:
Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
https://doi.org/10.1093/biostatistics/kxw041
EnhancedVolcano(resLFC,
lab = rownames(resLFC),
x = 'log2FoldChange',
y = 'pvalue')
Another way is to use a a numeric contrast vector with one element for each element in ‘resultsNames(object)’ (most general case).
resultsNames(dds)
[1] "group.cornea_CoV2" "group.cornea_mock" "group.limbus_CoV2"
[4] "group.limbus_mock" "group.sclera_CoV2" "group.sclera_mock"
group.cornea_CoV2
group.cornea_mock
group.limbus_CoV2
group.limbus_mock
group.sclera_CoV2
group.sclera_mock
resLFC_2 <- lfcShrink(dds, contrast=c(1,-1,0,0,0,0), type="ashr")
using 'ashr' for LFC shrinkage. If used in published research, please cite:
Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
https://doi.org/10.1093/biostatistics/kxw041
identical(resLFC$log2FoldChange, resLFC_2$log2FoldChange)
[1] TRUE
We observe that the two approaches give us identical results.
For the next section (gene set enrichment analysis), let’s save the results from this model.
saveRDS(res, file="../results/res_Cornea_CoV2_vs_Mock.rds")
This notebook used material adapted from the following sources:
sessionInfo()
R version 4.4.3 (2025-02-28)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] knitr_1.48 ashr_2.2-63
[3] ggbeeswarm_0.7.2 EnhancedVolcano_1.24.0
[5] ggrepel_0.9.6 ggplot2_3.5.1
[7] RColorBrewer_1.1-3 pheatmap_1.0.13
[9] hexbin_1.28.5 vsn_3.74.0
[11] DESeq2_1.46.0 SummarizedExperiment_1.36.0
[13] Biobase_2.66.0 MatrixGenerics_1.18.1
[15] matrixStats_1.4.1 GenomicRanges_1.58.0
[17] GenomeInfoDb_1.42.1 IRanges_2.40.0
[19] S4Vectors_0.44.0 BiocGenerics_0.52.0
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 dplyr_1.1.4 vipor_0.4.7
[4] farver_2.1.2 fastmap_1.2.0 apeglm_1.28.0
[7] digest_0.6.37 mime_0.12 lifecycle_1.0.4
[10] statmod_1.5.0 invgamma_1.2 magrittr_2.0.3
[13] compiler_4.4.3 rlang_1.1.4 sass_0.4.9
[16] tools_4.4.3 utf8_1.2.4 yaml_2.3.10
[19] labeling_0.4.3 S4Arrays_1.6.0 DelayedArray_0.32.0
[22] plyr_1.8.9 abind_1.4-8 BiocParallel_1.40.2
[25] numDeriv_2016.8-1.1 withr_3.0.1 grid_4.4.3
[28] preprocessCore_1.68.0 fansi_1.0.6 colorspace_2.1-1
[31] MASS_7.3-64 scales_1.3.0 mvtnorm_1.3-3
[34] bbmle_1.0.25.1 cli_3.6.3 rmarkdown_2.28
[37] crayon_1.5.3 generics_0.1.3 httr_1.4.7
[40] bdsmatrix_1.3-7 cachem_1.1.0 affy_1.84.0
[43] zlibbioc_1.52.0 parallel_4.4.3 BiocManager_1.30.25
[46] XVector_0.46.0 vctrs_0.6.5 Matrix_1.7-2
[49] jsonlite_1.8.9 mixsqp_0.3-54 irlba_2.3.5.1
[52] beeswarm_0.4.0 locfit_1.5-9.12 limma_3.62.2
[55] jquerylib_0.1.4 affyio_1.76.0 glue_1.8.0
[58] emdbook_1.3.14 codetools_0.2-19 gtable_0.3.6
[61] UCSC.utils_1.2.0 munsell_0.5.1 tibble_3.2.1
[64] pillar_1.9.0 htmltools_0.5.8.1 GenomeInfoDbData_1.2.13
[67] truncnorm_1.0-9 R6_2.5.1 evaluate_1.0.1
[70] lattice_0.22-5 highr_0.11 SQUAREM_2021.1
[73] bslib_0.8.0 Rcpp_1.0.13 coda_0.19-4.1
[76] SparseArray_1.6.2 xfun_0.48 pkgconfig_2.0.3