This session will guide you to perform gene set enrichment analysis (GSEA) and over representation analysis (ORA).
The objectives of this session are:
Note: All the code in this tutorial will be performed in R (or Rstudio).
This workshop will use processed RNA-seq counts from this study, where they studied the transcriptional response to SARS-CoV-2 infection in three tissues (cornea, limbus, sclera). The dataset consists of 18 samples, with three replicates per tissue and condition (mock vs CoV-2).
We will also use the DESeq2 results from the previous step. If you do not have them, a copy is stored in the data folder.
The data can be found here:
ls ../data/
c2.cp.kegg_legacy.v2025.1.Hs.symbols.gmt
GSE164073_Eye_count_matrix.tsv
GSE164073_Eye_count_meta.tsv
h.all.v2025.1.Hs.symbols.gmt
res_Cornea_CoV2_vs_Mock.rds
Finally, we will also be using annotated gene sets from MsigDB.
library(DESeq2)
library(ggplot2)
library(fgsea)
library(knitr) # Used for formatting output
library(pheatmap)
library(GSA)
library(RColorBrewer)
set.seed(43) # For reproducibility
Let’s load the DESeq2 results and the Hallmark pathway gene set from MsigDB.
res<- readRDS("../data/res_Cornea_CoV2_vs_Mock.rds")
geneset_file <- GSA::GSA.read.gmt("../data/h.all.v2025.1.Hs.symbols.gmt")
geneSets <- geneset_file$genesets
names(geneSets)<- geneset_file$geneset.names
The core of the GSEA analysis is done by the fgsea function. To run it, we need to first obtain a vector to rank our genes.
head(res)
log2 fold change (MLE): group.cornea_CoV2 vs group.cornea_mock
Wald test p-value: group.cornea_CoV2 vs group.cornea_mock
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
A1BG 90.4475 -0.3234786 0.1858483 -1.740552 0.081762169 0.2592268
A1BG-AS1 278.3007 -0.0626454 0.1219786 -0.513577 0.607547777 0.8036976
A2M 873.3212 -0.4794032 0.1900402 -2.522642 0.011647706 0.0687980
A2M-AS1 11.3404 0.5897878 0.4671506 1.262522 0.206761050 0.4497418
A4GALT 775.8491 0.4498681 0.1159353 3.880337 0.000104312 0.0015732
AAAS 741.3456 -0.1308305 0.0878288 -1.489609 0.136327113 0.3482062
geneRanks<- res$stat
names(geneRanks)<- rownames(res)
head(geneRanks)
A1BG A1BG-AS1 A2M A2M-AS1 A4GALT AAAS
-1.740552 -0.513577 -2.522642 1.262522 3.880337 -1.489609
For GSEA, we will use the Wald test statistics stored in the stat column from the results, which are effectively the log2FoldChange/SE.
fgseaRes <- fgsea(pathways = geneSets,
stats = geneRanks,
eps = 0, # For better p-value estimation
minSize = 15,
maxSize = 500)
Let’s visualize the results.
head(fgseaRes[order(-NES,pval)])
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
plotGseaTable(geneSets[topPathways], geneRanks, fgseaRes,
gseaParam=0.5)
We can visualize also as a simple barplot.
df_gsea<- as.data.frame(fgseaRes)
df_gsea<- df_gsea[with(df_gsea, pathway %in% topPathways),]
df_gsea$pathway<- factor(df_gsea$pathway, levels=rev(topPathways)) # convert to factor and order the levels for plotting
ggplot(df_gsea, aes(x=pathway, y=NES, fill=-log10(padj)) )+
geom_bar(stat="identity") +
scale_fill_continuous(low="#ffdae0", high="red") +
coord_flip()
We observe enrichment for TNFa signaling, Interferon gamma response and EMT pathways in CoV2 condition, while proliferation pathway enrichment in the mock.
We can also visualize the enrichment of a specific pathway
plotEnrichment(geneSets[["HALLMARK_E2F_TARGETS"]],
geneRanks) + labs(title="E2F targets")
We observe very strong enrichment for the E2F targets gene set in the mock.
We can also visualize the leading-edge genes through a heatmap. Let’s first reload the counts and obtain the vst using DESeq2.
raw_counts<- read.table("../data/GSE164073_Eye_count_matrix.tsv", header=T, row.names=1)
meta<- read.table("../data/GSE164073_Eye_count_meta.tsv", row.names=1)
meta$Condition<- factor(meta$Condition, levels=c("mock", "CoV2"))
meta_design<- model.matrix( ~ Condition, data=meta)
dds <- DESeqDataSetFromMatrix(countData = raw_counts,
colData = meta,
design= meta_design)
smallestGroupSize <- 3
keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
dds <- dds[keep,]
vsd <- vst(dds)
Now, we can visualize the leading edge genes for a particular pathway, in this case, the E2F targets.
leadingEdges<- unlist(df_gsea[with(df_gsea, pathway == "HALLMARK_E2F_TARGETS"),"leadingEdge"])
head(leadingEdges)
[1] "RRM2" "CDK1" "HMMR" "TOP2A" "DLGAP5" "RACGAP1"
RRM2
CDK1
HMMR
TOP2A
DLGAP5
RACGAP1
# Let's filter for only cornea samples
meta_subset<- meta[with(meta, Tissue == "cornea"),]
colors <- colorRampPalette(c("darkblue","white","darkred"))(255)
pheatmap(assay(vsd[head(leadingEdges,15),rownames(meta_subset)]), annotation_col=meta_subset,col=colors, scale="row")
Note: We will be running enrichR, which requires to establish a connection to the Enrichr website.
library(enrichR)
Welcome to enrichR
Checking connections ...
Enrichr ... Connection is Live!
FlyEnrichr ... Connection is Live!
WormEnrichr ... Connection is Live!
YeastEnrichr ... Connection is Live!
FishEnrichr ... Connection is Live!
OxEnrichr ... Connection is Live!
For running the next steps, please make sure that your connection is live:
websiteLive <- getOption("enrichR.live")
websiteLive
[1] TRUE
You can check the databases available for GSEA:
dbs <- listEnrichrDbs()
head(dbs)
Let’s do an enrichment analysis of genes upregulated in CoV2 using the Hallmark gene set.
sig_genes<- rownames(res[with(res, abs(log2FoldChange > 1) & padj < 0.05 & !is.na(padj)),])
# Defining a background as the set of genes tested
background<- rownames(res)
enriched <- enrichr(genes=sig_genes, databases=c("MSigDB_Hallmark_2020"), background=background)
Uploading data to Speedrichr...
- Your gene set... Done.
- Your background... Done.
Getting enrichment results...
- MSigDB_Hallmark_2020... Done.
Parsing results... Done.
head(enriched[["MSigDB_Hallmark_2020"]])
We can plot a barplot of the results
plotEnrich(enriched[["MSigDB_Hallmark_2020"]], showTerms = 20, numChar = 40, y = "Count", orderBy = "P.value")
Similar as GSEA, we also observe enrichment for TNF alpha Signaling via NFKB and EMT pathways in the CoV2 condition.
This notebook used material adapted from the following sources:
sessionInfo()
R version 4.4.3 (2025-02-28)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] enrichR_3.4 RColorBrewer_1.1-3
[3] GSA_1.03.3 pheatmap_1.0.13
[5] knitr_1.48 fgsea_1.32.4
[7] ggplot2_3.5.1 DESeq2_1.46.0
[9] SummarizedExperiment_1.36.0 Biobase_2.66.0
[11] MatrixGenerics_1.18.1 matrixStats_1.4.1
[13] GenomicRanges_1.58.0 GenomeInfoDb_1.42.1
[15] IRanges_2.40.0 S4Vectors_0.44.0
[17] BiocGenerics_0.52.0
loaded via a namespace (and not attached):
[1] fastmatch_1.1-6 gtable_0.3.6 rjson_0.2.23
[4] xfun_0.48 bslib_0.8.0 lattice_0.22-5
[7] vctrs_0.6.5 tools_4.4.3 generics_0.1.3
[10] curl_5.2.3 parallel_4.4.3 tibble_3.2.1
[13] fansi_1.0.6 highr_0.11 pkgconfig_2.0.3
[16] Matrix_1.7-2 data.table_1.15.4 lifecycle_1.0.4
[19] GenomeInfoDbData_1.2.13 farver_2.1.2 compiler_4.4.3
[22] munsell_0.5.1 codetools_0.2-19 htmltools_0.5.8.1
[25] sass_0.4.9 yaml_2.3.10 pillar_1.9.0
[28] crayon_1.5.3 jquerylib_0.1.4 BiocParallel_1.40.2
[31] DelayedArray_0.32.0 cachem_1.1.0 abind_1.4-8
[34] mime_0.12 tidyselect_1.2.1 locfit_1.5-9.12
[37] digest_0.6.37 dplyr_1.1.4 WriteXLS_6.7.0
[40] labeling_0.4.3 cowplot_1.1.3 fastmap_1.2.0
[43] grid_4.4.3 colorspace_2.1-1 cli_3.6.3
[46] SparseArray_1.6.2 magrittr_2.0.3 S4Arrays_1.6.0
[49] utf8_1.2.4 withr_3.0.1 scales_1.3.0
[52] UCSC.utils_1.2.0 rmarkdown_2.28 XVector_0.46.0
[55] httr_1.4.7 evaluate_1.0.1 rlang_1.1.4
[58] Rcpp_1.0.13 glue_1.8.0 jsonlite_1.8.9
[61] R6_2.5.1 zlibbioc_1.52.0