This workshop will guide you through the basics of ChIP-seq analysis with hands-on exercises. You will learn how to process ChIP-seq data: perform read alignment, peak calling, quality control, visualization through the genome browser, motif finding and gene set enrichment analysis.
In this tutorial, we are going to perform motif finding and gene set enrichment analysis. For both tasks, we are going to use a bed file from the data folder, which has the peak calls for HIF1a across the whole genome.
Note: The first section (Motif finding) will use the terminal. The second section will use R.
Let’s move to the scripts directory (or stay here if already there):
cd chip_seq/scripts
Let’s take a look at our bed file:
head ../data/complete/RCC4_Normoxia_HIF1a_PM14_Rep1.bed
chr1 827430 827761 test_peak_1 144 .
chr1 1019597 1019967 test_peak_2 324 .
chr1 1079402 1080118 test_peak_3 116 .
chr1 1505040 1505335 test_peak_4 121 .
chr1 1614069 1614459 test_peak_5 174 .
chr1 2208006 2208234 test_peak_6 35 .
chr1 2314901 2315258 test_peak_7 96 .
chr1 6701533 6701823 test_peak_8 64 .
chr1 6784797 6785021 test_peak_9 28 .
chr1 8202139 8202582 test_peak_10 508 .
Count the number of peaks:
wc -l ../data/complete/RCC4_Normoxia_HIF1a_PM14_Rep1.bed
2927 ../data/complete/RCC4_Normoxia_HIF1a_PM14_Rep1.bed
For motif finding, we are going to use MEME. For this tutorial, we are going to use the online interactive suite, but you can download/install meme locally by following the instructions in here. There is also a R package that connects the installation to R, you can check it here.
If the online job takes too long, you can also explore the results in here:
ls ../answers/meme/HIF1a
background
centrimo_msgs.txt
centrimo_out
combined.meme
fimo_out_1
fimo_out_2
fimo_out_3
fimo_out_4
fimo_out_5
meme-chip.html
meme_out
meme_tomtom_out
messages.txt
motif_alignment.txt
progress_log.txt
RCC4_Normoxia_HIF1a_PM14_Rep1.bed
seqs-centered
spamo1_msgs.txt
spamo2_msgs.txt
spamo3_msgs.txt
spamo4_msgs.txt
spamo5_msgs.txt
spamo_out_1
spamo_out_2
spamo_out_3
spamo_out_4
spamo_out_5
streme_out
streme_tomtom_out
summary.tsv
Open the meme chip results.
Note: The next chunks show R code.
For gene set enrichment analysis, we are going to use GREAT. For this tutorial, we are going to use the R package that allows you to use GREAT, you can check it here.
Loading library
library(rGREAT)
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
------------------
Note: On Aug 19 2019 GREAT released version 4 where it supports `hg38`
genome and removes some ontologies such pathways. `submitGreatJob()`
still takes `hg19` as default. `hg38` can be specified by the `species
= 'hg38'` argument. To use the older versions such as 3.0.0, specify as
`submitGreatJob(..., version = '3.0.0')`.
From rGREAT version 1.99.0, it also implements the GREAT algorithm and
it allows to integrate GREAT analysis with the Bioconductor annotation
ecosystem. By default it supports more than 500 organisms. Check the
new function `great()` and the new vignette.
------------------
inp_bed<- read.table("../data/complete/RCC4_Normoxia_HIF1a_PM14_Rep1.bed")
colnames(inp_bed)<- c("chr", "start", "end", "name", "score", "strand")
head(inp_bed)
gr<- makeGRangesFromDataFrame(inp_bed)
job = submitGreatJob(gr, species="hg38")
job
Submit time: 2023-10-22 21:56:36
Note the results may only be avaiable on GREAT server for 24 hours.
Version: 4.0.4
Species: hg38
Inputs: 2927 regions
Mode: Basal plus extension
Proximal: 5 kb upstream, 1 kb downstream,
plus Distal: up to 1000 kb
Include curated regulatory domains
Enrichment tables for following ontologies have been downloaded:
None
tbl = getEnrichmentTables(job)
The default enrichment table does not contain informatin of associated
genes for each input region. You can set `download_by = 'tsv'` to
download the complete table, but note only the top 500 regions can be
retreived. See the following link:
https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655401/Export#Export-GlobalExport
Except the additional gene-region association column if taking 'tsv' as
the source of result, all other columns are the same if you choose
'json' (the default) as the source. Or you can try the local GREAT
analysis with the function `great()`.
names(tbl)
[1] "GO Molecular Function" "GO Biological Process" "GO Cellular Component"
GO Molecular Function
GO Biological Process
GO Cellular Component
head(tbl[["GO Biological Process"]])
head(tbl[["GO Molecular Function"]])
head(tbl[["GO Cellular Component"]])
In differential gene expression analysis, volcano plot is used to visualize relations between log2 fold change and (adjusted) p-values. Similarly, we can also use volcano plot to visualize relations between fold enrichment and (adjusted) p-values for the enrichment analysis. The plot is made by the function plotVolcano():
plotVolcano(job, ontology = "GO Biological Process")
plotRegionGeneAssociations(job)
getRegionGeneAssociations(job)
GRanges object with 2914 ranges and 2 metadata columns:
seqnames ranges strand | annotated_genes dist_to_TSS
<Rle> <IRanges> <Rle> | <CharacterList> <IntegerList>
[1] chr1 827430-827761 * | OR4F16,SAMD11 -140922,-98143
[2] chr1 1019597-1019967 * | AGRN -341
[3] chr1 1079402-1080118 * | RNF223,C1orf159 -5453,36321
[4] chr1 1505040-1505335 * | ATAD3A,ATAD3B -6988,33418
[5] chr1 1614069-1614459 * | MIB2 -1179
... ... ... ... . ... ...
[2910] chrX 149540764-149541085 * | CXorf40A 282
[2911] chrX 151974673-151974942 * | GABRE -128
[2912] chrX 153689106-153689420 * | SLC6A8,BCAP31 1164,34831
[2913] chrX 154378497-154378733 * | FLNA,EMD -3989,-582
[2914] chrX 154398747-154399024 * | RPL10 655
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
ls ../data/complete/RCC4_Normoxia_HIF2a_PM9_Rep1.bed
../data/complete/RCC4_Normoxia_HIF2a_PM9_Rep1.bed
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] rGREAT_2.2.0 GenomicRanges_1.52.1 GenomeInfoDb_1.36.4
[4] IRanges_2.34.1 S4Vectors_0.38.2 BiocGenerics_0.46.0
loaded via a namespace (and not attached):
[1] DBI_1.1.3
[2] bitops_1.0-7
[3] biomaRt_2.56.1
[4] rlang_1.1.1
[5] magrittr_2.0.3
[6] GetoptLong_1.0.5
[7] matrixStats_1.0.0
[8] compiler_4.3.1
[9] RSQLite_2.3.1
[10] GenomicFeatures_1.52.2
[11] png_0.1-8
[12] vctrs_0.6.3
[13] stringr_1.5.0
[14] pkgconfig_2.0.3
[15] shape_1.4.6
[16] crayon_1.5.2
[17] fastmap_1.1.1
[18] dbplyr_2.3.4
[19] XVector_0.40.0
[20] ellipsis_0.3.2
[21] utf8_1.2.3
[22] Rsamtools_2.16.0
[23] promises_1.2.1
[24] rmarkdown_2.24
[25] bit_4.0.5
[26] xfun_0.40
[27] zlibbioc_1.46.0
[28] cachem_1.0.8
[29] jsonlite_1.8.7
[30] progress_1.2.2
[31] blob_1.2.4
[32] highr_0.10
[33] later_1.3.1
[34] DelayedArray_0.26.7
[35] BiocParallel_1.34.2
[36] parallel_4.3.1
[37] prettyunits_1.1.1
[38] R6_2.5.1
[39] bslib_0.5.1
[40] stringi_1.7.12
[41] RColorBrewer_1.1-3
[42] rtracklayer_1.60.1
[43] jquerylib_0.1.4
[44] Rcpp_1.0.11
[45] SummarizedExperiment_1.30.2
[46] iterators_1.0.14
[47] knitr_1.43
[48] httpuv_1.6.11
[49] Matrix_1.6-0
[50] tidyselect_1.2.0
[51] abind_1.4-5
[52] yaml_2.3.7
[53] doParallel_1.0.17
[54] codetools_0.2-19
[55] curl_5.0.2
[56] lattice_0.21-8
[57] tibble_3.2.1
[58] Biobase_2.60.0
[59] shiny_1.7.5
[60] KEGGREST_1.40.1
[61] evaluate_0.21
[62] BiocFileCache_2.8.0
[63] xml2_1.3.5
[64] circlize_0.4.15
[65] Biostrings_2.68.1
[66] pillar_1.9.0
[67] filelock_1.0.2
[68] MatrixGenerics_1.12.3
[69] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[70] DT_0.30
[71] foreach_1.5.2
[72] generics_0.1.3
[73] RCurl_1.98-1.12
[74] hms_1.1.3
[75] xtable_1.8-4
[76] glue_1.6.2
[77] tools_4.3.1
[78] BiocIO_1.10.0
[79] TxDb.Hsapiens.UCSC.hg38.knownGene_3.17.0
[80] GenomicAlignments_1.36.0
[81] XML_3.99-0.14
[82] grid_4.3.1
[83] AnnotationDbi_1.62.2
[84] colorspace_2.1-0
[85] GenomeInfoDbData_1.2.10
[86] restfulr_0.0.15
[87] cli_3.6.1
[88] rappdirs_0.3.3
[89] fansi_1.0.4
[90] S4Arrays_1.0.6
[91] dplyr_1.1.3
[92] sass_0.4.7
[93] digest_0.6.33
[94] org.Hs.eg.db_3.17.0
[95] rjson_0.2.21
[96] htmlwidgets_1.6.2
[97] memoise_2.0.1
[98] htmltools_0.5.6
[99] lifecycle_1.0.3
[100] httr_1.4.7
[101] GlobalOptions_0.1.2
[102] GO.db_3.17.0
[103] mime_0.12
[104] bit64_4.0.5