This workshop will guide you through the basics of ChIP-seq analysis with hands-on exercises. You will learn how to process ChIP-seq data: perform read alignment, peak calling, quality control, visualization through the genome browser, motif finding and gene set enrichment analysis.
In this tutorial, we are going to focus on quality control.
Note: All the code in this tutorial will be performed in R. You can follow up by typing R in your terminal or in Rstudio.
For this tutorial, we are going to use the R library ChIPQC
library(ChIPQC)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
# You can install R libraries with the following lines:
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("ChIPQC")
# BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene")
You can find the manual in here.
This tutorial assumes that you are in the scripts directory, check the path with
Let’s move to the scripts directory (or stay here if already there):
setwd("chip_seq/scripts")
We’ve already prepared the sample sheet needed by ChIPQC. You can read it like this
df<- read.table("../data/qc/qc.tsv", header=T)
df
Then, we construct the ChIPQC Experiment object
chip<- ChIPQC(df, annotation="hg38")
HIF1a_Rep1 RCC4 HIF1a Normoxia None 1 bed
HIF1a_Rep2 RCC4 HIF1a Normoxia None 2 bed
Checking chromosomes:
[1] "chr6"
Compiling annotation...
Computing metrics for 3 samples...
Executing in parallel - max 6 cores.
Let’s visualize the results
chip
Samples: 3 : HIF1a_Rep1 HIF1a_Rep2 Input_Rep1
Tissue Factor Condition Treatment Control Replicate Peaks
HIF1a_Rep1 RCC4 HIF1a Normoxia None Input_Rep1 1 22
HIF1a_Rep2 RCC4 HIF1a Normoxia None Input_Rep1 2 28
Reads Map% Filt% Dup% ReadL FragL RelCC SSD RiP%
HIF1a_Rep1 192884 100 0 0 75 242 6.73 1.44 3.12
HIF1a_Rep2 213500 100 0 0 75 241 7.95 1.29 2.05
Input_Rep1 334415 100 0 0 75 183 7.37 1.40 NA
FragL This is the estimated mean fragment length that is calculated from the data by systematically shifting the reads on each strand towards each other until the highest degree of cross-coverage is obtained.
RelCC RelCC is a metric of ChIP-enrichment. It is calculated by comparing the maximal cross coverage peak (at the shift size corresponding to the fragment length) to the cross coverage at a shift size corresponding to the read length. Generally a value of 1 or greater is recommended.
RiP% RiP stands for percentage of reads in peaks, also known as FRIP. This is again an indicator of good enrichments. ChIPs around 1% or higher usually indicate successful enrichment.
SSD SSD is another metric of ChIP-enrichment. It is looking at the density of positions with different pileup values. The higher the value the more successful the ChIP was. SSD values close to 1 are generally correlated of poorly enriched samples, while successful ChIPs can expect values around 1.5, with highly enriched samples having SSD values of 2 or higher. It is computed by looking at the standard deviation of signal pile-up along the genome normalised to the total number of reads. An enriched sample typically has regions of significant pile-up so a higher SSD is more indicative of better enrichment. SSD scores are dependent on the degree of total genome wide signal pile-up and so are sensitive to regions of high signal found with Blacklisted regions as well as genuine ChIP enrichment.
We can plot the cross-correlation profile:
plotCC(chip)
We don’t observe easily the ‘phantom-peak’, but there seems to be enrichment around the fragment length, which might indicate that these samples are of good quality.
We can plot other QCs associated with this sample.
A heatmap plot showing relative enrichment of reads around annotated genomic features.
plotRegi(chip)
Eeach peak is centered on its summit (point of highest pileup after extending the reads to the calculated fragment length), and the pileup values at bases in a window upstream and downstream of the summits is computed and averaged for all peaks in the sample. Good ChIPs will show distinctive patterns of enrichment in these peaks, while associated controls will be relatively flat.
plotPeakProfile(chip)
The correlation heatmap is based on correlation values for all the peak scores for each sample.
plotCorHeatmap(chip)
Compute a PCA based on all the peak scores for each sample.
plotPrincomp(chip)
ls ../data/HIF2a_Rep1/
HIF2a_Rep1.sorted.rmdup.bam
HIF2a_Rep1.sorted.rmdup.bam.bai
macs2
ls ../data/HIF2a_Rep2/
HIF2a_Rep2.sorted.rmdup.bam
HIF2a_Rep2.sorted.rmdup.bam.bai
macs2
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.17.0
[2] GenomicFeatures_1.52.2
[3] AnnotationDbi_1.62.2
[4] ChIPQC_1.36.1
[5] BiocParallel_1.34.2
[6] DiffBind_3.10.1
[7] SummarizedExperiment_1.30.2
[8] Biobase_2.60.0
[9] MatrixGenerics_1.12.3
[10] matrixStats_1.0.0
[11] GenomicRanges_1.52.1
[12] GenomeInfoDb_1.36.4
[13] IRanges_2.34.1
[14] S4Vectors_0.38.2
[15] BiocGenerics_0.46.0
[16] ggplot2_3.4.3
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3
[2] jsonlite_1.8.7
[3] magrittr_2.0.3
[4] farver_2.1.1
[5] rmarkdown_2.24
[6] BiocIO_1.10.0
[7] zlibbioc_1.46.0
[8] vctrs_0.6.3
[9] memoise_2.0.1
[10] Rsamtools_2.16.0
[11] RCurl_1.98-1.12
[12] SQUAREM_2021.1
[13] mixsqp_0.3-48
[14] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[15] htmltools_0.5.6
[16] S4Arrays_1.0.6
[17] progress_1.2.2
[18] curl_5.0.2
[19] truncnorm_1.0-9
[20] sass_0.4.7
[21] KernSmooth_2.23-22
[22] bslib_0.5.1
[23] htmlwidgets_1.6.2
[24] plyr_1.8.8
[25] TxDb.Hsapiens.UCSC.hg18.knownGene_3.2.2
[26] cachem_1.0.8
[27] GenomicAlignments_1.36.0
[28] mime_0.12
[29] lifecycle_1.0.3
[30] pkgconfig_2.0.3
[31] Matrix_1.6-0
[32] R6_2.5.1
[33] fastmap_1.1.1
[34] GenomeInfoDbData_1.2.10
[35] digest_0.6.33
[36] systemPipeR_2.6.3
[37] numDeriv_2016.8-1.1
[38] colorspace_2.1-0
[39] chipseq_1.50.0
[40] ShortRead_1.58.0
[41] irlba_2.3.5.1
[42] RSQLite_2.3.1
[43] hwriter_1.3.2.1
[44] labeling_0.4.2
[45] invgamma_1.1
[46] filelock_1.0.2
[47] fansi_1.0.4
[48] httr_1.4.7
[49] abind_1.4-5
[50] compiler_4.3.1
[51] bit64_4.0.5
[52] withr_2.5.0
[53] DBI_1.1.3
[54] TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2
[55] highr_0.10
[56] gplots_3.1.3
[57] biomaRt_2.56.1
[58] MASS_7.3-60
[59] rappdirs_0.3.3
[60] DelayedArray_0.26.7
[61] rjson_0.2.21
[62] GreyListChIP_1.32.1
[63] gtools_3.9.4
[64] caTools_1.18.2
[65] tools_4.3.1
[66] glue_1.6.2
[67] restfulr_0.0.15
[68] grid_4.3.1
[69] reshape2_1.4.4
[70] generics_0.1.3
[71] gtable_0.3.3
[72] BSgenome_1.68.0
[73] hms_1.1.3
[74] xml2_1.3.5
[75] utf8_1.2.3
[76] XVector_0.40.0
[77] ggrepel_0.9.4
[78] pillar_1.9.0
[79] stringr_1.5.0
[80] emdbook_1.3.13
[81] limma_3.56.2
[82] dplyr_1.1.3
[83] BiocFileCache_2.8.0
[84] lattice_0.21-8
[85] rtracklayer_1.60.1
[86] bit_4.0.5
[87] deldir_1.0-9
[88] tidyselect_1.2.0
[89] locfit_1.5-9.8
[90] Biostrings_2.68.1
[91] amap_0.8-19
[92] knitr_1.43
[93] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
[94] xfun_0.40
[95] stringi_1.7.12
[96] yaml_2.3.7
[97] evaluate_0.21
[98] codetools_0.2-19
[99] interp_1.1-4
[100] bbmle_1.0.25
[101] TxDb.Rnorvegicus.UCSC.rn4.ensGene_3.2.2
[102] tibble_3.2.1
[103] cli_3.6.1
[104] munsell_0.5.0
[105] jquerylib_0.1.4
[106] Rcpp_1.0.11
[107] dbplyr_2.3.4
[108] coda_0.19-4
[109] png_0.1-8
[110] bdsmatrix_1.3-6
[111] XML_3.99-0.14
[112] parallel_4.3.1
[113] blob_1.2.4
[114] prettyunits_1.1.1
[115] latticeExtra_0.6-30
[116] jpeg_0.1-10
[117] bitops_1.0-7
[118] mvtnorm_1.2-3
[119] apeglm_1.22.1
[120] scales_1.2.1
[121] Nozzle.R1_1.1-1.1
[122] crayon_1.5.2
[123] TxDb.Dmelanogaster.UCSC.dm3.ensGene_3.2.2
[124] rlang_1.1.1
[125] ashr_2.2-63
[126] KEGGREST_1.40.1
[127] TxDb.Celegans.UCSC.ce6.ensGene_3.2.2