This workshop will guide you through the basics of ChIP-seq analysis with hands-on exercises. You will learn how to process ChIP-seq data: perform read alignment, peak calling, quality control, visualization through the genome browser, motif finding and gene set enrichment analysis.
Note: All the code in this tutorial will be performed in the terminal.
One of the most common useful ways to visualize our data is through the UCSC genome browser. We can visualize the peaks that we’ve generated with macs and see directly the regions of the genome where our protein of interest is binding.
Let’s move to the scripts directory (or stay here if already there):
cd chip_seq/scripts
To do this, we need to upload the .bed files that we got from macs2
head ../results/macs2/HIF1a_rep1_peaks.bed
chr6 41734187 41734560 HIF1a_rep1_peak_1 181 .
chr6 41787043 41787581 HIF1a_rep1_peak_2 160 .
chr6 42452936 42453274 HIF1a_rep1_peak_3 194 .
chr6 42929347 42929669 HIF1a_rep1_peak_4 135 .
chr6 43013743 43014397 HIF1a_rep1_peak_5 100 .
chr6 43181975 43182279 HIF1a_rep1_peak_6 109 .
chr6 43667941 43668609 HIF1a_rep1_peak_7 4365 .
chr6 43725079 43725412 HIF1a_rep1_peak_8 102 .
chr6 43768993 43770080 HIF1a_rep1_peak_9 968 .
chr6 43798896 43799239 HIF1a_rep1_peak_10 322 .
In the genome browser, visualize the coordinates of a given peak in your .bed file.
Remember that in the UCSC genome browser you have to specify the positions in this format: chr:start-end.
One of the most common/useful formats to visualize the data is the bigwig format. To generate this, we are going to use deeptools.deepTools is a suite of python tools particularly developed for the efficient analysis of high-throughput sequencing data, such as ChIP-seq, RNA-seq or MNase-seq. It offers a lot of tools for processing BAM and bigwig files.
We are going to convert our .BAM file ( from our alignment step) into a .bigwig file
ls ../results/HIF1a_Rep1.sorted.rmdup.bam
../results/HIF1a_Rep1.sorted.rmdup.bam
To convert, we are going to use the bamCoverage command
bamCoverage --help
usage: bamCoverage -b reads.bam -o coverage.bw
help: bamCoverage -h / bamCoverage --help
This tool takes an alignment of reads or fragments as input (BAM file) and
generates a coverage track (bigWig or bedGraph) as output. The coverage is
calculated as the number of reads per bin, where bins are short consecutive
counting windows of a defined size. It is possible to extended the length of
the reads to better reflect the actual fragment length. *bamCoverage* offers
normalization by scaling factor, Reads Per Kilobase per Million mapped reads
(RPKM), counts per million (CPM), bins per million mapped reads (BPM) and 1x
depth (reads per genome coverage, RPGC).
Required arguments:
--bam BAM file, -b BAM file
BAM file to process (default: None)
Output:
--outFileName FILENAME, -o FILENAME
Output file name. (default: None)
--outFileFormat {bigwig,bedgraph}, -of {bigwig,bedgraph}
Output file type. Either "bigwig" or "bedgraph".
(default: bigwig)
Optional arguments:
--help, -h show this help message and exit
--scaleFactor SCALEFACTOR
The computed scaling factor (or 1, if not applicable)
will be multiplied by this. (Default: 1.0)
--MNase Determine nucleosome positions from MNase-seq data.
Only 3 nucleotides at the center of each fragment are
counted. The fragment ends are defined by the two mate
reads. Only fragment lengthsbetween 130 - 200 bp are
considered to avoid dinucleosomes or other artifacts.
By default, any fragments smaller or larger than this
are ignored. To over-ride this, use the
--minFragmentLength and --maxFragmentLength options,
which will default to 130 and 200 if not otherwise
specified in the presence of --MNase. *NOTE*: Requires
paired-end data. A bin size of 1 is recommended.
(default: False)
--Offset INT [INT ...]
Uses this offset inside of each read as the signal.
This is useful in cases like RiboSeq or GROseq, where
the signal is 12, 15 or 0 bases past the start of the
read. This can be paired with the --filterRNAstrand
option. Note that negative values indicate offsets
from the end of each read. A value of 1 indicates the
first base of the alignment (taking alignment
orientation into account). Likewise, a value of -1 is
the last base of the alignment. An offset of 0 is not
permitted. If two values are specified, then they will
be used to specify a range of positions. Note that
specifying something like --Offset 5 -1 will result in
the 5th through last position being used, which is
equivalent to trimming 4 bases from the 5-prime end of
alignments. Note that if you specify --centerReads,
the centering will be performed before the offset.
(default: None)
--filterRNAstrand {forward,reverse}
Selects RNA-seq reads (single-end or paired-end)
originating from genes on the given strand. This
option assumes a standard dUTP-based library
preparation (that is, --filterRNAstrand=forward keeps
minus-strand reads, which originally came from genes
on the forward strand using a dUTP-based method).
Consider using --samExcludeFlag instead for filtering
by strand in other contexts. (default: None)
--version show program's version number and exit
--binSize INT bp, -bs INT bp
Size of the bins, in bases, for the output of the
bigwig/bedgraph file. (Default: 50)
--region CHR:START:END, -r CHR:START:END
Region of the genome to limit the operation to - this
is useful when testing parameters to reduce the
computing time. The format is chr:start:end, for
example --region chr10 or --region
chr10:456700:891000. (default: None)
--blackListFileName BED file [BED file ...], -bl BED file [BED file ...]
A BED or GTF file containing regions that should be
excluded from all analyses. Currently this works by
rejecting genomic chunks that happen to overlap an
entry. Consequently, for BAM files, if a read
partially overlaps a blacklisted region or a fragment
spans over it, then the read/fragment might still be
considered. Please note that you should adjust the
effective genome size, if relevant. (default: None)
--numberOfProcessors INT, -p INT
Number of processors to use. Type "max/2" to use half
the maximum number of processors or "max" to use all
available processors. (Default: 1)
--verbose, -v Set to see processing messages. (default: False)
Read coverage normalization options:
--effectiveGenomeSize EFFECTIVEGENOMESIZE
The effective genome size is the portion of the genome
that is mappable. Large fractions of the genome are
stretches of NNNN that should be discarded. Also, if
repetitive regions were not included in the mapping of
reads, the effective genome size needs to be adjusted
accordingly. A table of values is available here: http
://deeptools.readthedocs.io/en/latest/content/feature/
effectiveGenomeSize.html . (default: None)
--normalizeUsing {RPKM,CPM,BPM,RPGC,None}
Use one of the entered methods to normalize the number
of reads per bin. By default, no normalization is
performed. RPKM = Reads Per Kilobase per Million
mapped reads; CPM = Counts Per Million mapped reads,
same as CPM in RNA-seq; BPM = Bins Per Million mapped
reads, same as TPM in RNA-seq; RPGC = reads per
genomic content (1x normalization); Mapped reads are
considered after blacklist filtering (if applied).
RPKM (per bin) = number of reads per bin / (number of
mapped reads (in millions) * bin length (kb)). CPM
(per bin) = number of reads per bin / number of mapped
reads (in millions). BPM (per bin) = number of reads
per bin / sum of all reads per bin (in millions). RPGC
(per bin) = number of reads per bin / scaling factor
for 1x average coverage. None = the default and
equivalent to not setting this option at all. This
scaling factor, in turn, is determined from the
sequencing depth: (total number of mapped reads *
fragment length) / effective genome size. The scaling
factor used is the inverse of the sequencing depth
computed for the sample to match the 1x coverage. This
option requires --effectiveGenomeSize. Each read is
considered independently, if you want to only count
one mate from a pair in paired-end data, then use the
--samFlagInclude/--samFlagExclude options. (Default:
None)
--exactScaling Instead of computing scaling factors based on a
sampling of the reads, process all of the reads to
determine the exact number that will be used in the
output. This requires significantly more time to
compute, but will produce more accurate scaling
factors in cases where alignments that are being
filtered are rare and lumped together. In other words,
this is only needed when region-based sampling is
expected to produce incorrect results. (default:
False)
--ignoreForNormalization IGNOREFORNORMALIZATION [IGNOREFORNORMALIZATION ...], -ignore IGNOREFORNORMALIZATION [IGNOREFORNORMALIZATION ...]
A list of space-delimited chromosome names containing
those chromosomes that should be excluded for
computing the normalization. This is useful when
considering samples with unequal coverage across
chromosomes, like male samples. An usage examples is
--ignoreForNormalization chrX chrM. (default: None)
--skipNonCoveredRegions, --skipNAs
This parameter determines if non-covered regions
(regions without overlapping reads) in a BAM file
should be skipped. The default is to treat those
regions as having a value of zero. The decision to
skip non-covered regions depends on the interpretation
of the data. Non-covered regions may represent, for
example, repetitive regions that should be skipped.
(default: False)
--smoothLength INT bp
The smooth length defines a window, larger than the
binSize, to average the number of reads. For example,
if the --binSize is set to 20 and the --smoothLength
is set to 60, then, for each bin, the average of the
bin and its left and right neighbors is considered.
Any value smaller than --binSize will be ignored and
no smoothing will be applied. (default: None)
Read processing options:
--extendReads [INT bp], -e [INT bp]
This parameter allows the extension of reads to
fragment size. If set, each read is extended, without
exception. *NOTE*: This feature is generally NOT
recommended for spliced-read data, such as RNA-seq, as
it would extend reads over skipped regions. *Single-
end*: Requires a user specified value for the final
fragment length. Reads that already exceed this
fragment length will not be extended. *Paired-end*:
Reads with mates are always extended to match the
fragment size defined by the two read mates. Unmated
reads, mate reads that map too far apart (>4x fragment
length) or even map to different chromosomes are
treated like single-end reads. The input of a fragment
length value is optional. If no value is specified, it
is estimated from the data (mean of the fragment size
of all mate reads). (default: False)
--ignoreDuplicates If set, reads that have the same orientation and start
position will be considered only once. If reads are
paired, the mate's position also has to coincide to
ignore a read. (default: False)
--minMappingQuality INT
If set, only reads that have a mapping quality score
of at least this are considered. (default: None)
--centerReads By adding this option, reads are centered with respect
to the fragment length. For paired-end data, the read
is centered at the fragment length defined by the two
ends of the fragment. For single-end data, the given
fragment length is used. This option is useful to get
a sharper signal around enriched regions. (default:
False)
--samFlagInclude INT Include reads based on the SAM flag. For example, to
get only reads that are the first mate, use a flag of
64. This is useful to count properly paired reads only
once, as otherwise the second mate will be also
considered for the coverage. (Default: None)
--samFlagExclude INT Exclude reads based on the SAM flag. For example, to
get only reads that map to the forward strand, use
--samFlagExclude 16, where 16 is the SAM flag for
reads that map to the reverse strand. (Default: None)
--minFragmentLength INT
The minimum fragment length needed for read/pair
inclusion. This option is primarily useful in ATACseq
experiments, for filtering mono- or di-nucleosome
fragments. (Default: 0)
--maxFragmentLength INT
The maximum fragment length needed for read/pair
inclusion. (Default: 0)
bamCoverage --bam ../results/HIF1a_Rep1.sorted.rmdup.bam -o ../results/HIF1a_Rep1.bigwig --normalizeUsing BPM --extendReads
normalization: BPM
bamFilesList: ['../results/HIF1a_Rep1.sorted.rmdup.bam']
binLength: 50
numberOfSamples: None
blackListFileName: None
skipZeroOverZero: False
bed_and_bin: False
genomeChunkSize: None
defaultFragmentLength: 223
numberOfProcessors: 1
verbose: False
region: None
bedFile: None
minMappingQuality: None
ignoreDuplicates: False
chrsToSkip: []
stepSize: 50
center_read: False
samFlag_include: None
samFlag_exclude: None
minFragmentLength: 0
maxFragmentLength: 0
zerosToNans: False
smoothLength: None
save_data: False
out_file_for_raw_data: None
maxPairedFragmentLength: 892
Let’s see the parameters of the call:
Now, we should have our .bigwig file:
ls ../results/HIF1a_Rep1.bigwig
../results/HIF1a_Rep1.bigwig
UCSC doesn’t allow to ‘upload’ directly your .bigwig file. Instead, you have to store them in a web-accessible http, https, or ftp location. One of the options to store .bigwig files is Cyverse.
Because data upload to servers is time-consuming, for this step, we are going to pretend we’ve already uploaded to data to Cyverse.
Now, open your UCSC genome browser and click in the Custom Tracks button. Then, you will add the following custom track:
track type=bigWig name="HIF1a_rep1" description="HIF1a bigwig" bigDataUrl=https://data.cyverse.org/dav-anon/iplant/home/arielmadr23/27_HIF_Smythies_2019/myHub/hg38/RCC4_Normoxia_HIF1a_PM14_Rep1.bw
You can read more about how to upload data to the UCSC genome browser in here:
Deeptools also offers different tools for visualizing ChIP-seq data, including the generation of Heatmaps and different summary plots. For now, let’s try the plotHeatmap function.
mkdir -p ../results/deeptools
The first step is to compute the signal using computeMatrix. computeMatrix has two modes: * scale regions: In the scale-regions mode, all regions in the BED file are stretched or shrunken to the length (in bases) indicated by the user. * reference-point: Reference-point refers to a position within a BED region (e.g., the starting point). In this mode, only those genomicpositions before (upstream) and/or after (downstream) of the reference point will be plotted. You can read more about them here.
computeMatrix reference-point -S ../results/HIF1a_Rep1.bigwig \
--referencePoint center \
-R ../results/macs2/HIF1a_rep1_summits.bed \
--beforeRegionStartLength 3000 \
--afterRegionStartLength 3000 \
--skipZeros \
-o ../results/deeptools/matrix.mat.gz
plotHeatmap -m ../results/deeptools/matrix.mat.gz \
--heatmapHeight 5 \
--refPointLabel 'Peak summit' \
--regionsLabel 'HIF1a peaks' \
--plotTitle 'Peak summits (HIF1a)' \
-out ../results/deeptools/HIF1a_heatmap.png
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] digest_0.6.33 R6_2.5.1 fastmap_1.1.1 xfun_0.40
[5] cachem_1.0.8 knitr_1.43 htmltools_0.5.6 rmarkdown_2.24
[9] cli_3.6.1 sass_0.4.7 jquerylib_0.1.4 compiler_4.3.1
[13] tools_4.3.1 mime_0.12 evaluate_0.21 bslib_0.5.1
[17] yaml_2.3.7 rlang_1.1.1 jsonlite_1.8.7