This workshop will guide you through the basics of ChIP-seq analysis with hands-on exercises. You will learn how to process ChIP-seq data: perform read alignment, peak calling, quality control, visualization through the genome browser, motif finding and gene set enrichment analysis.

In this tutorial, we are going to focus on peak calling.

Note: All the code in this tutorial will be performed in the terminal.

01 - Peak Calling with MACS

Let’s move to the scripts directory (or stay here if already there):

cd chip_seq/scripts

The main command that we are going to use with macs2 is the callpeak function. The manual can be found here.

macs2 callpeak --help

usage: macs2 callpeak [-h] -t TFILE [TFILE ...] [-c [CFILE [CFILE ...]]]
                      [-f {AUTO,BAM,SAM,BED,ELAND,ELANDMULTI,ELANDEXPORT,BOWTIE,BAMPE,BEDPE}]
                      [-g GSIZE] [-s TSIZE] [--keep-dup KEEPDUPLICATES]
                      [--outdir OUTDIR] [-n NAME] [-B] [--verbose VERBOSE]
                      [--trackline] [--SPMR] [--nomodel] [--shift SHIFT]
                      [--extsize EXTSIZE] [--bw BW] [--d-min D_MIN]
                      [-m MFOLD MFOLD] [--fix-bimodal] [-q QVALUE | -p PVALUE]
                      [--scale-to {large,small}] [--down-sample] [--seed SEED]
                      [--tempdir TEMPDIR] [--nolambda] [--slocal SMALLLOCAL]
                      [--llocal LARGELOCAL] [--max-gap MAXGAP]
                      [--min-length MINLEN] [--broad]
                      [--broad-cutoff BROADCUTOFF] [--cutoff-analysis]
                      [--call-summits] [--fe-cutoff FECUTOFF]
                      [--buffer-size BUFFER_SIZE] [--to-large] [--ratio RATIO]

optional arguments:
  -h, --help            show this help message and exit

Input files arguments:
  -t TFILE [TFILE ...], --treatment TFILE [TFILE ...]
                        ChIP-seq treatment file. If multiple files are given
                        as '-t A B C', then they will all be read and pooled
                        together. REQUIRED.
  -c [CFILE [CFILE ...]], --control [CFILE [CFILE ...]]
                        Control file. If multiple files are given as '-c A B
                        C', they will be pooled to estimate ChIP-seq
                        background noise.
  -f {AUTO,BAM,SAM,BED,ELAND,ELANDMULTI,ELANDEXPORT,BOWTIE,BAMPE,BEDPE}, --format {AUTO,BAM,SAM,BED,ELAND,ELANDMULTI,ELANDEXPORT,BOWTIE,BAMPE,BEDPE}
                        Format of tag file, "AUTO", "BED" or "ELAND" or
                        "ELANDMULTI" or "ELANDEXPORT" or "SAM" or "BAM" or
                        "BOWTIE" or "BAMPE" or "BEDPE". The default AUTO
                        option will let MACS decide which format (except for
                        BAMPE and BEDPE which should be implicitly set) the
                        file is. Please check the definition in README. Please
                        note that if the format is set as BAMPE or BEDPE,
                        MACS2 will call its special Paired-end mode to call
                        peaks by piling up the actual ChIPed fragments defined
                        by both aligned ends, instead of predicting the
                        fragment size first and extending reads. Also please
                        note that the BEDPE only contains three columns, and
                        is NOT the same BEDPE format used by BEDTOOLS.
                        DEFAULT: "AUTO"
  -g GSIZE, --gsize GSIZE
                        Effective genome size. It can be 1.0e+9 or 1000000000,
                        or shortcuts:'hs' for human (2.7e9), 'mm' for mouse
                        (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for
                        fruitfly (1.2e8), Default:hs
  -s TSIZE, --tsize TSIZE
                        Tag size/read length. This will override the auto
                        detected tag size. DEFAULT: Not set
  --keep-dup KEEPDUPLICATES
                        It controls the behavior towards duplicate tags at the
                        exact same location -- the same coordination and the
                        same strand. The 'auto' option makes MACS calculate
                        the maximum tags at the exact same location based on
                        binomal distribution using 1e-5 as pvalue cutoff; and
                        the 'all' option keeps every tags. If an integer is
                        given, at most this number of tags will be kept at the
                        same location. Note, if you've used samtools or picard
                        to flag reads as 'PCR/Optical duplicate' in bit 1024,
                        MACS2 will still read them although the reads may be
                        decided by MACS2 as duplicate later. If you plan to
                        rely on samtools/picard/any other tool to filter
                        duplicates, please remove those duplicate reads and
                        save a new alignment file then ask MACS2 to keep all
                        by '--keep-dup all'. The default is to keep one tag at
                        the same location. Default: 1

Output arguments:
  --outdir OUTDIR       If specified all output files will be written to that
                        directory. Default: the current working directory
  -n NAME, --name NAME  Experiment name, which will be used to generate output
                        file names. DEFAULT: "NA"
  -B, --bdg             Whether or not to save extended fragment pileup, and
                        local lambda tracks (two files) at every bp into a
                        bedGraph file. DEFAULT: False
  --verbose VERBOSE     Set verbose level of runtime message. 0: only show
                        critical message, 1: show additional warning message,
                        2: show process information, 3: show debug messages.
                        DEFAULT:2
  --trackline           Tells MACS to include trackline with bedGraph files.
                        To include this trackline while displaying bedGraph at
                        UCSC genome browser, can show name and description of
                        the file as well. However my suggestion is to convert
                        bedGraph to bigWig, then show the smaller and faster
                        binary bigWig file at UCSC genome browser, as well as
                        downstream analysis. Require -B to be set. Default:
                        Not include trackline.
  --SPMR                If True, MACS will SAVE signal per million reads for
                        fragment pileup profiles. It won't interfere with
                        computing pvalue/qvalue during peak calling, since
                        internally MACS2 keeps using the raw pileup and
                        scaling factors between larger and smaller dataset to
                        calculate statistics measurements. If you plan to use
                        the signal output in bedGraph to call peaks using
                        bdgcmp and bdgpeakcall, you shouldn't use this option
                        because you will end up with different results.
                        However, this option is recommended for displaying
                        normalized pileup tracks across many datasets. Require
                        -B to be set. Default: False

Shifting model arguments:
  --nomodel             Whether or not to build the shifting model. If True,
                        MACS will not build model. by default it means
                        shifting size = 100, try to set extsize to change it.
                        It's highly recommended that while you have many
                        datasets to process and you plan to compare different
                        conditions, aka differential calling, use both
                        'nomodel' and 'extsize' to make signal files from
                        different datasets comparable. DEFAULT: False
  --shift SHIFT         (NOT the legacy --shiftsize option!) The arbitrary
                        shift in bp. Use discretion while setting it other
                        than default value. When NOMODEL is set, MACS will use
                        this value to move cutting ends (5') towards 5'->3'
                        direction then apply EXTSIZE to extend them to
                        fragments. When this value is negative, ends will be
                        moved toward 3'->5' direction. Recommended to keep it
                        as default 0 for ChIP-Seq datasets, or -1 * half of
                        EXTSIZE together with EXTSIZE option for detecting
                        enriched cutting loci such as certain DNAseI-Seq
                        datasets. Note, you can't set values other than 0 if
                        format is BAMPE or BEDPE for paired-end data. DEFAULT:
                        0.
  --extsize EXTSIZE     The arbitrary extension size in bp. When nomodel is
                        true, MACS will use this value as fragment size to
                        extend each read towards 3' end, then pile them up.
                        It's exactly twice the number of obsolete SHIFTSIZE.
                        In previous language, each read is moved 5'->3'
                        direction to middle of fragment by 1/2 d, then
                        extended to both direction with 1/2 d. This is
                        equivalent to say each read is extended towards 5'->3'
                        into a d size fragment. DEFAULT: 200. EXTSIZE and
                        SHIFT can be combined when necessary. Check SHIFT
                        option.
  --bw BW               Band width for picking regions to compute fragment
                        size. This value is only used while building the
                        shifting model. Tweaking this is not recommended.
                        DEFAULT: 300
  --d-min D_MIN         Minimum fragment size in basepair. Any predicted
                        fragment size less than this will be excluded.
                        DEFAULT: 20
  -m MFOLD MFOLD, --mfold MFOLD MFOLD
                        Select the regions within MFOLD range of high-
                        confidence enrichment ratio against background to
                        build model. Fold-enrichment in regions must be lower
                        than upper limit, and higher than the lower limit. Use
                        as "-m 10 30". This setting is only used while
                        building the shifting model. Tweaking it is not
                        recommended. DEFAULT:5 50
  --fix-bimodal         Whether turn on the auto pair model process. If set,
                        when MACS failed to build paired model, it will use
                        the nomodel settings, the --exsize parameter to extend
                        each tags towards 3' direction. Not to use this
                        automate fixation is a default behavior now. DEFAULT:
                        False

Peak calling arguments:
  -q QVALUE, --qvalue QVALUE
                        Minimum FDR (q-value) cutoff for peak detection.
                        DEFAULT: 0.05. -q, and -p are mutually exclusive.
  -p PVALUE, --pvalue PVALUE
                        Pvalue cutoff for peak detection. DEFAULT: not set.
                        -q, and -p are mutually exclusive. If pvalue cutoff is
                        set, qvalue will not be calculated and reported as -1
                        in the final .xls file.
  --scale-to {large,small}
                        When set to 'small', scale the larger sample up to the
                        smaller sample. When set to 'larger', scale the
                        smaller sample up to the bigger sample. By default,
                        scale to 'small'. This option replaces the obsolete '
                        --to-large' option. The default behavior is
                        recommended since it will lead to less significant
                        p/q-values in general but more specific results. Keep
                        in mind that scaling down will influence control/input
                        sample more. DEFAULT: 'small', the choice is either
                        'small' or 'large'.
  --down-sample         When set, random sampling method will scale down the
                        bigger sample. By default, MACS uses linear scaling.
                        Warning: This option will make your result unstable
                        and irreproducible since each time, random reads would
                        be selected. Consider to use 'randsample' script
                        instead. <not implmented>If used together with --SPMR,
                        1 million unique reads will be randomly picked.</not
                        implemented> Caution: due to the implementation, the
                        final number of selected reads may not be as you
                        expected! DEFAULT: False
  --seed SEED           Set the random seed while down sampling data. Must be
                        a non-negative integer in order to be effective.
                        DEFAULT: not set
  --tempdir TEMPDIR     Optional directory to store temp files. DEFAULT: /tmp
  --nolambda            If True, MACS will use fixed background lambda as
                        local lambda for every peak region. Normally, MACS
                        calculates a dynamic local lambda to reflect the local
                        bias due to the potential chromatin accessibility.
  --slocal SMALLLOCAL   The small nearby region in basepairs to calculate
                        dynamic lambda. This is used to capture the bias near
                        the peak summit region. Invalid if there is no control
                        data. If you set this to 0, MACS will skip slocal
                        lambda calculation. *Note* that MACS will always
                        perform a d-size local lambda calculation while the
                        control data is available. The final local bias would
                        be the maximum of the lambda value from d, slocal, and
                        llocal size windows. While control is not available, d
                        and slocal lambda won't be considered. DEFAULT: 1000
  --llocal LARGELOCAL   The large nearby region in basepairs to calculate
                        dynamic lambda. This is used to capture the surround
                        bias. If you set this to 0, MACS will skip llocal
                        lambda calculation. *Note* that MACS will always
                        perform a d-size local lambda calculation while the
                        control data is available. The final local bias would
                        be the maximum of the lambda value from d, slocal, and
                        llocal size windows. While control is not available, d
                        and slocal lambda won't be considered. DEFAULT: 10000.
  --max-gap MAXGAP      Maximum gap between significant sites to cluster them
                        together. The DEFAULT value is the detected read
                        length/tag size.
  --min-length MINLEN   Minimum length of a peak. The DEFAULT value is the
                        predicted fragment size d. Note, if you set a value
                        smaller than the fragment size, it may have NO effect
                        on the result. For BROAD peak calling, try to set a
                        large value such as 500bps. You can also use '--
                        cutoff-analysis' option with default setting, and
                        check the column 'avelpeak' under different cutoff
                        values to decide a reasonable minlen value.
  --broad               If set, MACS will try to call broad peaks using the
                        --broad-cutoff setting. Please tweak '--broad-cutoff'
                        setting to control the peak calling behavior. At the
                        meantime, either -q or -p cutoff will be used to
                        define regions with 'stronger enrichment' inside of
                        broad peaks. The maximum gap is expanded to 4 * MAXGAP
                        (--max-gap parameter). As a result, MACS will output a
                        'gappedPeak' and a 'broadPeak' file instead of
                        'narrowPeak' file. Note, a broad peak will be reported
                        even if there is no 'stronger enrichment' inside.
                        DEFAULT: False
  --broad-cutoff BROADCUTOFF
                        Cutoff for broad region. This option is not available
                        unless --broad is set. If -p is set, this is a pvalue
                        cutoff, otherwise, it's a qvalue cutoff. Please note
                        that in broad peakcalling mode, MACS2 uses this
                        setting to control the overall peak calling behavior,
                        then uses -q or -p setting to define regions inside
                        broad region as 'stronger' enrichment. DEFAULT: 0.1
  --cutoff-analysis     While set, MACS2 will analyze number or total length
                        of peaks that can be called by different p-value
                        cutoff then output a summary table to help user decide
                        a better cutoff. The table will be saved in
                        NAME_cutoff_analysis.txt file. Note, minlen and maxgap
                        may affect the results. WARNING: May take ~30 folds
                        longer time to finish. The result can be useful for
                        users to decide a reasonable cutoff value. DEFAULT:
                        False

Post-processing options:
  --call-summits        If set, MACS will use a more sophisticated signal
                        processing approach to find subpeak summits in each
                        enriched peak region. DEFAULT: False
  --fe-cutoff FECUTOFF  When set, the value will be used to filter out peaks
                        with low fold-enrichment. Note, MACS2 use 1.0 as
                        pseudocount while calculating fold-enrichment.
                        DEFAULT: 1.0

Other options:
  --buffer-size BUFFER_SIZE
                        Buffer size for incrementally increasing internal
                        array size to store reads alignment information. In
                        most cases, you don't have to change this parameter.
                        However, if there are large number of
                        chromosomes/contigs/scaffolds in your alignment, it's
                        recommended to specify a smaller buffer size in order
                        to decrease memory usage (but it will take longer time
                        to read alignment files). Minimum memory requested for
                        reading an alignment file is about # of CHROMOSOME *
                        BUFFER_SIZE * 8 Bytes. DEFAULT: 100000

Obsolete options:
  --to-large            Obsolete option. Please use '--scale-to large'
                        instead.
  --ratio RATIO         Obsolete option. Originally designed to normalize
                        treatment and control with customized ratio, now it
                        won't have any effect.

We will use the BAM file from our previous step, and an Input control BAM that has been already pre-processed.

ls  ../data/Input_Rep1/

Input_Rep1.sorted.rmdup.bam
Input_Rep1.sorted.rmdup.bam.bai

If you didn’t compute the BAM for the alignment step, you can find it in here:

ls ../data/HIF1a_Rep1/HIF1a_Rep1.sorted.rmdup.bam

../data/HIF1a_Rep1/HIF1a_Rep1.sorted.rmdup.bam

Let’s create an output directory for storing MACS results

mkdir -p ../results/macs2

Now we are ready to call MACS

macs2 callpeak -t ../results/HIF1a_Rep1.sorted.rmdup.bam -c ../data/Input_Rep1/Input_Rep1.sorted.rmdup.bam -f BAMPE -g hs -n HIF1a_rep1 -q 0.05 --outdir ../results/macs2

INFO  @ Sun, 22 Oct 2023 21:42:17: 
# Command line: callpeak -t ../results/HIF1a_Rep1.sorted.rmdup.bam -c ../data/Input_Rep1/Input_Rep1.sorted.rmdup.bam -f BAMPE -g hs -n HIF1a_rep1 -q 0.05 --outdir ../results/macs2
# ARGUMENTS LIST:
# name = HIF1a_rep1
# format = BAMPE
# ChIP-seq file = ['../results/HIF1a_Rep1.sorted.rmdup.bam']
# control file = ['../data/Input_Rep1/Input_Rep1.sorted.rmdup.bam']
# effective genome size = 2.70e+09
# band width = 300
# model fold = [5, 50]
# qvalue cutoff = 5.00e-02
# The maximum gap between significant sites is assigned as the read length/tag size.
# The minimum length of peaks is assigned as the predicted fragment length "d".
# Larger dataset will be scaled towards smaller dataset.
# Range for calculating regional lambda is: 1000 bps and 10000 bps
# Broad region calling is off
# Paired-End mode is on
 
INFO  @ Sun, 22 Oct 2023 21:42:17: #1 read fragment files... 
INFO  @ Sun, 22 Oct 2023 21:42:17: #1 read treatment fragments... 
INFO  @ Sun, 22 Oct 2023 21:42:17: 95298 fragments have been read. 
INFO  @ Sun, 22 Oct 2023 21:42:17: #1.2 read input fragments... 
INFO  @ Sun, 22 Oct 2023 21:42:18: 166484 fragments have been read. 
INFO  @ Sun, 22 Oct 2023 21:42:18: #1 mean fragment size is determined as 238.2 bp from treatment 
INFO  @ Sun, 22 Oct 2023 21:42:18: #1 note: mean fragment size in control is 210.8 bp -- value ignored 
INFO  @ Sun, 22 Oct 2023 21:42:18: #1 fragment size = 238.2 
INFO  @ Sun, 22 Oct 2023 21:42:18: #1  total fragments in treatment: 95298 
INFO  @ Sun, 22 Oct 2023 21:42:18: #1 user defined the maximum fragments... 
INFO  @ Sun, 22 Oct 2023 21:42:18: #1 filter out redundant fragments by allowing at most 1 identical fragment(s) 
INFO  @ Sun, 22 Oct 2023 21:42:18: #1  fragments after filtering in treatment: 95227 
INFO  @ Sun, 22 Oct 2023 21:42:18: #1  Redundant rate of treatment: 0.00 
INFO  @ Sun, 22 Oct 2023 21:42:18: #1  total fragments in control: 166484 
INFO  @ Sun, 22 Oct 2023 21:42:18: #1 user defined the maximum fragments... 
INFO  @ Sun, 22 Oct 2023 21:42:18: #1 filter out redundant fragments by allowing at most 1 identical fragment(s) 
INFO  @ Sun, 22 Oct 2023 21:42:18: #1  fragments after filtering in control: 166150 
INFO  @ Sun, 22 Oct 2023 21:42:18: #1  Redundant rate of control: 0.00 
INFO  @ Sun, 22 Oct 2023 21:42:18: #1 finished! 
INFO  @ Sun, 22 Oct 2023 21:42:18: #2 Build Peak Model... 
INFO  @ Sun, 22 Oct 2023 21:42:18: #2 Skipped... 
INFO  @ Sun, 22 Oct 2023 21:42:18: #3 Call peaks... 
INFO  @ Sun, 22 Oct 2023 21:42:18: #3 Pre-compute pvalue-qvalue table... 
INFO  @ Sun, 22 Oct 2023 21:42:19: #3 Call peaks for each chromosome... 
INFO  @ Sun, 22 Oct 2023 21:42:19: #4 Write output xls file... ../results/macs2/HIF1a_rep1_peaks.xls 
INFO  @ Sun, 22 Oct 2023 21:42:19: #4 Write peak in narrowPeak format file... ../results/macs2/HIF1a_rep1_peaks.narrowPeak 
INFO  @ Sun, 22 Oct 2023 21:42:19: #4 Write summits bed file... ../results/macs2/HIF1a_rep1_summits.bed 
INFO  @ Sun, 22 Oct 2023 21:42:19: Done!

-t: treatment file.
-c: control file
-f: Format of the tag file. Since our data is paired-end, we are using BAMPE. Note that by specifying that the data is paired-end, MACS will use the actual fragment size from the reads and will not estimate it.
-g: Effective genome size. It has presets for various genomes.
-n: Experiment name, used for the output file names

Let’s take a look at the MACS output:

ls ../results/macs2

HIF1a_rep1_peaks.narrowPeak
HIF1a_rep1_peaks.xls
HIF1a_rep1_summits.bed

The main result of macs will be a list of peaks called as statistically significant. The .xls file can be opened and explored with excel. The .narrowPeak is a BED6+4 format file that contains the peak locations together with peak summit, p-value, and q-value. The _summits.bed file is in BED format, which contains the peak summits locations for every peak.

head ../results/macs2/HIF1a_rep1_peaks.narrowPeak

chr6    41734187    41734560    HIF1a_rep1_peak_1   181 .   5.15392 22.2925 18.1868 181
chr6    41787043    41787581    HIF1a_rep1_peak_2   160 .   5.07169 20.1635 16.0937 158
chr6    42452936    42453274    HIF1a_rep1_peak_3   194 .   4.60577 23.54   19.4193 214
chr6    42929347    42929669    HIF1a_rep1_peak_4   135 .   4.93615 17.5521 13.5385 155
chr6    43013743    43014397    HIF1a_rep1_peak_5   100 .   4.03138 13.9432 10.0151 477
chr6    43181975    43182279    HIF1a_rep1_peak_6   109 .   4.46358 14.922  10.9685 106
chr6    43667941    43668609    HIF1a_rep1_peak_7   4365    .   34.633  443.987 436.589 342
chr6    43725079    43725412    HIF1a_rep1_peak_8   102 .   4.01653 14.1664 10.2305 132
chr6    43768993    43770080    HIF1a_rep1_peak_9   968 .   13.7479 101.454 96.8855 238
chr6    43798896    43799239    HIF1a_rep1_peak_10  322 .   7.87035 36.5134 32.2785 185

We can create a BED file by ommiting some columns(this will be useful for the next step).

cut -f 1-6 ../results/macs2/HIF1a_rep1_peaks.narrowPeak > ../results/macs2/HIF1a_rep1_peaks.bed

Column 9 represents the -log10qvalue at peak summit. We can also take a look at the coordinates of the summits of the peaks ( regions of highest signal)

head ../results/macs2/HIF1a_rep1_summits.bed

chr6    41734368    41734369    HIF1a_rep1_peak_1   18.1868
chr6    41787201    41787202    HIF1a_rep1_peak_2   16.0937
chr6    42453150    42453151    HIF1a_rep1_peak_3   19.4193
chr6    42929502    42929503    HIF1a_rep1_peak_4   13.5385
chr6    43014220    43014221    HIF1a_rep1_peak_5   10.0151
chr6    43182081    43182082    HIF1a_rep1_peak_6   10.9685
chr6    43668283    43668284    HIF1a_rep1_peak_7   436.589
chr6    43725211    43725212    HIF1a_rep1_peak_8   10.2305
chr6    43769231    43769232    HIF1a_rep1_peak_9   96.8855
chr6    43799081    43799082    HIF1a_rep1_peak_10  32.2785

We can also check the number of significant peaks called

wc -l ../results/macs2/HIF1a_rep1_peaks.narrowPeak

22 ../results/macs2/HIF1a_rep1_peaks.narrowPeak

02 - Peak overlap between replicates

One of the most common problems in dealing with peaks in ChIP-seq analysis is to check if two sets of peaks overlap. To work through that, we are going to use bedtools intersect. You can find information about the tool in here.

For this exercise, we are going the overlap between peaks called by MACS2 for two replicates, one of them is the one we have processed so far, and the other one can be found in the data folder

ls ../data/HIF1a_Rep2/macs2/

HIF1a_rep2_peaks.bed
HIF1a_rep2_peaks.narrowPeak
HIF1a_rep2_peaks.xls
HIF1a_rep2_summits.bed

head ../data/HIF1a_Rep2/macs2/HIF1a_rep2_peaks.narrowPeak

chr6    40587363    40588002    HIF1a_rep2_peak_1   101 .   5.61209 14.3279 10.1769 102
chr6    41734211    41734546    HIF1a_rep2_peak_2   235 .   7.59926 28.0224 23.5614 167
chr6    41787067    41787523    HIF1a_rep2_peak_3   164 .   6.21755 20.807  16.4506 181
chr6    42453021    42453358    HIF1a_rep2_peak_4   155 .   6.51948 19.8864 15.5494 157
chr6    42890670    42891118    HIF1a_rep2_peak_5   52  .   4.54819 9.16283 5.21816 344
chr6    42929320    42929657    HIF1a_rep2_peak_6   89  .   5.23768 13.0709 8.96026 149
chr6    43013856    43014357    HIF1a_rep2_peak_7   102 .   5.19112 14.4096 10.257  127
chr6    43181955    43182229    HIF1a_rep2_peak_8   54  .   4.37834 9.4117  5.4488  123
chr6    43575833    43576214    HIF1a_rep2_peak_9   176 .   8.06929 22.0725 17.6879 211
chr6    43668045    43668540    HIF1a_rep2_peak_10  2533    .   21.5658 261.055 253.316 242

Let’s check the intersect between replicate 1 and replicate 2 with the default behaviour

bedtools intersect -a ../results/macs2/HIF1a_rep1_peaks.narrowPeak -b ../data/HIF1a_Rep2/macs2/HIF1a_rep2_peaks.narrowPeak

chr6    41734211    41734546    HIF1a_rep1_peak_1   181 .   5.15392 22.2925 18.1868 181
chr6    41787067    41787523    HIF1a_rep1_peak_2   160 .   5.07169 20.1635 16.0937 158
chr6    42453021    42453274    HIF1a_rep1_peak_3   194 .   4.60577 23.54   19.4193 214
chr6    42929347    42929657    HIF1a_rep1_peak_4   135 .   4.93615 17.5521 13.5385 155
chr6    43013856    43014357    HIF1a_rep1_peak_5   100 .   4.03138 13.9432 10.0151 477
chr6    43181975    43182229    HIF1a_rep1_peak_6   109 .   4.46358 14.922  10.9685 106
chr6    43668045    43668540    HIF1a_rep1_peak_7   4365    .   34.633  443.987 436.589 342
chr6    43725079    43725385    HIF1a_rep1_peak_8   102 .   4.01653 14.1664 10.2305 132
chr6    43769014    43770018    HIF1a_rep1_peak_9   968 .   13.7479 101.454 96.8855 238
chr6    43798896    43799222    HIF1a_rep1_peak_10  322 .   7.87035 36.5134 32.2785 185
chr6    43923520    43923979    HIF1a_rep1_peak_11  1104    .   15.5653 115.05  110.41  251
chr6    43990948    43991196    HIF1a_rep1_peak_12  139 .   5.26933 17.9401 13.9195 131
chr6    44010071    44010351    HIF1a_rep1_peak_14  1303    .   16.3591 135.086 130.349 1005
chr6    44010837    44011239    HIF1a_rep1_peak_14  1303    .   16.3591 135.086 130.349 1005
chr6    44058536    44058841    HIF1a_rep1_peak_15  180 .   5.83929 22.1368 18.0332 225
chr6    44073393    44073738    HIF1a_rep1_peak_16  2501    .   36.4275 255.43  250.181 264
chr6    44075173    44075528    HIF1a_rep1_peak_17  1175    .   16.5665 122.233 117.549 220
chr6    44076549    44077451    HIF1a_rep1_peak_18  2887    .   31.5892 294.173 288.734 590
chr6    44237388    44237938    HIF1a_rep1_peak_20  222 .   6.61987 26.4474 22.2977 180
chr6    46653087    46653380    HIF1a_rep1_peak_22  40  .   3.56876 7.73921 4.02043 110

By default, if an overlap is found, bedtools intersect reports the shared interval between the two overlapping features. We can also return not only the shared interval, but the original entries of A for the overlap:

bedtools intersect -a ../results/macs2/HIF1a_rep1_peaks.narrowPeak -b ../data/HIF1a_Rep2/macs2/HIF1a_rep2_peaks.narrowPeak -wa

chr6    41734187    41734560    HIF1a_rep1_peak_1   181 .   5.15392 22.2925 18.1868 181
chr6    41787043    41787581    HIF1a_rep1_peak_2   160 .   5.07169 20.1635 16.0937 158
chr6    42452936    42453274    HIF1a_rep1_peak_3   194 .   4.60577 23.54   19.4193 214
chr6    42929347    42929669    HIF1a_rep1_peak_4   135 .   4.93615 17.5521 13.5385 155
chr6    43013743    43014397    HIF1a_rep1_peak_5   100 .   4.03138 13.9432 10.0151 477
chr6    43181975    43182279    HIF1a_rep1_peak_6   109 .   4.46358 14.922  10.9685 106
chr6    43667941    43668609    HIF1a_rep1_peak_7   4365    .   34.633  443.987 436.589 342
chr6    43725079    43725412    HIF1a_rep1_peak_8   102 .   4.01653 14.1664 10.2305 132
chr6    43768993    43770080    HIF1a_rep1_peak_9   968 .   13.7479 101.454 96.8855 238
chr6    43798896    43799239    HIF1a_rep1_peak_10  322 .   7.87035 36.5134 32.2785 185
chr6    43923513    43923979    HIF1a_rep1_peak_11  1104    .   15.5653 115.05  110.41  251
chr6    43990916    43991196    HIF1a_rep1_peak_12  139 .   5.26933 17.9401 13.9195 131
chr6    44010031    44011443    HIF1a_rep1_peak_14  1303    .   16.3591 135.086 130.349 1005
chr6    44010031    44011443    HIF1a_rep1_peak_14  1303    .   16.3591 135.086 130.349 1005
chr6    44058536    44058996    HIF1a_rep1_peak_15  180 .   5.83929 22.1368 18.0332 225
chr6    44073262    44073853    HIF1a_rep1_peak_16  2501    .   36.4275 255.43  250.181 264
chr6    44075148    44075610    HIF1a_rep1_peak_17  1175    .   16.5665 122.233 117.549 220
chr6    44076130    44077532    HIF1a_rep1_peak_18  2887    .   31.5892 294.173 288.734 590
chr6    44237299    44237938    HIF1a_rep1_peak_20  222 .   6.61987 26.4474 22.2977 180
chr6    46653087    46653436    HIF1a_rep1_peak_22  40  .   3.56876 7.73921 4.02043 110

03 - Exercises

1. Perform the peak calling steps for the HIF2a data. If you didn’t complete the alignment step for HIF2a, a processed .bam file can be found in here:

ls ../data/HIF2a_Rep1/HIF2a_Rep1.sorted.rmdup.bam

../data/HIF2a_Rep1/HIF2a_Rep1.sorted.rmdup.bam

You can use the same input control as for HIF1a.

1. How would you run MACS without a input ChIP-seq control? Test this using the HIF1a data.
1. By default, bedtools intersect will report an overlap between A and B so long as there is at least one base pair is overlapping. What argument would you change to overlap by at least 50 percent? Test this using replicate 1 and replicate 2 of the HIF1a data.
1. Compare the overlap between the HIF1a replicates and between HIF1a and HIF2a.

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] digest_0.6.33   R6_2.5.1        fastmap_1.1.1   xfun_0.40      
 [5] cachem_1.0.8    knitr_1.43      htmltools_0.5.6 rmarkdown_2.24 
 [9] cli_3.6.1       sass_0.4.7      jquerylib_0.1.4 compiler_4.3.1 
[13] tools_4.3.1     mime_0.12       evaluate_0.21   bslib_0.5.1    
[17] yaml_2.3.7      rlang_1.1.1     jsonlite_1.8.7

LS0tCnRpdGxlOiAiTWlDTTogMDIgLSBQZWFrIENhbGxpbmciCmF1dGhvcjogIkFyaWVsIE1hZHJpZ2FsIEFndWlycmUiCmRhdGU6ICIyMDIzLTEwLTI1IgpvdXRwdXQ6CiAgaHRtbF9ub3RlYm9vazoKICAgIGRmX3ByaW50OiBwYWdlZAogICAgY29kZV9mb2xkaW5nOiBzaG93CiAgICB0b2M6IHllcwogICAgdG9jX2Zsb2F0OiAKICAgICAgY29sbGFwc2VkOiBmYWxzZQogICAgICBzbW9vdGhfc2Nyb2xsOiBmYWxzZQotLS0KCmBgYHtyLGVjaG89RkFMU0V9Cmh0bWx0b29sczo6aW1nKHNyYyA9IGtuaXRyOjppbWFnZV91cmkoImltYWdlcy9taWNtX2NvbG9yX2xvZ28ucG5nIiksIAogICAgICAgICAgICAgICBhbHQgPSAnbG9nbycsIAogICAgICAgICAgICAgICBzdHlsZSA9ICdwb3NpdGlvbjphYnNvbHV0ZTsgdG9wOjMwcHg7IHJpZ2h0OjA7IHBhZGRpbmc6MTBweDsgbWF4LXdpZHRoOjUwJTsnKQpgYGAKClRoaXMgd29ya3Nob3Agd2lsbCBndWlkZSB5b3UgdGhyb3VnaCB0aGUgYmFzaWNzIG9mIENoSVAtc2VxIGFuYWx5c2lzIHdpdGggaGFuZHMtb24gZXhlcmNpc2VzLiBZb3Ugd2lsbCBsZWFybiBob3cgdG8gcHJvY2VzcyBDaElQLXNlcSBkYXRhOiBwZXJmb3JtIHJlYWQgYWxpZ25tZW50LCBwZWFrIGNhbGxpbmcsIHF1YWxpdHkgY29udHJvbCwgdmlzdWFsaXphdGlvbiB0aHJvdWdoIHRoZSBnZW5vbWUgYnJvd3NlciwgbW90aWYgZmluZGluZyBhbmQgZ2VuZSBzZXQgZW5yaWNobWVudCBhbmFseXNpcy4KCkluIHRoaXMgdHV0b3JpYWwsIHdlIGFyZSBnb2luZyB0byBmb2N1cyBvbiBwZWFrIGNhbGxpbmcuIAoKKipOb3RlKio6IEFsbCB0aGUgY29kZSBpbiB0aGlzIHR1dG9yaWFsIHdpbGwgYmUgcGVyZm9ybWVkIGluIHRoZSB0ZXJtaW5hbC4gCgojIDAxIC0gUGVhayBDYWxsaW5nIHdpdGggTUFDUwoKTGV0J3MgbW92ZSB0byB0aGUgc2NyaXB0cyBkaXJlY3RvcnkgKG9yIHN0YXkgaGVyZSBpZiBhbHJlYWR5IHRoZXJlKToKYGBge2Jhc2gsIGV2YWw9Rn0KY2QgY2hpcF9zZXEvc2NyaXB0cwpgYGAKClRoZSBtYWluIGNvbW1hbmQgdGhhdCB3ZSBhcmUgZ29pbmcgdG8gdXNlIHdpdGggbWFjczIgaXMgdGhlIGNhbGxwZWFrIGZ1bmN0aW9uLiBUaGUgbWFudWFsIGNhbiBiZSBmb3VuZCBbaGVyZV0oaHR0cHM6Ly9weXBpLm9yZy9wcm9qZWN0L01BQ1MyLykuIApgYGB7YmFzaH0KbWFjczIgY2FsbHBlYWsgLS1oZWxwCmBgYApXZSB3aWxsIHVzZSB0aGUgQkFNIGZpbGUgZnJvbSBvdXIgcHJldmlvdXMgc3RlcCwgYW5kIGFuIElucHV0IGNvbnRyb2wgQkFNIHRoYXQgaGFzIGJlZW4gYWxyZWFkeSBwcmUtcHJvY2Vzc2VkLiAKYGBge2Jhc2h9CmxzICAuLi9kYXRhL0lucHV0X1JlcDEvCmBgYApJZiB5b3UgZGlkbid0IGNvbXB1dGUgdGhlIEJBTSBmb3IgdGhlIGFsaWdubWVudCBzdGVwLCB5b3UgY2FuIGZpbmQgaXQgaW4gaGVyZToKYGBge2Jhc2h9CmxzIC4uL2RhdGEvSElGMWFfUmVwMS9ISUYxYV9SZXAxLnNvcnRlZC5ybWR1cC5iYW0KYGBgCkxldCdzIGNyZWF0ZSBhbiBvdXRwdXQgZGlyZWN0b3J5IGZvciBzdG9yaW5nIE1BQ1MgcmVzdWx0cwpgYGB7YmFzaH0KbWtkaXIgLXAgLi4vcmVzdWx0cy9tYWNzMgpgYGAKTm93IHdlIGFyZSByZWFkeSB0byBjYWxsIE1BQ1MKYGBge2Jhc2h9Cm1hY3MyIGNhbGxwZWFrIC10IC4uL3Jlc3VsdHMvSElGMWFfUmVwMS5zb3J0ZWQucm1kdXAuYmFtIC1jIC4uL2RhdGEvSW5wdXRfUmVwMS9JbnB1dF9SZXAxLnNvcnRlZC5ybWR1cC5iYW0gLWYgQkFNUEUgLWcgaHMgLW4gSElGMWFfcmVwMSAtcSAwLjA1IC0tb3V0ZGlyIC4uL3Jlc3VsdHMvbWFjczIKYGBgCiogLXQ6IHRyZWF0bWVudCBmaWxlLiAKKiAtYzogY29udHJvbCBmaWxlCiogLWY6IEZvcm1hdCBvZiB0aGUgdGFnIGZpbGUuIFNpbmNlIG91ciBkYXRhIGlzIHBhaXJlZC1lbmQsIHdlIGFyZSB1c2luZyBCQU1QRS4gTm90ZSB0aGF0IGJ5IHNwZWNpZnlpbmcgdGhhdCB0aGUgZGF0YSBpcyBwYWlyZWQtZW5kLCBNQUNTIHdpbGwgdXNlIHRoZSBhY3R1YWwgZnJhZ21lbnQgc2l6ZSBmcm9tIHRoZSByZWFkcyBhbmQgd2lsbCBub3QgZXN0aW1hdGUgaXQuIAoqIC1nOiBFZmZlY3RpdmUgZ2Vub21lIHNpemUuIEl0IGhhcyBwcmVzZXRzIGZvciB2YXJpb3VzIGdlbm9tZXMuCiogLW46IEV4cGVyaW1lbnQgbmFtZSwgdXNlZCBmb3IgdGhlIG91dHB1dCBmaWxlIG5hbWVzCgpMZXQncyB0YWtlIGEgbG9vayBhdCB0aGUgTUFDUyBvdXRwdXQ6CmBgYHtiYXNofQpscyAuLi9yZXN1bHRzL21hY3MyCmBgYApUaGUgbWFpbiByZXN1bHQgb2YgbWFjcyB3aWxsIGJlIGEgbGlzdCBvZiBwZWFrcyBjYWxsZWQgYXMgc3RhdGlzdGljYWxseSBzaWduaWZpY2FudC4gVGhlIC54bHMgZmlsZSBjYW4gYmUgb3BlbmVkIGFuZCBleHBsb3JlZCB3aXRoIGV4Y2VsLiBUaGUgLm5hcnJvd1BlYWsgaXMgYSBCRUQ2KzQgZm9ybWF0IGZpbGUgdGhhdCBjb250YWlucyB0aGUgcGVhayBsb2NhdGlvbnMgdG9nZXRoZXIgd2l0aCBwZWFrIHN1bW1pdCwgcC12YWx1ZSwgYW5kIHEtdmFsdWUuIFRoZSBfc3VtbWl0cy5iZWQgZmlsZSBpcyBpbiBCRUQgZm9ybWF0LCB3aGljaCBjb250YWlucyB0aGUgcGVhayBzdW1taXRzIGxvY2F0aW9ucyBmb3IgZXZlcnkgcGVhay4gCmBgYHtiYXNofQpoZWFkIC4uL3Jlc3VsdHMvbWFjczIvSElGMWFfcmVwMV9wZWFrcy5uYXJyb3dQZWFrCmBgYApXZSBjYW4gY3JlYXRlIGEgQkVEIGZpbGUgYnkgb21taXRpbmcgc29tZSBjb2x1bW5zKHRoaXMgd2lsbCBiZSB1c2VmdWwgZm9yIHRoZSBuZXh0IHN0ZXApLgpgYGB7YmFzaH0KY3V0IC1mIDEtNiAuLi9yZXN1bHRzL21hY3MyL0hJRjFhX3JlcDFfcGVha3MubmFycm93UGVhayA+IC4uL3Jlc3VsdHMvbWFjczIvSElGMWFfcmVwMV9wZWFrcy5iZWQKYGBgCkNvbHVtbiA5IHJlcHJlc2VudHMgdGhlIC1sb2cxMHF2YWx1ZSBhdCBwZWFrIHN1bW1pdC4KV2UgY2FuIGFsc28gdGFrZSBhIGxvb2sgYXQgdGhlIGNvb3JkaW5hdGVzIG9mIHRoZSBzdW1taXRzIG9mIHRoZSBwZWFrcyAoIHJlZ2lvbnMgb2YgaGlnaGVzdCBzaWduYWwpCmBgYHtiYXNofQpoZWFkIC4uL3Jlc3VsdHMvbWFjczIvSElGMWFfcmVwMV9zdW1taXRzLmJlZApgYGAKV2UgY2FuIGFsc28gY2hlY2sgdGhlIG51bWJlciBvZiBzaWduaWZpY2FudCBwZWFrcyBjYWxsZWQKYGBge2Jhc2h9CndjIC1sIC4uL3Jlc3VsdHMvbWFjczIvSElGMWFfcmVwMV9wZWFrcy5uYXJyb3dQZWFrCmBgYAojIDAyIC0gUGVhayBvdmVybGFwIGJldHdlZW4gcmVwbGljYXRlcwoKT25lIG9mIHRoZSBtb3N0IGNvbW1vbiBwcm9ibGVtcyBpbiBkZWFsaW5nIHdpdGggcGVha3MgaW4gQ2hJUC1zZXEgYW5hbHlzaXMgaXMgdG8gY2hlY2sgaWYgdHdvIHNldHMgb2YgcGVha3Mgb3ZlcmxhcC4gVG8gd29yayB0aHJvdWdoIHRoYXQsIHdlIGFyZSBnb2luZyB0byB1c2UgYmVkdG9vbHMgaW50ZXJzZWN0LiBZb3UgY2FuIGZpbmQgaW5mb3JtYXRpb24gYWJvdXQgdGhlIHRvb2wgaW4gW2hlcmVdKGh0dHBzOi8vYmVkdG9vbHMucmVhZHRoZWRvY3MuaW8vZW4vbGF0ZXN0L2NvbnRlbnQvdG9vbHMvaW50ZXJzZWN0Lmh0bWwpLiAKCiFbXShpbWFnZXMvYmVkdG9vbHNfaW50ZXJzZWN0LnBuZykKCkZvciB0aGlzIGV4ZXJjaXNlLCB3ZSBhcmUgZ29pbmcgdGhlIG92ZXJsYXAgYmV0d2VlbiBwZWFrcyBjYWxsZWQgYnkgTUFDUzIgZm9yIHR3byByZXBsaWNhdGVzLCBvbmUgb2YgdGhlbSBpcyB0aGUgb25lIHdlIGhhdmUgcHJvY2Vzc2VkIHNvIGZhciwgYW5kIHRoZSBvdGhlciBvbmUgY2FuIGJlIGZvdW5kIGluIHRoZSBkYXRhIGZvbGRlcgpgYGB7YmFzaH0KbHMgLi4vZGF0YS9ISUYxYV9SZXAyL21hY3MyLwpgYGAKYGBge2Jhc2h9CmhlYWQgLi4vZGF0YS9ISUYxYV9SZXAyL21hY3MyL0hJRjFhX3JlcDJfcGVha3MubmFycm93UGVhawpgYGAKTGV0J3MgY2hlY2sgdGhlIGludGVyc2VjdCBiZXR3ZWVuIHJlcGxpY2F0ZSAxIGFuZCByZXBsaWNhdGUgMiB3aXRoIHRoZSBkZWZhdWx0IGJlaGF2aW91cgpgYGB7YmFzaH0KYmVkdG9vbHMgaW50ZXJzZWN0IC1hIC4uL3Jlc3VsdHMvbWFjczIvSElGMWFfcmVwMV9wZWFrcy5uYXJyb3dQZWFrIC1iIC4uL2RhdGEvSElGMWFfUmVwMi9tYWNzMi9ISUYxYV9yZXAyX3BlYWtzLm5hcnJvd1BlYWsKYGBgCkJ5IGRlZmF1bHQsIGlmIGFuIG92ZXJsYXAgaXMgZm91bmQsIGJlZHRvb2xzIGludGVyc2VjdCByZXBvcnRzIHRoZSBzaGFyZWQgaW50ZXJ2YWwgYmV0d2VlbiB0aGUgdHdvIG92ZXJsYXBwaW5nIGZlYXR1cmVzLgpXZSBjYW4gYWxzbyByZXR1cm4gbm90IG9ubHkgdGhlIHNoYXJlZCBpbnRlcnZhbCwgYnV0IHRoZSBvcmlnaW5hbCBlbnRyaWVzIG9mIEEgZm9yIHRoZSBvdmVybGFwOgpgYGB7YmFzaH0KYmVkdG9vbHMgaW50ZXJzZWN0IC1hIC4uL3Jlc3VsdHMvbWFjczIvSElGMWFfcmVwMV9wZWFrcy5uYXJyb3dQZWFrIC1iIC4uL2RhdGEvSElGMWFfUmVwMi9tYWNzMi9ISUYxYV9yZXAyX3BlYWtzLm5hcnJvd1BlYWsgLXdhCmBgYAojIDAzIC0gRXhlcmNpc2VzCiAKKiAxLiBQZXJmb3JtIHRoZSBwZWFrIGNhbGxpbmcgc3RlcHMgZm9yIHRoZSBISUYyYSBkYXRhLiBJZiB5b3UgZGlkbid0IGNvbXBsZXRlIHRoZSBhbGlnbm1lbnQgc3RlcCBmb3IgSElGMmEsIGEgcHJvY2Vzc2VkIC5iYW0gZmlsZSBjYW4gYmUgZm91bmQgaW4gaGVyZTogCmBgYHtiYXNofQpscyAuLi9kYXRhL0hJRjJhX1JlcDEvSElGMmFfUmVwMS5zb3J0ZWQucm1kdXAuYmFtCmBgYApZb3UgY2FuIHVzZSB0aGUgc2FtZSBpbnB1dCBjb250cm9sIGFzIGZvciBISUYxYS4KCiogMi4gSG93IHdvdWxkIHlvdSBydW4gTUFDUyB3aXRob3V0IGEgaW5wdXQgQ2hJUC1zZXEgY29udHJvbD8gVGVzdCB0aGlzIHVzaW5nIHRoZSBISUYxYSBkYXRhLgoKKiAzLiBCeSBkZWZhdWx0LCBiZWR0b29scyBpbnRlcnNlY3Qgd2lsbCByZXBvcnQgYW4gb3ZlcmxhcCBiZXR3ZWVuIEEgYW5kIEIgc28gbG9uZyBhcyB0aGVyZSBpcyBhdCBsZWFzdCBvbmUgYmFzZSBwYWlyIGlzIG92ZXJsYXBwaW5nLiBXaGF0IGFyZ3VtZW50IHdvdWxkIHlvdSBjaGFuZ2UgdG8gb3ZlcmxhcCBieSBhdCBsZWFzdCA1MCBwZXJjZW50PyBUZXN0IHRoaXMgdXNpbmcgcmVwbGljYXRlIDEgYW5kIHJlcGxpY2F0ZSAyIG9mIHRoZSBISUYxYSBkYXRhLiAKCiogNC4gQ29tcGFyZSB0aGUgb3ZlcmxhcCBiZXR3ZWVuIHRoZSBISUYxYSByZXBsaWNhdGVzIGFuZCBiZXR3ZWVuIEhJRjFhIGFuZCBISUYyYS4KCmBgYHtyfQpzZXNzaW9uSW5mbygpCmBgYAoK

MiCM: 02 - Peak Calling

Ariel Madrigal Aguirre

2023-10-25

01 - Peak Calling with MACS

02 - Peak overlap between replicates

03 - Exercises