class: center, middle, inverse, title-slide # Epigenomics, with focus on the methylome ### Mehran Karimzadeh and Martina Steiner ### June 20th, 2023 --- <style type="text/css"> .small { font-size: 65%; } code.r{ font-size=8px; } code.python{ font-size=3px; } code.bash{ font-size=8px; } .medium { font-size: 80%; } .smallcode { font-size:50% } .smallcode .remark-code { font-size: 50% } </style> This tutorial was created by Dr. Mehran Karimzadeh for MMG3003Y, and adapted by Dr. Martina Steiner. You will work through a series of quizzes on Quercus, and go through a pipeline of software on the teach cluster. This will prepare you for this week's assignment. Work through the following Quizzes now: 1. EpigeneticsII_WGBS-context 2. EpigeneticsII_methods 3. EpigeneticsII_bisulfite-conversion 4. EpigeneticsII_computational-steps --- ## Background 1: bash * Log into the teach cluster as described in the RNAseq tutorial * Avoid computation on the login node #describe why * Request an interactive session ```bash salloc samtools ... ``` -- * Write a bash script and submit using _sbatch_ ```bash echo -e '#!/bin/sh' > myBashScriptThatIWillSubmitToRunOnSlurm.sh echo "samtools ..." >> myBashScriptThatIWillSubmitToRunOnSlurm.sh sbatch -c 1 --mem=4G -t 1:00:00 myBashScriptThatIWillSubmitToRunOnSlurm.sh ``` --- ## For loops in bash ```bash for i in {0..99} do echo "Variable i is set to $i" done ``` -- * Work through the following Quiz on Quercus now 4. EpigeneticsII_bash 5. EpigeneticsII_pipeline --- ## Background 2: Bisulfite sequencing pipeline * Pipeline outline -- + Download the reference genome -- + Index the reference genome -- + Copy fastq files from a shared folder -- + Trimming fastq files with trim galore -- + Aligning with bwa-meth (copy the BAM files from a shared folder) -- + Extracting CpG methylation estimates -- + Visualizing methylation at CpG islands (supplementary) -- + Identifying differentially methylated regions -- + Investigating genes affected by differential methylation --- * Tutorial steps: + Tutorial step 1: Data access, trimming and quality control + Tutorial step 2: Alignment + Tutorial step 3: Extracting estimates + Tutorial step 4: Identify differentially methylated regions <img src="mw0.jpg" width="800px" /> --- ## Quality control * Each step of the process requires a quality control -- * "FASTQC" to assess quality of FASTQ files before and after removing the adapters -- * "samtools flagstat" to assess the percent of uniquely aligned reads -- * Enrichment of differentially methylated regions in promoters, CpG islands, etc. --- ## Troubleshooting guide Throughout this tutorial, you may see that you do not see the same output as the instructor. Feel free to ask for help. Some common reasons include: * You are not logged into the teaching cluster. * You can type `echo $HOSTNAME` to see if it returns `teach01.scinet.local` or not * Your session got disconnected and the environmental variables that you defined in the earlier steps are not initialized. * You can check if a variable is initialized by typing `echo $VARIABLENAME`. In the case of arrays, to see their elements, you can type `echo ${ARRAYNAME[@]}`. * If these are not initialized, nothing will be printed and that means you need to go back to the first occurence of the variable or array and re-run the command. --- ## Tutorial step 1: Data access — Indexing BWA-meth * Download FASTA file of chr22 ```bash mkdir -p $SCRATCH/Ref cd $SCRATCH/Ref wget ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz zcat Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz | sed 's/>22/>chr22/' > Homo_sapiens.GRCh38.dna.chromosome.22.fa ``` * Index FASTA file ```bash salloc module load anaconda3 gcc java fastqc cutadapt trimgalore bwa samtools bwameth htslib methyldackel bwameth.py index Homo_sapiens.GRCh38.dna.chromosome.22.fa ``` * Expected run time is 3 minutes <img src="mw1.jpg" width="800px" /> --- ## Tutorial step 1: Data access — Download CpG islands * Visit http://genome.ucsc.edu * Select the 3rd tool, _Table Browser_ * Select the correct genome assembly, and group _Regulation_ * Select the _CpG Islands_ track * Under position, type chr22 * Select the output format as _BED - browser extensible data_ * Save the file as: _hg38CpgIslandsForChr22.bed_ * Use scp to transfer the file to $SCRATCH/Datasets ```bash # From the local computer scp ~/Downloads/hg38CpgIslandsForChr22.bed username@teach.scinet.utoronto.ca:/scratch/t/teachmmg3003/mmg3003ta002/Datasets ``` <img src="mw1.jpg" width="800px" /> --- ## Tutorial step 1: Data access — Downloading necessary files * The following directory contains all of the pipeline for chromosome 22: ```bash /scratch/m/mhoffman/karimzad ``` * You can copy the fastq files from that directory to your scratch folder: ```bash cp -rf /scratch/m/mhoffman/karimzad/newFastqFilesChr22 $SCRATCH ``` * This folder contains all of the files from the pipeline we process. You can copy them the same way if you have issues running the commands. <img src="mw1.jpg" width="800px" /> --- ## What are these samples * H1-hESC is a human embryonic stem cell line which has been profiled extensively by the ENCODE consortium * The left ventricle embryonic tissue is obtained from a human embryo * By comparing these two tissues, we _may_ identify which regions of chr22 must be (de)methylated for differentiating the embryonic stem cell towards a heart muscle progeny. --- ## Tutorial step 1: Trimming — Trim the FASTQ files * Write a script to trim paired-end FASTQ files with trim galore in a new folder called trimmedFastqs ```bash LOGDIR=$SCRATCH/Logs # Creates a new variable SCRIPTDIR=$SCRATCH/Scripts # For scripts that run different programs mkdir -p $SCRIPTDIR $LOGDIR # Creating multiple folders simultaneously cd $SCRIPTDIR # Change diretory FASTQDIR=$SCRATCH/newFastqFilesChr22 # PAth to our untrimmed fastq files FQFOLDERS=($(ls $FASTQDIR)) # Arrays OUTMAIN=$SCRATCH/trimmedFastqsChr22 for FQFOLDER in ${FQFOLDERS[@]} do FQ1=$FASTQDIR/$FQFOLDER/$FQFOLDER\__1.fastq.gz FQ2=$FASTQDIR/$FQFOLDER/$FQFOLDER\__2.fastq.gz OUTDIR=$OUTMAIN/$FQFOLDER mkdir -p $OUTDIR echo -e '#!/bin/sh' > $SCRATCH/Scripts/$FQFOLDER\_TrimGalore.sh echo "module load anaconda3 gcc java fastqc cutadapt trimgalore bwa samtools bwameth htslib methyldackel" >> $SCRATCH/Scripts/$FQFOLDER\_TrimGalore.sh echo "trim_galore --fastqc --paired --gzip -o $OUTDIR $FQ1 $FQ2" >> $SCRATCH/Scripts/$FQFOLDER\_TrimGalore.sh sbatch -c 1 -t 1:00:00 -e $LOGDIR/$FQFOLDER\_TrimGalore.%A.ERR -o $LOGDIR/$FQFOLDER\_TrimGalore.%A.LOG $SCRIPTDIR/$FQFOLDER\_TrimGalore.sh done ``` <img src="mw1.jpg" width="800px" /> --- ## Tutorial step 1: Quality control — FASTQC reports * Use _scp_ to download fastqc files to your local computer ```bash STUDENTID=05 scp -r mmg3003student$STUDENTID@teach.scinet.utoronto.ca:/scratch/t/teachmmg3003/mmg3003student$STUDENTID/trimmedFastqsChr22/*/*fastqc* ~/Downloads ``` <img src="mw1.jpg" width="800px" /> --- ## Tutorial step 2: Alignment — Align with BWA-Meth * Write a script to generate <filename>_Align.sh scripts for aligning fastq files and submit them to cluster with sbatch ```bash LOGDIR=$SCRATCH/Logs; SCRIPTDIR=$SCRATCH/Scripts; cd $SCRIPTDIR REF=$SCRATCH/Ref/Homo_sapiens.GRCh38.dna.chromosome.22.fa FASTQDIR=$SCRATCH/trimmedFastqsChr22 BAMDIR=$SCRATCH/trimmedAlignedBamsChr22 mkdir -p $BAMDIR mkdir -p $SCRIPTDIR SAMPLES=($(ls $FASTQDIR)) for SAMPLE in ${SAMPLES[@]} do FQ1=$(ls $FASTQDIR/$SAMPLE | grep val_1.fq.gz) FQ2=$(ls $FASTQDIR/$SAMPLE | grep val_2.fq.gz) echo -e '#!/bin/sh' > $SCRIPTDIR/$SAMPLE\_Align.sh echo "module load anaconda3 gcc java fastqc cutadapt trimgalore bwa samtools bwameth htslib methyldackel" >> $SCRIPTDIR/$SAMPLE\_Align.sh echo "bwameth.py --reference $REF $FASTQDIR/$SAMPLE/$FQ1 $FASTQDIR/$SAMPLE/$FQ2 | samtools view -bS -F 4 > $BAMDIR/$SAMPLE.bam" >> $SCRIPTDIR/$SAMPLE\_Align.sh # sbatch -c 1 -t 4:00:00 -e $LOGDIR/$SAMPLE\_Align.%A.ERR -o $LOGDIR/$SAMPLE\_Align.%A.LOG $SCRIPTDIR/$SAMPLE\_Align.sh done ``` <img src="mw2.jpg" width="800px" /> --- ## Tutorial step 2: Alignment — Copy the aligned bam files * It takes 4 hours of CPU time to align the FASTQ files to chr22. * Assignment: Similar to FASTQ files, copy the folder containing bam files of chr22 to your _$SCRATCH_ NOW! <img src="mw2.jpg" width="800px" /> --- ## ## Tutorial step 2: Alignment — Sort and index bam files * MethylDackel requires sorted and indexed bam files * Write a script to sort and index each bam file ```bash LOGDIR=$SCRATCH/Logs; SCRIPTDIR=$SCRATCH/Scripts; cd $SCRIPTDIR BAMDIR=$SCRATCH/trimmedAlignedBamsChr22 BAMFILES=($(ls $BAMDIR | grep .bam | grep -v bam.bai | grep -v sorted)) for BAMFILE in ${BAMFILES[@]} do SAMPLENAME=$(echo $BAMFILE | sed 's/.bam//') echo -e '#!/bin/sh' > $SCRATCH/Scripts/$SAMPLENAME\_sortAndIndex.sh echo "module load anaconda3 gcc java fastqc cutadapt trimgalore bwa samtools bwameth htslib methyldackel" >> $SCRATCH/Scripts/$SAMPLENAME\_sortAndIndex.sh echo "samtools sort $BAMDIR/$BAMFILE -o $BAMDIR/$SAMPLENAME\_sorted.bam" >> $SCRATCH/Scripts/$SAMPLENAME\_sortAndIndex.sh echo "samtools index $BAMDIR/$SAMPLENAME\_sorted.bam" >> $SCRATCH/Scripts/$SAMPLENAME\_sortAndIndex.sh sbatch -c 1 -t 1:00:00 -e $LOGDIR/sortIndex.%A.ERR -o $LOGDIR/sortIndex.%A.LOG $SCRATCH/Scripts/$SAMPLENAME\_sortAndIndex.sh done ``` <img src="mw2.jpg" width="800px" /> --- ## Tutorial step 3: Extracting estimates — Run MethylDackel * MethylDackel uses BAM files to extract cytosine methylation counts ```bash LOGDIR=$SCRATCH/Logs; SCRIPTDIR=$SCRATCH/Scripts; cd $SCRIPTDIR BAMDIR=$SCRATCH/trimmedAlignedBamsChr22; OUTMAIN=$SCRATCH/methylDackelOutputChr22 BAMFILES=($(ls $BAMDIR | grep sorted | grep -v bai | grep bam)) REF=$SCRATCH/Ref/Homo_sapiens.GRCh38.dna.chromosome.22.fa for BAMFILE in ${BAMFILES[@]} do SAMPLENAME=$(echo $BAMFILE | sed 's/_sorted.bam//') OUTDIR=$OUTMAIN/$SAMPLENAME mkdir -p $OUTDIR echo -e '#!/bin/sh' > $SCRIPTDIR/MethylDackel_$SAMPLENAME.sh echo "module load anaconda3 gcc java fastqc cutadapt trimgalore bwa samtools bwameth htslib methyldackel" >> $SCRIPTDIR/MethylDackel_$SAMPLENAME.sh echo "MethylDackel extract --fraction --mergeContext $REF $BAMDIR/$BAMFILE -o $OUTDIR/$SAMPLENAME\_" >> $SCRIPTDIR/MethylDackel_$SAMPLENAME.sh sbatch -c 1 -t 1:00:00 -e $LOGDIR/Meth.%A.ERR -o $LOGDIR/Meth.%A.LOG $SCRIPTDIR/MethylDackel_$SAMPLENAME.sh done ``` <img src="mw3.jpg" width="800px" /> --- ## Tutorial step 3: Extracting estimates — Explore the output of MethylDackel * What does each column of MethylDackel output represent? ```r track type="bedGraph" description="/scratch/t/teachmmg3003/mmg3003ta002/methylDackelOutput/H1-hESC_rep1/H1-hESC_rep1_ CpG methylation levels" chr22 10513864 10513865 0 chr22 10513906 10513907 1 chr22 10515169 10515170 0 ``` <img src="mw3.jpg" width="800px" /> --- ## Tutorial step 3: Extracting estimates — bedGraph is not efficient * bedGraph is a user-readable file format * Storing genomic signal in bedGraph format takes too much space and is computationally inefficient for random data retrieval * bigWig format, however, can store and retrieve genomic signals efficiently * Here we will download a program called bedGraphToBigWig and use it to convert bedGraph files ```bash mkdir -p ~/software/bin cd ~/software/bin wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedGraphToBigWig # Give yourself permission to run this program chmod u+x bedGraphToBigWig ``` <img src="mw3.jpg" width="800px" /> --- ## Tutorial step 3: Extracting estimates — Finding size of chromosomes * bedGraphToBigWig requires a file with information of how long each chromosome is ```bash cd ~/software/bin wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/fetchChromSizes chmod u+x fetchChromSizes ./fetchChromSizes hg38 > $SCRATCH/Ref/hg38.chromsizes ``` * How else can we extract chromosome sizes from a fasta file? <img src="mw3.jpg" width="800px" /> --- ## Tutorial step 3: Extracting estimates — Convert bedGraph to bigWig * Write a script to convert output of MethylDackel from bedGraph to bigWig ```bash salloc MAINDIR=$SCRATCH/methylDackelOutputChr22 SAMPLES=($(ls $MAINDIR)) for SAMPLE in ${SAMPLES[@]} do BDG=$(ls $MAINDIR/$SAMPLE | grep bedGraph) BW=$(echo $BDG | sed 's/bedGraph/bigWig/') ~/software/bin/bedGraphToBigWig $MAINDIR/$SAMPLE/$BDG $SCRATCH/Ref/hg38.chromsizes $MAINDIR/$SAMPLE/$BW done ``` <img src="mw3.jpg" width="800px" /> --- ## Tutorial step 4: Identify DMRs — Identify differentially methylated regions * There are various software for identifying differentially methylated regions * Here we will use [https://dx.doi.org/10.1101%2Fgr.196394.115](Metilene: fast and sensitive calling of differentially methylated regions from bisulfite sequencing data) * Metilene requires a union file of bedGraphs we generated earlier with MethylDackel with the following columns: ```r Chrom Start End G1_1 G1_2 G2_1 G2_2 ``` --- * We can generate the input file this way: ```bash MAINDIR=$SCRATCH/methylDackelOutputChr22 SAMPLES=($(ls $MAINDIR)) BGS=() HEADER=(chr start end) for SAMPLE in ${SAMPLES[@]} do HEADER+=($SAMPLE) BG=$(ls $MAINDIR/$SAMPLE | grep bedGraph) BGS+=($MAINDIR/$SAMPLE/$BG) done module load gcc/7.3.0 bedtools echo -e ${HEADER[@]} | tr " " "\t" > $SCRATCH/methylDackelOutputChr22/mergedOutputs_unionbedg.bed bedtools unionbedg -i ${BGS[@]} >> $SCRATCH/methylDackelOutputChr22/mergedOutputs_unionbedg.bed ``` <img src="mw4.jpg" width="800px" /> --- ## Tutorial step 4: Identify DMRs — Metilene ```bash module load metilene OUTDIR=$SCRATCH/metileneOutputChr22 mkdir -p $OUTDIR echo -e "Chrom\tStart\tEnd\tqVal\tmeanDiff\tnumCpgs\tpMWU\tp2DKS\tmeanG1\tmeanG2" > $OUTDIR/MetileneDMR.bed metilene -a "H1-hESC" -b "leftVentricle" $SCRATCH/methylDackelOutputChr22/mergedOutputs_unionbedg.bed >> $OUTDIR/MetileneDMR.bed ``` <img src="mw4.jpg" width="800px" /> --- ## Tutorial step 5: Pathway analysis How can you go from a list of genomic regions to pathways? -- + Find which genes overlap your DMRs -- + Find which biological pathways are regulated by those genes -- + Compute statistical significance for such an overlap -- An automated way to do these steps would be GREAT! http://great.stanford.edu/public/html/ <img src="mw5.jpg" width="800px" /> --- # Supplementary slides --- ## How can we explore hundreds of genomic regions for specific features, enrichments, etc.? * DeepTools has a program called _computeMatrix_ * computeMatrix accepts signal files (e.g. in bigWig) and genomic region annotations (e.g. in BED or GTF) to calculate summary statistics * computeMatrix has two modules: + reference-point: Obtains measures for entries of BED file (as reference) as well as their upstream and downstream + scale-regions: Calculates summary measures for BED files by shrinking each entry to a user-defined length --- <img src="computeMatrix1.png" width="800px" /> <img src="mw3.jpg" width="800px" /> --- ## How is methylation signal around CpG islands? * Write a script to execute computeMatrix reference-point on CpG island BED file and the four bigWig files ```bash MAINDIR=$SCRATCH/methylDackelOutputChr22 SAMPLES=($(ls $MAINDIR)) BWS=() for SAMPLE in ${SAMPLES[@]} do BW=$(ls $MAINDIR/$SAMPLE | grep bigWig) BWS+=($MAINDIR/$SAMPLE/$BW) done module load anaconda2/5.1.0 deeptools/3.2.1-anaconda2 OUTDIR=$SCRATCH/methylationMatricesChr22 mkdir -p $OUTDIR computeMatrix reference-point -R $SCRATCH/Datasets/hg38CpgIslandsForChr22.bed -S ${BWS[@]} -o $OUTDIR/mergedMethylationAroundIslands.tsv.gz --referencePoint TES --skipZeros --sortRegions descend --sortUsingSamples 1 plotProfile -m $OUTDIR/mergedMethylationAroundIslands.tsv.gz -out $OUTDIR/mergedMethylationAroundIslands_1.pdf plotProfile -m $OUTDIR/mergedMethylationAroundIslands.tsv.gz --perGroup --plotType heatmap -out $OUTDIR/mergedMethylationAroundIslands_2.pdf ``` <img src="mw3.jpg" width="800px" /> ---