Variant calls with GATK
I had to replace a CLC Bio variant calling pipeline with an open-source equivalent. These notes are mainly an update of another blog post Variant calling with GATK.
Tools Setup
-
Build and install (to ~/.local) bwa
-
Build and install samtools
-
Download and extract GATK under non-comercial license
-
Download and extract Picard Tools
-
Picard requires Java 1.8 so update to OpenJDK-1.8.72
-
Set environment variable to reference Picard
:::shell export PICARD=~/Downloads/picard-tools-2.1.0/picard.jar
Pipeline
-
Index the reference
:::shell bwa index PAO1_reference.fasta
-
Sort the reference
:::shell samtools faidx PAO1_reference.fasta
-
Create sequence dictionary
:::shell java -jar $PICARD CreateSequenceDictionary \ REFERENCE=PAO1_reference.fasta \ OUTPUT=PAO1_reference.dict
-
Align reads using bwa
:::shell bwa mem PAO1_reference.fasta \ genomes/AZPAE12150_S4_L001_R1_001.fastq.gz \ genomes/AZPAE12150_S4_L001_R2_001.fastq.gz > \ pipeline_products/AZPAE12150_S4_L001.aln.sam
-
Sort SAM file, add fake ReadGroup IDs (otherwise will fail at GATK RealignerTargetCreator below) and convert to bam
:::shell java -jar $PICARD AddOrReplaceReadGroups \ I=pipeline_products/AZPAE12150_S4_L001.aln.sam \ O=pipeline_products/AZPAE12150_S4_L001.sorted.bam \ RGLB=kos1 \ RGPL=illumina \ RGPU=unit1 \ RGSM=AZPAE12150 \ SORT_ORDER=coordinate
-
Mark duplicates
:::shell java -jar $PICARD MarkDuplicates \ I=pipeline_products/AZPAE12150_S4_L001.sorted.bam\ O=pipeline_products/AZPAE12150_S4_L001.dedup.bam \ METRICS_FILE=pipeline_products/metrics.txt
-
Build bam index
:::shell java -jar $PICARD BuildBamIndex INPUT=pipeline_products/AZPAE12150_S4_L001.dedup.bam
-
Create realignment targets
:::shell java -jar ~/Downloads/GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator \ -R ../PAO1_reference.fasta -I AZPAE12150_S4_L001.dedup.bam \ -o AZPAE12150_S4_L001.targetintervals.list
-
Indel realignment
:::shell java -jar ~/Downloads/GATK/GenomeAnalysisTK.jar -T IndelRealigner \ -R ../PAO1_reference.fasta -I AZPAE12150_S4_L001.dedup.bam \ -targetIntervals AZPAE12150_S4_L001.targetintervals.list \ -o AZPAE12150_S4_L001.realigned.bam
-
Call variants (takes ~10 minutes to run on a pair of 150M fastq.gz file )
:::shell java -jar ~/Downloads/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller \ -R ../PAO1_reference.fasta -I AZPAE12150_S4_L001.realigned.bam \ -ploidy 1 -stand_call_conf 30 -stand_emit_conf 10 \ -o AZPAE12150_S4_L001.raw.vcf
-
Filter out SNPs (optional)
:::shell java -jar ~/Downloads/GATK/GenomeAnalysisTK.jar -T SelectVariants \ -R ../PAO1_reference.fasta -V AZPAE12150_S4_L001.raw.vcf -selectType SNP \ -o AZPAE12150_S4_L001.snps.vcf
-
Convert to a tab separated file for easy dataframe loading
:::shell java -jar ~/Downloads/GATK/GenomeAnalysisTK.jar -T VariantsToTable \ -R ../PAO1_reference.fasta -V AZPAE12150_S4_L001.snps.vcf \ -F POS -F REF -F ALT -F QUAL -o AZPAE12150_S4_L001.snps.tsv
Wrapping in a script
I also put together a Julia script that automates the whole pipeline.
Performance
I don’t have the experience to grade the quality of the SNP calling. Anecdotally I seemed to run into a reported issue where it seemed to miss SNPs at the beginning of a gene with lots of repeats.
In comparison to the CLC pipeline, some SNP set union/differences showed there
were about 27k they both called; CLC called a few hundred that GATK did not and
GATK called a couple thousand that CLC did not. Anecdotally, the extra calls
from GATK seemed to come from low depth of coverage region however some
primitive attempts to filter those using SelectVariants
did not seem to
improve the overlap.