Variant calls with GATK
I had to replace a CLC Bio variant calling pipeline with an open-source equivalent. These notes are mainly an update of another blog post Variant calling with GATK.
Tools Setup
Build and install (to ~/.local) bwa
Build and install samtools
Download and extract GATK under non-comercial license
Download and extract Picard Tools
Picard requires Java 1.8 so update to OpenJDK-1.8.72
Set environment variable to reference Picard
:::shell export PICARD=~/Downloads/picard-tools-2.1.0/picard.jar
Pipeline
Index the reference
:::shell bwa index PAO1_reference.fasta
Sort the reference
:::shell samtools faidx PAO1_reference.fasta
Create sequence dictionary
:::shell java -jar $PICARD CreateSequenceDictionary \ REFERENCE=PAO1_reference.fasta \ OUTPUT=PAO1_reference.dict
Align reads using bwa
:::shell bwa mem PAO1_reference.fasta \ genomes/AZPAE12150_S4_L001_R1_001.fastq.gz \ genomes/AZPAE12150_S4_L001_R2_001.fastq.gz > \ pipeline_products/AZPAE12150_S4_L001.aln.sam
Sort SAM file, add fake ReadGroup IDs (otherwise will fail at GATK RealignerTargetCreator below) and convert to bam
:::shell java -jar $PICARD AddOrReplaceReadGroups \ I=pipeline_products/AZPAE12150_S4_L001.aln.sam \ O=pipeline_products/AZPAE12150_S4_L001.sorted.bam \ RGLB=kos1 \ RGPL=illumina \ RGPU=unit1 \ RGSM=AZPAE12150 \ SORT_ORDER=coordinate
Mark duplicates
:::shell java -jar $PICARD MarkDuplicates \ I=pipeline_products/AZPAE12150_S4_L001.sorted.bam\ O=pipeline_products/AZPAE12150_S4_L001.dedup.bam \ METRICS_FILE=pipeline_products/metrics.txt
Build bam index
:::shell java -jar $PICARD BuildBamIndex INPUT=pipeline_products/AZPAE12150_S4_L001.dedup.bam
Create realignment targets
:::shell java -jar ~/Downloads/GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator \ -R ../PAO1_reference.fasta -I AZPAE12150_S4_L001.dedup.bam \ -o AZPAE12150_S4_L001.targetintervals.list
Indel realignment
:::shell java -jar ~/Downloads/GATK/GenomeAnalysisTK.jar -T IndelRealigner \ -R ../PAO1_reference.fasta -I AZPAE12150_S4_L001.dedup.bam \ -targetIntervals AZPAE12150_S4_L001.targetintervals.list \ -o AZPAE12150_S4_L001.realigned.bam
Call variants (takes ~10 minutes to run on a pair of 150M fastq.gz file )
:::shell java -jar ~/Downloads/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller \ -R ../PAO1_reference.fasta -I AZPAE12150_S4_L001.realigned.bam \ -ploidy 1 -stand_call_conf 30 -stand_emit_conf 10 \ -o AZPAE12150_S4_L001.raw.vcf
Filter out SNPs (optional)
:::shell java -jar ~/Downloads/GATK/GenomeAnalysisTK.jar -T SelectVariants \ -R ../PAO1_reference.fasta -V AZPAE12150_S4_L001.raw.vcf -selectType SNP \ -o AZPAE12150_S4_L001.snps.vcf
Convert to a tab separated file for easy dataframe loading
:::shell java -jar ~/Downloads/GATK/GenomeAnalysisTK.jar -T VariantsToTable \ -R ../PAO1_reference.fasta -V AZPAE12150_S4_L001.snps.vcf \ -F POS -F REF -F ALT -F QUAL -o AZPAE12150_S4_L001.snps.tsv
Wrapping in a script
I also put together a Julia script that automates the whole pipeline.
Performance
I don’t have the experience to grade the quality of the SNP calling. Anecdotally I seemed to run into a reported issue where it seemed to miss SNPs at the beginning of a gene with lots of repeats.
In comparison to the CLC pipeline, some SNP set union/differences showed there
were about 27k they both called; CLC called a few hundred that GATK did not and
GATK called a couple thousand that CLC did not. Anecdotally, the extra calls
from GATK seemed to come from low depth of coverage region however some
primitive attempts to filter those using SelectVariants
did not seem to
improve the overlap.