Variant calling is the process of identifying differences between two genome samples. Usually differences are limited to single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels). Larger structural variation such as inversions, duplications and large deletions are not typically covered by “variant calling”.
Imagine that you have been asked to find the differences between a sample that has been sequenced and a known genome. For example: You have a new sample from a patient and you want to see if it has any differences from a well known reference genome of the same species. Typically, you would have a couple of fastq read files sent back to you from the sequencing provider and either an annotated or non annotated reference genome.
In this tutorial, we will use the tool “Snippy” (see author development repository) to find high confidence differences (indels or SNPs) between our known genome and our reads. Snippy uses one tool to align the reads to the reference genome, and another tool to decide (“call”) if any of the resulting discrepancies are real variants or technical artifacts that can be ignored. Finally, Snippy uses another tool to check what effect these differences have on the predicted genes - truncation, frame shift or if the changes are synonymous.
For the read alignment (read mapping) step, Snippy uses BWA MEM with a custom set of settings which are very suitable to aligning reads for microbial type data. For the variant calling step, Snippy uses Freebayes with a custom set of settings. snpeff is then used to describe what the predicted changes do in terms of the genes themselves.
The Galaxy wrapper for Snippy has the ability to change some of the underlying tool settings in the advanced section but it is not recommended.
The data for today is a subset of a real dataset from a Staphylococcus aureus bacteria.
We have a closed genome sequence and an annotation for our “wildtype” strain.
We have used a whole genome shotgun approach to produce a set of short sequence reads on an Illumina DNA sequencing instrument for our mutant strain.
The reads are paired-end
Each read is on average 150 bases
The reads would cover the original wildtype genome to a depth of 19x
The files we will be using are:
mutant_R1.fastq & mutant_R2.fastq - the read files in fastq format.
wildtype.fna - The sequence of the reference strain in fasta format.
wildtype.gbk - The reference strain with gene and other annotations in genbank format.
wildtype.gff - The reference strain with gene and other annotations in gff3 format.
Click galaxy-uploadUpload Data at the top of the tool panel
Select galaxy-wf-editPaste/Fetch Data
Paste the link(s) into the text field
Press Start
Close the window
Find variants with Snippy
We will now run the Snippy tool on our reads, comparing it to the reference.
Snippy is a tool for rapid bacterial SNP calling and core genome alignments. Snippy finds SNPs between a haploid reference genome and your NGS sequence reads. It will find both substitutions (snps) and insertions/deletions (indels).
If we give Snippy an annotated reference, it will silently run a tool called SnpEff which will figure out the effect of any changes on the genes and other features. If we just give Snippy the reference sequence alone without the annotations, it will not run SnpEff.
We have an annotated reference and so will use it in this case.
Hands On: Run Snippy
Snippytool with the following parameters
“Reference File” to the wildtype.gbk file (if the genbank file is not selectable, make sure to change its datatype to ‘genbank’)
“Single or Paired-end reads” to Paired
“Select first set of reads” to mutant_R1.fastq
“Select second set of reads” to mutant_R2.fastq
Select all outputs
Examine Snippy output
Snippy has taken the reads, mapped them against the reference using BWA MEM, looked through the resulting BAM file and found differences using some fancy Bayesian statistics (Freebayes), filtered the differences for sensibility and finally checked what effect these differences will have on the predicted genes and other features in the genome.
It produces quite a bit of output, there can be up to 10 output files.
Filename
Description
snps vcf file
The final annotated variants in VCF format
snps gff file
The variants in GFF3 format
snps table
A simple tab-separated summary of all the variants
snps summary
A summary of the SNPs called
log file
A log file with the commands run and their outputs
aligned fasta
A version of the reference but with - at position with depth=0 and N for 0 < depth < –mincov (does not have variants)
consensus fasta
A version of the reference genome with all variants instantiated
mapping depth
A table of the mapping depth
mapped reads bam
A BAM file containing all of the mapped reads
outdir
A tarball of the Snippy output directory for input into Snippy-core if required
We will now have a look at the contents of the SNP table file (snippy on data XX, data XX and data XX table):
1 2 3 4 5 6 7 8 9 10 11 12 13 14
CHROM POS TYPE REF ALT EVIDENCE FTYPE STRAND NT_POS AA_POS EFFECT LOCUS_TAG GENE PRODUCT
Wildtype 24388 snp A G G:22 A:0 CDS + 1/702 1/233 initiator_codon_variant c.1A>G p.Met1? WILD_00022 walR Transcriptional regulatory protein WalR
Wildtype 29479 snp T G G:21 T:0 CDS + 39/792 13/263 synonymous_variant c.39T>G p.Gly13Gly WILD_00026 yycJ Putative metallo-hydrolase YycJ
Wildtype 47299 snp T A A:24 T:0 CDS + 54/1758 18/585 stop_gained c.54T>A p.Cys18* WILD_00043 mecR1 Methicillin resistance mecR1 protein
Wildtype 102969 snp G C C:16 G:0 CDS - 87/1281 29/426 synonymous_variant c.87C>G p.Gly29Gly WILD_00093 spa Immunoglobulin G-binding protein A
Wildtype 103048 snp T A A:20 T:0 CDS - 8/1281 3/426 missense_variant c.8A>T p.Lys3Met WILD_00093 spa Immunoglobulin G-binding protein A
Wildtype 103379 del GAA GA GA:11 GAA:0
Wildtype 106602 snp T G G:21 T:0 CDS - 993/993 331/330 stop_lost&splice_region_variant c.993A>C p.Ter331Tyrext*? WILD_00097 yfiY putative siderophore-binding lipoprotein YfiY
Wildtype 109833 snp T A A:16 T:0 CDS + 1/1755 1/584 initiator_codon_variant c.1T>A p.Leu1? WILD_00100 iucC_1 Aerobactin synthase
Wildtype 114540 del ATT AT AT:25 ATT:0 CDS + 1717/1737 573/578 frameshift_variant c.1717delT p.Cys573fs WILD_00102 iucA N(2)-citryl-N(6)-acetyl-N(6)-hydroxylysine synthase
Wildtype 129881 mnp GT AA AA:18 GT:0 CDS + 55/708 19/235 missense_variant c.55_56delGTinsAA p.Val19Asn WILD_00117 deoD Purine nucleoside phosphorylase DeoD-type
Wildtype 138877 snp G C C:14 G:0 CDS + 1119/1545 373/514 missense_variant c.1119G>C p.Trp373Cys WILD_00125 hypothetical protein
Wildtype 138920 snp A G G:10 A:0 CDS + 1162/1545 388/514 missense_variant c.1162A>G p.Lys388Glu WILD_00125 hypothetical protein
Wildtype 160547 del GTC GC GC:18 GTC:0
Wildtype 160552 del CTA CA CA:20 CTA:0
Wildtype 190866 del GTT GT GT:18 GTT:0 CDS - 28/1356 10/451 frameshift_variant c.28delA p.Asn10fs WILD_00166 brnQ Branched-chain amino acid transport system 2 carrier protein
Question
Which types of variants have been found?
What is the third variant called?
What is the product of the mutation?
What might be the result of such a mutation?
In the 3rd column, you have “snp” for SNP, “del” for deletion, “mnp” for
This is a T→A mutation, causing a stop codon
On the 14th column, we see that The product of this gene is a methicillin resistance protein. Methicillin is an antibiotic.
This will cause a truncation in the Methicillin gene and a loss of resistance in the organism.
View Snippy output in JBrowse
We could go through all of the variants like this and read them out of a text table, but this is onerous and doesn’t really give the context of the changes very well. It would be much nicer to have a visualisation of the SNPs and the other relevant data. In Galaxy we can use a tool called JBrowse.
Hands On: Run JBrowse
JBrowsetool with the following parameters
“Reference genome to display” to Use a genome from history
“Select the reference genome” to wildtype.fna
This sequence will be the reference against which annotations are displayed
“Produce Standalone Instance” to Yes
“Genetic Code” to 11: The Bacterial, Archaeal and Plant Plastid Code
“JBrowse-in-Galaxy Action” to New JBrowse Instance
“Track Group”
We will now set up three different tracks - these are datasets displayed underneath the reference sequence (which is displayed as nucleotides in FASTA format). We will choose to display the sequence reads (the .bam file), the variants found by snippy (the .gff file) and the annotated reference genome (the wildtype.gff)
Track 1 - sequence reads: Click on Insert Track Group and fill it with
“Track Category” to sequence reads
Click on Insert Annotation Track and fill it with
“Track Type” to BAM Pileups
“BAM Track Data” to snippy bam file
“Autogenerate SNP Track” to Yes
“Track Visibility” to On for new users
Track 2 - variants: Click on Insert Track Group and fill it with
“Track Category” to variants
Click on Insert Annotation Track and fill it with
“Track Type” to GFF/GFF3/BED Features
“GFF/GFF3/BED Track Data” to snippy snps gff file
“Track Visibility” to On for new users
Track 3 - annotated reference: Click on Insert Track Group and fill it with
“Track Category” to annotated reference
Click on Insert Annotation Track and fill it with
“Track Type” to GFF/GFF3/BED Features
“GFF/GFF3/BED Track Data” to wildtype.gff
“JBrowse Track Type [Advanced]” to Canvas Features
Click on “JBrowse Styling Options [Advanced]”
“JBrowse style.label” to product
“JBrowse style.description” to product
“Track Visibility” to On for new users
A new dataset will be created in your history, containing the JBrowse interactive visualisation. We will now view its contents and play with it by clicking the galaxy-eye (eye) icon of the JBrowse on data XX and data XX - Complete dataset. The JBrowse window will appear in the centre Galaxy panel.
Hands On: Inspecting the SNPs using JBrowse
Display all the tracks and practice maneuvering around
Click on the tick boxes on the left to display the tracks
Zoom out by clicking on the minus button to see sequence reads and their coverage (the grey graph)
Figure 2: Inspection of the STOP SNP using JBrowse
Question
What is the correct codon at this position?
What is the mutation found here?
The correct codon at this position is TGT, coding for the amino acid Cysteine (middle row of the amino acid translations).
The mutation of T → A turns this triplet into TGA, a stop codon.
Conclusion
By running a tool such as Snippy on your read files and reference genome, we can find where the biologically important changes between genomes of different strains occur and perhaps what they mean to the phenotype.
You've Finished the Tutorial
Please also consider filling out the Feedback Form as well!
Key points
We used a tool called Snippy to call variants between our reads and our reference genome.
As our reference genome had annotations, we could see what effect the changes have on the features as annotated in the reference and therefore make inferences on the possible changes to the phenotype.
We used the JBrowse genome browser to visualise what these changes look like.
Frequently Asked Questions
Have questions about this tutorial? Have a look at the available FAQ pages and support channels
Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.
Feedback
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
@misc{variant-analysis-microbial-variants,
author = "Anna Syme and Simon Gladman and Torsten Seemann",
title = "Microbial Variant Calling (Galaxy Training Materials)",
year = "",
month = "",
day = "",
url = "\url{https://training.galaxyproject.org/training-material/topics/variant-analysis/tutorials/microbial-variants/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
doi = {10.1371/journal.pcbi.1010752},
url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
year = 2023,
month = {jan},
publisher = {Public Library of Science ({PLoS})},
volume = {19},
number = {1},
pages = {e1010752},
author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
editor = {Francis Ouellette},
title = {Galaxy Training: A powerful framework for teaching!},
journal = {PLoS Comput Biol}
}
Congratulations on successfully completing this tutorial!
You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.
3 stars:
Liked: Files were not taking too much time to load/process.
January 2024
3 stars:
Liked: I think it clearly explained.
Disliked: Thanks for the tutorial it is good. But I would add some info about the interpretation of the output. I don't understand for example this case that is shown in the tutorial: % head -5 mysnps/snps.tab CHROM POS TYPE REF ALT EVIDENCE FTYPE STRAND NT_POS AA_POS LOCUS_TAG GENE PRODUCT EFFECT chr 5958 snp A G G:44 A:0 CDS + 41/600 13/200 ECO_0001 dnaA replication protein DnaA missense_variant c.548A>C p.Lys183Thr REF is A and ALT is G, but then the change is c.548A>C
July 2021
5 stars:
Liked: The step by step procedure
Disliked: If you have time, please add further analysis of the SNP (when we search snippy in galaxy, it shows two snippy programs, only of the them is showing right result) please try to correct it (if m not mistaken)
April 2021
3 stars:
Disliked: This tutorial works with the input files. However, working with non-model organisms is a problem. Download the genbank entry is nice, but which one - on BioSample or else? What should it look like? Any requirements? Only the easy way is shown. How can a reference genome fasta be used along with a gff3 file? That does not work at all with files other than built in or sample files. The tutorial is meant well, but it should be improved to enable also people not working with yeast, mouse or human to use it and prevent problems. Thanks a lot!
July 2019
4 stars:
Disliked: The step "Select all outputs" in "Hands-on: Run Snippy" is not obvious what to do exactly.
November 2018
5 stars:
Liked: This was very straight forward and ran well (once it was clear that it only runs on the eu site). listing the file types was useful
Disliked: maybe link to descriptions of the various file type generated/used