Calling very rare variants

Author(s) Anton Nekrutenko avatar Anton NekrutenkoNick Stoler avatar Nick Stoler
Reviewers Saskia Hiltemann avatarBjörn Grüning avatarNicola Soranzo avatarHelena Rasche avatarBérénice Batut avatarMaria Doyle avatarAnthony Bretaudeau avatarMateo Boudet avatarAlexandre Cormier avatarWilliam Durand avatarNiall Beard avatar
Overview
Creative Commons License: CC-BY Questions:
  • What frequency of variants is so low that it is obscured by sequencing error rate?

  • What are the different types of consensus sequences produced from duplex sequencing?

Objectives:
  • Processing raw duplex sequencing data into consensus sequences

  • Find rare variants without relying on diploid assumptions

Requirements:
Time estimation: 3 hours
Supporting Materials:
Published: Feb 22, 2017
Last modification: Dec 9, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00310
version Revision: 21

This page explains how to perform discovery of low frequency variants from duplex sequencing data. As an example we use the ABL1 dataset published by Schmitt and colleagues (SRA accession SRR1799908).

Agenda
  1. Background
  2. How to use this tutorial
  3. Generating consensus sequences
    1. Getting data in and assessing quality
    2. Processing reads into duplex consensus sequences with Du Novo
    3. Filtering consensuses
  4. Calling variants with duplex consensus sequences
    1. Mapping the reads
    2. Calling the variants
    3. Results
  5. Calling variants with single strand consensus sequences
    1. Re-running analyses with workflows
  6. Conclusion

Background

Finding rare variants

Most popular variant callers focus on the common case of sequencing a diploid individual to find heterozygous and homozygous variants. This is a well-studied problem with its own challenges, but at least you can expect your variants to be present in either 100%, 50%, or 0% of your sample DNA. If you observe a variant present in 99%, 56%, or 2% of the reads at a site, you can probably assume the allele is actually present at 100%, 50%, or 0%, respectively, in your sample.

But in this tutorial, we’re looking for rare variants. So our true frequency might actually be 13%, 1%, or even 0.4%. The challenge then becomes distinguishing these situations from sequencing errors. Next-generation sequencers produce noise at this level, making it challenging to make this distinction in data produced with standard resequencing methods.

Duplex sequencing

Duplex sequencing is a method that addresses the problem of distinguishing sequencing signal from noise. It can increase sequencing accuracy by over four orders of magnitude. Duplex sequencing uses randomly generated oligomers to uniquely tag each fragment in a sample after random shearing. The tagged fragments are then PCR amplified prior to sequencing, so that many reads can be obtained from each original molecule. The tags in each read can then be used to identify which original fragment the read came from. Identifying multiple reads from each fragment allows building a consensus of the original sequence of the fragment, eliminating errors.

The key to duplex sequencing, as opposed to other types of consensus-based methods (review here), is that both ends of the original fragment are tagged such that its strands can be distinguished. Knowing which strand each read comes from allows us to recognize errors even in the first round of PCR.

Processing the raw reads into consensus sequences consists of four main steps:

  1. Group reads by their tags.
  2. Align reads in the same tag group.
  3. Build single-strand consensus sequences (SSCS) of reads coming from the same original strand.
  4. Build duplex consensus sequences (DCS) from pairs of SSCS.

Du Novo is a tool which can carry out these steps. Unlike most other such tools, it can do so without the use of a reference sequence, and it can correct for errors in the tags which can contribute to data loss.

Comment: Terminology

Du Novo processes the tags from each fragment by concatenating them into a single barcode.

  • For a standard protocol with two 12bp tags, this results in a 24bp barcode which identifies each family.

Schmitt et al. 2012 provides this overview of the whole method: duplex.

Figure 1: The logic of duplex sequencing. The computational process is shown in part C.

The value of single-strand consensus sequences

The DCSs have the ultimate accuracy, yet the SSCSs can also be very useful when ampliconic DNA is used as an input to a duplex experiment. Let us illustrate the utility of SSCSs with the following example. Suppose one is interested in quantifying variants in a virus that has a very low titer in body fluids. Since the duplex procedure requires a substantial amount of starting DNA (between between 0.2 and 3 micrograms) the virus needs to be enriched. This can be done, for example, with a PCR designed to amplify the entire genome of the virus. Yet the problem is that during the amplification heterologous strands will almost certainly realign to some extent forming heteroduplex molecules:

hd.

Figure 2: Heteroduplex formation in ampliconic templates. Image by Barbara Arbeithuber from Stoler et al. 2016. Here there are two distinct types of viral genomes: carrying A and G. Because the population of genomes is enriched via PCR, heteroduplex formation takes place, skewing frequency estimates performed using DCSs.

In the image above there are two alleles: green (A) and red (G). After PCR a fraction of molecules are in heteroduplex state. If this PCR-derived DNA is now used as the starting material for a DS experiment, the heteroduplex molecules will manifest themselves as having an N base at this site (because Du Novo interprets disagreements as Ns during consensus generation). So, DSCs produced from this dataset will have A, G, and N at the polymorphic site. Yet, SSCSs will only have A and G. Thus SSCS will give a more accurate estimate of the allele frequency at this site in this particular case. In Du Novo SSCSs are generated when the param-check Output single-strand consensus sequences option of tool Du Novo: Make consensus reads tool is set to Yes (see below).

How to use this tutorial

The entire analysis described here is accessible as a Galaxy history that you can copy and play with.

Comment: Running the tools
  • Leave all parameters on their default settings, unless instructed otherwise.
    Comment: Helping Du Novo

    But if you’d like to help improve Du Novo, consider checking Yes under param-check Send usage data.

This analysis can be divided into three parts:

  1. Generating consensus sequences
  2. Calling variants with duplex consensus sequences
  3. Calling variants with single strand consensus sequences

Here are the steps, displayed as the Galaxy history you’ll end up with if you follow the instructions:

  • Note: Galaxy histories show the first step at the bottom!

steps.

Figure 3: Analysis outline

Generating consensus sequences

The starting point of the analysis is sequencing reads (in FASTQ format) produced from a duplex sequencing library.

Getting data in and assessing quality

You can obtain the data from Schmitt et al. 2015 from Zenodo, or just import the data from the Galaxy Data library under the “GTN Materials” folder.

Hands-on: Importing the raw data
  1. Import the datasets from Zenodo, or from a data library.

    https://zenodo.org/record/3554549/files/SRR1799908_forward.fastq
    https://zenodo.org/record/3554549/files/SRR1799908_reverse.fastq
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Data (top panel) then Data libraries
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

  2. Rename galaxy-pencil the datasets to

    • SRR1799908_forward
    • SRR1799908_reverse
    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field
    • Click the Save button

This creates two datasets in our galaxy history: one for forward reads and one for reverse.

We then evaluated the quality of the data by running FastQC on both datasets (forward and reverse):

Hands-on: Evaluating input read quality
  1. FastQC ( Galaxy version 0.74+galaxy1) with the following parameters:
    • param-file Short read data from your current history: One of the raw FASTQ datasets (SRR1799908_forward)
    • Leave the rest of the options as their defaults.
  2. Repeat with the other FASTQ file (SRR1799908_reverse)

This created two datasets in our galaxy history: one for forward reads and one for reverse. We then evaluated the quality of the data by running FastQC on both datasets (forward and reverse). You can read about using tool FastQC in the dedicated quality-control tutorial.

This gave us the following plots:

Quality scores across all bases: foward. Quality scores across all bases: reverse.

Figure 4: FastQC assessment of the quality of the raw reads. Left: Forward reads. Right: Reverse reads.

One can see that these data are of excellent quality and no additional processing is required before we can start the actual analysis.

Processing reads into duplex consensus sequences with Du Novo

Now we are ready to collapse the raw reads into duplex consensus sequences.

Sorting reads into families

The tool Du Novo: Make families tool will separate the 12bp tags from each read pair and concatenate them into a 24bp barcode. Then, it will use the barcodes to sort the reads into families that all descend from the same original fragment.

Hands-on: Sorting reads into families
  1. Du Novo: Make families ( Galaxy version 3.0.2) with the following parameters:
    • param-file Sequencing reads, mate 1: The forward raw reads (SRR1799908_forward)
    • param-file Sequencing reads, mate 2: The reverse raw reads (SRR1799908_reverse)
    • param-text Tag length: 12

Correcting barcodes

The grouping reads based on barcode relies on exact barcode matches. Any PCR or sequencing error in the barcode sequence will prevent the affected reads from being joined with their other family members.

Du Novo includes a tool which can correct most of these errors and recover the affected reads. This can increase the final yield of duplex consensus reads by up to 11% (Stoler et al. 2018, in preparation).

Hands-on: Correcting barcodes
  1. Du Novo: Correct barcodes ( Galaxy version 3.0.2) with the following parameters:
    • param-file Input reads: The output of tool Make families
    • param-text Maximum differences: 3

Aligning families

After grouping reads that came from the same original fragment, we need to align them with each other. This next tool will perform a multiple sequence alignment on each family.

Comment: Analysis bottleneck

This is by far the most time-consuming step.

On this dataset, it took 2 hours to complete when run on Galaxy Main.

  • At the time, Galaxy allocated 6 cores to the job.
Hands-on: Aligning families
  1. Du Novo: Align families ( Galaxy version 3.0.2) with the following parameters:
    • param-file Input reads: The output of tool Correct barcodes
    • param-select Multiple sequence aligner: Kalign2

Making consensus sequences

Now, we need to collapse the aligned reads into consensus sequences. This next tool will process each group of aligned reads that came from the same single-stranded family into a consensus. Then it will align the consensus sequences from the two strands of each original molecule, and call a consensus between them.

Normally, the tool only produces the final double-stranded consensus sequences. But we will make use of the single-stranded consensus sequences later, so we’ll tell it to keep those as well.

Hands-on: Making consensus sequences
  1. Du Novo: Make consensus reads ( Galaxy version 3.0.2) with the following parameters:
    • param-file Aligned input reads: The output of tool Align families
    • param-text Minimum reads for a consensus sequence: 3
    • param-text Consensus % threshold: 0.7
    • param-select Output format: FASTQ
    • param-check Output single-strand consensus sequences as well: Yes
  2. This tool will produce 4 output files, in two sets of paired-end FASTQ files:
    • one pair for the single-stranded consensus sequences (called SSCS mate 1 and 2)
    • one pair for the double-stranded consensus sequences (called mate 1 and 2).
Comment: Setting output formats

You may have to set the datatype of the outputs from tool Du Novo: Make consensus reads tool.

Versions below 2.16 only set the datatype to fastq, not the more specific fastqsanger. Many tools (like tool Map with BWA-MEM) won’t accept FASTQ input without it specifying what subtype it is.

  1. In your history, click on the pencil icon next to the dataset name.
  2. Click on the Datatypes tab.
  3. In the Change datatype pane, click on the dropdown where it says fastq.
  4. Enter fastqsanger, then click the Change datatype button in the upper right of the pane.

There is no easy way to assign a PHRED score to a consensus base derived from many duplex reads.

So Du Novo does not attempt to give a meaningful score. It assigns the same arbitrary score to all bases.

  • It produces FASTQ for compatibility, but the output contains no more information than a FASTA file.

You may have noticed the param-text Output PHRED score parameter in the tool Du Novo: Make consensus reads tool. This allows you to specify which score to assign to (all) the bases.

Filtering consensuses

You may have realized that when calling a “consensus” between two sequences, if the two disagree on a base, there’s no way to know which is correct. So in these situations, Du Novo uses the IUPAC ambiguity letter for the two different bases (e.g. W = A or T). Also, when calling single-stranded consensus sequences, if there aren’t enough high-quality bases to call a position (in the above hands-on, we set this threshold to 70%), it gives an N.

This information could be useful for some analyses, but not for our variant calling. The tool tool Sequence Content Trimmer will help with filtering these out. With the settings below, it will move along the read, tracking the frequency of ambiguous (non-ACGT) bases in a 10bp window. If it sees more than 2 ambiguous bases in a window, it will remove the rest of the read, starting with the first offending base in the window. We’ll also tell it to remove entirely any read pair containing a read that got trimmed to less than 50bp.

Hands-on: Filtering the consensus sequences
  1. Sequence Content Trimmer ( Galaxy version 0.2.3) with the following parameters:
    • param-select Paired reads?: Paired
    • param-file Input reads (mate 1): The double-stranded output of tool Make consensus reads (mate 1)
    • param-file Input reads (mate 2): The double-stranded output of tool Make consensus reads (mate 2)
    • param-text Bases to filter on: ACGT
    • param-text Frequency threshold: 0.2
    • param-text Size of the window: 10
    • param-check Invert filter bases: Yes
    • param-check Set a minimum read length: Yes
    • param-text Minimum read length: 50

Calling variants with duplex consensus sequences

At this point we have trimmed DCSs. We can now proceed to call variants. This involves aligning the variants against the reference genome, then counting variants.

We’re not specifically interested in the reference sequence, since all we care about is sequence content of the consensus reads. But we’ll be using the reference sequence to figure out where all the reads come from. This lets us stack them on top of each other, with equivalent bases lined up in columns. Then we can step through each column, count how many times we see each base, and and compile a list of variants.

Mapping the reads

Align against the genome with BWA-MEM

Here, we’ll use tool Map with BWA-MEM to map the DCS reads to the human reference genome.

Hands-on: Align with BWA-MEM
  1. Map with BWA-MEM ( Galaxy version 0.7.18) with the following parameters:
    • param-select Using reference genome?: Human (Homo sapiens) (b38): hg38
    • param-file Select first set of reads: The first output from the tool Sequence Content Trimmer
    • param-file Select second set of reads: The second output from the tool Sequence Content Trimmer

Left Aligning indels

To normalize the positional distribution of indels we use the tool BamLeftAlign utility from the FreeBayes package. You can find it in the NGS: Variant Analysis section. This is necessary to avoid erroneous polymorphisms flanking regions with indels (e.g., in low complexity loci):

Hands-on: Left-align indels
  1. BamLeftAlign ( Galaxy version 1.3.8) with the following parameters:
    • param-file Select alignment file in BAM format: The output of tool Map with BWA-MEM
    • param-select Using reference genome: Human (Homo sapiens): hg38
      • The same genome we aligned to.

Calling the variants

Now we’ll use our aligned consensus reads to find variants.

Normally, in a diploid resequencing experiment, you would call variants relative to the reference. So, you’d report sites which are different from the reference (and whether they’re hetero- or homozygous).

In our case, we’re interested in rare variants. So what we’ll report is the sites where there is more than one allele, and what the frequency is of the less-common allele (the minor allele). This has the potential to include every small sequencing error (even though we’re using duplex, there still are errors). So to reduce the noise, we’ll set a lower threshold at 1% minor allele frequency (MAF).

Finding variants in the alignment

To identify sites containing variants we use the tool Naive Variant Caller (NVC) tool from the NGS: Variant Analysis section. This reads the alignment and counts the number of bases of each type at each site.

Hands-on: Count the variants
  1. Naive Variant Caller (NVC) ( Galaxy version 0.0.4) with the following parameters:
    • param-file BAM file: The output of tool BamLeftAlign
    • param-select Using reference genome: hg38
      • The same genome we aligned to.
    • param-check Insert Restrict to regions: Click to add a region.
    • param-text Chromosome: chr9
      • ABL1 is on chr9. Restricting it to this region saves some processing time.
    • param-text Minimum base quality: 0
    • param-text Minimum mapping quality: 20
    • param-text Ploidy: 1
      • Ploidy is irrelevant here as it is a mixture of multiple genomes.

The tool Naive Variant Caller (NVC) generates a VCF file that can be viewed at genome browsers such as IGV. Yet one rarely finds variants by looking at genome browsers. We’ll want to use tools to search for variants that fit our criteria.

Finding minor alleles

Now we’ll want to parse the VCF produced by the NVC, determine what the major and minor allele is at each site, and calculate their frequencies. The tool Variant Annotator from the NGS: Variant Analysis section can do this.

Hands-on: Read the variants file
  1. Variant Annotator ( Galaxy version 1.3.2) with the following parameters:
    • param-file Input variants from Naive Variants Detector: The output of tool Naive Variant Caller (NVC)
    • param-text Minor allele frequency threshold: 0
    • param-text Coverage threshold: 10
    • param-check Output stranded base counts: Yes
      • To be able to filter for strand bias.

Filtering out the noise

Now we have a file containing the base counts for every site covered by at least 10 reads. We’d like to filter through this data to find sites with a reasonable chance of being a real variant, not sequencing error.

The tool Variant Annotator produces a simple tab-delimited file, with one site per line. We can use the tool Filter tool from the Filter and Sort section to process this kind of file. We’ll use the filter c16 >= 0.01 to remove lines where the value in column 16 is less than 0.01. Column 16 contains the minor allele frequency, so this will remove all sites with a MAF less than 1%.

Hands-on: Filter the raw variants list
  1. Filter - data on any column using simple expressions with the following parameters:
    • param-file Filter: The output of tool Variant Annotator
    • param-text With following condition: c16 >= 0.01
    • param-check Number of header lines to skip: 1

Results

Now we’re down to just two sites:

Position (chr9) Major allele Minor allele MAF
Column 3 Column 14 Column 15 Column 16
130,872,141 G A 0.01259
130,880,141 A G 0.47764

The polymorphism we are interested in (and the one reported by Schmitt et al. 2015) is at the position 130,872,141 and has a frequency of 1.3%. The other site (position 130,880,141) is a known common variant rs2227985, which is heterozygous in this sample.

Calling variants with single strand consensus sequences

To analyze the SSCS data, go back to the Filtering consensuses step, and replace the input to the tool Sequence Content Trimmer with the single-stranded output of tool Make consensus reads instead of the double-stranded output. Then continue through the same steps as before, with the same parameters, to the end.

Then you will have a new set of variants, but produced with the SSCS reads, giving you a higher sensitivity (fewer false negatives), but lower specificity (more false positives).

There’s a shortcut to avoid setting every parameter the second time you run a tool.

  1. In your history, click on an output of the first run to expand it.
  2. Click on the button with the circular “re-run” arrows.
  3. Now the parameters will all be the same as the last run. All you have to do is change the input file(s).

Re-running analyses with workflows

Instead of manually re-running all the tools in the variant calling section, you can use a workflow to automatically run the same tools, but on the SSCS reads. Workflows let you run a chain of tools on different input data with a single click of a button. You can find more information on using workflows in the Galaxy 101 introductory tutorial.

We’ve prepared two workflows which split the above analysis into two steps:

  1. Using Du Novo to create consensus sequences from raw reads.
    • This will generate trimmed DCS and SSCS files from raw sequencing data.
    • This does not include the FastQC step. You should always run FastQC on your raw reads first, to check the quality of your sequencing run before proceeding with the analysis.
    Comment: Helping Du Novo

    The param-check Send usage data option is left off in the above workflow. This is because we want to make sure you only share data knowingly.

    But again, if you’d like to help improve Du Novo, consider turning it on.

  2. Calling variants from consensus sequences.
    • This takes a pair of FASTQ files and calls variants using them.
    • If you’d like variants from both DCS and SSCS, you’ll have to run this twice, once on each.
    • N.B. Remember that this workflow is designed for the above ABL1 analysis. If you want to use it for any other dataset, you’ll have to change the relevant options.

You can use the variant calling workflow to call variants using the SSCS instead of the DCS.

Conclusion

You should now understand duplex sequencing, rare variants, and be able to process the former to find the latter.

If things don’t work…

…you need to complain. Use Galaxy’s Help Forum to do this.