+ - 0:00:00
Notes for current slide

Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.

Press P again to switch presenter notes off

Press C to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other.

Useful when presenting.

Notes for next slide



De Bruijn Graph Assembly



last_modification Updated:   purlPURL: gxy.io/GTN:S00032

text-document Plain-text slides |

Tip: press P to view the presenter notes | arrow-keys Use arrow keys to move between slides
1 / 44

Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.

Press P again to switch presenter notes off

Press C to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other.

Useful when presenting.

Requirements

Before diving into this slide deck, we recommend you to have a look at:

2 / 44

question Questions

  • What are the factors that affect genome assembly?

  • How does Genome assembly work?

3 / 44

objectives Objectives

  • Perform an optimised Velvet assembly with the Velvet Optimiser

  • Compare this assembly with those we did in the basic tutorial

  • Perform an assembly using the SPAdes assembler.

4 / 44

De novo Genome Assembly

Part 2: De Bruijn Graph Assembly

With thanks to T Seemann, D Bulach, I Cooke and Simon Gladman

5 / 44

de Bruijn Graphs

  • Named after Nicolaas Govert de Bruijn
  • Directed graph representing overlaps between sequences of symbols
  • Sequences can be reconstructed by moving between nodes in graph

black and white image of a man wearing glasses

6 / 44

de Bruijn Graphs

  • A directed graph of sequences of symbols
  • Nodes in the graph are k-mers
  • Edges represent consecutive k-mers (which overlap by k-1 symbols)
Consider the 2 symbol alphabet (0 & 1) de Bruijn Graph for k =3 every permutation of 3 0s and/or 1s is shown, each is a node. Lines connect ones that share at least 2 overlap.
7 / 44

Producing sequences

  • Sequences of symbols are produced by moving through the graph
Same graph as before with arrows listing a path through the graph. Nodes 111, 110, 100, and 000 combine into 111000.
8 / 44

K-mers? Special K Logo


  • To be able to use de Bruijn graphs, we need reads of length L to overlap by L-1 bases.
  • Not all reads will overlap another read perfectly.
    • Read errors
    • Coverage "holes"
  • Not all reads are the same length (depending on technology and quality cleanup)

To help us get around these problems, we use all k-length subsequences of the reads, these are the k-mers.

9 / 44

What are K-mers? Special K Logo

k=12 kmers show three sequences with length 12 overlapping. For k=5 10 different shorter sequences are shown.
10 / 44

K-mers de Bruijn graph Wooden blocks with letters A, B, C.

An example of happiness is given with 4 fragments, happi, pine, iness, appin.
11 / 44

K-mers de Bruijn graph Wooden blocks with letters A, B, C.

all 4-mers are shown from those 5-mers, so happi becomes happ and appi. These are then uniqued.
12 / 44

K-mers de Bruijn graph Wooden blocks with letters A, B, C.

A graph is drawn based on overlaps between those 4-mers.
13 / 44

K-mers de Bruijn graph Wooden blocks with letters A, B, C.

The graph is collapsed into the word happiness, success.
14 / 44

The problem of repeats Repeat icon

A second example with the word mississippi is shown. The fragments are now missis, ssissi, ssippi.
15 / 44

The problem of repeats Repeat icon

All 4-mers are generated so missis becomes miss, issi, and ssis. These are then uniqued to get 7 unique 4-mers.
16 / 44

The problem of repeats Repeat icon

A graph is generated but this graph is more complex and has a loop.
17 / 44

The problem of repeats Repeat icon

This graph is walked and now the results are shown, mississippi, mississississippi, or more could be found by following the loop repeatedly.
18 / 44

Different k Letter K

The same mississippi example from the start.
19 / 44

Different k Letter K

Instead of 4-mers, we now produce 5-mers.
20 / 44

Different k Letter K

The graph is produced, and now there are two disconnected sets of nodes.
21 / 44

Different k Letter K

When collapsing this graph we see two results, MISSISSIS and SSIPPI.
22 / 44

Choose k wisely An old man from some movie.

  • Lower k

    • More connections
    • Less chance of resolving small repeats
    • Higher k-mer coverage
  • Higher k

    • Less connections
    • More chance of resolving small repeats
    • Lower k-mer coverage

Optimum value for k will balance these effects.

23 / 44

Read errors hazard symbol

Example 3, happiness again, with fragments happi, iness, aplin, pinet Blank image
24 / 44

Read errors hazard symbol

Happiness example 5-mers all 3-mers are produced
25 / 44

Read errors hazard symbol

Happiness example 5-mers The three mers are collapsed into an overlap graph
26 / 44

Read errors hazard symbol

Happiness example 5-mers The graph is coloured and six contigs are produced based on overlaps.
27 / 44

More coverage hazard symbol

  • Errors won't be duplicated in every read
  • Most reads will be error free
  • We can count the frequency of each k-mer
  • Annotate the graph with the frequencies
  • Use the frequency data to clean the de Bruijn graph

More coverage depth will help overcome errors!

28 / 44

Read errors revisited hazard symbol

Same happiness example, all three mers are present, but now the counts are shown. One path has many more counts than the other. Text asks which path looks most valid? why?
29 / 44

Another parameter - coverage cutoff

  • At what point is a low coverage indicative of an error?
  • Can we ignore low coverage nodes and paths?
  • This is a new assembly parameter

Coverage cutoff

30 / 44

de Bruijn graph assembly process

  1. Select a value for k
  2. "Hash" the reads (make the kmers)
  3. Count the kmers
  4. Make the de Bruijn graph
  5. Perform graph simplification steps - use cov cutoff
  6. Read off contigs from simplified graph
31 / 44

Graph simplification

Step 1: Chain merging

Looking at a doubly connected graph with forward/reverse sequences and piles of overlaps. Nodes are connected with lines. One node is labelled as already merged, and two nodes sharing a prefix are labelled as further merging possible.

32 / 44

Graph simplification

Step 2: Tip clipping

a graph with every node doubled for forward/reverse strands is shown, clip tips if the length of the tip of the node is less than 2 times k.

33 / 44

Graph simplification

Three graphs are shown, B, C, and D. B is the most complex, C combines two nodes B and B prime, while D collapses even more nodes to further simplify the graph.

Step 3: Bubble collapsing

  • Detect redundant paths through graph
  • Compare the paths using sequence alignment
  • If similar, merge the paths

Image: Zerbino & Birney 2008

34 / 44

Graph simplification

Step 4: Remove low coverage nodes

  • Remove erroneous nodes and connections using the "coverage cutoff"
  • Genuine short nodes will have a high coverage
35 / 44

Make contigs

  • Find an unbalanced node in the graph
  • Follow the chain of nodes and "read off" the bases to produce the contigs
  • Where there is an ambiguous divergence/convergence, stop the current contig and start a new one.
  • Re-trace the reads through the contigs to help with repeat resolution
36 / 44

Velvet

Velvet has two separate programs:

  • Velveth
    • Makes the k-mers and
    • Efficiently counts (hashes) them
    • All in O(N) time
  • Velvetg
    • Makes the graph - O(U) time. U = unique k-mers.
    • Simplifies it
    • Makes contigs - O(E) time. E = edges in graph

But: You need to choose k and c wisely!

37 / 44

Velvet - Paired end scaffolding

  • Breadcrumb algorithm

Two boxes A and B are at opposite ends, there are many nodes along a line between them, and some sitting outside of the graph or otherwise weirdly connected.

38 / 44

Extensions of the idea

39 / 44

SPAdes spades logo

  • de Bruijn graph assembler by Pavel Pevzner's group out of St. Petersburg
  • Uses multiple k-mers to build the graph
    • Graph has connectivity and specificity
    • Usually use a low, medium and high k-mer size together.
  • Performs error correction on the reads first

  • Maps reads back to the contigs and scaffolds as a check

  • Under active development

  • Much slower than Velvet

  • Should be used in preference to Velvet now.
40 / 44

A move back to OLC

  • New long read technologies
    • PacBio and MinIon
  • Assemblers: HGap, CANU
    • Use overlap, layout consensus approach
  • CANU can perform hybrid assemblies with long and short reads

image of a minion connected to a computer via a cable. Below a man stands next to a very large machine.

41 / 44

Bandage

  • Assembly graph viewer and manipulator
  • Written by Ryan Wick of Centre for Systems Genomics - Uni. Melbourne, Australia
A screenshot of the bandage gui showning drawing options on left and a middle window with a close up of a genome assembly graph.
42 / 44

keypoints Key points

  • We learned about how the choice of k-mer size will affect assembly outcomes

  • We learned about the strategies that assemblers use to make reference genomes

  • We performed a number of assemblies with Velvet and SPAdes.

  • You should use SPAdes or another more modern assembler than Velvet for actual assemblies now.

43 / 44

Thank You!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors!

Author(s) orcid logoSimon Gladman avatar Simon Gladman
Reviewers Helena Rasche avatarCristóbal Gallardo avatarAnne Fouilloux avatarYvan Le Bras avatarSaskia Hiltemann avatarNicola Soranzo avatarBérénice Batut avatarNiall Beard avatar
Galaxy Training Network

Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.

44 / 44

Requirements

Before diving into this slide deck, we recommend you to have a look at:

2 / 44
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow