Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.
Press P
again to switch presenter notes off
Press C
to create a new window where the same presentation will be displayed.
This window is linked to the main window. Changing slides on one will cause the
slide to change on the other.
Useful when presenting.
Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.
Press P
again to switch presenter notes off
Press C
to create a new window where the same presentation will be displayed.
This window is linked to the main window. Changing slides on one will cause the
slide to change on the other.
Useful when presenting.
Before diving into this slide deck, we recommend you to have a look at:
Goal: Reconstruct the sequences of a complete genome, or as close as possible to the complete genome, from sequences of DNA fragments (the "reads").
Genome assembly consists of aligning and reconstructing these fragments to form a continuous sequence (that of the chromosomes) or a set of contiguous sequences (called contigs or scaffolds).
https://www.hudsonalpha.org/sequencing-from-scratch-reference-genomes-and-de-novo-sequence-assembly/
https://www.hudsonalpha.org/sequencing-from-scratch-reference-genomes-and-de-novo-sequence-assembly/
https://www.nature.com/articles/s44185-024-00054-6
Higher ploidy -> harder to assemble => Increase of sequencing depth
https://commons.wikimedia.org/w/index.php?curid=19537795
Daniel Hartl. Essential Genetics: A Genomics Perspective. Jones & Bartlett Learning. p. 177. ISBN 978-0-7637-7364-9. (2011).
Heterozygous: Locus-specific, diploid (2N) organism has two different alleles of a particular gene at the same locus
Heterozygosity is a metric used to indicate the probability that an individual is heterozygous for a particular allele
Higher heterozygosity -> harder to assemble => Increase of sequencing depth
https://www.genome.gov/genetics-glossary/heterozygous
Contig: a contiguous sequence in an assembly. A contig does not contain long stretches of unknown sequences (aka assembly gaps).
The contig is usually generated using the long-reads data.
Scaffold: a sequence consists of one or multiple contigs connected by assembly gaps of typically inexact sizes. A scaffold is also called a supercontig, though this terminology is rarely used nowadays.
Usually, scaffolds are generated using the Hi-C data
Assembly: a set of contigs or scaffolds.
1 node = 1 read
1 bridge = 1 overlap
Determine the best path through the graph
Remove redundant information
Process repeated many times
Sequences combined to form the final sequence
1 node = 1 k-mer
1 edge = 1 overlap
Find the path that consistently traverses the graph
Hi-C: Capturing interactions between different parts of a genome by measuring the physical proximity of DNA segments in the nucleus:
Binding of closely interacting DNA regions.
DNA is digested, labeled, and joined using ligations to create hybrid fragments.
These fragments are sequenced to reveal which parts of the genome were spatially close, even if they are distant in terms of linear sequence.
Hi-C allows for the transition from the assembly of fragmented contigs to:
This mainly depends on the quantity and quality of DNA as well as the cost of the experiment but many parameters need to be considered before performing an NGS experiment:
Sequencing technology for assembly:
Sequencing technology for scaffolding:
Optical mapping: technique to physically locate specific enzymes restriction sites or sequence motifs to produce DNA sequence fingerprints. Providers : BioNano, BGI
Mate pair (deprecated)
Typical sequencing strategies - EBP (Earth Biogenome Project) recommendation -
Different level of assembly exist today :
It can still contain gaps in between the scaffolds (shown as "NNNNNNNN" in the assembly)
Haplotig: a contig that comes from the same haplotype. In an unphased assembly, a contig may join alleles from different parental haplotypes in a diploid or polyploid genome.
Primary assembly: a complete assembly with long stretches of phased blocks.
Alternate assembly: an incomplete assembly consisting of haplotigs in heterozygous regions. An alternate assembly always accompanies a primary assembly. It is not useful by itself as it is fragmented and incomplete.
Haplotype-resolved assembly: sets of complete assemblies consisting of haplotigs, representing an entire diploid/polyploid genome.
To be successful, you must have sufficient computing resources (CPUS, RAM, walltime and storage).
FASTA: a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.
Hosseini, M., Pratas, D. & Pinho, A. J. A Survey on Data Compression Methods for Biological Sequences. Information 7, 56 (2016).
FASTQ: a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores (Phred). Both the sequence letter and quality score are each encoded with a single ASCII character for brevity. It's the standard sequencing output for Illumina and MGI sequencers.
Hosseini, M., Pratas, D. & Pinho, A. J. A Survey on Data Compression Methods for Biological Sequences. Information 7, 56 (2016).
SAM (Sequence Alignment Map): a text-based format originally for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker et al.
BAM (Binary Alignment Map): the comprehensive raw data of genome sequencing; it consists of the lossless, compressed binary representation of the SAM format. It's the standard sequencing output for PacBio sequencers.
CRAM (Compressed Reference-oriented Alignment Map): a compressed columnar file format for storing biological sequences aligned to a reference sequence.
Image licensed CC-BY 4.0 Hosseini et al. 2016
Hosseini, M., Pratas, D. & Pinho, A. J. A Survey on Data Compression Methods for Biological Sequences. Information 7, 56 (2016).
N50: given a set of sequences of varying lengths, the N50 is defined as the length L of the shortest contig for which longer and equal length contigs cover at least 50% of the assembly.
L50: given a set of sequences of varying lengths, the L50 is defined as count of smallest number of sequences whose length sum makes up 50% of the assembly.
N50 describes a sequence length whereas L50 describes a number of sequences.
Example:
N50 = 8 and L50 = 4
Alhakami, H., Mirebrahim, H., & Lonardi, S. (2017). A comparative evaluation of genome assembly reconciliation tools. Genome biology, 18(1), 1-14.
BUSCO: Assessing genome assembly and annotation completeness with Benchmarking Universal Single-Copy Orthologs
Quantitative assessment of genome assembly based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs.
Tips: Reference databases are constructed using known genomes. Species with few/no close genomes available can have very bad scores.
The quality of an assembly is often validated by using other data from the same individual or from other individuals (RNA-Seq alignment, Hi-C alignment, DNA-Seq alignment,...).
The positions of the telomeric repeats in the chromosome assemblies are also of interesting to evaluate the correctness.
The identification of organelles (mitochondria, chloroplast,...) can also inform us about the quality of the assembly in terms of completness. However, the structure of the organelles may lead the assembler to think that they are repeats and he discards them.
In the case of diploid organisms, one of the classical problems of assemblies is the conservation of the two haplotypes. We obtains particular BUSCO / kmer / assembly size metrics that can be corrected by removing, "purging", the haplotigs.
We learned the importance of preparing the project to ensure its success
We learned the importance of surrounding ourselves with all the people who have knowledge of the different parts of the project (wet lab, sequencing, bioinformatics,...)
We learned the definitions of bioinformatics terms used in genomes assembly
We have seen the bioinformatics file formats used for these analyses
To go further : Deeper look into Genome Assembly algorithms
We have seen the bioinformatics tools to assess the quality of an assembly
To go further : Genome assembly quality control
This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors!
Author(s) |
![]() ![]() |
Editor(s) |
![]() ![]() ![]() ![]() |
Reviewers |
|
Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.
Before diving into this slide deck, we recommend you to have a look at:
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |