View markdown source on GitHub

Genome assembly and assembly QC - Introduction short version

Contributors

Questions

last_modification Published: Apr 22, 2025
last_modification Last Updated: Apr 23, 2025

Genome assembly

.left[ Goal: Reconstruct the sequences of a complete genome, or as close as possible to the complete genome, from sequences of DNA fragments (the “reads”).

Genome assembly consists of aligning and reconstructing these fragments to form a continuous sequence (that of the chromosomes) or a set of contiguous sequences (called contigs or scaffolds). ]

.image-1[ Assembly overview ]

.footnote[https://www.hudsonalpha.org/sequencing-from-scratch-reference-genomes-and-de-novo-sequence-assembly/]


Genome assembly vs alignment

.image-1[ Alignment vs assembly ]

.footnote[https://www.hudsonalpha.org/sequencing-from-scratch-reference-genomes-and-de-novo-sequence-assembly/]


Steps before starting a genome project

.left[


Steps before starting a genome project - ERGA model

.image-20[ ERGA process ]

.footnote[https://www.nature.com/articles/s44185-024-00054-6]


Genome information: Genome availability, expected genome size, ploidy, etc

.pull-left[ How to collect informations?

.image-5[ Heterozigous genotype GOAT logo ] ]

.pull-right[ .image-50[ variation in estimated genome sizes in base pairs ]]

Higher ploidy -> harder to assemble => Increase of sequencing depth

.footnote[https://commons.wikimedia.org/w/index.php?curid=19537795

Daniel Hartl. Essential Genetics: A Genomics Perspective. Jones & Bartlett Learning. p. 177. ISBN 978-0-7637-7364-9. (2011). ]


Genome information: Heterozygosity level & Others

.pull-left[ .left[ Heterozygous: Locus-specific, diploid (2N) organism has two different alleles of a particular gene at the same locus


Heterozygosity is a metric used to indicate the probability that an individual is heterozygous for a particular allele

]]

.pull-right[ .image-100[ Heterozigous genotype ]]

Higher heterozygosity -> harder to assemble => Increase of sequencing depth

.footnote[https://www.genome.gov/genetics-glossary/heterozygous]


DNA extraction tips

.left[

]

Sequencing / Bioinformatics


Bioinformatics steps - Definitions

.image-5[ Illustration of the working principle of scaffolding ]

.left[ Contig: a contiguous sequence in an assembly. A contig does not contain long stretches of unknown sequences (aka assembly gaps). The contig is usually generated using the long-reads data.
Scaffold: a sequence consists of one or multiple contigs connected by assembly gaps of typically inexact sizes. A scaffold is also called a supercontig, though this terminology is rarely used nowadays. Usually, scaffolds are generated using the Hi-C data
Assembly: a set of contigs or scaffolds. ]


Assembly algorithms - Overlap-Layout-Consensus (OLC)

.pull-left[ .left[ 1 node = 1 read

1 bridge = 1 overlap

Determine the best path through the graph

Remove redundant information

Process repeated many times

Sequences combined to form the final sequence ]]

.pull-right[ .image-100[ Heterozigous genotype ]]

.footnote[https://www.researchgate.net/figure/Overlap-layout-consensus-genome-assembly-algorithm-Reads-are-provided-to-the-algorithm_fig2_26266221 ]


Assembly algorithms - De Bruijn Graphs

.pull-left[ .left[ 1 node = 1 k-mer

1 edge = 1 overlap

Find the path that consistently traverses the graph ]]

.pull-right[ .image-100[ Heterozigous genotype ]]

.footnote[https://www.researchgate.net/figure/Illustration-of-de-Bruijn-graph-based-assembly_fig1_229437536 ]


Assembly - Scaffolding (and manual curation)

.left[ Hi-C: Capturing interactions between different parts of a genome by measuring the physical proximity of DNA segments in the nucleus:
Binding of closely interacting DNA regions.
DNA is digested, labeled, and joined using ligations to create hybrid fragments.
These fragments are sequenced to reveal which parts of the genome were spatially close, even if they are distant in terms of linear sequence.

Hi-C allows for the transition from the assembly of fragmented contigs to:

.image-50[ Arima ]

.footnote[https://arimagenomics.com/applications/genome-assembly/ ] —

Sequencing steps - The options

.left[ This mainly depends on the quantity and quality of DNA as well as the cost of the experiment but many parameters need to be considered before performing an NGS experiment:


Sequencing steps - The technologies

.left[ Sequencing technology for assembly:


Sequencing technology for scaffolding:


Typical sequencing strategies - EBP (Earth Biogenome Project) recommendation -

]


Bioinformatics steps - Assembly quality

.left[ Different level of assembly exist today :

It can still contain gaps in between the scaffolds (shown as “NNNNNNNN” in the assembly)


Bioinformatics steps - Definitions

.left[ Haplotig: a contig that comes from the same haplotype. In an unphased assembly, a contig may join alleles from different parental haplotypes in a diploid or polyploid genome.
Primary assembly: a complete assembly with long stretches of phased blocks.
Alternate assembly: an incomplete assembly consisting of haplotigs in heterozygous regions. An alternate assembly always accompanies a primary assembly. It is not useful by itself as it is fragmented and incomplete.
Haplotype-resolved assembly: sets of complete assemblies consisting of haplotigs, representing an entire diploid/polyploid genome. ]

.image-60[ Illustration of the assembly types ]


Computational resources and requirements

.left[ To be successful, you must have sufficient computing resources (CPUS, RAM, walltime and storage).


Bioinformatics data formats

.left[ FASTA: a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. ]

.image-100[ Fasta format description ] Image licensed CC-BY 4.0 Hosseini et al. 2016

.footnote[Hosseini, M., Pratas, D. & Pinho, A. J. A Survey on Data Compression Methods for Biological Sequences. Information 7, 56 (2016).]


Bioinformatics data formats

.left[ FASTQ: a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores (Phred). Both the sequence letter and quality score are each encoded with a single ASCII character for brevity. It’s the standard sequencing output for Illumina and MGI sequencers. ]

.image-100[ Fastq format description ] Image licensed CC-BY 4.0 Hosseini et al. 2016

.footnote[Hosseini, M., Pratas, D. & Pinho, A. J. A Survey on Data Compression Methods for Biological Sequences. Information 7, 56 (2016).]


Bioinformatics data formats

SAM (Sequence Alignment Map): a text-based format originally for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker et al.
BAM (Binary Alignment Map): the comprehensive raw data of genome sequencing; it consists of the lossless, compressed binary representation of the SAM format. It’s the standard sequencing output for PacBio sequencers.
CRAM (Compressed Reference-oriented Alignment Map): a compressed columnar file format for storing biological sequences aligned to a reference sequence.

.pull-left[ .image[ SAM format description ]]

.pull-right[ Image licensed CC-BY 4.0 Hosseini et al. 2016 ]

.footnote[Hosseini, M., Pratas, D. & Pinho, A. J. A Survey on Data Compression Methods for Biological Sequences. Information 7, 56 (2016).]


After the assembly, how do we assess its quality?

.image-100[ The 3C for genome assembly quality control ]


Continuity : N50

.left[ N50: given a set of sequences of varying lengths, the N50 is defined as the length L of the shortest contig for which longer and equal length contigs cover at least 50% of the assembly.
L50: given a set of sequences of varying lengths, the L50 is defined as count of smallest number of sequences whose length sum makes up 50% of the assembly.
N50 describes a sequence length whereas L50 describes a number of sequences.

Example:

.image-100[ Schematic explanation of N50 ]

N50 = 8 and L50 = 4

.footnote[Alhakami, H., Mirebrahim, H., & Lonardi, S. (2017). A comparative evaluation of genome assembly reconciliation tools. Genome biology, 18(1), 1-14.]


.pull-right[ .image-55[ Quast report ]]

.pull-left[

Tool to evaluate continuity : QUAST


Completeness : BUSCO score

.pull-left[ BUSCO: Assessing genome assembly and annotation completeness with Benchmarking Universal Single-Copy Orthologs
Quantitative assessment of genome assembly based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs. .image-70[ Formula to estimate assembly completeness for core genes ] ] .pull-right[ .image-70[ Example of BUSCO plot for Nosema species (Microsporidia) ]]

.footnote[Tips: Reference databases are constructed using known genomes. Species with few/no close genomes available can have very bad scores.]


Correctness

** Proportion of the assembly that is free from mistakes**


Evaluation against reference genome (or second haplotype)

.image-30[ Example of a dot plot between 2 genomes. ]


Assembly QC Tips


Key Points

Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.

References

  1. Hosseini, M., D. Pratas, and A. Pinho, 2016 A Survey on Data Compression Methods for Biological Sequences. Information 7: 56. 10.3390/info7040056

Funding

These individuals or organisations provided funding support for the development of this resource

Logo
Gallantries
This project (2020-1-NL01-KA203-064717) is funded with the support of the Erasmus+ programme of the European Union. Their funding has supported a large number of tutorials within the GTN across a wide array of topics.