Genome assembly using PacBio data

Overview
Creative Commons License: CC-BY Questions:
  • How to perform a genome assembly with PacBio data ?

  • How to check assembly quality ?

Objectives:
  • Assemble a Genome with PacBio data

  • Assess assembly quality

Requirements:
Time estimation: 6 hours
Level: Intermediate Intermediate
Supporting Materials:
Published: Nov 29, 2021
Last modification: Apr 23, 2025
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00033
rating Rating: 5.0 (1 recent ratings, 3 all time)
version Revision: 11

In this tutorial, we will assemble a genome of a species of fungi in the family Aspergillus, Aspergillus niger, from PacBio sequencing data. These data will be downloaded from ENA. The quality of the assembly obtained will be analyzed, in particular by comparing it to a reference assembly, available on ENA.

Agenda

In this tutorial, we will cover:

  1. Get data
    1. Get data from ENA
    2. Get reference genome from ENA
  2. Genome Assembly
  3. Quality assessment
    1. Genome assemblies comparison with Quast
    2. Genome assembly assessment with BUSCO
  4. Conclusion

Get data

We will use long reads sequencing data: HiFi (High Fidelity long reads) from PacBio sequencing of Aspergillus niger genome. This data is available on ENA. We will also use later a reference genome assembly downloaded from ENA.

Get data from ENA

Hands On: Data upload from ENA
  1. Create a new history for this tutorial
  2. Import the files from ENA

    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR317/012/SRR31719412/SRR31719412_subreads.fastq.gz
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

  3. Rename the datasets
  4. Check that the datatype is fastq.gz

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select datatypes from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

Get reference genome from ENA

Hands On: Data upload from ENA
  1. Reference genome is available here: ASM4765177v1 assembly for Aspergillus niger
  2. Download the WGS Set FASTA (JBKZXA01.fasta.gz) on your computer Make sure you download WGS Set FASTA (JBKZXA01.fasta.gz) and NOT ALL Set FASTA
  3. Upload this file on Galaxy (Upload –> Choose local file –> Select the file –> Start –> Close)
  4. Check that the datatype is fasta.gz

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select datatypes from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

Genome Assembly

Hands-on: Choose Your Own Tutorial

This is a "Choose Your Own Tutorial" (CYOT) section (also known as "Choose Your Own Analysis" (CYOA)), where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial

Hifiasm and Flye are two well known assembler.

Hifiasm is a fast haplotype-resolved de novo assembler initially designed for PacBio HiFi reads. In general, hifiasm generates the assembly graphs in the GFA format, so a step of conversion to fasta is necessary. The GFA 1.0 format is a tab-delimited text format for describing a set of sequences and their overlap.
Hifiasm produces arguably the best single-sample telomere-to-telomere assemblies combing HiFi, ultralong and Hi-C reads, and it is one of the best haplotype-resolved assemblers for the trio-binning assembly given parental short reads. For a human genome, hifiasm can produce the telomere-to-telomere assembly in one day.

Hands On: Assembly with Hifiasm
  1. Hifiasm ( Galaxy version 0.25.0+galaxy0) with the following parameters:
    • “Mode”: Standard
    • param-file “Input reads”: the raw data (fastq.gz)
    • “Output log file”: Set to yes

    The tool produces five datasets: Haplotype-resolved raw unitig graph, Haplotype-resolved processed unitig graph without small bubbles, Primary assembly contig graph, Alternate assembly contig graph, [hap1]/[hap2] contig graph.

  2. GFA to FASTA ( Galaxy version 0.1.2) with the following parameters:
    • param-file “Input GFA file”: primary assembly contig graph
Question

What are the different output datasets from Hifiasm?

  • Haplotype-resolved raw unitig graph: This graph keeps all haplotype information, including somatic mutations and recurrent sequencing errors.
  • Haplotype-resolved processed unitig graph without small bubbles: This graph ‘pops’ small bubbles in the raw unitig graph; small bubbles might be caused by somatic mutations or noise in data, which are not the real haplotype information.
  • Primary assembly contig graph: This graph includes a complete assembly with long stretches of phased blocks, though there may be some haplotype collapse.
  • Alternate assembly contig graph: This graph consists of all contigs that are discarded from the primary contig graph.
  • [hap1]/[hap2] contig graph: Each graph consists of phased contigs (output only with Hi-C phasing enabled).

We will use Flye, a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PacBio / ONT reads as input and outputs polished contigs. Flye also has a special mode for metagenome assembly. All informations about Flye assembler are here: Flye.

Hands On: Assembly with Flye
  1. Flye ( Galaxy version 2.9.5+galaxy1) with the following parameters:
    • param-file “Input reads”: the raw data (fastq.gz)
    • “Mode”: PacBio HiFi
    • “Number of polishing iterations”: 1
    • “Reduced contig assembly coverage”: Disable reduced coverage for initial disjointing assembly
    • “Generate a log file”: Set to yes

    The tool produces four datasets: consensus, assembly graph, graphical fragment assembly and assembly info

Question

What are the different output datasets?

  • The first dataset (consensus) is a fasta file containing the final assembly (1461 contigs). You may notice that the result (contigs number) you obtained is sligthy different from the one presented here. This is due to the Flye assembly algorithm which doesn’t always give the eact same results.
  • The second and third dataset are assembly graph files. These graphs are used to represent the final assembly of a genome, they are based on reads and their overlap information. Some tools such as Bandage allow to visualize the assembly graph.
  • The fourth dataset is a tabular file (assembly_info) containing extra information about contigs/scaffolds.

Quality assessment

Genome assemblies comparison with Quast

A way to calculate metrics assembly is to use QUAST = QUality ASsessment Tool. Quast is a tool to evaluate genome assemblies by computing various metrics. The manual of Quast is here: Quast Quast allows to compare the assembly carried out with the reference genome and produces statistics such as the genome fraction. However, when comparing to a reference genome, Quast results do not display the assembly N50, therefore, in this tutorial, you will generate two quast reports : One specifying a reference genome and one not specifying a reference genome.

Hands On: Quast
  1. Quast ( Galaxy version 5.0.2+galaxy3) specifying a reference genome with the following parameters:
    • “Assembly mode?”: Individual assembly (1 contig file per sample)
    • “Use customized names for the input files?”: No, use dataset names
      • param-collection “Contigs/scaffolds file”: JBKZXA01.fasta.gz (reference assembly), fasta file (output of GFA to FASTA tool) and/or consensus (output of Flye tool)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: Yes
        • “Reference genome”: JBKZXA01.fasta.gz (reference assembly)
      • “Type of organism”: Fungus: use of GeneMark-ES for gene finding, ...
  2. Quast ( Galaxy version 5.0.2+galaxy3) without specifying a reference genome with the following parameters:
    • “Assembly mode?”: Individual assembly (1 contig file per sample)
    • “Use customized names for the input files?”: No, use dataset names
      • param-collection “Contigs/scaffolds file”: JBKZXA01.fasta.gz (reference assembly), fasta file (output of GFA to FASTA tool) and/or consensus (output of Flye tool)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: No
      • “Type of organism”: Fungus: use of GeneMark-ES for gene finding, ...
Question

Compare the different metrics obtained for the assembly you generated and the reference genome.

We compare the metrics of the three assemblies:

  • The reference genome (JBKZXA01.fasta.gz): 8 contigs, Contig N50 = 3.5Mb, Contig length max = 6.2 Mb, size = 35.4Mb, 49.51% GC
  • The Hifiasm assembly: 105 contigs, N50 = 4.9 Mb, length max = 7.1Mb, assembly size = 42.1 Mb, 47.91% GC
  • The Flye assembly: 13 contigs, N50 = 4.9Mb, length max = 6.8Mb, size = 38.7 Mb, 49.31% GC

Genome assembly assessment with BUSCO

BUSCO (Benchmarking Universal Single-Copy Orthologs) allows a measure for quantitative assessment of genome assembly based on evolutionarily informed expectations of gene content. Details for this tool are here: Busco website

Hands On: BUSCO on assembly
  1. Busco ( Galaxy version 5.8.0+galaxy0) with the following parameters:
    • “Tool version”: Galaxy Version 5.8.0+galaxy0
    • param-file “Sequences to analyse”: Multiple datasets
    • param-collection “Sequences to analyse”: JBKZXA01.fasta.gz (reference assembly), fasta file (output of GFA to FASTA tool) and/or consensus (output of Flye tool)
    • “Auto-detect or select lineage”: Select lineage
      • “Lineage”: Fungi
      • “Which outputs should be generated”: short summary text; summary image
Question

Compare the number of BUSCO genes identified in the generated assembly and the reference genome. What do you observe ?

Short summary generated by BUSCO indicates that reference genome (JBKZXA01.fasta.gz) contains:

  1. 751 Complete BUSCOs (of which 749 are single-copy and 2 are duplicated),
  2. 0 fragmented BUSCOs,
  3. 7 missing BUSCOs.

Short summary generated by BUSCO indicates that Hifiasm assembly contains:

  1. 756 complete BUSCOs (754 single-copy and 2 duplicated),
  2. 0 fragmented BUSCOs
  3. 2 missing BUSCOs.

Short summary generated by BUSCO indicates that Flye assembly contains:

  1. 755 complete BUSCOs (753 single-copy and 2 duplicated),
  2. 0 fragmented BUSCOs
  3. 3 missing BUSCOs.

BUSCO analysis confirms that these two assemblies are of similar quality, with similar number of complete, fragmented and missing BUSCOs genes.

Conclusion

This pipeline shows how to generate and evaluate a genome assembly from long reads PacBio data. Once you are satisfied with your genome sequence, you might want to annotate it: have a look at the RepeatMasker and Funannoate tutorials to learn how to do it!