Assembly of the mitochondrial genome from PacBio HiFi reads

Overview
Creative Commons License: CC-BY Questions:
  • How to assemble the mitochondrial genome from PacBio Hifi Reads

Objectives:
  • Generate Mitochondrial assembly

  • Understand the outputs of MitoHifi

Requirements:
Time estimation: 1 hour
Supporting Materials:
Published: Sep 3, 2024
Last modification: Sep 3, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00453
version Revision: 1

Introduction

This tutorial will show you how to assemble a mitochondrial genome from PacBio HiFi data using MitoHiFi Uliano-Silva et al. 2023. Combined with the tutorials “Using the VGP workflows to assemble a vertebrate genome with HiFi and Hi-C data” and “Decontamination of a genome assembly”, this allows you to produce a reference assembly for both the nuclear and the mitochondrial DNA of a vertebrate species.

This tutorial uses data from the Zebra Finch (Taeniopygia guttata) generated by the Vertebrate Genome Project. We downsampled the reads that didn’t align with the mitochondrial genome so that the tutorial can run faster.

Comment: Run this analysis on "real" data

If you want to run this analysis on a real sequencing library generated by the Vertebrate Genome Project you can find the PacBio HiFi data on Genome Ark as a remote repository and upload it to Galaxy (available on the three main Public Galaxy instances: .org, .eu, .org.au).

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a Choose remote files:

  1. Click on Upload Data on the top of the left panel
  2. Click on Choose remote files and scroll down to find your data folder or type the folder name in the search box on the top.

    • Look for your data under:Genome Ark -> species -> Taeniopygia_guttata -> bTaeGut2 -> genomic_data -> pacbio_hifi -> *_reads.fasta.gz
  3. click on OK
  4. Click on Start
  5. Click on Close
  6. You can find the dataset has begun loading in you history.

The assembly is using the wrapped workflow MitoHiFi. MitoHiFi:

  • Extracts mitochondrial reads (based on a BLAST against an existing reference mitogenome) and uses Hifiasm Cheng et al. 2021 to assemble them.
  • Removes nuclear mitochondrial DNA sequences (NUMTs) from the potential mitogenome contigs
  • Generates a circularized and annotated genome for all potential mitogenome contigs
  • Selects a representative for the final mitochondrial assembly
Agenda

In this tutorial, we will cover:

  1. Introduction
    1. Get data
  2. Download the mitogenome for a related species
    1. Assemble the mitochondrial genome with MitoHiFi
  3. Conclusion

Get data

Hands-on: Data Upload
  1. Create a new history for this tutorial
  2. Import the files from Zenodo:

    https://zenodo.org/records/13345315/files/PacBio_reads.fastqsanger.gz
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

  3. Check that the datatype is fastqsanger.gz

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select fastqsanger.gz from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

Download the mitogenome for a related species

To assemble the mitogenome from our PacBio Data, MitoHiFi needs a reference mitogenome to fish out mitochondrial reads (the reads are blasted against the related reference). To download this reference, we use the tool MitoHiFi with the operation Find a closely related species.

  1. MitoHiFi ( Galaxy version 3+galaxy0) with the following parameters:
    • “Operation type selector”: Find a close-related mitochondrial reference genome
      • “Species name”: Taeniopygia guttata Enter the latin name of the species you are assembling
      • “Email”: your.email@service.com Enter your email
      • “Minimal appropriate length”: 15000 As vertebrate mitochondrial genomes are typically at least 14kbp long, we are using a value in this range so that we get complete mitogenome results as our reference.

Assemble the mitochondrial genome with MitoHiFi

Hands-on: Assemble the mitochondrial genome
  1. Create a collection with your PacBio HiFi Reads

    • Click on galaxy-selector Select Items at the top of the history panel Select Items button
    • Check the fastq.gz containing the HiFi reads
    • Click 1 of N selected and choose Build Dataset List

      build list collection menu item

    • Enter a name for your collection to PacBio Reads
    • Click Create collection to build your collection
    • Click on the checkmark icon at the top of your history again

  2. MitoHiFi ( Galaxy version 3+galaxy0) with the following parameters:

    • “Operation type selector”: Run MitoHiFi
      • “Input mode”: Pacbio Hifi Reads
        • param-collection “Pacbio Hifi reads”: PacBio Reads (Input dataset collection)
      • param-file “Close-related mitogenome in fasta format”: MitoHiFi on : reference genome (FASTA) (output of MitoHiFi tool)
      • param-file “Close-related mitogenome in genbank format”: MitoHiFi on : reference genome (genbank) (output of MitoHiFi tool)
      • “Genetic code”: Vertebrate mitochondrial code
      • In “Advanced options”:
        • “Blast percentage identity”: 70 This setting filters the potential mito contigs – setting it to 70 means that we are retaining contigs with at least 70% of its length in the BLAST match. This parameter can be lowered if you are expecting more sequence divergence among mitogenomes of your taxa, or vice versa.

Outputs of MitoHiFi:

  • Final mitogenome (FASTA). The mitochondrial genome circularized and rotated to start at tRNA-Phe.
  • Final mitogenome (genbank). The final mitogenome annotated in GenBank format.
  • Final mitogenome annotation (png). The predicted genes in the final mitogenome.
  • Final mitogenome coverage (png). The sequencing coverage along the final mitogenome.
  • Contigs stats (TSV). Contains the statistics of your assembled mitogenomes such as the number of genes, size, whether it was circularized or not, if the sequence has frameshifts, and other metrics.
  • Reads mapped to close-related mtDNA (FASTA). All and filtered by size.
  • Hifiasm contigs (fasta). The results of running Hifiasm on the mitochondrial reads.
Image produced by MitoHiFi showing the genes annotated in our mitogenome. Open image in new tab

Figure 1: Final mitogenome annotation

Conclusion

In this tutorial, we learned how to assemble the mitochondrial genome using PacBio HiFi reads and MitoHiFi. You can try this tutorial on your own data using the full HiFi read set you’d use for the nuclear genome assembly, since the filtering for mitochondrial reads happens within MitoHiFi.