Marine Omics identifying biosynthetic gene clusters

Author(s) Marie Josse avatar Marie Josse
Overview
Creative Commons License: CC-BY Questions:
  • Which biological questions are addressed by the tutorial?

  • Which bioinformatics techniques are important to know for this type of data?

Objectives:
  • Follow a marine omics analysis

  • Learn to conduct a secondary metabolite biosynthetic gene cluster (SMBGC) Annotation

  • Discover SanntiS a tool for identifying BGCs in genomic & metagenomic data

  • Manage fasta files

Requirements:
Time estimation: 3 hours
Supporting Materials:
Published: Aug 19, 2024
Last modification: Sep 16, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00450
version Revision: 2

Introduction

Through this tutorial, you will learn in the first part how to produce a protein fasta file from a nucleotide fasta file using Prodigal (it is a tool that predicts protein-coding genes from DNA sequences).

Then, you’ll be using InterProscan to create a tabular. Interproscan is a batch tool to query the InterPro database. It helps identify and predict the functions of proteins by comparing them to known databases.

And finally, you will discover SanntiS both to build genbank and especially to detect and annotate biosynthetic gene clusters (BGCs).

Agenda

In this tutorial, we will cover:

  1. Introduction
  2. Get data
  3. Import and launch the workflow
  4. Prodigal Gene Predictor: generate a protein fasta file
    1. Prodigal Gene Predictor
    2. SanntiS for building a Genbank file
    3. Regex Find And Replace
  5. InterProScan
  6. SanntiS for annotating biosynthetic gene clusters
  7. Conclusion
  8. Extra information

Get data

The FASTA file used in this example is, intentionally, a very small fraction of a genome (spanning exactly 1 BGC), and it’s main purpose is to quickly check that all the pieces of the workflow are working.

Hands-on: Data Upload
  1. Create a new history for this tutorial and give it a name (for example “Marine Omics: SMBGC annotation”) for you to find it again later if needed.

    To create a new history simply click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Import the file with this link https://figshare.com/ndownloader/files/48574534and name it BGC0001472.fna

    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

  3. Rename the datasets BGC0001472.fna

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field
    • Click the Save button

Hands-on: Choose Your Own Tutorial

This is a "Choose Your Own Tutorial" section, where you can select between multiple paths. Click one of the buttons below to select how you want to follow the tutorial

Do you want to run the workflow or to discover the tools one by one ?

Import and launch the workflow

Hands-on: Import the workflow
  • Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
  • Option 1: use the URL - Click on galaxy-upload Import at the top-right of the screen - Paste the URL of the workflow into the box labelled “Archived Workflow URL” https://earth-system.usegalaxy.eu/u/marie.josse/w/marine-omics-identifying-biosynthetic-gene-clusters
  • Option 2: use the workflow name - Click on Public workflows at the top-right of the screen
    • Search for Marine Omics identifying biosynthetic gene clusters
    • In the workflow preview box click on galaxy-upload Import
  • Click the Import workflow button
Hands-on: Run the workflow
  • Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
  • Click on the workflow-run (Run workflow) button next to your workflow
  • /!\ Select Yes for Workflow semi automatic
  • Configure the workflow as needed with the 2 datasets you uploaded right before (BGC0001472.fna)
  • Click the Run Workflow button at the top-right of the screen
  • You may have to refresh your history to see the queued jobs

Now you don’t have to do anything else. You should see all the different steps of the workflow appear in your history. When the workflow is fully completed you should have the following history.

Image of the history with all the steps of the workflow. Open image in new tab

Figure 1: History

Workflows are a powerful Galaxy feature that allows you to scale up your analysis by performing an end-to-end analysis with a single click of a button. In order to keep provenance of the workflow invocation (an invocation of a workflow means one run or execution of the workflow) it can be exported from Galaxy in the form of a Workflow Run Crate RO-Crate profile.

After the workflow has completed, we can export the RO-Crate. The crate does not appear in your history, but can be accessed from the User -> Workflow Invocations menu on the top bar.

Hands-on: Extract a RO-Crate
  1. In the top right of your history, go to galaxy-history-options -> Show Invocations
Image of the history options. Open image in new tab

Figure 2: History options

Our latest workflow run should be listed at the top.

  1. Click on it to expand it:
Image of the workflow invocation. Open image in new tab

Figure 3: Workflow invocation
  1. Click on the Export tab in the expanded view of the workflow invocation. You should see a page that contains three download options: - Research Object Crate (RO-Crate) - BioCompute Object - File
  2. Click on the Generate galaxy-download option of the RO-Crate box (1st box)
Image of the RO-Crate download. Open image in new tab

Figure 4: RO-Crate

Great work! You have created a Workflow Run Crate. This makes it easy to track the provenance of the executed workflow.

Prodigal Gene Predictor: generate a protein fasta file

Prodigal Gene Predictor

Prodigal is a tool that predicts protein-coding genes from DNA sequences. It takes a nucleotide FASTA file as input and identifies regions that are likely to code for proteins. The output is a protein FASTA file where each sequence represents a predicted protein. Some of these sequences end with an asterisk (*), which marks the end of a complete protein sequence identified by Prodigal. This asterisk is added when Prodigal detects a full protein-coding region that ends with a stop codon. Sequences without an asterisk either represent partial proteins or do not end in a typical stop codon.

Hands-on: Run prodigal
  1. Prodigal Gene Predictor ( Galaxy version 2.6.3+galaxy0) with the following parameters:
    • param-file “Specify input file”: BGC0001472.fna (Input the nucleotide fasta file)
    • “Specify mode”: Meta : Anonymous sequences, analyze using preset training files, ideal for metagenomic data or single short sequences

    You don’t need to change any other parameters leave them on the default input.

  2. Click on Run Tool

You should have 4 new outputs appearing in your history. In these outputs you should have Prodigal Gene Predictor on data 1 : protein translations file. You can click on it and then click on the galaxy-eye (eye).

Image of the protein fasta file from prodigal and of the new history with all the new outputs. Open image in new tab

Figure 5: Prodigal outputs

You can notice here that at each end of the sequence there’s a *. Later on we will need to remove this star. But, first we are going to use this protein file to build the Genbank that SanntiS need to make a SMBGC annotation.

SanntiS for building a Genbank file

This step combines the original nucleotide sequences with the cleaned protein sequences to create a GenBank format file. This format is widely used for storing and organising information about DNA sequences and their annotations. In this step, the DNA sequences and their corresponding coding regions are transformed into a format that is suitable for SanntiS.

Hands-on: Build Genbank
  1. SanntiS biosynthetic gene clusters ( Galaxy version 0.9.3.5+galaxy1) with the following parameters:
    • “Do you want to build a genbank or to make a SMBGC Annotation?”: Build genbank
      • param-file “Input a nucleotide fasta file”: BGC0001472.fna (Input the nucleotide fasta file)
      • param-file “Input a protein fasta file”: Prodigal Gene Predictor on data 1 : protein translations file (output of Prodigal Gene Predictor tool)
  2. Click on Run Tool
Image of the Genbank file produces by SanntiS. Open image in new tab

Figure 6: Genbank file

Regex Find And Replace

Remember earlier we noticed the star * in the protein fasta file ?

Now is the time to remove it !

The asterisks at the end of some protein sequences are informative but can cause issues with some analysis tools. In this step, we remove these asterisks to produce a clean protein FASTA file, making it ready for further analysis.

Hands-on: Remove *
  1. Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
    • param-file “Select lines from”: Prodigal Gene Predictor on data 1 : protein translations file (output of Prodigal Gene Predictor tool)
    • In “Check”:
      • param-repeat “Insert Check”
        • “Find Regex”: \*$
        • “Replacement”: `` (leave an empty box there)
  2. Click on Run Tool

Check if the * were well removed.

InterProScan

InterProScan is a tool that helps identify and predict the functions of proteins by comparing them to known databases. This tool analyses the protein sequences and produces an output file containing detailed information about the possible functions, domains, and families associated with these proteins.

Hands-on: Task description
  1. InterProScan ( Galaxy version 5.59-91.0+galaxy3) with the following parameters:
    • param-file “Protein FASTA File”: Regex Find And Replace on data ** (output of Regex Find And Replace tool)
    • “Use applications with restricted license, only for non-commercial use?”: No You can leave all the other parameters on the default input.
  2. Click on Run Tool
Image of InterProScan output in tabular (tsv) in the history. Open image in new tab

Figure 7: InterProScan output

SanntiS for annotating biosynthetic gene clusters

SanntiS is a tool specifically designed to detect and annotate biosynthetic gene clusters (BGCs). It uses neural networks trained on InterPro signatures to achieve high accuracy in identifying BGCs in both genomic and metagenomic datasets 1. A significant benefit of using SanntiS is that your results will be comparable with a large number of datasets, including over 5,000 marine metagenomic assemblies archived in the MGnify resource. This tool provides valuable insights into the biosynthetic potential of organisms or environmental samples. The final output is a GFF file containing detailed BGC annotations, which can be used for further analyses or applications.

Hands-on: Identify biosynthetic gene clusters
  1. SanntiS biosynthetic gene clusters ( Galaxy version 0.9.3.5+galaxy1) with the following parameters:
    • “Do you want to build a genbank or to make a SMBGC Annotation?”: Run SanntiS
      • param-file “Input the tabular file from InterProScan”: InterProScan on data ** (output of InterProScan tool)
      • param-file “Input a Genbank file”: SanntiS output data genbank (output of SanntiS biosynthetic gene clusters tool)
  2. Click on Run Tool

Finally, you should have one gff3 file in your history under SanntiS output data

Image of SanntiS output. Open image in new tab

Figure 8: SanntiS output

Conclusion

Here you now know how to conduct a Marine Omics analysis

Extra information

Coming up soon even more tutorials on and other Earth-System related trainings. Keep an galaxy-eye open if you are interested!