Ploting a Microbial Genome with Circos

Author(s) orcid logoHelena Rasche avatar Helena Rasche
Overview
Creative Commons License: CC-BY Questions:
  • How can I visualise common genomic datasets like GFF3, BigWig, and VCF

Objectives:
  • Plot an E. coli genome in Galaxy

  • With tracks for the annotations, sequencing data, and variants.

Requirements:
Time estimation: 30 minutes
Level: Intermediate Intermediate
Supporting Materials:
Published: Nov 8, 2023
Last modification: Sep 17, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00373
rating Rating: 5.0 (1 recent ratings, 1 all time)
version Revision: 3

Once you’re comfortable with Circos in Galaxy, you might want to explore some real world use cases with Circos such as making a simple Genome Annotation plot, like one might want to publish alongside their genome annotation publication

Agenda

In this tutorial, we will cover:

  1. Get data
  2. Fast Option: Using a Workflow
  3. Manual Configuration
  4. Conclusion

There are a few common data formats and potentially required transformations for working with genome annotation data in Galaxy:

Data type Transformations Visualisation Options
Genome Annotations Circos Interval to Tiles, Circos Interval to Text Labels Tiles (classic gene box-type viz), Text labels (used for annotating the important genes names), or a histogram (feature density)
Sequencing BAM Coverage → Circos BigWig to Scatter Histogram is probably the most common visualisation option
Variants From BAM + variant calling, cut tool Select specific columns of the VCF file (i.e. chromosome, position, position, quality)

Get data

Hands-on: Data Upload
  1. Create and name a new history for this tutorial.

    To create a new history simply click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Import the datasets we will visualize:

    https://zenodo.org/record/3591856/files/genome.fa
    https://zenodo.org/record/3591856/files/dna%20sequencing%20coverage.bw
    https://zenodo.org/record/3591856/files/RNA-Seq%20coverage%201.bw
    https://zenodo.org/record/3591856/files/RNA-Seq%20coverage%202.bw
    https://zenodo.org/record/3591856/files/genes%20(NCBI).gff3
    https://zenodo.org/record/3591856/files/variants.vcf
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

This is a subset1 of Escherichia coli str. K-12 substr. MG1655, and associated sequencing data.

And we’ll be using that data to make the following plot:

plot of NC_000913 showing a range of datasets, from outside to in: a pink band representing the genome, two rna seq datasets and a dna seq coverage dataset showing coverage spikes around 10 and 20kb. Next inside is an overlapping set of genes presumably on the plus strand in dark blue overlapped with lilac triangles representing variants. Next a set of gene names, and finally a set of minus strand genes before a GC skew map in red and blue.Open image in new tab

Figure 1: Resulting Circos Plot showing a set of gene annotations, RNA-Seq, DNA-Seq coverage, as well as variants.

Fast Option: Using a Workflow

If you would like to try the ‘fast’ option, once you’ve imported the datasets, you can run the following workflow:

Hands-on: Run workflow
  1. Import the workflow into Galaxy

    Hands-on: Importing and launching a GTN workflow
    Launch Circos for E. Coli (View on GitHub, Download workflow) workflow.
    • Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
    • Click on galaxy-upload Import at the top-right of the screen
    • Paste the following URL into the box labelled “Archived Workflow URL”: https://training.galaxyproject.org/training-material/topics/visualisation/tutorials/circos-microbial/workflows/main_workflow.ga
    • Click the Import workflow button

    Below is a short video demonstrating how to import a workflow from GitHub using this procedure:

    Video: Importing a workflow from URL

  2. Run the workflow workflow using the following parameters:

    • param-file “Genome”: genome.fa
    • param-file “Genes”: genes (NCBI).gff3
    • param-file “RNA Seq Coverage (1)”: RNA-Seq Coverage 1.bw
    • param-file “RNA Seq Coverage (2)”: RNA-Seq Coverage 2.bw
    • param-file “DNA Sequencing Coverage”: dna sequencing coverage.bw
    • param-file “Variants”: variants.vcf
    • Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
    • Click on the workflow-run (Run workflow) button next to your workflow
    • Configure the workflow as needed
    • Click the Run Workflow button at the top-right of the screen
    • You may have to refresh your history to see the queued jobs

Alternatively you can run the pre-processing steps and configure Circos manually as follows:

Manual Configuration

We’ll calculate the GC skew first from the genome sequence:

Hands-on: GC Skew
  1. GC Skew ( Galaxy version 0.69.8+galaxy9) with the following parameters:
    • “Source for reference genome”: Use a genome from history
      • param-file “Select a reference genome”: genome.fa (Input dataset)
    • “Window size”: 200
    Comment: Window size

    The optimal window size is sometimes a process of trial and error to find the right balance between too many datapoints, and the expected smooth curve that should appear indicating forward or reverse strand genes.

Preparing BigWig Files

With that file available, we’re ready to convert these into a format Circos can understand. Natively we store the files in BigWig because it’s a very space efficient format, however Circos only processes text files, and expects a dataset with the following structure:

Column Value
1 Chromosome name
2 Start
3 End
4 value

so we’ll use a tool to convert them into the Circos-preferred format.

Hands-on: Dataset Pre-processing
  1. Circos: bigWig to Scatter ( Galaxy version 0.69.8+galaxy9) with the following parameters:
    • param-files “Data file”:
      • output of GC Skew tool
      • RNA-Seq coverage 1.bw (Uploaded Dataset)
      • RNA-Seq coverage 2.bw (Uploaded Dataset)
      • DNA sequencing coverage.bw (Uploaded Dataset)
    Comment: Multi-select to automate processing

    Multi-select allows you to easily process several datasets at once in Galaxy

Comment: Creating BigWig files from coverage

You can use a tool like bamCoverage: generates a coverage bigWig file from a given BAM or CRAM file ( Galaxy version 3.5.4+galaxy0) to create a bigWig file from a BAM or CRAM sequencing dataset.

Preparing Variant Calls

Variant calls in a vcf format can easily be transformed into the same format as we converted the BigWigs to.

Hands-on: Dataset Pre-processing
  1. Cut with the following parameters:
    • “Cut columns”: c1,c2,c2,c6
    • param-file “From”: variants.vcf (Uploaded dataset)
    Question
    1. Why these columns? What do they represent?
    2. Why is c2 selected twice?
    1. c1 is the chromosome name, c2 is the position of the variant, and c6 is the quality column.
    2. c2 is used twice because in Circos there are no ‘point’ values, everything has a start and end. So here we re-use the start position to represent a 1 base long feature.

Preparing Gene Annotations

Gene annotations (gff3, bed, gtf), known as “intervals” in the Circos world, can be converted into a couple different formats, namely text labels and tiles.

Hands-on: Prepare gene calls
  1. Circos: Interval to Circos Text Labels ( Galaxy version 0.69.8+galaxy9) with the following parameters:
    • “Data Format”: GFF3
      • param-file “GFF3 File”: genes (NCBI).gff3 (Input dataset)
      • “GFF3 Attribute to pull value from”: Name
  2. GFF-to-BED with the following parameters:
    • param-file “Convert this dataset”: genes (NCBI).gff3 (Input dataset)
  3. Circos: Interval to Tiles ( Galaxy version 0.69.8+galaxy9) with the following parameters:
    • “Data Format”: BED6+
      • param-file “BED File (BED6+ only)”: output of GFF-to-BED tool

Making the Plot

With our:

  • gene calls
  • variant calls
  • and sequencing depth

We’re ready to run Circos! As this is a ‘near-final’ circos plot it’s requires complicated configuration. Normally you would reach configuration like this with a lot of iterations. It took the tutorial author around 20 executions of the Circos tool to produce this plot.

Hands-on: Circos
  1. Circos ( Galaxy version 0.69.8+galaxy9) with the following parameters:
    • In “Karyotype”:
      • “Reference Genome Source”: ` FASTA File from History (can be slow, generate a length file to improve execution time.)`
        • param-file “Source FASTA Sequence”: genome.fa (Uploaded dataset)
    • In “Ideogram”:
      • “Chromosome units”: Kilobases
      • “Spacing Between Ideograms (in chromosome units)”: 0.3
      • “Thickness”: 10.0
      • In “Labels”:
        • “Label Font Size”: 48
    • In “2D Data Tracks”:
      • In “2D Data Plot”:
        • param-repeat “Insert 2D Data Plot”
          • “Outside Radius”: 0.98
          • “Inside Radius”: 0.92
          • “Plot Type”: Histogram
            • param-file “Histogram Data Source”: output of Circos: bigWig to Scatter on RNA Seq Coverage 2 tool
            • In “Plot Format Specific Options”:
              • “Fill Color”: #f08fa4
        • param-repeat “Insert 2D Data Plot”
          • “Outside Radius”: 0.92
          • “Inside Radius”: 0.86
          • “Plot Type”: Histogram
            • param-file “Histogram Data Source”: output of Circos: bigWig to Scatter on RNA Seq Coverage 1 tool
            • In “Plot Format Specific Options”:
              • “Fill Color”: #8ff0a4
        • param-repeat “Insert 2D Data Plot”
          • “Outside Radius”: 0.86
          • “Inside Radius”: 0.8
          • “Plot Type”: Histogram
            • param-file “Histogram Data Source”: output of Circos: bigWig to Scatter on DNA sequencing coverage tool
            • In “Plot Format Specific Options”:
              • “Fill Color”: #ffbe6f
        • param-repeat “Insert 2D Data Plot”
          • “Outside Radius”: 0.79
          • “Inside Radius”: 0.6
          • “Z-index”: 10 (This is used to plot over the genes which are added later.)
          • “Plot Type”: Scatter
            • param-file “Scatter Data Source”: output of cut on variants.vcf tool
            • In “Plot Format Specific Options”:
              • “Glyph”: Triangle
              • “Glyph Size”: 6
              • “Fill Color”: #dc8add
              • “Stroke Thickness”: 0
          • In “Axes”:
            • In “Axis”:
              • param-repeat “Insert Axis”
                • “Radial Position”: Absolute position (values match data values)
                  • “Spacing”: 5000.0
                • “y1”: 40000.0
                • “Color”: #1a5fb4
                • “Color Transparency”: 0.4
        • param-repeat “Insert 2D Data Plot”
          • “Outside Radius”: 0.6
          • “Inside Radius”: 0.55
          • “Plot Type”: Text Labels
            • param-file “Text Data Source”: output of Circos: Interval to Text on genes (NCBI).gff tool
            • In “Plot Format Specific Options”:
              • “Label Size”: 18
              • “Show Link”: No
              • “Snuggle Labels”: Yes
        • param-repeat “Insert 2D Data Plot”
          • “Outside Radius”: 0.7
          • “Inside Radius”: 0.6
          • “Plot Type”: Tiles
            • param-file “Tile Data Source”: output of Circos: Interval to Tiles on genes (NCBI).gff tool
            • In “Plot Format Specific Options”:
              • “Fill Color”: #1c71d8
              • “Overflow Behavior”: Hide: overflow tiles are not drawn
          • In “Rules”:
            • In “Rule”:
              • param-repeat “Insert Rule”
                • In “Conditions to Apply”:
                  • param-repeat “Insert Conditions to Apply”
                    • “Condition”: Based on qualifier value (when available)
                      • “Qualifier name”: strand
                      • “Condition”: Less than (numeric)
                      • “Qualifier value to compare against”: 0
                • In “Actions to Apply”:
                  • param-repeat “Insert Actions to Apply”
                    • “Action”: Change Visibility
        • param-repeat “Insert 2D Data Plot”
          • “Outside Radius”: 0.53
          • “Inside Radius”: 0.45
          • “Plot Type”: Tiles
            • param-file “Tile Data Source”: output of Circos: Interval to Tiles on genes (NCBI).gff tool
            • In “Plot Format Specific Options”:
              • “Overflow Behavior”: Hide: overflow tiles are not drawn
          • “Orient Inwards”: Yes
          • In “Rules”:
            • In “Rule”:
              • param-repeat “Insert Rule”
                • In “Conditions to Apply”:
                  • param-repeat “Insert Conditions to Apply”
                    • “Condition”: Based on qualifier value (when available)
                      • “Qualifier name”: strand
                      • “Condition”: Greater than (numeric)
                      • “Qualifier value to compare against”: 0
                • In “Actions to Apply”:
                  • param-repeat “Insert Actions to Apply”
                    • “Action”: Change Visibility
              • param-repeat “Insert Rule”
                • In “Conditions to Apply”:
                  • param-repeat “Insert Conditions to Apply”
                    • “Condition”: Apply to Every Point
                • In “Actions to Apply”:
                  • param-repeat “Insert Actions to Apply”
                    • “Action”: Change Fill Color for all points
                      • “Fill Color”: #99c1f1
        • param-repeat “Insert 2D Data Plot”
          • “Outside Radius”: 0.45
          • “Inside Radius”: 0.35
          • “Plot Type”: Histogram
            • param-file “Histogram Data Source”: output of Circos: bigWig to Scatter on the GC Skew Plot tool
            • In “Plot Format Specific Options”:
              • “Fill Color”: #ff5757
          • In “Rules”:
            • In “Rule”:
              • param-repeat “Insert Rule”
                • In “Conditions to Apply”:
                  • param-repeat “Insert Conditions to Apply”
                    • “Condition”: Based on value (ONLY for scatter/histogram/heatmap/line)
                      • “Points below this value”: 0.0
                • In “Actions to Apply”:
                  • param-repeat “Insert Actions to Apply”
                    • “Action”: Change Fill Color for all points
                      • “Fill Color”: #5092f7
    • In “Ticks”:
      • “Skip first label”: Yes
      • In “Tick Group”:
        • param-repeat “Insert Tick Group”
          • “Tick Spacing”: 10.0
          • “Tick Size”: 20.0
          • “Show Tick Labels”: Yes
        • param-repeat “Insert Tick Group”
          • “Tick Size”: 15.0
          • “Show Tick Labels”: No
        • param-repeat “Insert Tick Group”
          • “Tick Spacing”: 0.25
          • “Color”: #9a9996
          • “Show Tick Labels”: No
    Comment: Circos is complicated

    Please check your parameters carefully, and expect that mistakes can be made. Just re-run the tool and modify your parameters! And while this example is probably very overwhelming, when you create a Circos plot from scratch, it will be less overwhelming; it’ll be your data which you know better, and you’ll add one track at a time.

    gif of a circos plot being iteratively modified to reach something that looks like the final plot.

Congratulations on plotting a microbial genome subset in Circos!

Conclusion

Plotting with Circos is essentially infinitely customisable but here we offer suggestions for a default plotting workflow.

  1. reduced for faster plotting and faster data download