Checking expected species and contamination in bacterial isolate

Author(s) orcid logoBérénice Batut avatar Bérénice Batut
Editor(s) orcid logoClea Siguret avatar Clea Siguret
Creative Commons License: CC-BY Questions:
  • What are the species in bacterial isolate sequencing data?

  • Run a series of tool to identify species in bacterial isolate sequencing data

  • Visualize the species abundance

Time estimation: 1 hour
Level: Introductory Introductory
Supporting Materials:
Published: Mar 4, 2024
Last modification: Jun 27, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL:
rating Rating: 5.0 (1 recent ratings, 1 all time)
version Revision: 4

Sequencing (determining of DNA/RNA nucleotide sequence) is used all over the world for all kinds of analysis. The product of these sequencers are reads, which are sequences of detected nucleotides.

When sequencing bacterial isolates, it might be good to check if the expected species or strains can be identified in the data or if there is any contamination.

To illustrate the process, we take quality-controlled data of a bacterial genome (KUN1163 sample) from sequencing data produced in “Complete Genome Sequences of Eight Methicillin-Resistant Staphylococcus aureus Strains Isolated from Patients in Japan” (Hikichi et al. 2019).

Methicillin-resistant Staphylococcus aureus (MRSA) is a major pathogen causing nosocomial infections, and the clinical manifestations of MRSA range from asymptomatic colonization of the nasal mucosa to soft tissue infection to fulminant invasive disease. Here, we report the complete genome sequences of eight MRSA strains isolated from patients in Japan.


In this tutorial, we will cover:

  1. Galaxy and data preparation
  2. Taxonomic profiling
  3. Species extraction
  4. Contamination identification
  5. Conclusion

Galaxy and data preparation

Any analysis should get its own Galaxy history. So let’s start by creating a new one and get the data (forward and reverse quality-controlled sequences) into it.

Hands-on: Prepare Galaxy and data
  1. Create a new history for this analysis

    To create a new history simply click the new-history icon at the top of the history panel:

    UI for creating new history

  2. Rename the history

    1. Click on galaxy-pencil (Edit) next to the history name (which by default is “Unnamed history”)
    2. Type the new name
    3. Click on Save
    4. To cancel renaming, click the galaxy-undo “Cancel” button

    If you do not have the galaxy-pencil (Edit) next to the history name (which can be the case if you are using an older version of Galaxy) do the following:

    1. Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
    2. Type the new name
    3. Press Enter

  3. Import the quality-controlled sequences from Zenodo or from Galaxy shared data libraries:
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Shared data (top panel) then Data libraries
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

Taxonomic profiling

To find out which microorganisms are present, we will compare the reads of the sample to a reference database, i.e. sequences of known microorganisms stored in a database, using Kraken2 (Wood et al. 2019).

In the \(k\)-mer approach for taxonomy classification, we use a database containing DNA sequences of genomes whose taxonomy we already know. On a computer, the genome sequences are broken into short pieces of length \(k\) (called \(k\)-mers), usually 30bp.

Kraken examines the \(k\)-mers within the query sequence, searches for them in the database, looks for where these are placed within the taxonomy tree inside the database, makes the classification with the most probable position, then maps \(k\)-mers to the lowest common ancestor (LCA) of all genomes known to contain the given \(k\)-mer.


Kraken2 uses a compact hash table, a probabilistic data structure that allows for faster queries and lower memory requirements. It applies a spaced seed mask of s spaces to the minimizer and calculates a compact hash code, which is then used as a search query in its compact hash table; the lowest common ancestor (LCA) taxon associated with the compact hash code is then assigned to the k-mer.

You can find more information about the Kraken2 algorithm in the paper Improved metagenomic analysis with Kraken 2.

For this tutorial, we will use the PlusPF database which contains the Standard (archaea, bacteria, viral, plasmid, human, UniVec_Core), protozoa and fungi data.

Hands-on: Assign taxonomic labels with Kraken2
  1. Kraken2 ( Galaxy version 2.1.1+galaxy1) with the following parameters:
    • “Single or paired reads”: Paired
      • param-file “Forward strand”: DRR187559_after_fastp_1
      • param-file “Reverse strand”: DRR187559_after_fastp_2
    • “Minimum Base Quality”: 10
    • In “Create Report”:
      • “Print a report with aggregrate counts/clade to file”: Yes
    • “Select a Kraken2 database”: PlusPF-16
  • Classification: tabular files with one line for each sequence classified by Kraken and 5 columns:

    1. C/U: a one letter indicating if the sequence classified or unclassified
    2. Sequence ID as in the input file
    3. NCBI taxonomy ID assigned to the sequence, or 0 if unclassified
    4. Length of sequence in bp (read1|read2 for paired reads)
    5. A space-delimited list indicating the lowest common ancestor (LCA) mapping of each k-mer in the sequence

      For example, 562:13 561:4 A:31 0:1 562:3 would indicate that:

      1. The first 13 k-mers mapped to taxonomy ID #562
      2. The next 4 k-mers mapped to taxonomy ID #561
      3. The next 31 k-mers contained an ambiguous nucleotide
      4. The next k-mer was not in the database
      5. The last 3 k-mers mapped to taxonomy ID #562

      |:| indicates end of first read, start of second read for paired reads

     Column 1	Column 2	Column 3	Column 4	Column 5
     C 	DRR187559.1 	1280 	164|85 	0:1 1279:1 0:41 1279:10 0:5 1280:5 1279:1 0:1 1279:1 0:7 1279:5 0:2 1279:6 0:12 1279:5 0:19 1279:2 0:6 |:| 0:39 1279:2 0:6 1279:3 0:1
     C 	DRR187559.2 	1280 	70|198 	0:2 1279:5 0:29 |:| 0:52 1279:5 0:13 1279:3 0:23 1279:2 0:45 1280:1 0:9 1280:3 A:8
     C 	DRR187559.3 	1279 	106|73 	0:4 1279:3 0:36 1279:4 0:10 1279:5 0:3 1279:5 0:2 |:| 0:39
     C 	DRR187559.4 	1279 	121|189 	1279:6 0:17 1279:4 0:28 1279:2 0:30 |:| 0:7 1279:5 0:19 1279:5 0:20 1279:1 0:8 1279:1 0:1 1279:6 0:25 1279:2 0:44 A:11
     C 	DRR187559.5 	1279 	68|150 	1279:2 0:20 1279:3 0:9 |:| 0:10 1279:3 0:24 1279:2 0:9 1279:5 0:21 1279:5 0:20 1279:5 0:9 1279:1 0:2
     C 	DRR187559.6 	1280 	137|246 	0:2 1280:5 0:28 1279:1 0:28 1279:2 0:8 1279:2 0:23 1279:1 0:3 |:| 1279:1 0:2 1279:3 0:61 1279:2 0:14 1279:2 0:97 A:30 
    1. Is the first sequence in the file classified or unclassified?
    2. What is the taxonomy ID assigned to the first classified sequence?
    3. What is the corresponding taxon?
    1. classified
    2. 1280
    3. 1280 corresponds to Staphylococcus aureus .
  • Report: tabular files with one line per taxon and 6 columns or fields

    1. Percentage of fragments covered by the clade rooted at this taxon
    2. Number of fragments covered by the clade rooted at this taxon
    3. Number of fragments assigned directly to this taxon
    4. A rank code, indicating
      • (U)nclassified
      • (R)oot
      • (D)omain
      • (K)ingdom
      • (P)hylum
      • (C)lass
      • (O)rder
      • (F)amily
      • (G)enus, or
      • (S)pecies

      Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., G2 is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.

    5. NCBI taxonomic ID number
    6. Indented scientific name
     Column 1	Column 2	Column 3	Column 4	Column 5	Column 6
     0.24 	1065 	1065 	U 	0 	unclassified
     99.76 	450716 	14873 	R 	1 	root
     96.44 	435695 	2 	R1 	131567 	cellular organisms
     96.43 	435675 	3889 	D 	2 	Bacteria
     95.56 	431709 	78 	D1 	1783272 	Terrabacteria group
     95.53 	431578 	163 	P 	1239 	Firmicutes
     95.49 	431390 	4625 	C 	91061 	Bacilli
     94.38 	426383 	1436 	O 	1385 	Bacillales
     94.04 	424874 	2689 	F 	90964 	Staphylococcaceae
     93.44 	422124 	234829 	G 	1279 	Staphylococcus 
    1. How many taxa have been found?
    2. What are the percentage on unclassified?
    3. What are the domains found?
    1. 621, as the number of lines
    2. 0.24%
    3. Only Bacteria

Species extraction

In Kraken output, there are quite a lot of identified taxa with different levels. To extract the species level, we will use Bracken.

Bracken (Bayesian Reestimation of Abundance after Classification with Kraken) is a “simple and worthwile addition to Kraken for better abundance estimates” (Ye et al. 2019). Instead of only using proportions of classified reads, it takes a probabilistic approach to generate final abundance profiles. It works by re-distributing reads in the taxonomic tree: “Reads assigned to nodes above the species level are distributed down to the species nodes, while reads assigned at the strain level are re-distributed upward to their parent species” (Lu et al. 2017).

Hands-on: Extract species with Bracken
  1. Bracken ( Galaxy version 2.9+galaxy0) with the following parameters:
    • param-collection “Kraken report file”: Report output of Kraken
    • “Select a kmer distribution”: PlusPF, same as for Kraken

      It is important to choose the same database that you also chose for Kraken2

    • “Level”: Species

Bracken generates one output as a table with 7 columns:

  • Taxon name
  • Taxonomy ID
  • Level ID (S=Species, G=Genus, O=Order, F=Family, P=Phylum, K=Kingdom)
  • Kraken assigned reads
  • Added reads with abundance re-estimation
  • Total reads after abundance re-estimation
  • Fraction of total reads
  1. How many species have been found?
  2. Which the species has been the most identified? And in which proportion?
  3. What are the other species?
  1. 51 (52 lines including 1 line with header)
  2. Staphylococcus aureus with 95.5% of the reads.
  3. Most of the other species are from Staphylococcus genus, so same as Staphylococcus aureus. The other species in really low proportion.

As expected Staphylococcus aureus represents most of the reads in the data.

Contamination identification

To explore Kraken report and specially to detect more reliably minority organisms or contamination, we will use Recentrifuge (Martı́ Jose Manuel 2019).

Hands-on: Identify contamination
  1. Recentrifuge ( Galaxy version 1.12.1+galaxy0) with the following parameters:
    • param-file “Select taxonomy file tabular formated”: report of Kraken2 tool
    • “Type of input file (Centrifuge, CLARK, Generic, Kraken, LMAT)”: Kraken
    • In “Database type”:
      • “Cached database whith taxa ID”: latest
    • In “Output options”:
      • “Type of extra output to be generated (default on CSV)”: TSV
      • “Remove the log file”: Yes
    • In ” Fine tuning of algorithm parameters”:
      • “Strain level instead of species as the resolution limit for the robust contamination removal algorithm; use with caution, this is an experimental feature”: Yes

Recentrifuge generates 3 outputs:

  • A statistic table with general statistics about assignations

    1. How many sequences have been used?
    2. How many sequences have been classified?
    1. 451,780
    2. 450,715
  • A data table with a report for each taxa

    1. How many taxa have been kept?
    2. What is the lowest level in the data?
    1. 185 (189 lines including 3 header lines)
    2. The lowest level is strain.
  • A HTML report with a Krona chart

    1. What is the percentage of classified sequences?
    2. When clicking on Staphylococcus aureus, what can we say about the strains?
    3. Is there any contamination?
    1. 99.8%
    2. 99% of sequences assigned to Staphylococcus aureus are not assigned to any strains, probably because they are too similar to several strains. Staphylococcus aureus subsp. aureus JKD6159 is the strain with the most classified sequences, but only 0.3% of the sequences assigned to Staphylococcus aureus.
    3. There is no sign of a possible contamination. Most sequences are classified to taxon on the Staphylococcus aureus taxonomy. Only 3% of the sequences are not classified to Staphylococcus.


In this tutorial, we checked bacterial isolate sequencing data for expected species and potential contamination.