Decontamination of a genome assembly

Author(s)	Delphine Lariviere
Reviewers

Overview
Questions:

How to remove contaminant sequences from your assembly?

Objectives:

Remove contaminant sequences from an assembly

Identify contaminant species

Remove Mitochondrial DNA from an assembly

Requirements:

Introduction to Galaxy Analyses

slides Slides: Quality Control

tutorial Hands-on: Quality Control

Time estimation: 1 hour 30 minutes

Supporting Materials:

Datasets

Workflows

FAQs

video Recordings

video Tutorial (September 2024) - 29m

video View All

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

UseGalaxy.org (Main) ✅ ⭐️

UseGalaxy.org.au ✅ ⭐️

Possibly Working

UseGalaxy.cz

Published: Sep 4, 2024

Last modification: Jun 4, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00452

rating Rating: 3.5 (2 recent ratings, 9 all time)

version Revision: 5

When sequencing a genome, it is common that contamination from a foreign organism get mixed with the genomic material of our species of interest. For instance, if you are processing a whole body sample of an insect then you will sequence not only the insect, but everything on and inside of it. When building a reference genome, it is important to separate these contaminants from the genome of our species. Foreign DNA sequences could cause false positive identification when running BLAST analyses, the misidentification of genes that don’t actually belong to the species, or they can be incorporated as ‘reference’ sequence for that species in the public archives when those sequences did not actually belong to that species.

In this tutorial, we will learn how to separate these contaminants from a genome assembly of a vertebrate. It will also include the removal of mitogenome contigs so that the final genome assembly contains only nuclear DNA. (A separate tutorial exists for assembling the mitogenome.)

Agenda

In this tutorial, we will cover:

Get data

Genome Masking

Identify Non-Target Contaminants

Investigate our contaminants

Get the list of contaminants to remove from our assembly

Identify mitochondrial DNA

Remove contaminants and mitochondrial DNA from the assembly

Conclusion

Get data

For this tutorial, we are using the genome assembly of the zebra finch (Taeniopygia guttata) generated by the Vertebrate Genome Project. We downsized the assembly to 50 of the largest scaffolds to keep the runtime reasonable, and added some contaminants for the sake of this tutorial.

Comment: Run this analysis on a "real" assembly

If you want to run this analysis on a real assembly generated by the Vertebrate Genome Project, you can find the scaffolded assembly on Genome Ark as a remote repository and upload it to Galaxy (available on the three main Public Galaxy instances: .org, .eu, .org.au).

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a Choose remote files:

Click on Upload Data on the top of the left panel

Click on Choose remote files and scroll down to find your data folder or type the folder name in the search box on the top.

Look for your data under:Genome Ark -> species -> Taeniopygia_guttata -> bTaeGut2 -> assembly_vgp_hic_2.0 -> bTaeGut2.hic.hap1.s2.fasta

click on OK

Click on Start

Click on Close

You can find the dataset has begun loading in you history.

The analysis described in this tutorial will work with any assembly in FASTA format. If your initial dataset is compressed, use the tool Convert compressed file to uncompressed. first.

Hands On: Data Upload
Create a new history for this tutorial
Import the files from Zenodo:
https://zenodo.org/records/13367433/files/Contaminated_assembly.fasta
Copy the link location

Click galaxy-upload Upload at the top of the activity panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window
Rename the dataset : Contaminated Assembly

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, change the Name field

Click the Save button

Check that the datatype is fasta

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, click galaxy-chart-select-data Datatypes tab on the top

In the galaxy-chart-select-data Assign Datatype, select fasta from “New Type” dropdown

Tip: you can start typing the datatype into the field to filter the dropdown menu

Click the Save button

Genome Masking

Genome masking refers to hiding regions of low complexity in the genome, such as repeats. Masking a genome reduces the noise and improves the efficacy of the decontamination by emphasizing the regions that contribute the most to the classification (Saini et al. 2016, Caetano-Anolles).

The first step of our analysis is therefore to mask our genome assembly. There are two types of masking, depending on the requirements of the tools we are using:

Soft Masking The masked regions are replaced by lower case letters. This is the prefered method of masking when the tools understand it, because it does not cause any loss of information.
Hard Masking The masked regions are replaced by a wildcard letter: Ns for nucleotide sequences, Xs for proteins. Use this technique if you are unsure whether the tools you will be using knows to ignore lower case letters.

Repeat masking is important for a lot of analyses like sequence alignments, classification, gene annotation. Repetitive and low complexity regions cause problems for search and clustering algorithms based on patterns. For gene annotation, repeats can cause non-specific gene hits.

In this tutorial we will use Kraken2 to classify our contaminant scaffolds, Blastn to identify the mitochondrial sequences, and DustMasker to mask our genome. Dustmasker generates soft masked sequences, while Kraken2 and Blast needs hard masked sequences, so we will need to do some file manipulations along the way.

Before running the analysis, ensure the sequences in your assembly are uppercase. This will ensure the soft masking can be done properly. If your sequences are not uppercase, then run the optional step “Convert the assembly file to upper case”. We will then convert the lower case letter to Ns to hard mask the sequences for Kraken2 and Blast.

Because soft masking uses lowercase to denote masked base pairs, the input sequence should not have lowercase sequences because then you cannot tell afterwards whether a sequence is lowercase because it started that way or because it was soft masked.

Text transformation ( Galaxy version 1.1.1) with the following parameters:

param-file “File to process”: Contaminated assembly

“SED Program”: s/^[^>].+/\U&/g (Convert letters in the sequence to uppercase)

Hands On: Mask the Genome with Dustmasker

NCBI BLAST+ dustmasker ( Galaxy version 2.14.1+galaxy1) with the following parameters:

“Subject database/sequences”: FASTA file from your history

param-file “Nucleotide FASTA subject file to use instead of a database”: Contaminated assembly (or the output of Text transformation tool if you converted your sequences to upper case)

“DUST level”: 40

“Output format”: FASTA

Rename your file Soft Masked assembly

Take a peek at your genome assembly. You can see that the beginning of the first scaffold has been converted to lower case, which masks repeats corresponding to the telomeres in the sequences.

Hands On: Soft Masking to Hard Masking

Text transformation ( Galaxy version 1.1.1) with the following parameters:

param-file “File to process”: Soft Masked assembly (output of NCBI BLAST+ dustmasker tool)

“SED Program”: /^>/!y/atcgn/NNNNN/: In sequence lines (not starting with “>”), replace each instance of lower case bases (a t c or g), including undefined ones (n), to Ns).

Rename your file Hard Masked assembly

Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.

Finding

Below are just a few examples of basic expressions:

Regular expression Matches

abc an occurrence of abc within your data

(abc|def) abc or def

[abc] a single character which is either a, b, or c

[^abc] a character that is NOT a, b, nor c

[a-z] any lowercase letter

[a-zA-Z] any letter (upper or lower case)

[0-9] numbers 0-9

\d any digit (same as [0-9])

\D any non-digit character

\w any alphanumeric character

\W any non-alphanumeric character

\s any whitespace

\S any non-whitespace character

. any character

\. literal . (period)

{x,y} between x and y repetitions

^ the beginning of the line

$ the end of the line

Note: you see that characters such as *, ?, ., + etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So \? matches the question mark character exactly.

Examples

Regular expression matches

\d{4} 4 digits (e.g. a year)

chr\d{1,2} chr followed by 1 or 2 digits

.*abc$ anything with abc at the end of the line

^$ empty line

^>.* Line starting with > (e.g. Fasta header)

^[^>].* Line not starting with > (e.g. Fasta sequence)

Replacing

Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups (...), which we can refer to using \1, \2 etc for the first and second captured values. If you want to refer to the whole match, use &.

Regular expression Input Captures

chr(\d{1,2}) chr14 \1 = 14

(\d{2}) July (\d{4}) 24 July 1984 \1 = 24, \2 = 1984

An expression like s/find/replacement/g indicates a replacement expression, this will search (s) for any occurrence of find, and replace it with replacement. It will do this globally (g) which means it doesn’t stop after the first match.

Example: s/chr(\d{1,2})/CHR\1/g will replace chr14 with CHR14 etc.

You can also use replacement modifier such as convert to lower case \L or upper case \U. Example: s/.*/\U&/g will convert the whole text to upper case.

Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the s/../../g structure.

There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.

Tip: RegexOne is a nice interactive tutorial to learn the basics of regular expressions.

Tip: Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.

Tip: Cyrilex is a visual regular expression tester.

Comment: Unsure about your regex?

Ask ChatGPT about it. E.g. What is the sed command to convert a text to upper case? or test with a regex tester website (Warning, some of these testers do not understand the lowercase and uppercase transformations)

Regular expression	Matches
`abc`	an occurrence of `abc` within your data
`(abc\|def)`	`abc` or `def`
`[abc]`	a single character which is either `a`, `b`, or `c`
`[^abc]`	a character that is NOT `a`, `b`, nor `c`
`[a-z]`	any lowercase letter
`[a-zA-Z]`	any letter (upper or lower case)
`[0-9]`	numbers 0-9
`\d`	any digit (same as `[0-9]`)
`\D`	any non-digit character
`\w`	any alphanumeric character
`\W`	any non-alphanumeric character
`\s`	any whitespace
`\S`	any non-whitespace character
`.`	any character
`\.`	literal . (period)
`{x,y}`	between x and y repetitions
`^`	the beginning of the line
`$`	the end of the line

Regular expression	matches
`\d{4}`	4 digits (e.g. a year)
`chr\d{1,2}`	`chr` followed by 1 or 2 digits
`.*abc$`	anything with `abc` at the end of the line
`^$`	empty line
`^>.*`	Line starting with `>` (e.g. Fasta header)
`^[^>].*`	Line not starting with `>` (e.g. Fasta sequence)

Regular expression	Input	Captures
`chr(\d{1,2})`	`chr14`	`\1 = 14`
`(\d{2}) July (\d{4})`	24 July 1984	`\1 = 24`, `\2 = 1984`

Take a look at your dataset and see that the lower case bases have been converted to Ns.

Identify Non-Target Contaminants

Kraken2 is a tool for taxonomic classification. It matches the kmers found in our assemblies to the ones stored in its databases. For the purpose of decontamination, we only need to identify possible contaminants, not the sequences belonging to our sampled organisms. Therefore, we only need to use a database containing likely contaminants such as bacteria, virus, and human (always wear gloves, folks!). The best option at the time of this tutorial’s writing is the PlusPF database which contains the RefSeq Standard (archaea, bacteria, viral, plasmid, human, UniVec_Core) plus protozoa and fungi data.

Hands On: Identify the taxonomy of our sequences

Kraken2 ( Galaxy version 2.1.1+galaxy1) with the following parameters:

“Single or paired reads”: Single

param-file “Input sequences”: Hard Masked assembly (output of Text transformation tool)

“Print scientific names instead of just taxids”: Yes

“Confidence”: 0.3

“Split classified and unclassified outputs?”: Yes

“Select a Kraken2 database”: PlusPF (2021-05-17)

Kraken provides us with three outputs:

A table containing the classification information for each sequence
A fasta file containing the classified sequences (our contaminants)
A fasta file containing the unclassified sequences (our precious zebra finch)

We could decide to use the fasta file of unclassified sequences as our assembly, but remember that our genome is hard masked. That means that we lost information. Because we don’t use it now doesn’t mean we never will. So instead of using the masked genome, we will remove the classified sequences from our original assembly.

But first, we are naturally curious people, and we want to know what type of contaminants we had in our sample.

Investigate our contaminants

Take a look at the classification output of Kraken2. You can see five columns containing:

C / U: a one letter code indicating that the sequence was either classified or unclassified.
The sequence ID.
The taxonomy ID Kraken2 used to label the sequence; this is 0 if the sequence is unclassified.
The length of the sequence in bp.
A space-delimited list indicating the LCA mapping of each k-mer in the sequence(s). For example, “562:13 561:4 A:31 0:1” would indicate that:
- the first 13 k-mers mapped to taxonomy ID #562
- the next 4 k-mers mapped to taxonomy ID #561
- the next 31 k-mers contained an ambiguous nucleotide
- the next k-mer was not in the database

To identify our contaminants, we only need the information from the first three columns, and only for the sequences that are classified. So we will start by extracting our three columns and then filtering out unclassified sequences.

Hands On: Extract classification information

Cut with the following parameters:

“Cut columns”: c1,c2,c3

“Delimited by”: Tab

param-file “From”: Kraken2 on data X: Classification (output of Kraken2 tool)

Comment: Columns definitions

The three columns that we extracted contain:

Whether the sequence is classified C or unclassified U

The sequence name

The sequence classification

Comment: What if my organism is in the database?

The filtering of contaminants is based on their presence in the database. Any sequence classified by Kraken2 is considered as a contaminant. If you are working on an organism that is present in the database, this filtering will not work.

If you can’t find a database that include contaminants but exclude your organism, you will need to process the outputs of Kraken differently. Instead of filtering the sequences on classified/unclassified, select the sequences that do not match the taxonomic ID of your species. That is true for the rest of the section.

Hands On: Remove unclassified sequences

Filter with the following parameters:

param-file “Filter”: Cut on data X (output of Cut tool)

“With following condition”: c1!='U'

Rename the output Contaminants informations

Question

How many different contaminants are in our sample?

We found two contaminants in our sample: Salmonella and Yersinia phage phiR1-37.

Get the list of contaminants to remove from our assembly

In order to remove the contaminant sequences from our assembly, we need a list of the contaminants names, that we will get by extracting the contaminant IDs from the Kraken output.

Hands On: Get contaminant sequence names

Cut with the following parameters:

“Cut columns”: c2

“Delimited by”: Tab

param-file “From”: Contaminants informations (output of Filter tool)

Rename the output List of Contaminants

Now that we have our list of contaminants, we will go through a similar process to extract mitochondrial sequences.

Identify mitochondrial DNA

We start with running Blastn against a database containing mitochondrial sequences.

Hands On: Blastn against RefSeq Mitochondrion

NCBI BLAST+ blastn ( Galaxy version 2.14.1+galaxy1) with the following parameters:

param-file “Nucleotide query sequence(s)”: Hard Masked assembly (output of Text transformation tool)

“Subject database/sequences”: Locally installed BLAST database

“Nucleotide BLAST database”: RefSeq Mitochondrion

“Type of BLAST”: blastn - Traditional BLASTN requiring an exact match of 11, for somewhat similar sequences

“Output format”: Tabular (select which columns)

“Standard columns”: qseqid, sseqid, length, qstart, qend, evalue

“Extended columns”: qlen

“Miscellaneous columns”: qcovs, qcovhsp

“Advanced Options”: Hide Advanced Options

This step may take a little while, so you have time to make yourself a cup of coffee.

Next, we parse the result of the Blastn to extract a report on the alignements between our assemblies and the database, and a list of the scaffolds aligning with mitochondrial sequences.

Hands On: Extract alignement informations from the Blast output

Parse mitochondrial blast ( Galaxy version 1.0.2+galaxy0) with the following parameters:

param-file “Tabular file generated by mito-blast”: blastn Text transformation on data X vs 'refseq_mitochondrion' (output of NCBI BLAST+ blastn tool)

Rename the output List of Mitochondrial sequences

Question

How many scaffolds belonging to the mitochondria are in our sample?

We found two mitochondrial scaffolds in our assembly.

Remove contaminants and mitochondrial DNA from the assembly

From the previous sections, we obtain two files containing a list of sequences belonging to contaminants and one belonging to the mitochondria. We will merge these two lists to clean up our original assembly.

Hands On: Concatenate lists of scaffolds to remove

Concatenate datasets ( Galaxy version 0.1.1) with the following parameters:

param-files “Datasets to concatenate”: List of Contaminants, List of Mitochondrial sequences

Rename the output Sequences to remove

Now we will use gfastats, a tool box for genome assemblies with capability for manipulations and statistics generation.

Hands On: Remove scaffolds from the original genome assembly

gfastats ( Galaxy version 1.3.6+galaxy0) with the following parameters:

param-file “Input file”: output (Input dataset)

“Specify target sequences”: Enabled

param-file “Exclude specific intervals”: Sequences to remove (output of Concatenate datasets tool). This input can be a bed file with sequence coordinates, or, in our case, a simple list of sequence IDs to remove from the fasta file.

“Tool mode”: Genome assembly manipulation

“Output format”: FASTA.gz

Conclusion

In this tutorial we used taxonomic classification to identified contaminants and Blast to identify mitochondrial sequences in a genome assembly. We then removed these sequences to keep only the nuclear DNA of our species of interest.

Now that we have removed the contaminants and mitochondrial DNA from our assembly, it is ready for manual curation!

You've Finished the Tutorial

Key points

Assembly decontamination is important to avoid false identification of genes, blast hits…

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

References

Saini, H., S. P. Lal, V. V. Naidu, V. W. Pickering, G. Singh et al., 2016 Gene masking-a technique to improve accuracy for cancer classification with high dimensionality in microarray data. BMC Medical Genomics 9: 261–269. 10.1186/s12920-016-0233-2
Caetano-Anolles, D. Masked reference genomes. https://gatk.broadinstitute.org/hc/en-us/articles/17295731870235-Masked-reference-genomes

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Delphine Lariviere, Decontamination of a genome assembly (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/assembly/tutorials/assembly-decontamination/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{assembly-assembly-decontamination,
author = "Delphine Lariviere",
	title = "Decontamination of a genome assembly (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/assembly/tutorials/assembly-decontamination/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/assembly/tutorials/assembly-decontamination/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: gfastats
  owner: bgruening
  revisions: 3ef480892a9f
  tool_panel_section_label: FASTA/FASTQ
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: text_processing
  owner: bgruening
  revisions: d698c222f354
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: text_processing
  owner: bgruening
  revisions: 86755160afbf
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: text_processing
  owner: bgruening
  revisions: d698c222f354
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: text_processing
  owner: bgruening
  revisions: 86755160afbf
  tool_panel_section_label: Text Manipulation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: ncbi_blast_plus
  owner: devteam
  revisions: d999e774072a
  tool_panel_section_label: NCBI Blast
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: ncbi_blast_plus
  owner: devteam
  revisions: cbf3f518b668
  tool_panel_section_label: NCBI Blast
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: ncbi_blast_plus
  owner: devteam
  revisions: d999e774072a
  tool_panel_section_label: NCBI Blast
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: ncbi_blast_plus
  owner: devteam
  revisions: cbf3f518b668
  tool_panel_section_label: NCBI Blast
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: kraken2
  owner: iuc
  revisions: 20e2f64aa1fe
  tool_panel_section_label: Metagenomic Analysis
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: kraken2
  owner: iuc
  revisions: cdee7158adf3
  tool_panel_section_label: Metagenomic Analysis
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: parse_mito_blast
  owner: iuc
  revisions: 3c9ad4adf8d2
  tool_panel_section_label: Extract Features
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

t{ hist[0] | to_stars }} 4

t{ hist[0] | to_stars }} 3

t{ hist[0] | to_stars }} 1

t{ hist[0] | to_stars }} 1

June 2025

4 stars: Liked: It was esasy to follow. It gave a good understanding of the concept of data cleaning in the sense that you need to think about possible contaminants in your assembly and where they could possibly come from.

3 stars: Liked: I like the workflow , it is easy to find and the explanation is good because ypu understand what are the tools for, so the steps make sense. Disliked: The Parse tool is no longer available on galaxy, I tried to use another tools but it is not working

November 2024

1 stars: Disliked: gfastats was not working and I could not remove the sequences at the end of the tutorial

4 stars: Liked: everything except the two points below in the "what could be improved" section Disliked: In the Parse mitochondrial blast section, I got two output files. The tutorial doesn't mention that there are two, and it tells me to rename the output, but doesn't tell me which one. In the final step (gfastats), it's not clear what the input file should be. The tutorial just says, "output (Input dataset)"...which unless I missed something isn't helpful at all.

October 2024

4 stars: Disliked: some details are lacking in instructions

5 stars: Disliked: blast took more than one hour, maybe the step can be omitted in the tutorial and the results file can be provided?