View markdown source on GitHub

Essential genes detection with Transposon insertion sequencing

Contributors

Questions

Objectives

last_modification Published: Jul 2, 2019
last_modification Last Updated: Jul 2, 2019

Motivation

Transposon insertion sequencing is a technique used to functionally annotate bacterial genomes. The genome is saturated by transposon insertions. The insertion of a transposon being disruptive for the region, the analysis of insertion frequency provides information on how the bacteria fitness change due to this disruption :

Speaker Notes

Identifying the impact of mutations on bacterial fitness allows to classify certain gene as essential for the bacteria in different conditions.


Transposon insertion sequencing method


Data production

.center[Description of tnseq method

From Chao et al.]

Speaker Notes The initial population genomes are mutated so that the genome is saturated with transposon insertions. A library is saturated if in the genomes across the whole population of bacteria, each potential insertion site has at least one insertion. The population is then divided into several media containing different growth conditions. After growth, the regions flanking the insertion are amplified and sequenced, allowing to determine the location of the insertion.


Analysis

.center[ Description of tnseq method

From Chao et al. ]

Speaker Notes

After alignement to the reference genome, the resulting data will show a discrete repartition of reads on each TA site. If a gene present several insertions, like the two leftmost genes in Condition A, it means that its disruption has little or no impact to the bacterial growth. On the other hand, when a gene shows no insertions at all, like the rightmost gene in Condition A, is means that any disruption in this gene killed the bacteria, meaning its a gene essential to bacteria survival. If the library is sufficiently saturated, there is a clear threshold between essential and non-essential genes when you analyze the insertion rate per gene.


Note:

.left[


Experiment parameters


Types of transposon

Speaker Notes

In this tutorial, we are using mariner transposon, that target TA sequences.


Library complexity

Speaker Notes

The advantage of a target specific transposon, like the mariner, in opposition of a Tn5-based transposon inserting randomly, it that the limited number of insertion sites makes it easier to build high complexity libraries.


Structure of the tranposon constructs


Desired Properties of the constructs

Speaker Notes


Structure of the modified mariner transposon Himar1

.center[ Structure of the transpson construct, containing several barcodes and adapters

From Santiago et al. ]

Speaker Notes

Note: Because of this complex tranposon structure, the reads obtained after sequencing contain a large portion of tranposon sequence for a 16-17 bp genomic sequence. This will necessitate several step of pre-processing to extract this genomic sequence.


Tnseq analysis

Speaker Notes With the count of insertions at every insertion site, there is several methods existing to identify essential genes of regions. They can be divided in two major categories :

  1. Annotation dependent : The read counts and/or insertion frequency are calculated across defined regions (genes, promoters … )
  2. Annotation independent : The read counts and/or disrupted sites are considered across the whole genome, independently of defined structures.

Methods of Analysis

.center[ Different types of TnSeq Analyses methods

From Chao et al. ]

Speaker Notes

Annotation dependent method The total read count an/or percentage of disrupted site are computed per annotated regions. The values are then compared to the rest of the genome to classify the genes into the categories essential or non-essential. Annotation independent method The total read count and/or disrupted sites are computed independently of annotated regions. One of these methods is using a sliding window. Each window is then classified into the categories essential or non-essential. After the windows have been classified, they are linked annotations, and the genes/regions can be classified as essential, non-essential, or domain essential according to the classification of the windows they cover. The same classification can be done using HMM based methods instead of sliding windows. In that case, each insertion site will be predicted as essential or non essential. From Chao et al.


TRANSIT Tool

.left[

From DeJesus et al.

Speaker Notes

We are presenting only the Himar1 methods, other tn5 specific methods existing

Gumbel The Gumbel can be used to determine which genes are essential in a single condition. It does a gene-by-gene analysis of the insertions at TA sites with each gene, makes a call based on the longest consecutive sequence of TA sites without insertion in the genes, calculates the probability of this using a Bayesian model.

HMM The HMM method can be used to determine the essentiality of the entire genome, as opposed to gene-level analysis of the other methods. It is capable of identifying regions that have unusually high or unusually low read counts (i.e. growth advantage or growth defect regions), in addition to the more common categories of essential and non-essential.

Re-sampling The re-sampling method is a comparative analysis the allows that can be used to determine conditional essentiality of genes. It is based on a permutation test, and is capable of determining read-counts that are significantly different across conditions.


Workflow of Analysis of Tnseq Data


Agenda : From reads to essential genes annotations

  1. Removing all non genomic sequences from the sequenced reads
    1. Data Structure
    2. Separating reads from different experimental conditions
    3. Removing Adapter sequence
    4. Separate reads from different transposon constructs
    5. Removing remaining transposon sequence.
  2. Counting the number of insertion per TA sites
    1. Aligning the reads to a reference genome
    2. Getting coverage of the genome
    3. Getting TA sites positions
    4. Merging overall coverage and TA sites to get the coverage of each TA sites
  3. Predicting Essential Genes with Transit
    1. Predict the essentiality of genes
    2. Understand the output

1. Removing all non genomic sequences from the sequenced reads

.left[#### 1. Data Structure]

Workflow for preprocessing the read before alignment

Speaker Notes

The experimental design of transposon insertion sequencing produces raw reads containing a lot of adapters and foreign sequences that has been used to insert and target the transposon. In order to obtain the core reads that contain only genomic sequence, we have a number of steps to do to remove them and divide the reads per experimental condition and type of transposon.

The pre-processing of the data will be done through several steps of Cutadapt software, first we sill separate the reads of each experimental condition based on a 8 bp barcode at the beginning of the read (Illumina demultiplexing). The tail of each set of read is then removed. It immediately follows the 3bp barcode specific to transposon constructs, and contains illumina adapter sequence and downstream. To be sure all our reads have been trimmed correctly we filter out the reads too large. We then separate the reads per transposon construct and then remove the remaining transposon sequence containing MmeI.

1. Removing all non genomic sequences from the sequenced reads

2. Separating reads from different experimental conditions

We start by dividing the initial data set by experimental conditions thanks to a 8bp barcode added by the Illumina multiplexing protocol. To do that we use Cutadapt with the fastq faile containing the reads and a fasta file containing the barcodes for each condition :

.center[ Barcode data:

 >control
 ^CTCAGAAG
 >condition
 ^GACGTCAT

]

The ‘^’ at the beginning of the sequence means we want to anchor the barcode at the beginning of the read. To know more about the symbols used by cutadapt, see cutadapt manual

Speaker Notes The output is a collection of the different conditions dataset, here control and condition, and a report text file. You can look at the report and see that 100% of the reads has been trimmed.


1. Removing all non genomic sequences from the sequenced reads

3. Removing Adapter sequence

4. Removing Adapter sequence

5. Removing remaining transposon sequence

Speaker Notes

To remove the reads that might not have been trimmed because of too many mismatches or other reasons, we will then filter the reads by size. Because we know the approximate size of the remaining sequences, we can filter the reads based on this estimation.

4.

We have now removed the transposon sequences that was outside of the 3 bp barcode specific to the type of construct. The constructs used in this experiment contain different strengths and directions of promoters. It means that in addition of disrupting a gene at the location of the insertion, it will modify the expression of either upstream or downstream regions. The analysis of this modification will be studied in another training material, but for now we will consider it does not impact the essentiality analysis and we will use the different constructs as replicates. We therefore need to separate the reads based on the construct specific 3bp barcodes.

All these steps are performed with the Cutadapt tool


2. Counting the number of insertion per TA sites

1. Aligning the reads to a reference genome


2. Counting the number of insertion per TA sites

2. Getting coverage of the genome

.center[ The read align to the genome with its 3' end covering half of the TA site ]

Speaker Notes

In our case, the reads cover the flanking region on one side of the TA site where the transposon inserted. That means we do not want to have the coverage across the whole reads, as it could cover several TA sites, but only the coverage at the end of the read.

The sequenced read cover the 5’ region flanking the site of insertion. To assign the read to its correct insertion, we need to compute the coverage at the 3’ end of the read.


2. Counting the number of insertion per TA sites

3. Getting TA sites positions

Speaker Notes

As we are not considering strand separately in this analyses, we will consider both count as attached to the leftmost base of the TA site. To do that we will create two list of TA site positions, listing the 5’ end of each TA site for forward and reverse strand, and then merge them to get a global count per TA site.

In order to get the coverage on each TA site we need to prepare a file containing the position of each TA site. Depending on the direction of insertion the coverage will be counted on the leftmost position of the TA site or on the rightmost. As we are not considering strand separately in this analyses, we will consider both count as attached to the leftmost base of the TA site. To do that we will create two list of TA site positions, listing the 5’ end of each TA site for forward and reverse strand, and then merge them to get a global count per TA site.

The coordinates provided by wordmatch are based on count 1. Meaning the first nucleotide is counted as number 1. However, bamCoverage count the first nucleotide as nucleotide number 0. To be able to compare the two results, we need to shift the TA site positions by 1.


2. Counting the number of insertion per TA sites

4. Merging overall coverage and TA sites to get the coverage of each TA sites

Speaker Notes

Now that have the files containing the coordinates of our TA sites for both strands. We will cross them with the coverage files to get the coverage on both sides of each TA site.

We now have a read count for each nucleotides of the TA sites. The insertions counted in the forward strand files are in a different direction than those counted on the reverse strand file. In our case, we are only interested in studying the gene disruption, and therefore we do not want to consider the directions separately. We will add the forward and reverse count together to get a total count per TA site. In order to do that we need to assign the count of both strand to the same nucleotide. We will do that by shifting by one position the count on the reverse strand.


3. Predicting Essential Genes with Transit

1. Predict the essentiality of genes with Transit

The output of Transit is a tabulated file containing the following collumns (you can find more information on Transit manual page):

We can obtain the list of genes predicted as essential by filtering on the essentiality call.

Speaker Notes

Now that we have the counts of insertions per TA site, we can use them to predict gene esssentiality with Transit. In order to do that, we need to create a an annotation file in the prot_table format, specifique to the Transit tool. You can create this file from a gff3 from GenBank.

Now that we have prepared the annotation file, we can use the count per TA site to predict essential genes using Transit tool. We will modify a couple parameters from the default settings :


3. Predicting Essential Genes with Transit

2. Understand the output

.left[

Orf Name Desc k n r s zbar Call
SAOUHSC_00002 - DNA polymerase III subunit 2 96 94 1102 1.000000 E
SAOUHSC_00008 - histidine ammonia- 54 128 8 95 0.000000 NE
SAOUHSC_00028 - hypothetical protein 8 18 4 24 -1.000000 S
SAOUHSC_00227 - hypothetical protein 21 162 24 267 0.970900 U

]

Speaker Notes

ORF Gene ID.

Name Name of the gene.

Description Gene description.

k Number of Transposon Insertions Observed within the ORF.

n Total Number of TA dinucleotides within the ORF.

r Length of the Maximum Run of Non-Insertions observed.

s Span of nucleotides for the Maximum Run of Non-Insertions.

zbar Posterior Probability of Essentiality.

Call Essentiality call for the gene. Depends on FDR corrected thresholds. E=Essential U=Uncertain, NE=Non-Essential, S=too short

Note: Technically, Bayesian models are used to calculate posterior probabilities, not p-values (which is a concept associated with the frequentist framework). However, we have implemented a method for computing the approximate false-discovery rate (FDR) that serves a similar purpose. This determines a threshold for significance on the posterior probabilities that is corrected for multiple tests. The actual thresholds used are reported in the headers of the output file (and are near 1 for essentials and near 0 for non-essentials). There can be many genes that score between the two thresholds (t1 < zbar < t2). This reflects intrinsic uncertainty associated with either low read counts, sparse insertion density, or small genes. If the insertion_density is too low (< ~30%), the method may not work as well, and might indicate an unusually large number of Uncertain or Essential genes.


Take Home


Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.