Library Generation for DIA Analysis

Author(s)	Matthias Fahrner Melanie Föll
Reviewers

Overview
Questions:

How does the peptide identification work in data independent acquisition (DIA) mass spectrometry (MS) data?

What is a spectral library and how to generate a spectral library using Maxquant, diapysef and OpenSwath tools?

Objectives:

Generating a spectral library from data dependent acquisition (DDA) MS data

Understanding DIA data principles and characteristics

Optimizing and refining a spectral library for the analysis of DIA data

Requirements:

Introduction to Galaxy Analyses

Time estimation: 1 hour

Level: Intermediate Intermediate

Supporting Materials:

Datasets

Workflows

FAQs

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

Possibly Working

UseGalaxy.org.au

UseGalaxy.be

UseGalaxy.cz

UseGalaxy.no

Published: Dec 2, 2020

Last modification: Jun 2, 2025

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00212

rating Rating: 5.0 (0 recent ratings, 2 all time)

version Revision: 12

The proteome refers to the entirety of proteins in a biological system (e.g cell, tissue, organism). Proteomics is the large-scale experimental analysis of proteins and proteomes, most often performed by mass spectrometry that enables great sensitivity and throughput. Especially for complex protein mixtures, bottom-up mass spectrometry is the standard approach. In bottom-up proteomics, proteins are digested with a specific protease into peptides and the measured peptides are in silico reassembled into the corresponding proteins. Inside the mass spectrometer, not only the peptides are measured (MS1 level), but the peptides are also fragmented into smaller peptides which are measured again (MS2 level). This is referred to as tandem-mass spectrometry (MS/MS). Identification of peptides is performed by peptide spectrum matching of the theoretical spectra generated from the input protein database (fasta file) with the measured MS2 spectra. Peptide quantification is most often performed by measuring the area under the curve of the MS1 level peptide peaks, but special techniques such as TMT and DIA allow to quantify peptides on MS2 level. Nowadays, bottom-up tandem-mass spectrometry approaches allow for the identification and quantification of several thousand proteins.

To enable the identification and quantification of such high numbers of proteins in a time-wise limited measurement there has been a lot of MS method and data acquisition optimization. The most commonly used technique is the data dependent acquisition (DDA). Here only the most abundant (top N) peptides from each MS1 scan are selected for subsequent fragmentation, with the most promising chance for successful identifications due to high intensity signals. Afterwards the respective mass over charge (m/z) values are set on an exclusion list enabling the selection of less abundant peptides from the following MS1 scan for the next cycle of top N fragmentation (MS2) scans. However, since this acquisition method is highly dependent on the abundance of the respective peptides at a given retention time this method is referred to as data dependent acquisition (DDA). This dependency on both time as well as intensity leads to measurement specific identifications between multiple measurements.

Another more recently developed method is the so called data independent acquisition (DIA). However due to some unique characteristics of this acquisition method it requires spectral libraries, which contain the peptide sequences as well as their respective fragment spectrum (MS2) information. Here we will use DDA data from the same samples to build a spectral library which can then be used for the identification of peptides in data independent acquisition (DIA) data from the same samples using e.g. OpenSwathWorkflow OpenSwathworkflow tutorial.

DDA_lib_gen. — **Figure 1**: Labelfree proteomics using liquid chromatography tandem-mass spectrometry (LC-MS/MS). Principles of data dependent acquisition (DDA).

High quality spectral libraries can be generated by fractionation and separate measurement of a representative sample or mixture for a cohort. During the analysis the peptide and protein identifications of all fractions can than be combined into comprehensive spectral libraries. Here the spectral libraries tend to be relatively large offering great sensitivity (even to low abundant peptides), but also contain peptides which are only observable due to the prior fractionation. Another way of generating spectral libraries is by measuring the same samples using DDA as well as using DIA. Here the DDA results will be analyzed and sample specific spectral library is generated which can then be used for the analysis of the DIA data. The resulting spectral libraries are often smaller and contain only peptides which are observable in complex and non-fractionated samples.

A plethora of software solutions were developed for the analysis of DDA proteomics data. MaxQuant is one of the most popular proteomics softwares because it is easy to use and a free software that offers functionalities for nearly all kinds of proteomics data analysis challenges Cox and Mann 2008. Mass spectrometry raw data is normally obtained in a vendor specific, proprietary file format. MaxQuant can directly take those raw files as input. For peptide identification MaxQuant uses a search engine called “Andromeda”.

The dataset in this training consists of different Spike-in mixtures of stable amounts of human peptides and different amounts of E.coli peptides. Each sample reflects a non-fractionated MS measurement (DDA) of different Spike-in ratios. To facilitate the analysis and directly combine the identifications of all measurements, we will define the different measurements as fractions of one sample, as it would be the case for e.g. high pH, SCX or HILIC fractionated samples.

Agenda

In this tutorial, we will cover:

Get data

MaxQuant Analysis of DDA data

Quality control results

Filter for unique peptides in the evidence and msms output

diapysef library generation

Spectral library refinement with OpenSwathAssayGenerator

Adding decoy sequences with OpenSwathDecoyGenerator

Converting the spectral library with TargetedFileConverter

Conclusion

Get data

Hands On: Data upload
Create a new history for this tutorial and give it a meaningful name

To create a new history simply click the new-history icon at the top of the history panel:
Import the fasta and raw files as well as the experimental annotation and the iRTassays file from Zenodo
https://zenodo.org/record/4293493/files/Human_database_plus_iRT.fasta
https://zenodo.org/record/4293493/files/Ecoli_database.fasta
https://zenodo.org/record/4293493/files/iRTassays.tsv
https://zenodo.org/record/4293493/files/HEK_Ecoli_exp_design.txt
https://zenodo.org/record/4293493/files/Sample1.raw
https://zenodo.org/record/4293493/files/Sample2.raw
https://zenodo.org/record/4293493/files/Sample3.raw
https://zenodo.org/record/4293493/files/Sample4.raw
https://zenodo.org/record/4293493/files/Sample5.raw
Copy the link location

Click galaxy-upload Upload at the top of the activity panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window
Once the files are green, rename the experimental design file in ‘HEK_Ecoli_exp_design’, the iRT transition file in ‘iRTassays’ and all the raw files in ‘Sample1.raw’, ‘Sample2.raw’, ‘Sample3.raw’, ‘Sample4.raw’ and ‘Sample5.raw’

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, change the Name field

Click the Save button

Once the files are green, generate a collection for all .raw files (and name it DDA_data) as well as for both .fasta files (and name it FASTA)

Click on galaxy-selector Select Items at the top of the history panel

Check all the datasets in your history you would like to include

Click n of N selected and choose Advanced Build List

You are in collection building wizard. Choose Flat List and click ‘Next’ button at the right bottom corner.

Double clcik on the file names to edit. For example, remove file extensions or common prefix/suffixes to cleanup the names.

Enter a name for your collection

Click Build to build your collection

Click on the checkmark icon at the top of your history again

MaxQuant Analysis of DDA data

The MaxQuant Galaxy implementation contains the most important MaxQuant parameters. As an alternative, MaxQuant (using mqpar.xml) tool can be used with a preconfigured mqpar.xml file.

Hands On: MaxQuant Analysis

MaxQuant ( Galaxy version 2.0.3.0+galaxy0) with the following parameters:

In “Input Options”:

param-collection “FASTA files”: FASTA collection

“identifier parse rule”: >([^ ]*)

“description parse rule”: ^>.*\|.*\|[^ ]+ (.*) OS.*$

In “Search Options”:

param-file “Specify an experimental design template”: HEK_Ecoli_exp_design.txt

“minimum unique peptides”: 1

“Match between runs”: Yes

In “Protein quantification”:

“Use only unmodified peptides”: No

In “Parameter Group”:

param-collection “Infiles”: DDA_data collection

“missed cleavages”: 1

“variable modifications”: Oxidation (M)

“Generate PTXQC (proteomics quality control pipeline) report? (experimental setting)”: True

In “Output Options”:

“Select the desired outputs.”: Protein Groups Peptides mqpar.xml Evidence MSMS

Comment: Protein Groups

Proteins that share all their peptides with other proteins cannot be unambiguously identified. Therefore, MaxQuant groups such proteins into one protein group and only one common quantification will be calculated. The different protein properties are separated by semicolon.

Here we used a variation of the “identifier parse rule” to allow for the organism label to be present in the spectral library. For investigations of single organisms and to keep only the Uniprot identifier one can adjust the “identifier parse rule” accordingly. More details on the different MaxQuant parameters can be found in this MaxQuant tutorial.

In case the MaxQuant run is not yet finished, the results can be downloaded from Zenodo to be able to continue the tutorial
Import the files from Zenodo
https://zenodo.org/record/4293493/files/MaxQuant_PTXQC_report.pdf
https://zenodo.org/record/4293493/files/MaxQuant_Protein_Groups.tabular
https://zenodo.org/record/4293493/files/MaxQuant_Peptides.tabular
https://zenodo.org/record/4293493/files/MaxQuant_Evidence.tabular
https://zenodo.org/record/4293493/files/MaxQuant_MSMS.tabular
https://zenodo.org/record/4293493/files/MaxQuant_mqpar.xml

Quality control results

To get a first overview of the MaxQuant results, the PTXQC report is helpful. Click on the galaxy-eye eye of the PTXQC pdf file to open it in Galaxy. Screening through the different plots might already give you a hint on how many peptides and proteins were identified as well as some of the data quality.

The PTXQC software (Bielow et al. 2015) was built to enable direct proteomics quality control from MaxQuant result files. This quality control can be directly used in the Galaxy MaxQuant wrapper by setting “Generate PTXQC” to True. This will generate a pdf file with multiple quality control plots. Be aware that the cutoffs set in PTXQC might not be applicable to your experiment and mass spectrometer type and therefore “under performing” and “fail” do not necessarily mean that the quality is poor.

Question

How many proteins were found in total?

How many peptides were found in total?

Approximately 5,370 proteins were found (see ProteinGroups output)

Approximately 34,970 peptides were found (see Peptides output)

Filter for unique peptides in the evidence and msms output

Hands On: Filtering the MaxQuant search results for unique peptides

Filter with the following parameters: - param-file “Infile”: MaxQuant_Evidence

“With following condition”: len(c9.split(';')) < 2

“Number of header lines to skip”: 1

Rename galaxy-pencil the output file to ‘Filter on MaxQuant_Evidence’

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, change the Name field

Click the Save button

Filter with the following parameters: - param-file “Infile”: MaxQuant_MSMS

“With following condition”: len(c12.split(';')) < 2

“Number of header lines to skip”: 1

Rename galaxy-pencil the output file to ‘Filter on MaxQuant_MSMS’

Comment: Unique Features / peptides

Peptides which only occur in one protein are considered as unique for this specific protein. To avoid ambigious protein mapping later, we filter the MaxQuant search results for unique peptides only. CAUTION: This increase in specificity (each peptide originates from only one protein) will reduce the size of the spectral library and might lead to decreased sensitivity during the DIA analysis.

Question

How many unique features per lines remain after the filtering in the evidence and the msms output?

~100,000 lines in the evidence and ~98,000 lines in the msms.

diapysef library generation

Hands On: Generation of a spectral library using the unique MaxQuant search results and indexed retention time iRT peptides

diapysef library generation generates spectral library for DIA analysis ( Galaxy version 0.3.5.0):

In “MaxQuant output file msms.txt:

param-file “Infile”: Filter on MaxQuant_MSMS

In “MaxQuant output file evidence.txt:

param-file “Infile”: Filter on MaxQuant_Evidence

In “Retention time alignment method”:

param-file “Infile”: iRTassays.tsv

Comment: Indexed retention time iRT peptides

To allow for improved alignement of the different measurements, synthetic (non-endogenous) peptides which were spiked-in to all samples prior to the MS measurement. First the measured retention times (RTs) of those spiked-in peptides are extracted in the MaxQuant search results. Using the measured RTs as well as arbitrary assigned values (ranging from -26 to 99) a linear regression through the 11 synthetic peptides is applied. Based on this linear regression fit indexed retention times (iRTs) are assigned to all other peptides.

Question

How many of the 11 synthetic peptides were found in the first DDA (Sample1) file?

How many were found in the second DDA (Sample2) file?

8 iRT peptides were found in the first DDA file

9 iRT peptides were found in the second DDA file

Hands On: Spectral library optimization and refinement using OpenSwathAssayGenerator

OpenSwathAssayGenerator ( Galaxy version 2.6+galaxy0) with the following parameters:

“Output file type – default: determined from file extension or content”: tabular (tsv)

“MZ threshold in Thomson for precursor ion selection”: 0.015

“upper MZ limit for precursor ions”: 1000.0

“MZ threshold in Thomson for fragment ion annotation”: 0.015

Comment: Adjustment of the Spectral library

The spectral library coming from diapysef library generation contains all observed fragment ions from the DDA runs resulting in a relativly large spectral library. Large libraries can lead to inceased processing times as well as fewer identifications after adjusting based on the False Discovery Rate (FDR). Thus, it is recommened to optimize and refine the spectral library by e.g. filtering for peptides with at least 6 transitions (increased confidence) and limiting the maximum also to 6 transitions (avoiding inflated libraries). Furthermore, the scan range can be adjusted (here between 400 - 1000 m/z) covering the same m/z range as in the DIA measurements.

Question

How many transitions does the refined spectral library contain and how many were removed?

The refined library contains ~170,000 lines, and almost 700,000 lines were removed.

Adding decoy sequences with OpenSwathDecoyGenerator

Hands On: Adding decoy transitions to the spectral library

OpenSwathDecoyGenerator ( Galaxy version 2.6+galaxy0) with the following parameters:

“Output file type – default: determined from file extension or content”: tabular (tsv)

“Advanced Options”: Show Advanced Options

“MZ threshold in Thomson for fragment ion annotation”: 0.015

Comment: Decoy transitions

To enable correct false discovery rate (FDR) computation later on, we add computanionally generated decoy sequences to the spectral library. Those “non-observed” sequences can be generated based on the observed sequences with slight modifications. The most commonly used methods are either shuffle (randomly altering the amino acid sequence of each observed transition) or reverse (by reversing the actually obeserved transitions). Those artificially generated transitions were labelled as decoy and are later on considered as known false positives. Example: By applying an FDR of 1 % we only allow for e.g only 1 such decoy transition out of 100 identifications. Thus one could estimate that the remaining 99 non-labeled identifications contain also 1 % false positive hits.

Question

How many lines do you expect after running the OpenSwathDecoyGenerator on your spectral library containing only observed transitions?

We expect double the amount of lines (~340,000), since we generate one decoy (non-observed) per target (observed) sequence.

Converting the spectral library with TargetedFileConverter

Hands On: Converting the final spectral library from .tsv to the sqlite .pqp format

TargetedFileConverter ( Galaxy version 2.6+galaxy0) with the following parameters:

“Output file type – default: determined from file extension or content”: pqp

Comment: Finding the right format for the spectral library.

Generally there is a broad variety of data formats for spectral libraries including .tsv, .dlib, .pqp and more. Depending on the DIA analysis software one or multiple formats are supported and can be used. In OpenSwathWorkflow one needs to have the spectral library in .pqp format to being able to combine multiple runs after the DIA analysis and before applying the FDR scoring. This is particularily interesting when the multiple runs should be compared later on.

Question

How many peptides and proteins does the final library contain?

How many target peptides and proteins are in the library?

The final spectral library contains over 56,000 peptides covering over 10,200 proteins.

Since we added equal numbers of decoy sequences we expect to have only half of the peptides and proteins (~28,000 and ~5,100) which are real targets.

Conclusion

DDA_lib_gen_pipe. — **Figure 2**: All-in one workflow for DIA analysis in Galaxy. The generation of a spectral library is highlighted in red.

Using data dependent acquisiton (DDA) mass spectrometry data we generated a spectral library which can be used for the analysis of data independent acquisition (DIA) data. The quality and content of the spectral library is crucial for the following DIA data analysis. Thus, in some analysis it might be beneficial or even necessary to use slightly adjusted spectral libraries, containing e.g. semi-tryptic peptides (see enzyme in MaxQuant), non-unique peptides (skip filtering) or longer peptides (higher missed cleavages).

DDA_lib_gen_galaxy. — **Figure 3**: Galaxy worfklow containing all necessary steps for the genration of a spectral library.

You've Finished the Tutorial

Key points

MaxQuant, diapysef and OpenSwath tools enable the generation of spectral libraries from DDA data

Generated spectral libraries can be used for the analysis of DIA data

Frequently Asked Questions

Have questions about this tutorial? Have a look at the available FAQ pages and support channels

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

Cox, J., and M. Mann, 2008 MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nature Biotechnology 26: 1367–1372. 10.1038/nbt.1511
Bielow, C., G. Mastrobuoni, and S. Kempa, 2015 Proteomics Quality Control: Quality Control Software for MaxQuant Results. Journal of Proteome Research 15: 777–787. 10.1021/acs.jproteome.5b00780

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Matthias Fahrner, Melanie Föll, Library Generation for DIA Analysis (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/DIA_lib_OSW/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{proteomics-DIA_lib_OSW,
author = "Matthias Fahrner and Melanie Föll",
	title = "Library Generation for DIA Analysis (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/DIA_lib_OSW/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/proteomics/tutorials/DIA_lib_OSW/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: diapysef
  owner: galaxyp
  revisions: 8eed818f5d48
  tool_panel_section_label: Proteomics
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: maxquant
  owner: galaxyp
  revisions: 37d669de2828
  tool_panel_section_label: Proteomics
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: maxquant
  owner: galaxyp
  revisions: 9c52362ae7bb
  tool_panel_section_label: Proteomics
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: openms_openswathassaygenerator
  owner: galaxyp
  revisions: df72ca515420
  tool_panel_section_label: Proteomics
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: openms_openswathassaygenerator
  owner: galaxyp
  revisions: 2a9ff56cb279
  tool_panel_section_label: Proteomics
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: openms_openswathdecoygenerator
  owner: galaxyp
  revisions: 965413f29bf5
  tool_panel_section_label: Proteomics
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: openms_openswathdecoygenerator
  owner: galaxyp
  revisions: f861ec9fbe59
  tool_panel_section_label: Proteomics
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: openms_targetedfileconverter
  owner: galaxyp
  revisions: aa62f8f48dd4
  tool_panel_section_label: Proteomics
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: openms_targetedfileconverter
  owner: galaxyp
  revisions: dd71e020e2aa
  tool_panel_section_label: Proteomics
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

5 stars 2

October 2023

5 stars: Liked: I like almost everything, nicely presented and so helpful Disliked: maybe more clear at first regarding what site provides what so we dont have to register several rounds

June 2022

5 stars: Liked: The questions that helped you look for the most relevant information in each Galaxy output in the history really made the tutorial engaging.