Peptide and Protein Quantification via Stable Isotope Labelling (SIL)
Author(s) | Florian Christoph Sigloch Björn Grüning Matthias Fahrner |
Reviewers |
OverviewQuestions:Objectives:
What are MS1 features?
How to quantify based on MS1 features?
How to map MS1 features to MS2 identifications?
How to evaluate and optimize the results?
Requirements:
MS1 feature quantitation and mapping of quantitations to peptide and protein IDs.
- Introduction to Galaxy Analyses
- tutorial Hands-on: Protein FASTA Database Handling
- tutorial Hands-on: Peptide and Protein ID using OpenMS tools
Time estimation: 1 hourLevel: Advanced AdvancedSupporting Materials:Published: Mar 17, 2017Last modification: Nov 9, 2023License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00230version Revision: 31
To compare protein amounts in different samples from MS/MS data, two different experiment setups exist. Firstly, unmodified proteins can be measured in separate runs at one sample per MS-run. Secondly, proteins of samples to compare can be labelled with small chemical tags, mixed, and measured side-by-side in a single MS-run. There are two types of chemical tags:
- Isobaric tags display the same mass on first hand, but fragment during the generation of the MS/MS spectra to yield reporter ions of different mass. The intensity of those reporter ions can be compared in MS/MS spectra. There are two types of isobaric tags commercially available: tandem mass tags (TMT) and isobaric tags for relative and absolute quantitation (iTRAQ).
- Isotopic tags are chemically identical, but differ in their mass due to incorporated stable isotopes. Examples of different isotopic tags for stable isotope labelling (SIL) are ICAT, SILAC, dimethylation, or heavy oxygen (18O).
This tutorial deals with protein quantitation via stable isotope labelling (SIL). For isotopic tags, quantitation can be achieved by comparing the intensity of MS1 peptide mass traces. The whole MS1 profile of a peptide, i.e. the intensities of all its isotopic peaks over time, is called a peptide feature (Figure 1a). Incorporation of stable isotopes results in different peptide masses on MS1 level, which give rise to coeluting ion traces in the TIC with a mass difference typical for each different chemical tag (Figure 1b). Figure originally published in Nilse et al, 2015.
In this tutorial, we will use tools of the OpenMS suite to identify and quantify peptides and proteins.
Prerequisites
If you are in the planning phase of your quantitative proteomics experiment, you may want to consider our tutorial on different quantitation methods first.
To learn about protein identification in Galaxy, please consider our OpenMS-based peptide ID tutorial.
Hands-on: IntroductionIn the hands-on section of this tutorial, we will use a quantitative comparison of HEK cell lysate as a test dataset. In this experiment, HEK cells were once labelled with light, once with heavy SILAC. Both cultures were lysed simultaneously and the cell lysates were mixed in a certain ratio. A detailed description of the full dataset is available in the PRIDE archive.
Your objective in this hands-on-tutorial is to find out the correct mixing ratio of the test sample.
To speed up analysis, the input dataset was filtered to include only those data acquired in second 2000-3000 of the original LC gradient.
AgendaIn this tutorial, we will deal with:
Peptide and Protein Identification
In this tutorial, peptide identification will be performed using the workflow of the previous Peptide ID Tutorial. Alternatively one can perform the protein identification step by step in the Peptide ID Tutorial using the SILAC dataset from zenodo but beware to specify the labels in the param_variable_modifications
of XTandemAdapter tool.
A common problem in mass spectrometry are misassigned mono-isotopic precursor peaks. Most search engines allow for some adaptation of the monoisotopic peak and we will use this by leaving By default, misassignment to the first and second isotopic 13C peak are also considered
at No
.
Hands-on: Data upload
Create a new history for this SILAC Proteome exercise
To create a new history simply click the new-history icon at the top of the history panel:
Import the mzml file, containing the measured mass spectra from Zenodo or a data library:
https://zenodo.org/record/1051552/files/HEK_SILAC-K6R6_ST905_part.mzml
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:
- Go into Data (top panel) then Data libraries
- Navigate to the correct folder as indicated by your instructor.
- On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
- Select the desired files
- Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
In the pop-up window, choose
- “Select history”: the history you want to import the data to (or create a new one)
- Click on Import
CommentThe data have been preprocessed during the conversion from the machine raw file. We used background removal on MS1 and MS2 level, and MS2 deisotoping.
Import the human protein database (including cRAP contaminants and decoys) from zenodo or a data library:
https://zenodo.org/record/892005/files/Human_database_including_decoys_%28cRAP_added%29.fasta
Import the workflow
https://training.galaxyproject.org//training-material/topics/proteomics/tutorials/protein-id-oms/workflows/workflow.ga
Modify the workflow
- Connect the
mzML
input directly to the XTandemAdapter tool node and delete PeakPickerHiRes.Change the XTandemAdapter tool parameters: Add the param_variable_modifications
Label:13C(6) (K)
andLabel:13C(6) (R)
- Run the workflow with
- the mzML dataset
1: mzML dataset
- the human FASTA database
2: protein FASTA database
If you want to learn more about Galaxy workflows, please consult the Galaxy Introduction
Question
- How many peptides and proteins were successfully identified?
- 2217 non-redundant peptides and 763 proteins were identified. Numbers may slightly vary depending on the versions of the tools and the used FASTA file.
MS1 Feature Detection
MS1 feature detection is a critical step in quantitative workflows. In principle, there are two different ways to define features:
- Feature detection solely based upon MS1 data (mzML/raw files) without prior knowledge of peptide identifications (IDs).
- Advantage: Feature results can be used to assist in peptide identification.
- Drawback: Not all peptide identifications can be mapped to features, thus not every identified peptide can be quantified.
- Feature detection based upon peptide IDs.
- Advantage: Most peptide IDs trigger a feature detection.
- Drawback: Feature results cannot be used to improve peptide ID.
The OpenMS suite provides several tools (FeatureFinders) for MS1 feature detection. For SIL we have to use FeatureFinderMultiplex, which does not take peptide IDs as an input.
Hands-on: MS1 Feature Detection
- Run FeatureFinderMultiplex ( Galaxy version 2.6+galaxy0) with
- param-file “LC-MS dataset in centroid or profile mode”: mzML file
- In “algorithmic parameters”
- “Labels used for labelling the samples”:
[ ][Arg6,Lys6]
- “m/z tolerance for search of peak patterns”:
10
- “Maximum number of missed cleavages due to incomplete digestion”:
1
- “Optional outputs”:
out_multiplets (Optional output file containing all detected peptide groups
Comment: Multiple labels per peptideWhen using SILAC-KR or dimethyl-labelling and trypsin digestion, exactly one labelled amino acid per peptide is expected. The only labelled amino acids are lysine (K) and arginine (R) and trypsin cuts after each of them. However, a small percentage of missed cleavage normally occur also in those datasets. Setting “Maximum number of missed cleavages due to incomplete digestion” to
1
will be sufficient to deal with most missed cleavages.When using other enzymes (e.g. Lys-C) or other labels (e.g. \(^{18}O\)), several labelled amino acids per peptide are expected. You can search for those features by increasing the parameter “Maximum number of missed cleavages due to incomplete digestion”.
Mapping Identifications to Features
We now have feature quantifications for MS1 elution peaks, peptide identifications for the MS2 spectra and protein identifications.
The next step is to map the MS2-based peptide identifications to the quantified MS1 precursor peaks (“peptide features”). This will enable the quantification of identified peptides.
For labelled data, it is necessary to map peptide identifications to consensus features (i.e. a pair of one light peptide feature with one matching heavy feature in the correct m/z distance).
For consensusXML
, IDMapper uses the consensus centroids, not the feature boundaries for mapping. Therefore, the RT tolerance has to be set higher than for mapping to featureXML
. A good starting value is half the expected typical retention time.
Sometimes several peptide identifications are mapped to a feature. The tool IDConflictResolver filters the mapping so that only the identification with the best score is associated to each feature. Another refinement of the quantitative result is obtained by removing falsely mapped identifications e.g. light identification mapped to heavy feature. This step is performed by the MultiplexResolver tool that returns a first file with the correctly mapped peptides and as a second output the falsly mapped peptides.
Finally, the correctly mapped peptides will be combined into protein quantifications with the ProteinQuantifier tool.
Hands-on: Quant to ID matching
- Run IDMapper ( Galaxy version 2.6+galaxy0) with
- param-file “Protein/peptide identifications file”: output of FalseDiscoveryRate
- param-file “Feature map/consensus map file”: output of FeatureFinderMultiplex
- “RT tolerance (in seconds) for the matching of peptide identifications and (consensus) features”:
20
- “m/z tolerance (in ppm or Da) for matching of peptide identifications and (consensus) features”:
10
- In “Additional options for consensusXML input”
- “Match using RT and m/z of sub-features instead of consensus RT and m/z”:
Yes
- “Store the map index of the sub-feature in the peptide ID”:
Yes
- Run IDConflictResolver ( Galaxy version 2.6+galaxy0) with
- param-file “Input file”: output of IDMapper
- Run MultiplexResolver ( Galaxy version 2.5+galaxy0) with
- param-file “Peptide multiplets with assigned sequence information”: output of IDConflictResolver
- “Labels used for labelling the samples”:
[ ][Arg6,Lys6]
- “Maximum number of missed cleavages due to incomplete digestion”:
1
- Run ProteinQuantifier ( Galaxy version 2.6+galaxy0) with
- param-file “Input file”: first output of MultiplexResolver
- param-file “Protein inference results”: output of FalseDiscoveryRate
- “Calculate protein abundance from this number of proteotypic peptides (most abundant first; ‘0’ for all)”:
0
- “Averaging method used to compute protein abundances from peptide abundances”:
sum
- “Add the log2 ratios of the abundance values to the output”:
Yes
- “Optional outputs”:
out (Output file for protein abundances)
andpeptide_out (Output file for peptide abundances)
Comment: ProteinQuantifier parametersPeptide quantitation algorithms are more precise for high abundant peptides. Therefore, it is recommended to base protein quantitations on those peptides. In ProteinQuantifier, you may restrict the calculation of protein abundances to the most abundant peptides by using the option “Calculate protein abundance from this number of proteotypic peptides”. However, we recommend to use the averaging method
sum
instead. By using this option, protein ratios are based on the sum of all peptide abundances. Thus, highly abundant peptides thus have more influence on protein abundance calculation than low abundant peptides. A simple sum-of-intensities algorithm provided the best estimates of true protein ratios in a comparison of several protein quantitation algorithms (Carrillo et al., Bioinformatics, 2009).
Descriptive Statistics and Plotting of Protein Quantitations
ProteinQuantifier produces two output tables: the first one gives information about the quantified proteins, the second one gives information about the quantified peptides. For proteins, we added a log-transformed ratio to the output, which is saved in column 8 of the protein table. The ratio is calculated as log2 (abundance2/abundance1), which is sometimes called the fold change (FC) ratio.
To get a quick overview of the results, you can calculate basic descriptive statistics and plot the data as a histogram.
Comment lines in the beginning of a tabular
file may sometimes cause errors, therefore we will remove them with the tool Select last lines from a dataset (tail).
Hands-on: Descriptive Statistics
- Run Summary Statistics with
- param-file “Summary statistics on”: protein table output (first file) of ProteinQuantifier
- “Column or expression”:
c8
- Run Select last lines from a dataset (tail) ( Galaxy version 1.1.0) with
- param-file “Text file”: protein table output (first file) of ProteinQuantifier
- “Operation”:
Keep everything from this line on
- “Number of lines”:
4
- Run Histogram ( Galaxy version 1.0.4) with
- param-file “Dataset”: output of Select last
- “Numerical column for x axis”:
Column: 8
- “Number of breaks (bars)”:
20
- “Plot title” and “Label for x axis” to something meaningful.
Comment: Calculating descriptive statistics for peptidesThe peptide table output of ProteinQuantifier does not give the log-transformed ratio for each peptide. Nonetheless, you may calculate basic statistics of the FC values by running Summary Statistics with “Column or expression” set to
log(c6/c5,2)
.
Question
- How many peptides and proteins were successfully quantified?
- What might have been the mixing ratio of the input dataset?
- In the histogram, there is a second local maximum at about FC 0. What might that mean?
- With the above parameters, you should have quantified 870 peptides and 421 proteins (slight variations are expected depending on tool version or fasta file)
- In the histogram, you see that the peak of the density curve is between -1.1 and -1.2. In the summary statistics, you can see that the mean protein ratio was -0.98. An FC of -1 indicates that the unlabelled proteins were twice as abundant as their heavy-labelled counterparts. Indeed, the mixing ratio of the dataset was 2 parts light labelled HEK cell lysate and 1 part heavy labelled HEK cell lysate.
- Some proteins were quantified with an FC close to 0. These may stem from incomplete SILAC labelling. Even after two weeks of cell culture in SILAC medium, some proteins with a very low turnover rate may remain unlabelled.
Evaluation and Optimization of Quantitation Results
Protein quantitation is a multi-step procedure. Many parameters of different steps influence the final results. Therefore, it is recommended to optimize the tool parameters for each dataset and to carefully evaluate quantitation results. While the total number of quantified proteins is a first important parameter for optimization, it is also necessary to visualize the results and check for correct feature finding and ID mapping.
Galaxy does not provide a tool for proteomics visualization, we recommend to use the OpenMS Viewer TOPPView. Basic TOPPView tutorials are available as videos and a more comprehensive tutorial as HTML or PDF.
For the optimization of tool parameters, it is recommended not to work with a complete LC-MS/MS run. Instead, we will use FileFilter to extract a small RT-slice of our input dataset, i.e. a fraction of the original dataset that was measured during a short period of time. Reducing the test data reduces the time needed for analysis and facilitates visual examination of the data.
Using Galaxy Workflows enables us to quickly re-run a full analysis with changed parameters. To learn about Galaxy Workflows, consult this tutorial.
Cave: Be aware that only very small parts of your dataset can be checked by visual examination. To minimize biases, try to look at the same areas / features of each result file.
Hands-on: Data reduction and visual evaluation with TOPPView
- Run FileFilter tool with
- param-file “Input file”:
mzML
file- “Retention time range to extract”:
2000:2200
Extract a workflow out of your history or import the premade workflow
Run the whole workflow again with default settings on the reduced
mzML
file.- Run FileFilter tool with
- param-file “Input file”: IDConflictResolver output
- “Remove features without annotations”:
Yes
Rename the FileFilter output to
Annotated features
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, change the Name field
- Click the Save button
- Run FileFilter tool with
- param-file “Input file”: IDConflictResolver output
- “Remove features without annotations”:
Yes
Rename the FileFilter output to
UNannotated features
- Download locally the following files:
- Spectra:
mzML
file- peptide IDs: IDScoreSwitcher
idXML
output file- features: FeatureFinderMultiplex
featureXML
output file- consensus features: FileFilter
consensusXML
output files (“Annotated” and “UNannotated” features)- Open TOPPView
- Open the
mzML
file in TOPPView with
- Open as set to
new window
- Map view set to
2D
- Low intensity cutoff set to
off
- Open all other downloaded files in TOPPView with
- Open as set to
new layer
- Activate the
mzML
layer and click onShow projections
- Activate the
consensusXML
layers and click onShow consensus feature element positions
- Evaluate your data analysis, by
- zooming into a specific region (hold
Ctrl
and use the mouse to zoom)- measuring m/z and RT distances (select the
mzML
layer, holdShift
and use the mouse to measure)- displaying an area in 3D view (
right-click
into the 2D View and selectSwitch to 3D view
)- switching on and off the display of single layers (
left-click
at the tick-boxes in the window “Layers”)
Examples
-
Displaying annotated vs. UNannotated features: visualize annotated (= mapped) and unannotated (= unmapped) features by switching between activating only the “Annotated_features.consensusxml” or only the “UNannotated_features.consensusxml” layer
-
Correct mapping: a feature was detected, a peptide was identified and the two were mapped.
-
No feature detected for a contaminant. Contaminants are often not labelled, but occur only in their unlabelled isoform. Therefore, they do not give rise to a consensus feature in FeatureFinderMultiplex.
Hands-on: Check a possible contaminant
- Run TextExporter tool with
- param-file “Input file”: IDFilter output file
- Run Search in textfiles (grep) tool with
- param-file“Select lines from”: output of TextExporter
- “Regular Expression”: the peptide sequence (e.g.
MFLSFPTTK
)- Check if the peptide was mapped to a protein marked with “CONTAMINANT”.
Typical Problems
Three problems typically hamper correct peptide mapping:
-
A feature is detected, but no peptide identification is nearby.
- Possible cause: This may be caused by imperfect peptide identification. However, it is never expected that every single MS2-spectrum leads to an identification. The protein might be missing in the database, or the peptide may carry a modification that was not included in the search.
- Possible solution: Improve your search engine settings.
-
A peptide was identified, but no feature is nearby.
- Possible cause:
- The elution peaks of the peptide may be distorted. This is typical for low intensity peptides. If a lot of peptides have distorted elution peaks this may be a sign of spray instability.
- The peptide is a contaminant.
- Possible solution:
- Lower the FeatureFinderMultiplex parameters Two peptides in a multiplet are expected to have the same isotopic pattern and/or The isotopic pattern of a peptide should resemble the averagine model at this m/z position or broaden the Range of isotopes per peptide in the sample (in Advanced Options).
- No optimization of parameters is needed (see example above)
- Possible cause:
-
A peptide was identified and a feature was detected nearby, but the two are not mapped to each other.
- Possible cause:
- The MS2 event and the feature are too far apart to be mapped.
- The precursor of the MS2 was not correctly assigned to the mono-isotopic peak.
- The detected feature is too small in RT dimension and covers only a part of the peptide peaks.
- Possible solution:
- Increase the IDMapper parameter RT tolerance (in seconds) for the matching of peptide identifications and (consensus) features.
- Feature size in RT dimension cannot be directly corrected, use solution 1 instead.
- Possible cause:
Two problems typically disturb correct peptide quantitation:
- A peptide is mapped to the wrong feature.
- Possible cause: Co-eluting peptides of a similar mass may be falsely mapped to a nearby feature, if the correct peptide did not lead to an identification or was identified only with a low score. In high-resolution data, this problem of is very limited. Co-eluting peptides can normally be distinguished by slightly different m/z values. In low-resolution data, wrong assignment may occur more often.
- Possible solution: If a high value is used for the precursor mass tolerance, try to keep the RT tolerance low to avoid false mapping.
- Background noise (1) or co-eluting peptides (2) are incorporated in a feature.
- Possible solution:
- Use noise-filtering either during pre-processing or by increasing the FeatureFinderMultiplex parameter Lower bound for the intensity of isotopic peaks
- Reduce the FeatureFinderMultiplex parameter m/z tolerance for search of peak patterns.
- Possible solution:
Question
- How many peptides could not be mapped to MS1 features? (Click on the IDMapper output and look at the tool’s infobox.)
- How many features could not be mapped to a peptide identification? (Click on the ProteinQuantifier output and look at the tool’s infobox.)
- Which problems are most prominent in the test dataset?
- 1,395 peptide IDs could not be mapped to a feature.
- 1,898 features, corresponding to 949 consensus features could not be mapped to a peptide identification.
- The mapping of peptide IDs to features seems to have worked mostly fine. The main problems seem to be (1) missing peptide identifications, (2) missing features where a peptide was identified and (3) features that span a shorter RT range than the corresponding peptide’s elution peak.
Optimization of Quantitation Results
For optimization, it is critical to modify only one parameter at a time. Also, it is recommended to optimize the tools in the order of their position in the workflow.
In the test dataset, several peptides were identified, but not quantified. Some of the peptides were even identified in the unlabelled, as well as in the labelled form. To optimize the feature detection, we will relax the parameters of FeatureFinderMultiplex.
Hands-on: Optimize Feature Detection
- Run the whole workflow again:
- Change the FeatureFinderMultiplex parameter “Range of isotopes per peptide in the sample” from
3:6
to2:6
.- Integrate the HighResPrecursorMassCorrector tool into the workflow.
- Use the mzML file and the featureXML from the FeatureFinderMultiplex tool as input
- Set Additional retention time tolerance added to feature boundaries between
0.0
and10.0
- Connect the output with the search engine.
- Compare the number of identified proteins, unmatched features and unmapped peptides for each parameter setting.
- Visualize the results with TOPPView to check for correct feature detection and feature-to-peptide mapping.
When running a workflow, you may send the results to a new history. This helps keeping track of different parameter settings.
QuestionWhich parameter improved the number of quantified proteins?
Both changes led to more quantified proteins. Increasing the isotope range led to 26 \% more protein quantitations, increasing the RT tolerance led to 7 \% more protein quantitations.