Computational analysis of shotgun proteomics data Michael J MacCoss Proteomics technology is progressing at an incredible rate. The latest generation of tandem mass spectrometers can now acquire tens of thousands of fragmentation spectra in a matter of hours. Furthermore, quantitative proteomics methods have been developed that incorporate a stable isotope-labeled internal standard for every peptide within a complex protein mixture for the measurement of relative protein abundances. These developments have opened the doors for ‘shotgun’ proteomics, yet have also placed a burden on the computational approaches that manage the data. With each new method that is developed, the quantity of data that can be derived from a single experiment increases. To deal with this increase, new computational approaches are being developed to manage the data and assess false positives. This review discusses current approaches for analyzing proteomics data by mass spectrometry and identifies present computational limitations and bottlenecks. Addresses Department of Genome Sciences, University of Washington, 1705 NE Pacific Street, K307, Box 357730, Seattle, WA 98195-7730, USA Corresponding author: MacCoss, MJ (
[email protected])
Current Opinion in Chemical Biology 2005, 9:88–94 This review comes from a themed issue on Proteomics and genomics Edited by Benjamin F Cravatt and Thomas Kodadek Available online 8th January 2005 1367-5931/$ – see front matter # 2005 Elsevier Ltd. All rights reserved. DOI 10.1016/j.cbpa.2004.12.010
Introduction The proteomics field, as with other technology-rich scientific disciplines, often progresses with a leap in technology followed by numerous smaller advances to fully exploit the initial improvement. Although many of the advances in proteomics technology have obviously been driven by mass spectrometry (MS) hardware and methodology developments, the capabilities of computational tools have also propelled proteomics forward. One of the greatest paradigm shifts for proteomics has been the development of computer programs for searching uninterpreted tandem mass spectra of peptides against either a protein or nucleotide sequence database. The use of database searching algorithms such as SEQUEST [1] and MASCOT [2] (Box 1) combined with the significant increase in available genomic sequence information has made the Current Opinion in Chemical Biology 2005, 9:88–94
interpretation of tandem mass spectra of peptides routine. The automation provided by these programs has extended the use of tandem MS (MS/MS) beyond the chemical sciences and into the biological sciences. Without the development of database searching software, the influx of methods and technology away from ‘one protein, one analysis’ and towards the shotgun analysis of thousands of proteins in a single experiment would not have been possible. Over the past several years there have been enormous technological strides in the high-throughput analysis of peptides from unfractionated mixtures using MS. The development of novel sample preparation strategies, stable isotope labeling for quantitation, and microcapillary multidimensional separations of peptides combined with the recent influx of faster and more sensitive commercially available mass spectrometers has further increased the burden on the development of computational tools to facilitate the analysis of these data. For example, a 24 h MudPIT experiment [3] analyzed on a mass spectrometer capable of a 3–5 Hz scan rate can acquire 250 000–430 000 data-dependent MS/MS spectra. To manage these developments, equally significant improvements in computational analysis of MS data have been reported. Unfortunately, it is not possible to discuss all the emerging software to manage proteomics data within the page limits of this review and, therefore, the strengths and present limitations of a selection of computational tools are presented.
Database searching algorithms MS-based proteomics is made possible largely because of the development of computer algorithms capable of matching an uninterpreted fragmentation spectrum of a peptide against an amino acid sequence in a protein database. In general, these database-searching algorithms can be divided into two main classes of algorithms: first, software that correlates the experimentally acquired spectra against theoretically predicted spectra from a sequence database; and second, software that derives the probability that the fragment ions in the spectrum could be derived by random chance from sequences within the database. The first of these algorithms, SEQUEST, was developed over a decade ago, and it is a test of the quality of the approach that it remains so successful today [1]. SEQUEST can be placed into the first class of database searching algorithms described above. It uses a crosscorrelation function (XCorr) to assess the quality of the match between an experimentally derived and a www.sciencedirect.com
Computational analysis of shotgun proteomics data MacCoss 89
Box 1 Web sites for further information on the programs highlighted in this article. Database searching algorithms SEQUEST http://www.sequest.org MASCOT http://www.matrixscience.com ProbID http://projects.systemsbiology.net/probid/ PEP_PROBE NA Filtering and assembly software DTASelect http://fields.scripps.edu/DTASelect/ INTERACT http://sashimi.sourceforge.net/ Quantitative analysis software ASAPRatio http://sashimi.sourceforge.net RelEx http://fields.scripps.edu/relex
predicted spectrum, and the results are ranked on the basis of these scores. The XCorr, although independent of database size, is dependent on the number of the spectral features between the experimental and theoretical spectra being compared, so the same XCorr value for two different peptide sequences might not necessarily reflect a similar closeness of fit. Nevertheless, there have been several approaches to normalize XCorr across different peptide sequences and charge states [4–7]. These normalizations are easy to implement computationally and alleviate the differences between peptide sequences without losing the sensitivity of the algorithm. The optimal normalization results in the comparison of all SEQUEST results on the same scale where 1.0 is a perfect match and 0.0 signifies no match and the normalized XCorr is consistent for all peptides independent of peptide length and peptide charge state [5]. The second class of database-searching algorithms uses probability-based matching to derive the correct peptide sequence from within a database using an MS/MS spectrum. The most widely used of these algorithms is the MASCOT search engine [2,8]. MASCOT incorporates the MOWSE probability model [9], which was initially derived for peptide mass fingerprinting data that scores the frequency of peptide (M+H)+ distributions from the protein database. It is likely that MASCOT implements a similar approach for MS/MS fragmentation spectra, but the details have never been reported. The programs ProbID [10] and PEP_PROBE [11] are also probability-based algorithms, but use a Bayesian model and hypergeometric model, respectively, instead of a frequency model to calculate the probability that the candidate peptide is a true match as opposed to a random match. Database-searching algorithms will always return the best-matching peptide within the database for a given fragmentation spectrum even if the peptide sequence of the spectrum is not in the database, contains an unanticipated modification, or if the spectrum is of a nonpeptide molecule. If it were possible to uniquely separate www.sciencedirect.com
true and false positives on the basis of the score returned by the database-searching program, then each identified peptide would indicate the presence of the corresponding protein. However, current scoring methods result in a large overlap between correct and incorrect peptide identifications. Thus, to ensure that a large fraction of the true positive identifications are retained, a score threshold is selected where a percentage of the peptide identifications may be incorrect, and these representative incorrect peptide sequences will be a result of a match against a random sequence in the database. These random matches are not a failure of the database-searching program, but instead a result of the complexity of the problem. Probability-based matching approaches are powerful because they provide a measure of the likelihood that the identified peptide sequence could have been derived by random chance. Nevertheless, the XCorr algorithm used by SEQUEST is surprisingly robust for low signalto-noise spectra and remains a primary search algorithm. Thus, it is unsurprising that several research groups have used SEQUEST XCorr values to derive empirical probability scores [5,7,12]. A particularly powerful approach is to use not only XCorr but also many different scoring criteria and spectral features to distinguish between true and false positive results. Keller et al. used a linear discriminant analysis to learn the difference between correct and incorrect peptide identifications [7]. This approach uses a normalized XCorr score combined with the number of tryptic termini from the peptide returned from the SEQUEST search to improve the discrimination between correct and incorrect and then to compute empirical probabilities on the basis of discriminant score. An alternative approach by Anderson et al. used a set of known true and false positive peptide identifications to train a machine learning algorithm, called a support vector machine (SVM), to determine a decision boundary using a combination of 13 different ‘features’ [13]. The decision boundary was then used with the same features derived from unknown SEQUEST peptide identifications to classify the output into either true or false positive identifications. The SVM approach is very powerful and when used to evaluate SEQUEST search results provides the best results to date in approaching a complete separation between correct and incorrect peptide identifications. With the diversity of scoring algorithms available for the interpretation of MS/MS spectra of peptides, where and how do we place thresholds to minimize the false positives without losing true positive peptide identifications? Although an ‘expert’ user may be able to distinguish between the correct and incorrect results by manual validation of the spectrum and the respective peptide sequence returned from the database search, this complicates the reproducibility and exchange of data between laboratories. How can the same data integrity be mainCurrent Opinion in Chemical Biology 2005, 9:88–94
90 Proteomics and genomics
tained between laboratories without adding a systematic bias from different breadth of personnel experience? Given the broad differences between different probability models, at a minimum each user should understand the approach used to derive the confidence of the peptide identification and be aware of its respective strengths and limitations.
Deriving protein identifications from database search results of peptide mass spectra Because database search algorithms function on the level of peptides, additional software is needed to assemble these short peptide sequences into protein sequences. This process is complicated by the identification of nonunique peptide sequences present in several different proteins and the generation of very large datasets containing individual peptides over a broad range of confidences. Thus, these computer programs have the task of filtering through the noise and deriving the smallest number of protein sequences that can be assembled from the accurately identified peptide sequences. The programs DTASelect [14] and INTERACT [15] are two widely used computer programs for assembling peptide identifications into protein identifications. These programs allow the use of selection filters on both the peptide and protein identification level. An important capability of both DTASelect’s companion software Contrast and INTERACT is the filtering and comparison of peptide identifications between multiple experiments. The filtering capabilities of these programs do not provide probability-based measures directly per se, but they can be used to apply cutoff filters that have a certain confidence derived empirically from a training set [5].
number of loci in the database increases [19]. Ultimately the large number of spectra acquired by multidimensional chromatography–MS/MS combined with the use of a small database will result in false-positive protein identifications from multiple peptide ‘hits’ that would not have resulted from smaller datasets and/or larger protein databases [19]. The advantages of faster scanning mass spectrometers and multidimensional separations are obvious; however, the total number of MS/MS spectra and the protein database size must be considered to accurately access the number of peptides that are needed to exceed the expectation factor that a given locus could have been derived from random incorrect peptide identifications. To estimate the likelihood that a given protein could arise from a spurious random match, Moore et al. used a reversed database concatenated to the forward database to estimate the false positive identification rate [12]. Because of its simplicity, this approach has since been implemented by several other research groups [20,21]. Figure 1 illustrates how a reversed database can be used to minimize the false positive rate by selecting a threshold score that keeps the reversed/forward protein ratio < 5%. One potential limitation of the reversed database is that because the sequences are only reversed, many of the sequence characteristics of the forward database are maintained. Thus, care must be taken because several the ‘false positives’ may actually be true positive peptide Figure 1
Partial tryptic, 2 peptides per locus
The use of probabilities derived empirically using a training dataset is a powerful approach where the total number of spectra is small and the relative size of the database searched is very large. However, an often overlooked consequence of large MS/MS spectra datasets can be the increased rate of false positive protein identifications. Because only 10–20% of the acquired MS/MS spectra result in a true peptide identification and the scores for the incorrect and correct peptide identifications overlap, a threshold is chosen that will almost always include several incorrect peptide identifications. These incorrect peptide identifications will be randomly distributed throughout loci within a sequence database and, thus, are generally insignificant when multiple independent peptide identifications are required to identify a protein sequence because the probability of obtaining multiple random incorrect peptide identifications to the same protein locus is reduced [16,17]. Although the chance of obtaining a single incorrect peptide identification with a high score is increased with a large database [18], the expectation that a given loci will contain multiple random peptide identifications is decreased as the Current Opinion in Chemical Biology 2005, 9:88–94
Number of proteins identified
1400 1200
Proteins from forward sequences Proteins from reversed sequences
1000
95% Confidence
800 600 400 200 0 0.1
0.2
0.3 0.4 Normalized XCorr
0.5
0.6
Current Opinion in Cell Biology
Use of a reversed protein database concatenated to the normal database to estimate the confidence of the identified proteins. Assuming that the incorrect peptide identifications will be sporadically distributed throughout the protein database, then the identification of proteins from a reversed database can be used to estimate the false positive rate on proteins from the normal database using any selection threshold or criteria. www.sciencedirect.com
Computational analysis of shotgun proteomics data MacCoss 91
identifications, and under some circumstances can result in an exaggerated false positive rate. Nevertheless, the reversed database is very useful in estimating the effect of selected search and filtering criteria and, although not perfect, will only result in an overly conservative protein identification list.
Quantitative analysis of proteomics data Because most proteins are not generally expressed in an all-or-nothing fashion, methods have been developed to extend proteomics technologies beyond qualitative profiling to include the measurement of quantitative changes. Most quantitative MS methods are based on the measurement of the analyte of interest relative to a corresponding stable isotope-labeled analog of the same molecule. For quantitative proteomics, an internal standard can be produced for peptides measured by ‘shotgun’ proteomics by incorporating ‘heavy’ stable isotope atoms into proteins or peptides using metabolic labeling, chemical derivatization, enzymatic catalysis or isotopic exchange. The peptides containing the natural abundance atoms and enriched ‘heavy’ atoms are then identified by LC–MS/MS and software can then be used to evaluate the relative abundance between the unlabeled and labeled peptide pairs. Several computer programs exist for the analysis of quantitative proteomics data using stable isotope labeling
[15,22,23]. These programs first derive extracted ion chromatograms from the data for the peptide pairs (Figure 2) and compute relative abundances from the background-subtracted areas of the chromatograms. Software to automatically derive the area under a chromatographic peak is often the Achilles heal of any quantitative MS analysis. These programs must determine where the peak starts and ends, while also determining the contribution from background on which the peak is superimposed. Although the objective and reproducible assessment of peak area is reliable for high signal-to-noise (S/N) chromatograms, peaks of even modest S/N are difficult to integrate reproducibly. Unfortunately, the selection of the ‘correct’ start and stop points is highly subjective. Any objective criteria that can be used to evaluate peak locations are often dependent on peak shape and will likely differ substantially between the 1000s of peptides identified in a mLC/mLC/MS/MS run. Figure 3 illustrates two potentially different start and stop points for the same chromatogram, each with a different resulting peak area. Neither peak integration is necessarily better than the other and defining these criteria will depend on the S/N of the chromatogram, the chromatogram peak shape, and even between different users’ biases. One elegant approach for evaluating the relative abundance between two isotopomers was first described by Thorne et al. [24] nearly two decades ago. This algorithm
Relative abundance
Figure 2
Peptide sequence DIDIEYHQNK
Time
15N-Enriched
1270 1275 1280 1285 1290 1295 m/z
Relative abundance
Relative abundance
Unlabeled
Time Current Opinion in Chemical Biology
Mass spectrometry can be used for relative protein quantitation by determining the ion-current ratio of extracted ion chromatograms between unlabeled and stable isotope labeled peptides. www.sciencedirect.com
Current Opinion in Chemical Biology 2005, 9:88–94
92 Proteomics and genomics
Relative abundance
Relative abundance
Figure 3
Time
Time Current Opinion in Chemical Biology
Assessment of a chromatogram’s start point (green lines) and stop point (red lines) is the most challenging aspect of any quantitative proteomics algorithm. The complicating factor of any quantitative mass spectrometry approach, as demonstrated with these two graphs, is that a single chromatogram can be integrated using different criteria; with each criteria resulting in different peak areas (shaded yellow).
uses a least squares regression to evaluate the background-subtracted mass spectrometer ion-current ratio from two extracted ion chromatograms. In a single calculation, the slope of the regression provides a measure of the background-subtracted ratio, the intercept provides a measure of the ratio of the two backgrounds, and the correlation coefficient provides a measure of the ratio quality (Figure 4). Using this approach, both chromatograms are handled simultaneously, and the quality of the ratio is independent of the chromatogram peak shape and
is only marginally affected by the algorithm’s chosen start and stop points. Furthermore, methods that use traditional peak integration where the two chromatograms are treated independently will always be limited by the ability of the software to detect and integrate the extracted ion chromatogram of the lowest S/N. By contrast, because the least squares regression algorithm considers both chromatograms simultaneously, the peak detection only needs to be performed on the isotopomer with the greatest intensity. Thus, there are several
Figure 4
Correlation between two different ion chromatograms provides a promising alternative to traditional peak integration for calculating relative abundances by mass spectrometry. The area in yellow is the region where the two chromatograms are correlated using a least squares regression. The upper and lower chromatograms illustrate the effect of varying the window used for calculating the regression. Because the labeled and unlabeled chromatograms are handled simultaneously, this approach is only minimally affected by the correlation start and stop points, and the ratio calculation is surprisingly tolerant of noisy data. Current Opinion in Chemical Biology 2005, 9:88–94
www.sciencedirect.com
Computational analysis of shotgun proteomics data MacCoss 93
advantages to this approach and, to date, the only computer program that has implemented the least squared regression for calculating MS ion-current ratios in proteomics is RelEx [22].
5.
MacCoss MJ, Wu CC, Yates JR III: Probability-based validation of protein identifications using a modified SEQUEST algorithm. Anal Chem 2002, 74:5593-5599.
6.
Jaffe JD, Berg HC, Church GM: Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics 2004, 4:59-77.
As with software to assemble protein identifications, quantitative proteomics software needs to be able to derive a final protein ratio. Calculating protein ratios is complicated because several the identified peptides are not unique and cannot be used to estimate the final protein ratio. Additionally, many ratios calculated by automated methods have obvious outliers that increase the error and reduce the sensitivity to small changes in relative abundance. Both ASAPRatio [23] and RelEx [22] apply a Dixon’s Q-test to remove outliers before calculating the final protein ratio.
7.
Keller A, Nesvizhskii AI, Kolker E, Aebersold R: Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem 2002, 74:5383-5392.
8.
Creasy DM, Cottrell JS: Error tolerant searching of uninterpreted tandem mass spectrometry data. Proteomics 2002, 2:1426-1434.
9.
Pappin DJ, Hojrup P, Bleasby AJ: Rapid identification of proteins by peptide-mass fingerprinting. Curr Biol 1993, 3:327-332.
Conclusions Recent advances in methodologies and MS instrumentation have enabled the direct analysis of complex protein mixtures and have paved the way for ‘shotgun’ proteomics. To manage each new advance in methodology, advances in software have evolved to meet the computational demands of each new analysis. Unfortunately, the analysis of MS data is still far from perfect and numerous additional developments need to occur before proteome analysis becomes truly global and routine. Because no one approach has addressed the present needs, there are often complementary algorithms to handle the analysis of ‘shotgun’ proteomics data, each with their own strengths and weaknesses. Until this field matures, it is important that the user fully understands the approach that they use and intimately understands and recognizes the limitations of the respective algorithms.
Acknowledgements The author gratefully acknowledges financial support from the American Society for Mass Spectrometry, the University of Washington Royalty Research Fund, and National Institutes of Health Grants P41-RR11823 and P30-AG13280.
References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as: of special interest of outstanding interest 1.
Eng JK, McCormack AL, Yates JR III: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 1994, 5:976-989.
2.
Perkins DN, Pappin DJ, Creasy DM, Cottrell JS: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20:3551-3567.
10. Zhang N, Aebersold R, Schwikowski B: ProbID: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data. Proteomics 2002, 2:1406-1412. 11. Sadygov RG, Yates JR III: A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal Chem 2003, 75:3792-3798. 12. Moore RE, Young MK, Lee TD: Qscore: an algorithm for evaluating SEQUEST database search results. J Am Soc Mass Spectrom 2002, 13:378-386. The authors are the first to describe the use of a reversed protein database to evaluate the effectiveness of a new scoring routine. 13. Anderson DC, Li W, Payan DG, Noble WS: A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. J Proteome Res 2003, 2:137-146. An approach that uses a support vector machine to classify correct and incorrect protein identifications by SEQUEST using 13 independent features. 14. Tabb DL, McDonald WH, Yates JR III: DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics. J Proteome Res 2002, 1:21-26. 15. Han DK, Eng J, Zhou H, Aebersold R: Quantitative profiling of differentiation-induced microsomal proteins using isotopecoded affinity tags and mass spectrometry. Nat Biotechnol 2001, 19:946-951. 16. Wu CC, MacCoss MJ, Howell KE, Yates JR: A method for the comprehensive proteomic analysis of membrane proteins. Nat Biotechnol 2003, 21:532-538. 17. Wu CC, MacCoss MJ, Mardones G, Finnigan C, Mogelsvang S, Yates JR III, Howell KE: Organellar proteomics reveals golgi arginine dimethylation. Mol Biol Cell 2004, 15:2907-2919. 18. Eriksson J, Fenyo D: The statistical significance of protein identification results as a function of the number of protein sequences searched. J Proteome Res 2004, 3:979-982. 19. Sadygov RG, Liu H, Yates JR: Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases. Anal Chem 2004, 76:1664-1671. The first algorithm to consider the total number of spectra from a dataset in addition to the database size in the estimate of protein confidence. 20. Elias JE, Gibbons FD, King OD, Roth FP, Gygi SP: Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nat Biotechnol 2004, 22:214-219.
3.
Washburn MP, Wolters D, Yates JR III: Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat Biotechnol 2001, 19:242-247.
21. Resing KA, Meyer-Arendt K, Mendoza AM, Aveline-Wolf LD, Jonscher KR, Pierce KG, Old WM, Cheung HT, Russell S, Wattawa JL et al.: Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics. Anal Chem 2004, 76:3556-3568.
4.
Yates JR III, Morgan SF, Gatlin CL, Griffin PR, Eng JK: Method to compare collision-induced dissociation spectra of peptide: potential for library searching and subtractive analysis. Anal Chem 1998, 70:3557-3567.
22. MacCoss MJ, Wu CC, Liu H, Sadygov R, Yates JR III: A correlation algorithm for the automated quantitative analysis of shotgun proteomics data. Anal Chem 2003, 75:6912-6921.
www.sciencedirect.com
Current Opinion in Chemical Biology 2005, 9:88–94
94 Proteomics and genomics
Proteomics software that uses a least squared regression for the calculation of the ion-current ratios from extracted ion chromatograms of unlabeled and stable isotope labeled peptides. This software is compatible with all chemical and metabolic labeling approaches.
The authors describe quantitative proteomics software that calculates the isotopomer ratios of peptides identified by shotgun proteomics. The software uses the ratios from all charge states and weights the peptide ratios depending on signal intensity to calculate protein ratios.
23. Li XJ, Zhang H, Ranish JA, Aebersold R: Automated statistical analysis of protein abundance ratios from data generated by stable-isotope dilution and tandem mass spectrometry. Anal Chem 2003, 75:6648-6657.
24. Thorne GC, Gaskell SJ, Payne PA: Approaches to the improvement of quantitative precision in selected ion monitoring: High resolution applications. Biomed Mass Spectrom 1984, 11:415-420.
Current Opinion in Chemical Biology 2005, 9:88–94
www.sciencedirect.com