Biochimica et Biophysica Acta 1646 (2003) 1 – 10 www.bba-direct.com
Review
Large-scale protein identification using mass spectrometry Dayin Lin a, David L. Tabb b, John R. Yates III a,* a
Department of Cell Biology, The Scripps Research Institute, 10550 North Torrey Pines Rd., La Jolla, CA 92037, USA b Department of Genome Sciences, University of Washington, Box 357730 Seattle, WA 98195, USA Received 2 September 2002; received in revised form 25 November 2002; accepted 3 December 2002
Abstract Recent achievements in genomics have created an infrastructure of biological information. The enormous success of genomics promptly induced a subsequent explosion in proteomics technology, the emerging science for systematic study of proteins in complexes, organelles, and cells. Proteomics is developing powerful technologies to identify proteins, to map proteomes in cells, to quantify the differential expression of proteins under different states, and to study aspects of protein – protein interaction. The dynamic nature of protein expression, protein interactions, and protein modifications requires measurement as a function of time and cellular state. These types of studies require many measurements and thus high throughput protein identification is essential. This review will discuss aspects of mass spectrometry with emphasis on methods and applications for large-scale protein identification, a fundamental tool for proteomics. D 2002 Elsevier Science B.V. All rights reserved. Keywords: Protein identification; Proteomic; Large-scale; Mass spectrometry
1. Introduction The last decade in science belongs to genome sequencing. Besides the human genome sequencing project [1,2], many other scientifically, medically, and economically interesting organisms were sequenced. As of April 2002, the Entrez database of the National Center for Biotechnology Information had a collection of completed or partially completed genomes representing more than 800 different organisms http://www.ncbi.nlm.nih.gov/. More recently, two separate research groups reported high-quality draft sequences of the rice genome [3,4], an agricultural research milestone that will lead to advances in studies of this important food grain that provides more calories than any other single food in the world [5]. Interestingly, a greater number of genes were predicted in the rice genome than was found in the rough draft of the human genome. A less complicated organism (fewer cell types, shorter life span) such as rice might be expected to contain fewer genes, but rice leads an immobile existence and therefore must respond to a greater number of environmental and biotic challenges
* Corresponding author. Tel.: +1-858-784-8862; fax: +1-858-784-8883. E-mail address:
[email protected] (J.R. Yates).
and presumably needs more genes to do so. A recent interpretation of the fewer than expected human genes suggests that this may result from the need for an advanced immune system to fight off disease [6]. Distinguishing between self and non-self may ultimately limit the number of different gene products the immune system can tolerate and still provide adequate protection [6]. The implications are that the human genome has had to develop alternate methods to create functional diversity and this may be accomplished through covalent modification of proteins and alternate splicing of genes. Thus, the challenges to understand the human biological system will require sophisticated protein biochemical methods. Most certainly pleiotrophy will play a significant role in creating the necessary functional diversity required to create the 250+ different cell types found in humans. Fortunately, the rapid completion of genome sequences has provided the essential information to link genes to gene products—proteins, the building blocks for cellular functions. A completed genome sequence provides the infrastructure for high throughput and large-scale protein analysis and these techniques will be essential to establish the functional framework of all proteins of an organism. New technologies, based on the use of genomic information, are accelerating the pace of biological discovery. Protein biochemistry has been a beneficiary of these advances which
1570-9639/02/$ - see front matter D 2002 Elsevier Science B.V. All rights reserved. doi:10.1016/S1570-9639(02)00546-0
2
D. Lin et al. / Biochimica et Biophysica Acta 1646 (2003) 1–10
have come about through synergies of genomic data and advanced mass spectrometers. Current high-sensitivity methods using mass spectrometry for protein identification have vastly lowered detection limits [7], permitting the facile analysis of proteins with low cell copy numbers from fewer numbers of cells [8]. An increasing level of automation is allowing the development of high-speed proteomic experiments. In addition, the quality of bioinformatics continues to improve; making predictions of gene products deduced from genome information increasingly accurate. With this database infrastructure in place, algorithms designed to match mass spectrometry data to known protein sequences have become possible [9,10]. Isolated proteins can be matched to protein sequences by peptide mass fingerprinting, while more complex mixtures of proteins can be analyzed through tandem mass spectrometry. These improvements in instrumentation and automation make possible experiments far greater in scale than ever before. Proteomics can be used to help define the functions and interrelationships of proteins in an organism [11,12]. The basic experiments to achieve this aim are commonly practiced in biochemistry and molecular biology. In the past, the rate-limiting step has been the analytical process through which the proteins observed in an experiment such as an immunoprecipitation can be identified. To inventory a cell’s complete protein content and dissect its protein interaction network requires protein identification technologies that can operate at extremely high throughput and sensitivity. Further complicating the analysis of proteins are the chemical modifications that may occur either co-translationally or post-translationally to modify or regulate their functions. Covalent modification events include phosphorylation, methylation, glycosylation, prenylation, formylation, or acetylation. Other modifications of proteins include proteolytic cleavage, oxidation of some amino acids, or crosslinking events. All processing adds greatly to the complexity of protein analysis. Modifications may also change both chemical and physical properties of individual proteins, such as chromatographic behavior, mass, and ionization efficiency. Although the complexity of the proteome is a challenge for its analysis, techniques to handle structural and modification features have evolved to make mass spectrometry an important technology for large-scale protein identification [9].
2. A paradigm shift: from sequencing to identification Edman sequencing has been the gold standard for protein sequencing for the last 25 years [13]. For unambiguous sequencing, a protein is purified to homogeneity prior to sequencing. The sample then undergoes cycles of Edman degradation reactions to remove the N-terminal amino acid which is collected at the end of each cycle for identification by liquid chromatography [14]. Each cycle requires 30 – 60 min. Edman sequencing is complicated by a blocked N-
terminus and incomplete purification [15]. A recent innovation in the use of Edman degradation has improved the analysis of simple protein mixtures by deconvoluting the mixed sequence from each cycle using sequences from databases [16]. Other traditional approaches for protein identification include the use of antibodies to perform Western blots [17]. Antibody use, however, is impaired by non-specific binding and by the availability of antibodies to all proteins [18]. As genome sequence information has accumulated, the paradigm has shifted from sequencing to identification. This situation has been facilitated by advances in ionization and mass analysis techniques for mass spectrometry and the ability to correlate mass spectrometry data of peptides and proteins to sequences in databases.
3. Ionization techniques for biomolecules in mass spectrometry Mass spectrometry analysis of peptides and proteins relies exclusively on soft ionization techniques that create intact gas-phase ions from biomolecules. The creation of intact molecular ions enables accurate measurement of molecular weight. The electrospray ionization (ESI) and matrix-assisted laser desorption/ionization (MALDI) techniques were developed more than 10 years ago and revolutionized the analysis of biomolecules [19 – 22]. Biomolecules are often polar and charged, thus conversion of solution or solid-phase ions to gas-phase ions is not an energetically favorable process. ESI and MALDI methods are currently the principal methods for peptide/protein ionization and they have been linked to high-throughput sample preparation techniques [23]. 3.1. Electrospray ionization ESI operates at atmospheric pressure and produces tiny charged solvent droplets when a high electric potential (2– 5 kV) is set between a capillary and the inlet to a mass spectrometer. By using a drying gas or heat in the atmospheric pressure interface, the charged droplets shrink and eventually desolvated ions are desorbed from the droplet [24]. The ESI process can be used to produce negative or positive ions, but typically peptides or proteins are analyzed as positive ions using the capillary as an anode and the MS inlet as the cathode. ESI uses a steady stream of solvent to produce a continuous beam of ions and thus is readily coupled to HPLC. When operating at high solvent flow rates of 100 Al/min to 1 ml/min, a nebulization gas (or sheath gas) is needed to assist the solvent dispersion. At low flow rates, electrospray can be induced by electric potential alone. Two operating regimes have been defined at low flow rates. The first is microelectrospray, where the flow rate is approximately 100 –500 nl/min; and the second is nanoelectrospray, where a flow rate of 100 nl/min or less is used [25,26]. These low flow rates create smaller droplets and a
D. Lin et al. / Biochimica et Biophysica Acta 1646 (2003) 1–10
much reduced Taylor cone, allowing the bulk of the spray to be directed into the mass spectrometer’s inlet [26]. As a result, much less sample is required than at higher flow rates because of the concentration-sensitive nature of electrospray. Another characteristic of ESI is the production of multiply charged ions which lowers the m/z values (over singly charged species, z=1) for higher molecular weight compounds and thus allows measurement of m/z values on mass spectrometers with limited m/z ranges. Furthermore, multiple protonation of peptides and proteins promotes more facile amide bond fragmentation when the ions are activated for dissociation. These features of ESI have led to its widespread adoption for proteomics research. 3.2. Matrix-assisted laser desorption/ionization MALDI uses energy from lasers rather than electrical potential to ionize biomolecules. MALDI uses small UV absorbing molecules to co-crystallize with peptides/proteins on a sample plate [22]. Ionization occurs when these matrix molecules absorb the energy provided by a laser (usually 337 nm). Release of the energy causes a rapid thermal expansion of matrix and analyte into the gas phase. Proton transfer from analyte to matrix may result in charge reduction to the singly charged ion observed in the gas-phase [27]. The most commonly used matrix molecules are a-cyano-4-hydroxycinaminic acid for peptides and polypeptides less than 5000 Da, and 3,5-dimethoxy-4-hydroxy-cinnamic acid, or sinapinic acid, for proteins. MALDI produces predominately singly charged ions and is less sensitive to salts in the buffer than ESI, although salt and matrix adducts of analyte ions can form. Because MALDI depends on pulsed laser radiation, ions are created in bunches or packets, consequently, mass analyzers capable of analyzing ions created in an intermittent fashion are required. The most common mass analyzers used with MALDI are time-of-flight (TOF) mass spectrometers with growing interest in using this ionization technique with ion trap mass spectrometers (described below). A recent innovation in MALDI sources is the operation of this device at atmospheric pressure [28]. This method of operation simplifies source design and allows the use of MALDI with sources originally designed for ESI. Thus, the two different ionization methods can be used interchangeably with the same instrument.
4. Mass analyzers for proteomic mass spectrometry As ions exit the ion source, they pass into a mass analyzer. The mass analyzer is responsible for separating ions by their mass-to-charge (m/z) ratios. Mass analyzers use electric and/ or magnetic fields to manipulate ions in a mass-dependent manner. Quadrupole (Q) mass analyzers use a radiofrequency (RF) voltage applied to four metal rods with RF voltage of alternate polarity placed on opposite rods. A direct current (DC) voltage is overlayed on the rods. The ratio of
3
RF to DC voltage selectively stabilizes the trajectory of ions of particular m/z value as they pass through the analyzer. Ion current is recorded at a detector as ions exit the analyzer [29]. Quadrupole ion trap (IT) analyzers create three-dimensional RF fields to trap ions in the center of a ring electrode [29,30]. The field can be manipulated to selectively eject ions of a particular m/z value to a detector to record the m/z ratios or to selectively retain a particular m/z value for collision-activated dissociation (CAD). CAD can be used to fragment ions by exciting the trapped ions to increase their motion causing hundreds to thousands of ion-molecular collisions with the helium bath gas. The resulting fragment ions are then sequentially ejected to the detector [30]. TOF mass analyzers accelerate a packet of ions with a set of electric potentials and differentiate them by the time they take to traverse a flight tube [31]. An m/z value can be calculated from the time required to move from the ion source to the detector. Magnetic fields can also be used to measure m/z values by trapping ions in a static magnetic field (Ion Cyclotron Mass Spectrometry also called Fourier Transform Mass Spectrometry, FTMS). Ions rotate around the magnetic field with a cyclotron frequency related to their m/z value. By perturbing ion motion in the cell, the ion motion goes into coherence and an electric signal can be measured between two detector plates (detector plates detect an ion current as the packet of ions moves from one side of the cell to the other). As the motion of the ions decays back to their natural cyclotron frequencies, the signal deteriorates. By recording the signal decay, frequencies can be calculated using a Fourier transform to calculate very accurate m/z values [32]. Hybrid mass spectrometers that combine different types of mass analyzers have been constructed to produce unique capabilities. For example, quadrupole-TOFs are instruments that combine a quadrupole mass analyzer for ion selection in a tandem mass spectrometer mode and TOF analyzers to record m/z values with high resolution [33]. Ions are dissociated in a collision cell between the two mass analyzers. More recently, TOF mass analyzers have been combined to create tandem mass spectrometers (TOF– TOF). In this instrument, m/z values are selected by their TOF and all others are deflected from the flight path. The ions then pass into a collision cell and undergo high-energy collisions with an inert gas such as helium, causing fragmentation of the ions. This prompt fragmentation of ions is referred to as collision-induced dissociation (CID). All of the above mass spectrometers are effective for proteomic studies and differentiate on the basis of mass range, mass accuracy, sensitivity, resolution and cost. Ion trap and TOF mass analyzers are compatible with MALDI because they trap packets of ions or analyze them simultaneously, respectively.
5. Overview of protein identification by mass analysis Complete protein identification involves correlating data from a protein with its gene sequence and identifying the
4
D. Lin et al. / Biochimica et Biophysica Acta 1646 (2003) 1–10
site and type of modifications that may be on the protein (Fig. 1). To unambiguously correlate an isolated protein’s sequence to the sequence predicted through genome sequencing, data indicative of amino acid sequence must be derived from the protein. During the last several years, two different approaches for MS-based protein identification from complex mixtures have been maturing: ‘‘top-down’’ and ‘‘bottom-up’’ [34]. 5.1. Top-down approach The ‘‘top-down’’ strategy starts with an intact protein and cleaves the protein in the gas phase rather than in solution. The protein is fragmented inside the mass spectrometer to create a ladder of ions indicative of the sequence. The difference in m/z values of fragment ions define the position and sequence of the amino acids in the protein. Several methods and mass spectrometers are used for ‘‘top-down’’ analysis. ESI-FTMS is able to analyze intact proteins because of its capability for high resolution. Proteins are fragmented in this device using electron capture dissociation (ECD) [35]. Other methods of ion dissociation have been used in FTMS instruments, but ECD has the potential to be a general method for protein fragmentation [36]. FTMS can isolate individual m/z values (a specific charge state of a protein, for example) and thus simple protein mixtures can be introduced into the mass spectrometer and then fragmented separately. Proteins need not be purified to homogeneity and thus have less dependence on chromatographic separation prior to MS analysis [37]. The high mass accuracy of FTMS instruments can detect protein sequence errors and protein post-translational modifications [38 –
41]. Even as ECD improves, protein fragmentation is still not as general as peptide fragmentation. Computer algorithms to use the data derived from ‘‘top-down’’ methods to identify protein sequences from databases have been developed [37,42]. However, the observed molecular weight of a protein may differ from the predicted molecular weight because of sequence errors, post-translational modifications, or proteolyic processing. The use of protein fragmentation techniques such as ECD has facilitated protein identification using intact proteins and databases [43]. After a protein has been identified, an accurate molecular weight can be enormously useful in pinpointing the extent of modification of the protein. This technique will have a significant role in the identification of small proteins (<10 kDa) because they are difficult to predict through bioinformatics and there is much to be learned about their biological roles. 5.2. Bottom-up approach The ‘‘bottom-up’’ approach identifies proteins by tandem mass spectrometry analysis of peptides derived from digestion of mixtures of intact proteins [44,45]. The resulting peptide mixture is chromatographically fractionated and introduced into a tandem mass spectrometer. Tandem mass spectrometers can select individual peptide ions for analysis from mixtures of ions. The fragmentation pattern derived for each peptide is indicative of the amino acid sequence of the peptide. This pattern can be used to search sequence databases without first interpreting the sequence from the spectrum. Amino acid sequences are selected for further analysis from the database on the basis of the molecular weight of the peptide. The predicted fragmentation pattern for each peptide
Fig. 1. Schematic showing the ‘‘top-down’’ and bottom-up approaches to protein analysis and identification. In the ‘‘top-down’’ approach, the intact protein is fragmented in the gas-phase to create sequence-specific fragment ions to aid in sequence analysis. The bottom-up approach uses proteolytic enzymes to create peptides for sequence analysis usually by using multidimensional liquid chromagoraphy in combination with tandem mass spectrometry.
D. Lin et al. / Biochimica et Biophysica Acta 1646 (2003) 1–10 Table 1 Publicly available algorithms for protein identification from mass spectrometry data Peptide mass Mascot Mowse MS-Fit Pepsea ProFound
mapping database matching tools http://www.matrixscience.com http://www.hgmp.mrc.ac.uk/Bioinformatics/Webapp/mowse http://www.prospector.ucsf.edu http://www.pepsea.protana.com http://www.proteometrics.com
Tandem mass Mascot MS-Tag Pepsea SEQUEST Sonar
spectrum database matching tools http://www.matrixscience.com http://www.prospector.ucsf.edu http://www.pepsea.protana.com http://www.fields.scripps.edu/sequest http://www.proteometrics.com
is then compared to that of the spectrum. When a preponderance of the fragment ions match, it is considered a good fit. Several algorithms have been developed to automate this process for protein identification (see Table 1). Several advantages exist in using tandem mass spectrometry data for protein identification. For instance, there is a higher level of certainty in the identification of proteins since the method relies on several pieces of highly specific information. Also, it is straightforward to identify the type and site of modification [46]. Finally, proteins can be identified in mixtures [44,47,48].
6. Large-scale separation methods in proteomics Cells and tissues contain thousands of proteins spanning a wide range of abundance. Mixtures of proteins or components of cells and tissues can be fractionated by many different methods including centrifugation, ion exchange chromatography, size exclusion chromatography, affinity chromatography, reversed-phase liquid chromatography, and gel electrophoresis (GE). Multiple fractionation steps are often required to simplify mixtures prior to analysis. Ideally, a method that will separate a maximum number of proteins in the fewest possible steps is desired. Two major strategies for proteomic analysis, two-dimentional gel electrophoresis (2DGE) and 2D-liquid cloromatography are illustrated in Fig. 1. 6.1. Two-dimensional gel electrophoresis 2DGE is an important method for proteome analysis because a large number of proteins can be separated in a single experiment [49]. Proteins are separated in a first dimension by charge using isoelectric focusing (IEF). The charge-focused proteins are then separated in the second dimension by size using SDS-polyacrylamide gel. Proteins are then visualized by staining the gel with hydrophobic dyes like Coomassie, silver ions, or fluorophores. Excess reagent is washed from the polyacrylamide gel and the stains adhere principally to the proteins, fixing their position
5
in the gel and allowing determination of pI and molecular weight. The intensity of the stained protein also provides a measure of how much protein is present and provides a means for differential analysis. A new strategy for this type of analysis involves the use of two fluorophores with different emission wavelengths that allows premixing of mixtures of proteins each labeled with the different fluorophores [50]. After separation, the intensity of fluorescence at each wavelength is determined, providing a relative measure of how much of each protein is present in each of the samples. 2DGE is very effective at separating charge and size isoforms of the same protein providing a measure of how a protein may be modified. 2DGE provides separation and visualization of the protein mixture but does not explicitly identify proteins (data from past experiments can be used in theory, although care must be exercised). A second analytical step must be employed to identify the proteins where proteins are excised from the gel, subjected to proteolytic digestion, and identified or sequenced. A 2DGE can produce a wealth of information from each separation and is very effective for differential analyses. The strength of 2DGE is that it can separate up to 10,000 proteins [51]. This separation involves the use of narrower pH gradients (e.g. 5 – 6, 6 –7, rather than 4 –10) during the isoelectric focusing stage, and the use of larger amounts of proteins to increase the dynamic range of the separation. After the second dimension separation, the gels are aligned to observe the whole separation. However, several drawbacks can limit its effectiveness. Excising proteins from a gel manually is time-consuming, and when hundreds or more spots need to be processed from a single gel, automation becomes necessary. Robotic systems are available to minimize the amount of labor required to perform these steps. Many of the drawbacks of 2DGE stem from the physical limitations of the separation. Most 2-D gels can only focus proteins with a pI range between 4 and 10, thus excluding very acidic and basic proteins. Proteins with molecular weights below 15 kDa or above 200 kDa either run off the gel or cannot move inside it unless special gels are used, which generally require separate analyses to resolve high molecular weight proteins and then again to resolve small proteins. Buffers necessary for isoelectric focusing limit the type of detergents that can be used, limiting conditions to solubilize membrane proteins [52 – 54]. Integral membrane proteins do not appear on the gel [55]. Low abundance proteins are often not observed on the gel [56]. Several improvements to this technique, however, have improved upon some of the issues [57,58]. Recent efforts have focused on optimizing solubilization conditions for membrane proteins and staining techniques. Mass spectrometry is now routinely used to identify proteins separated by 2DGE. Two different mass spectrometry techniques are commonly used to identify proteins separated by 2DGE. Because of the potential for high throughput analysis, MALDITOF is the most attractive mass spectrometry method to
6
D. Lin et al. / Biochimica et Biophysica Acta 1646 (2003) 1–10
identify proteins from 2DGE [59]. Proteins are separated to homogeneity, typically by 2DGE, and then trypsin or another enzyme digests them into characteristic peptides. A key component to mass mapping is cleavage of the proteins into predictable peptide fragments, although less specific proteases such as thermolysin have been used to cleave peptides for mass mapping [60]. Mass spectrometry (often MALDI-TOF) measures the m/z values of the peptides, and an algorithm searches a protein database for proteins that would produce peptides of these molecular weights when cleaved with the same protease (see Table 1 for several programs used for this purpose). The quality of the result depends on the purity of the sample, the accuracy of the measurement, and the number of peptide masses obtained. The peptide mass mapping method can be automated when coupled with robotic equipment that can deliver processed samples from spots on 2-D gels. Mass mapping is an attractive choice for proteomic studies involving 2DGE. Shevchenko et al. [61] reported the identification of 150 yeast proteins by 2DGE using MALDI peptide mass mapping. In 1999, a 2-D yeast reference map was established, and a combined 374 proteins were identified from several different labs [62]. In an analysis of H. pylori proteome, 1863 spots were observed from multiple 2DGE separations of prefractionated cellular material. Despite the observation of so many proteins, only 126 of the proteins were identified [63]. Single dimension gel electrophoresis can also be used for large-scale protein identification studies. Two recent analyses of immunoprecipitated yeast protein complexes by using SDS-PAGE and mass spectrometry identified large numbers of protein complexes [11,12]. 6.2. Multidimensional protein identification technology An alternate approach for the analysis of protein mixtures uses a different paradigm to identify the proteins contained in a mixture. This strategy uses the ability of tandem mass
spectrometers to select and analyze peptides from protein mixtures which have been proteolytically digested. The peptides then serve as surrogate markers for the protein sequence [44]. Proteins are identified by searching the resulting tandem mass spectra through sequence databases. Because digested protein mixtures create complex mixtures of peptides, high-performance separation techniques are required to resolve the peptides prior to entering the tandem mass spectrometer [44,47,48]. This approach, when combined with multidimensional liquid chromatography and database searching, is called Multidimensional Protein Identification Technology, or MudPIT. This process fully automates the separation and identification of proteins from complex mixtures [64,65]. The first step in any MudPIT experiment is the enzymatic digestion of a protein mixture to produce a more complex peptide mixture. A specially packed biphasic liquid chromatography column (see Fig. 2) using a strong cation exchange (SCX) support as the initial phase and reversed-phase (RP) material as the second phase is used to separate peptides. The column is constructed from fused silica and has an internal diameter of 100 Am and thus eluent is delivered to the mass spectrometer at low flow rates (100 – 200 nl/min). Peptides are stepped from the SCX phase onto the RP phase in a series of salt steps that increase in concentration. A subsequent RP gradient separates the peptides and delivers the peptides into the mass spectrometer after each salt step. Since each phase separates peptides by orthogonal chemical properties, a high-resolution separation among the peptides is achieved with the aim that the mass spectrometer will analyze a more manageable number of peptides at any point during the separation [66]. MudPIT analysis of proteins from a cell can detect proteins over a wide range of pI, abundance, and subcellular localization [65]. The MudPIT method was used to identify proteins from a yeast whole cell lysate by Washburn et al. [65]. From three fractions (soluble proteins, peripheral membrane proteins, integral membrane proteins) of the cell lysate, a total of 1484
Fig. 2. Preparation for a biphasic column for LC/LC/MS/MS analysis. A piece of fused silica capillary is pulled to have an opening of 5 Am. The packing of the column is achieved by using a pressure bomb to force packing materials up to the column. RP and SCX materials are used sequentially to make a biphasic column.
D. Lin et al. / Biochimica et Biophysica Acta 1646 (2003) 1–10 Table 2 Comparison between 2DGE and MudPIT methods
Pro
2DGE
MudPIT
. automation possible
. unbiased analysis of all
. pre-cast gels available . good resolving power
for proteins . image data is quantitative
Con
. widely used by researchers . large quantity of sample
needed . unsuitable for proteins with extreme pIs or molecular weights or of low abundance . membrane proteins are hard to detect
proteins . less sample is needed . highly resolving 2-D separation for peptides . large number of proteins can be detected . requires custom equipment
for column packing . lack of direct quantitative
information . no commercially available
system
proteins were identified. More significantly, low abundance proteins, membrane proteins, small (less than 10 kDa) or large (greater than 180 kDa) proteins, and proteins with extreme pI values were all well represented in the data set. MudPIT appears to introduce very little bias in its sampling of proteins from the cell but a key element is the sample preparation procedure (see Table 2). MudPIT type analyses require high throughput database searching techniques. Sadygov et al. [67] describe a multiprocessing database searching algorithm to increase the search speeds for the large number of spectra produced in a MudPIT experiment. A second requirement is the need for computer algorithms to assemble the data from the database search. Tabb et al. [68]
7
described a program, DTASelect, to assemble, filters, and compare the results of database searches. 6.3. Alternate multidimensional separations The use of tandem mass spectrometry analysis of peptides derived from digested protein mixtures has created an alternate approach for the analysis of experiments (see Table 2 for comparison) [56]. Besides the direct on-line protocol of MudPIT, off-line multidimensional separation methods have been used to achieve the same basic aim as MudPIT [65]. Commercially available columns (ion exchange or affinity) were used to fractionate peptides before each fraction was subjected to LC/MS/MS analysis [69]. Gygi et al. [70] used two dimensions of off-line separation of analyte before using LC/MS/MS. In this method, proteins from S. cerevisiae were reduced and labeled at cysteines with a biotin reagent followed by trypsin digestion [71]. Peptides are then separated by a strong cation-exchange step; each collected fraction is then passed through an avidin column to selectively isolate only biotin-tagged cysteinecontaining peptides, and finally, LC/MS/MS is performed on fractions collected in the affinity step. This sequence of separations is designed to increase the detection of low abundance proteins. The use of an affinity step is meant to reduce sample complexity, but the addition of multiple offline steps requires the use of 5 –10 times the amount of material as used in a MudPIT analysis. The drawbacks of off-line separations are decreased automation and increased sample loss; this protocol requires milligram quantities of sample in order to make up the sample loss between
Fig. 3. Schemes for large-scale protein identification methods. The gel-based 2DGE approach separates proteins from cell lysate by molecular weight and isoelectric point. Each spot can then be excised and enzymatically digested for MS analysis. The liquid-based MudPIT method digests the proteins in the lysate and separates peptides on a biphasic LC column for MS/MS analysis. Database searches are conducted for both approaches after MS data are collected.
8
D. Lin et al. / Biochimica et Biophysica Acta 1646 (2003) 1–10
chromatography steps. Conrads et al. [72] uses the high mass accuracy of FTMS instruments to profile peptides from digested protein mixtures. Measurement of m/z values with high accuracy allows facile comparison between analyses and provides in instances a mechanism to identify proteins. To further ensure unambiguous identification of the peptides, the complex peptide mixture is analyzed by multiple LC/MS/MS experiments using ‘‘gas-phase’’ fractionation on an ion trap mass spectrometer [73] (Fig. 3). By reducing the scan range of the mass spectrometer (e.g. 500– 800), the number of ions available for data-dependent data acquisition is reduced, allowing the acquisition of more peptide ions then normally occurs during a single dimension chromatographic separation. By repeating the process over different mass ranges (e.g. 800 –1000, 1000– 1200, etc.) more thorough coverage of the proteome is possible. In the study by Lipton et al. [73] of D. radiodurans, an impressive 61% of the proteome was covered by analyzing cultures grown under a variety of conditions. A variation of the ‘‘bottom-up’’ approach that combines features of the ‘‘top-down’’ approach is employed by Chong et al. [74]. In this approach, proteins are separated by multidimensional liquid chromatography and fractions collected for digestion. Liquid isoelectric focusing has been used together with reversed-phase liquid chromatography to effect high-resolution separations [75]. After digestion, fractions are then analyzed by MALDI-TOF to identify the proteins present. Alternatively, the molecular weights of proteins can be measured directly in the mass spectrometer to create three-dimensional maps of proteins based on pI, hydrophobicity and molecular weight to compare cells of different states [76].
Modified peptides, such as those bearing the important phosphorylation modification, may produce fragment ion spectra of greater complexity than normal peptides, complicating their interpretation. At the same time, augmented forms of existing database matching algorithms typically take far longer to run on spectra when modification possibilities are taken into account. Progress has been made in solving this problem, but much work remains to be done [79 – 84]. An increasing emphasis in proteomics is the quantitation of protein content rather than simple determination of presence or absence [85]. Several methods exist to quantitate proteins and all rely on labeling proteins with a mass label to differentiate the same peptides from two states of the cells for relative quantitation. The first method uses metabolic labeling to incorporate 15N into peptides [86]. A second method uses proteolytic digestion in the presence of 18 O water to incorporate a label [87]. The last method uses covalent labeling to incorporate the mass label [71,88 – 90]. Covalent labeling is targeted at reactive amino acid residues and generally the label contains deuterium atoms in place of hydrogen atoms. Labels have been developed for amino and sulfhydryl groups. These methods allow the measurement of relative quantitative states. Recently, Stemmann et al. [91] have developed a method for the measurement of absolute quantities of peptides. Because the method requires the addition of a known quantity of a peptide standard, it is not suitable for proteome-wide measurements. Biology is moving to a more quantitative description of cellular processes and mass spectrometry is poised to play a large role, but all of this relies on high throughput identification of proteins.
7. Future emphases
Acknowledgements
As challenges in protein identification are steadily overcome, the focus in proteomics is shifting to other important areas. Three areas of particular emphasis are ‘‘de novo’’ peptide sequencing, post-translational modification identification, and quantitation of peptides by mass spectrometry. De novo peptide sequencing by mass spectrometry differs substantially from the peptide identification process as embodied in SEQUEST and other algorithms, which rely on the presence of the target peptide sequence in a supplied database. The de novo challenge is to infer a peptide sequence directly from the spectrum without external hints as to what the sequence may be. Several attempts have been made but success has been limited by data quality (such as mass accuracy and signal to noise) and hindered by the complexity of accurately predicting fragment ion spectra [77,78]. It is unlikely that de novo sequencing will ever be as sensitive as database searching methods. Post-translational modifications play an important role in determining protein function, but identifying peptides containing these modifications poses many challenges.
The authors acknowledge support from NIH grant R33CA8165-01, NIH grant RR11823-05, and National Science Foundation Graduate Research Fellowship.
References [1] I.H.G.S. Consortium, Nature 409 (2001) 860 – 921. [2] J.C. Venter, M.D. Adams, G.G. Sutton, A.R. Kerlavage, H.O. Smith, M. Hunkapiller, Science 280 (1998) 1540 – 1542. [3] S.A. Goff, D. Ricke, T.H. Lan, G. Presting, R. Wang, M. Dunn, J. Glazebrook, A. Sessions, P. Oeller, H. Varma, D. Hadley, D. Martin, C. Martin, F. Katagiri, B.M. Lange, T. Moughamer, Y. Xia, P. Budworth, J. Zhong, T. Miguel, U. Paszkowski, S. Zhang, M. Colbert, W.L. Sun, L. Chen, B. Cooper, S. Park, T.C. Wood, L. Mao, P. Quail, R. Wing, R. Dean, Y. Yu, A. Zharkikh, R. Shen, S. Sahasrabudhe, A. Thomas, R. Cannings, A. Gutin, D. Pruss, J. Reid, S. Tavtigian, J. Mitchell, G. Eldredge, T. Scholl, R.M. Miller, S. Bhatnagar, N. Adey, T. Rubano, N. Tusneem, R. Robinson, J. Feldhaus, T. Macalma, A. Oliphant, S. Briggs, Science 296 (2002) 92 – 100. [4] J. Yu, S. Hu, J. Wang, G.K. Wong, S. Li, B. Liu, Y. Deng, L. Dai, Y. Zhou, X. Zhang, M. Cao, J. Liu, J. Sun, J. Tang, Y. Chen, X. Huang,
D. Lin et al. / Biochimica et Biophysica Acta 1646 (2003) 1–10
[5] [6] [7] [8] [9] [10] [11]
[12]
[13] [14] [15] [16] [17] [18]
[19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29]
[30]
W. Lin, C. Ye, W. Tong, L. Cong, J. Geng, Y. Han, L. Li, W. Li, G. Hu, J. Li, Z. Liu, Q. Qi, T. Li, X. Wang, H. Lu, T. Wu, M. Zhu, P. Ni, H. Han, W. Dong, X. Ren, X. Feng, P. Cui, X. Li, H. Wang, X. Xu, W. Zhai, Z. Xu, J. Zhang, S. He, J. Xu, K. Zhang, X. Zheng, J. Dong, W. Zeng, L. Tao, J. Ye, J. Tan, X. Chen, J. He, D. Liu, W. Tian, C. Tian, H. Xia, Q. Bao, G. Li, H. Gao, T. Cao, W. Zhao, P. Li, W. Chen, Y. Zhang, J. Hu, S. Liu, J. Yang, G. Zhang, Y. Xiong, Z. Li, L. Mao, C. Zhou, Z. Zhu, R. Chen, B. Hao, W. Zheng, S. Chen, W. Guo, M. Tao, L. Zhu, L. Yuan, H. Yang, Science 296 (2002) 79 – 92. D. Kennedy, Science 296 (2000) 13. A.J.T. George, Trends Immunol. 23 (2002) 351 – 355. M.E. Belov, M.V. Gorshkov, H.R. Udseth, G.A. Anderson, R.D. Smith, Anal. Chem. 72 (2000) 2271 – 2279. D.A. Wolters, M.P. Washburn, J.R. Yates III, Anal. Chem. 73 (2001) 5683 – 5690. J.R. Yates III, J. Mass Spectrom. 33 (1998) 1 – 19. J.R. Yates III, Trends Genet. 16 (2000) 5 – 8. A.C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz, J.M. Rick, A.M. Michon, C.M. Cruciat, M. Remor, C. Hofert, M. Schelder, M. Brajenovic, H. Ruffner, A. Merino, K. Klein, M. Hudak, D. Dickson, T. Rudi, V. Gnau, A. Bauch, S. Bastuck, B. Huhse, C. Leutwein, M.A. Heurtier, R.R. Copley, A. Edelmann, E. Querfurth, V. Rybin, G. Drewes, M. Raida, T. Bouwmeester, P. Bork, B. Seraphin, B. Kuster, G. Neubauer, G. Superti-Furga, Nature 415 (2002) 141 – 147. Y. Ho, A. Gruhler, A. Heilbut, G.D. Bader, L. Moore, S.L. Adams, A. Millar, P. Taylor, K. Bennett, K. Boutilier, L. Yang, C. Wolting, I. Donaldson, S. Schandorff, J. Shewnarane, M. Vo, J. Taggart, M. Goudreault, B. Muskat, C. Alfarano, D. Dewar, Z. Lin, K. Michalickova, A.R. Willems, H. Sassi, P.A. Nielsen, K.J. Rasmussen, J.R. Andersen, L.E. Johansen, L.H. Hansen, H. Jespersen, A. Podtelejnikov, E. Nielsen, J. Crawford, V. Poulsen, B.D. Sorensen, J. Matthiesen, R.C. Hendrickson, F. Gleeson, T. Pawson, M.F. Moran, D. Durocher, M. Mann, C.W. Hogue, D. Figeys, M. Tyers, Nature 415 (2002) 180 – 183. J.E. Shively, EXS 88 (2000) 99 – 117. J.M. Walker, Methods Mol. Biol. 32 (1994) 329 – 334. T.C. Farries, A. Harris, A.D. Auffret, A. Aitken, Eur. J. Biochem. 196 (1991) 679 – 685. A.J. Mackey, T.A. Haystead, W.R. Pearson, Mol. Cell Proteomics 1 (2002) 139 – 147. P. Shalhoub, S. Kern, S. Girard, L. Beretta, Dis. Markers 17 (2001) 217 – 223. D.E. Rebeski, E.M. Winger, Y.K. Shin, M. Lelenta, M.M. Robinson, R. Varecka, J.R. Crowther, J. Immunol. Methods 226 (1999) 85 – 92. M. Barber, R.S. Bordoli, R.D. Sedgwick, A.N. Tyler, B.W. Bycroft, Biochem. Biophys. Res. Commun. 101 (1981) 632 – 638. H.R. Morris, M. Panico, M. Barber, R.S. Bordoli, R.D. Sedgwick, A. Tyler, Biochem. Biophys. Res. Commun. 101 (1981) 623 – 631. J.B. Fenn, M. Mann, C.K. Meng, S.F. Wong, C.M. Whitehouse, Science 246 (1989) 64 – 71. F. Hillenkamp, M. Karas, R.C. Beavis, B.T. Chait, Anal. Chem. 63 (1991) 1193A – 1203A. G. Siuzdak, Proc. Natl. Acad. Sci. U. S. A. 91 (1994) 11290 – 11297. A.P. Jonsson, Cell. Mol. Life Sci. 58 (2001) 868 – 884. M.R. Emmett, R.M. Caprioli, J. Am. Soc. Mass Spectrom. 5 (1994) 605 – 613. M.S. Wilm, M. Mann, Int. J. Mass Spectrom. Ion Process. 136 (1994) 2 – 3. R. Kruger, A. Pfenninger, I. Fournier, M. Gluckmann, M. Karas, Anal. Chem. 73 (2001) 5812 – 5821. V.V. Laiko, M.A. Baldwin, A.L. Burlingame, Anal. Chem. 72 (2000) 652 – 657. K. Busch, G.L. Glish, S.A. McLuckey, Mass Spectrometry/Mass Spectrometry: Techniques and Applications of Tandem Mass Spectrometry, 1st ed., VCH, New York, 1988. K.R. Jonscher, J.R. Yates III, Anal. Biochem. 244 (1997) 1 – 15.
9
[31] R.J. Cotter, Biomed. Environ. Mass Spectrom. 18 (1989) 513 – 532. [32] A.G. Marshall, C.L. Hendrickson, G.S. Jackson, Mass Spectrom. Rev. 17 (1998) 1 – 35. [33] H.R. Morris, T. Paxton, A. Dell, J. Langhorne, M. Berg, R.S. Bordoli, J. Hoyes, R.H. Bateman, Rapid Commun. Mass Spectrom. 10 (1996) 889 – 896. [34] J.R. Kettman, J.R. Frey, I. Lefkovits, Biomol. Eng. 18 (2001) 207 – 212. [35] F.W. McLafferty, M.W. Senko, Stem Cells 12 (1994) 68 – 73. [36] E. Mortz, P.B. O’Connor, P. Roepstorff, N.L. Kelleher, T.D. Wood, F.W. McLafferty, M. Mann, Proc. Natl. Acad. Sci. U. S. A. 93 (1996) 8264 – 8267. [37] J.L. Stephenson, S.A. McLuckey, G.E. Reid, J.M. Wells, J.L. Bundy, Curr. Opin. Biotechnol. 13 (2002) 57 – 64. [38] N.L. Kelleher, S.V. Taylor, D. Grannis, C. Kinsland, H.J. Chiu, T.P. Begley, F.W. McLafferty, Protein Sci. 7 (1998) 1796 – 1801. [39] N.L. Kelleher, R.A. Zubarev, K. Bush, B. Furie, B.C. Furie, F.W. McLafferty, C.T. Walsh, Anal. Chem. 71 (1999) 4250 – 4253. [40] S.K. Sze, Y. Ge, H. Oh, F.W. McLafferty, Proc. Natl. Acad. Sci. U. S. A. 99 (2002) 1774 – 1779. [41] Y. Ge, B.G. Lawhorn, M. ElNaggar, E. Strauss, J.H. Park, T.P. Begley, F.W. McLafferty, J. Am. Chem. Soc. 124 (2002) 672 – 678. [42] D.M. Horn, R.A. Zubarev, F.W. McLafferty, Proc. Natl. Acad. Sci. U. S. A. 97 (2000) 10313 – 10317. [43] F. Meng, B.J. Cargile, L.M. Miller, A.J. Forbes, J.R. Johnson, N.L. Kelleher, Nat. Biotechnol. 19 (2001) 952 – 957. [44] J.K. Eng, A.L. McCormack, J.R. Yates III, J. Am. Soc. Mass Spectrom. 5 (1994) 976 – 989. [45] M. Mann, M. Wilm, Anal. Chem. 66 (1994) 4390 – 4399. [46] J.R. Yates, J.K. Eng, A.L. McCormack, D. Schieltz, Anal. Chem. 67 (1995) 1426 – 1436. [47] A.L. McCormack, J.K. Eng, J.R. Yates III, Methods 6 (1994) 273 – 274 (A companion to Methods in Enzymology). [48] A.L. McCormack, D.M. Schieltz, B. Goode, S. Yang, G. Barnes, D. Drubin, J.R. Yates III, Anal. Chem. 69 (1997) 767 – 776. [49] S.E. Ong, A. Pandey, Biomol. Eng. 18 (2001) 195 – 205. [50] F. Von Eggeling, A. Gawriljuk, W. Fiedler, G. Ernst, U. Claussen, J. Klose, I. Romer, Int. J. Mol. Med. 8 (2001) 373 – 377. [51] J. Klose, U. Kobalz, Electrophoresis 16 (1995) 1034 – 1059. [52] L. Vuillard, N. Marret, T. Rabilloud, Electrophoresis 16 (1995) 295 – 297. [53] T. Rabilloud, Electrophoresis 17 (1996) 813 – 829. [54] T. Rabilloud, C. Adessi, A. Giraudel, J. Lunardi, Electrophoresis 18 (1997) 307 – 316. [55] N. Galeva, M. Altermann, Proteomics 2 (2002) 713 – 722. [56] S.P. Gygi, G.L. Corthals, Y. Zhang, Y. Rochon, R. Aebersold, Proc. Natl. Acad. Sci. U. S. A. 97 (2000) 9390 – 9395. [57] P.G. Righetti, A. Castagna, B. Herbert, Anal. Chem. 73 (2001) 320A – 326A. [58] G. Candiano, L. Musante, M. Bruschi, G.M. Ghiggeri, B. Herbert, F. Antonucci, P.G. Righetti, Electrophoresis 23 (2002) 292 – 297. [59] J.S. Cottrell, Pept. Res. 7 (1994) 115 – 124. [60] S.J. Bark, N. Muster, J.R. Yates III, G. Siuzdak, J. Am. Chem. Soc. 123 (2001) 1774 – 1775. [61] A. Shevchenko, O.N. Jensen, A.V. Podtelejnikov, F. Saglicco, M. Wilm, O. Vorm, P. Mortensen, A. Shevchenko, H. Boucherie, M. Mann, Proc. Natl. Acad. Sci. U. S. A. 93 (1996) 14440 – 14445. [62] M. Perrot, F. Sagliocco, T. Mini, C. Monribot, U. Schneider, A. Shevchenko, M. Mann, P. Jeno, H. Boucherie, Electrophoresis 20 (1999) 2280 – 2298. [63] P.R. Jungblut, D. Bumann, G. Haas, U. Zimny-Arndt, P. Holland, S. Lamer, F. Siejak, A. Aebischer, T.F. Meyer, Mol. Microbiol. 36 (2000) 710 – 725. [64] A.J. Link, J. Eng, D.M. Schieltz, E. Carmack, G.J. Mize, D.R. Morris, B.M. Garvik, J.R. Yates III, Nat. Biotechnol. 17 (1999) 676 – 682. [65] M.P. Washburn, D. Wolters, J.R. Yates III, Nat. Biotechnol. 19 (2001) 242 – 247.
10
D. Lin et al. / Biochimica et Biophysica Acta 1646 (2003) 1–10
[66] D. Lin, A.J. Alpert, J.R. Yates III, Am. Genomic/Proteomic Technol. 1 (2001) 38 – 46. [67] R.G. Sadygov, J. Eng, E. Durr, A. Saraf, H. McDonald, M.J. MacCoss, J.R. Yates III, J. Proteome Res. 1 (2002) 211 – 215. [68] D.L. Tabb, W.H. McDonald, J.R. Yates III, J. Proteome Res. 1 (2002) 21 – 26. [69] E.C. Koc, W. Burkhart, K. Blackburn, M.B. Moyer, D.M. Schlatzer, A. Moseley, L.L. Spremulli, J. Biol. Chem. 276 (2001) 43958 – 43969. [70] S.P. Gygi, B. Rist, T.J. Griffin, J. Eng, R. Aebersold, J. Proteome Res. 1 (2002) 47 – 54. [71] S.P. Gygi, B. Rist, S.A. Gerber, F. Turecek, M.H. Gelb, R. Aebersold, Nat. Biotechnol. 17 (1999) 994 – 999. [72] T.P. Conrads, G.A. Anderson, T.D. Veenstra, L. Pasa-Tolic, R.D. Smith, Anal. Chem. 72 (2000) 3349 – 3354. [73] M.S. Lipton, L. Pasa-Tolic, G.A. Anderson, D.J. Anderson, D.L. Auberry, J.R. Battista, M.J. Daly, J. Fredrickson, K.K. Hixson, H. Kostandarithes, C. Masselon, L.M. Markillie, R.J. Moore, M.F. Romine, Y. Shen, E. Stritmatter, N. Tolic, H.R. Udseth, A. Venkateswaran, K.K. Wong, R. Zhao, R.D. Smith, Proc. Natl. Acad. Sci. U. S. A. 99 (2002) 11049 – 11054. [74] B.E. Chong, R.L. Hamler, D.M. Lubman, S.P. Ethier, A.J. Rosenspire, F.R. Miller, Anal. Chem. 73 (2001) 1219 – 1227. [75] D.B. Wall, M.T. Kachman, S.S. Gong, S.J. Parus, M.W. Long, D.M. Lubman, Rapid Commun. Mass Spectrom. 15 (2001) 1649 – 1661. [76] D.B. Wall, S.J. Parus, D.M. Lubman, J. Chromatogr., B, Anal. Technol. Biomed. Life Sci. 774 (2002) 53 – 58. [77] V. Dancik, T.A. Addona, K.R. Clauser, J.E. Vath, P.A. Pevzner, J. Comput. Biol. 6 (1999) 327 – 342.
[78] J.A. Taylor, R.S. Johnson, Rapid Commun. Mass Spectrom. 11 (1997) 1067 – 1075. [79] K.R. Jonscher, J.R. Yates, J. Biol. Chem. 272 (1997) 1735 – 1741. [80] M.G. de Carvalho, A.L. McCormack, E. Olson, F. Ghomashchi, M.H. Gelb, J.R. Yates, C.C. Leslie, J. Biol. Chem. 271 (1996) 6987 – 6997. [81] Y. Oda, T. Nagasu, B.T. Chait, Nat. Biotechnol. 19 (2001) 379 – 382. [82] A. Schlosser, R. Pipkorn, D. Bossemeyer, W.D. Lehmann, Anal. Chem. 73 (2001) 170 – 176. [83] H. Zhou, J.D. Watts, R. Aebersold, 2000. [84] M.J. MacCoss, W.H. McDonald, A. Saraf, R. Sadygov, J.M. Clark, J.J. Tasto, K.L. Gould, D. Wolters, M. Washburn, A. Weiss, J.I. Clark, J.R. Yates III, Proc. Natl. Acad. Sci. U. S. A. 99 (2000) 7900 – 7905. [85] M.P. Washburn, R. Ulaszek, C. Deciu, D.M. Schieltz, J.R. Yates III, Anal. Chem. 74 (2002) 1650 – 1657. [86] Y. Oda, K. Huang, F.R. Cross, D. Cowburn, B.T. Chait, Proc. Natl. Acad. Sci. U. S. A. 96 (1999) 6591 – 6596. [87] X. Yao, A. Freas, J. Ramirez, P.A. Demirev, C. Fenselau, Anal. Chem. 73 (2001) 2836 – 2842. [88] M. Munchbach, M. Quadroni, G. Miotto, P. James, Anal. Chem. 72 (2000) 4047 – 4057. [89] M. Geng, J. Ji, F.E. Regnier, J. Chromatogr., B 870 (2000) 295 – 313. [90] J. Ji, A. Chakraborty, M. Geng, X. Zhang, A. Amini, M. Bina, F. Regnier, J. Chromatogr., B, Biomed. Sci. Appl. 745 (2000) 197 – 210. [91] O. Stemmann, H. Zou, S.A. Gerber, S.P. Gygi, M.W. Kirschner, Cell 107 (2001) 715 – 726.