This document was uploaded by user and they confirmed that they have the permission to share it.


From the analysis of protein complexes to proteome-wide linkage maps Pierre Legrain*†, Jean-Luc Jestin# and Vincent Schächter*‡ Recent advances in genomics have led to the accumulation of an unprecedented amount of data about genes. Proteins, not genes, however, sustain function. The traditional approach to protein function analysis has been the design of smart genetic assays and powerful purification protocols to address very specific questions concerning cellular mechanisms. Lately, a number of proteome-wide functional strategies have emerged, giving rise to a new field in biology, proteomics, that addresses the biology of a cell as a whole. Addresses *Hybrigenics, 180 Avenue Daumesnil, Paris 75012, France † e-mail: [email protected] ‡ e-mail: [email protected] # Unité de Chimie Organique, Institut Pasteur, 25 rue du Dr Roux, Paris 75724 Cedex 15, France; e-mail: [email protected] Current Opinion in Biotechnology 2000, 11:402–407 0958-1669/00/$ — see front matter © 2000 Elsevier Science Ltd. All rights reserved. Abbreviations 2DE two-dimensional gel electrophoresis MS mass spectrometry

Introduction Several technologies to study specific cellular functions or processes have been around for many years, such as enzymatic assays, complex purification or subcellular localization. In most cases, the studies have focused on a small set of genes or proteins. The increasing amount of genomic data, however, has led to the availability of more new biological objects than could reasonably be studied using classical genetic or biochemical means. An initial approach was to screen for new essential genes, assuming that these genes would be more interesting than others. It soon became clear, however, that most genes could be deleted without obvious changes in phenotype and that it was necessary to study combinations of proteins or subtle phenotypes to understand cellular functioning. Geneticists had been studying mutant phenotypes and grouping genes to build pathways. New technologies were developed, such as the yeast two-hybrid or phage-display assays that allow the detection of protein–protein interactions in order to build protein interaction maps, or synthetic lethality screens that identify genes whose products are functionally related. Biochemists have purified proteins associated with an activity in order to characterize the components of biochemical complexes. Two-dimensional gel electrophoresis (2DE) was coupled to protein mass spectrometry (MS) analysis to characterize the components of a complex or even to identify exhaustively most components of an expressed proteome, yielding protein expression maps. More recently, bioinformaticists developed algorithms to group proteins according to their

sequences assuming that such clustering of proteins would lead to functional grouping (see below for discussion). It is expected that the combination of several of these techniques applied at a proteome scale will lead to completely different approaches for functional analysis, integrating enzymatic complexes and cascades (e.g. metabolic pathways or maps) in a cell architecture and leading to an integrated view of cell functioning. We present here various approaches to link proteins together in order to build functional networks.

Protein interaction maps Two-hybrid in yeast

Since its original description in 1989 [1], the yeast twohybrid assay has been used extensively to identify protein–protein interactions. Initially designed as an assay for the detection of an interaction between two known proteins, it was rapidly developed as a screening assay to find partners for a protein of interest [2]. There are currently many variations around the same principle (for reviews see [3–5]). In most cases, the goal has been to find partners for one or several proteins and to build hypothesis-driven experiments from the two-hybrid data. Recently, several groups have developed strategies that allow construction of functional networks based on protein interaction maps [6–8]. These approaches have attempted to solve some of the technical drawbacks inherent to the two-hybrid approach, namely false positives (proteins that interact non-specifically) and false negatives (protein–protein interactions that are not detected in a classical yeast two-hybrid approach), in order to improve the quality of the data obtained as well as to provide as complete a description as possible of the protein–protein interactions occurring within a cell. Indeed, yeast two-hybrid strategies have been used successfully to decipher both stable biochemical complexes [9,10] and metabolic pathways, such as cell-cycling or splicing [6,7,11•]. With the recent tremendous increase in genome sequence data (with still more to come) it becomes attractive to consider these protein interaction maps as a first step toward the analysis of protein function. Indeed, genome-wide approaches have been reported on T7 phage [12], yeast [7,11•,13•,14•], the human hepatitis C virus [15•] and Caenorhabditis elegans [16•]. How these various experimental approaches will compare to each other is still an open question. As a whole, however, they will provide the scientific community with efficient and accurate tools to approach protein function on a large-scale and in a systematic way. In addition, they will help pinpoint many more proteins as new potential targets for drugs, while protein–protein interactions themselves could become promising targets for drug design in their own right [17].

From the analysis of protein complexes to proteome-wide linkage maps Legrain, Jestin and Schächter

Other in vivo two-hybrid assays

During the past few years, several two-hybrid assays have been developed in organisms other than yeast. In order to build protein-linkage maps, a sufficient throughput at a cost-effective level must be achieved. In this respect, Escherichia coli appears potentially as an even more suitable host than yeast for protein–protein interactions screens. Indeed, a transcriptional activation assay in E. coli was published, very similar to the yeast assay, showing that a simple contact between a DNA-bound protein and the RNA polymerase mediated through two arbitrary polypeptides could elicit transcription [18]. Two other assays have been published based on the reconstitution of an enzyme, leading to a measurable enzymatic activity [19] or to the activation of a signal transduction pathway [20]. Although these technologies are not yet applicable to screening for protein–protein interactions, one of them has been used for screening variants in leucine-zipper libraries [21] by reconstituting an active DHFR enzyme made of two separate inactive polypeptides through a leucinezipper structure. Despite the absence of many post-translational modifications in bacteria, such systems might present many advantages for high-throughput screening compared to yeast: the protein complex is built in the cytoplasm, not on DNA, avoiding transcriptional autoactivation; also, screening assays might be more cost effective because of generaton time of bacteria as well as easier and inexpensive molecular biology on E. coli compared to yeast. Protein–protein interactions can also be screened in mammalian cells, allowing a more natural environment for many metazoan proteins [22]. This assay may or may not be amenable to large-scale screening. Phage display technologies

Phage display provides a physical link between a polypeptide and its coding gene. The polypeptide is expressed as a fusion with a phage coat protein on the surface of the phage particle, which contains the corresponding gene fusion. The phage-displayed polypeptide can be selected through interaction with a target using affinity chromatography and further characterized by amplification and sequencing of the corresponding gene. Although no protein–protein interaction map using phage display has been established so far, the technology might well have the necessary scaling potential: cDNA libraries have been successfully expressed on phage T7 [23], filamentous phage [24] and phage lambda [25], thereby providing a means to identify proteins interacting in vitro with given targets. An E. coli genomic library displayed on filamentous phage has also been constructed and tested in a model system [26].

Protein expression maps Proteome-wide protein identification

Protein expression maps aim to identify proteins localized in specific protein complexes, in organelles or in cells. A typical approach consists of the separation of the various


proteins of an extract by gel electrophoresis followed by mass spectrometric analysis of protein gel spots, providing precise identification of polypeptides by unique assignments with their corresponding DNA sequences through the use of sequence databases [27]. Recent optimizations of the various steps have provided one of the most powerful approaches in proteomics. First, accurate purification method(s) for proteins, protein complexes or organelles are required. These mainly include centrifugation in density gradients [28,29], exclusion chromatography [30], and affinity chromatography using, for example, peptide tags [31], antibodies (immunoprecipitation) [32,33•,34] or substrates [35]. A tandem affinity purification (TAP) involving a combination of two high-affinity tags linked to the protein of interest permits a very efficient purification in two steps from a crude extract [36•]. The technique has been suggested as a general method for protein complex purification under mild conditions after expression in a natural environment [36•]. It remains to be demonstrated that this technique is applicable to organisms other than yeast. Second, sodium dodecyl sulfate polyacrylamide gel electrophoresis (SDS–PAGE) is typically combined with isoelectric focusing (IEF) to separate the various proteins according to their isoelectric point on a gel with an immobilized pH gradient to provide a 2DE technique, which is thus far the most commonly used multidimensional protein separation method used in proteomics [32,33•,35,37]. Recently, innovative sample preparations have made membrane-associated proteins amenable to 2DE analyses [38]. As this technique is becoming standardized and reproducible in different laboratories, annotated databases of 2DE images have been created for various proteomes [37,39]. Third, the separated protein spots on the gel can be excised and the proteins proteolytically digested in-gel. Peptides are subsequently eluted from the gel and analysed by MS. High throughput is achieved by automated matrix-assisted laser desorption/ionization (MALDI), providing a list of masses for the peptides. Matching this list against the list of calculated peptide masses from an appropriate protein sequence database characterizes the isolated protein. This method alone was found to be sufficient in several proteomic studies [28,33•]. Electrospray ionization (ESI) MS requires a preliminary peptide purification step and provides both the peptide mass list (ESI MS) and amino acid sequences of selected peptides on tandem mass spectra (ESI MS/MS): this allows unambiguous identification of a protein’s sequence by database searching [29–32,35]. Highthroughput methods have been designed to identify various post-translational modifications of proteins by MS [40•].


Protein technologies and commercial enzymes

Proteome-wide protein quantification: towards differential maps

Further proteome-wide characterization, which requires accurate large-scale quantification, allows the production of global maps of differentially expressed proteins. Radioactive labeling of metabolites yields protein extracts that can be analyzed quantitatively on 2D gels by scintillation counting [41]. Stable isotope metabolic labeling of distinct protein pools followed by MS analysis of proteins purified from 2D gels provides a further method for quantitative and differential analysis of protein expression [42]. An analogous strategy consists of direct MS analysis of purified proteins: cell extracts are treated with alkylating reagents which are tagged for affinity isolation and labeled by different stable isotopes for MS characterization of the distinct cell types or cell states studied [43••]. This strategy does not make use of 2DE technology and thereby avoids its main limitation, that is, its inability to detect and quantify rare proteins and the difficulty of comparing gels. Differential protein expression maps have been applied, for example, to the elucidatation of signal transduction pathways [42,44], the characterization of distinct cell types [45,46], the identification of virulence factors of pathogenic bacteria [47], parasite-encoded membrane-proteins [38], and proteins specifically associated with human diseases [48••], and the determination of the effects that environmental changes have on protein expression [43••,49].

Other protein linkage maps Transcriptome maps

The availability of completely sequenced genomes has permitted the production of microarrays of all the predicted genes of a particular organism. By performing hybridization experiments with probes prepared from RNAs of cells grown under varying conditions, it has been possible to produce genome-wide quantitative patterns of gene expression, and by inference maps of co-expressed proteins. Protein expression does not, however, always parallel mRNA expression [41]. In addition, different techniques used in gene expression profiling are far from being directly comparable to one another. Nevertheless, we believe that standardized and systematic studies on gene expression will provide rich foundations for further functional studies. Phenotypic mutant linkage maps

The availability of completely sequenced genomes also paves the way for systematic gene disruption experiments. A list of essential genes can be compiled [50•], and more importantly, systematic analysis of viable disrupted mutants allows the grouping of genes according to mutant phenotypes [51•,52••]. Such studies have been performed not only on small bacterial genomes and yeast, but also on higher eukaryotes with a collection of tagged mouse embryonic stem cells [53]. This should ultimately lead to databases of mutant phenotypes directly related to a given mutant genotype.

Prediction-based linkage maps

Whereas the accumulated knowledge on protein linkage maps has been derived so far mainly from direct exploitation of biochemical and genetic experiments, recent approaches attempt to bypass the collection of specific experimental data altogether by predicting functional links between proteins through computational means. These approaches are founded on various biological hypotheses (phylogeny, structure, etc.) and tested on different sets of experimental data (sequences, gene expression, etc.), this data being available in databases of various kinds, bibliographical included. Promising attempts to automatically extract protein–protein interaction information from scientific text, for example, by applying parsing techniques based on a restricted vocabulary to Medline abstracts [54•], need to be refined in accuracy and scaled-up to produce sizeable maps. Other works endeavor to identify functional links based on the ordering of related genes on genome sequences, following the notion that gene proximity is a result of selective pressure to associate genes that are coregulated and thus potential interacting partners [55,56]. The so-called ‘domain-fusion’ method [57•,58] also relies on sequence data to predict interactions; it is based on the idea that a configuration where a protein (the ‘composite protein’) contains two domains orthologous to two fulllength proteins of a given species (the ‘component proteins’) hints at the existence of an interaction between the component proteins. Although comparisons with biochemical data obtained independently validate each of these prediction schemes to a certain degree, all have theoretical limitations, and false negatives as well as false positives abound. Furthermore, the predicted functional links may correspond to indirect functional associations, such as involvement in the same pathway or complex, as well as to direct molecular interactions. Given the present state of the art, additional information or biological validation is thus required to reach conclusive evidence on both the existence and the nature of a given functional link. One way to reduce this uncertainty is to combine independent prediction methods, or better yet, methods based on different types of experimental data. Links between proteins showing similar phylogenetic profiles, correlated mRNA expression patterns, or which participate in the same pathways have been compounded with domainfusion predicted links, resulting in a higher confidence composite linkage map [59••].

Conclusions Proteomics aims to determine the nature and the quantity of proteins present in biological samples, to identify linkage between proteins and ultimately to understand the function of these proteins. The approaches described in this review — ranging from yeast twohybrid or phage-display assays for interaction map construction to the combined use of 2DE and MS for protein expression map building, to the faster but less

From the analysis of protein complexes to proteome-wide linkage maps Legrain, Jestin and Schächter

conclusive computational prediction techniques — evolve from different technological backgrounds, but all stem from an effort to scale up from ad-hoc, ‘local’ protein analysis to global, proteome-wide pictures.


Acknowledgements We thank D Strosberg, S Whiteside and J Wojcik for critical review of the manuscript. We are also grateful to A Danchin for many stimulating discussions.

