Chapter 3
SAGE EXPRESSION MAPPING
Table of contents 1. 2. 3. 4. 5. 6.
Introduction Methodology Build process for NCBI SAGE database Result analysis Applications References
1. Introduction: Serial analysis of gene expression (SAGE) is a technique used by molecular biologists to produce a snapshot of the messenger RNA population in a sample of interest. The original technique was developed by Dr. Victor Velculescu at the Oncology Center of Johns Hopkins University, and was published in the journal Science in 1995. Several variants have been developed since, most notably a more robust version, LongSAGE, RL-SAGE and the most recent SuperSAGE that enables very precise annotation of existing genes and discovery of new genes within genomes because of an increased taglength of 25-27 bp. Serial analysis of gene expression allows the quantitative measurement of gene exexpression by measuring large numbers of transcripts from tissues of interest. Short tags of 9-11 bp of DNA are isolated from the 3' end of transcripts, sequenced and assigned tо genes. SAGE Exploits large-scale sequencing, but very effective for small scale sequencing. It Provides expression data for thousands of gene products without a priori knowledge. It identifies expressed genes. 2. Methodology: Three principles underlie the SAGE methodology: 1. A short sequence tag (10-14bp) contains sufficient information to uniquely identify a transcript provided that that the tag is obtained from a unique position within each transcript; Edited by-Nishith A.V, Published by- Anand M. B,
Page 1
Chapter 3
SAGE EXPRESSION MAPPING
2. Sequence tags can be linked together to from long serial molecules that can be cloned and sequenced; and 3. Quantitation of the number of times a particular tag is observed provides the expression level of the corresponding transcript.
RNA is isolated from a source of interest and converted to cDNA with a biotinylated oligo(dT) primer.
Edited by-Nishith A.V, Published by- Anand M. B,
Page 2
Chapter 3
SAGE EXPRESSION MAPPING
A restriction enzyme (the "anchoring enzyme") is used to digest the total population of transcripts so that only short fragments are isolated, and the tight interaction between biotin and avidin allows the 3* end of each transcript to be tethered to streptavidin beads. Two populations of linkers are added, allowing the cDNA to be digested with a specialized restriction enzyme that releases the linker with a short fragment of cDNA (the " tag") Tags are concatenated, cloned and sequenced. This process results in the description of thousands of tags from a biological source. The SAGb site at NCBI is at cloned, and sequenced. This process results in the description of thousands of tags from a biological source. A variety of SAGE libraries have been constructed. Each tag in a library is likely to correspond to a single gene. For a 9-bp tag, there are 49, or 262,144 transcripts that can be distinguished, assuming a random nucleotide distribution at the tag site. In practice, tags are mapped to genes using UniGene. In some cases a tag may be present on more than one gene. In other cases, a gene may have more than one tag (e.g., there may be alternative splicing of a transcript such that there are multiple tags for that gene). An assumption of SAGE is that the number of tags found in a SAGE library is directly proportional to the number of mRNA molecules in the biological sample. SAGE has been used to describe the properties of the yeast transcriptome. The expression of 4665 genes was characterized, the majority of which had not been functionally characterized. Consistent with the analysis of UniGene clusters, these data showed that many transcripts are expressed only rarely. The number of distinct transcripts that were expressed in each cell type ranged from about 14,000 to 20,000, and the expression level ranged from one copy per cell to 5300 copies per cell.
Edited by-Nishith A.V, Published by- Anand M. B,
Page 3
Chapter 3
SAGE EXPRESSION MAPPING
3. Build process for NCBI SAGE database:
SAGE libraries can be electronically queried at the NCBI website, allowing the comparison of gene expression in any tissues for which SAGE libraries have been generated. The website includes tag data from SAGE libraries and annotation data in which tags are mapped to genes. SAGE libraries can be selected in a manner similar to using digital differential display. The genes that correspond to tags differentially present in lung Include surfactant, pronapsim A and secretoglobin with hundreds of tags in lung but none in brain. Assorted brain enriched transcripts are also identified. Examination of surfactant shows that the mapping of this particular tag (TGCCAGGTCT) to the surfactant gene appears unambiguous and 50 tags corresponding to surfactant have been identified selectively in a lung library.
4. Result analysis: The output of SAGE is a list of short sequence tags and the number of times it is observed. Using sequence databases a researcher can usually determine, with some confidence, the original mRNA (and therefore which gene) the tag was extracted from. Statistical methods can be applied to tag and count lists from different samples in order to determine which genes are more highly expressed. For example, a normal tissue sample can be compared against a corresponding tumor to determine which genes tend to be more (or less) active. Edited by-Nishith A.V, Published by- Anand M. B,
Page 4
Chapter 3
SAGE EXPRESSION MAPPING
The result is a long list of nucleotides that has to be analyzed by computer. This analysis will do several things: count the tags, determine which ones come from the same RNA molecule, and figure out which ones come from known, well-studied genes and which ones are new. Example of a concatemer: CAT G
ACCCACG AGC TAG 1
AG G
GTACGAT CAT GAAACCT GAT G ATG TAG 2 TAG 3 TAGGACGAGG GT GGACAATGCT TAG 5 TAG 6
CA CC
TTGGGTA GCA TAG 4
CAT G
CATG
A computer program generates a list of tags and tells how many times each one has been found in the cell: Tag_Sequence
Count
ATCTGAGTTC
1075
GCGCAGACTT
125
TCCCCGTACA
112
TAGGACGAGG
92
GCGATGGCGG
91
TAGCCCAGAT
83
GCCTTGTTTA
80
GCGATATTGT
66
TACGTTTCCA
66
TCCCGTACAT
66
TCCCTATTAA
66
GGATCACAAT
55
AAGGTTCTGG
54
CAGAACCGCG
50
GGACCGCCCC
48
The next step is to identify the RNA and the gene that produced each of the tags.
Edited by-Nishith A.V, Published by- Anand M. B,
Page 5
Chapter 3
SAGE EXPRESSION MAPPING
This is done by comparing the tags to a database containing all known genes from the organism. The following list shows the results of a comparison. Tag_Sequence
Count
Gene Name
ATATTGTCAA
5
translation elongation factor 1 gamma
AAATCGGAAT
2
T-complex protein 1, z-subunit
ACCGCCTTCG
1
no match
GCCTTGTTTA
81
rpa1 mRNA fragment for r ribosomal protein
GTTAACCATC
45
ubiquitin 52-AA extension protein
CCGCCGTGGG
9
SF1 protein (SF1 gene)
TTTTTGTTAA
99
NADH dehydrogenase 3 (ND3) gene
GCAAAACCGG
63
rpL21
GGAGCCCGCC
45
ribosomal protein L18a
GCCCGCAACA
34
ribosomal protein S31
GCCGAAGTTG
50
ribosomal protein S5 homolog (M(1)15D)
TAACGACCGC
4
BcDNA.GM12270
If there is a complete genome for the organism used in the experiment, there should theoretically be a match for every tag. Some of the genes will have been studied before; they will have been given names, and researchers will know something about how they function in cells. But many genes have only just been discovered, and we know nothing about their functions. One of the fascinating things about SAGE is that it can also be used to discover new genes in organisms for which there is no complete genome. If a sequence doesn’t match a known gene, it must come from a gene that hasn’t been discovered before. The profiles of different types of cells (for example, a muscle cell and a brain cell) will be very different. The profile of a cancerous cell, or one which has been infected, will also deviate from that of a normal cell. Edited by-Nishith A.V, Published by- Anand M. B,
Page 6
Chapter 3
SAGE EXPRESSION MAPPING
By monitoring the complete activity of the genome, SAGE should give researchers strong clues about patterns of gene activity that contribute to a particular disease.
5. Applications: SAGE was originally conceived for use in cancer studies. It has been successfully used to describe the transcriptome of other diseases and in a wide variety of organisms. The SAGE method can be applied to the studies exploring virtually any kinds of biological phenomena in which the changes in cellular transcription are responsible. SAGE is a highly competent technology that can not only give a global gene expression profile of a particular type of cell or tissue, but also help us identify a set of specific genes to the cellular conditions by comparing the profiles constructed for a pair of cells that are kept at different conditions. SAGE is used for analysis of human transcriptome. It is used for the serial microanalysis of renal transcriptomes. It is used for the characterization of gene expression in colorectal adenomas and cancers.
6. References: 1. Webliography: 1> http://en.wikipedia.org/wiki/Serial_analysis_of_gene_expression 2> http://www.embl-heidelberg.de/info/sage/ 3> http://www.ncbi.nlm.nih.gov/pubmed/11251221 4> http://www.sagenet.org/findings/index.html
2. Bibliography: 1> Bionformatics and functional genomics, Jonathan Pevsner
Edited by-Nishith A.V, Published by- Anand M. B,
Page 7