Transcriptomics Notes 2

  • Uploaded by: Anand Bambhania
  • 0
  • 0
  • April 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Transcriptomics Notes 2 as PDF for free.

More details

  • Words: 2,307
  • Pages: 12
Chapter 2

EST CLUSTERING AND ASSEMBLY

Table of contents 1. Introduction, 1 2. EST Clustering, 2 2.1 Algorithms used for similarity measurement, 2 2.2 Types of clustering, 3 2.3 EST clustering databases, 4 3. EST assembling , 4 3.1 EST assembling software, 5 3.2 Cluster joining, 5 3.3 Drawbacks of EST assembling,6 4. Tools, 6 5. References,12

1. Introduction:  ESTs (Expressed Sequence Tags) represent partial sequences of cDNA clones (average _ 360 bp).  Single-pass reads from the 5’ and/or 3’ ends of cDNA clones.  Individual clones are picked from the library, and one sequence is generated from each end of the cDNA insert.  Thus, each clone normally has a 5_ and 3_ EST associated with it.  The sequences average ~360 bases in length.  Because the ESTs are short, they generally represent only fragments of genes, not complete coding sequences.  Many sequencing centers have automated the process of EST generation, producing ESTs at a rapid rate.  ESTs are submitted to all three international sequence databases (GenBank, EMBL,and DDBJ)

Edited by- Priya J.A . Densi V.B.,Pooja P.S, Rohit G. S. Published by- Anand M. B.

Page 1

Chapter 2

EST CLUSTERING AND ASSEMBLY

2. EST clustering:  The goal of the clustering process is to incorporate overlapping ESTs which tag the same transcript of the same gene in a single cluster.  For clustering, we measure the similarity (distance) between any 2 sequences.  The distance is then reduced to a simple binary value: accept or reject two sequences in the same cluster.  As of mid-2000, GenBank contained just under 1.9 million human EST records.  For example, dbEST contains more than 200 ESTs for human alpha-fetoprotein alone.

2.1 Algorithms used for similarity measurement: 

Pairwise alignment algorithms: 1) Smith-Waterman It is the most sensitive, but time consuming (ex. cross-match) 2) Heuristic algorithms, As BLAST and FASTA, trade some sensitivity for speed.

Edited by- Priya J.A . Densi V.B.,Pooja P.S, Rohit G. S. Published by- Anand M. B.

Page 2

Chapter 2



EST CLUSTERING AND ASSEMBLY

Non-alignment based scoring methods: 1) d2 cluster algorithm: It is based on word comparison and composition (word identity and multiplicity)



Pre-indexing methods.



Purpose-built alignments based clustering methods.

2.2 Types of clustering:

CLUSTERING TYPES

STRINGENT CLUSTERING:

LOOSE CLUSTERING:

1) Greater initial fidelity;

1) Lower initial fidelity;

2) One pass;

2) Multi-pass;

3) Lower coverage of expressed gene data;

3) Greater coverage of expressed gene

data; 4) Lower cluster inclusion of expressed gene forms; 5) Shorter consensi.

4) Greater cluster inclusion of alternate expressed forms. 5) Longer consensi; 6) Risk to include paralogs in the same gene index.

Edited by- Priya J.A . Densi V.B.,Pooja P.S, Rohit G. S. Published by- Anand M. B.

Page 3

Chapter 2

EST CLUSTERING AND ASSEMBLY

CLUSTERING

SUPERVISED EST CLUSTERING

UNSUPERVISED EST CLUSTERING

ESTs are classified with respect to known

ESTs are classified without any prior

sequences or seeds.

knowledge.

2.3 EST clustering databases:

 EST clustering databases include three databases: 

UniGene



TGI (TIGR Gene Index)



STACK



trEST

 A combination of supervised and unsupervised methods with variable levels of stringency are used in UniGene. No consensus sequences are produced.  TIGR Gene Index uses a stringent and supervised clustering method, which generate shorter consensus sequences and separate splice variants.  STACK uses a loose and unsupervised clustering method, producing longer consensus sequences and including splice variants in the same index.  trEST is an attempt to produce contigs from clusters of ESTs and to translate them into proteins.

3. EST assembling:

 A multiple alignment for each cluster can be generated, this is known as clustering and consensus sequences generated which is known as processing. Edited by- Priya J.A . Densi V.B.,Pooja P.S, Rohit G. S. Published by- Anand M. B.

Page 4

Chapter 2

EST CLUSTERING AND ASSEMBLY

 Assembly and processing result in the production of consensus sequences and Singletons.  Since ESTs represent gene transcripts, they will not contain repeats.  EST assembly is complicated by features like (cis-) 

alternative splicing,



trans-splicing,



single-nucleotide polymorphism,



recoding,



Post-transcriptional modification.

 These differences make the new generation assemblers less applicable to EST assembly.

3.1 EST assembling software: 

PHRAP VIEW It provides a ‘‘global’ view of the assembly, complementing the individual base and trace view provided by consed.



GAP4 Gap4 is an interactive program used for working on data from sequencing projects.

3.2 Cluster joining:  All ESTs generated from the same cDNA clone correspond to a single gene.  Generally the original cDNA clone information is available (_ 90%).  Using the cDNA clone information and the 5’ and 3’ reads information, clusters can be joined.

Edited by- Priya J.A . Densi V.B.,Pooja P.S, Rohit G. S. Published by- Anand M. B.

Page 5

Chapter 2

EST CLUSTERING AND ASSEMBLY

3.3 Drawbacks of EST assembly:  The main drawback of EST assembly is that it does not usually permit the determination of a complete cDNA sequence, because most genes are too large to be covered by endsequencing.  In addition, sequence quality drops towards the end of the sequence reads, which can prevent assembly programs from joining overlapping sequences into a single contig.  Even when full-length contigs are generated, they are likely to contain errors, especially in regions where only low-quality data are available. 4. Tools: 4.1 EST clustering tools: 1> WCD: 

In order to decide whether two ESTs have a sufficiently large approximate overlap, we have to decide: o

how long the overlap should be;

o how we measure similarity or difference and

Edited by- Priya J.A . Densi V.B.,Pooja P.S, Rohit G. S. Published by- Anand M. B.

Page 6

Chapter 2

EST CLUSTERING AND ASSEMBLY

o

what the error threshold should be.



All of these are parameters of the clustering process.



In addition, wcd provides a number of different features and ways in which the user can control clustering. Most of these are parameterisable.



wcd provides two ways of comparing ESTs for overlap. The default distance function used is the d2 distance function, a biologically validated distance function for EST comparison, and is particularly insensitive to repeats and rearrangement.



wcd has an efficient, published implementation of the d2 distance function. The user can specify word length, window size and error threshold.



A memory efficient implementation of edit distance is also provided. The user may give the penalty matrix and threshold or use the defaults provided.



The computations of d2 and edit distance are both expensive.



wcd provides parameterisable heuristics which filter out unnecessary comparisons and speed up the clustering. Empirical testing has shown that the default parameters of the heuristics are very conservative:



They do not affect the quality of the results, while speeding up the clustering by an order of magnitude.



However, more aggressive parameters can speed up the clustering significantly and have a much smaller impact on the quality results than small changes to other parameters. In practice, clustering often has to be performed several times as the user explores different parameters or isolates problems in the data.



Using aggressive heuristics for this early phase is particularly useful.

Edited by- Priya J.A . Densi V.B.,Pooja P.S, Rohit G. S. Published by- Anand M. B.

Page 7

Chapter 2

EST CLUSTERING AND ASSEMBLY

2> JESAM: 

A multi-stage pipeline was developed to discover all the arrangements of all pairs of sequences where the alignment could be consistent with the sequences being cognate and contiguous.



The algorithm's first stages reduce the total number of pairwise sequence alignments whilst aiming to maintain overlap sensitivity and alignment accuracy.



To ensure that the published alignments were both biologically useful and mathematically optimal, it was thought necessary to use a dynamic programming algorithm with a sophisticated gap penalty scheme.



However, computation time makes it impractical to compare all sequences against each other for large EST datasets with workstation implementations of derivatives of the Smith-Waterman algorithm.



Specialized hardware was an unacceptable solution due to perceived problems of cost, availability, portability, and ease of algorithm development.



The JESAM alignment algorithm therefore uses dynamic programming only for the final alignments relying on the gross overall overlap being easy to find because the goal was only to discover potentially overlapping subsequences, not distant homologues mutated apart though millennia

3> TGICL: 

TGICL uses stringent pairwise comparisons between input sequences to group those sharing significant regions of near identity. Individual assembly of each cluster has the advantage of producing larger, more complete consensus sequences while eliminating potentially misclustered sequences. In its simplest application, TGICL takes a single parameter:

Edited by- Priya J.A . Densi V.B.,Pooja P.S, Rohit G. S. Published by- Anand M. B.

Page 8

Chapter 2 

EST CLUSTERING AND ASSEMBLY

An input multi-FASTA file. TGICL’s final output is one or more ACE files containing CAP3 assembliesand a list of singletons. Prior to running TGICL, the input dataset should be cleaned to remove contaminating sequences, including vector, adapter, and bacterial sequences, which can lead to misclustering and misassembly.



This can be done using either a stringent program such as Lucy or a sequence trimming script such as SeqClean with filtering databases such as NCBI’s UniVec.



Known repeats should also be masked using RepeatMasker with the lower-case masking option; TGICL excludes masked regions during its initial word-hashing phase.

4.2 EST assembly tools: 1> Phrapview: 

Phrapview is distributed along with the phrap assembly engine and is a graphical viewer for phrap assemblies.



It is intended to provide a ‘‘global’ view of the assembly, complementing the individual base and trace view provided by consed.



This global view focuses on information pertaining to possible incorrectness, incompleteness, or nonuniqueness of the phrap-generated assembly.



Phrapview displays depth of coverage; forward-reverse read pairs, significant pairwise matches involving reads in different locations in the assembly, and chimeric reads.



The input to phrapview is a .view file, which is produced by running phrap with the View option. (Note that phrapview does not perform any of the analyses itself; rather, it provides a way of displaying a file that contains an already completed analysis of the project)



. A screen dump for a typical phrapview display of a 40-kb cosmid sequencing project is shown in Figure.

Edited by- Priya J.A . Densi V.B.,Pooja P.S, Rohit G. S. Published by- Anand M. B.

Page 9

Chapter 2

EST CLUSTERING AND ASSEMBLY

2> GAP4: 

Gap4 is an interactive program used for working on data from sequencing projects. It contains a comprehensive set of functions, many of which present their results graphically.

Edited by- Priya J.A . Densi V.B.,Pooja P.S, Rohit G. S. Published by- Anand M. B.

Page 10

Chapter 2 

EST CLUSTERING AND ASSEMBLY

Others, such as the Experiment Suggestion functions, produce textual output ready for parsing by external programs.



One of its important components, used by many of the other functions, is the consensus algorithm. The gap4 database does not store the consensus sequence; rather, it is calculated whenever it is needed. When appropriate, it can be calculated separately for each strand, and, in the Contig Editor and Contig Joining Editor, it is instantly updated for each edit made.



When phred-style confidence values are available, the algorithm uses them with strand and chemistry data to calculate a confidence value for each base in the consensus.



At the end of a project, the algorithm can produce a FASTA-format file or an Experiment file containing the consensus and its confidence values.



Preprocessing programs used by pregap4 and routines within gap4 can add annotations to readings (for example, the position of an Alu segment or a custom primer). Throughout the text, these annotations are referred to as ‘‘tags.’’

3> PHRAP: 

phrap is a program for assembling shotgun DNA sequence data. Among other features, it allowsuse of the entire read and not just the trimmed high quality part,



It uses a combination of user-supplied and internally computed data quality information to improve assembly accuracy in the presence of repeats



It constructs the contig sequence as a mosaic of the highest quality read segments rather than a consensus, it provides extensive assembly information to assist in troubleshooting assembly problems, and it handles large datasets.

4> CAP3: 

The CAP3 program includes a number of improvements and new features.



A capability to clip 58 and 38 lowquality regions of reads is included in the CAP3 program. Base quality values produced by PHREDare used in computation of overlaps

Edited by- Priya J.A . Densi V.B.,Pooja P.S, Rohit G. S. Published by- Anand M. B.

Page 11

Chapter 2

EST CLUSTERING AND ASSEMBLY

between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. 

Efficient algorithms are employed to identify and compute overlaps between reads. Forward–reverse constraints are used to correct assembly errors and link contigs.



Results of CAP3 on four BAC data sets are presented. The performance of CAP3 was compared with that of PHRAP on a number of BAC data sets.

5 References: (a) From headings 1 to 3 (By Priya J.A, Densi V.B): 1. Webliography 1> http://www.ch.embnet.org/CoursEMBnet/Pages02/slides/est_clustering.pdf 2> http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1874618 2. Bibliography 1> Bioinformatics- a practical guide to the analysis of genes and proteins. (2nd edition) By -Andreas D. Baxevanis, B.F. Francis Ouellett, Page nos. - 284, 288, 308 & 311 (b) Heading 4th (By Pooja P.S, Rohit G.S): 1. Webliography 1> http://www.genome.org/cgi/content/full/9/9/868#References 2> http://www.phrap.org/phredphrapconsed.html#block_phrap 3> http://genome.wustl.edu/est/esthmpg.html 2. Bibliography 1> Bioinformatics- a practical guide to the analysis of genes and proteins. (2nd edition) By -Andreas D. Baxevanis, B.F. Francis Ouellett, Page nos. - 288, 309 & 311

Edited by- Priya J.A . Densi V.B.,Pooja P.S, Rohit G. S. Published by- Anand M. B.

Page 12

Related Documents

Notes 2
November 2019 27

More Documents from ""