eDICTIONARY OF BIOINFORMATICS ARUN JAGOTA Bioinformatics By The Bay Press http://bioinformaticsbythebay.hypermart.net
©Copyright, Arun Jagota, 2001. For personal use only.
Preface This dictionary describes a number of terms, concepts, tools, and databases in Bioinformatics. Descriptions range from a few sentences to a few paragraphs each. (Neither URLs, nor references are included. The reader can easily find more information about a term by querying a search engine on the web with the term.) The terms described are specific to bioinformatics; molecular biology or statistical concepts not known to the author to be used significantly in bioinformatics (to date) are excluded. If some fonts are not coming out right, make sure you are reading in Acrobat Reader (or Adobe Acrobat) 4.0 or higher. About the Author Arun Jagota teaches a number of bioinformatics courses in the national award winning certificate program at the University of California, Santa Crux Extension. He is a PhD in Computer Science from the State University of New York at Buffalo in 1993. He has been a visiting faculty member in the Department of Mathematical Sciences at the University of Memphis, in the Department of Computer Science at the University of North Texas, an affiliated faculty member in the Department of Computer Science at the University of California, Santa Cruz, and an adjunct faculty member in the Department of Mathematical Sciences at San Jose State University. He is presently a part-time instructor and research scientist at the University of California, Santa Cruz, and an adjunct faculty member in the Department of Applied Mathematics as well as at the Department of Computer Engineering at Santa Clara University. Arun Jagota has taught more than thirty different courses in Computer Science, some in Discrete Mathematics, and a few in Probability and Statistics.
2
AAComp{Ident,Sim} These are search tools at the ExPasy server that use the composition of an amino acid sequence, rather than the sequence itself, to find sequences in a database (SWISS-PROT) with similar composition. (The composition of a query is searched against a database of the compositions of sequences in SWISS-PROT.) AACompIdent inputs the composition of the query and is useful when the query sequence is not known. In fact, it may be used to identify the sequence from its composition. AACompSim inputs a query sequence and computes its composition internally.
ACCESSION NUMBER This is a unique identifier to a Genbank sequence record. The typical format is 1-2 letters followed by 4-6 digits. Here are a few fictitious though syntactically correct examples: AG123456, BF43251.
AFFINE GAP COST This is a model for scoring gaps in a pairwise sequence alignment. It has two parameters, a gapopen penalty o applied once to each contiguous block of gaps, and a gap-extension penalty e < o applied to each gap in a block except the first one. For example, a gap block ---- would incur a cost of 6 if o = 3 and e = 1 .
ASN.1 This is the data model that underlies the set of integrated databases at NCBI. This model is similar to object-oriented data structures and databases, and offers rich support for nested and interrelated data structures. The richness of this model facilitates the seamless and well-navigable access to these databases that the NCBI web browser Entrez provides. Contrast this model to the flat-file format of Genbank, which has its deficiencies.
BLAST Basic Local Alignment Search Tool. This is a popular sequence-based database searching program. It performs, one-by-one, a heuristic, local alignment of the query sequence to each sequence in the specified database. (Heuristic means non-optimal. Local means as opposed to global.) BLAST is much faster than dynamic programming, while remaining effective. (Dynamic programming computes optimal local alignments.) BLAST returns a list of hits ranked by E-value. The lower the E-value of a hit, the more significant it is.
BLASTN This is the version of BLAST that searches a nucleotide sequence against a nucleotide database. It uses a primitive substitution matrix (match=1, mismatch=-2).
BLASTP This is the version of BLAST that searches a protein sequence against a protein database. It supports searches using the BLOSUM or the PAM family of substitution matrices. BLOSUM62 is the most popular matrix used in BLAST searches.
3
BLASTX This is the version of BLAST that searches a nucleotide sequence against a protein database by translating the nucleotide sequence in all frames.
BLAST2SEQUENCES This is a version of BLAST which locally aligns two sequences, instead of searching one sequence against a database.
BLOCKS This is a collection of blocks, where a block is an ungapped multiple alignment of highly similar fragments of protein sequences. A block represents a motif with some structural or functional significance. The popular BLOSUM62 substitution matrix is constructed from this database.
BLOSUM This is a family of substitution matrices for scoring alignments, for example when doing BLAST searches. BLOSUM62 is the most widely used member of this family.
BOOTSTRAP This is a statistical method used to assess the robustness of phylogenetic trees produced by various tree-building methods. The original multiple alignment from which a phylogenetic tree was first produced is used to generate numerous bootstrap alignments. Each bootstrap alignment is obtained from the original multiple alignment by sampling without replacement one column at random from the n columns in the original alignment, n times. Thus a bootstrap alignment has the same number of columns as the original alignment. A phylogenetic tree is built from each bootstrap alignment. Once all the trees have been built, these trees may be compared to yield a consensus tree or consensus subtrees. Well-conserved branches (or more generally, subtrees) are often deemed as reliable.
BRANCH-AND-BOUND This is a method for speeding up brute-force search for an optimal solution to a combinatorial optimization problem. A combinatorial optimization problem is defined by a set S of solutions, with a quality q ( s ) assigned to every solution s in S . A brute-force search generates and tests each solution in S to find a solution with maximum quality. Branch-and-bound uses the following clever idea to prune the search space. Suppose that a subset of solutions has been visited and a best solution in this set found with quality p . Consider another subset Q of unvisited solutions, and an
q on the quality of any solution in Q . If p ≥ q , then no solution in Q need be visited, since our present best solution cannot be improved by visiting Q . The better the upper bound we can place on Q and the larger the size of the set Q , the more the speedup will be. upper bound
In bioinformatics, the branch-and-bound method has been used to optimally thread a protein sequence through a tertiary structure template and to find a maximum parsimony phylogenetic tree.
4
CATH This is a database which classifies protein domain structures in PDB. Each structure is classified at four levels. One of the levels is Class, which characterizes the secondary structure content of the structure as mainly-alpha (contains mainly alpha-helices), mainly-beta (contains mainly betasheets), alpha/beta (contains alternating alpha/beta structures, or many of each), or neither (structure has low secondary structure content). Another level is Architecture, which characterizes the overall shape of the domain structure in terms of the orientation of the secondary structure elements. A third level is Topology, which characterizes the fold family of the structure based on the orientation and the inter-connectivity of the secondary structure elements. A fourth level is Homologous Superfamily, which groups together protein domain structures that appear to be evolutionarily related.
CE This is a tertiary structure-to-structure alignment method for structures in PDB. It is accessible from the PDB web site. CE builds a long alignment from short alignments called aligned fragment pairs (AFPs). An AFP is a short fragment in one structure aligned with a short fragment in the other, based on local geometry. CE first picks the most similar AFP, then extends the alignment by adding more AFPs to it. The AFPs in the extended alignment are required to form a continuous chain, with no overlapping fragments but with gaps permitted, as illustrated below.
The two axes are the protein structures, the solid diagonal lines are the AFPs, and the continuous chain of the three AFPs (with intervening gaps) forms the alignment. The dashed lines show the projection of this alignment onto the two structures.
CLUSTALW This is a popular multiple alignment tool based on the progressive alignment method.
CLUSTER ANALYSIS This is a computational and statistical procedure for partitioning a data set into subsets of similar items. It is often used to group together genes with similar expression patterns (or experiments with similar response patterns of genes) from microarray gene expression data. It is also used to group together proteins with similar sequences or similar structure (many protein classification databases are constructed this way).
5
CLUSTR This is a database of an automated clustering of the proteins in SWISS-PROT plus TrEMBL into groups of related proteins. The clustering is based on a similarity matrix of all pairs of protein sequences. Similarity of two sequences is measured as the score of an optimal local alignment, as computed via the Smith-Waterman algorithm.
CONSERVED DOMAIN DATABASE This is a database at NCBI which may be used to locate domains in a protein sequence. Domain motifs, represented as position-specific weight matrix profiles, are scanned against the query sequence to find which motifs hit where.
Cn3D This is a 3D structure viewer for protein structures in MMDB. It includes the ability to display structure-to-structure alignments and structure-based sequence-to-sequence alignments. In contrast to conventional multiple alignments, these ones are of a master (the query structure) to slaves (the structures aligned to the query structure). It uses VAST to find and align slave structures to the master structure. Cn3D also permits one to annotate the view of a structure and also has other improved features (e.g. ability to view correlated disorder in Xray structures, something often ignored by earlier viewers.
CPHMODELS This is a homology-based program for predicting the tertiary structure of a protein from its sequence. The sequence is compared with other sequences to find a highly similar sequence whose tertiary structure is known. This structure is returned as the predicted structure of the original sequence.
DALI This is a tertiary structure-to-structure alignment method and program. One can query PDB with a structure and find similar structures in PDB, using DALI. Pre-computed alignments are stored in the FSSP database. DALI works by aligning the distance matrices of the two protein structures. (The distance matrix of a protein tertiary structure stores the distances between all pairs of residues, specifically the distances between their alpha-carbons.) DALI compares the two distance matrices by finding highly similar "aligned submatrix pairs", then putting together many such pairs to form an alignment. This is illustrated below.
The outer boxes denote the distance matrices of the two structures. The enclosed boxes denote the submatrices. Solid submatrices in the two structures form one aligned pair, the dashed 6
submatrices form another pair. DALI scores an aligned pair, i.e. a pair of submatrices, one from each structure, by quantifying the similarity of the submatrices being compared, specifically by aggregating, component-wise, the values in their cells.
DIRICHLET DISTRIBUTION This is a continuous distribution that is conjugate to the multinomial distribution. It is used to specify a prior distribution over probability parameters. Let us explain this distribution and this use of it with the following example. We toss a coin n times, and observe k heads. The maximum likelihood estimate (of the probability that the coin will hand a head on the next toss) is k / n . When n is small, this estimate is very unreliable. The Bayesian alternative would be to assume a prior distribution over the probability parameter PH that the coin lands a head in any one toss. From the
n tosses, one could then compute the posterior
data, in which k heads have been observed in density
P ( PH | k , n ) using Bayes rule. (In practice, this is not usually done, rather the mean of the
posterior is computed, which is much simpler.) The posterior is affected by our choice of the prior distribution (which models our belief of what we expect to happen when a coin is tossed) as well as by our observations: k heads in n tosses. For instance, if assume, reasonably, that all coins are fair unless we are shown evidence to the contrary, then this prior distribution would be concentrated around PH → 0.5 . On the other hand, we might equally reasonably assume that all possible values of PH , ranging from 0 to 1 , are equally likely. (In other words, that all degrees of skewness of the coin are equally likely.) In this case, we would have a flat Dirichlet density. More generally, we can span the spectrum of flat Dirichlet densities (all degrees of skewness of the coin equally likely) to biased Dirichlet densities (which concentrate the prior probabilities on particular values of PH ) by certain parameters, α , which control the shape of the distribution. It turns out that, when the prior distribution distribution
P ( PH ) is a Dirichlet distribution, the posterior
P ( PH | k , n ) is also a Dirichlet distribution.
In bioinformatics, a Dirichlet distribution (and sometimes a mixture of Dirichlet distributions) is often used to model the prior distribution of the emission probabilities of alphabet symbols (the four nucleotides in the case of DNA sequences) in a column in a multiple alignment, or in a position in a functional site. This is usually done for protein sequences, although our example below is a DNA one. For example, let us suppose that we would like to build a position-weight matrix profile for a set of aligned donor site sequences. Consider the position +1 in this profile, the first base into the intron. We expect in advance that this position will almost surely contain a G . We will choose the Dirichlet distribution to model the prior distribution of the probability parameters of the four nucleotides at this position in such a way that it is concentrated around PG → 1 . This way, the posterior probability
P ( PG → 1| data ) will remain close to 1 unless we find enough sequences in which there is a letter other than G at this position (which is extremely unlikely). We will use flatter Dirichlet distributions at other columns in the multiple alignment where we do not have a prior expectation of a strong bias in favor of some nucleotides over others. The approach described in the previous example works well even when there is limited data. For instance, if we constructed a position-weight matrix profile from an aligned set of just (say) three sequences, the profile would reveal that we can trust the +1 column of the profile (even though our 7
multiple alignment only has three sequences) because there is a strong prior expectation of a G at this position, and our method has exploited this knowledge. On the other hand, our profile would reveal a wider spread at some positions further away from the site boundary, which would indicate that we don't have enough data to offset the flatter prior that reflects our uncertainty of what compositions we would expect at these positions.
DOT MATRIX This is a nice way to visualize a local alignment, especially many alignments of pairs of, possibly overlapping, fragments in the two sequences, as illustrated below.
The horizontal and vertical axes correspond to the two sequences. Solid diagonal lines denote aligned fragments.
DDBJ This is the Japanese sibling of Genbank.
DYNAMIC PROGRAMMING This is an algorithm for computing an optimal local or global alignment of two sequences.
EMBL This is the European sibling of Genbank, an annotated collection of all public domain DNA and protein sequences.
ENSEMBL This is a web server which provides access to automatically generated annotations of the genomes of complex organisms such as humans. One can search the human genome, browse chromosomes, find genes, find genomic sequences similar to a given protein sequence, etc.
ENTREZ This is a text-based search engine for bioinformatics databases, at NCBI. It provides access to a literature database (PubMed), nucleotide database (Genbank), protein sequence database, 3D structure database (MMDB), Genome database (complete genome assemblies), population sets, and Online Mendelian Inheritance in Man (OMIM).
8
ENTROPY This is an information-theoretic concept that quantifies the amount of disorder in a probability function. For example, the entropy H ( P ) of random DNA, i.e. PA = PC = PG = PT , is 2 . Entropy ranges from 0 to
log 2 Σ , where Σ denotes the number of symbols in the alphabet Σ .
Entropy is useful in quantifying the degree of conservedness in a column in a multiple alignment (or in a position in a functional site). The lower the entropy, the more conserved the column is. Also see information content, sequence logos.
E-VALUE This is the expected number of chance hits that score as well or better than the given hit. The lower the E-value of a hit, the more significant the hit is. BLAST reports the E-value of a hit, and ranks hits by their E-values.
EXTREME VALUE DISTRIBUTION This is the distribution used by BLAST to compute its E-values. This distribution models the score of a highest-scoring hit of a query sequence to a database composed of random sequences. For concreteness, consider the setting of ungapped global alignment. We have a query sequence q of length n , to be compared individually against m sequences in a database D composed of random sequences, each also of length n . For simplicity, let us define our score function S ( x, y ) that quantifies the similarity of two equal-length sequences x and y as percent identity. For any fixed sequence d in the random database, the distribution of S ( q, d ) is binomial, and may be approximated by a normal distribution. On the other hand, the distribution of the best score,
S (q, d * ) = max d in D ( S (q, d ) ) is not normal, it is an extreme-value distribution associated with
the normal distribution.
FASTA This is a program similar to BLAST for heuristic local alignment and for sequence-based database searching.
FASTA FORMAT This is a widely used format for DNA and protein sequences. Many search tools accept sequences in this format. For each sequence, there is a header line beginning with > followed by one or more lines containing the sequence itself, as illustrated below. >sequence header ACAGAAA >Sequence header >ACAAAGA …
9
FRAGMENT ASSEMBLY PROBLEM A somewhat idealized version of the problem is as follows. We wish to determine the sequence X = X 1 X 2 ... X n of a DNA molecule, say 1000s of base pairs long. Experimentally, we sequence its fragments (typically about 400 bases each) at random, unknown locations. Moreover, some of these fragments may come from the forward strand, others from the backward strand. We don't know which fragment comes from which strand. The problem is to chain these fragments together in such a way (based on their overlaps) that we get a good reconstruction of X , as illustrated below. Reconstructed
X
Fragments
Note that some fragments may need to be reverse-complemented in the assembly process. If the strands of all the fragments are known, then the problem may be modeled as one of finding the shortest common superstring of the fragments. A greedy algorithm is commonly used as the center-piece of a fragment assembly algorithm. The greedy method combines, at each stage, the two fragments with the highest overlap and replaces them with their overlapped concatenation (contig), as depicted below. Contig Fragments
FSSP This is a database of all-against-all structure-to-structure alignments of structures in PDB, alignments done via the program Dali. In more detail, all protein chains from PDB of length at least 30 are taken and partitioned into two sets: a representative set R and a set of sequence homologs H to structures in R . The sequences of any two structures in R have no more than 25% identity. The sequence of each structure in H has identity of more than 25% to the sequence of at least one structure in H . First, all pairs of structures in R are aligned by Dali. Next, each structure in R is aligned with its sequence homologs in H .
GCG This is a large commercial package for sequence analysis containing more than one hundred programs for various tasks: pairwise alignment, multiple alignment, phylogenetic analysis, format conversion, fragment assembly, protein structure prediction, etc. It is also called the Wisconsin package.
10
GENBANK This is the premier database of nucleotide sequences. It is in flat file format.
GENSCAN This is a popular genefinding program. It uses hidden Markov models.
GLOBAL ALIGNMENT This is a full alignment of two DNA or protein sequences, with gaps inserted in one or both sequences, as needed.
GRAIL This is a popular genefinding program. It uses neural networks to combine information about predicted local sites such as splice sites with predicted coding regions.
HIDDEN MARKOV MODEL This is a sophisticated probabilistic model for biological sequences. It is especially useful in describing a multiple alignment of protein sequences, with gaps in the alignment. See Profile Hidden Markov Models. Many modern genefinders also use hidden Markov models.
HITS This is a database of protein domains and motifs. Motifs take the form of regular expressions (from Prosite), generalized profiles (from Prosite), and Hidden Markov Models (from Pfam).
HMMER This is a program for building a profile hidden Markov model as well as a multiple alignment from a set of protein sequences.
HSSP This is a database of homology-derived secondary structure of proteins produced by aligning protein sequences of unknown structure to the sequences of known structures.
INFORMATION CONTENT This is the amount of information in a probability function P . It is defined as IC( P) = H ( R) − H ( P) . Here H denotes entropy, and R is a reference probability function, typically one that assigns the same probability to each outcome in the sample space. For example, if the sample space is the DNA alphabet and R is the uniformly random probability function, H ( R) equals 2 . Information content is used to quantify the amount of information in a column of a multiple alignment (or of a position in a functional site). Information content is used in sequence logos , a
11
visualization tool for multiple alignments that reveals the information content of each column, as well as its composition, in a visual format.
JUKES-CANTOR MODEL This is a time-dependent substitution matrix for nucleotide sequences. The matrix contains just two time-dependent quantities P (α | α , t ) and P ( β | α , t ) which are independent of α and β . Here
α denotes a nucleotide and β a nucleotide different from α . The JC substitution matrix does not distinguish different types of substitutions, say transitions versus transversions.
KEGG This is a database of biochemical pathways.
KIMURA MODEL This is a variant of the Jukes-Cantor substitution matrix for nucleotide sequences which distinguishes between transition substitutions and transversion substitutions.
LALIGN This is a tool for local alignment of two sequences, where the local alignment is allowed to be fragmented.
LIGAND This is a chemical database of enzyme reactions.
LINEAR CORRELATION COEFFICIENT This is a normalized measure of the correlation
ρ ( X , Y ) between two random variables X and
Y . It is defined as covariance( X , Y ) (standard deviation( X ) × standard deviation(Y ) )
ρ ( X , Y ) ranges from −1 to +1 , with −1 indicating that X and Y are maximally anti-correlated, +1 indicating that X and Y are maximally positively correlated, and 0 indicating that X and Y are uncorrelated. Linear correlation coefficient is widely used to compare the expression vectors of genes (or experiments) in microarray data, especially during cluster analysis.
LOCAL ALIGNMENT This is the process of finding and aligning highly similar regions of two DNA or protein sequences.
12
LOW-COMPLEXITY REGION This is a portion of a DNA or a protein sequence with many repeats of short sequences or other forms of biased composition. It can cause database searching programs like BLAST to give false hits. It is a good idea to have the low-complexity regions removed from a query sequence before doing a BLAST search (this is the default action in BLAST).
LPFC Library of Protein Family Cores. Think of this as a collection of structure profiles. A structure profile corresponds to a structurally conserved region and is formed by a structural alignment of proteins in a family. Functional sites in a 3D structure may be detected by comparison with structure profiles. These profiles are also useful in threading.
MARKOV MODEL This is a probability model that permits bounded dependencies. Specifically, in a first-order Markov model, the probability of a letter at a particular position in a sequence is permitted to depend on the
P ( X i | X i −1 ,..., X 1 ) = P ( X i | X i −1 ) , where X i denotes the ith letter in a sequence X . In a kth -order Markov model, letter at the previous position. That is,
P ( X i | X i −1 ,..., X 1 ) = P ( X i | X i −1 ,..., X i − k ) In bioinformatics, Markov models are used to model short well-conserved patterns called signals as well as the statistics of variable-length sequences called content. (A donor splice site is an example of a signal, an exon an example of content.)
MEME This is a tool for automatically discovering motifs (represented by ungapped position-weight matrix profiles) from a set of related protein or DNA sequences. The motif discovery algorithm is based on fitting a two-component mixture model to the given set of sequences, using the EM algorithm. One component describes the motif by a fixed-width position-weight matrix profile. The other component models background, i.e. all other positions in the sequences.
META-MEME This is a tool that inputs a set of related protein sequences and a set of motifs in them (discovered by MEME, a related tool) and produces a hidden Markov model. The HMM may then be used to multiply align the protein sequences, or to search (the HMM) against a database of sequences. These uses are similar to those of profile HMMs.
MMDB Molecular Modeling Database. This is the NCBI database of protein tertiary structures. Think of it as a value-added, validated version of PDB. The distinguishing features of MMDB are explicit and consistent chemical graph information following extensive validation, uniformly derived secondary structure definitions and domain descriptions, citation matching to MEDLINE, taxonomy assignment to each protein, and structure neighbors for all structure chains and individual domains. Structure neighbors of each MMDB structure in the database are pre-computed using the VAST 13
structure-to-structure alignment tool. Sequence neighbors of each MMDB structure in the database are pre-computed using BLAST on the sequence of the structure. The Cn3D viewer is useful for viewing not only a structure but also alignments of a structure to other structures as well as corresponding alignments of their sequences.
MULTINOMIAL DISTRIBUTION This is the generalization of the binomial distribution to more than two outcomes. For example, given the background probabilities P ≡p PA , PC , PG , PT f of A, C , G , T , this computes the
probability that a sequence of length n in which the letter at each position is generated randomly according to P , and independently of other positions, has a composition vector p #As,#Cs,#Gs,#Ts f . (Here all four compositions must add up to n .) This distribution is suited to modeling the composition of DNA or protein sequences. It has applications to scoring the composition of a column in a multiple alignment to a model of the expected composition, to quantifying codon usage bias in protein sequences, to trying to distinguish exons from introns by composition, and others.
MULTIPLE ALIGNMENT This is a global alignment of more than two DNA or protein sequences. The alignment is typically scored by scoring each column of the alignment and adding up the column scores. Gaps costs may be incurred in the score of a column, or separately. One way to score a column is by its degree of conservedness (more conserved columns indicate better-scoring alignments). This can be done by computing the information content of the column. A more widely used method is called the sum-of-pairs method. In this method, the score of a column is the sum of the scores of all distinct pairs of letters in the column, where a pair of letters is scored via a substitution matrix (such as BLOSUM or PAM in the case of protein sequences). Multiple alignments have many uses including towards building descriptors for functional sites, for domains, and for families. A descriptor built from a multiple alignment is often used to find more sequences that fit the descriptor (more sequences that belong to a functional site, domain, or protein family, as the case may be) and add these to the alignment. A multiple alignment is also a first step in phylogenetic tree-building.
MUTUAL INFORMATION
M ( X , Y ) , quantifies the amount of dependence between two symbolic random variables X and Y . (Symbolic here means that X and Y need not take numeric values. For example, X and Y could range over the nucleotides A, C , G , T .) M ( X , Y ) is defined as the relative entropy H ( P ( X , Y ) || P ( X ) P (Y ) ) , where P ( X , Y ) is the joint distribution of X and This quantity,
Y and P( X ) and P(Y ) are its marginal distributions. M ( X , Y ) is always nonnegative, the more positive it is, the more dependent X and Y are. Mutual information has been used in RNA structure analysis, specifically in quantifying the basepairing strength of two columns iand j in a multiple alignment of RNA sequences. The higher the mutual information of these two columns, the more likely it is that they are base-pairing columns. This calculation assists in the prediction of RNA secondary structure from a multiple alignment.
14
Mutual information may also be used to reveal and quantify the amount of dependence between (adjacent or non-adjacent) positions in a functional site, for example a donor splice site. Mutual information may also be used to quantify the dependence between pairs of residues in near proximity, in protein tertiary structures.
NEIGHBOR-JOINING METHOD This is a phylogenetic tree-building method that is "one notch above" the UPGMA tree-building method.
NEURAL NETWORK This is an algorithm for learning from a given set of examples, loosely inspired by real neural circuits in the brain. Neural networks are widely used for pattern recognition, for instance to classify an input vector into one of k classes. A neural network is trained on a set of labeled examples, i.e. the correct class of each example is given to the network during training. After training, the neural network is used to classify new examples. In bioinformatics, neural networks have found use in protein secondary structure prediction, in predicting short semi-conserved sites such as donor and acceptor splice sites, in genefinding, and other areas.
NUSSINOV RNA FOLDING ALGORITHM This is a dynamic programming algorithm for computing the maximum number of nested canonical base pairs in an RNA sequence. The sequence ACAACGU has two nested canonical base pairs, the outer pair being A-U and the inner pair being C-G and this is the maximum number in this sequence. The sequence ACAGCGU on the other hand has three nested canonical base pairs.
PARSIMONY This is a character-based method for constructing a phylogenetic tree from a multiply aligned set of sequences. The parsimony of a tree is defined as the minimum number of substitutions required to produce a given set of sequences, placed in a particular way at the leaves of the tree. Here is an example on the DNA alphabet, for a fixed tree on whose leaves one-letter sequences are placed.
C
C
A
C
The minimum number of substitutions on this tree to yield the leaves is 1, achieved by placing the same 1-letter sequence, C, at all the internal nodes of the tree, including the root. The parsimony of a tree may be computed efficiently by a clever method. It is much more time consuming to find the tree which has maximum parsimony (minimum number of substitutions). A 15
naïve method would enumerate all trees. This would be prohibitively time consuming since the number of trees is exponential in the number of sequences. A branch-and-bound method, which prunes the search space, is used. It is still slow, though.
PAM Point Accepted Mutation. This is a family of protein substitution matrices for scoring alignments, especially when doing BLAST searches. One PAM is a unit of evolutionary divergence in which 1% of the residues have changed. PAMX is thus the matrix for "X PAM units". First, a PAM matrix is constructed for a small value of X from a correct multiple alignment of closely related protein sequences. This PAM is then extrapolated to construct PAMY, where Y >> X, by essentially matrix multiplication. To align highly divergent sequences use PAM250.
PAML This is a software package for Phylogenetic Analysis By Maximum Likelihood. According to the author, this package is not good for building trees, but is good for evaluating and comparing trees built by other methods such as PAUP.
PAUP Phylogenetic Analysis Using Parsimony. This is a software package for constructing a phylogenetic tree from biomolecular sequences or other data. It includes distance-based and maximumlikelihood methods as well.
PDB Protein Data Bank. This is the repository of experimentally determined tertiary structures of proteins. For each structure, it stores the 3D coordinates of the atoms in the structure, and various types of annotations including the references (authors, titles, journals, etc), bonding information, secondary structures, and so on.
Pfam This is a database of profile Hidden Markov Model motifs. Each motif is constructed from a seed multiple alignment of highly similar sequences, which grows to include more distant members iteratively.
PHD This is a class of neural network methods for predicting secondary structure (PHDsec), solvent accessibility (PHDacc), and transmembrane helices (PHDhtm). Predictions are made as follows. First, using blastp, the sequence x is searched against SWISSPROT to find similar sequences. From the hits, a multiple alignment is generated by a program called MaxHom. From this multiple alignment, a profile is built. This profile is then searched against SWISS-PROT, to find more related sequences. (During this process, motifs in x are found by searching the ProSite database, domains in x found by searching the ProDom database, and lowcomplexity regions in x marked using the program SEG.) The final profile is fed to PHDsec which makes the secondary structure prediction. Each residue is labeled as being in an alpha-helix, a beta-sheet, or neither. Confidence values are also returned. 16
The reported accuracy of PHDsec is about 73%. This same profile may be fed to PHDacc which makes a solvent accessibility prediction. The output is ten states of relative accessibility. This same profile may be fed to PHDhtm which predicts transmembrane helices. The output is two states: each residue is tagged as being in a transmembrane helix or not. The string of outputs is refined by a dynamic programming type method, to filter out errors using context. (For instance, if the predictions of three consecutive residues are TNT where T denotes "in transmembrane helix" and N not, then the N will be replaced by T in the refinement phase.) The reported accuracy of PHDhtm is about 95%.
PHI-BLAST Pattern Hit Initiated BLAST. This is a version of BLAST that inputs a protein sequence and a regular expression pattern in it. It may be used to find other protein sequences that not only contain the same pattern but are also similar to the input protein sequence in the proximity of the occurrence of the pattern, in the two sequences. This is illustrated below. Query sequence P Similar regions P Hit sequence
PHYLIP Phylogeny Inference Package. This is a free software package for constructing a phylogenetic tree from molecular sequences, gene sequences, or other data. It supports parsimony, distance matrix methods, and likelihood methods. It also supports bootstrap analysis and consensus tree building.
PHYLOGENETIC ANALYSIS This is the process of building a phylogenetic tree from a given set of sequences (or other data). The first step is to do a multiple alignment of the sequences, using CLUSTALW perhaps. The second step is to clean up the multiple alignment (remove outlying sequences, handle gaps, downweight overrepresented sequences, etc). The third step is to build one or more trees, using a distance-based method, a parsimony method, or a likelihood-based method. The fourth step (which may interact with earlier steps) is to assess the quality of the built trees, perhaps using bootstrap.
PIR This is a well-annotated and nonredundant protein sequence database. A closely related database is iProClass, which has classified more than 200,000 protein sequences from PIR and SWISSPROT into about 30,000 superfamilies, about 100,000 families, about 2,500 domains, and so on. Interestingly, iProClass is implemented on a relational database system.
17
POSITION-WEIGHT MATRIX PROFILE This is a matrix that describes the composition of each column in a multiple alignment (or of a functional site). Here is a DNA example. Consider the ungapped alignment
ACTAG, ATCCG, TTCAG One version of a profile associated with this aligned set is the 3 × 5 matrix
P given by
P[ A,1] = P[ A,5] = 2; P[C , 2] = P[C , 4] = 1; P[C ,3] = 2; P[G,5] = 3; P[T ,1] = P[T ,3] = 1; P[T , 2] = 2 with the rest of the entries being zero. In practice, what are stored in the profile are not counts, but rather log-probabilities. For instance, the log-probability version of P[ A,1] would be log 2 / 3 . Let X denote a sequence and is scored as follows:
P a profile, both of the same length n . The alignment of X to P
S ( X , P) =
∑ P ( X ,i) i
i =1,..., n
The score of the sequence ACCAG against the version of the profile
P that stores the counts is
P[ A,1] + P[C , 2] + P[C ,3] + P[ A, 4] + P[G,5] = 2 + 1 + 2 + 2 + 2 = 9 P stores log-probabilities, the score S ( X , P) has a simple probabilistic interpretation: it is the log-probability of generating the sequence X from the profile P under the assumption that the positions in P are independent. When
PREDICTPROTEIN This web server features the class of PHD programs for predicting secondary structure, solvent accessibility, and transmembrane helices, as well as programs for prediction of tertiary structure, coiled-coil regions, etc. At the same site, in the Meta PredictProtein section, is another secondary structure prediction program, JPRED, which takes a consensus between a number of methods. There are also three other programs for predicting transmembrane helices, TMHMM, TOPRED, and DAS. The tertiary structure prediction programs include TOPITS, SWISS-MODEL, and CPHmodels.
PRINCIPAL COMPONENTS ANALYSIS This is a method of visualizing high-dimensional data by transforming it into a very low-dimensional space (usually 1, 2, or 3D). The transformation is achieved (by rotating and translating the axes of the original space) in such a way that the first few axes describe the data as best as possible. The remaining axes may then be "thrown away", with possibly some (but hopefully not a lot) of loss of information. This is illustrated in the example below.
18
Here the original data is in the two-dimensional space x-y but most of it may be described in the one-dimensional space PC1. Principal components analysis is often used to visualize microarray data because of its high dimensionality (think of the expression levels of thousands of genes in a single experiment as forming a vector in a thousands-dimensional space).
PRODOM This is a database of protein domains.
PROFILE HIDDEN MARKOV MODELS This is a sophisticated probabilistic method, employing a hidden Markov model, to multiply align a set of protein sequences. The trained hidden Markov model itself serves as a descriptor for the alignment. New sequences may be aligned by aligning them to the HMM. The HMM is superior to a simple (position-weight matrix) profile in that it explicitly models gap columns in an alignment, as illustrated below. Insert column 2 A
T
T C
ACTTA ATTT ATTA ATTG ATCA
1
3
4
G A
5
This HMM has five states. The directed arcs denote transitions among states. They have probabilities on them, not shown here. The circular states denote match (i.e., non-gap) columns. The square state denotes an insert column. Each state emits nucleotides. The first state emits only As, the second state emits all four nucleotides with equal probability, the third state emits only Ts, the fourth state emits Ts with higher probability than Cs but no other nucleotides, and the fifth state emits As with higher probability than Gs but no other nucleotides. Imagine that the five sequences shown to the left are the ones from which this HMM was trained. Now consider a new sequence, ATCG. The alignment algorithm will realize that A is best interpreted as being emitted from state 1, T from state 3, C from state 4, and G from state 5. The alignment algorithm will thus produce ATCG, the aligned version of ATCG. The Pfam database of HMM profiles is a profile HMM of this type. See Pfam.
19
PROGRESSIVE ALIGNMENT This is a class of heuristic methods for multiple alignment. A multiple alignment is built progressively by doing repeated pairwise alignments. Here is an illustrative example of a progressive alignment method. First, the two sequences that are most similar in the given set are identified, and aligned. Next, a third sequence that is most similar to the aligned set in step 1 is identified, and aligned to this set. Next, a fourth sequence that is most similar to the current aligned set is identified, and aligned to the current set. And so on.
PROSITE This is a database of short biologically significant motifs in protein sequences. The motifs are stored as regular expression patterns and also as position-weight matrix profiles. The PROSITE web server supports tools for scanning a protein sequence to find patterns in the patterns database that occur in the sequence, and profiles in the profiles database that score sufficiently well against portions of the sequence. The PROSITE web server also supports tools for finding sequences in a sequence database (SWISS-PROT) that match a given pattern.
PROTOMAP This is a database of an automated hierarchical clustering of the proteins in SWISS-PROT into groups of related proteins. The clustering uses transitivity (if protein A is strongly similar to protein B, and B strongly similar to protein C, then A is judged to be similar to C), with similarity measured in the usual way via substitution matrices (BLOSUM50 and BLOSUM62).
PSI-BLAST Position-Specific Iterated BLAST. This is a variant of BLAST for doing a heuristic version of a profile search. PSI-BLAST inputs a query sequence and first attempts to find strong hits in a database via the usual gapped BLAST. The hits are then multiply aligned, and a profile constructed from this alignment. Next, this profile is searched against the database to try to pick up more hits. PSI-BLAST is designed with the hope that this method will better facilitate the detection of remote homologues. On the other hand, this method is more prone to false hits than the usual BLASTs. False positives that get included in the first alignment may tarnish subsequent hits.
PSEUDO-COUNT METHOD This is a method of incorporating prior knowledge into a multiple alignment in the form of pseudocounts. Let us illustrate it with a simple example. Consider n independent tosses of a coin in which
k heads are observed. The maximum likelihood estimate of the probability PH that the next toss lands a head is k n . This estimate is very unreliable when n is small. (For example, when n equals 1 , the maximum likelihood estimate sets PH to be 0 or to 1 , both unreasonable values.) If we assume that the coin is fair unless we see enough evidence to the contrary---a reasonable assumption---we can generate virtual examples (under this assumption) to increase the number of examples. We do this as follows. We make w virtual tosses, each with a coin assumed to be fair. The expected number of heads in the w virtual tosses is then 0.5w . We now compute the maximum likelihood estimate on this enlarged data set, which yields
PH = ( k + 0.5w ) ( n + w )
The parameter w allows us to control the trade-off between how much we trust the data versus how much we trust our prior assumption.
20
In the previous paragraph, we may replace the assumed prior probability of landing heads, oneprior
half, by any prior probability PH . We may even replace it with a distribution over prior probabilities, usually the so-called Dirichlet distribution. In bioinformatics, the pseudo-count method is used to construct position-weight matrix profiles for multiple alignments, or for functional sites, when only limited data is available but some prior knowledge is also available.
RASMOL This is a widely used and powerful free viewer for PDB structures. RasMol allows one to manipulate the view of the structure being viewed by a combination of mouse operations, dropdown menus, and command-line commands. RasMol supports a powerful set of commands, entered in a separate command-line window.
RELATIVE ENTROPY This is an information-theoretic concept that quantifies the extent of the difference between two probability functions. For example, let PA , PC , PG , PT and QA , QC , QG , QT denote two probability functions on the DNA alphabet. The relative entropy
H ( P || Q) quantifies how different Q is from
P . Relative entropy is always nonnegative. Relative entropy is not necessarily symmetric, it is rarely the case that H ( P || Q ) equals H (Q || P ) . Relative entropy may be used to compare the composition of a column in a multiple alignment to the expected composition at that column (assuming one exists). This in turn may be used to score how well the column is aligned.
REPEAT MASKER This is a tool that removes (masks) repeats from a DNA sequence. (Repeats are low-complexity regions that can produce spurious hits when the sequence containing repeats is searched against databases using BLAST or other tools. By default, BLAST masks these in the query sequence before initiating a database search.)
RESTRICTION MAP CONSTRUCTION Restriction enzymes cut foreign DNA at locations called restriction sites. A restriction map is a map of these locations in the DNA. These locations are not determined experimentally. What is determined experimentally are some properties of fragments formed after the DNA has been cut by various restriction enzymes in various ways. From this data, a restriction map is then constructed by a nontrivial computer algorithm. Graph theory and algorithms are often used.
SCOP Structural Classification of Proteins. This is a database which hierarchically classifies proteins at four levels: classes, folds, superfamilies, and families. Proteins are classified from their 3D structure, taken from PDB. The database is constructed by manual comparison of structures, aided by software tools as found useful. Sometimes proteins are grouped together if they share a common domain, other times a wider context of multiple domains is used.
21
Proteins are placed together in the same family if they are found to be clearly evolutionarily related. Usually this determination is made only when the percentage identity of their aligned sequences is found to be 30% or greater. Proteins are placed together in the same superfamily if they are judged to be probably evolutionarily related. Usually this determination is made from commonality in the structural and functional features of the two proteins, even though their sequences have low identity (significantly less than 30%). Proteins are placed together in the same fold category if they have major secondary structure elements in the same order and topology. Proteins in the same fold category are structurally but not necessarily evolutionarily related.
SEG This is a program that identifies low-complexity regions in protein sequences, and replaces the amino acids at these positions by Xs. A low-complexity region is one with repeats or other types of biased composition. A low-complexity region in a sequence input to BLAST can lead to false hits that score well (have low E-values) but are unrelated to the query sequence except in that they have similar low-complexity regions. It is therefore a good idea to filter out low-complexity regions from a sequence before doing a BLAST query.
SEQUENCE LOGO This is a method for visualizing the composition of the various columns in a multiple alignment, or that of various positions in a functional site. Consider the case of nucleotide sequences. The height of each position in a sequence logo (a position corresponds to a column in the multiple alignment) indicates the information content of the corresponding column in the multiple alignment. In our case, this would range from 0 to 2, with 0 indicating that this column has zero information content and 2 indicating that this column has the maximum possible information content, 2 bits. Each position in a sequence logo also displays, in addition to the height, the relative contributions of each of the four nucleotides A, C, G, T to the total information content (i.e., this height).
SRS This is a text-based search engine for over 80 databases, supported by EBI. It comes with a powerful query language that is reminiscent of query languages for relational databases. Check out especially the link operators, resembling the join operator in relational databases.
STOCHASTIC CONTEXT-FREE GRAMMAR This is a probabilistic version of a context-free grammar. It is a generalization of a hidden Markov model to the context-free setting. It is suited to modeling palindrome languages such as those arising in RNA secondary structures (nested base pairings).
SUBSTITUTION MATRIX This is a 20-by-20 matrix S where
S (a, b) indicates the likelihood of a mutatation of a residue a
to residue b (or vice-versa). It is used to score an alignment of two sequences, most significantly the alignment of a database hit to the query sequence. The BLOSUM and PAM family of substitution matrices are the most widely used, in tools such as BLAST and many others.
22
SUFFIX TREE This is a clever data structure used to speed up the search for exact matches of a query sequence to fragments in a database. The database is preprocessed and stored in the form of a suffix tree. Locating the first match then only takes time proportional to the length of the query sequence, which is amazingly fast. Suffix trees have other uses, including finding repeats in a sequence.
SWISS-PROT This is a curated database of protein sequences with a high level of annotation (functions of a protein, domains in it, etc) and low redundancy.
SWISS-MODEL This is a program for predicting the tertiary structure of a protein from its sequence, via the prediction by homology method. The query sequence x is first searched against sequences in the crystallographic database ExPdb. If a clear homolog y is found (i.e., y has sufficient identity with x) an atomic model for the predicted structure of x is then built, using y. If no clear homolog is found, the process is aborted.
TANDEM REPEAT FINDER This is a tool that finds tandem repeats---two or more exact or near-exact copies of the same sequence fragment that are adjacent---in a nucleotide sequence.
TBLASTN This is the version of BLAST that searches a protein query sequence against a version of a nucleotide database translated on-the-fly in all reading frames.
TBLASTX This is the version of BLAST that searches a nucleotide sequence translated in all reading frames against a nucleotide database translated on-the-fly in all reading frames.
tRNA-SCAN-SE
THREADING This is an alignment of a protein sequence to the 3D structure of another protein.
TOPITS This is a program for predicting the tertiary structure of a protein from its sequence. It uses a database of secondary structure strings derived from tertiary structures in PDB. (This database has one such string for each tertiary structure in PDB.) First, this program uses the PHDsec program to predict the secondary structure string sse(x) of a protein sequence x. Next, it aligns, one by one, this predicted string sse( x) with each secondary structure string in the database. (Alignments are done via dynamic programming.) Finally, it picks a high-scoring alignment (sse(x), sse(y)) where 23
sse(y) is a string from the database derived from tertiary structure y from PDB. Finally, it predicts the structure of x to be y.
TREMBL This is an automatically annotated adjunct to SWISS-PROT.
UNGAPPED ALIGNMENT This is the process of finding highly similar regions of two DNA or protein sequences. Gaps are not permitted to be inserted.
UPGMA This is a distance-based method for building a phylogenetic tree from a multiply aligned set of sequences. It performs a hierarchical clustering of the given data set, which yields this tree.
VAST This is a structure-to-structure alignment method for structures in MMDB. It works by aligning the secondary structure elements in the two proteins.
24