BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
Basic Principles of BLAST Analysis Additional information can be obtained from the information pages at www.ncbi.nlm.nih.gov/Blast
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
Analyzing the Sequenced Genes •
•
•
• • •
•
Structure prediction – Secondary structure of DNA and RNA – Possible 3-D structure of proteins Identity of the encoded gene/gene product – Prediction of general physical properties (e.g. M.W., pI; may be important for proteonomic analysis) – Database (e.g. Genbank) search based on sequence homology Possible function of the encoded gene product – Search for signature domains or function motifs using consensus patterns (based on statistics) Possible location of the encoded gene product – Prediction of subcellular localization by consensus patterns Prediction of evolutionary relationship – Multiple alignment, clustering, etc. Gene prediction from genomic sequences – Prediction for coding regions and location of introns – Prediction for promoter regions Prediction of regulatory sites – Prediction of consensus cis-acting regulatory elements
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
blastn: good for high score search; not for comparison of distant relationship blastp: use substitution matrix to find distant relationship; can use SEG to filter low complexity region blastx: use for new DNA sequences and analysis of ESTs tblastn: search for coding regions that are not defined in the database tblastx: use for analysis of ESTs
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BLAST Search • www.ncbi.nlm.nih.gov/Blast • Basic Local Alignment Search Tool • Uses heuristic algorithm which seeks local (instead of global) alignments; able to detect relationships among sequences which shares similarity only in isolated regions • The initial search is done for a word of length “W” that scores at least “T” when compared to the query using a substitution matrix • Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of “S”
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
Word Size = Word Length = 11 Expect = The statistical significance threshold for reporting matches against database sequences; the default value is 10, meaning that 10 matches are expected to be found merely by chance Expect=Kmne-λT
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam Bit Score The value S’ is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches. S’=(λS-lnK)/ln2 [λ and K are normalizing parameters]
E Value Expectation value. The number of different alignments with scores equivalent to or better than S’ that are expected to occur in a database search by chance. The lower the E value, the more significant the score. E=mn2-S’ [m: effective length of the query; n: total number of bases of the database]
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
CDD Search Compares protein sequences to the Conserved Domain Database. The CDD is a database containing a collection of functional and/or structural domains derived from two popular collections, Smart and Pfam, plus contributions from colleagues at NCBI.
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
PSI-BLAST Position specific iterative BLAST refers to a feature of BLAST 2.0 in which a profile (or position specific scoring matrix, PSSM) is constructed (automatically) from a multiple alignment of the highest scoring hits in an initial BLAST search. The PSSM is generated by calculating position-specific scores for each position in the alignment. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second (etc.) BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity.
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
PSSM Position-specific scoring matrix. Based on a Profile (A table that lists the frequencies of each amino acid in each position of protein sequence. Frequencies are calculated from multiple alignments of sequences containing a domain of interest). The PSSM gives the log-odds score for finding a particular matching amino acid in a target sequence.
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam
BIO4320 Lecture Materials, Prepared by Dr. Hon-Ming Lam