The Poor Beginners’ Guide to Bioinformatics
What we have – and don’t have... a computer connected to the Internet (incl. Web browser) a text editor (Notepad or better) public databases of genomic sequences public databases of cDNA + EST public databases of protein sequences, structures and motifs money for specialised software packages public servers capable of (almost) anything we wish to do
Dealing with a sequence: model tasks • basic (DNA) sequence manipulation: restriction analysis, translation… • sequence similarity and pattern/motif searches • gene building: modelling exon-intron structures • protein domain searches,structure analysis • construction and interpretation of sequence alignments
Notes on basic sequence handling Make sure you have the correct format. FASTA format is (almost) always correct. >sequencename thisisasequenceinfastaformat
If not, you can always use raw data. If things don’t work, check for gaps in sequence, empty lines, and file extension. BEWARE OF MICROSOFT!
Model tasks continued … • basic (DNA) sequence manipulation: restriction analysis, translation… • sequence similarity and pattern/motif searches • gene building: modelling exon-intron structures • protein domain searches,structure analysis • construction and interpretation of sequence alignments
Defining a gene family… • By overall domain structure FH3?
FH1
FH2
• By domain sequence
• Based on a peptide motif
L-X-X-G-N-X-[ML]-N
Sequence comparison-based searches • Entrez “related sequences” easy identification of “false starts” no organism selection
• BLAST/FASTA all DNA/protein combinations taxonomy selection possible statistical data provided domain structure comparison available divergent motifs may be missed
Two methods are better than one.
Notes on all sequence comparisons, searches, alignments… Start with defaults (the authors know what they are doing)… … BUT don’t be afraid to vary the parameters Chose a reasonable scoring matrix: Distant sequences: low BLOSUM, high PAM Closely related sequences: low PAM, high BLOSUM
Motif-based searches sensitive no statistics only protein databases can be searched
• TAIR PatMatch Arabidopsis - specific Problematic user interface
• ISREC - INSECTS admirable technology access to SwissProt and TrEMBL no organism selection
Model tasks continued … • basic (DNA) sequence manipulation: restriction analysis, translation… • sequence similarity and pattern/motif searches • gene building: modelling exon-intron structures • protein domain searches,structure analysis • construction and interpretation of sequence alignments
Some genes are more alike than others… • A number of splicing prediction servers available • Agreement of different methods is a good sign but no absolute measure • Always align ESTs if possible • Beware of non-conventional intron boundaries (GC-AG instead of GT-AG) • Plant data for transcription start/factor binding sites prediction are limited
Model tasks continued … • basic (DNA) sequence manipulation: restriction analysis, translation… • sequence similarity and pattern/motif searches • gene building: modelling exon-intron structures • protein domain searches,structure analysis • construction and interpretation of sequence alignments
Searching for known domains/motifs
• Searching for PROSITE patterns – allowing ambiguities
• PROSITE and Pfam profile searches • SMART, CDsearch (domains and more)
Predicting protein localisation
• predicting signal peptides/anchors • 2 methods available • possibility to predict organelle localisation
• transmembrane segments prediction
Model tasks continued … • basic (DNA) sequence manipulation: restriction analysis, translation… • sequence similarity and pattern/motif searches • gene building: modelling exon-intron structures • protein domain searches,structure analysis • construction and interpretation of sequence alignments
Alignment: “manual” or automated?
locally installed, free, for Mac and PC interactive domain definition statistical data provided may produce falsepositive blocks (read the on-line manual!)
“objective” results a number of servers available recommended for wellconserved proteins empiric parameters (e.g. gap penalties) bad for divergent sequences
Phylogenetic analyses Two methods are better than one. Your phylogeny cannot be better than your alignment. Gaps are no data. Allways do bootstrapping (100-500 cycles) Certain questions cannot be answered from an unrooted tree.
Points to take off... • go to the Bioinformatics page http://www2.rhul.ac.uk/~ujba110/Bioinfo.htm
• select your exercise (A,B,C,D,E) • … and enjoy it! If you mean it seriously: • create your own bookmarks (seed provided on the course web page)