Practical Application: Web Programming and Bioinformatics
1
14. Web Programming and Bioinformatics In Section 3.2, we learned how the syntax of Pascal programming language can be defined with a CFG. (Actually, Appendix A shows the whole definition of Pascal programming language in terms of a syntax flow graph.) Other programming languages can also be defined formally with a grammar, except for a few exceptional cases of context dependency, such as double declaration of a variable in a block and using numeric labels in DO-loop in FORTRAN. This chapter first shows that the major parts of the popular Web programming language HTML (Hypertext Markup Language) and XML (Extensible Markup Language), which is designed for e-commerce, can also be defined in terms of a CFG. Then the chapter presents a potential application of the formal languages to molecular biology.
14.1 Hyper Text Markup Language (HTML) 415 14.2 Document Type Definition (DTD) and XML 418 14.3 Genetic code and grammar 425
2
Web Programming and Bioinformatics
14.1 Hyper Text Markup Language (HTML) In this section, to see how our knowledge in the formal languages could be applicable to investigating the properties of programming languages, we shall first examine HTML. The left box below shows an informal definition (itemized for convenience) of an HTML list that would commonly appear in a text. To the right are CFG rules translated from the informal definition. (1) Char denotes a character. (2) Text is a sequence of characters. (3) Doc is a sequence of Elements. (4) For a string x, we call <x> a tag, and the matching tag of <x>. (5) Element is a Text, or a Doc inbetween a matching tag, or a Doc with a tag at the front. (6) ListItem is a document with tag
at the front. ListItem is an item of a list. (7) List is a sequence of zero or
(1) Char → a | A | . . . . . | z | Z |.. (2) Text → Char Text | ε (3) Doc → Element Doc | ε (4) Tag → | (5) Element → Text | a Doc in-between a matching tag
| Doc (6) ListItem → Doc (7) List → ListItem List | ε 3
HTML
Web Programming and Bioinformatics
Notice that in the CFG rules, the bold-faced Italic words denote nonterminal symbols. The blank between two nonterminals (e.g., Char Text) is used to delimit the two nonterminals (not for the terminal symbol blank) and hence, can be ignored. While translating, we found a context dependent part of the language, i.e., “Element is a Doc between a matching tag,” that is impossible to translated into a context-free grammar rule (see the highlighted part). (1) Char denotes a character. (2) Text is a sequence of characters. (3) Doc is a sequence of Elements. (4) For a string x, we call <x> a tag, and the matching tag of <x>. (5) Element is a Text, or a Doc inbetween a matching tag, or a Doc with a tag at the front. (6) ListItem is a document with tag at the front. ListItem is an item of a list. (7) List is a sequence of zero or
(1) Char → a | A | . . . . . | z | Z | . . (2) Text → CharText | ε (3) Doc → ElementDoc | ε (4) Tag → | (5) Element → Text | a Doc between a matching tag
| TextDoc (6) ListItem → Doc (7) List → ListItem List | ε 4
HTML
Web Programming and Bioinformatics
According to the informal definition, a document Doc should appear between a matching tag, like <x>Doc , where x is a text (i.e., a string). In the formal definition, we need a rule with Element which can generate <x>Doc for an arbitrary Text x. Notice that Element → Doc does not work, because we cannot guarantee the left side Text and the right side Text generate identical strings. Actually, it is not hard to see that the set of documents with matching tags has the same context dependency as in the language L = {xcx | x ∈ {a, b}*}. Using the pumping lemma for context-free languages, we can prove that this language L is not context-free (see Exercise 12.4). This implies that the list part of HTML is not context-free. However, if we restrict the matching tags to a finite set of strings, say {w1, w2, . . , wk} (e.g., <EM>Doc, Doc
, etc.), HTML list can be defined in terms of a “pure” CFG as shown below. (1) Char → a | A | . . . . . | z | Z | . . . (2) Text → Char Text | ε (3) Doc → Element Doc | ε (4) Tag → | (5) Element → Text | Doc | <w1>Doc | . . . . | <wk>Doc (6) ListItem → <“LI”>Doc
(7) List → ListItem List | ε 5
Web Programming and Bioinformatics
14.2 DTD and XML The major objective of HTML is to define the format of a document, and hence it is not easy to describe the semantics of the text, that is needed in the area of ecommerce. XML has been introduced to overcome this drawback. Actually, XML is the language (i.e., the set of documents) written according to a DTD (Document Type Definition), which can be transformed into a CFG. All DTD’s have the following form. , where each element definition has the following format. This format can be transformed into the following CFG rules, where DTD corresponds to the start symbol S. DTD → List-of-Element-Definitions → Element-Definition List-of-Element-Definitions | Element-Definition Element-Definition → 6
DTD and XML
Web Programming and Bioinformatics
DTD → List-of-Element-Definitions → Element-Definition List-of-Element-Definitions | Element-Definition Element-Definition → The List-of-Element-Definitions and Element-Description are, respectively, delimited by the pairs of the matching tags and , and <element-name> and . Element description is a variation of the regular expression that is defined as follows. (1) Both (a) and (b) below are element descriptions. (a) An element-name. (b) Any text (with no tag) that is denoted by the special term #PCDATA. (2) If E1 and E2 are element descriptions, E1*, E1+, E1?, (E1, E2), and E1 | E2 are element descriptions. Recall that E1+ = E1(E1)*, E1? = E1+ε , E1 | E2 = E1+E2, and E1,E2 correspond to the concatenation of regular expression E1 and E2. We know that every language expressible with regular expression can also be defined in terms of a regular grammar, and every regular grammar is a CFG. Hence, Element-Description is context-free. 7
DTD and XML
Web Programming and Bioinformatics
Here is a popular example of a DTD, which defines the specification of PC’s. ]>
8
Web Programming and Bioinformatics
DTD and XML
Here, we show how to transform each element definition of PcSpecs into CFG rules (shown in the box). PCS → PCSS
PCSS → PC PCSS | ε
PC → MODEL PRICE PROCESSOR RAM DISK+ DISK+ → DISKS DISKS → DISK DISKS | DISK
MODEL → <MODEL>Text PRICE → Text
9
Web Programming and Bioinformatics
DTD and XML
PROCESSOR → MANF MODEL SPEED
MANF (\#PCDATA)> MODEL (\#PCDATA)>
MANF → <MANF>Text MODEL → <MODEL>Text
..... DISK → HARDDISK | CD | DVD
..... 10
DTD and XML
Web Programming and Bioinformatics
Here is an example of XML document written according to the DTD PcSpecs. <MODEL>1234 $3000 512 <MANF>Superdc <MODEL>xx1000 <SIZE>62Gb <SPEED>32x . . . . 11
Web Programming and Bioinformatics
DTD and XML
Again in XML we see the same context dependency caused by the matching tags. As in the previous example, if we restrict the matching tags to those chosen from a finite set of reserved words, such as MODES, PRICE, DISK, etc., XML turns out to be a CFL.
Warning Signs
Break Time
- On children's alphabet blocks: Letters may be used to construct words, phrases and sentences that may be deemed offensive. - On a microscope: Objects are smaller and less alarming than they appear. - On a revolving door: Passenger compartments for individual use only. - On work gloves: For best results, do not leave at crime scene. - On a palm sander: Not to be used to sand palms. - On a calendar: Use of term "Sunday" for reference only. No meteorological warranties express or implied. -On a blender: Not for use as an aquarium. - Dennis -
12
Web Programming and Bioinformatics
14.3 Gene code and grammar Quite a few linguistic terminologies, such as code, express, translate, edit, etc., have been adopted by biologists, especially in the field of molecular biology, ever since James Watson and Francis Crick proposed the structure of DNA (deoxyribonucleic acid) strands in 1953. This implies that in biology there are problems that we can approach with the knowledge of automata and formal languages. As an example, this section shows how it is possible to express genes in terms of a grammar. Before the example, we need a very brief introduction to the process of protein synthesis. A genome is a long double helix DNA strand, consisting of four chemical bases; adenine, thymine, cytosine and guanine, denoted by A, T, C, and G, respectively.
Genome (double helix)
DNA molecule
A
G
T
C
T
C
A
G
backbone bases 13
Gene Grammar
Web Programming and Bioinformatics
Each base in a double helix forms one of the four complementary pairs, A-T, T-A, C-G, and G-C as the following example shows. (Thus, given a DNA single strand, we can figure out its complementary strand.) ACTTAACGGCGAT TGAATTGCCGCTA Proteins are synthesized by the following three steps; (1) a single strand segment (i.e., gene) of the genome is transcribed into an RNA. The RNA is the complementary strand of the source strand, with every base T replaced by U (uracil). Hence, RNA’s consist of four bases, A, C, G and U. 5’ end
TCAAGCT 5’ end UCAAGCU
3’ end
3’ end
AGTTCGA RNA
(1) RNA Transcription 14
Web Programming and Bioinformatics
Gene Grammar
(2) In eukaryotes, only useful substrings for protein synthesis, called exons, are spliced (i.e., cut off and concatenated) to form a mRNA (messenger RNA). (3) Finally, ribosomes, large molecular assemblies traveling along the mRNA, translate the mRNA into a protein, which is an amino-acid sequence.
Transcribed RNA
exons
introns
(2) splicing Messenger RNA (3) Translation Protein
15
Gene Grammar
Web Programming and Bioinformatics
There are 20 different amino acids. Hence, proteins can be represented as a string of 20 characters. The translation is done by triplets (i.e., groups of three adjacent bases), also called codons, according to the table shown below. The translation begins with the start codon ATG, and continues until it meets one of the stop codons, TAA, TAG, and TGA (see the table below). For every codon scanned, a specific amino acid is attached to the next position of the protein being synthesized. Notice that because there are 4× 4× 4 = 64 different codons and 20 amino acids, several codons are translated into the same amino acid. (Following convention, we shall represent codons using character T rather than U.) Ala
Arg
GCA
AGA
Asp
GCG
GAT AGG GAC
GCT
CGA
GCC
CGG
Ans
Cys
AAT TGT AAC TGC
Glu
Gln Gly
GAA CAA GGA GAG CAG GGG GGT GGC
His
Ile Leu Lys Met
Phe Pro Ser Thr Trp Tyr Var Stop
CAT ATA TTA AAA ATG TTT CAC ATT TTG AAG (start) TTC ATC CTA CTG CTT CTC
CCA CCG CCT CCC
AGT AGC TCA TCG TCT TCC
ACA TGG TAT GTA ACG TAC GTG ACT GTT ACC GTC
TAA TAG TGA
CGT CGC
16
Gene Grammar
Web Programming and Bioinformatics
This three step process for protein synthesis is called the central dogma in molecular biology, because with so many exceptional cases, it is no longer accepted as a valid theory. However, to show a potential application of the formal languages and automata theory to molecular biology, we shall present an example based on this dogma. As we did in Section 14.1 for HTML, we will transform the informal definition of genes into a formal definition in terms of a grammar. (This example is an excerpt, lightly edited, from David Sears’ paper “The Linguistics of DNA,” which appeared in the journal American Scientist, Vol. 80, 1992.) • The transcript, which is the part of the gene that is copied into a messenger RNA, has flanking 5’- and 3’-untranslated-regions around a coding-region initiated by a start-codon. → → <5’-untranslated-region><start-codon> <3’-untranslated-region>
17
Gene Grammar
Web Programming and Bioinformatics
• The coding-region consists of a codon followed by more coding-region, a recursion that ultimately terminates in a stop-codon. The coding-region also allows for a splice at any point in the recursion, resulting in the insertion of an intron, which is bracketed by a splicing signals GT and AG. → | <splice> | <stop-codon> <splice> → → GTAG • A stop-codon is any of the three triplets TAA, TAG and TGA, whereas a codon is any of the remaining 61 possible triplets that are translated into an amino acid according to the rule given in the codon table. <start-codon> → <Met>
<stop-codon> → TAA | TAG | TGA
→ | | | | | | | | | | | | <Met>| | | <Ser> | | | | 18
Gene Grammar
Web Programming and Bioinformatics
• The first two bases of a codon partially specify the amino acid into which it is translated. Every codon starting with GC (CG) is translated into alanine (respectively, arginine), independent of the third base. Every codon starting with GA is translated into aspartic acid if the third base is a pyrimidine (i.e., C or T). Otherwise, if the third base is a purine (i.e., A or G), it is translated into glutamic acid. Here are some formal representations of such translation rules specified in the codon table. → GC → CG → GA → TG → GA → CA → A | C | G | T → C | T → A | G All the grammar rules presented up to this point are context-free. But the following biological fact requires a rule from the upper level in the Chomsky hierarchy. • Introns can be transposed to the left or right. →
→
The following slide shows a grammar that specifies the set of “genes” that we have developed in this section. 19
Web Programming and Bioinformatics
Gene Grammar →
→ <5’-untranslated-region><start-codon><3’-untranslated-region> → | <splice> | <stop-codon> -------------------------------------------------------------------------- → | | | | | | | | | | | | <Met>| | | <Ser>| | | | <start-codon> → <Met>
<stop-codon> → TAA | TAG | TGA
---------------------------------------------------------------------------- → GC
→ CG
→ GA
→ TG
→ GA
→ CA
→ GG
→ CA
→ AT | ATA <Met> → ATG
→ TT | CT → TT<primidine>
<Ser> → AG<primidine> | TC
→ TGG
→ CC → TA
→ AA → AC → GT
------------------------------------------------------------------------------ → A | T | G | C
→ C | T
→ A | G
------------------------------------------------------------------------------<splice> →
→ GTAG
------------------------------------------------------------------------------ →
→
20
Gene Grammar
Web Programming and Bioinformatics
The grammar that we have just developed is incomplete because no rules are defined for the nonterminal symbols <5’-untranslated-region>, <3’-untranslated-region> and . The two untranslated regions flanking the coding-region usually contain segments regulating the gene expression. The contents and the function of these parts are not yet completely understood. The intron-body parts generally consist of a long repeating string, whose exact role is also under investigation. We need more research before we can present a formal model for these regions.
Love
Break Time
What else is love but understanding and rejoicing in the fact that another person lives, acts, and experiences otherwise than we do….? - Friedrich Nietzsche – Just because you love someone doesn’t mean you have to be involved with them. Love is not a bandage to cover wounds. - Hugh Elliot -
21