Msc Project

  • May 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Msc Project as PDF for free.

More details

  • Words: 6,826
  • Pages: 44
NON-CODING RNA PREDICTION OF CLINICALLY IMPORTANT MYCOPLASMA BY COMPARATIVE GENOMIC ANALYSIS

Dissertation submitted to the Madurai Kamaraj University In partial fulfillment for the requirement of Masters of Science in Biotechnology

Submitted by Reg No: A242009

SCHOOL OF BIOTECHNOLOGY MADURAI KAMARAJ UNIVERSITY MADURAI 625 021

May 2004

To THE SMALL AND POWERFUL Non-coding RNA

DECLARATION I declare that this dissertation entitled Non-coding RNA prediction of clinically important Mycoplasma using comparative genome analysis submitted by me in partial fulfillment for the requirement of Masters of Science in Biotechnology to the Madurai Kamaraj University is based on the work carried out by me in the School of Biotechnology, Madurai Kamaraj University, Madurai under the guidance and supervision of Dr. Z. A. Rafi, Reader, School of Biotechnology, Madurai Kamaraj University, Madurai. I also declare that this dissertation or any part of it has not been submitted elsewhere for any other degree or diploma.

Madurai-21 May 7, 2004

Regn. No.:A242009

ACKNOWLEDGEMENTS I owe my gratitude to DR. Z.A. RAFI for his guidance and supervision in this project. His care and concern has been the driving force for me all through this work. I am thankful for his constant advice and encouragement.

I am thankful to Prof.

S.Krishnaswamy for introducing me to the field of Bioinformatics.

I would also like to thank my classmates Anurag, Basanth, Dinesh, Geeta, Hridesh, Kaiser, Netrapal, Subhanjan, Sucharitha, Vijay, for their support and company during the past two years, that made my stay in Madurai a memorable one. I would like to thank Deepak for his help in creating a C programme.

My special thanks are due to my roommate and friend Santosh for his constructive criticism for my mistakes. I acknowledge my special friend Ayushi who has been my rich source of encouragement and entertainment during the last phase at MKU.

I am indebted to the entire School of Biotechnology for making my M.Sc an intellectually stimulating experience.

I also acknowledge the Dept. of Science and Technology, Government of India, for its financial support since last five years through Kishore Vaigyanik Protsahan Yojana and Dept. of Biotechnology, Government of India, for supporting this project.

CONTENTS

1. Briefing 2. Introduction 3. Review of Literature 4. Materials 5. Methods 6. Results 7. Discussion 8. References

BRIEFING Small untranslated RNA molecules are found in all kingdoms of life. Many of them that are discovered till date are conserved between closely related organisms with a characteristic secondary structure.

These were found to regulate

diverse functions – mainly regulation of gene expression. Non-coding RNAs (ncRNAs) are difficult to detect biochemically or to predict by traditional sequence analysis.

To search the ncRNAs that may play an important role in the life cycle of pathogenic Mycoplasma, we used a well established computational strategy that distinguish conserved RNA secondary structures from a background of other conserved sequences using probabilistic models of expected mutational patterns in pairwise sequence alignments.

We report here the complete genome screening for ncRNA done with this method on the available completely sequenced six Mycoplasma genomes using comparative sequence analysis.

The screen resulted in several putative ncRNAs.

Majority of the predicted ncRNA sequences are in the range of 130-160 nucleotides and the number of the ncRNAs predicted was in proportion to the length of their genome size except for the one genome. Our candidate ncRNAs showed similarity with few of the biochemically characterized ncRNAs in bacteria as well as eukaryotes. This suggests the broadly conserved nature of the ncRNAs across the other kingdoms of life. This finding places our putative ncRNAs as suitable candidates for the drug discovery and developmental studies of the Mycoplasma.

1

INTRODUCTION Central dogma of Molecular Biology defined a general pathway for expression of genetic information stored in DNA, transcribed into transient mRNA & decoded on ribosomes with the help of adapter RNA to produce proteins which in turn perform all enzymatic and structural functions in the cell. According to this view RNAs play a rather accessory role and the complexity of a given organism is defined by the constellation of proteins encoded by the genome.

However, discovery of RNAs

performing enzymatic and other functional roles in the cell complicated the existing picture.

Discovery of RNaseP catalysis nature and self splicing activity of group I introns suggested that functions of RNA go far beyond a passive role in the expression of protein coding genes. More recent discoveries attributed a variety of regulatory roles to RNA that includes control of plasmid replication, transposition in prokaryotes and eukaryotes, phage development, viral replication, bacterial virulence, global circuits in bacteria in response to environmental changes, or developmental control in lower eukaryotes.

The above reviewed functions suggest that RNAs which are considered as non functional RNAs are not only molecular fossils left from time immemorial. Analyses of several sequenced genomes suggest that protein-coding genes alone are not enough to account for the complexity of higher organisms. Genomic analysis showed that with an increase of an organism’s complexity the protein coding contribution of the genome decreases. It is estimated that about 98% of transcriptional output of eukaryotic and upto 10% of prokaryotic genomes in RNA does not encode for any protein.

2

In this context, ncRNA are defined as heterogeneous transcripts that have a wide functional spectrum. Broadly, ncRNAs can be divided into two classes:

1. Housekeeping RNAs that are constitutively expressed and required for normal functions and viability of the cell. 2. Regulatory ncRNAs, by contrast, include those that are expressed at certain stages of an organism’s development or cell differentiation, or as a response to external stimuli.

Many of these ncRNAs were discovered by chance while researchers were studying individual genetic systems. NcRNA species have been difficult to detect by targeted experimental procedures or by traditional computational approaches.

An attempt has been made in the present study to screen for the ncRNAs of the completetly sequenced and clinically important Mycoplasm a* genomes by co mparat ive sequence analysis.

*M.penetrans, M.mycoides, M.gallisepticum, M.pulmonis, M.pneumoniae, M.genetalium

3

LITERATURE Generally the gene finding algorithms assumes that the target is a protein coding gene that produces mRNA and they fail to scan or target towards ncRNAs. However, a few computational strategies have recently emerged to detect these ncRNAs which can be classified into the following four categories:

Sequence similarity analysis: This is simply searching a newly sequenced genome for similarity against the known ncRNAs [Lowe et al., 1991; Lowe et al., 1999; Zwieb et al, 1999].

Transcriptional Signal analysis: It is based on the fact that ncRNAs are transcribed but not translated. So, this is a systematic approach that searches for ncRNA genes that has transcriptional signals but not translational signals [Argman et al., 2001; Olivas et al., 1997]

Statistical analysis: This involves the analysis of base composition statistics of noncoding regions in comparison to coding regions [Shattner, 2002]

Comparative genomic analysis: Sequences conferring important characteristics are conserved across related genomes. Similar assumptions have been made in case of ncRNAs also.

A comparative analysis approach of related genes is used to screen

ncRNAs across the related genomes [Elena Rivas 2001; Elena Rivas et al., 2001; Wassarman et al., 1999].

The aim of the current study is to find ncRNAs that may play important role in determining pathogenesis of clinically important Mycoplasma. The current study was carried out using comparative genomic analysis approach. This selection was done on the basis that this Mycoplasma shares the common characteristic disease causing ability. So, a comparative genomic analysis is assumed to highlight the group of ncRNAs that help in pathogenesis.

4

In our approach to predict ncRNAs by comparative genomics we used a computational tool – QRNA [Elena Rivas 2001], the heart of our project. The following information details about its evolution and how it works.

There had been some earlier explored RNA gene finding approaches but with limited success [Elena Rivas 2000].

Early hypothesis in this regard was that

biologically functional RNA structures may have more stable predicted secondary structures than would be expected for a random sequence of the same base composition [Chen JH et al., 1990; Le SY et al., 1988; 1990]. Although to a certain extent the above hypothesis is true, it has been reported that stable predicted secondary structures alone cannot give positive expected signal, since the predicted stability of structural RNAs is not sufficiently distinguishable from the predicted stability of random sequence to use as the basis for a reliable ncRNA gene finding algorithm [Elena Rivas 2000]. Nonetheless, conserved RNA secondary structure remained a best hope for an exploitable statistical signal in ncRNA genes. Hence, the above approaches were coupled to comparative sequence analysis for determination of additional statistical signals [Elena Rivas 2001].

The comparative sequence analysis for ncRNA genes has its basis from the work which used BLASTN programme to locate genomic regions with significant sequence similarity between two related bacterial species.

A computational tool

CRITICA analyzed the pattern of mutation in these ungapped, aligned conserved regions for evidence of coding structure [Badger 1999]. For example mutations to synonymous codons get positive scores, while aligned triplets that translate to dissimilar amino acids get negative scores. The programme then subsequently extends any coding assigned ungapped seed alignments into complete ORFs.

5

QRNA is an extension of CRITICA to identify structural RNA regions. The extensions include: 1. using fully probabilistic models; 2. adding a third model of pairwise alignments constrained by structural RNA evolution; 3. allowing gapped alignments; and 4. allowing for the possibility that only part of the pairwise alignment may represent a coding region or structural RNA, because a primary sequence alignment may extend into flanking non-coding or nonstructural conserved sequence.

These extensions add complexity to the approach. It also uses probabilistic modeling methods and formal languages to guide our construction. Further pair – Hidden Markov Models and pair – Stochastic Context Free Grammar were used to produce three evolutionary models for coding, structural RNA or something else.

Given three

probabilistic models and a pairwise sequence alignment to be tested, QRNA can calculate the Bayesian posterior probability that an alignment should be classified as coding, structural or something else.

QRNA screens for conserved RNA secondary structures.

It detects

various non-genic sequences with conserved RNA structures, including rho-independent terminators, rRNA spacers, transcriptional attenuators in ribosomal protein and amino acid biosynthetic operons, other cis-regulatory RNA structures, and even certain repetitive elements forming pseudo knots, stem loops, palindromic sequences etc.,.

The predicted targets are referred as ncRNA genes, but it must be understood that this really meant a conserved RNA secondary structure that may or may not turnout to be an independent functional ncRNA gene upon further analysis.

6

MATERIALS System configuration Hardware specification:  Machine Name

:

Pentium IV

 CPU Speed

:

2.8GHz

 RAM Memory

:

512MB

 Hard disk

:

80GB

Operating system specifications:  Red hat Linux 9.0  Microsoft Windows XP Packages installed and Applications used:      

Red Hat Linux 9.0 EMBOSS-2.8.0 WU BLAST 2.0 QRNA Microsoft Office Perl 5.0

Selected Genomes for the study:      

Mycoplasma penetrans Mycoplasma mycoides Mycoplasma gallisepticum Mycoplasma pulmonis Mycoplasma pneumoniae Mycoplasma genetalium

7

METHODS Downloading the genomes of Mycoplasma: Folder containing various formats of genomes was downloaded from NCBI ftp site ftp://ftp.ncbi.nlm.nih/Bacteria/Mycoplasma_Species for each of the organisms selected.

The formats should include fasta format of the whole genome

nucleotide sequence (accession_number.fna file), protein table format that constitute the coordinates of the starting and ending regions of the protein coding regions (accession_number.ptt file).

Preparing range file of intergenic regions: Range file preparation involves three steps starting from manipulation of the coordinates of protein coding regions.  Getting coordinates of the protein coding regions – Protein table containing file was opened in Microsoft Word and the option Convert:Table to Text & Text to Table was used to make a table of just protein coding region coordinates.  Getting coordinates for intergenic regions – The protein coding coordinates were pasted into a Microsoft Excel file and simple mathematical options were used to obtain the coordinates of intergenic regions.  Making a Range file – Final step of making a range file was done by copying the intergenic region coordinates into notepad file and is given as input for a C programme that selects only the intergenic regions whose length is greater than 49 nucleotides for further use as input file in emboss applications (reason).

Extracting the intergenic regions from the genomes: Extractseq application in the emboss suite was used to get each intergenic region in the genome separately in fasta format. This procedure was repeated for each genome. Syntax: extractseq –regions @rangefile –separate

8

Making Genome databases: A database formatting programme obtained within the WU BLAST 2.0 suite was used to make databases. Each database constituted five genomes excluding the genome with which the database is subjected to BLAST. Syntax: xdformat –n –o database_name

Similarity search: Similarity search for the intergenic regions of each genome was done by blastn programme with default parameters from WU BLAST 2.0 suite against a database that doesn’t contain the organism’s genome. Syntax: blastn database_name nucleotide_query >output_file_name

Parsing WU BLAST 2.0 outputs: The output file of WU BLAST 2.0 needs to be parsed by a perl script. This parsing is done with the default parameters using - blastn2qrnadepth.pl available along with the QRNA-2.0.1 suite. The result of the parsing will give three output files and one of the files, with a file_name.q extension will be used as input for the QRNA application. Syntax: blastn2qrnadepth.pl -g query_organism file_name

Non-coding RNA prediction: The file_name.q file obtained from the parsed blast file was used as input for QRNA with window size 150 and moving 50 nucleotides each time. An option –B was used to avoid false positive scores. Syntax: qrna –w 150 –x 50 –B input_file_name > output_file_name

9

Extraction of loci identified as ncRNA: Perl script phase_count_fast.pl was used to prune the QRNA output to get the actual independent genomic regions that are identified as RNAs using default parameters. The nucleotide sequence of the predicted ncRNA was extracted by the same procedure used for extracting the intergenic regions. Syntax: phase_count_fast.pl file_name query_org database_org

10

RESULTS Intergenic regions were rich source for the presence of ncRNAs. As a first step, the contribution of the intergenic region to the genome of the organism was calculated. Graph 1 show the length of the selected genomes and Graph 2 displays the percentage of the intergenic regions in the genome. From the graph it was clear that the intergenic sequences were very low compared to the protein coding regions. This agrees with the common feature of the prokaryotes which processes only small percentage of intergenic regions [Mattick 2001].

The number of intergenic sequences determined was high and it was found that several intergenic sequences were of small stretches. Since biochemically characterized ncRNA genes had a minimum length of 50 nucleotides, only the stretches that contained more than or equal to 50 nucleotides in length were alone considered. This curing was done by an in-house C programme. Graph 3 displays the intergenic regions present before and after curing. It has been observed that nearly half of the intergenic regions were eliminated based on the above criteria.

The current analysis is based on the prediction of conserved secondary structures and comparative genomic studies based on similarity of the existing genomes. Hence, databases of groups of organisms under study were created. Each database was a collection of genomes of other five similar organisms excluding the one which was under study.

The organism under study was searched for similarity against a database

containing genomes of five related organisms. Table 1 lists the organism and database contents against which the organism is searched for similarity. The table also indicates the number of similar hits that would be fed in as an input to the QRNA after using the perl script blastn2qrnadepth.pl. The perl script is used for filtering of hits below the threshold level as described in the methods above.

This in turn shows the relative

proportion of similarity existing between the organisms with respect to genome size. The results of the perl script were displayed in Graph 4. The graph indicates that almost all the selected genomes showed a proportionate increase in the number of similarity hits

11

found with respect to the genome size, except M.gallisepticum.

This suggests that this

particular organism may have different characteristic sequence compared to the other selected organisms.

The similarity hits that were selected above the set threshold were evaluated by QRNA using a window scanning approach.

A window size of 150

nucleotides and extension of 50 nucleotides was chosen to minimize the CPU time taken by the QRNA.

Putative ncRNA output results received from the QRNA for each organism is shown in Graph 5.

Here again, the ncRNAs predicted show a proportional increase in their

number compared with respect to their genome size, except M.gallisepticum.

Spread of the length of the putative ncRNAs was plotted in Graph 6. The graph shows the range i.e., the smallest and the longest ncRNAs predicted for each organism together with the average length as pointed by the horizontal line.

12

1. Mycoplasma genetalium G37 complete genome - 0..580074 480 proteins Location Strand Length PID Gene

Synonym Code COG

Product

735..1829

+

364

3844620

MG001

-

-

-

DNA polymerase III, subunit beta (dnaN)

1829..2761 2846..4798 4813..7323 7295..8548 8552..9184 …………..

+ + + + + ….

310 650 836 417 210 …..

1045670 1045671 1045672 1045673 1045674 ………..

MG002 MG003 MG004 MG005 MG006 ……….

..

..

..

dnaJ-like protein DNA gyrase subunit B (gyrB) DNA gyrase subunit A (gyrA) seryl-tRNA synthetase (serS) thymidylate kinase (tmk) ….....................................

2. 735 1829 1829 2761 2846 4798 4813 7323 7295 8548 8552 9184 …… …… 3. 2762 4799 7224 8549 ……

2845 4812 7294 8551 ……

Fig1: (1) Protein table format of the Mycoplasma genetalium genome showing the annotation of the protein coding regions and the names of the characterized and putative proteins. (2) Coordinates of the protein coding regions alone obtained after a series of conversions from Table to Text and Text to Table option in Microsoft Word. (3) Coordinates of the intergenic sequences alone obtained after a simple mathematical application use in Microsoft Excel. 13

Oraganism Genome size M.genetalium 580,074 M.pneumoniae 8,16,394 M.pulmonis 9,63,879 M.gallisepticum 9,96,422 M.mycoides 12,11,703 M.penetrans 13,58,633

Genome Size Comparision M.gen M.pne M.pul M.gal M.myc M.pen 0

500000

1000000

1500000

Genome length

M.genM.pneM.pulM.galM.mycM.pen-

Mycoplasma genetalium Mycoplasma pneumoniae Mycoplasma pulmonis Mycoplasma gallisepticum Mycoplasma mycoides Mycoplasma penetrans

Graph1: GENOME LENGTH COMPARISION OF THE MYCOPLASMA

100% 80% 60% 40% 20% 0% M.pen M.myc M.gal Intergenic region

M.pul

M.pne M.gen

Protein Coding Region

Graph2: BAR GRAPH SHOWING THE PERCENTAGE OF INTERGENIC

REGION IN THE GENOME OF MYCOPLASMA

14

1. Starting

2.

Ending Length

1 2762 4799 7324 8549 9185 9922 11253 12041 12726 13566 14434 15317 …….

734 2845 4812 7294 8551 9156 9923 11251 12068 12701 13569 14395 15555 …….

734 84 14 0 3 0 2 0 28 0 4 0 239 ….

Starting 1 2762 15317 …….

Ending 734 2845 15555 …….

Length 734 84 239 …..

Fig2: (1) Intergenic sequence coordinates and their length in the Mycoplasma genetalium as obtained after the simple mathematics tool application in Microsoft Excel. Intergenic regions exist with a gap of 1 nucleotide to as many as thousands of nucleotides (not shown here). (2) Intergenic regions curated by the C programme to remove the regions whose length is less than 50 nucleotides. One can easily notice that the number of the intergenic regions decreases considerably after curing.

#this 1 2762 15317 19760 20356 28449 36714 38979 47423

is Mycoplasma genetalium G37 range file 734 2845 15555 19824 20543 28650 36977 39127 47580

……

……

Fig3: This figure shows an example of the first few coordinates of the range file created for Mycoplasma genetalium for use in the emboss application.

15

No. of Intergenic Regions

1200

Curing of Intergenic Regions Before

1000

After 800 600 400 200 0 Before After

M.pen M.myc M.gal

M.pul M.pne M.gen

1037

1016

726

782

689

480

643

572

290

376

282

122

Graph3: GRAPH SHOWING THE CULLING OF THE INTERGENIC SEQUENCES BY THE C PROGRAMME THAT SELECTS THE REGIONS WHOSE LENGTH IS GREATER THAN OR EQUAL TO 50 NUCLEOTIDES ONLY

16

>L43967_1_734 Mycoplasma genetalium G37 intergenic sequence TAAGTTATTATTTAGTTAATACTTTTAACAATATTATTAAGGTATTTAAAAAATACTATT ATAGTATTTAACATAGTTAAATACCTTCCTTAATACTGTTAAATTATATTCAATCAATAC ATATATAATATTATTAAAATACTTGATAAGTATTATTTAGATATTAGACAAATACTAATT TTATATTGCTTTAATACTTAATAAATACTACTTATGTATTAAGTAAATATTACTGTAATA CTAATAACAATATTATTACAATATGCTAGAATAATATTGCTAGTATCAATAATTACTAAT ATAGTATTAGGAAAATACCATAATAATATTTCTACATAATACTAAGTTAATACTATGTGT AGAATAATAAATAATCAGATTAAAAAAATTTTATTTATCTGAAACATATTTAATCAATTG AACTGATTATTTTCAGCAGTAATAATTACATATGTACATAGTACATATGTAAAATATCAT TAATTTCTGTTATATATAATAGTATCTATTTTAGAGAGTATTAATTATTACTATAATTAA GCATTTATGCTTAATTATAAGCTTTTTATGAACAAAATTATAGACATTTTAGTTCTTATA ATAAATAATAGATATTAAAGAAAATAAAAAAATAGAAATAAATATCATAACCCTTGATAA CCCAGAAATTAATACTTAATCAAAAATGAAAATATTAATTAATAAAAGTGAATTGAATAA AATTTTGGGAAAAA >L43967_2762_2845 Mycoplasma genitalium G37 intergenic sequence AAAACCTTTCATTTTTAATGTGTTATAATTATTTGTTATGCCATAAATTTAGTTTGTGGC AAAAGCTTCTGTACTGTTTATTTA >L43967_15317_15555 Mycoplasma genitalium G37 intergenic sequence ACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATT GGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCA ACTGAGCTATACTTCCAAGCATAATCCTAAGGGTATTTAACTAATTATTATAACAATTTT AATTTAACCAAAATACCCCTCGAATTTTAACAGTTTTTATAATCAAAACAGCTAATTTT >L43967_19760_19824 Mycoplasma genitalium G37 intergenic sequence ATAAATTTAATAGTGTTGAAAGACAAACATTATTAATTTTTGATCAGCTAAATAAAACAA AGCAA >L43967_20356_20543 Mycoplasma genetalium G37 intergenic sequence CTCAAAAAACTAATACATCAAACTTCAACCGTTTACTTTTTTATGAACAAGCACTACAAA GGTTTTATGAAGAATTATTTCAAATAGATTATTTAAGAAGATTTGAAAACATTCCCATTA AAGATAAGAATCAAATTGCGCTTTTTAAAACTGTTTTTGATGATTACAAAACCATTGATT TAGCAGAA Fig4: Result of the extractseq application in emboss suite which gives sequences of interest. The figure shows the fasta format of the first few intergenic sequences of the Mycoplasma genetalium obtained from the extractseq application given the range file and whole genome sequence as input.

17

Organism

M.penetrans

M.mycoides

M.gallisepticum

M.pulmonis

M.pneumoniae

M.genetalium

Organisms in Database

Database Created

M.gallisepticum M.genetalium M.mycoides ggmpnpudb M.pneumoniae M.pulmonis M.gallisepticum M.genetalium M.mycoides ggpppdb M.penetrans M.pneumoniae M.pulmonis M.genetalium M.mycoides M.penetrans gempppdb M.pneumoniae M.pulmonis M.gallisepticum M.genetalium M.mycoides ggmpepndb M.penetrans M.pneumoniae M.gallisepticum M.genetalium M.mycoides ggmpepudb M.penetrans M.pulmonis M.gallisepticum M.mycoides M.penetrans gampppdb M.pneumoniae M.pulmonis

No. of Blastn hits

1852

1787

850

1012

565

386

Table1: This table shows the databases created with the WU BLAST 2.0 application and the organism against which the database is searched for similarity.

18

BLASTN 2.0MP-WashU [03-Mar-2004] [linux24-i686-ILP32F64 2004-03-03T16:23:09] Copyright (C) 1996-2004 Washington University, Saint Louis, Missouri USA. All Rights Reserved. Reference:

Gish, W. (1996-2004) http://blast.wustl.edu

Notice: this program and its default parameter settings are optimized to find nearly identical sequences rapidly. To identify weak protein similarities encoded in nucleic acid, use BLASTX, TBLASTN or TBLASTX. Query=

L43967_1_734 Mycoplasma genetalium G37 intergenic sequence (734 letters; record 1)

Database:

gal.fasta 5 sequences; 5,347,031 total letters. Searching....10....20....30....40....50....60....70....80....90....100% done WARNING:

hspmax=1000 was exceeded by 1 of the database sequences, causing the associated cutoff score, S2, to be transiently set as high as 81.

Sequences producing High-scoring Segment Pairs: gb|U00089| Mycoplasma pneumoniae M129, intergenic sequence emb|BX293980.1| Mycoplasma mycoides subsp. mycoides SC ge... dbj|BA000026| Mycoplasma penetrans, intergenic sequence emb|AL445566| Mycoplasma pulmonis (strain UAB CTIP) inter... gb|AE015450.1| Mycoplasma gallisepticum strain R intergen...

19

High Score 692 675 602 539 528

Smallest Sum Probability P(N) N 1.8e-26 1.1e-25 2.1e-22 1.3e-19 4.6e-19

1 1 1 2 1

>gb|U00089| Mycoplasma pneumoniae M129, intergenic sequence Length = 816,394 Plus Strand HSPs: Score = 692 (109.9 bits), Expect = 1.8e-26, P = 1.8e-26 Identities = 410/664 (61%), Positives = 410/664 (61%), Strand = Plus / Plus Query: Sbjct: Query: Sbjct: Query: Sbjct: Query: Sbjct: Query: Sbjct:

90 TTAATACTGTTAAATTATATTCAATCAATACATATATAATATTATTAAAATACT-TGATA 148 | |||||| | ||||| || | | | | | | ||| | |||| | | || 130 TAAATACTAATCTTCTATATAGTATAGAGAAACTTTTTCT-TTAACATAATATTATCTTA 188 149 AGTATTATTTAGATATTAGACAAAT-ACTAATTTTA-TATTGCTTTAATACT-TAATAAA 205 | ||||||||| || || | | | | || | ||| |||| |||| || | | ||| 189 A-TATTATTTACCTACTA-ATAGCTTAATATTATTAGTATTTATTTAGTATTATGCTAA- 245 206 TACTACTTATGTATTAAGTAAATATTACTGTAATACTAATAA-C-AATATTATTAC-AAT 262 ||||| | ||||| | ||||||| | || || || || | ||||||||| ||| 246 TACTATGCAGATATTATCTTAATATTA-TCTA-TAGTATTAGGCTAATATTATTCTTAAT 303 263 ATGCTAGAATAATATTGCTAGTATCAATAATTACTAATATAGTATTAGGAAAATACCATA 322 || || ||| | ||| | || || || || | ||||| || ||||| | | 304 ATT-TAT--TAAGGTA-CTAA-AGCATTACCTA-TAGGTGA-TATTATGACAATACTAAA 356 323 ATAAT-ATTTCTAC-ATAATACTAAGTTAATACTATGTGTAGAATAATAAATAATCAGAT 380 | | | | || | || || | || | ||| | | || | || | || | 357 GTGGTTAGTATTATTAGGGTATTAT-TCAA-AGTAT-TCTCCAACACTATTCCCTTAGCT 413

Fig5: Output of the blastn programme from the WU BLAST 2.0 run with the intergenic sequences of Mycoplasma genitalium against the database containing the intergenic sequences of the other five Mycoplasma genomes: M.gallisepticum, M.mycoides, M.penetrans, M.pneumoniae, M.pulmonis.(The alignment is only partially shown). The blastn programme was run with default parameters.

20

No. of Blast hits M.gen M.pne M.pul M.gal

386 565 1012 850 1787 1852

M.myc M.pen

Graph4: GRAPH SHOWING NUMBER OF BLAST

HITS FOR EACH GENOME

21

>L43967_15317_15555-1>179-Mycoplasma ACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATT GGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCA ACTGAGCTATACTTCCAAGCATAATCCTAAGGGTAT-TTAACTA-ATTATTATAACAATT T >gb-U00089--19096>19275-Mycoplasma ACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATT GGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCA ACTGAGCTATACTTCCAGGCAAAATCTTC-GTACAGGTTCGCTTCATAATTATATTAATT T >L43967_15317_15555-1<239-Mycoplasma AAAATTA-GCTGTTTTGATTATAAAAACTG-TTAAAATTCGAGGGGTATTTTGGTTAAAT TAAAATTGTTATAATAATTA-GTTA-AATACCCTTAGGATT-ATGCTTGGAAGTATAGCT CAGTTGGTTAGAGCACACCCCTGATAAGGGTGAGGTCGATGGTTCAAGTCCATTTACTTC CACCAATAAT---GGGGATGTAGCTCAACTGATAGAGCACCTGATTTGCACTCAGGAGGT TGAGGGT >gb-AE015450.1--417273>417511-Mycoplasma AATTTTACGC-GTTGTTATTACCAATCGAAATTAAAAATTAAGCAG-ATATTCTTTAA-TGAGCT-GA-AT--TAATTATGTTATAATTCATATGGCAATCACGACTGGAAGTATAGCT CAGCTGGTTAGAGCACACCCCTGATAAGGGTGAGGTCGATGGTTCAAGTCCATTTACTTC CACCAGTTTTTTTGGGGACGTAGCTCAATTGATAGAGCACCTGATTTGCACTCAGGAGGT CGAGGGT >L43967_19760_19824-5<65-Mycoplasma TTGCTTTGTTTTATTTAGCTGATCAA-AAATTAATAATGTTTGTCTTTCAACACTATTAA AT >emb-BX293980.1--57200>57261-Mycoplasma TTGTTTTGTTTTATTTAATTGATCAATAAATTGATTTAGTTTATCTTTATTTATTAATAA AT Fig6.1: This figure shows one of the output file of the perl script blastn2qrnadepth.pl run with the blastn result of M.gentalium intergenic sequences Vs Mycoplasma database as input. The first file is named .q as extension (here genblast.q). This is the file used as input for the qrna programme in QRNA-2.0.1 suite. This consists of a collection of sequences in fasta format, where two sequences are the two component of an alignment with the gaps left in place.

22

1. FILE: genblast DIR: /home/kalyankpy/coput2/blast// FIRST TRIMMING Minimum length = 1 Maximum Evalue = 0.01 Minimum %id = 0 Maximum %id = 100 SECOND TRIMMING Alignments culled by = SC Depth of alignments = 1 shift =1 113-QUERY: L43967_546708_546877 Mycoplasma genitalium G37 intergenic sequence Total # alignments: 1121 After First trimming: 88 After Second trimming: 2 57-QUERY: L43967_325878_326027 Mycoplasma genitalium G37 intergenic sequence Total # alignments: 152 After First trimming: 3 After Second trimming: 3 72-QUERY: L43967_386409_386461 Mycoplasma genitalium G37 intergenic sequence Total # alignments: 292 After First trimming: 0 After Second trimming: 0 68-QUERY: L43967_364415_364533 Mycoplasma genitalium G37 intergenic sequence Total # alignments: 155 After First trimming: 1 After Second trimming: 1 ………………………………………………………………………………………………. Total #Queries Total #Alignments After first trimming After second trimming

122 53927 ave_len = 309.5 18851 ave_len = 552.6 386 ave_len = 404.2

Fig6.2: This figure shows the second file of the output from the perl script blastn2qrnadepth.pl run with the blastn result of M.gentalium intergenic sequences Vs Mycoplasma database as input. This is a file named with .q.rep (here, genblast.q.rep) as extension that has the report of the BLASTN alignment that have been pruned in the process of creating a file with .q as extension according to the options used in the perl script. 23

#--------------------------------------------------------------------------------# qrna 2.0.1 (Tue Aug 19 11:30:55 CDT 2003) using squid 1.5m (Sept 1997) #--------------------------------------------------------------------------------# PAM model = BLOSUM62 #--------------------------------------------------------------------------------# RNA model = /mix_tied_linux.cfg # RIBOPROB matrix = /RIBOPROB85-60.mat #--------------------------------------------------------------------------------# seq file = /home/kalyankpy/perlscriptresult/genblast.q # #seqs: 772 (max_len = 3420) #--------------------------------------------------------------------------------# window version: window = 150 slide = 50 -- length range = [0,9999999] #--------------------------------------------------------------------------------# 1 [both strands] (sre_shuffled) >L43967_1_734-90>722-Mycoplasma (664) >gb-U00089--130>767-Mycoplasma (664) length of whole alignment after removing common gaps: 664 Divergence time (variable): 0.401 [alignment ID = 61.75 MUT = 29.67 GAP = 8.58] length alignment: 150 (id=61.33) (mut=32.67) (gap=6.00)(sre_shuffled) posX: 0-149 [0-145](146) -- (0.42 0.08 0.06 0.43) posY: 0-149 [0-144](145) -- (0.37 0.11 0.06 0.46) L43967_1_734-90 TTAATTTTATTAAAACTATAACTTATTTTTTATAAACATTCTATGTTTTT gb-U00089--130> TTTATTTTATTAAAATTATAATGTATTTTTGTTAAATTTT.TAATTCTTT L43967_1_734-90 TAAA.CAAATGAGAAATATAGTAATAAAGCAAATT.TTTTCACCAT.TTT gb-U00089--130> CAGTGCACATA.CCTATTCGCTAGTTAA.ACGATAAAGTTAAAGAAATTT L43967_1_734-90 TTTATTATATCA.AAATTTAAAGAAAAATCTGAAAATTATCTATAATGTG gb-U00089--130> TTCTTTATATTCTAAATTT.AAAAATCTTCTCAATATAATACATAAT.TC LOCAL_DIAG_VITERBI -- [Inside SCFG] 24

OTH ends OTH ends COD ends COD ends RNA ends RNA ends winner =

*(+) (-) *(+) (-) *(+) (-) OTH

= = = = = =

(0..[150]..149) (0..[150]..149) (120..[27]..146) (41..[12]..52) (0..[21]..20) (0..[150]..149)

OTH = logoddspostOTH = sigmoidalOTH =

184.281 0.000 4.571

COD = logoddspostCOD = sigmoidalCOD =

166.408 -17.873 -17.932

RNA = logoddspostRNA = sigmoidalRNA =

179.710 -4.571 -4.571

Fig7: This is the qrna output file obtained by the syntax: qrna –w 150 –x 50 –a –B genblast.q. The qrna is a c programme written to evaluate the given alignment for its ability to forma a structural RNA. The above fig is the partial output of the qrna run with a scanning window option (here window size = 150, extension size = 50 nucleotides).  Every new blast alignment starts with two lines: “>Query_name” followed by “>Subject_name”  “Divergence time” indicates the particular time parameterization of QRNA used. By default QRNA decides on the divergence time (in this case it is 0.401) given the percentage identity of the alignment (61.75%).  Each new analyzed window starts with the line: length alignment: For each window and for each sequence in the alignment we have a line of the form: posX: 0-149 [0-145](146) -- (0.42 0.08 0.06 0.43) The first pair of numbers represents the first and last coordinates of the window respect to the beginning of the alignment. The pair of numbers in brackets represents the mapping of the window into the coordinate system of sequence X (after removing the gaps). The adjacent number in parenthesis is the length of that segment in sequence X. finally the four decimal numbers in parenthesis are the fraction of A, C, G, T’s respectively in the segment of sequence X involved in that particular window.  For each model, and for each strand, you are given the actual local regions (they could be more than one per model and strand) that score according to the model. The notation is (from..[Length]..to). Coordinates for both strands are given relative to the positive strand. The * indicates the strand with the strongest signal for a given model.  For the given scoring algorithm (here it is local viterbi by default) we get three row of numbers: Row 1: The scores of the alignment under each of the three models. The null model is a forth model which assumes that the two sequences in the alignment are independent from each other. Row 2: The two (COD and RNA) log-odds posterior probabilities respect to the OTH model. Row 3: The three sigmoidal scores calculated using the other two models as null models. The model with the highest sigmoidal score is the winner. 25

---------------Some General Statistics------------------FILE: ./genblast.2qrna method: LOCAL_DIAG_VITERBI Cutoff: 5 max id:

100

# blastn hits: 386 # windows: 2574 -----------------------------------------------------------------------Statistics by Windows--------------------# windows: 2574 RNA>0: RNA>cutoff:

41/2574 2/2574

COD>0: COD>cutoff:

2/2574 0/2574

in phases: RNA: COD: OTH:

2045/2574 2/2045 0/2045 2043/2045

in transitions: 0/2574 RNA/COD: 0/0 RNA/OTH: 0/0 COD/OTH: 0/0 RNA/COD/OTH: 0/0 -----------------------------------------------------------------------Statistics for RNA loci ():------------------# loci: 1 ave_length: 196.00 1-loci L43967_167180_175806-Mycoplasma 7457 7653 (197) 2 RNA -39.20 26.19 Fig8: This is the output of the phase_count_fast.pl perl script that extracted the RNA loci (with RNA score larger than a cutoff set with option –u, here –u is default set to 5). The script identified 1 independent locus in Mycoplasma genitalium that scores as RNA above 5 bits out of the 2574 windows from the 386 blastn alignments. The listed coordinates of each of the locus is of the following form: num-loci name_seq(seq_length)loc_from loc_to(loc_lenght)number_wind type_loc COD_sc RNA_sc Therefore, 1-loci L43967_167180_175806-Mycoplasma 7457 7653 (197) 2 RNA -39.20 26.19

means that the first M.genitalium locus corresponds to intergenic sequence named L43967_167180_175806-Mycoplasma. The locus has a length of 197 nucleotides and covers the region from 7457 to 7653. Two different windows have contributed to this RNA locus, and the average sigmoidal score for the coding model is -39.20 bits, while the average sigmoidal score for the RNA model is 26.19 bits.

26

No. of ncRNAs 60

52

50

40

39

40 30 20

12

10

4

1

0 M.pen

M.myc

M.gal

M.pul

M.pne

M.gen

Graph5: REPRESENTATION OF THE

PUTATIVE ncRNAS PREDICTED BY QRNA

Range of Non-coding RNA 350

Length (nt)

300 250 200 150 100 50 0 M.pen M.myc

M.gal

M.pul

M.pne M.gen

Graph6: GRAPH SHOWING THE LENGTH RANGE OF NON-CODING RNAs. (Vertical bars represent the spread of scores and horizontal bar represent the average)

27

Fig9: Analysis of the BLASTN alignments between M.gentalium intergenic sequences and the intergenic sequence database of M.gallisepticum, M.mycoides, M.penetrans, M.pneumoniae, M.pulmonis. Alignments have been grouped by percentage identity. Each figure represents the histogram of the number of alignments bined in each percentage identity interval. Green colour histogram shows the total number of windows analyzed. Blue colour histogram shows the windows that score as RNA or Coding sequence above cutoff of 0 bits. a) Figure showing the number of sequences scored as Coding regions in the windows analyzed. b) Figure showing the number of sequences scored as RNAs in the windows analyzed.

Fig10: Analysis of the BLASTN alignments between M.gentalium intergenic sequences and the intergenic sequence database of M.gallisepticum, M.mycoides, M.penetrans, M.pneumoniae, M.pulmonis. Alignments have been grouped by percentage identity. Each figure represents the scores of all the alignments as a function of the percentage identity of the alignments. “*” represents the average of the RNA or Coding sequence scores. The error bars correspond to one standard deviation. a) Figure showing the average of the scores scored as Coding regions in the windows analyzed. b) Figure showing the average of the scores scored as RNAs in the windows analyzed.

28

29

genblast.qrna.COD.id--sigmoidal LOD 30

= 303 +/- 198 ID=[100:0] total_counts [361] real COD-phase_counts_above: 0 [16//361]

NUMBER OF WINDOWS // qrna 2.0.1

25

20

15

10

5

0 100

95

90

85

80

75

70

65

60

% ID Fig 9a: Figure showing the number of sequences30 scored as coding regions in the windows analyzed

55

50

genblast.qrna.RNA.id--sigmoidal LOD 30

= 303 +/- 198 ID=[100:0] total_counts [361] real RNA-phase_counts_above: 0 [28//361]

NUMBER OF WINDOWS // qrna 2.0.1

25

20

15

10

5

0 100

95

90

85

80

75

70

65

60

% ID Fig 9a: Figure showing the number of sequences 31 scored as RNAs in the windows analyzed

55

50

genblast.qrna.COD.id--sigmoidal LOD 60

= 303 +/- 198 ID=[100:0] ave COD lodscore above: 0 [16//361]

COD sigmoidal LODSCORE // qrna 2.0.1

40

20

0

-20

-40

-60

-80 100

95

90

85

80

75

70

65

60

% ID Fig 10a: Figure showing the average of the scores32scored as coding regions in the windows analyzed

55

50

genblast.qrna.RNA.id--sigmoidal LOD 20

= 303 +/- 198 ID=[100:0] ave RNA lodscore above: 0 [28//361]

RNA sigmoidal LODSCORE // qrna 2.0.1

10

0

-10

-20

-30

-40

-50

-60 100

95

90

85

80

75

70

65

60

% ID Fig 10b: Figure showing the average of the scores scored as RNAs in the windows analyzed. 33

55

50

DISCUSSION The intergenic regions in prokaryotes are small; however, their presence has long been shown to play a significant role in these organisms. The percentage of the intergenic regions in Mycoplasma genomes varied from 9.2% in M.genetalium (smallest) to 18% M.mycoides (largest) genome.

Number of intergenic regions was spread to over 122 locations (least in

M.genetalium) to 643 (highest in M.mycoides). Average length of intergenic regions ranged from 234 (in M.penetrans) to 441 (in M.genetalium) nucleotides. This indicates that the average length of intergenic regions in a smaller genome is greater compared to the average length in a larger genome. This could be due to the appearance of large number of small interspersing regions (intergenic regions with few nucleotides only) in M.penetrans that results in the reduction of the average length.

The QRNA was used with an option of shuffling the sequence. This estimates the false positives that could arise with the given sequence composition and length. Earlier results in similar ncRNA predictions in E.coli have shown 85% true positives (Rivas and Eddy 2001). The predicted loci in the present study are regions of conserved secondary structures that include ncRNAs and need not be individual ncRNAs alone.

To assess the significance of the prediction, the predicted loci were searched for similarity against the already known and biochemically characterized ncRNAs obtained from the ncRNA database at http://biobases.ibch.poznan.pl/nc (updated till 2002).

The putative non-coding RNAs were searched against known Mycoplasma ncRNA data (only two ncRNAs have been characterized in Mycoplasma capricolum).

The results

indicated that one of the putative ncRNA from the current study was showing a good percentage of identity (60%) with one of the two biochemically available Mycoplasma ncRNA data viz., Mc_MCS4 ncRNA obtained from Mycoplasma capricolum. The Mc_MCS4 has already been shown to have extensive similarity with the eukaryotic U6 snRNA also. This strengthens our candidate ncRNA to be a possible functional entity.

34

Since the number of ncRNA in

biochemically determined database was small the database was expanded to include other prokaryotic ncRNAs.

The results indicated that a stretch of nucleotides in the putative ncRNA was showing significant similarity to MicF RNAs from E.coli, S.typhi, and K.pneumoniae. Since MicF was known to regulate the expression of OmpF and the stretch of nucleotides showing similarity were conserved across all the species, one can possibly say that the putative ncRNA stretch may be a MicF counterpart in Mycoplasma. Another ncRNA showing significant similarity to E.coli OxyS RNA was also noticed. OxyS RNA was known to modulate gene expression in response to Hydrogen peroxide, a common chemical produced by mammals in response to infection. So, this proposes a defense mechanism operating in Mycoplasma.

The database was further expanded to include eukaryotic ncRNAs that constituted the characterized miRNA and development regulating RNAs and protein function modifying RNAs. The putative ncRNAs were found to have more than 60% identity with a number of miRNAs from mouse, humans, A. thaliana and C.elegans. Fig. 11a shows a blastn hit showing 71% identity against one of the putative ncRNA from M.mycoides. This clearly shows that the putative ncRNA does have a conserved secondary structure similar to the well characterized stem loop region of C.briggsae miRNA. Fig 11b shows a blastn hit having an identity of 63% from the same M.mycoides with the characterized ncRNA obtained from the development regulating RNA of Homosapiens.

35

>cbr-mir-268 MI0000541 Caenorhabditis briggsae miR-268 stem-loop Length = 79 Minus Strand HSPs: Score = 95 (20.3 bits), Expect = 0.22, P = 0.19 Identities = 33/46 (71%), Positives = 33/46 (71%), Strand = Minus / Plus Query: Sbjct:

64 CAAAC-CTCTAAACTT-CTAAGAACTTCTTCTTCTTCTTCTTCTTC 21 || || | || | || || | |||||| || |||||||||||| 34 CAGACACACTCA-CTGACTCACTGCTTCTTGTTTTTCTTCTTCTTC 78

Fig 11a: A 71% identity blastn hit obtained for one of the putative ncRNA from M.mycoides. This clearly shows that the putative ncRNA have a conserved secondary structure similar to the well characterized stem loop region C.briggsae miRNA.

Significant hits were found with the development regulating ncRNAs included those from Homosapiens also.

>Hs_NTT Length = 17,572 Plus Strand HSPs: Score = 116 (23.5 bits), Expect = 0.025, P = 0.024 Identities = 60/94 (63%), Positives = 60/94 (63%), Strand = Plus / Plus

Query: Sbjct: Query: Sbjct:

11 TATTTAATATTTATAATTGCTATTTAGCATCTTAAAA-AAGA-CG-TCTTT-AAA-TATA 65 || |||| | || ||| | | || | |||| | ||| | |||| ||| |||| 5336 TACATAAT-TAGATCATTTATTCTAAGTAAATTAAGAGAAGCTCTATCTTCCAAAATATA 5394 66 GATAGTTATACTAATTAGAAAATAGTTAAT-AAG 98 |||| | || ||| |||| | ||||| ||| 5395 GATATCTCTAGCAAT-AGAAGAGTTTTAATTAAG 5427

Fig 11b: A sample sequence hit having an identity of about 63% from the same M.mycoides with the characterized ncRNA obtained from development regulating RNA of Homosapiens.

36

These results indicate that the ncRNAs were conserved across other kingdoms of life.

Since the ncRNAs are generally conserved across a wider spectrum, the ncRNAs can

possibly play variant roles in different cellular processes, though the role is yet to be proved biochemically (which still remains as a challenging task).

The very existence and expression profile of ncRNAs is not predictable, their functional analysis remains challenging. Given the predicted ncRNAs, the task can be handled with reduced burden.

37

REFERENCES

1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programme. Nucleic Acids Research 1997, 25:3389 2. Argman L, Hershberg R, Vogel J, Bejerano G, Wagner EG, Margalit H and Altuvia S: Novel small RNA-encoding genes in the intergenic regions of Escherichia coli. Current Biology 2001, 11:941 3. Badger JH and Oslen GJ: CRITICA: Coding Region Identification Tool Involving Comparative Analysis. Molecular Biology and Evolution 1999, 16:512 4. Capara MG, Wilsen TW: RNA: versatility in form and function. Nature Structural Biology 2000, 7:831 5. Elena Rivas & Sean R Eddy: Secondary structure alone is generally not statistically significant for the detection of non-coding RNAs. Bioinformatics 2000, 16:583 6. Elena Rivas, Sean R Eddy: QRNA: A non-coding RNA genefinder using comparative genome sequence analysis (ftp://ftp.genetics.wustl.edu/pub/eddy/software/qrna.tar.z) 2001 7. Elena Rivas, Robert J Klein, Thomas A Jones and Sean R Eddy: Computational identification of non-coding RNAs in Escherichia coli by comparative genomics. Current Biology 2001, 11:1369 8. Elena Rivas & Sean R Eddy: Non-coding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2001, 2:8 9. Erdmann VA, Barciszewska MZ, Szymanski M, Hochberg A, de Groot N, Barciszewski J: The non-coding RNAs as riboregulators. Nucleic Acids Research 2001, 29:189 10. Gish W: WU-BLAST 2.0 (http://blast.wustl.edu/) 2003 11. Huttenhofer A, Kiefmann M, Meier-Ewert S, O’Brien J, Lehrach H, Bachellerie JP, Brosius J: RNomics: an experimental approach that identifies 201 candidates for novel, small, non-messenger RNAs in mouse. EMBO journal, 2001, 20:2943

38

12. Lowe TM, Sean R Eddy: tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Research, 1997, 25:955 13. Lowe Sean R Eddy: A computational tool for methylation guide snoRNAs in yeast. Science, 1999, 283:1168 14. Maciej Szymanski and Jan Barciszawski: Beyond the proteome: non-coding regulatory RNAs. Genome Biology 2002, 3: 0005.1 15. Mattick JS:

Non-coding RNAs: the architects of eukaryotic complexity.

EMBO

Reports 2001, 2:986 16. Olivas WM, Muhlrad D, Parker R: Analysis of the yeast genome: identification of new non-coding and small ORF-containing RNAs. Nucleic Acids Research 1997, 25:4619 17. Sean R Eddy: Non-coding RNA genes. Current Opinion in Genetics and Development 1999, 9:695 18. Sean R Eddy: Non-coding RNA genes and modern RNA world.

Nature Review

Genetics 2001, 2:919 19. Shchattner P: Searching for RNA genes using base-composition statistics. Nucleic Acids Research 2002, 30:2076 20. Wasserman KM, Zhang A, Storz G: Small RNAs in Escherichia coli.

Trends in

Microbiology 1999, 7:37 21. Zweib, Wower I, Wower J: Comparative sequence analysis of tmRNA. Nucleic Acids Research 1999, 27:2063

39

Related Documents

Msc Project
May 2020 11
Msc
December 2019 35
Msc
October 2019 38
Msc
November 2019 30