Research Proposal Bioinformatics approach to evaluation of Transcription factor genes and diseases (Cancer) Brijesh Singh Yadav
(Senior Research Associates, URC, Allahabad)
E-Mail:
[email protected]
Problem Statement: The purpose of the proposed research is the development of a computational approach to quantitatively evaluate associations between transcription factor encoding genes and human diseases, based on available literature evidence. The approach will analyze a set of candidate genes and determine which genes are linked to human diseases, which properties are involved in these gene-disease linkages, and which clusters of similar genes are involved in particular diseases. During the course of the research, I shall explore methods for recapitulating existing associations and predicting novel associations based on diverse forms of data pertaining to genes and diseases. These methods will evaluate the resulting associations in a quantitative manner, and the resulting analyses will be validated to determine the efficacy of the methods.
Background: Identification of functional causes and contributing mechanisms of disease is a principal aim of biomedical research. In many cases, the term “disease” broadly applies to a heterogeneous set of observable properties, which may arise from multiple molecular processes. Disease is often characterized by symptoms and a pattern of progression over time. The area of Cancer diseases is particularly broad, encompassing a wide range of complex, abnormal phenotypes. Compared to diseases associated with other organs, many types of cancer like brain cancer tend to be poorly understood: many are difficult to characterize and have complex genetic components involving multiple genes.
Transcription factors are key regulators of gene expression, involved via processes such as the recruitment of transcription initiation factors and conformational change of DNA, working alone or as part of protein complexes. GeneSeeker can find genes within a chromosomal location that are localized in particular tissues, by looking at human and mouse expression data. Another method of associating disease genes to anatomical locations performed text mining of PubMed abstracts to associate eVOC anatomical ontology terms to gene names. Machine learning approaches can be used when a representative set of disease genes are available to use as training data. In DGP, a decision tree classification approach is used to find features common to disease genes based on a training set composed of sample disease and control proteins. Features were protein length, BLASTP ratios (conservation score) between a protein and its highest scoring homologue within taxonomic groups (representing phylogenetic conservation and extent) and the conservation score with the closest paralogue. The study indicates that, on average, hereditary disease genes (genes taken from OMIM) in comparison to randomly selected genes are longer, more conserved, phylogenetically extended and without close paralogues. PROSPECTR uses a wider variety of features, including the length of the gene, the length of its coding sequence, the length of its cDNA, length of the protein, GC content and percentage protein identity with its nearest homologue in various species (mouse, worm, fly). The investigators used an alternating decision tree, taking genes from OMIM and comparing against genes not found in OMIM. They also generated two independent test sets – one using genes from the Human Gene Mutation Database with randomly selected control genes, and another set of 54 genes not in OMIM, again with a set of randomly selected control genes. POCUS takes another machine learning approach, using a selected training set of genes linked to the target disease. POCUS identifies common features between all the training genes – InterPro domains, GO annotations, similar expression profile – and assesses the chance that such common features would be shared by chance. This method depends on a carefully selected training set of genes, and focuses the likelihood of these genes all sharing common, disease-related properties, in contrast to methods that focus on overrepresentation of properties among the training genes.
Proposed Method: Most of the existing methods for the computational prediction of linkages between genes and disease take as input a preliminary list of candidate genes (e.g. genes in a genomic region linked in a genetic study to a disease), and return as output either a reduced or a ranked list. The underlying approaches differ substantively between methods. Examples of characteristics used in the methods include numerical features derived from the raw sequence of genes and/or encoded proteins, existing annotations of proteins and genes, and abstracts or articles directly referring to the gene. The current methods focus on using properties from a representative set of genes to identify similar genes from the candidate set. We propose a method of extracting gene-disease associations that will emphasise verifiable supporting evidence for the predicted associations, and a quantitative evaluation of the strength of the association. We shall investigate both associations between genes and disease, as well as properties of the gene-disease association. We shall consider three base entities – Genes, Diseases, Evidence – and the relationships between these entities.
Goal of Research: Our goal will be to predict Gene-Disease relationships based on the existence of relationships between other entity pairings. After initial study of mammalian gene-disease relationships, we will broaden the approach to incorporate entity relationships involving orthologous genes in model organisms or related diseases. These paths of supporting evidence will be quantitatively evaluated, making it possible to both extract strongly supported gene-disease linkages and to rank these linkages. Although the thesis itself will investigate properties of transcription factor genes in Cancer diseases, the methods and analysis will be designed for general application. For the initial analysis of the main gene-disease associations.
Reference: 1. Van Driel MA, Cuelenaere K, Kemmeren PP, Leunissen JA, Brunner HG, et al. (2005) GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases. Nucleic Acids Research 33: 758. 2. Tiffin N, Kelso J, Powell A, Pan H, Bajic V, et al. (2005) Integration of text- and datamining using ontologies successfully selects disease gene candidates. Nucleic Acids Research 33: 1544-1552. 3. López-Bigas N, Ouzounis C (2004) Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic Acids Research 32: 3108. 4. Adie E, Adams R, Evans K, Porteous D, Pickard B (2005) Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics 6: 55. 5. Turner F, Clutterbuck D, Semple C (2003) POCUS: mining genomic sequence annotation to predict disease genes. Genome Biology 4: 75.