Bioinformatics-from Data To Knowledge

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Bioinformatics-from Data To Knowledge as PDF for free.

More details

  • Words: 2,976
  • Pages: 48
TITech 13 Nov 2000

Bioinformatics: Converting Data to Knowledge Gio Wiederhold Stanford University Computer Science, E.E. & Medicine http://www-db.stanford.edu/people/gio.html

Data

✖ Aggregation of instances Integration of sources

Knowledge Analyses

Observations Filters

• The product: Information 10/15/08

Gio Wiederhold - TITech 2000

2

Bio-Information • to learn about ourselves, – our origins, our place in the world Primates, Mice, Zebrafish, Fruit Flies, Roundworms, Yeast

– modesty, seeing how much we share with all organisms – not just of philosophical interest, but also

• to help humanity to lead healthy lives – to create new scientific methods – to create new diagnostics – to create new therapeutics 10/15/08

Gio Wiederhold - TITech 2000

3

Loops of Data and Knowledge Information is created at the Storage confluence of Education data -- the state Selection Recording & knowledge -Integration the ability to select and Abstraction Experience State changes project the Decision-making state into the future Action Knowledge Loop

10/15/08

Data Loop

Gio Wiederhold - TITech 2000

4

Volume and Variety Two interacting issues in the generating information 1. The volume is large -we need automation 2. The data is varied & heterogeneous • many autonomous sources • many distinct objectives

➔ 10/15/08

many incompatibilities, errors Gio Wiederhold - TITech 2000

5

Nature 1 human > 30 000 genes ~ 10 000 proteins diseases

Quantities

Progress

The human genome: ~ 4 000 000 000 base pairs

Genes, and gene abnormalities

6 000 000 000 humans

Everybody’s genes <1000 systems ~2 000 000 molecules 10/15/08

Metabolic pathways Small organic molecules - affect proteins - suitable for drugs Gio Wiederhold - TITech 2000

6

Diversity ➪ Heterogeneity A wide variety of knowledge is needed to interpret the data A large variety of experts is developing this knowledge The scope of interests differs among those experts The knowledge is expressed in diverse ways The terms differs in precise meaning: semantics A large variety of data types is needed A wide variety of representations is used The database and file schemas differ A wide variety of representations is used The openness and accessibility of the information differs 10/15/08

Gio Wiederhold - TITech 2000

7

Scope differences A scope difference exists when terms differ in their mapping to real-world objects disabled

employee (payroll) employee(personnel) all possible employees

contractors

The local objective determines scope Example: “binding site” in PDB database [Waugh&Altman] binding sites reported for publication doubtful all actual binding sites reporting doubtful results risks rejection of publication 10/15/08

Gio Wiederhold - TITech 2000

8

Heterogeneity inhibits Integration • An essential feature of science – autonomy of fields – differing granularity and scope of focus – growth of fields requires new terms

• A feature of technological process – standards require stability – yesterday’s innovations are today’s infrastructure

Must be dealt with explicitly – sharing, integration, and aggregation are essential – large quantities of data require precision 10/15/08

Gio Wiederhold - TITech 2000

9

Heterogeneity among domains is natural Interoperation creates mismatch • Autonomy conflicts with consistency, – Local Needs have

Priority,

– Outside uses are a Byproduct

Heterogeneity must be addressed • Platform and Operating Systems ✔ ✔ • Data Representation and Access Conventions ✔ • Metadata: Annotations, Naming, and Ontology ✚ – needed to share data from distinct sources 10/15/08

Gio Wiederhold - TITech 2000

10

Required precision = F(volume) More precision is needed as data volume increases --- a small error rate still leads to too many errors

False Positives have to be investigated

wi th

to

ol

s?

it m li

h

acce

ptab l

ma n

lost opportunities, suboptimal to some degree

n a um

Information Wall

e l imit

hu

False Negatives cause

data errors

( attractive-looking supplier - makes toys apparent drug-target with poor annotation )

information quantity adapted from Warren Powell, Princeton Un.

10/15/08

Gio Wiederhold - TITech 2000

11

Inconsistency causes errors, while results need precision False positives = poor precision typically cost more than false negatives = poor recall Example: [ Todd Lowe >

tRNA search
]

Search in Yeast for 55 methylation sites -- required manual elimination of pseudogenes Search space in human genome is 215 times larger, not yet done

In drug-discovery we have now more targets than . pharmaceutical companies can afford 10/15/08 Gio Wiederhold - TITech 2000 12

Broad array of relatable sources • • • •

Genomic Bibliographic Demographic Epidemiological – Familial – Contacts

• Clinical – Drug effectiveness – Drug-resistance – Co-occurrence 10/15/08

[ Many used in data-mining: as PRM (Probabilistic Relational Model) research by Lise Getoor @ stanford ] Requires acyclicity. Use temporal dependencies?

Gio Wiederhold - TITech 2000

13

Intersection of a large (irrelevant data) and a small (good data) distribution. Result

10/15/08

The optimal separation creates more false positives (irrelevant results ) than false negatives (good results missed)

Gio Wiederhold - TITech 2000

14

Quality of data verified through publication Data characteristics project [Stephen Koslow, Office on Neuroinformatics, NIMH www.nimh.nih.gov/neuroinformatics/index.cfg] The human brain uses 15 Watts; has dozens of cell types, 100 billion (10^14) neural cells, 10^15 connections. Neuroscience is a growing field, includes neuroinformatics. Intial, broad journals, reductionist journals, Numerical, symbolic, literature and image data. Volume of publication only for serotonin, discovered in 1948, now 70 000 papers, is becoming impossible to follow. Voluminous 3-D MRI data. UCLA brain mapping. Basis for localization of diagnostic EEG, MEG observations. 10/15/08

Gio Wiederhold - TITech 2000

15

Projects requiring manual curation are domain specific Virtual Cell Project Dong-Guk Shin, Univ. Connecticut [email protected] also available without DB support, www.nrcam,uchc.edu NIH supported: Physiology modeling, NSF support: computational modeling approach. Bottom-up approach to cell modeling: Cross checking of models and HXs: Geometry from segmented images, 2-D visualization of specified reactions: channels, pumps, for extra, intra (cytosol), of core cellular compartments. Generates equations for simulation. Result is a DB publication cycle, supporting model copying and adaptation. For access to remote DBs will need more than a browser, but also a query system, with join over association. DBs need APIs and mediation for scalability and mismatch. 10/15/08

Gio Wiederhold - TITech 2000

16

Data integration in Literature [ Jim Garrels, Proteome, Inc. www.proteome.cm free ] BioKnowlede Library, a portal site: with 50 billion bytes of text covering the 5 billion bytes in Genbank. Classification, curated by experts. Pages {title with brief functional description, family, properties (Mutant phenotype, ) , } sequence annotations, related proteins: Orthologs and Interlogs (in different species) [Marc Vidal, MGH], Integrated data from cDNA microarrays and chips, systematic 2-hybrids, Model-organisms: First Yeast, now Worms [Stuart Kim, Stanford], Several 1000 physical associations and interactions. Authors should not publish experimental data directly into a DB and curate their own papers, but submit their results and publish detailed expression studies and update their own results. 10/15/08 Gio Wiederhold - TITech 2000 17

Relationships among search parameters perfect recall

100% per fe

ll a c re

ct p rec

isio n

50%

0% 10/15/08

r = v.relevant v.available

precision p= v.relevant v.retrieved

ed v e tri ble e r la e i a m v u vol me a u vol % tage actually relevant

space of methods, ranked from best Gio Wiederhold - TITech 2000

18

Means to achieve precision in text Textual information - knowledge - complements pure data-oriented searches as BLAST [Liu & Altman] • Reduce redundancy – omit similar results from alternate sources reports, workshop papers, journals, books

• Reduce false positives – recognize contextual domains * • the same word refers to different object types nail (carpentry, anatomy), miter (carpentry, religion)

• Abstract findings to higher levels – Linguistic processing based on customer model medical case studies have similar formats 10/15/08

Gio Wiederhold - TITech 2000

19

Integration makes Semantic Mismatches visible Information comes from many autonomous sources • Differing viewpoints (by source) – – – – –

differing terms for similar items { lorry, truck } same terms for dissimilar items trunk ( luggage, car ) differing coverage vehicles ( DMV, police, AIA ) differing granularity trucks (shipper, manuf.) different scope student (museum fee, Stanford )

• Hinders use of information from disjoint sources – missed linkages – irrelevant linkages

loss of information, opportunities overload on user or application program

• Poor precision for interoperation

ok for web browsing poor for business and science 10/15/08

Gio Wiederhold - TITech 2000

20

Shared Knowledge Base PharmGKB – PharmacoGenetics Knowledge Base starting 2000 “An Ontology for Genetic Information” [Russ Altman] based at Stanford, funded by NIGMS to link existing projects – but open to others.

Phenotype variation --> Genotype variation • • • • • •

Phase 2 metabolizing enzymes – R.Weinshllboum at Mayo Clinic Asthma -- Weiss (was Jeff Raizin) at Havard Un. Anti-cancer agents -- Mark Ratain at Un. of Chicago Membrane Transporters -- Kathleen Giacomini, UCSF Tomoxifen metabolic activation -- Dave Flockhart at Georgetown Un. Minority Populations and Privacy – M.Rothstein at Univ of Houston ➹

• •

Depression in Mexican-Americans -- J.Licinio at UCLA Database Tools -- Prakash Nadkarni at Yale Un. 10/15/08

Gio Wiederhold - TITech 2000

21

Complex Relationships Genomic information Coding

Molecules Pharma. activity

Drug response systems

Isolated functional measures V in aria ge tio no n me Protein Products in m e l s Ro ani g or

Molecular & cellular phenotype Obser vable pheno types

Alleles Mole cular Varia tion

Integrated functional measures ble s a v e er typ s Obheno p Genetic Makeup nt e s atmocol e Tr rot p

Drugs

Clinical phenotype Physiology

Individuals Non-genetic factors

Environment courtesy of R.Altman &Teri Klein, PhamGKB

10/15/08

Gio Wiederhold - TITech 2000

22

PharmGKB • Ontology for pharmacogenetics – Represented in Protégé [Musen: smi.stanford.edu/project/protege]

• Service for Universities and Industry • open access to information and tools, but not a warehouse – Industrial affiliates contributors and consumers at larger scales: • • • •

geneticXchange Merck Co Pharmacia SmithKline-Beecham ( & Glaxo-Wellcome )

• Collaboration in larger topics:

GeneLogic Guidant Doubletwist Incyte Informax SGI Sun

– Biotechnology -- Clark Center – Education -- NIH sponsored training program, new UG degrees 10/15/08

Gio Wiederhold - TITech 2000

23

Consistency: global or partial ? • Global consistency + wonderful for users and their programs – too many interacting sources – long time to achieve, 2 sources (UAL, LH), 3 (+ trucks), 4, … all ? – costly maintenance, since all sources evolve – no world-wide authority to dictate conformance

• Domain-specific ontologies XML DTD assumption + Small, focused, cooperating groups + high quality, some examples - arthritis, Shakespeare plays + allows sharable, formal tools + ongoing, local maintenance affecting users - annual updates – poor interoperation, users still face inter-domain mismatches 10/15/08 Gio Wiederhold - TITech 2000 – periodic source updates need automation in interoperation

24

Stanford Infolab SKC project ( Scalable Knowledge Composition ) Objective: High precision in semantic interoperation of autonomous sources • Basic -- pessimistic -- assumption:

– The ontological mapping of terms ↔ objects differs between autonomous domains.

• But

– The collections of real-world objects provides a grounding for the definitions, and an opportunity to validate the meaning of the terms being employed. – Relationships have semantic and a related structural significance.

10/15/08

Gio Wiederhold - TITech 2000

25

Exploit Domain-specific Expertise .

Knowledge needed is huge in science and in business • Partition into natural domains • Determine domain responsibility and authority • Empower domain owners • Provide tools Consider interaction 10/15/08

Gio Wiederhold - TITech 2000

Society of specialists

Society of specialists

Society of specialists

26

SKC grounded definition

.

• Ontology: a set of terms and their relationships • Term: a reference to real-world and abstract objects • Relationship: a named and typed set of links between objects • Reference: a label that names objects • Abstract object: a concept which refers to other objects • Real-world object: an entity instance with a physical manifestation 10/15/08

Gio Wiederhold - TITech 2000

27

Sample Operation: INTERSECTION Result contains shared terms, useful for purchasing

Articulation

Source Domain 1: Owned and maintained by Store 10/15/08

Source Domain 2: Owned and maintained by Factory

Gio Wiederhold - TITech 2000

28

An Ontology Algebra A knowledge-based algebra for ontologies Intersection Union Difference

create a subset ontology keep sharable entries create a joint ontology merge entries create a distinct ontology remove shared entries

The Articulation Ontology (AO) consists of rules that link domain ontologies 10/15/08

Gio Wiederhold - TITech 2000

matching

29

INTERSECTION support Articulation ontology

Terms useful for purchasing

Matching rules that use terms from the 2 source domains

Store Ontology 10/15/08

Gio Wiederhold - TITech 2000

Factory Ontology 30

Other Basic Operations UNION: merging entire ontologies

DIFFERENCE: material fully under local control Articulation ontology

typically prior intersections 10/15/08

Gio Wiederhold - TITech 2000

31

Sample Operation: INTERSECTION Result contains shared terms, useful for purchasing

Articulation

Source Domain 1: Owned and maintained by Store 10/15/08

Source Domain 2: Owned and maintained by Factory

Gio Wiederhold - TITech 2000

32

Tools to create articulations Graph matcher for Articulationcreating Expert Transport ontology

Vehicle ontology

Suggestions for articulations 10/15/08

Gio Wiederhold - TITech 2000

33

continue from initial point Also suggest similar terms for further articulation: • by spelling similarity, • by graph position • by term match nexus Expert response: 1. Okay 2. False 3. Irrelevant to this articulation All results are recorded Okay ’s are converted into articulation rules 10/15/08

Gio Wiederhold - TITech 2000

34

Candidate Match Nexus Term linkages automatically extracted from 1912 Webster’s dictionary *

Notice presence of 2 domains: Based on processing headwords ➽ definitions using algebra primitives

10/15/08

chemistry, transport

* free; have processed the OED (Oxford English Dictionary) at Stanford for internal use

Gio Wiederhold - TITech 2000

35

Using the Match Nexus

Experiment: On government structures of NATO countries: SKEIN system resolved over 70% of unmatched terms 10/15/08

Gio Wiederhold - TITech 2000

36

Using the Match Nexus

10/15/08

Gio Wiederhold - TITech 2000

37

Features of an algebra Operations can be composed Operations can be rearranged Alternate arrangements can be evaluated Optimization is enabled The record of past operations can be kept and reused when sources change

10/15/08

Gio Wiederhold - TITech 2000

38

Knowledge Composition Composed knowledge for applications using A,B,C,E

Articulation knowledge

Articulation knowledge (C E) U

Knowledge resource E

U

Articulation knowledge for (A B) U

Knowledge resource A 10/15/08

U

(B

C)

Knowledge resource B

Knowledge resource C

Gio Wiederhold - TITech 2000

(C

U

U : union : intersection

(A B) U (B C) U (C E) U

Legend:

U U

for

D)

Knowledge resource D 39

Support Domain Specialization • Knowledge Acquisition (20% effort) & • Knowledge Maintenance (80% effort *) to be performed • Domain specialists • Professional organizations • Field teams of modest size autonomously maintainable

Empowerment

* based on experience with software 10/15/08

Gio Wiederhold - TITech 2000

40

Summary Scalable Knowledge Composition Provide for Maintainable Ontologies • devolve maintenance onto many domain-specific experts / authorities • provide an algebra to compute composed ontologies that are limited to their articulation terms

SKC

• enable interpretation within the source contexts 10/15/08

Gio Wiederhold - TITech 2000

41

Many Other Tasks

at/near Stanford

Matching cell / protein 3D with chemical’s 3D • Regulatory Gene motifs : – Bioprospector [ Brutlag & Liu <www-cmgm.stanford.edu> ]

• Protein structure generation – moving from small to larger proteins 1: Powerful parallel processing [IBM BlueGene] 2: Two-level : use features as an intermediate (alpha-helix, beta-sheets, …)

3: Protein Folding speedup by delegation [Shirts & Pande: foldingathome.stanford.edu ]

• RNA folding (simpler, larger) [Nakatani & Pande] 10/15/08

Gio Wiederhold - TITech 2000

42

Provenance of derived data Assure having a proper history of derived results [ Peter Buneman, UPenn, www.humgen.upenn.edu ] K2 integration tool Integrated databases often don’t indicate the original sources I.e., SwissProt does not distinguish inferred versus being observed. [ William Gelbart, Harvard University] Flybase Flybase also collects data as exons and their mutations, tranposon insertion sites. Moving from being Hunter Gatherers in science to Harvesters, moving to an agronomical society Clasical genomics is being superseded by expression and interaction of gene products and gene perturbation.

[ Peter Karp, SRI Int., Bioinformatics Res.Group, www.ai.sri.com/pkarp/ ] EcoCyc EcoCyc links proteins to 150 metabolic pathways in Ecoli Databases are supplanting journals. They are re-analyzable. Results in journals are not.

Estimate now about 500 public databases for Bioinformatics; although not all of them have APIs, use real DBMSs, have differing models, units of measurements, leading to semantic problems. 10/15/08

Gio Wiederhold - TITech 2000

43

The People Problem The demand for people in bioinformatics is high, at all levels • Critical is a lack of – training opportunities - programs and teachers – available trainees

• Being in multi-disciplinary field is scary – tenure for faculty – load for students – salary and growth differentials in biology and CS

• Some institutions are moving aggressively – must compete with World-Wide Web visions 10/15/08

Gio Wiederhold - TITech 2000

44

Bioinformatics: Converting Data to Knowledge • The means: People • The product: Information

10/15/08

Gio Wiederhold - TITech 2000

45

Up-to-dateness 100%

never 1/year

%tage up-to-date

1/month 1/week 1/day



=

F(user need) effort, methods

50%

1/hour 1/minute 1/second

Frequency of source change 10/15/08

0%

0 1 ? frequency of visits

as often as possible Feb.2000

F(capability given 2.2M public sites with 288M pages ) Gio Wiederhold - TITech 2000

46

Privacy requires Ethics Knowledge carries responsibilities. How will people feel about your knowledge about them? their genetic make-up, physical & psychological propensities. Privacy is hard to formalize, but that does not mean it is not real to people. Perceptions count. (There is also real stuff insurance scams - personal relations ) Diagnostics without therapies. 10/15/08

Gio Wiederhold - TITech 2000

47

Securing Collaboration Collaborator source query

certified result

Security Filter certified query

Logs unfiltered result

Private Patient Data 10/15/08

Gio Wiederhold - TITech 2000

Gio Wiederhold TIHI Oct96 48

48

Related Documents

Knowledge
June 2020 32
Knowledge
November 2019 42
Knowledge
November 2019 36