TITech 13 Nov 2000
Bioinformatics: Converting Data to Knowledge Gio Wiederhold Stanford University Computer Science, E.E. & Medicine http://www-db.stanford.edu/people/gio.html
Data
✖ Aggregation of instances Integration of sources
Knowledge Analyses
Observations Filters
• The product: Information 10/15/08
Gio Wiederhold - TITech 2000
2
Bio-Information • to learn about ourselves, – our origins, our place in the world Primates, Mice, Zebrafish, Fruit Flies, Roundworms, Yeast
– modesty, seeing how much we share with all organisms – not just of philosophical interest, but also
• to help humanity to lead healthy lives – to create new scientific methods – to create new diagnostics – to create new therapeutics 10/15/08
Gio Wiederhold - TITech 2000
3
Loops of Data and Knowledge Information is created at the Storage confluence of Education data -- the state Selection Recording & knowledge -Integration the ability to select and Abstraction Experience State changes project the Decision-making state into the future Action Knowledge Loop
10/15/08
Data Loop
Gio Wiederhold - TITech 2000
4
Volume and Variety Two interacting issues in the generating information 1. The volume is large -we need automation 2. The data is varied & heterogeneous • many autonomous sources • many distinct objectives
➔ 10/15/08
many incompatibilities, errors Gio Wiederhold - TITech 2000
5
Nature 1 human > 30 000 genes ~ 10 000 proteins diseases
Quantities
Progress
The human genome: ~ 4 000 000 000 base pairs
Genes, and gene abnormalities
6 000 000 000 humans
Everybody’s genes <1000 systems ~2 000 000 molecules 10/15/08
Metabolic pathways Small organic molecules - affect proteins - suitable for drugs Gio Wiederhold - TITech 2000
6
Diversity ➪ Heterogeneity A wide variety of knowledge is needed to interpret the data A large variety of experts is developing this knowledge The scope of interests differs among those experts The knowledge is expressed in diverse ways The terms differs in precise meaning: semantics A large variety of data types is needed A wide variety of representations is used The database and file schemas differ A wide variety of representations is used The openness and accessibility of the information differs 10/15/08
Gio Wiederhold - TITech 2000
7
Scope differences A scope difference exists when terms differ in their mapping to real-world objects disabled
employee (payroll) employee(personnel) all possible employees
contractors
The local objective determines scope Example: “binding site” in PDB database [Waugh&Altman] binding sites reported for publication doubtful all actual binding sites reporting doubtful results risks rejection of publication 10/15/08
Gio Wiederhold - TITech 2000
8
Heterogeneity inhibits Integration • An essential feature of science – autonomy of fields – differing granularity and scope of focus – growth of fields requires new terms
• A feature of technological process – standards require stability – yesterday’s innovations are today’s infrastructure
Must be dealt with explicitly – sharing, integration, and aggregation are essential – large quantities of data require precision 10/15/08
Gio Wiederhold - TITech 2000
9
Heterogeneity among domains is natural Interoperation creates mismatch • Autonomy conflicts with consistency, – Local Needs have
Priority,
– Outside uses are a Byproduct
Heterogeneity must be addressed • Platform and Operating Systems ✔ ✔ • Data Representation and Access Conventions ✔ • Metadata: Annotations, Naming, and Ontology ✚ – needed to share data from distinct sources 10/15/08
Gio Wiederhold - TITech 2000
10
Required precision = F(volume) More precision is needed as data volume increases --- a small error rate still leads to too many errors
False Positives have to be investigated
wi th
to
ol
s?
it m li
h
acce
ptab l
ma n
lost opportunities, suboptimal to some degree
n a um
Information Wall
e l imit
hu
False Negatives cause
data errors
( attractive-looking supplier - makes toys apparent drug-target with poor annotation )
information quantity adapted from Warren Powell, Princeton Un.
10/15/08
Gio Wiederhold - TITech 2000
11
Inconsistency causes errors, while results need precision False positives = poor precision typically cost more than false negatives = poor recall Example: [ Todd Lowe >
tRNA search
]
Search in Yeast for 55 methylation sites -- required manual elimination of pseudogenes Search space in human genome is 215 times larger, not yet done
In drug-discovery we have now more targets than . pharmaceutical companies can afford 10/15/08 Gio Wiederhold - TITech 2000 12
Broad array of relatable sources • • • •
Genomic Bibliographic Demographic Epidemiological – Familial – Contacts
• Clinical – Drug effectiveness – Drug-resistance – Co-occurrence 10/15/08
[ Many used in data-mining: as PRM (Probabilistic Relational Model) research by Lise Getoor @ stanford ] Requires acyclicity. Use temporal dependencies?
Gio Wiederhold - TITech 2000
13
Intersection of a large (irrelevant data) and a small (good data) distribution. Result
10/15/08
The optimal separation creates more false positives (irrelevant results ) than false negatives (good results missed)
Gio Wiederhold - TITech 2000
14
Quality of data verified through publication Data characteristics project [Stephen Koslow, Office on Neuroinformatics, NIMH www.nimh.nih.gov/neuroinformatics/index.cfg] The human brain uses 15 Watts; has dozens of cell types, 100 billion (10^14) neural cells, 10^15 connections. Neuroscience is a growing field, includes neuroinformatics. Intial, broad journals, reductionist journals, Numerical, symbolic, literature and image data. Volume of publication only for serotonin, discovered in 1948, now 70 000 papers, is becoming impossible to follow. Voluminous 3-D MRI data. UCLA brain mapping. Basis for localization of diagnostic EEG, MEG observations. 10/15/08
Gio Wiederhold - TITech 2000
15
Projects requiring manual curation are domain specific Virtual Cell Project Dong-Guk Shin, Univ. Connecticut
[email protected] also available without DB support, www.nrcam,uchc.edu NIH supported: Physiology modeling, NSF support: computational modeling approach. Bottom-up approach to cell modeling: Cross checking of models and HXs: Geometry from segmented images, 2-D visualization of specified reactions: channels, pumps, for extra, intra (cytosol), of core cellular compartments. Generates equations for simulation. Result is a DB publication cycle, supporting model copying and adaptation. For access to remote DBs will need more than a browser, but also a query system, with join over association. DBs need APIs and mediation for scalability and mismatch. 10/15/08
Gio Wiederhold - TITech 2000
16
Data integration in Literature [ Jim Garrels, Proteome, Inc. www.proteome.cm free ] BioKnowlede Library, a portal site: with 50 billion bytes of text covering the 5 billion bytes in Genbank. Classification, curated by experts. Pages {title with brief functional description, family, properties (Mutant phenotype, ) , } sequence annotations, related proteins: Orthologs and Interlogs (in different species) [Marc Vidal, MGH], Integrated data from cDNA microarrays and chips, systematic 2-hybrids, Model-organisms: First Yeast, now Worms [Stuart Kim, Stanford], Several 1000 physical associations and interactions. Authors should not publish experimental data directly into a DB and curate their own papers, but submit their results and publish detailed expression studies and update their own results. 10/15/08 Gio Wiederhold - TITech 2000 17
Relationships among search parameters perfect recall
100% per fe
ll a c re
ct p rec
isio n
50%
0% 10/15/08
r = v.relevant v.available
precision p= v.relevant v.retrieved
ed v e tri ble e r la e i a m v u vol me a u vol % tage actually relevant
space of methods, ranked from best Gio Wiederhold - TITech 2000
18
Means to achieve precision in text Textual information - knowledge - complements pure data-oriented searches as BLAST [Liu & Altman] • Reduce redundancy – omit similar results from alternate sources reports, workshop papers, journals, books
• Reduce false positives – recognize contextual domains * • the same word refers to different object types nail (carpentry, anatomy), miter (carpentry, religion)
• Abstract findings to higher levels – Linguistic processing based on customer model medical case studies have similar formats 10/15/08
Gio Wiederhold - TITech 2000
19
Integration makes Semantic Mismatches visible Information comes from many autonomous sources • Differing viewpoints (by source) – – – – –
differing terms for similar items { lorry, truck } same terms for dissimilar items trunk ( luggage, car ) differing coverage vehicles ( DMV, police, AIA ) differing granularity trucks (shipper, manuf.) different scope student (museum fee, Stanford )
• Hinders use of information from disjoint sources – missed linkages – irrelevant linkages
loss of information, opportunities overload on user or application program
• Poor precision for interoperation
ok for web browsing poor for business and science 10/15/08
Gio Wiederhold - TITech 2000
20
Shared Knowledge Base PharmGKB – PharmacoGenetics Knowledge Base starting 2000 “An Ontology for Genetic Information” [Russ Altman]
based at Stanford, funded by NIGMS to link existing projects – but open to others.
Phenotype variation --> Genotype variation • • • • • •
Phase 2 metabolizing enzymes – R.Weinshllboum at Mayo Clinic Asthma -- Weiss (was Jeff Raizin) at Havard Un. Anti-cancer agents -- Mark Ratain at Un. of Chicago Membrane Transporters -- Kathleen Giacomini, UCSF Tomoxifen metabolic activation -- Dave Flockhart at Georgetown Un. Minority Populations and Privacy – M.Rothstein at Univ of Houston ➹
• •
Depression in Mexican-Americans -- J.Licinio at UCLA Database Tools -- Prakash Nadkarni at Yale Un. 10/15/08
Gio Wiederhold - TITech 2000
21
Complex Relationships Genomic information Coding
Molecules Pharma. activity
Drug response systems
Isolated functional measures V in aria ge tio no n me Protein Products in m e l s Ro ani g or
Molecular & cellular phenotype Obser vable pheno types
Alleles Mole cular Varia tion
Integrated functional measures ble s a v e er typ s Obheno p Genetic Makeup nt e s atmocol e Tr rot p
Drugs
Clinical phenotype Physiology
Individuals Non-genetic factors
Environment courtesy of R.Altman &Teri Klein, PhamGKB
10/15/08
Gio Wiederhold - TITech 2000
22
PharmGKB • Ontology for pharmacogenetics – Represented in Protégé [Musen: smi.stanford.edu/project/protege]
• Service for Universities and Industry • open access to information and tools, but not a warehouse – Industrial affiliates contributors and consumers at larger scales: • • • •
geneticXchange Merck Co Pharmacia SmithKline-Beecham ( & Glaxo-Wellcome )
• Collaboration in larger topics:
GeneLogic Guidant Doubletwist Incyte Informax SGI Sun
– Biotechnology -- Clark Center – Education -- NIH sponsored training program, new UG degrees 10/15/08
Gio Wiederhold - TITech 2000
23
Consistency: global or partial ? • Global consistency + wonderful for users and their programs – too many interacting sources – long time to achieve, 2 sources (UAL, LH), 3 (+ trucks), 4, … all ? – costly maintenance, since all sources evolve – no world-wide authority to dictate conformance
• Domain-specific ontologies XML DTD assumption + Small, focused, cooperating groups + high quality, some examples - arthritis, Shakespeare plays + allows sharable, formal tools + ongoing, local maintenance affecting users - annual updates – poor interoperation, users still face inter-domain mismatches 10/15/08 Gio Wiederhold - TITech 2000 – periodic source updates need automation in interoperation
24
Stanford Infolab SKC project ( Scalable Knowledge Composition ) Objective: High precision in semantic interoperation of autonomous sources • Basic -- pessimistic -- assumption:
– The ontological mapping of terms ↔ objects differs between autonomous domains.
• But
– The collections of real-world objects provides a grounding for the definitions, and an opportunity to validate the meaning of the terms being employed. – Relationships have semantic and a related structural significance.
10/15/08
Gio Wiederhold - TITech 2000
25
Exploit Domain-specific Expertise .
Knowledge needed is huge in science and in business • Partition into natural domains • Determine domain responsibility and authority • Empower domain owners • Provide tools Consider interaction 10/15/08
Gio Wiederhold - TITech 2000
Society of specialists
Society of specialists
Society of specialists
26
SKC grounded definition
.
• Ontology: a set of terms and their relationships • Term: a reference to real-world and abstract objects • Relationship: a named and typed set of links between objects • Reference: a label that names objects • Abstract object: a concept which refers to other objects • Real-world object: an entity instance with a physical manifestation 10/15/08
Gio Wiederhold - TITech 2000
27
Sample Operation: INTERSECTION Result contains shared terms, useful for purchasing
Articulation
Source Domain 1: Owned and maintained by Store 10/15/08
Source Domain 2: Owned and maintained by Factory
Gio Wiederhold - TITech 2000
28
An Ontology Algebra A knowledge-based algebra for ontologies Intersection Union Difference
create a subset ontology keep sharable entries create a joint ontology merge entries create a distinct ontology remove shared entries
The Articulation Ontology (AO) consists of rules that link domain ontologies 10/15/08
Gio Wiederhold - TITech 2000
matching
29
INTERSECTION support Articulation ontology
Terms useful for purchasing
Matching rules that use terms from the 2 source domains
Store Ontology 10/15/08
Gio Wiederhold - TITech 2000
Factory Ontology 30
Other Basic Operations UNION: merging entire ontologies
DIFFERENCE: material fully under local control Articulation ontology
typically prior intersections 10/15/08
Gio Wiederhold - TITech 2000
31
Sample Operation: INTERSECTION Result contains shared terms, useful for purchasing
Articulation
Source Domain 1: Owned and maintained by Store 10/15/08
Source Domain 2: Owned and maintained by Factory
Gio Wiederhold - TITech 2000
32
Tools to create articulations Graph matcher for Articulationcreating Expert Transport ontology
Vehicle ontology
Suggestions for articulations 10/15/08
Gio Wiederhold - TITech 2000
33
continue from initial point Also suggest similar terms for further articulation: • by spelling similarity, • by graph position • by term match nexus Expert response: 1. Okay 2. False 3. Irrelevant to this articulation All results are recorded Okay ’s are converted into articulation rules 10/15/08
Gio Wiederhold - TITech 2000
34
Candidate Match Nexus Term linkages automatically extracted from 1912 Webster’s dictionary *
Notice presence of 2 domains: Based on processing headwords ➽ definitions using algebra primitives
10/15/08
chemistry, transport
* free; have processed the OED (Oxford English Dictionary) at Stanford for internal use
Gio Wiederhold - TITech 2000
35
Using the Match Nexus
Experiment: On government structures of NATO countries: SKEIN system resolved over 70% of unmatched terms 10/15/08
Gio Wiederhold - TITech 2000
36
Using the Match Nexus
10/15/08
Gio Wiederhold - TITech 2000
37
Features of an algebra Operations can be composed Operations can be rearranged Alternate arrangements can be evaluated Optimization is enabled The record of past operations can be kept and reused when sources change
10/15/08
Gio Wiederhold - TITech 2000
38
Knowledge Composition Composed knowledge for applications using A,B,C,E
Articulation knowledge
Articulation knowledge (C E) U
Knowledge resource E
U
Articulation knowledge for (A B) U
Knowledge resource A 10/15/08
U
(B
C)
Knowledge resource B
Knowledge resource C
Gio Wiederhold - TITech 2000
(C
U
U : union : intersection
(A B) U (B C) U (C E) U
Legend:
U U
for
D)
Knowledge resource D 39
Support Domain Specialization • Knowledge Acquisition (20% effort) & • Knowledge Maintenance (80% effort *) to be performed • Domain specialists • Professional organizations • Field teams of modest size autonomously maintainable
Empowerment
* based on experience with software 10/15/08
Gio Wiederhold - TITech 2000
40
Summary Scalable Knowledge Composition Provide for Maintainable Ontologies • devolve maintenance onto many domain-specific experts / authorities • provide an algebra to compute composed ontologies that are limited to their articulation terms
SKC
• enable interpretation within the source contexts 10/15/08
Gio Wiederhold - TITech 2000
41
Many Other Tasks
at/near Stanford
Matching cell / protein 3D with chemical’s 3D • Regulatory Gene motifs : – Bioprospector [ Brutlag & Liu <www-cmgm.stanford.edu> ]
• Protein structure generation – moving from small to larger proteins 1: Powerful parallel processing [IBM BlueGene] 2: Two-level : use features as an intermediate (alpha-helix, beta-sheets, …)
3: Protein Folding speedup by delegation [Shirts & Pande: foldingathome.stanford.edu ]
• RNA folding (simpler, larger) [Nakatani & Pande] 10/15/08
Gio Wiederhold - TITech 2000
42
Provenance of derived data Assure having a proper history of derived results [ Peter Buneman, UPenn, www.humgen.upenn.edu ] K2 integration tool Integrated databases often don’t indicate the original sources I.e., SwissProt does not distinguish inferred versus being observed. [ William Gelbart, Harvard University] Flybase Flybase also collects data as exons and their mutations, tranposon insertion sites. Moving from being Hunter Gatherers in science to Harvesters, moving to an agronomical society Clasical genomics is being superseded by expression and interaction of gene products and gene perturbation.
[ Peter Karp, SRI Int., Bioinformatics Res.Group, www.ai.sri.com/pkarp/ ] EcoCyc EcoCyc links proteins to 150 metabolic pathways in Ecoli Databases are supplanting journals. They are re-analyzable. Results in journals are not.
Estimate now about 500 public databases for Bioinformatics; although not all of them have APIs, use real DBMSs, have differing models, units of measurements, leading to semantic problems. 10/15/08
Gio Wiederhold - TITech 2000
43
The People Problem The demand for people in bioinformatics is high, at all levels • Critical is a lack of – training opportunities - programs and teachers – available trainees
• Being in multi-disciplinary field is scary – tenure for faculty – load for students – salary and growth differentials in biology and CS
• Some institutions are moving aggressively – must compete with World-Wide Web visions 10/15/08
Gio Wiederhold - TITech 2000
44
Bioinformatics: Converting Data to Knowledge • The means: People • The product: Information
10/15/08
Gio Wiederhold - TITech 2000
45
Up-to-dateness 100%
never 1/year
%tage up-to-date
1/month 1/week 1/day
∫
=
F(user need) effort, methods
50%
1/hour 1/minute 1/second
Frequency of source change 10/15/08
0%
0 1 ? frequency of visits
as often as possible Feb.2000
F(capability given 2.2M public sites with 288M pages ) Gio Wiederhold - TITech 2000
46
Privacy requires Ethics Knowledge carries responsibilities. How will people feel about your knowledge about them? their genetic make-up, physical & psychological propensities. Privacy is hard to formalize, but that does not mean it is not real to people. Perceptions count. (There is also real stuff insurance scams - personal relations ) Diagnostics without therapies. 10/15/08
Gio Wiederhold - TITech 2000
47
Securing Collaboration Collaborator source query
certified result
Security Filter certified query
Logs unfiltered result
Private Patient Data 10/15/08
Gio Wiederhold - TITech 2000
Gio Wiederhold TIHI Oct96 48
48