Generic Model Management: A Database Infrastructure for Schema Manipulation Philip A. Bernstein Microsoft Research Sept. 15, 2003
© 2003 Microsoft Corporation
1
Meta Data Management
Meta data = structural information
DB schema, interface defn, web site map, form defns, … UML Architecture
Forms ER Diagram Customer
Order
Product
Scheduled Delivery
Salesperson
Business Rules Emp.Sal < Emp.Mgr.Sal
Business Process Update Marketing
Authorize Credit
Order Entry
Bill Customer
Table Defns
Schedule Delivery Inventory
VB interfaces
Sept. 15, 2003
C++ interfaces
© 2003 Microsoft Corporation
2
Meta Data Problems
They all involve schemas and mappings E.g., data translation between data models Hierarchical Schema PO PO # POdate
Relational Schema PurchaseOrder OrdIDOrderDate
POLines Prod# PName Sept. 15, 2003
Items Item#OrdIDI_Name © 2003 Microsoft Corporation
3
Such Problems are Pervasive
Data translation
Schema evolution & data migration
XML message translation for e-commerce
Integrate custom apps with commercial apps
Data warehouse loading (clean & transform)
Design tool support (DB, UML, …)
OO or XML wrapper generation for SQL DB
Semantic web
Sept. 15, 2003
© 2003 Microsoft Corporation
4
Meta Data Solutions
Solutions strongly resemble each other, but usually are problem-specific usually are language-specific SQL, ODMG, UML, XML, RDF, …. usually involve a lot of object-at-a-time programming
Goals
Generic solutions “Set”-at-a-time programming
Sept. 15, 2003
© 2003 Microsoft Corporation
5
Model Management
A generic approach to meta data mgmt
Model Mgmt operators manipulate models and mappings as bulk objects
Their representation is generic
Operators - Match, Merge, Diff, Compose
Avoids problem-specific and languagespecific solutions
Avoids object-at-a-time programming
Sept. 15, 2003
© 2003 Microsoft Corporation
6
Models and Mappings A model is a rooted directed graph, which represents a complex information structure. Relational Emp Schema
map1
E#
Emp E#
Dept#
Dept#
Name
Name
A mapping is a model that represents a transformation Sept. 15, 2003 © 2003 Microsoft Corporation
XSD
First Last 7
Models and Mappings A model is a rooted directed graph, which represents a complex information structure. map1
Relational Emp Schema
Emp E#
E# Dept#
Dept#
Name
Name
Or it could be a binary table (a morphism) Sept. 15, 2003
XSD
© 2003 Microsoft Corporation
First Last 8
Model Mgmt Algebra
map = Match (M1, M2)
<M3, map13, map23> = Merge (M1, M2, map)
map3 = Compose(map1, map2)
<M2, map12> = Diff(M1, map)
<M2, map12> = ModelGen(M1, metamodel2)
Sept. 15, 2003
© 2003 Microsoft M = Copy( M )Corporation
9
Outline Introduction to Model Management Using MM to solve meta data problems Matching anatomy ontologies Model merging Wrap-up
Sept. 15, 2003
© 2003 Microsoft Corporation
10
Categorizing Meta Data Problems
Model mapping
M1
M2
Data translation
XML message translation for e-commerce
Integrate custom apps with commercial apps Data warehouse loading (clean & transform)
map12
Solution is the match “operator”
Sept. 15, 2003
Really a CAD system for mapping generation
© 2003 Microsoft Corporation
11
Categorizing M D Problems (2)
Model integration
map12
M1 m
ap
23
ap
M2
m
13
M3
View integration
Data integration
Solution is the Merge operator
Sept. 15, 2003
© 2003 Microsoft Corporation
12
Categorizing M D Problems (3)
Model and mapping generation
M1
map12
M2
Design tools (ER → SQL)
Wrapper generation (SQL → OO or XML)
Solution is the ModelGen operator
<M2, map12> = ModelGen(M1, metamodel2)
Sept. 15, 2003
© 2003 Microsoft Corporation
13
Categorizing M D Problems (4)
Change propagation
M1
map12
M2
M1′
map12
M2′
Schema evolution
Required maintenance for all meta data problems
Solution requires the rest of MM algebra
Sept. 15, 2003
© 2003 Microsoft Corporation
14
Change Propagation
Given
map1 between xsd1 and SQL schema rdb1 xsd2, a modified version of xsd1
Produce
rdb2 to store instances of xsd2 a mapping between xsd2 and rdb2
1. map2
xsd1
map1 m . 2
rdb1
2. map3 = map2 • map1
3 p a
3. <map4, rdb3 > = Copy(map3)
map4 rdb3 rdb2 xsd2 3.map Sept. 15, 2003
1. map2= Match(xsd1, xsd2)
Now we need to merge Diff(xsd2,map4) into rdb3
© 2003 Microsoft Corporation
15
Change Propagation (cont’d) map1 m . 2
3 p a
xsd2 3. map4 4. map5
rdb1
7. = Merge(rdb3, rdb4, map7) rdb3 7. m 6. map7
1. map2
xsd1
4. <xsd2′, map5> = Diff(xsd2,map4) 5. = ModelGen(xsd2′, SQL) 6. map7 = map4 • map5 • map6
xsd2′ 5. map6 rdb4 7. Sept. 15, 2003
ap
9
rdb2
p8 a m
© 2003 Microsoft Corporation
16
Complete Script in Rondo Operator Definition: PropagateChanges(s1, d1, s1_d1, s2, c, s2_c) 1. s1_s2 = Match(s1, s2); 2. 〈d1′, d1′_d1〉 = Delete(d1, Traverse(All(s1) − Domain(s1_s2), s1_d1)); 3. 〈c′, c′_c〉 = Extract(c, Traverse(All(s2) − Range(s1_s2), s2_c)); 4. c′_d1′ = c′_c ∗ Invert(s2_c) ∗ Invert(s1_s2) ∗ s1_d1 ∗ Invert(d1′_d1); 5. 〈d2, c′_d2, d1′_d2〉 = Merge(c′, d1′, c′_d1′); •
s2_d2 = s2_c ∗ Invert(c′_c) ∗ c′_d2 + Invert(s1_s2) ∗ s1_d1 ∗ Invert(d1′_d1) ∗ d1′_d2;
7. return 〈d2, s2_d2〉; Operator Use: SQLXSD: PropagateChanges(s1, d1, s1_d1, s2, ModelGen(s2, XSD));
Sept. 15, 2003
© 2003 Microsoft Corporation
17
Status Report
Previous scenario is executable in Rondo, the first complete MM prototype
There are many prototypes for Match
[Rahm & Bernstein, VLDB J., Dec. 2001]
Detailed design for Merge
[Melnik et al, SIGMOD 2003]
[Pottinger & Bernstein, VLDB 2003]
There are several efforts on a formal semantics for MM operators
Sept. 15, 2003
© 2003 Microsoft Corporation
18
Outline Introduction to Model Management Using MM to solve meta data problems Matching anatomy ontologies Model merging Wrap-up
Sept. 15, 2003
© 2003 Microsoft Corporation
19
Schema Matching Algorithms
About a dozen published algorithms
Many good ideas, but none are robust
Schema-based vs. content-based Per-element vs. structural Linguistic vs. constraint-based Independently-developed schemas vs. incrementally-modified schemas Hybrid vs. composite Human review and input is essential
User interface is also quite important
Sept. 15, 2003
© 2003 Microsoft Corporation
20
Matching Anatomy Ontologies Match
two human anatomy ontologies
FMA
– Univ. of Washington Galen CRM – Univ. of Manchester (UK) By Peter Mork (Univ. of Washington) Both models are big Ultimate
goal was finding differences Like most match algorithms, ours calculates a similarity score for the m× n pairs of elements Sept. 15, 2003
© 2003 Microsoft Corporation
21
Aligning Representations generic part
FMA: generic Cardiac Heart part valve
Heart
CRM: Heart Heart sensibly hasStructuralComponent ValveInHeart sensibly Sept. 15, 2003
© 2003 Microsoft Corporation
Cardiac valve
h-S-C
Valve In Heart 22
Anatomy Matching Algorithm Lexical Match
1. •
• • •
Normalize string, UMLS dictionary lookup, convert to concept-ID from thesaurus String comparison → 306 matches Adding spaces, ignoring case → 1834 matches Lexical tools → 3503 matches
Sept. 15, 2003
© 2003 Microsoft Corporation
23
Anatomy Matching Example S: similarity score
generic part
S = 2/15
Heart
Cardiac valve
Heart
h-S-C
sensibly
Valve In Heart
S=1
Sept. 15, 2003
© 2003 Microsoft Corporation
S=1
24
Anatomy Matching Algorithm Lexical Match
1. •
Structure Match
2. • •
•
Normalize string, UMLS dictionary lookup, convert to concept-ID from thesaurus Similarity(reified nodes) = Average(neighbors) Back-propagate to neighbors
Adds 64 matches (to previous 3503) • Implies 875 reified relationship matches Sept. 15, 2003
© 2003 Microsoft Corporation
25
Anatomy Matching Example S: similarity score
generic part Cardiac valve
Heart S=1
Sept. 15, 2003
S = 2/15
S = 2/3 Heart
h-S-C
sensibly
Valve In Heart
© 2003 Microsoft Corporation
S=1
26
Anatomy Matching Algorithm Lexical Match
1. •
Structure Match
2. • •
Similarity(reified nodes) = Average(neighbors) Back-propagate to neighbors
Align Super-classes
3. • •
Normalize string, UMLS dictionary lookup, convert to concept-ID from thesaurus
Super-class similarity = average similarity of children, grandchildren, great-grandchildren Adds 213 matches (to 3567)
Sept. 15, 2003
© 2003 Microsoft Corporation
27
Some Lessons
A common encoding of models is hard and involves compromises
Match needs to invent generalizations
Different styles of reifying relationships CRM stores transitive relationships In FMA, arterial supply, venous drainage, nerve supply, lymphatic drainage In CRM, these all map to isServedBy
On big models, Match is expensive
Some steps required days to execute Cross-product filled 80 GB (< 1GB input). Sept. 15, 2003 © 2003 Microsoft Corporation
28
Outline Introduction to Model Management Using MM to solve meta data problems Matching anatomy ontologies Model merging Wrap-up
Sept. 15, 2003
© 2003 Microsoft Corporation
29
Merge(M1, M2, map) Return the union of models M1 and M2
Use map to guide the Merge If elements x = y in map, then collapse them into one element
⇒
Emp
map
Emp
AddrName
=
NamePhone
Sept. 15, 2003
© 2003 Microsoft Corporation
Emp Addr NamePhone
30
Merge(M1, M2, map) [Buneman,
Meta-model has aggregation & generalization only Union, and collapse objects having the same name Fix-up step for inconsistencies created by merging X a Y
Davidson, Kosky, EDBT 92]
a
a
X a
Y
Z
W
X a Z
X
Y
Z
Successive fixups lead to different results Batch them at the end, to get a unique minimal result Now enrich the meta-model (containment, complex mappings, …) & merge semantics (conflicts, deletes) Sept. 15, 2003
© 2003 Microsoft Corporation
31
Resolving Merge Conflicts
Meta Meta Model Conflict Model Conflict
1
EmployeeID
Name
2
FirstName
3
6
Sept. 15, 2003
Employee
Emp#
5
Meta Model Conflict
mapee
Emp
7
4
LastName
Emp Emp# Name FirstNameLastName © 2003 Microsoft Corporation
8 9 10 11 32
Contributions to Merge
[Pottinger & Bernstein, VLDB 03]
Generic correctness criteria for Merge Use of first-class input mapping (not just correspondences) Taxonomy of conflicts & resolution strategies Characterize when Merge can be automatic A merge algorithm for an EER representation Experimental evaluation
Sept. 15, 2003
© 2003 Microsoft Corporation
33
An Approach to ModelGen [Atzeni & Torlone, EDBT '96]
Meta-models are made of patterns Object has sub-object (a)
Aggregation has key (c)
Define pattern transformations as rules
Aggregation has attributes (b)
For XSD→SQL,
+
To translate Ms into meta-modelt (MMt),
Apply rules that replace patterns in MS that are not in MMt by patterns that are in MMt
Sept. 15, 2003
© 2003 Microsoft Corporation
34
ModelGen Research
More complete repertoire of patterns Make patterns more generic Integrate with rules engine (avoid cycles, control search) Implement it
Sept. 15, 2003
© 2003 Microsoft Corporation
35
What Next?
Add semantics to mappings Thorough formal semantics of operators Industrial strength schema matching More and bigger applications More prototypes More operators Better user interfaces
Sept. 15, 2003
© 2003 Microsoft Corporation
36
References
http://www.research.microsoft.com/~philbe Overview
Implementation
Survey: Rahm & Bernstein , VLDB J., Dec. 2001 Prototype: Madhavan, Bernstein, & Rahm, VLDB 2001
Merge Operation
Bernstein & Rahm, ER 2000
Match Operation
Melnik, Rahm, & Bernstein, SIGMOD 2003
Data Warehouse Examples
Bernstein, CIDR 2003 Bernstein, Halevy, & Pottinger, SIGMOD Record, Dec. 2000
Pottinger & Bernstein, VLDB 200337
Theory
Alagić & Bernstein, DBPL 2001 Madhavan et al, AAAI 2002
Sept. 15, 2003
© 2003 Microsoft Corporation
37
Sept. 15, 2003
© 2003 Microsoft Corporation
38