Nested Mappings: Schema Mapping Reloaded P. Papotti Universita’ Roma Tre
M.A. Hernandez - H. Ho - L. Popa A. Fuxman - R.J. Miller IBM Almaden Research Center
University of Toronto
The Problem of Mapping Generation
Schemas can be arbitrarily different
E.g., different normalization & naming, missing/extra elements
Input: correspondences between atomic schema elements
Logical and declarative expressions of relationships between schemas.
(Automatic discovery)
Abstraction for data interoperability tasks Simpler than actual implementations of data exchange (SQL/XQuery/XSLT)
Must generate transformation that:
Preserves data relationships: pname-dname, pname-ename, etc.
Creates new target values (pid)
Produces “correct” groupings
03/22/09
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
2
Outline
Schema mapping generation
[VLDB’02] Fagin, Hernandez, Popa (IBM Almaden), Miller, Velegrakis (Univ. of Toronto)
From basic to nested:
Issues with basic mappings Nested mappings and their advantages
Conclusion
03/22/09
Generation algorithm Performance impact
Related work Future directions Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
3
Schema Mapping Generation Source schema S
Schema Correspondences
Target schema T
Mappings
Source Concepts (relational views)
Step 1. Extraction of “concepts” (in each schema).
Concept = one category of data that can exist in the schema
Step 2. Mapping generation
03/22/09
Target Concepts (relational views)
Enumerate all non-redundant maps between pairs of concepts
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
4
Example
The concept of “project of a department”
m1
m1 maps proj to dept-projects
proj: proj Set [ dname pname emps: emps Set [ ename salary ] ]
m 1: ∀(p0 in proj) ∃(d in dept) ∃(p in d.projects) p0.dname = d.dname ∧ p0.pname = p.pname m 2: ∀(p0 in proj) ∀(e0 in p0.emps) ∃(d in dept) ∃(p in d.projects) ∃(e in d.emps) ∃(w in e.worksOn) w.pid = p.pid expression for ∧ p0.dname = d.dname dept-empsworksOn∧ p0.pname = p.pname projects ∧ e0.ename = e.ename ∧ e0.salary = e.salary
03/22/09
m2
dept: dept Set [ dname budget emps: emps Set [ ename salary worksOn: worksOn Set [ ]
pid
] projects: projects Set [ pid pname ] ]
m2 maps proj-emps to dept-emps-worksOn-projects The concept of “project of an employee of a department”
Two ‘basic’ mappings (or source-to-target tgds or GLAV formulas) Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
5
Outline
Schema mapping generation
[VLDB’02] Fagin, Hernandez, Popa (IBM Almaden), Miller, Velegrakis (Univ. of Toronto)
From basic to nested:
Issues with basic mappings Nested mappings and their advantages
Conclusion
03/22/09
Generation algorithm Performance impact
Related work Future directions
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
6
Issue 1: Many Small Uncorrelated Formulas m1
proj: proj Set [ dname pname emps: emps Set [ ename salary ] ]
m2
dept: dept Set [ dname budget emps: emps Set [ ename salary worksOn: worksOn Set [ ]
pid
] projects: projects Set [ pid pname ] ]
m1: “for every proj tuple there must be dept and project tuples such that …“
m2: “for every emp of a proj tuple there must be: dept, emp, worksOn, project … “
If we also had dependents under employees, then: “for every dependent of an emp of a proj … “ and so on … There is a lot of common mapping behavior that is repeated
03/22/09
E.g., m2 repeats the mapping behavior of m1 (although for a “subconcept”) Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
7
Issue 2: Redundancy in the Generated Data m1
Input:
proj: proj Set [ dname CS pname uSearch emps: emps Set [ { ename Alice John salary 120K, 90K ] } ]
m2
dept: dept Set [ dname CS budget B1 emps: emps Set [ ename {} salary worksOn: worksOn Set [ ]
Possible output:
pid
] projects: projects Set [ pid pname ] ]
03/22/09
CS B3
{ Alice 120K
{ John 90K
} { X1 uSearch }
Required to exist based on m1
CS B2
{ X2 }
{ X2 uSearch }
}
{ X3 }
{ X3 uSearch }
Required to exist based on m2
m2 repeats the mapping behavior of m1:
“duplicate” dept and project tuples
“duplicate” nulls (pid values: X2 and X3, and budget values)
Moreover, this duplication happens for each joining emp tuple in the source
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
8
Issue 3: No Grouping in the Target m1
Input:
proj: proj Set [ dname CS pname uSearch emps: emps Set [ { ename Alice John salary 120K, 90K ] } ]
m2
dept: dept Set [ dname CS budget B1 emps: emps Set [ ename {} salary worksOn: worksOn Set [ ]
Possible output:
pid
] projects: projects Set [ pid pname ] ]
CS B2
03/22/09
Required to exist based on m2
Alice and John are in different singleton sets (E and E’)
{ X2{}X2} { X3 {} X3 } } }
{ X2 { X3 { X3 uSearch } uSearch }uSearch }
Required to exist based on m1
CS B3
{ Alice, {John { Alice John 120K 120K, 90K 90K }
{ X1 uSearch }
CS B2
There can be as many singleton sets as emp tuples in the source nested set
It is desirable to enforce the grouping on the target data
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
9
Summary of issues
Fragmentation of the specification
(Too) many small tgds
Fragmentation of the data
03/22/09
Generate redundant data (which later needs to be removed or fused) No grouping enforced on the target data (need additional phase to enforce any grouping)
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
10
Idea
m1
proj: proj Set [ dname pname emps: emps Set [ ename salary ] ]
03/22/09
m2
dept: dept Set [ dname budget emps: emps Set [ ename salary worksOn: worksOn Set [ ]
pid
] projects: projects Set [ pid pname ] ]
We would like to reuse (in m2) the “dept” and “project” tuples that the simpler mapping m1 asserts.
Make m2 assert only the “extra” information Also accumulate the corresponding employees into one set
Idea: Correlate the mapping formulas based on their common part
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
11
Correlating Mapping Formulas m1: ∀(p0 in proj) ∃(d in dept) ∃(p in d.projects) p0.dname = d.dname ∧ p0.pname = p.pname proj tuples mapped only once
Submapping, correlated to the parent mapping
m2: ∀(p0 in proj) ∀(e0 in p0.emps) ∃(d in dept) ∃(p in d.projects) ∃(e in d.emps) ∃(w in e.worksOn) w.pid=p.pid ∧ p0.dname = d.dname ∧ p0.pname = p.pname ∧ e0.ename = e.ename ∧ e0.salary = e.salary Replace with
n: ∀(p0 in proj) ∃(d in dept) ∃(p in d.projects) p0.dname = d.dname ∧ p0.pname = p.pname ∧ [ ∀(e0 in p0.emps) For every proj tuple, ∃(e in d.emps) ∃(w in e.worksOn) we map all employees, w.pid=p.pid as a group. ∧ e0.ename = e.ename ∧ e0.salary = (Source grouping is e.salary preserved) ]
This is a nested mapping
03/22/09
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
12
Advantages of Nested Mappings
Nested tgds can exploit the natural hierarchy that exists on the concepts of a schema
e.g., proj-emps is a “subconcept” of proj, in the source schema Map higher concept only once; use submappings for subconcepts
proj: proj Set [ dname pname emps: emps Set [ ename salary ] ]
Nested mappings are strictly more expressive: There is no set of source-to-target tgds that is equivalent to n.
03/22/09
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
13
Nesting Algorithm: Sketch
Step 1. Discovery: construct a DAG of basic mapping based on the concepts hierarchy Step 2. Correlation: construct nested mappings by traversing the DAG, starting from each root, and repeatedly applying the nesting step hinted before.
03/22/09
We get a forest of nested mappings
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
14
Nesting Algorithm: Example proj: proj Set of [ dname pname emps: emps Set of [ ename salary ] ]
dept: dept Set of [ dname budget emps: emps Set of [ ename salary worksOn: worksOn Set of [ ]
]
P
X
D
PE
DE
pid
] projects: projects Set of [ pid pname ]
for p in proj exists d’ in dept, p’ in d’.projects where d’.dname=p.dname and p’.pname=p.pname and
DEPW
A DAG of basic mappings
( for e in p.emps exists e’ in d’.emps, w in e’.worksOn where w.pid=p’.pid and e’.ename=e.ename and e’.salary=e.salary ) 03/22/09
DP
PDP
PEDEPW
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
15
Experimental evaluation
Goal: show empirically that nested mappings can dramatically:
reduce the cost of producing a target instance improve the quality of the generated data
DBLP-like schema, on both source and target, with four levels of nesting/grouping:
authors – level 1 conferences – level 2 years – level 3 publications – level 4
Mappings are implemented by generating queries (in XQuery)
Qbasic based on basic mappings
Qnested based on nested mappings
03/22/09
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
16
Example Queries – 2 Levels Only Qbasic
Multiple query terms (one per basic mapping)
let $doc0 := fn:doc("instance.xml") return
{ for $x0 in $doc0/authorDB/author, $x1 in $x0/conf return { $x0/name/text() } { for $x0L1 in $doc0/authorDB/author, $x1L1 in $x0L1/conf where $x0/name/text()=$x0L1/name/text() return { $x1L1/name/text() } } } { for $x0 in $doc0/authorDB/author return { $x0/name/text() } }
Need re-grouping (over entire data) Generate duplicates
03/22/09
Qnested let $doc0 := fn:doc("instance.xml") return
{ for $x0 in $doc0/authorDB/author return { $x0/name/text() } { for $x1 in $x0/conf return { $x1/name/text() } } }
Single pass over the data No duplicates
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
17
Execution time comparison •
Qbasic execution time / Qnested execution time
• Logarithm scale
Nested query execution time
Basic query execution time /
10000
1020 KB 514 KB
1000
312 KB
100
Execution time for basic: 22 minutes Execution time for nested: 1.1 seconds
10
1 1
2
3
4
Ne sti n g Le ve l
03/22/09
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
18
Output file size comparison •
Qbasic output file size / Qnested output file size
• Logarithm scale 100,0 Basic query output size / Nested query output size
2111 KB 514 KB 312 KB 10,0
• Size of generated data for basic (including duplicates): 45MB • Size of generated data for nested: 552KB
1,0 1
2
3
4
Ne sting Le ve l
The nested mapping results in much more efficient execution with less redundant data
03/22/09
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
19
Related work
Both embedded mappings [Melnik et al. SIGMOD’05] and HePTox [Bonifati et al. VLDB’05] support nested data, but do not support nesting of mappings. Nested mappings are less general than languages used for composition [Fagin et al. PODS’04, Nash et al. PODS’05], but are more compact and easier to understand/program The generation algorithm identifies common expressions within mappings: same spirit of work in query optimization [e.g., Roy et al. SIGMOD’00].
But query optimization preserves query equivalence, while our techniques lead to mappings with better semantics (do not preserve query equivalence).
There are already commercial tools that use similar paradigms (e.g., IBM Ascential DataStage TX) but most of the mapping generation work is manual.
03/22/09
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
20
Conclusion
Nested tgds: better specification language for transformation
Use correlation (hierarchy) between concepts
Less redundancy in the output, more efficient
Naturally preserve source grouping
For more complex mappings we expose Skolem functions to let users alter the default grouping behavior
Nested tgds are more compact and easier to understand/program
03/22/09
Humans think top-down: map top concepts, then submappings, etc.
Can be generated too !
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
21
Future Directions
Extend existing solutions to use nested mappings
Data integration, mapping analysis and reasoning, schema evolution, etc. Nested tgds are more complex as a logic formalism !
Study the formal foundation of nested mappings More generally, develop methods for deciding when and why is a schema mapping specification “better” than another
Need to look at issues such as:
03/22/09
preservation of the source data (associations, correlations, etc.) minimization of incompleteness
Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
22