Nested Mappings_schema Mapping Reloaded

  • Uploaded by: vthung
  • 0
  • 0
  • April 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Nested Mappings_schema Mapping Reloaded as PDF for free.

More details

  • Words: 2,174
  • Pages: 22
Nested Mappings: Schema Mapping Reloaded P. Papotti Universita’ Roma Tre

M.A. Hernandez - H. Ho - L. Popa A. Fuxman - R.J. Miller IBM Almaden Research Center

University of Toronto

The Problem of Mapping Generation 

Schemas can be arbitrarily different 

E.g., different normalization & naming, missing/extra elements



Input: correspondences between atomic schema elements 



Logical and declarative expressions of relationships between schemas. 





(Automatic discovery)

Abstraction for data interoperability tasks Simpler than actual implementations of data exchange (SQL/XQuery/XSLT)

Must generate transformation that: 

Preserves data relationships: pname-dname, pname-ename, etc.



Creates new target values (pid)



Produces “correct” groupings

03/22/09

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

2

Outline 

Schema mapping generation 



[VLDB’02] Fagin, Hernandez, Popa (IBM Almaden), Miller, Velegrakis (Univ. of Toronto)

From basic to nested:  

Issues with basic mappings Nested mappings and their advantages  



Conclusion  

03/22/09

Generation algorithm Performance impact

Related work Future directions Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

3

Schema Mapping Generation Source schema S

Schema Correspondences

Target schema T

Mappings

Source Concepts (relational views) 

Step 1. Extraction of “concepts” (in each schema). 



Concept = one category of data that can exist in the schema

Step 2. Mapping generation 

03/22/09

Target Concepts (relational views)

Enumerate all non-redundant maps between pairs of concepts

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

4

Example 

The concept of “project of a department”

m1

m1 maps proj to dept-projects

proj: proj Set [ dname pname emps: emps Set [ ename salary ] ]

m 1: ∀(p0 in proj) ∃(d in dept) ∃(p in d.projects) p0.dname = d.dname ∧ p0.pname = p.pname m 2: ∀(p0 in proj) ∀(e0 in p0.emps) ∃(d in dept) ∃(p in d.projects) ∃(e in d.emps) ∃(w in e.worksOn) w.pid = p.pid expression for ∧ p0.dname = d.dname dept-empsworksOn∧ p0.pname = p.pname projects ∧ e0.ename = e.ename ∧ e0.salary = e.salary 

03/22/09



m2

dept: dept Set [ dname budget emps: emps Set [ ename salary worksOn: worksOn Set [ ]

pid

] projects: projects Set [ pid pname ] ]

m2 maps proj-emps to dept-emps-worksOn-projects The concept of “project of an employee of a department”

Two ‘basic’ mappings (or source-to-target tgds or GLAV formulas) Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

5

Outline 

Schema mapping generation 



[VLDB’02] Fagin, Hernandez, Popa (IBM Almaden), Miller, Velegrakis (Univ. of Toronto)

From basic to nested:  

Issues with basic mappings Nested mappings and their advantages  



Conclusion  

03/22/09

Generation algorithm Performance impact

Related work Future directions

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

6

Issue 1: Many Small Uncorrelated Formulas m1

proj: proj Set [ dname pname emps: emps Set [ ename salary ] ]

m2

dept: dept Set [ dname budget emps: emps Set [ ename salary worksOn: worksOn Set [ ]

pid

] projects: projects Set [ pid pname ] ]



m1: “for every proj tuple there must be dept and project tuples such that …“



m2: “for every emp of a proj tuple there must be: dept, emp, worksOn, project … “





If we also had dependents under employees, then: “for every dependent of an emp of a proj … “ and so on … There is a lot of common mapping behavior that is repeated 

03/22/09

E.g., m2 repeats the mapping behavior of m1 (although for a “subconcept”) Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

7

Issue 2: Redundancy in the Generated Data m1

Input:

proj: proj Set [ dname CS pname uSearch emps: emps Set [ { ename Alice John salary 120K, 90K ] } ]

m2

dept: dept Set [ dname CS budget B1 emps: emps Set [ ename {} salary worksOn: worksOn Set [ ]

Possible output:

pid

] projects: projects Set [ pid pname ] ]



03/22/09

CS B3

{ Alice 120K

{ John 90K

} { X1 uSearch }

Required to exist based on m1 

CS B2

{ X2 }

{ X2 uSearch }

}

{ X3 }

{ X3 uSearch }

Required to exist based on m2

m2 repeats the mapping behavior of m1: 

“duplicate” dept and project tuples



“duplicate” nulls (pid values: X2 and X3, and budget values)

Moreover, this duplication happens for each joining emp tuple in the source

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

8

Issue 3: No Grouping in the Target m1

Input:

proj: proj Set [ dname CS pname uSearch emps: emps Set [ { ename Alice John salary 120K, 90K ] } ]

m2

dept: dept Set [ dname CS budget B1 emps: emps Set [ ename {} salary worksOn: worksOn Set [ ]

Possible output:

pid

] projects: projects Set [ pid pname ] ]

CS B2

03/22/09

Required to exist based on m2

Alice and John are in different singleton sets (E and E’) 



{ X2{}X2} { X3 {} X3 } } }

{ X2 { X3 { X3 uSearch } uSearch }uSearch }

Required to exist based on m1



CS B3

{ Alice, {John { Alice John 120K 120K, 90K 90K }

{ X1 uSearch }

CS B2

There can be as many singleton sets as emp tuples in the source nested set

It is desirable to enforce the grouping on the target data

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

9

Summary of issues 

Fragmentation of the specification 



(Too) many small tgds

Fragmentation of the data 



03/22/09

Generate redundant data (which later needs to be removed or fused) No grouping enforced on the target data (need additional phase to enforce any grouping)

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

10

Idea

m1

proj: proj Set [ dname pname emps: emps Set [ ename salary ] ]

03/22/09

m2

dept: dept Set [ dname budget emps: emps Set [ ename salary worksOn: worksOn Set [ ]





pid

] projects: projects Set [ pid pname ] ]

We would like to reuse (in m2) the “dept” and “project” tuples that the simpler mapping m1 asserts.





Make m2 assert only the “extra” information Also accumulate the corresponding employees into one set

Idea: Correlate the mapping formulas based on their common part

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

11

Correlating Mapping Formulas m1: ∀(p0 in proj) ∃(d in dept) ∃(p in d.projects) p0.dname = d.dname ∧ p0.pname = p.pname proj tuples mapped only once

Submapping, correlated to the parent mapping

m2: ∀(p0 in proj) ∀(e0 in p0.emps) ∃(d in dept) ∃(p in d.projects) ∃(e in d.emps) ∃(w in e.worksOn) w.pid=p.pid ∧ p0.dname = d.dname ∧ p0.pname = p.pname ∧ e0.ename = e.ename ∧ e0.salary = e.salary Replace with

n: ∀(p0 in proj) ∃(d in dept) ∃(p in d.projects) p0.dname = d.dname ∧ p0.pname = p.pname ∧ [ ∀(e0 in p0.emps) For every proj tuple, ∃(e in d.emps) ∃(w in e.worksOn) we map all employees, w.pid=p.pid as a group. ∧ e0.ename = e.ename ∧ e0.salary = (Source grouping is e.salary preserved) ]

This is a nested mapping

03/22/09

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

12

Advantages of Nested Mappings 

Nested tgds can exploit the natural hierarchy that exists on the concepts of a schema 





e.g., proj-emps is a “subconcept” of proj, in the source schema Map higher concept only once; use submappings for subconcepts

proj: proj Set [ dname pname emps: emps Set [ ename salary ] ]

Nested mappings are strictly more expressive: There is no set of source-to-target tgds that is equivalent to n.

03/22/09

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

13

Nesting Algorithm: Sketch 



Step 1. Discovery: construct a DAG of basic mapping based on the concepts hierarchy Step 2. Correlation: construct nested mappings by traversing the DAG, starting from each root, and repeatedly applying the nesting step hinted before. 

03/22/09

We get a forest of nested mappings

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

14

Nesting Algorithm: Example proj: proj Set of [ dname pname emps: emps Set of [ ename salary ] ]

dept: dept Set of [ dname budget emps: emps Set of [ ename salary worksOn: worksOn Set of [ ]

]

P

X

D

PE

DE

pid

] projects: projects Set of [ pid pname ]

for p in proj exists d’ in dept, p’ in d’.projects where d’.dname=p.dname and p’.pname=p.pname and

DEPW

A DAG of basic mappings

( for e in p.emps exists e’ in d’.emps, w in e’.worksOn where w.pid=p’.pid and e’.ename=e.ename and e’.salary=e.salary ) 03/22/09

DP

PDP

PEDEPW

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

15

Experimental evaluation 

Goal: show empirically that nested mappings can dramatically:  



reduce the cost of producing a target instance improve the quality of the generated data

DBLP-like schema, on both source and target, with four levels of nesting/grouping:    



authors – level 1 conferences – level 2 years – level 3 publications – level 4

Mappings are implemented by generating queries (in XQuery) 

Qbasic based on basic mappings



Qnested based on nested mappings

03/22/09

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

16

Example Queries – 2 Levels Only Qbasic

Multiple query terms (one per basic mapping)

 

let $doc0 := fn:doc("instance.xml") return { for $x0 in $doc0/authorDB/author, $x1 in $x0/conf return { $x0/name/text() } { for $x0L1 in $doc0/authorDB/author, $x1L1 in $x0L1/conf where $x0/name/text()=$x0L1/name/text() return { $x1L1/name/text() } } } { for $x0 in $doc0/authorDB/author return { $x0/name/text() } }

Need re-grouping (over entire data) Generate duplicates

03/22/09

Qnested let $doc0 := fn:doc("instance.xml") return { for $x0 in $doc0/authorDB/author return { $x0/name/text() } { for $x1 in $x0/conf return { $x1/name/text() } } }

 

Single pass over the data No duplicates

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

17

Execution time comparison •

Qbasic execution time / Qnested execution time

• Logarithm scale

Nested query execution time

Basic query execution time /

10000

1020 KB 514 KB

1000

312 KB

100

Execution time for basic: 22 minutes Execution time for nested: 1.1 seconds

10

1 1

2

3

4

Ne sti n g Le ve l

03/22/09

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

18

Output file size comparison •

Qbasic output file size / Qnested output file size

• Logarithm scale 100,0 Basic query output size / Nested query output size

2111 KB 514 KB 312 KB 10,0

• Size of generated data for basic (including duplicates): 45MB • Size of generated data for nested: 552KB

1,0 1

2

3

4

Ne sting Le ve l

The nested mapping results in much more efficient execution with less redundant data

03/22/09

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

19

Related work 





Both embedded mappings [Melnik et al. SIGMOD’05] and HePTox [Bonifati et al. VLDB’05] support nested data, but do not support nesting of mappings. Nested mappings are less general than languages used for composition [Fagin et al. PODS’04, Nash et al. PODS’05], but are more compact and easier to understand/program The generation algorithm identifies common expressions within mappings: same spirit of work in query optimization [e.g., Roy et al. SIGMOD’00]. 



But query optimization preserves query equivalence, while our techniques lead to mappings with better semantics (do not preserve query equivalence).

There are already commercial tools that use similar paradigms (e.g., IBM Ascential DataStage TX) but most of the mapping generation work is manual.

03/22/09

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

20

Conclusion 

Nested tgds: better specification language for transformation 

Use correlation (hierarchy) between concepts



Less redundancy in the output, more efficient



Naturally preserve source grouping 

For more complex mappings we expose Skolem functions to let users alter the default grouping behavior



Nested tgds are more compact and easier to understand/program 



03/22/09

Humans think top-down: map top concepts, then submappings, etc.

Can be generated too !

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

21

Future Directions 

Extend existing solutions to use nested mappings 



 

Data integration, mapping analysis and reasoning, schema evolution, etc. Nested tgds are more complex as a logic formalism !

Study the formal foundation of nested mappings More generally, develop methods for deciding when and why is a schema mapping specification “better” than another 

Need to look at issues such as:  

03/22/09

preservation of the source data (associations, correlations, etc.) minimization of incompleteness

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

22

Related Documents


More Documents from "Austrian National Tourism Board"