L05: Distributed Query Processing & Optimization Query
Processing Query Decomposition Data Localization Query Optimization
H.Lu/HKUST
Query Processing
Any high-level query (SQL) on a database must be processed, optimized and executed by the DBMS The high-level query is scanned, and parsed to check for syntactic correctness An internal representation of a query is created, which is either a query tree or a query graph The DBMS then devises an execution strategy for retrieving the result of the query. (An execution strategy is a plan for executing the query by accessing the data, and storing the intermediate results) The process of choosing one out of the many execution strategies is known as query optimization
H.Lu/HKUST
L05: Distributed Query Processing -- 2
Query Processor A query processor is a module in the DBMS that performs the tasks to process, to optimize, and to generate execution strategy for a high-level query For a DDBMS, the QP also does data localization for the query based on the fragmentation scheme and generates the execution strategy that incorporates the communication operations involved in processing the query
H.Lu/HKUST
L05: Distributed Query Processing -- 3
Query Optimizer Queries expressed in SQL can have multiple equivalent relational algebra query expressions The distributed query optimizer must select the ordering of relational algebra operations, sites to process data, and possibly the way data should be transferred. This makes distributed query processing significantly more difficult
H.Lu/HKUST
L05: Distributed Query Processing -- 4
Complexity of Relational Algebra Operations
The relational algebra is used to express the output of the query. The complexity of relational algebra operations play a role in defining some of the principles of query optimization. All complexity measures are based on the cardinality of the relation Operations Complexity Select, Project (w/o duplicate elimination) O(n) Project (with duplicate elimination), Group O(n logn) Join, Semi-join, Division, Set Operators O(n logn) Cartesian Product O(n2 ) This was given in the book (p194). It is over simplified.
H.Lu/HKUST
L05: Distributed Query Processing -- 5
Characteristics of Query Processors
Languages Input language can be relational algebra or calculus; output language is relational algebra (annotated with communication primitives). The query processor must efficiently map input language to output language Types of Optimization The output language specification represents the execution strategy. There can be many such strategies, the best one can be selected through exhaustive search, or by applying heuristic (minimize size of intermediate relations). For distributed databases semi-joins can be applied to reduce data transfer.
H.Lu/HKUST
L05: Distributed Query Processing -- 6
When to Optimize
Static: done before executing the query (at compilation time), cost of optimization amortized over multiple executions, mostly based on exhaustive search. Since sizes of intermediate relations need to be estimated, it can result in sub-optimal strategies. Dynamic: done at run time; every time the query is executed, can make use of exact sizes of intermediate relations, expensive, based on heuristics Hybrid: mixes static and dynamic approaches; the approach is mainly static, but dynamic query optimization may take place when high difference between predicted and actual sizes are detected
H.Lu/HKUST
L05: Distributed Query Processing -- 7
Characteristics of Query Processors
Statistics fragment cardinality and size size and number of distinct values for each attribute. detailed histograms of attribute values for better selectivity estimation. Decision Sites one site or several sites participate in selection of strategy Exploitation of network topology wide area network - communication cost local area network - parallel execution
H.Lu/HKUST
L05: Distributed Query Processing -- 8
Characteristics of Query Processors
Exploitation of replicated fragments larger number of possible strategies Use of Semijoins reduce size of data transfer increase # of messages and local processing good for fast or slow networks?
H.Lu/HKUST
L05: Distributed Query Processing -- 9
Layers of Query Processing Calculus Query on Distributed Relations QUERY DECOMPOSITION
CONTROL SITE
GLOBAL SCHEMA
Algebra Query on Distributed Relations DATA LOCALIZATION
FRAGMENT SCHEMA
Fragment Query GLOBAL OPTIMIZATION
LOCAL SITE H.Lu/HKUST
STATISTICS ON FRAGMENTS
Optimized Fragment Query With Communication Operations LOCAL OPTIMIZATION
LOCAL SCHEMA
Optimized Local Queries L05: Distributed Query Processing
-- 10
L05: Distributed Query Processing & Optimization Query Processing Query Decomposition Data Localization Query Optimization
H.Lu/HKUST
Query Decomposition
Normalization The calculus query is written in a normalized form (CNF or DNF) for subsequent manipulation Analysis The query is analyzed for semantic correctness Simplification Redundant predicates are eliminated to obtain simplified queries Restructuring The calculus query is translated to optimal algebraic query representation
H.Lu/HKUST
L05: Distributed Query Processing -- 12
Query Decomposition: Normalization
Lexical and syntactic analysis check validity check for attributes and relations type checking on the qualification There are two possible forms of representing the predicates in query qualification: Conjunctive Normal Form (CNF) or Disjunctive Normal Form (DNF) CNF: (p11 ∨p12 ∨... ∨p1n) ∧... ∧(pm1 ∨pm2 ∨... ∨pmn) DNF: (p11 ∧p12 ∧... ∧p1n) ∨... ∨(pm1 ∧pm2 ∧... ∧pmn) OR's mapped into union AND's mapped into join or selection
H.Lu/HKUST
L05: Distributed Query Processing -- 13
Query Decomposition: Analysis
Queries are rejected because the attributes or relations are not defined in the global schema; or operations used in qualifiers are semantically incorrect For only those queries that do not use disjunction or negation semantic correctness can be determined by using query graph One node of the query graph represents result sites, others operand relations, edge between nodes operand nodes represent joins, and edge between operand node and result node represents project
H.Lu/HKUST
L05: Distributed Query Processing -- 14
Query Graph and Join Graph SELECT Ename, Resp FROM E, G, J WHERE E. ENo = G. ENO AND G.JNO = J.JNO AND JNAME = ``CAD'' AND DUR >= 36 AND Title = ``Prog'' JNAME = ``CAD''
PROJ
EMP
PROJ
E. ENo = G. ENO
G.JNO = J.JNO
ASG Result Resp
Ename EMP
Title = ``Prog''
E. ENo = G. ENO G.JNO = J.JNO ASG DUR >= 36 H.Lu/HKUST
L05: Distributed Query Processing -- 15
Disconnected Query Graph
Semantically incorrect conjunctive multivariable query without negation have query graphs which are not connected
SELECT Ename, Resp FROM E, G, J WHERE E. ENo = G. ENO AND JNAME = ``CAD'' AND DUR >= 36 AND Title = ``Prog''
Result
Ename
Resp
EMP
E. ENo = G. ENO ASG
PROJ
DUR >= 36 H.Lu/HKUST
Title = ``Prog''
JNAME = ``CAD''
L05: Distributed Query Processing -- 16
Simplification: Eliminating Redundancy
Elimination of redundant predicates using well known idempotency rules: p ∧p = p; p ∨p = p; p ∨true = true; p ∨false = p; p ∧true = p; p ∨false = false; p1 ∧(p1 ∨p 2 ) = p1; p1 ∨(p1 ∧p 2 ) = p1
Such redundant predicates arise when user query is enriched with several predicates to incorporate view relation correspondence, and ensure semantic integrity and security
H.Lu/HKUST
L05: Distributed Query Processing -- 17
Eliminating Redundancy-- An Example SELECT TITLE FROM E WHERE (NOT (TITLE = ``Programmer'') AND (TITLE = ``Programmer'' OR TITLE = ``Elec.Engr'') AND NOT (TITLE = ``Elec.Engr'')) OR ENAME = ``J.Doe''; SELECT TITLE FROM E WHERE ENAME = ``J.Doe''; H.Lu/HKUST
L05: Distributed Query Processing -- 18
Eliminating Redundancy-- An Example p1 = <TITLE = ``Programmer''> p2 = <TITLE = ``Elec. Engr''> p3 = <ENAME = ``J.Doe''> Let the query qualification is (¬ p1 ∧(p1 ∨p2) ∧¬ p2) ∨p3 The disjunctive normal form of the query is = (¬ p1 ∧p1 ∧¬p2) ∨(¬ p1 ∨p2 ∧¬ p2) ∨p3 = (false ∧¬ p2) ∨(¬ p1 ∧false) Ú p3 = false ∨false ∨p3 = p3 H.Lu/HKUST
L05: Distributed Query Processing -- 19
Query Decomposition: Rewriting
Rewriting calculus query in relational algebra; straightforward transformation from relational calculus to relational algebra, and restructuring relational algebra expression to improve performance
H.Lu/HKUST
L05: Distributed Query Processing -- 20
Rewriting -- Transformation Rules (I)
Commutativity of binary operations: R× S ⇔ S× R R S ⇔S R R∪S ⇔ S∪R Associativity of binary operations: (R × S) × T ⇔ R × ( S × T ) (R S) T ⇔ R (S T) Idempotence of unary operations: grouping of projections and selections Π A’ ( Π A’’ (R )) ⇔ Π A’ (R ) for A’⊆A’’⊆ A σ
H.Lu/HKUST
p1(A1)
(σ
p2(A2)
(R )) ⇔ σ
p1(A1) ∧p2(A2)
(R ) L05: Distributed Query Processing -- 21
Rewriting -- Transformation Rules (II)
Commuting selection with projection Π A1,
…, An
(σ
p(Ap)
(R )) ⇔ Π A1,
(σ
…, An
p(Ap)
( Π A1,
…, An, Ap
(R )))
Commuting selection with binary operations σ
p(Ai)
(R × S) ⇔ (σ
σ
p(Ai)
(R
σ
p(Ai)
(R ∪ S) ⇔ σ
S) ⇔ (σ
p(Ai) p(Ai) p(Ai)
(R)) × S (R)) (R) ∪ σ
S p(Ai)
(S)
Commuting projection with binary operations Π C(R × S) ⇔ Π A(R) × Π B (S) C = A ∪ B Π C(R
S) ⇔ Π C(R)
Π C (S)
Π C (R ∪ S) ⇔ Π C (R) ∪ Π C (S) H.Lu/HKUST
L05: Distributed Query Processing -- 22
An SQL Query and Its Query Tree Π ENAME σ
(ENAME<>“J.DOE”)
SELECT Ename FROM J, G, E WHERE G.Eno=E.ENo AND G.JNo = J.JNo AND ENAME <> `J.Doe' AND JName = `CAD' AND (Dur=12 or Dur=24) H.Lu/HKUST
∧(JNAME=“CAD/CAM”)
∧(Dur=12
∨Dur=24)
JNO
PROJ
ENO
ASG
EMP
L05: Distributed Query Processing -- 23
Query Decomposition: Rewriting Π ENAME JNO
Π σ
Π
JNO
ENO JNAME=“CAD/CAM”
Π σ
PROJ H.Lu/HKUST
JNO,ENAME
Π
JNO,ENO
Dur=12
∨Dur=24
ASG
σ
ENO,ENAME
ENAME<>“J.DOE”
EMP L05: Distributed Query Processing -- 24
L05: Distributed Query Processing & Optimization Query Processing Query Decomposition Data Localization Query Optimization
H.Lu/HKUST
Data Localization Localization program Given an algebraic query on global schema Determine which fragments are involved A naïve way to localize a distributed query Substitute each global relation with its localization program generic query Use reduction techniques for efficiency
H.Lu/HKUST
L05: Distributed Query Processing -- 26
Reduction for HF Remove empty relations generated by contradicting selection on horizontal fragments; Remove useless relations generated by projections on vertical fragments; Distribute joins over unions in order to isolate and remove useless joins
H.Lu/HKUST
L05: Distributed Query Processing -- 27
Data Localization-- An Example EMP is fragmented into EMP1 = σ ENO ≤ “E3” (EMP)
Π ENAME σ σ σ
Dur=12
EMP2 = σ
∨Dur=24
EMP3 = σ
ENAME<>“J.DOE”
“E3”<ENO ENO>“E6”
≤ “E6”
(EMP)
(EMP)
ASG is fragmented into ASG1 = σ ENO ≤ “E3” (ASG)
JNAME=“CAD/CAM”
ASG2 = σ
JNO
ENO>“E3”
(ASG)
ENO
PROJ ASG1 H.Lu/HKUST
∪
∪ ASG2 EMP1
EMP1
EMP1
L05: Distributed Query Processing -- 28
Reduction with Selection EMP is fragmented into EMP1 = σ ENO ≤ “E3” (EMP)
SELECT * FROM EMP WHERE ENO=“E5”
EMP2 = σ EMP3 = σ
≤ “E6”
“E3”<ENO ENO>“E6”
(EMP)
(EMP)
Given Relation R, FR={R1, R2, …, Rn} where Rj =σ pj (R) σ pj (Rj) = ∅ if ∀x ∈ R: ¬(pi(x)∧pj(x)) σ
σ
H.Lu/HKUST
σ
∪
ENO=“E5”
EMP
ENO=“E5”
EMP1
EMP2
EMP3
ENO=“E5”
EMP2
L05: Distributed Query Processing -- 29
Reduction with join SELECT * FROM EMP, ASG WHERE EMP.ENO=ASG.ENO ENO
ASG
∪
∪ ASG1
ASG2 EMP1
EMP1
EMP1
EMP
ASG is fragmented into ASG1 = σ ENO ≤ “E3” (ASG) ASG2 = σ
ENO
ENO>“E3”
H.Lu/HKUST
(ASG)
EMP is fragmented into EMP1 = σ ENO ≤ “E3” (EMP) EMP2 = σ EMP3 = σ
“E3”<ENO ENO>“E6”
≤ “E6”
(EMP)
(EMP)
L05: Distributed Query Processing -- 30
Reduction with Join (I) (R1 ∪ R2) S ⇔ (R1 S) ∪ (R2
ENO
∪
∪ ASG1
ASG2 EMP1
EMP2
S)
EMP3
∪ ENO
ENO
ENO
ENO
ENO
ENO
EMP1 ASG1 EMP1 ASG2 EMP2 ASG1 EMP2 ASG2 EMP3 ASG1 EMP3 ASG2
H.Lu/HKUST
L05: Distributed Query Processing -- 31
Reduction with Join (II) ∪ ENO
EMP1 ASG1
ENO
EMP2 ASG2
ENO
EMP3 ASG2
Given Ri =σ pi (R) and Rj =σ pj (R) Ri
Rj = ∅ if ∀x ∈ Ri , ∀y∈ Rj: ¬(pi(x)∧pj(y)) Reduction with join 1. Distribute join over union 2. Eliminate unnecessary work
H.Lu/HKUST
L05: Distributed Query Processing -- 32
Reduction for VF
Find useless intermediate relations Relation R defined over attributes A = {A1, A2, …, An} vertically fragmented as Ri =Π A’ (R) where A’⊆ A Π K,D (Ri) is useless if the set of projection attributes D is not in A’
EMP1= Π ENO,ENAME EMP2= Π ENO,TITLE SELECT ENAME FROM EMP H.Lu/HKUST
(EMP) (EMP)
Π ENAME
Π ENAME
ENO
EMP1
EMP2
EMP1
L05: Distributed Query Processing -- 33
Reduction for DHF Distribute joins over union Apply the join reduction for horizontal fragmentation EMP1: σ
TITLE=“Programmer”
(EMP)
EMP2: σ
TITLE ≠ “Programmer”
(EMP)
ASG1: ASG ASG2: ASG
ENO
EMP1
ENO
EMP2
SELECT * ASG1 FROM EMP, ASG WHERE ASG.ENO = EMP.ENO AND EMP.TITLE = “Mech. Eng.” H.Lu/HKUST
ENO
σ
TITLE=“MECH.Eng.”
∪
∪ ASG2 EMP1
EMP2
L05: Distributed Query Processing -- 34
Reduction for DHF (II) Selection first
ENO
σ
Joins over union TITLE=“Mech. Eng.”
∪ ASG1
ASG2
∪
EMP2
ENO
σ
ENO
σ
TITLE=“Mech. Eng.”
ASG2 ASG1 H.Lu/HKUST
EMP2
ASG1
ENO
σ
TITLE=“Mech. Eng.”
EMP2
ASG2 ASG1
TITLE=“Mech. Eng.”
EMP2
L05: Distributed Query Processing -- 35
Reduction for HF --An Example EMP1 = σ
ENO ≤ “E4”
(Π ENO,ENAME
(EMP))
EMP2 = σ
ENO>“E4”
(Π ENO,ENAME
(EMP))
EMP3 = Π ENO,TITLE
Π ENAME
(EMP)
σ
Π ENAME
QUERY SELECT ENAME FROM EMP WHERE ENO = “E5”
σ
ENO=“E5”
EMP2 ENO=“E5”
ENO
∪ EMP1 ASG1 H.Lu/HKUST
EMP2
EMP3
L05: Distributed Query Processing -- 36
Summary Query decomposition is done as in a centralized DBMS A lot of logic rules about transformations Data localization happens only in a distributed DBMS Use fragmentation characteristics to simplify the queries on fragments
H.Lu/HKUST
L05: Distributed Query Processing -- 37