Notes of Database Management System
UNIT – III
NORMALISATION Functional Dependencies A relational schema has attributes (A, B, C... Z) and that the whole database is described by a single universal relation called R = (A, B, C, ..., Z). This assumption means that every attribute in the database has a unique name. A functional dependency is a property of the semantics of the attributes in a relation. The semantics indicate how attributes relate to one another, and specify the functional dependencies between attributes. When a functional dependency is present, the dependency is specified as a constraint between the attributes. Consider a relation with attributes A and B, where attribute B is functionally dependent on attribute A. If we know the value of A and we examine the relation that holds this dependency, we will find only one value of B in all of the tuples that have a given value of A, at any moment in time.
In the figure above, A is the determinant of B and B is the consequent of A. The determinant of a functional dependency is the attribute or group of attributes on the left-hand side of the arrow in the functional dependency. The consequent of a fd is the attribute or group of attributes on the right-hand side of the arrow. Consider a relational schema
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 1
Notes of Database Management System
The functional dependency staff# → position clearly holds on this relation instance. However, the reverse functional dependency position → staff# clearly does not hold. The relationship between staff# and position is 1:1 – for each staff member there is only one position. On the other hand, the relationship between position and staff# is 1:M – there are several staff numbers associated with a given position.
For the purposes of normalization we are interested in identifying functional dependencies between attributes of a relation that have a 1:1 relationship. A functional dependency is a property of a relational schema (its intension) and not a property of a particular instance of the schema (extension). Identify the functional dependencies that hold using the relation schema STAFFBRANCH In order to identify the time invariant Fds, we need to clearly understand the semantics of the various attributes in each of the relation schemas in question. For example, if we know that a staff member’s position and the branch at which they are located determines their salary.
Inference Rules for Functional Dependencies The set of all Fds that are implied by a given set S of Fds is called the closure of S, written S+. The very first Armstrong is the person who gives a set of inference rules. The following are the six well-known inference rules that apply to functional dependencies. IR1: reflexive rule – if X ⊇ Y, then X → Y IR2: augmentation rule – if X → Y, then XZ → YZ IR3: transitive rule – if X → Y and Y → Z, then X → Z IR4: projection rule – if X → YZ, then X → Y and X → Z IR5: additive rule – if X → Y and X → Z, then X → YZ IR6: pseudo transitive rule – if X → Y and YZ → W, then XZ → W Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 2
Notes of Database Management System
The first three of these rules (IR1-IR3) are known as Armstrong’s Axioms and constitute a necessary and sufficient set of inference rules for generating the closure of a set of functional dependencies. The other rules derived from this are: Given R = (A,B,C,D,E,F,G,H, I, J) and F = {AB → E, AG → J, BE → I, E → G, GI → H} Does F = { AB → GH} ? Proof 1. AB → E, given in F 2. AB → AB, reflexive rule IR1 3. AB → B, projective rule IR4 from step 2 4. AB → BE, additive rule IR5 from steps 1 and 3 5. BE → I, given in F 6. AB → I, transitive rule IR3 from steps 4 and 5 7. E → G, given in F 8. AB → G, transitive rule IR3 from steps 1 and 7 9. AB → GI, additive rule IR5 from steps 6 and 8 10. GI → H, given in F 11. AB → H, transitive rule IR3 from steps 9 and 10 12. AB → GH, additive rule IR5 from steps 8 and 11 . Hence proven
Irreducible sets of dependencies We define a set of Fds to be irreducible( Usually called minimal in the literature) if and only if it satisfies the following three properties 1. The right hand side (the dependent) of every Fds in S involves just one attribute (that is, it is singleton set) 2. The left hand side (determinant) of every in S is irreducible in turn-meaning that no attribute can be discarded from the determinant without changing the closure S+(that is, with out converting S into some set not equivalent to S). We will say that such an Fd is left irreducible. 3. No Fd in S can be discarded from S without changing the closure S+(That is, without converting s into some set not equivalent to S) Eg. Relation R {A,B,C,D,E,F} satisfies the following Fds AB → C , C → A , BC → D , ACD → B BE → C CE → FA CF → BD D → EF Find an irreducible equivalent for this set of Fds? Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 3
Notes of Database Management System
The solution is simple. 1. AB → C 2. C → A 3. BC → D 4. ACD → B 5. BE → C 6. CE → A 7. CE → F 8. CF → B 9. CF → D 10. D → E 11. D → F Now: • 2 implies CE → AE (By augmentation), CE → A (By decomposition), so we can drop 6. • 8 implies CF → BC (By augmentation), by which 3 implies CF → D (By Transitivity), so we can drop 9. • 8 implies ACF → AB (By augmentation), and 11 implies ACD → ACF (By augmentation), and so ACD → AB (By Transitivity), and so ACD → B (By Decomposition), so we can drop 4. No further reductions are possible, and so we are left with the following irreducible set: 1. AB → C 2. C → A 3. BC → D 5. BE → C 7. CE → F 8. CF → B 10. D → E 11. D → F Eg. Given a relation schema R(A,B,C,D) and a set of FDs A BC B C A B AB C AC D Compute an irreducible set of FDs that is equivalent to this set.
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 4
Notes of Database Management System
Sol: 1) The first step is to rewrite the FDs such that each has a singleton right-hand side: AB AC [By Decomposition] BC AB ABC ACD We observe immediately that is FD AB occurs twice, so one occurrence can be eliminated. 2) Next, attribute C can be eliminated from the left-hand side of the FD ACD, because we have AC, so AAC by augmentation, and we are given ACD, so AD by transitivity; thus the C on the left-hand side of ACD is redundant. So the remaining FDs are: AB AC BC ABC AD 3) Next, we observe that the FD ABC can be eliminated, because again we have AC, so ABCB by augmentation, so ABC by decomposition. So the remaining FDs are: AB AC BC AD 4) Finally, the FD AC is implied by the FDs AB and BC, so it can also be eliminated. We are left with AB BC AD
This set is irreducible.
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 5
Notes of Database Management System
Eg. Consider the Relation R(A,B,C,D,E,F,G,H,I,J) and a set of FDs ABDE , ABG , BF , CJ ,CJI ,GH Is this an irreducible set? What are the candidate keys? Sol: 1) The set is not irreducible, since CJ and CJI together imply CI. 2) The super key of this relation is (A, B, C, D, G, J) (i.e. the set of attributes mentioned on the left-hand side of the given FDs). We can eliminate J from this set because CJ, and we eliminate G because ABG. Since none of the attributes A, B, C, D appears on the right-hand side of any of the given FDs, it follows that (A,B,C,D) is a candidate key. Full Functional Dependency – A functional dependency XY is a full functional dependency if removal of any attribute A from X means that the dependency does not hold any more; i.e. for any attribute A є X, (X – {A}) does not functionally determine Y. E.g. consider the following InvLine relation.
INV_NUM
LINENUM
QTY
INVDATE
Candidate Keys: {InvNum, LineNum}. Qty and InvDate is fully functionally dependent on combination of InvNum and LineNum, because removal of any attribute from candidate key does not determine any entity from the relation. Trivial Dependency – A functional dependency XY is trivial if X כY.(i.e X is a super set of Y or Y is a subset of X). E.g. The FD ABB and ABA is a trivial dependency. Partial dependency A partial dependency exists when an attribute B is functionally dependent on an attribute A, and A is a component of a multipart candidate key. Candidate keys: {InvNum, LineNum} InvDate is partially dependent on {InvNum, LineNum} as InvNum is a determinant of InvDate and InvNum is part of a candidate key.
If there is some attribute that can be removed from A and the dependency still holds.
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 6
Notes of Database Management System
PART - II A redundancy in a conceptual schema corresponds to a piece of information that can be derived (that is, obtained through a series of retrieval operations) from other data in the database. The presence of a redundancy in a database may be decided upon the following factores • an advantage: a reduction in the number of accesses necessary to.Obtain the derived information. • a disadvantage: because of larger storage requirements, (but, usually at negligible cost) and the necessity to carry out additional operations in order to keep the derived data consistent. The serious problem with using the relations is the problem of update anomalies. These can be classified in to • Insertion anomalies • Deletion anomalies • Modification anomalies Q. Discuss insertion, deletion, and modification anomalies. Why are they considered bad? Illustrate with examples. Ans. EMP_DEPT
Insertion Anomalies An "insertion anomaly" is a failure to place information about a new database entry into all the places in the database where information about that new entry needs to be stored. In a properly normalized database, information about a new entry needs to be inserted into only one place in the database; in an inadequately normalized database, information about a new entry may need to be inserted into more than one place and, human fallibility being what it is, some of the needed additional insertions may be missed. Eg. First Instance: - To insert a new employee tuple in to Emp_Dept table, we must include either the attribute values for the department that the employee works for, or nulls (if the employee does not work for a department as yet). For example to insert a new tuple for an employee who works in department no 5, we must enter the attribute values of department number 5correctly so that they are consistent, with values for the department 5 in other tuples in emp_dept.
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 7
Notes of Database Management System
Second Instance: - It is difficult to insert a new department that has no employees as yet in the emp_dept relation. The only way to do this is to place null values in the attributes for the employee this causes a problem because SSN in the primary key of emp_dept table and each tuple is supposed to represent an employee entity- not a department entity. Moreover, when the first employee is assigned to that department, we do not need this tuple with null values anymore. Deletion Anomalies A "deletion anomaly" is a failure to remove information about an existing database entry when it is time to remove that entry. In a properly normalized database, information about an old, to-be-gotten-rid-of entry needs to be deleted from only one place in the database; in an inadequately normalized database, information about that old entry may need to be deleted from more than one place, and, human fallibility being what it is, some of the needed additional deletions may be missed. Eg. The problem of deletion anomaly is related to the second insertion anomaly situation which we have discussed earlier, if we delete from emp_dept an employee tuple that happens to represent the last employee working for a particular department, the information concerning that department is lost from the database. Modification Anomalies Eg. In Emp_Dept, if we change the value of one of the attribute of a particular department- say, the manager of department 5-we must update the tuples of all employees who work in that department; other wise, the database will become inconsistent. If we fail to update some tuples, the same department will be shown to have 2 different values for manager in different employee tuple which would be wrong. All three kinds of anomalies are highly undesirable, since their occurrence constitutes corruption of the database. Properly normalized databases are much less susceptible to corruption than are unnormalized databases. Redundant information not only wastes storage but makes updates more difficult since. Normalization is the process of taking data from a problem and reducing it to a set of relations while ensuring data integrity, eliminating data redundancy and minimizing the insertion, deletion and update anomalies. 1. Data Integrity – All of the data in the database are consistent and satisfy all integrity constraints. 2. Data redundancy – If data in the database can be found in two different locations (direct redundancy) or if data can be calculated from other data items (indirect redundancy) then the data is said to contain redundancy. Data should only be stored once and avoid storing data that can be calculated from other data already held in the database. During the process of normalization redundancy must be removed, but not at the expense of breaking data integrity rules. Basically the normal form of the data indicates how much redundancy is in that data. The normal forms have a strict ordering. a) 1NF b) 2NF c) 3NF d) BCNF e) 4NF f) 5NF Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 8
Notes of Database Management System
A normalized relational database provides several benefits: • Elimination of redundant data storage. • Close modeling of real world entities, processes, and their relationships. • Structuring of data so that the model is flexible.
1st Normal Form (1NF) Def: A table (relation) is in 1NF if 1. There are no duplicated rows in the table. 2. Each cell is single-valued (i.e., there are no repeating groups or arrays). 3. Entries in a column (attribute, field) are of the same kind. Note: The order of the rows is immaterial; the order of the columns is immaterial. The requirement that there be no duplicated rows in the table means that the table has a key (although the key might be made up of more than one
column—even, possibly, of all the columns).
So we come to the conclusion, A relation is in 1NF if and only if all underlying domains contain atomic values only. The first normal form deals only with the basic structure of the relation and does not resolve the problems of redundant information or the anomalies discussed earlier. All relations discussed in these notes are in 1NF.
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 9
Notes of Database Management System
Eg. The following is not in 1NF. EmpNum
EmpPhone
EmpDegrees
1234
2345-987
3456
2367-890
BA,BSC
6789
2367-890
BSC,MSC,PhD
EmpDegrees is a multi valued field: Employee 6789 has 3 degrees: BSC,PhD and MSC Employee 3456 has 2 degrees: BA, BSC To obtain 1NF relations we must, without loss of information, replace the above with two relations. EMP_DETAIL EmpNum EmpPhone
1234
2345-987
3456
2367-890
6789
2367-890
EMP_DEG EmpNum
EmpDegrees
1234 3456
BA
3456
BSC
6789
BSC
6789
MSC
6789
PhD
An outer join between Employee and EmployeeDegree will produce the information that the original table hold.
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 10
Notes of Database Management System
Eg. No. of duplicate tuples
Eg. EMP_Proj
A relation is in first normal form if and only if, in every legal value of that relation every tuple contains one value for each attribute The above definition merely states that the relations are always in first normal form which is always correct. However the relation that is only in first normal form has a structure those undesirable for a number of reasons. First normal form (1NF) sets the very basic rules for an organized database: • Eliminate duplicative
columns from the same table.
• Create
separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 11
Notes of Database Management System
2nd Normal Form (2NF) Def: A table is in 2NF if it is in 1NF and if all non-key attributes are dependent on the entire key. The second normal form attempts to deal with the problems that are identified with the relation above that is in 1NF. The aim of second normal form is to ensure that all information in one relation is only about one thing. A relation is in 2NF if it is in 1NF and every non-key attribute is fully dependent on each candidate key of the relation.
Note: Since a partial dependency occurs when a non-key attribute is dependent on only a part of the (composite) key, the definition of 2NF is sometimes phrased as, "A table is in 2NF if it is in 1NF and if it has no partial dependencies." The general requirements of 2NF: • Remove • Create
subsets of data that apply to multiple rows of a table and place them in separate rows.
relationships between these new tables and their predecessors through the use of foreign keys.
These rules can be summarized in a simple statement: 2NF attempts to reduce the amount of redundant data in a table by extracting it, placing it in new table(s) and creating relationships between those tables.
Eg. The concept of 2NF requires that all attributes that are not part of a candidate key be fully dependent on each candidate key. If we consider the relation student (sno, sname, cno, cname) and the functional dependencies sno sname cno cname and assume that (sno, cno) is the only candidate key (and therefore the primary key), the relation is not in 2NF since sname and cname are not fully dependent on the key. The above relation suffers from the same anomalies and repetition of information as discussed above since sname and cname will be repeated. To resolve these difficulties we could remove those attributes from the relation that are not fully dependent on the candidate keys of the relations. Therefore we decompose the relation into the following projections of the original relation: S1 (sno, sname) , S2 (cno, cname) ,
SC (sno, cno)
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 12
Notes of Database Management System
Eg. Online store that maintains customer information database.
At this table reveals a small amount of redundant data. We're storing the "Sea Cliff, NY 11579" and "Miami, FL 33157" entries twice each. After decomposition:
Now that we've removed the duplicative data from the Customers table, we've satisfied the first rule of second normal form. We still need to use a foreign key to tie the two tables together. We'll use the ZIP code (the primary key from the ZIPs table) to create that relationship. Here's our new Customers table:
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 13
Notes of Database Management System
This minimized the amount of redundant information stored within the database and our structure is in second normal form. Eg. Stu_ID
Class_ID
Class_name
134-56-7890
M148
Math 148
123-45-7894
P113
Physics 113
534-98-9009
H151
History 151
134-56-7890
H151
History 151
AND Transitive Dependency A functional dependency XY in a relation schema R is a transitive dependency if there is a set of attributes Z that is neither a candidate key nor a subset of any key of R, and both X Z and ZY hold. Consider attributes A, B, and C, and where A B and B C. FDs are transitive, which means that we also have the FD, A C. We say that C is transitively dependent on A through B. Eg. EmpNum DeptNum
Emp_number
Emp_email
Dept_number
Dept_name
DeptNum DeptName
Emp_number
Emp_email
Dept_number
Dept_name
DeptName is transitively dependent on EmpNum via DeptNum: EmpNum DeptName
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 14
Notes of Database Management System
3rd Normal Form (3NF) Def: A table is in 3NF if it is in 2NF and if it has no transitive dependencies. The basic requirements of 3NF are as follows • Meet the requirements of 1NF and 2NF • Remove columns that are not fully dependent upon the primary key.
Although transforming a relation that is not in 2NF into a number of relations that are in 2NF removes many of the anomalies that appear in the relation that was not in 2NF, not all anomalies are removed and further normalization is sometime needed to ensure further removal of anomalies. These anomalies arise because a 2NF relation may have attributes that are not directly related to the thing that is being described by the candidate keys of the relation. Let us first define the 3NF. For E.g. The relation schema EMP_DEPT is in 2NF, since no partial dependencies on a key exist. However, EMP_DEPT is not in 3NF because of the transitive dependency of DMGRENO (and also DNAME) on ENO via DNUMBER. We can normalize EMP_DEPT by decomposing it into the two 3NF relation schemas ED1 and ED2. Intuitively, we see that ED1 and ED2 represent independent entity facts about employees and departments. A NATURAL JOIN operation on ED1 and ED2 will recover the original relation EMP_DEPT without generating spurious tuples.
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 15
Notes of Database Management System
Consider the following relation subject (cno, cname, instructor, office) cno cname , cno instructor , instructure office Assume that cname is not unique and therefore cno is the only candidate key. The following functional dependencies exist. We can derive cno office from the above functional dependencies and therefore the above relation is in 2NF. The relation is however not in 3NF since office is not directly dependent on cno. This transitive dependence is an indication that the relation has information about more than one thing (viz. course and instructor) and should therefore be decomposed. The primary difficulty with the above relation is that an instructor might be responsible for several subjects and therefore his office address may need to be repeated many times. This leads to all the problems that we identified at the beginning of this chapter. To overcome these difficulties we need to decompose the above relation in the following two relations: s (cno, cname, instructor) ins (instructor, office) s is now in 3NF and so is ins. Eg. Consider a table of Orders
Test for 1NF, 2NF and 3NF???? Ans. Our first requirement is that the table must satisfy the requirements of 1NF and 2NF. Are there any
duplicative columns? No. Do we have a primary key? Yes, the order number. Therefore, we satisfy the requirements of 1NF. Are there any subsets of data that apply to multiple rows? No, so we also satisfy the requirements of 2NF. Now, are all of the columns fully dependent upon the primary key? The customer number varies with the order number and it doesn't appear to depend upon any of the other fields. What about the unit price? This field could be dependent upon the customer number in a situation where we charged each customer a set price. However, looking at the data above, it appears we sometimes charge the same customer different prices. Therefore, the unit price is fully dependent upon the order number. The quantity of items also varies from order to order. BUT, The Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 16
Notes of Database Management System
total can be derived by multiplying the unit price by the quantity; therefore it's not fully dependent upon the primary key. We must remove it from the table to comply with the third normal form:
Boyce-Codd Normal Form (BCNF) A relation R is said to be in BCNF if whenever X A , holds in R, and A is not in X, then X is a candidate key for R. Boyce-Codd normal form (BCNF) was proposed as a simpler form of 3NF, but it was found to be stricter than 3NF. That is, every relation in BCNF is also in 3NF; however, a relation in 3NF is not necessarily in BCNF.
It should be noted that most relations that are in 3NF are also in BCNF. Infrequently, a 3NF relation is not in BCNF and this happens only if (a) the candidate keys in the relation are composite keys (that is, they are not single attributes), (b) there is more than one candidate key in the relation, and (c) the keys are not disjoint, that is, some attributes in the keys are common. The BCNF differs from the 3NF only when there are more than one candidate keys and the keys are composite and overlapping. So, a relation is said to be in the BCNF if and only if it is in the 3NF and every non-trivial, left-irreducible functional dependency has a candidate key as its determinant. In more informal terms, a relation is in BCNF if it is in 3NF and the only determinants are the candidate keys.
( REFER understandingnormalisation.pdf ) DESIRABLE PROPERTIES OF DECOMPOSITIONS So far our approach has consisted of looking at individual relations and checking if they belong to 2NF, 3NF or BCNF. If a relation was not in the normal form that was being checked for and we wished the relation to be normalized to that normal form so that some of the anomalies can be eliminated, it was necessary to decompose the relation in two or more relations. The process of decomposition of a relation R into a set of relations R1,R2….RN was based on identifying different components and using that as a basis of decomposition. The decomposed relations R1,R2….RN are projections of R and are of course not disjoint otherwise the glue holding the information together would be lost. Decomposing relations in this way based on a recognize and split method Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 17
Notes of Database Management System
is not a particularly sound approach since we do not even have a basis to determine that the original relation can be constructed if necessary from the decomposed relations. Desirable properties of decomposition are: 1. Attribute preservation 2. Lossless-join decomposition 3. Dependency preservation 4. Lack of redundancy
Attribute Preservation This is a simple and an obvious requirement that involves preserving all the attributes that were there in the relation that is being decomposed.
Lossless-Join Decomposition In these notes so far we have normalized a number of relations by decomposing them. We decomposed a relation intuitively. We need a better basis for deciding decompositions since intuition may not always be correct. We illustrate how a careless decomposition may lead to problems including loss of information. Consider the following relation enrol (sno, cno, date-enrolled, room-No., instructor) Suppose we decompose the above relation into two relations enrol1 and enrol2 as follows enrol1 (sno, cno, date-enrolled) ,
enrol2 (date-enrolled, room-No., instructor)
There are problems with this decomposition but we wish to focus on one aspect at the moment. Let an instance of the relation enrol be
and let the decomposed relations enrol1 and enrol2 be:
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 18
Notes of Database Management System
All the information that was in the relation enrol appears to be still available in enrol1 and enrol2 but this is not so. Suppose, we wanted to retrieve the student numbers of all students taking a course from Wilson, we would need to join enrol1 and enrol2. The join would have 11 tuples as follows:
…………. So on. The join contains a number of spurious tuples that were not in the original relation Enrol. Because of these additional tuples, we have lost the information about which students take courses from WILSON. (Yes, we have more tuples but less information because we are unable to say with certainty who is taking courses from WILSON). Such decompositions are called lossy decompositions. A nonloss or lossless decomposition is that which guarantees that the join will result in exactly the same relation as was decomposed. One might think that there might be other ways of recovering the original relation from the decomposed relations but, sadly, no other operators can recover the original relation if the join does not. Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 19
Notes of Database Management System
We need to analyze why some decompositions are lossy. The common attribute in above decompositions was Date-enrolled. The common attribute is the glue that gives us the ability to find the relationships between different relations by joining the relations together. If the common attribute is not unique, the relationship information is not preserved. If each tuple had a unique value of Date-enrolled, the problem of losing information would not have existed. The problem arises because several enrolments may take place on the same date.
Dependency Preservation It is clear that decomposition must be lossless so that we do not lose any information from the relation that is decomposed. Dependency preservation is another important requirement since a dependency is a constraint on the database and if X Yholds than we know that the two (sets) attributes are closely related and it would be useful if both attributes appeared in the same relation so that the dependency can be checked easily. Let us consider a relation R(A, B, C, D) that has the dependencies F that include the following: A B , A C etc…. If we decompose the above relation into R1(A, B) and R2(B, C, D) the dependency A C cannot be checked (or preserved) by looking at only one relation. It is desirable that decompositions be such that each dependency in F may be checked by looking at only one relation and that no joins need be computed for checking dependencies. In some cases, it may not be possible to preserve each and every dependency in F but as long as the dependencies that are preserved are equivalent to F, it should be sufficient.
Lack of Redundancy We have discussed the problems of repetition of information in a database. Such repetition should be avoided as much as possible. Lossless-join, dependency preservation and lack of redundancy not always possible with BCNF. Lossless-join, dependency preservation and lack of redundancy is always possible with 3NF. Q. Define Boyce-Codd normal form. How it is differ from 3NF? Why it is considered a stronger form of 3NF? Sol: BCNF - A relation schema R is in BCNF if whenever a nontrivial functional dependency X A holds in R, then X is a super key of R. 3NF - A relation schema R is in third normal form (3NF) if, whenever a nontrivial functional dependency XA holds in R, either • X is a super key of R, or • A is a prime attribute of R. Difference – The formal definition of BCNF differs slightly from the definition of 3NF. The only difference between the definitions of BCNF and 3NF is the second condition of 3NF (i.e. A is a prime attribute of R) which allows A to be prime, is absent from BCNF. BCNF requires that every determinant is a candidate key and 3NF doesn’t require every determinant to act as candidate key. So every relation which is in BCNF is also in 3NF but every relation which is in 3NF is not necessarily in BCNF. Consider for instance a relation R(A, B, C) with the following dependency diagram.
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 20
Notes of Database Management System
R A
B
C
Here the candidate key of the relation is AB and also the super key of this relation is AB. So the relation is in 3NF because we have the FDs ABC and AB is a super key of R and the FD CB and B is a prime attribute of R. The relation is not in BCNF because C is not a super key of the relation and still the FD CB holds. Why it is considered as a stronger normal form of 3NF BCNF is considered as a stronger normal form of 3NF because 3NF relations still have insertion, deletion and update anomalies but BCNF relations doesn’t have any insertion, deletion and update anomalies. In BCNF every dependency is on the candidate key which is our goal of normalization and in 3NF it is not required that every dependency is on candidate key. For e.g., consider a relation Stud_Course with the following dependency diagram with the fact that a particular student takes a course and has one instructor and a particular instructor teaches one course only. Stud_Course
With the dependency diagram we have following FDs. Studno, CoursenoInstrno and InstrnoCourseno Consider the following instance of the above relation:-
From this relation we have InstrnoCourseno i.e. instructor 99 teaches 1803, instructor 77 teaches 1903 and instructor 66 teaches 1803. The super key and candidate key of this relation is Studno, Courseno.The relation is in 3NF because we have FD Studno, CoursenoInstrno and Studno, Courseno is a super key and FD InstrnoCourseno and Courseno is a prime attribute of this relation.
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 21
Notes of Database Management System
This relation has the following insertion, deletion and update anomalies. 1. Insertion Anomaly – How do we add the fact that instructor 55 teaches course 2906. 2. Deletion Anomaly – If we delete all rows for course 1803 we will lose the information that instructor 99 teaches student 121 and 66 teaches student 222. 3. Update Anomaly – If we update the Courseno of student 121 to be 1903 then we also have to update his Instrno to be 77. Now the relation is not in BCNF because we have the FD InstrnoCourseno and Instrno is not a super key of this relation. So we decompose this relation into two relations Stud(Studno, Instrno) and Instr(Instrno, Courseno). Stud Studno
Instrno
Instr Instrno
courseno
Stud Studno 121 121 222 222
Instrno 99 77 66 77
Instrno 99 77 66
Courseno 1803 1903 1803
Instr
Now these relations don’t have any insertion, deletion and update anomalies because every dependency is on the super key. The natural join between these two relations produce the original relations.
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 22
Notes of Database Management System
Q. Prove that any relation schema with two attributes is in BCNF?
Sol: Consider a relation schema R1 with two attributes A and B and with the following dependency diagram. R1 A B Now with this dependency diagram we either have AB or BA. Now we define BCNF as a relation schema R is in BCNF if whenever a nontrivial functional dependency XA holds in R, then X is a super key of R. From the definition of BCNF the relation schema R1 is necessarily in BCNF because we either have AB with super key A or have BA with super key B. MULTIVALUED DEPENDENCIES A multivalued dependency (MVD) on R, X Y , says that if two tuples of R agree on all the attributes of X, then their components in Y may be swapped, and the result will be two tuples that are also in the relation. i.e., for each value of X, the values of Y are independent of the values of R-X-Y. Every FD is an MVD (promotion ):- If X Y, then swapping Y ’s between two tuples that agree on X doesn’t change the tuples. Therefore, the “new” tuples are surely in the relation, and we know X Y.
Consider the following relation that represents an entity employee that has one mutlivalued attribute proj: emp (e#, dept, salary, proj) We have so far considered normalization based on functional dependencies; dependencies that apply only to single-valued facts. For example, implies only one dept value for each value of e#. Not all information in a database is single-valued, for example, e# proj in an employee relation may be the list of all projects that the employee is currently working on. Although e# determines the list of all projects that an employee is working on, e# dept is not a functional dependency. Eg. programmer (emp_name, qualifications, languages) The above relation includes two multivalued attributes of entity programmer; qualifications and languages. There are no functional dependencies. The attributes qualifications and languages are assumed independent of each other. If we were to consider qualifications and languages separate entities, we would have two relationships (one between employees and qualifications and the other between employees and programming languages). Both the above relationships are many-to-many i.e. one programmer could have several qualifications and may know several programming languages. Also one qualification may be obtained by several programmers and one programming language may be known to many programmers. The above relation is therefore in 3NF (even in BCNF) but it still has some disadvantages. Suppose a programmer has several qualifications (B.Sc, Dip. Comp. Sc, etc) and is proficient in several programming languages; how should this information be represented? There are several possibilities.
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 23
Notes of Database Management System
All these variations have some disadvantages. If the information is repeated we face the same problems of repeated information and anomalies as we did when second or third normal form conditions are violated. If there is no repetition, there are still some difficulties with search, insertions and deletions. For example, the role of NULL values in the above relations is confusing.. These problems may be overcome by decomposing a relation like that:-
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 24
Notes of Database Management System
The concept of multivalued dependencies was developed to provide a basis for decomposition of relations like the one above. Therefore if a relation like enrolment(sno, subject#) has a relationship between sno and subject# in which sno uniquely determines the values of subject#, the dependence of subject# on sno is called a trivial MVD since the relation enrolment cannot be decomposed any further. More formally, X Ya MVD is called trivial MVD if either Y is a subset of X or X and Y together form the relation R. The MVD is trivial since it results in no constraints being placed on the relation. A MVD XY in R is called a trivial MVD if a) Y is a subset of X or b) X U Y = R. For e.g., consider the relation EMP_PROJECTS.from table we have a trivial MVD ENAME PNAME.
Therefore a relation having non-trivial MVDs must have at least three attributes; two of them multivalued. Non-trivial MVDs result in the relation having some constraints on it since all possible combinations of the multivalue attributes are then required to be in the relation. Possible: ZX Y but X ZY means X Y , X Z.
Multivalued Normalization --- Fourth Normal Form Def: A table is in 4NF if it is in BCNF and if it has no multi-valued dependencies.
Consider an example of Programmer(Emp name, qualification, languages) and discussed the problems that may arise if the relation is not normalised further. We also saw how the relation could be decomposed into P1(Emp name, qualifications) and P2(Emp name, languages) to overcome these problems. The decomposed relations are in fourth normal form (4NF) which we shall now define. We are now ready to define 4NF. A relation R is in 4NF if, whenever a multivalued dependency XYholds then either (a)
the dependency is trivial, or (b) X is a candidate key for R.
In fourth normal form, we have a relation that has information about only one entity. If a relation has more than one multivalue attribute, we should decompose it to remove difficulties with multivalued facts. EMP ENAME
PNAME
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
DNAME
Page 25
Notes of Database Management System
EMP ENAME
PNAME
DNAME
Kumar
X
Rakesh
Kumar
Y
Rao
Kumar
X
Rao
Kumar
Y
Rakesh
From the above fig we have ENAMEPNAME and ENAME DNAME (or ENAME PNAME|DNAME) holds in the EMP relation. The employee with ENAME ‘Kumar’ works on projects with PNAME ‘X’ and ‘Y’ and has two dependents with DNAME ‘Rakesh’ and ‘Rao’. This relation is not in 4NF because in the nontrivial MVDs ENAMEPNAME and ENAMEDNAME, ENAME is not a super key of EMP. We decompose EMP into EMP_PROJECTS and EMP_DEPENDENTS shown in fig below.
Both EMP_PROJECTS and EMP_DEPENDENTS are in 4NF, because the MVDs ENAMEPNAME in EMP_PROJECTS and ENAMEDNAME in EMP_DEPENDENTS are trivial MVDs. No other nontrivial MVDs hold in either EMP_PROJECTS or EMP_DEPENDENTS. No FDs hold in these relation schemas either. JOIN DEPENDENCIES Notice that an MVD is a special case of a JD where n = 2. That is, a JD denoted as JD(R1, R2) implies a MVD (R1 R2) (R1 – R2) (or, by symmetry, (R1 R2 (R2 – R1)). A join dependency JD(R1, R2,…,Rn), specified on relation schema R, is a trivial JD if one of the relation schemas Ri in JD(R1, R2,…,Rn) is equal to R. Such a dependency is called trivial because it has the nonadditive join property for any relation state r of R and hence does not specify any constraint on R. For e.g., consider the relation SUPPLY with JD(R1, R2, R3) shown below. From the relations we have R3) = SUPPLY
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 26
Notes of Database Management System
SUPPLY
SNAME Kumar Kumar Prakash Ramesh Prakash R1
PARTNAME Bolt Nut Bolt Nut Nail
PROJNAME projX projY projY projZ projX
R2
R3
Fifth Normal Form (5NF) (Project Join Normal Form) (PJNF) A relation schema R is in fifth normal form (5NF) (or project join normal form [PJNF]) with respect to a set of F of functional, multivalued, and join dependencies if, for every nontrivial join dependency JD(R1, R2,…,Rn) in F+ (that is implied by F), every Ri is a super key of R. OR
Def: A table is in 5NF, also called "Projection-Join Normal Form" (PJNF), if it is in 4NF and if every join dependency in the table is a consequence of the candidate keys of the table. The fifth normal form deals with join-dependencies which is a generalisation of the MVD. The aim of fifth normal form is to have relations that cannot be decomposed further. A relation in 5NF cannot be constructed from several smaller relations. A relation R satisfies join dependency (R1,R2,…..Rn) if and only if R is equal to the join of R1,R2,…..Rn where are R1subsets of the set of attributes of R. A relation R is in 5NF (or project-join normal form, PJNF) if for all join dependencies at least one of the following holds. (a) (R1,R2,…..Rn) is a trivial join-dependency (that is, one of R1 is R) (b) Every R1 is a candidate key for R.
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 27
Notes of Database Management System
For e.g., consider the SUPPLY relation shown above. The super keys of this relation are {SNAME, PARTNAME}, {SNAME, PROJNAME} and {PARTNAME, PROJNAME}. The decomposition of R1(SNAME, PARTNAME), R2(SNAME, PROJNAME) and R3(PARTNAME, PROJNAME) is in 5NF because we have a JD(R1, R2, R3) over R and every Ri is a super key of R.
Eg. An example of 5NF can be provided by the example below that deals with departments, subjects and students.
The above relation says that Comp. Sc. offers subjects CP1000, CP2000 and CP3000 which are taken by a variety of students. No student takes all the subjects and no subject has all students enrolled in it and therefore all three fields are needed to represent the information. The above relation does not show MVDs since the attributes subject and student are not independent; they are related to each other and the pairings have significant information in them. The relation can therefore not be decomposed in two relations (dept, subject), and (dept, student) without loosing some important information. The relation can however be decomposed in the following three relations. (dept, subject), and (dept, student) , (subject, student) and now it can be shown that this decomposition is lossless. DOMAIN KEY NF (Free from modification anomalies) A key uniquely identifies each row in a table. A domain is the set of permissible values for an attribute. By enforcing key and domain restrictions, the database is assured of being free from modification anomalies. Thus, it avoids all non-temporal anomalies. It's much easier to build a database in domain/key normal form than it is to convert lesser databases which may contain numerous anomalies. However, successfully building a domain/key normal form database remains a difficult task, even for experienced database programmers. Thus, while the domain/key normal form eliminates the problems found in most databases, it tends to be the most costly normal form to achieve.
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 28
Notes of Database Management System
Eg. A violation of DKNF occurs in the following table: Wealthy Person Wealthy Person
Wealthy Person Type
Net Worth in Dollars
Steve
Eccentric Millionaire
124,543,621
Roderick
Evil Billionaire
6,553,228,893
Katrina
Eccentric Billionaire
8,829,462,998
Gary
Evil Millionaire
495,565,211
(Assume that the domain for Wealthy Person consists of the names of all wealthy people in a pre-defined sample of wealthy people; the domain for Wealthy Person Type consists of the values 'Eccentric Millionaire', 'Eccentric Billionaire', 'Evil Millionaire', and 'Evil Billionaire'; and the domain for Net Worth in Dollars consists of all integers greater than or equal to 1,000,000.) There is a constraint linking Wealthy Person Type to Net Worth in Dollars, even though we cannot deduce one from the other. The constraint dictates that an Eccentric Millionaire or Evil Millionaire will have a net worth of 1,000,000 to 999,999,999 inclusive, while an Eccentric Billionaire or Evil Billionaire will have a net worth of 1,000,000,000 or higher. This constraint is neither a domain constraint nor a key constraint; therefore we cannot rely on domain constraints and key constraints to guarantee that an inconsistent Wealthy Person Type / Net Worth in Dollars combination does not make its way into the database. The DKNF violation could be eliminated by altering the Wealthy Person Type domain to make it consist of just two values, 'Evil' and 'Eccentric' (the wealthy person's status as a millionaire or billionaire is implicit in their Net Worth in Dollars, so no useful information is lost). DKNF is frequently difficult to achieve in practice.
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 29
Notes of Database Management System
Denormalization Occasionally database designers choose a schema that has redundant information; that is, it is not normalized. They use the redundancy to improve performance for specific applications. The penalty paid for not using a normalized schema is the extra work (in terms of coding time and execution time) to keep redundant data consistent. For instance, suppose that the name of an account holder has to be displayed along with the account number and balance, every time the account is accessed. In our normalized schema, this requires a join of account with depositor. One alternative to computing the join on the fly is to store a relation containing all the attributes of account and depositor. This makes displaying the account information faster. However, the balance information for an account is repeated for every person who owns the account, and all copies must be updated by the application, whenever the account balance is updated. The process of taking a normalized schema and making it non-normalized is called denormalization, and designers use it to tune performance of systems to support time-critical operations.
Mrs Mousmi Ajay Chaurasia , LECTURER, Dept. of INFORMATION TECHNOLOGY
Page 30