Normalisation
The Normalisation Anomalies
process,
Functional
dependency
and
The normalisation process, as first proposed by Codd (1972), takes a relation schema through a series of tests to certify whether it satisfies a certain normal form. It proceeds in a top-down fashion and evaluates each relation against the criteria for normal forms, decomposing relations as necessary. The first three normal forms are based on the functional dependencies among the attributes of a relation. Attribute B is functionally dependent on attribute A if, at any moment in time, there is no more than one value of attribute B associated with a given value of attribute A. Normalisation of data can be looked upon as relational design by analysis the process of analysing the given relation schemas based on their functional dependencies and primary keys to achieve the desirable properties of: minimising redundancy improving data handling efficiency improving application development capabilities minimising the insert, delete and update anomalies presented by unnormalised relations: Insert anomaly - refers to a situation wherein one cannot insert a new tuple (row) into a relation because of an artificial dependency on another relation. Delete anomaly - the opposite of the Insert anomaly: refers to a situation wherein a deletion of data about one particular entry causes unintended loss of data that characterises another entity. Update anomaly - refers to a situation where an update of a single data value requires multiple tuples (rows) of data to be updated. Data modification becomes problematic. Database normalization refers to a systematic way of ensuring that a database structure is suitable for general-purpose querying and free of certain undesirable characteristics—insertion, update, and deletion anomalies – that could lead to a loss of data integrity. Unsatisfactory relation schemas that do not meet the normal form test conditions are decomposed into smaller relation schemas that meet the tests and hence possess the desirable properties. An idea is to write the SQL that reproduces the original views after completing the normalisation process. Full normalisation might introduce additional complexity in: Table and query design
1
Query optimisation Query processing At times, some degree of de-normalisation is required for performance purposes. Although most successful databases are normalised to some degree, there is one substantial drawback of a normalised database: reduced database performance. A normalised database requires much more CPU, memory, and I/O to process transactions and database queries than does a de-normalised database. A normalised database must locate the requested tables and then join the data from the tables to either get the requested information or to process the desired data. More JOINS are required thus: Increasing costs Decreasing performance There are costs to denormalisation, however. Data redundancy is increased in a denormalised database, which can improve performance but requires more extraneous efforts to keep track of related data. Application coding renders more complications, because the data has been spread across various tables and may be more difficult to locate. In addition, referential integrity is more of a chore; related data has been divided among a number of tables. There is a happy medium in both normalization and denormalisation, but both require a thorough knowledge of the actual data and the specific business requirements of the pertinent company.
Unique identifiers, Candidate keys, Prime attributes and Primary keys The first step in normalisation is to choose a primary key from among the unique identifiers we find in the relation. A unique identifier is an attribute or collection of attributes that uniquely identifies each occurrence of a relation. In a few cases, there is no reasonable set of attributes in a relation that can be used as a unique identifier. When this occurs, we must invent a unique identifier, often with values assigned sequentially or randomly as we add entity occurrences to the database. We call unique identifiers that have real world meaning (e.g. ID number) natural identifiers, and those that do not (invented ones) surrogate or artificial identifiers. Sometimes a relation will have more than one possible unique identifier. When this occurs, we call each possibility a candidate. If a relation schema has more than one key, each key is called a candidate key.
Normalisation
An attribute of relation schema R is called a prime attribute of R if it is a member of some candidate key of R. An attribute is called nonprime if it is not a prime attribute, that is, not a member of any candidate key. Once we have identified all the possible candidates for a relation, we must choose one of them to be the primary key for the relation. Each relation schema must have a primary key. We can use a preceding hash mark (#) to denote the attribute name(s) of the primary key. Choosing a primary key is essential to the normalisation process because all the normalisation rules reference the primary key. The criteria for choosing the primary key from among the candidates is as follows (in order of precedence): If there is only one candidate, choose it. Choose the candidate least likely to have its value change. (Surrogate keys are always less likely to change compared with natural keys.) Choose the simplest candidate - composed of the fewest attributes. Choose the shortest candidate - for efficiency. After arbitrarily designating one of the candidate keys to be the primary key, the others are called secondary keys.
3
1NF: First Normal Form relation Move repeating and multivalued attributes to a new relation. A relation is said to be in first normal form when it contains no multivalued attributes. The first normal form is now considered to be part of the formal definition of a relation in the basic (flat) relational model. It is defined to disallow multivalued attributes, composite attributes and their combinations. It states that: the domain of an attribute must include only atomic (simple, indivisible) values the value of any attribute in a tuple must be a single value from the domain of that attribute. We can denote repeating groups and multivalued attributes by enclosing them in pairs of parentheses. To transform unnormalised relations into first normal form, we must move multivalued attributes and repeating groups to new relations: Create a new relation with a meaningful name. Copy the primary key from the original relation to the new one. The data depended on this primary key in the original relation, so it must still depend on this key in the new relation. This copied primary key now becomes a foreign key to the original relation. Move the repeating group or multivalued attribute to the new relation. Make the primary key copied from the original relation unique by adding attributes from the repeating group to it or replace the primary key with a single surrogate key attribute.
Normalisation
2NF: Second Normal Form relation Move partially dependent attributes to a new relation. The Second Normal Form is based on the concept of full functional dependency. A functional dependency X®Y is a full functional dependency if removal of any attribute A from X means the dependency does not hold any more, that is, for any attribute AÎX, (X - {A}) does not functionally determine Y). A functional dependency X®Y is a partial dependency if removal of any attribute A from X does not determine Y, that is, if some attribute AÎX can be removed from X and the dependency still holds (or, for any attribute AÎX, (X - {A}) ®Y. If the primary key is composite, a nonprime attribute cannot depend on only part of the key for 2NF to be satisfied. A relation schema R is in the second normal form if every nonprime attribute A in R is fully functionally dependent on the primary key of R. (Remember, an attribute of relation schema R is called nonprime if it is not a member of any candidate key of R.) Put another way, a relation is said to be in the second normal form if it meets both the following criteria: The relation is in the first normal form. All nonprime attributes are functionally dependent on the entire primary key. The second normal form only applies to relations where we have concatenated primary keys (composed of multiple attributes). If we have a primary key composed of only a single attribute and the primary key is atomic (has no subparts that make sense by themselves, as all attributes should be), then it is simply not possible for anything to depend on part of the primary key. It follows then, that any first normal form relation that has only a single attribute for its primary key is automatically in second normal form. The test for 2NF involves testing for functional dependencies whose left-hand side attributes are part of the primary key. Once we find a second normal form violation, the solution is to move the attribute(s) that is (are) partially dependent to a new relation where it depends on the entire key instead of part of the key. A more general definition of 2NF will disallow partial dependencies not only on the primary key but on ANY candidate key of a relation. In this case, a general definition of prime attribute will come to mean an attribute that is part of any candidate key. Now, a relational schema R is in second normal
5
form if every nonprime attribute A is not partially dependent on any key of R.
Normalisation
3NF: Third Normal Form relation Move transitively dependent attributes to a new relation. The third normal form is based on the concept of transitive dependency. An attribute that depends on another attribute that is not the primary key of the relation is said to be transitively dependent. A functional dependency X®Y in a relation schema R is a transitive dependency if there is a set of attributes Z that is neither a candidate key nor a subset of any key of R, and both X® and Z®Y hold. A relation is said to be in third normal form if it meets both the following criteria: The relation is in second normal form. There is no transitive dependence - that is, all nonprime attributes depend only on the primary key. (Remember, an attribute of relation schema R is called nonprime if it is not a member of any candidate key of R.) If any attribute depends on a nonprime attribute (which depends on the primary key) then 3NF is not satisfied. To transform a second normal form relation into third normal form, we simply move any transitively dependent attributes to relations where they depend only on the primary key, whilst being careful to leave the attribute on which they depend in the relation as a foreign key. (This will be needed to reconstruct the original user view via a join.) A more general definition of 3NF will disallow transitive dependencies not only on the primary key but on ANY candidate key of a relation. In this case, a general definition of prime attribute will come to mean an attribute that is part of any candidate key. A general definition of 3NF will imply that the relation schema R is in third normal form if, whenever a nontrivial functional dependency X®A holds in R, either: 1. X is superkey of R 2. A is a prime attribute of R A relation schema R violates the general definition of 3NF if a functional dependency X® holds in R that violates both conditions of 3NF. Violating condition 1 means that X is not a superset of any key of R; hence X could be: o nonprime (we have a typical transitive dependency that violates
7
3NF) o a proper subset of a key of R (we have a partial dependency that violates both 2NF and 3NF) Violating condition 2 means that A is a nonprime attribute. A relation schema R is in 3NF if every nonprime attribute of R meets both of the following conditions: It is fully functionally dependent on every key of R. It is nontransitively dependent on every key of R.
Summary First Normal Form (1NF) A relational database is in First Normal Form (1NF) if each attribute is singlevalued with atomic values. Second Normal Form (2NF) To discuss Second Normal Form, we need to understand functional dependency. Attribute (or set of attributes) B is functionally dependent on attribute (or set of attributes) A, written A --> B, if for each value of A, there corresponds exactly one value of B. Attribute (or set of attributes) B is fully functionally dependent on attribute (or set of attributes) A, if B is functionally dependent on A but is not functionally dependent on any proper subset of A. The entirety of A must be specified for B to be determined. A relational database is in Second Normal Form (2NF) if it is in 1NF and each attribute that is not a primary key is fully functionally dependent on the entity's primary key; that is, non-primary key attributes are dependent on the entity's entire primary key. Third Normal Form (3NF) A relational database is in Third Normal Form (3NF) if it is in 2NF and no nonprimary key attribute is functionally dependent on another non-primary key. To normalize, we move both attributes into a new entity. Normal Form
Characteristics
Conversion Process
1NF
Single-valued attributes
Create new entity from multivalued attribute
Normalisation
2NF
3NF
Atomic-valued attributes
Break composite attribute into component attributes
1NF
Convert to 1NF
Non-primary key attributes fully functionally dependent on primary key
Attached non-primary key attribute to correct (perhaps new) entity
2NF
Convert to 2NF
No non-primary key attribute is functionally dependent on another non-primary key
Move both attributes into a new entity
Example The First Normal Form For a table to be in first normal form, data must be broken up into the smallest units possible. For example, the following table is not in first normal form. Name
Address
Phone
Sally Singer
123 Broadway New York, NY, 11234
(111) 2223345
Jason Jumper
456 Jolly Jumper St. Trenton NJ, 11547
(222) 3345566
To conform to first normal form, this table would require additional fields. The name field should be divided into first and last name and the address should be divided by street, city, state, and zip like this. ID
Firs t
Last
Street
City
56 4
Sall y
Singe r
123 Broadway
New York
56 5
Jaso n
Jump er
456 Jolly Jumper St.
Trenton
S tat e NY
Zip
Phone
1123 4
(111) 2223345
NJ
1154 7
(222) 3345566
In addition to breaking data up into the smallest meaningful values, tables in first normal form should not contain repetitions groups of fields such as in the following table.
9
Rep ID
Representat ive
Client 1
Time 1
Client 2
Time 2
Client 3
Time 3
TS89 RK56
Gilroy Gladstone Mary Mayhem
US Corp. Italian a
14 hrs
Taggar ts Linker s
26 hrs 2 hrs
Kilroy Inc.
9 hrs
67 hrs
The problem here is that each representative can have multiple clients not all will have three. Some may have less as is the case in the second record, tying up storage space in your database that is not being used, and some may have more, in which case there are not enough fields. The solution to this is to add a record for each new piece of information. Rep ID
Rep First Name
Rep Last Name
Client
Time With Client
TS-89
Gilroy
Gladstone
US Corp
14 hrs
TS-89
Gilroy
Gladstone
Taggarts
26 hrs
TS-89
Gilroy
Gladstone
Kilroy Inc.
9 hrs
RK-56
Mary
Mayhem
Italiana
67 hrs
RK-56
Mary
Mayhem
Linkers
2 hrs
Notice the splitting of the first and last name fields again. This table is now in first normal form. Note that by avoiding repeating groups of fields, we have created a new problem in that there are identical values in the primary key field, violating the rules of the primary key. In order to remedy this, we need to have some other way of identifying each record. This can be done with the creation of a new key called client ID.
Normalisation
Rep ID*
Rep First Name
Rep Last Name
Client ID*
Client
Time With Client
TS-89
Gilroy
Gladstone
978
US Corp
14 hrs
TS-89
Gilroy
Gladstone
665
26 hrs
TS-89
Gilroy
Gladstone
782
RK-56
Mary
Mayhem
221
Taggart s Kilroy Inc. Italiana
RK-56
Mary
Mayhem
982
Linkers
2 hrs
9 hrs 67 hrs
This new field can now be used in conjunction with the Rep ID field to create a multiple field primary key. This will prevent confusion if ever more than one Representative were to serve a single client.
The Second Normal Form The second normal form states that each field in a multiple field primary key table must be directly related to the entire primary key. Or in other words, each non-key field should be a fact about all the fields in the primary key. Only fields that are absolutely necessary should show up in our table, all other fields should reside in different tables. In our example, we are devoting three fields to the identification of the employee and two to the identification of the client. We could identify them with only one field each - the primary key. We can then take out the extraneous fields and put them in their own table: Rep ID* TS-89
Client ID* 978
Time With Client 14 hrs
Client ID* 978
Client Name
TS-89
665
26 hrs
665
Taggarts
TS-89
782
9 hrs
782
Kilroy Inc.
RK-56
221
67 hrs
221
Italiana
RK-56
982
2 hrs
982
Linkers
RK-56
665
4 hrs
This table contains time card information. Rep ID* TS-89
First Name Gilroy
Last Name
RK-56
Mary
Mayhem
US Corp
This table contains Client information
Gladstone
11
This table contains Rep information. These tables are now in 2NF. By splitting off the unnecessary information and putting it in its own tables, we have eliminated redundancy and put our first table in second normal form.
The Third Normal Form Third normal form is the same as second normal form except that it only refers to tables that have a single field as their primary key. In other words, each non-key field in the table should be a fact about the primary key. Any of the preceding tables acts as an example of third normal form since all the fields in each table are necessary to describe the primary key. Once all the tables in a database have been taken through the third normal form, we can begin to set up relationships.