Overview of Relational Database Technology Hanu
[email protected]
Introductions
2
Agenda • • • • • • • • •
Introduction to files systems Introduction to Access methods--VSAM Introduction Data Base Systems- Architecture Introduction to Hierarchical databases--IMS Introduction to Network Database –IDMS Introduction to Relational Databases Introduction to OLTP New trends in DB technology Basics of Data Warehousing
3
Objectives • • • • •
Evolution of Database Technology Limitations of Legacy Access mechanism Limitations of Hierarchy and Network Databases Emergence of Relational database management system Latest trends in Database technology – OODBMS – OORDBMS – XML Integration
4
What this course does not cover • • • •
Data Modelling OO Design SQL Syntax Commercial database like – – – –
•
DB2 Oracle SQL Server 2000 and Sybase
Design of Data warehousing or Data Mining system
5
File Based Data Management •
Flat file systems are first attempt of computerization of manual book keeping system
•
Retrieval of data was possible only by sequential reading Updating and deleting the existing record was almost impossible
•
The only way to delete Sequential file records is to create a new file which does not contain them.
•
The only way to update records in a Sequential File is to create a new file which contains the updated records
6
Disadvantages of File based system • • • •
Data Redundancy - the same data might be stored in different places Poor Data Control - redundant data might be slightly different example Hanu’s data may be stored in Telephone, Payana and PSWeb Inability to Easily Manipulate Data - it was a tedious and error prone activity to modify files by hand Cryptic Work Flows - accessing the data could take excessive programming effort and was too difficult for real-users
7
VSAM Based Systems • • • • • •
Designed and Developed by IBM in early 60s First time introduced concept of Unique Key To be able to locate a given record, based on its key, and fetch it with minimal I/O (ideally a single read) Can define Secondary keys Made up of Multiple Control Areas (CA) Each control area is made up of Control Intervals and Free space
8
Disadvantages of VSAM • • • • • •
Complex Structure and access mechanism No querying facility Security Issues No concept of Referencing keys Application dependent Redundancy of the data
9
Hierarchical Database - IMS • • • •
Designed in Mid of 60s again by IBM Based on two tier client server architecture VENDOR Looks data only as ‘Hierarchical’ 1 Accessing child element only through parent node ITEM3
VENDOR ITEM LOCATION
ITEM2 ITEM1
L L OC3 L OC2 OC1 10
L OC1
L OC3
Disadvantages of IMS • • • • • • •
Accessing child nodes only through parent node Child record can not be inserted without a parent One child record can have only one parent record No querying facility No referential and constraints concept Redundancy of the data Very cryptic Macros
11
Network Database - IDMS • • • • •
Designed by Conference on Data Systems Languages (CODASYL) in late 60s Introduced to overcome Hierarchical DB limitations Data elements are linked through only pointers Eliminated redundancy completely Super set of Hierarchical Database with child and multi parent relationship
12
Disadvantages of IDMS • • •
Difficult to access the system using cumbersome pointer concept It was useful for Programmers than real users Difficult to represent many to many relationship
13
Relational Data Base Management System (RDBMS) •
• • • • • • • •
The Relational Model developed out of the work done by Dr. E. F. Codd at IBM in the early 70s who was looking for ways to solve the problems with the existing models First time introduced in his famous paper “A Relational Model of Data for Large Shared Databanks “ Based on Mathematical model of Relational Algebra At the core of the relational model is the concept of a table (also called a relation) in which all data is stored Each table is made up of records (horizontal rows also known as tuples) and fields (vertical columns also known as attributes) Data is separated from application – No more application dependent/centric In the relational model, operations that manipulate data do so on the basis of the data values themselves. Extremely easy meta data management Query Language Interface
14
Application Programs using DBMS Services
Application Programs
DBMS
File System Storage 15
Example of RDBMS Table SNO SNAME S1 S2 S3 S4 S5
Smith Jones Blake Clark Adams
STATUS 20 10 30 20 30
CITY London Paris Paris London Athens
16
ER modeling •
ER modeling : A graphical technique for understanding and organizing the data independent of the actual database implementation
•
Entity: Any thing that may have an independent existence and about which we intend to collect data. Also known as Entity type.
•
Entity instance: a particular member of the entity type e.g. a particular student
•
Attributes: Properties/characteristics that describe entities
•
Relationships: Associations between entities
17
Steps in ER Modeling •
Identify the Entities
•
Find relationships
•
Identify the key attributes for every Entity
•
Identify other relevant attributes
•
Draw complete E-R diagram with all attributes including Primary Key
•
Review your results with your Business users
18
Case Study – ER Model For a college DB Assumptions : • • • • • • • • • •
A college contains many departments Each department can offer any number of courses Many instructors can work in a department An instructor can work only in one department For each department there is a Head An instructor can be head of only one department Each instructor can take any number of courses A course can be taken by only one instructor A student can enroll for any number of courses Each course can have any number of students
19
Step 1 : Identify the Entities : • • • •
DEPARTMENT STUDENT COURSE INSTRUCTOR
Step 2 : Find the relationships •
One course is enrolled by multiple students and one student enrolls for multiple courses, hence the cardinality between course and student is Many to Many.
•
The department offers many courses and each course belongs to only one department, hence the cardinality between department and course is One to Many.
•
One department has multiple instructors and one instructor belongs to one and only one department , hence the cardinality between department and instructor is one to Many.
•
Each department there is a “Head of department” and one instructor is “Head of department “,hence the cardinality is one to one .
•
One course is taught by only one instructor, but the instructor teaches many courses, hence the cardinality between course and instructor is many to one.
20
Step 3: Identify the key attributes • • • •
Deptname is the key attribute for the Entity “Department”, as it identifies the Department uniquely. Course# (CourseId) is the key attribute for “Course” Entity. Student# (Student Number) is the key attribute for “Student” Entity. Instructor Name is the key attribute for “Instructor” Entity.
Step 4: Identify other relevant attributes For the department entity, the relevant attribute is location For course entity, course name,duration,prerequisite For instructor entity, room#, telephone# For student entity, student name, date of birth
21
Step 5: Draw complete E-R diagram with all attributes including Primary Key Department Name
Location
Department
Pre Requisite
1
1
1 Headed by
Offers
Has
Course# 1
N Duration
Course Name
N Course
N
Is taught by
1 Instructor
Instructor Name
Enrolled by
Room# Telephone#
M Student Date of Birth
Student#
N
Student Name
22
What is Normalization? •
Database designed based on the E-R model may have some amount of – Inconsistency – Uncertainty – Redundancy
To eliminate these draw backs some refinement has to be done on the database. – Refinement process is called Normalization – Defined as a step-by-step process of decomposing a complex relation into a simple and stable data structure. – The formal process that can be followed to achieve a good database design – Also used to check that an existing design is of good quality – The different stages of normalization are known as “normal forms” – To accomplish normalization we need to understand the concept of Functional Dependencies.
23
Need for Normalization Student_Course_Result Table
Student_Details
Course_Details
Result_Details
101
Davis
11/4/1986
M4
Applied Mathematics
Basic Mathematics
7
11/11/2004
82
A
102
Daniel
11/6/1987
M4
Applied Mathematics
Basic Mathematics
7
11/11/2004
62
C
101
Davis
11/4/1986
H6
American History
4
11/22/2004
79
B
103
Sandra
10/2/1988
C3
Bio Chemistry
11
11/16/2004
65
B
104
Evelyn
2/22/1986
B3
Botany
8
11/26/2004
77
B
102
Daniel
11/6/1987
P3
Nuclear Physics
Basic Physics
13
11/12/2004
68
B
105
Susan
8/31/1985
P3
Nuclear Physics
Basic Physics
13
11/12/2004
89
A
103
Sandra
10/2/1988
B4
Zoology
5
11/27/2004
54
D
105
Susan
8/31/1985
H6
American History
4
11/22/2004
87
A
104
Evelyn
2/22/1986
M4
Applied Mathematics
7
11/11/2004
65
B
Basic Chemistry
Basic Mathematics
Insert Anomaly Delete Anomaly Update Anomaly Data Duplication
24
Functional dependency •
In a given relation R, X and Y are attributes. Attribute Y is functionally dependent on attribute X if each value of X determines EXACTLY ONE value of Y, which is represented as X -> Y (X can be composite in nature).
•
We say here “x determines y” or “y is functionally dependent on x” X→Y does not imply Y→X
•
If the value of an attribute “Marks” is known then the value of an attribute “Grade” is determined since Marks→ Grade
•
Types of functional dependencies: – Full Functional dependency – Partial Functional dependency – Transitive dependency
25
Functional Dependencies Consider the following Relation REPORT (STUDENT#,COURSE#, CourseName, IName, Room#, Marks, Grade) • • • • • • •
STUDENT# - Student Number COURSE# - Course Number CourseName - Course Name IName - Name of the Instructor who delivered the course Room# - Room number which is assigned to respective Instructor Marks - Scored in Course COURSE# by Student STUDENT# Grade - obtained by Student STUDENT# in Course COURSE#
26
Functional Dependencies- From the previous example •
STUDENT# COURSE# Marks
•
COURSE# CourseName,
•
COURSE# IName (Assuming one course is taught by one and only one Instructor)
•
IName Room# (Assuming each Instructor has his/her own and nonshared room)
•
Marks Grade
27
Full dependencies X and Y are attributes. X Functionally determines Y Note: Subset of X should not functionally determine Y
Student#
Marks Course#
28
Partial dependencies X and Y are attributes. Attribute Y is partially dependent on the attribute X only if it is dependent on a sub-set of attribute X.
Student#
Course#
CourseName
IName
Room#
29
Transitive dependencies X Y and Z are three attributes. X -> Y Y-> Z => X -> Z
Course#
IName
Room#
30
First normal form: 1NF •
A relation schema is in 1NF : –
if and only if all the attributes of the relation R are atomic in nature.
–
Atomic: the smallest level to which data may be broken down and remain meaningful
31
Student_Course_Result Table Student_Details 101
Davis
11/4/1986
Course_Details
Results
M4
Applied Mathematics
Basic Mathematics
7
11/11/2004
82
A
Basic Mathematics
7
11/11/2004
62
C
4
11/22/2004
79
B
11
11/16/2004
65
B
8
11/26/2004
77
B
102
Daniel
11/6/1987
M4
Applied Mathematics
101
Davis
11/4/1986
H6
American History
103
Sandra
10/2/1988
C3
Bio Chemistry
104
Evelyn
2/22/1986
B3
Botany
102
Daniel
11/6/1987
P3
Nuclear Physics
Basic Physics
13
11/12/2004
68
B
105
Susan
8/31/1985
P3
Nuclear Physics
Basic Physics
13
11/12/2004
89
A
103
Sandra
10/2/1988
B4
Zoology
5
11/27/2004
54
D
105
Susan
8/31/1985
H6
American History
4
11/22/2004
87
A
104
Evelyn
2/22/1986
M4
Applied Mathematics
7
11/11/2004
65
B
Basic Chemistry
Basic Mathematics
32
Table in 1NF Student#
Student Name
Dateof Birth
Student_Course_Result Table Cour
CourseName
Pre Requisite
s e #
Dura
DateOf Exam
Marks
Grade
InDa
101
Davis
04-Nov-1986
M4
Applied Mathematics
Basic Mathematics
7
11-Nov-2004
82
A
102
Daniel
06-Nov-1986
M4
Applied Mathematics
Basic Mathematics
7
11-Nov-2004
62
C
101
Davis
04-Nov-1986
H6
American History
4
22-Nov-2004
79
B
103
Sandra
02-Oct-1988
C3
Bio Chemistry
11
16-Nov-2004
65
B
104
Evelyn
22-Feb-1986
B3
Botany
8
26-Nov-2004
77
B
102
Daniel
06-Nov-1986
P3
Nuclear Physics
Basic Physics
13
12-Nov-2004
68
B
105
Susan
31-Aug-1985
P3
Nuclear Physics
Basic Physics
13
12-Nov-2004
89
A
103
Sandra
02-Oct-1988
B4
Zoology
5
27-Nov-2004
54
D
105
Susan
31-Aug-1985
H6
American History
4
22-Nov-2004
87
A
104
Evelyn
22-Feb-1986
M4
Applied Mathematics
7
11-Nov-2004
65
B
Basic Chemistry
Basic Mathematics
33
Second normal form: 2NF •
A Relation is said to be in Second Normal Form if and only if : – It is in the First normal form, and – No partial dependency exists between non-key attributes and key attributes.
An attribute of a relation R that belongs to any key of R is said to be a prime attribute and that which doesn’t is a non-prime attribute
34
Second Normal Form • • • •
STUDENT# is key attribute for Student, COURSE# is key attribute for Course STUDENT# COURSE# together form the composite key attributes for Results relationship. Other attributes like StudentName (Student Name), DateofBirth, CourseName, PreRequisite, DurationInDays, DateofExam, Marks and Grade are non-key attributes.
To make this table 2NF compliant, we have to remove all the partial dependencies. Student #, Course# -> Marks, Grade Student# -> StudentName, DOB, Course# -> CourseName, Prerequiste, DurationInDays Course# -> Date of Exam
35
Second Normal Form - Tables in 2 NF STUDENT TABLE Student#
StudentName
DateofBirth
101
Davis
04-Nov-1986
102
Daniel
06-Nov-1987
COURSE TABLE Course#
Course Name
M1
Basic Mathematics
Pre Requisite
Duration InDays
11
103
Sandra
02-Oct-1988
M4
Applied Mathematics
104
Evelyn
22-Feb-1986
H6
American History
4
105
Susan
31-Aug-1985
C1
Basic Chemistry
5
106
Mike
04-Feb-1987
C3
Bio Chemistry
107
Juliet
09-Nov-1986
B3
Botany
8
108
Tom
07-Oct-1986
P1
Basic Physics
8
109
Catherine
06-Jun-1984
P3
Nuclear Physics
B4
Zoology 36
M1
C1
P1
7
11
13 5
Second Normal form – Tables in 2 NF Student#
Course#
Marks
Grade
101 M4
82 A
102 M4
62 C
101 H6
79 B
103 C3
65 B
104 B3
77 B
102 P3
68 B
105 P3
89 A
103 B4
54 D
105 H6
87 A
104 M4
65 B 37
Second Normal form – Tables in 2 NF Exam_Date Table Course#
DateOfExam
M4
11-Nov-04
H6
22-Nov-04
C3
16-Nov-04
B3
26-Nov-04
P3
12-Nov-04
B4
27-Nov-04
38
Third normal form:3 NF A relation R is said to be in the Third Normal Form (3NF) if and only if - It is in 2NF and
- No transitive dependency exists between non-key attributes and key attributes.
• STUDENT# and COURSE# are the key attributes. • All other attributes, except grade are non-partially, non-transitively dependent on key attributes.
•
Student#, Course# - > Marks
• Marks -> Grade
39
3NF Tables Student#
Course#
Marks
101 M4
82
102 M4
62
101 H6
79
103 C3
65
104 B3
77
102 P3
68
105 P3
89
103 B4
54
105 H6
87
104 M4
65 40
Third Normal Form – Tables in 3rd NF MARKSGRADE TABLE UpperBound
LowerBound
Grade
100
95 A+
94
85 A
84
70 B
69
65 B-
64
55 C
54
45 D
44
0 E 41
Boyce-Codd normal form - BCNF A relation is said to be in Boyce Codd Normal Form (BCNF) - if and only if all the determinants are candidate keys. BCNF relation is a strong 3NF, but not every 3NF relation is BCNF.
42
Consider this Result Table Student#
EmailID
Course#
Marks
101
[email protected]
M4
82
102
[email protected]
M4
62
101
[email protected]
H6
79
103
[email protected]
C3
65
104
[email protected]
B3
77
102
[email protected]
P3
68
105
[email protected]
P3
89
103
[email protected]
B4
54
105
[email protected]
H6
87
104
[email protected]
M4
65
43
BCNF S#
C#
Candidate Keys for the relation are - STUDENT# COURSE# and COURSE# EmailID Since Course # is overlapping, it is referred as Overlapping Candidate Key.
Valid Functional Dependendencies are Student# - > EmailID ( Non Key Determinant) EmailID - > Student# ( Non Key Determinant) Student#, Course#
- > Marks
Course# , EmailID - > Student#
44
EmailID
BCNF
STUDENT TABLE Student#
EmailID
101
[email protected] 102
[email protected]
103
[email protected] 104
[email protected] 105
[email protected]
45
BCNF Tables Student#
Course#
Marks
101 M4
82
102 M4
62
101 H6
79
103 C3
65
104 B3
77
102 P3
68
105 P3
89
103 B4
54
105 H6
87
104 M4
65 46
Merits of Normalization •
Normalization is based on a mathematical foundation.
•
Removes the redundancy to a greater extent. After 3NF, data redundancy is minimized to the extent of foreign keys.
•
Removes the anomalies present in INSERTs, UPDATEs and DELETEs.
47
Demerits of Normalization •
Data retrieval or SELECT operation performance will be severely affected.
•
Normalization might not always represent real world scenarios.
48
SQL - Background •
• • • • • •
Conceived in mid-1970’s as a database language for the relational model Developed by IBM First standardized in 1986 by ANSI Enhanced in 1989 Revised again in ‘92 Non Procedural language Number of commercial products
49
SQL Statements •
Data Definition Language (DDL) – CREATE TABLE – ALTER TABLE – DROP TABLE
•
Data Manipulation Language (DML) – – – –
•
SELECT INSERT UPDATE DELETE
Data Control Language (DCL) – GRANT – REVOKE
50
SQL - Some ANSI/ISO Keywords ALL AND AVG BETWEEN CHAR COMMIT COUNT CREATE DECIMAL DELETE
DISTINCT DROP FETCH GRANT GROUP BY HAVING IN INSERT MAX MIN NOT 51
NULL OR PRIVILEGE REFERENCES REVOKE SELECT SUM TABLE UPDATE VIEW WHERE
Commercial Products •
• • • •
• •
In 2004, the RDBMS market grew 10%, rising from just under $7.1 billion to nearly $7.8 billion in new license sales DB2 UDB – Market Leader Oracle – Fastest growing Database on Unix Boxes SQL Server 2000 – Leader in Windows Platform Teradata – Most efficient, Self tuning, Costliest DB for Data Warehousing application Sybase –May be target for acquisition MySQL – Open Source Database
Company
2004 Market Share%
2003-04 Growth%
IBM
34.1
5.8
Oracle
33.7
14.6
Microsoft
20
18
NCR Teradata
2.9
17.2
Sybase
2.3
0.5
52
On Line Transaction Processing (OLTP) System Handle • Several concurrent transactions from – – – –
• •
Spatially Distributed M/cs Execution of Instructions and Queries across LAN/WAN Geographically distributed processors Spatially Distributed Databases
Transaction is defined as logical unit of program execution that takes a system from one consistent state to another consistent state OLTP system should adhere to ACID Properties - Atomicity - Consistency - Isolation - Durability
53
State diagram of a transaction While executing
active l a n fi as e t h nt h ed r t e te Af tem xecu sta en e be
When normal execution can’t proceed
partially completed
failed After rollback and restoration to prev state
A suc fter com cessf ple ul tion
aborted
committed
54
Concurrency Vs Consistency in OLTP • • •
Concurrency and Consistency are inversely proportional to each other Multiple transactions accessing same resource simultaneously Problems associated with OLTP applications are – – – –
Lost update Dirty read Incorrect Summary Phantom records
55
Serialization Techniques • • •
Locking Time stamping Ensures consistency of the database while allowing concurrent access of the resources
56
Locking •
A lock is a variable associated with each data item in a database.
•
When updated by a transaction, DBMS locks the data item
•
serializability could be maintained by this.
•
Lock could be Shared or Exclusive Deadlock is most common problem with locking mechanism
•
57
Timestamping • • • •
Occurs when an older transaction tries to read a value that is written by a younger transaction. Or when an older transaction tries to modify(Write) a value already read or written by a younger transaction Both of these attempts signify that the older transaction was “too late” in performing the required operation Commercially not a viable option because of too much rollback
58
New Trends in DB Technology • • •
OODBMS OORDBMS XML Integration
59
OODBMS • • • •
Any user-defined data structures Any user-defined operations Any user-defined relationships Useful for – – – – –
Manufacturing Telecommunication CAD/CAM Multimedia products Aerospace and Flight simulations
60
Relationship in OODBMS • • • • •
One - Many Many - Many Is A Extends Whole-part
61
Commercial Packages • • • • • •
Objectivity Poet Jasmine Gemstone Itasca ObjectStore
More details log on to http://www.geocities.com/SiliconValley/2139/products.html
62
Limitations of OODBMS • • • •
procedural navigation No querying as it breaks encapsulation No mathematical foundation Not suitable for adhoc reporting system
63
OORDBMS • • • • • •
Marrying Relational and Object Oriented concepts Still data is stored in Relational manner Object wrapper for application Performance is the major concern Still under development stage Commercial Products – – – – –
Informix Universal Server (Illustra) ( Merged with IBM ) Oracle Oracle 10g IBM DB2 UDB UniSQL UniSQL/X Unisys OSMOS
64
XML in DB • •
Data-centric to Document-centric Simpler integration between Database and other tools like – – – –
• •
Middlewares EAI tools ERP tools Other Databases
Introduction of Native XML data type XML Query Language
65
What is OLAP or DW or BI? •
An organization’s success also depends on its ability to analyze data (through views and reports) and make intelligent decisions that potentially affect its future. Systems that facilitate such analyses are called On Line Analytical Processing (OLAP) systems or Data Warehousing System
•
Why not OLTP?
•
–
OLTP databases do not contain historical data
–
OLTP databases contain small subsets of organizational data
–
OLTP databases are heterogeneous in nature and geographically distributed systems
OLTP systems are – Fragmented –
Not integrated.
–
Difficult to access.
–
Disparate sources.
–
Disparate platforms.
–
Poor data quality.
–
Redundant data.
–
Difficult to understand
66
Data warehouse / Business Intelligence •
A Data Warehouse is a copy of the enterprise operational data, suitably modified to support the needs of analytical processes and stored outside the operational database.
•
According to Bill Inmon, known as the father of Data Warehousing, a data warehouse is a –
Subject oriented,
–
Integrated,
–
Time-variant,
–
Nonvolatile
–
Collection of data in support of management decisions.
67
Data warehouse architecture Data Warehouse Server (Tier 1)
OLAP Servers (Tier 2)
Clients (Tier 3)
e.g., MOLAP Semistructured Sources
Data Warehouse
Analysis serve
extract transform load refresh etc.
Query/Reporting
serve e.g., ROLAP
Operational DB’s
serve
Data Marts 68
Data Mining
Components of DW •
Extraction Transformation and Loading (ETL) – – – –
•
Informatica Power Center Data Stage AbInitio WebFOCUS
Data Warehouse – Teradata – DB2 UDB – Oracle 10g OLAP – Business Object – COGNOS – Hyperion – Power Analyzer
•
Data Mining – Intelligent Miner – Darwin – SAS Miner
69
Complementing Technology •
How many Infy shares sold yesterday in NASDAQ? What was the highest and lowest Price? – OLTP System
•
How Infy shares are doing in NASDAQ with respect to NSE India in last 5 Years? What’s the volume? P/E Ratio? Highest and Lowest Price? – DW System
•
What will be the Infy earnings in second quarter of next year? What will be the share price during that period? – Data Mining System
70
References • • • • • • • • • • • •
E&R VSAM Presentation E&R IMS Presentation E&R IDMS Presentation E&R RDBMS Presentation E&R OODBMS Presentation E&R OORDBMS Presentations E&R DW and BI Presentations http://mngktrmerweb/techportals/db/hanu_bank.htm www.oracle.com www.ibm.com www.mssqlserver.com www.sybase.com
71
Thank You
72