MANAGING DATA RESOURCES ORGANIZING DATA IN A TRADITIONAL FILE ENVIRONMENT Information is becoming as important a business resource as money, material, and people. Businesses are realizing the competitive advantage they can gain over their competition through useful information, not just data. An effective information system provides users with timely, accurate, and relevant information. This information is stored in computer files. When the files are properly arranged and maintained, users can easily access and retrieve the information they need. FILE ORGANIZATION TERMS AND CONCEPTS A computer system organizes data in a hierarchy that starts with bits and bytes and progresses to fields, records, files, and databases.
• •
Bit: A bit represents the smallest unit of data a computer can handle. Such as 0 and 1.
•
Field: A field is a unit of data consisting of one or more characters. An example of a field is your name, your address, or your Social Security number.
•
Record: A record is a collection of related fields. An example of a record would be your name and address and Social Security number.
•
File: A file is a collection of related records. An example of a file is collected data on everyone employed in the same department of a company, including all names, addresses, and Social Security numbers.
•
Database: A database is a collection of related files. A company database might include files on all past and current employees in all departments.
Byte (character): A byte is a group of bits. A character may be – but is not necessarily – the same as a byte. A character is a single letter, number, or special character such as ;, $, or %.
Figure: The data hierarchy A computer system organizes data in a hierarchy that starts with the bit, which represent either a 0 or 1. Bits can be grouped to form a byte to represents one character, number, or symbol. Bytes can be grouped form a field, and related fields can be grouped to form a record. Related records can be collected to form a file, and related files can be organized into a database. Entity: A record describes an entity. An entity is person, place, thing, or event on which we maintain information. An order is typical entity in a sales order file, which maintains information on a firm’s sales orders. Attribute: Each characteristic or quality describing a particular entity is called an attribute. For example, order number, order date, order amount, item number, and item quantity would each be an attribute of the entity order. The specific values that these attributes can have can be found in the fields of the record describing the entity order. Key Field: Every record in a file should contain at least one field that uniquely identifies that record so that the record can be retrieved, updated, or sorted. This identifier field is called a key field. An example of a key field is the order number for the order record. E n t i t y = O R D E R A
O r d e r n u m b e r 4 3 4 0 k
e y
O
r d e r d a t e
0 2 / 0 8 / 9 9
t t r i b u t e s
I t e m n u m b e r 1 5 8 3
f i e l d
-1-
Q
u a n t i t y
2
A
m
o u n t
1 7 . 4 0
f i e l d s
ACCESSING RECORDS FROM COMPUTER FILES Computer systems store files on secondary storage devices. Records can be arranged in several ways on storage media, and the arrangement determines the manner in which individual records can be accessed or retrieved. One way to organize records is sequentially. In Sequential file organization, data records must be retrieved in the same physical sequence in which they are stored. In contrast, Direct or random file organization, allows user to access records in any sequence the desire, without regard to their actual physical order on the storage media. When we were describing secondary storage, we talked about magnetic tape and disk storage for computer data. To understand how information is accessed from these mediums, think about the difference between a music cassette tape and a music CD. If you want to get to a particular song on a cassette tape, you must pass by all the other songs sequentially. If you want to get to a song on CD, you can go directly to that song without worrying about any of the others. That is the difference between sequential and direct access organization for database records. The Indexed Sequential Access Method The indexed sequential access method is such a file access method that directly access records organized sequentially using an index of key fields. An index is a table or list that relates record keys to physical locations on direct access files. To explain the indexed sequential access method, let's go back to the example of the cassette tape. A cassette tape label has a printed list of the songs contained on it which gives you a general idea of where to go on the tape to find a particular tune. So too with computer records on a sequential access tape using the key field. It gives the computer a pretty accurate idea of where a particular record is located. That's why it's so a important to have a unique ID as the key field. You and your customer could have a difficult time if the key field is duplicated among several records. Each key field and the ultimate location of that record on the storage device is maintained in the index. C R
y l i n d e
r
I n d e x
e c o r d
2 3 0 K e y f i e l d
| D
C
a t a
T r a c k I n d e x f o r C y l i n d e r
C
y l i n d e rH 1 2 3 . . . .
i g h e 2 3 5
T
1
s 0 9 8
y l i n d e rH 1 2 3 . . . .
t
K 0 2 8 . . . .
e y
C
r a c k
i g h e 2 3 5
I n d e
y l i n d e rH 1 2 3 . . . .
x
s 0 9 8
t
K
f o r
i g h e 2 3 5
s 0 9 8
e y
0 2 8 . . . . C
t
y l i n d e r
K
e y
0 2 8 . . . .
2
C
T r a c k I n d e x f o r C y l i n d e r
y l i n d e rH 1 2 3 . . . .
i g h e 2 3 5
s 0 9 8
3
t
K
e y
0 2 8 . . . .
Direct File Access Method The direct file access method is used with direct file organization. This method employs a key field to locate the physical address of a record. However, the process is accomplished using a mathematical formula called a transform algorithm to translate the key field directly into the record’s physical storage location on disk. The algorithm performs some mathematical computation on the record key, and the result of that calculation is the record’s physical address. An example might be an on-line hotel reservation system.
Figure: The direct file access method
-2-
PROBLEMS WITH THE TRADITIONAL FILE ENVIRONMENT The traditional file environment is a way of collecting and maintaining data in an organization that leads to each functional area or division creating and maintaining its own data files and programs. Many problems, such as data redundancy, program-data dependence, inflexibility, poor data security, and inability to share data among applications, have occurred with traditional file environments. Data Redundancy and Confusion Data redundancy is the presence of duplicate data in multiple data files. Data redundancy occurs when different divisions, functional areas, and groups in an organization independently collect the same piece of information. For instance, within the commercial loans division of a bank, the marketing and credit information functions might collect the same customer information. Because it is collected and maintained in so many different places, the same data item may have different meanings in different parts of the organization. Program-Data Dependence The close relationship between data stored in files and the software programs that update and maintain those files. Any change in data organization or format requires a change in all the programs associated with those files. In many instances, applications work with outdated data simply because of the difficulty of making updates. Lack of Flexibility A traditional file system can deliver routine scheduled reports after extensive programming efforts, but it cannot deliver ad hoc reports or respond to unanticipated information requirements in a timely fashion. The information required by ad hoc requests is “somewhere in the system” but is too expensive to retrieve. Several programmers would have to work for weeks to put together the required data items in a new file. Users – in particular, senior management – begin to wonder at this point why they have computers at all. Poor Security Because there is little control or management of data, access to and dissemination of result are virtually out of control. What limits on access exist tend to be the result of habit and tradition, as well as of the sheer difficulty of finding information. Lack of Data Sharing and Availability The lack of control over access to data in this confused environment does not make it easy for people to obtain information. Because pieces of information in different files and different parts of the organization cannot be related tone another, it is virtually impossible for information to be shared or accessed in a timely manner.
-3-
A MODERN DATABASE ENVIRONMENT The key to establishing an effective, efficient database is to involve the entire organization as much as possible, even if everyone seemingly will not be connected to it or be a user of it. Perhaps they won't be a part of it in the beginning, but they very well could be later on. DATABASE MANAGEMENT SYSTEMS (DBMS) A Database Management System is basically another software program like Word or Excel or Email. This type of software is more complicated: it permits an organization to centralize data, manage them efficiently, and provide access to the stored data by application programs.
H
E
P
U
I N T E G M A N D A T A
m p l o y N a m e A d d r e S o c i a l P o s i t i o M a r i t a a y r o H o u P a y G r o F e d S t a t N e t
e
R R B
e
A T E E S O A S E
D U
R
C
E
S
s :
s s S e c u r i t y n l S t a t u s
l l : r s w o r k r a t e s s P a y t a x e T a x P a y
e
N
u
m
b
e
r
P
e r s o n a l p l i c a t i o n r o g r a m s
P D
e r s o n n e l e p a r t m e n t
P a y r o l l A p p l i c a t i o n y s t e m P r o g r a m s
P D
a y r o l l e p a r t m
e n t
B D
e n e p
e n t
A
p P
d
M
D a t a b a s e a n a g e m e n t
S
B A
p P
B
e L P H R
n
e i f e e n e a e t
f i t s : I n s u r a n c s i o n P l a n l t h C a r e i r e m e n t b
e n e f i t s p l i c a t i o n r o g r a m s
e f i t s a r t m
e s P e
l a n n e f i t
Figure: The contemporary database environment
A single Human Resources Database serves multiple applications and also allows a corporation easily to draw together all of the information on various applications. The database management system acts as the interface between the application programs and the data. A DBMS has 3 components, all of them important for the long-term success of the system.
Data definition language Data manipulation language Data dictionary
Data Definition Language The data definition language is the formal language used by programmers to specify the content and structure of the database. The data definition language defines each data element as it appears in the database before that data element is translated into the forms required by application programs. Data Manipulation Language This is a formal language used by programmers to manipulate the data in the database and make sure they are formulated into useful information. The goal of this language should be to make it easy for users. The basic idea is to establish a single data element that can serve multiple users in different departments depending on the situation. Otherwise, you'll be tying up programmers to get information from the database that users should be able to get on their own. Data Dictionary Each data element or field should be carefully analyzed to determine what it will be used for, who will be the primary user, and how it fits into the overall scheme of things. Then write it all down and make it easily available to all users. This is one of the most important steps in creating a good database.
-4-
Figure: Sample data dictionary report. Figure shows a properly constructed data dictionary report. We can see exactly who owns the data element and all the business functions that use the data element. It also lists the people who have access to the data element. LOGICAL AND PHYSICAL VIEWS OF DATA Perhaps the greatest different between a DBMS and traditional file organization is that the DBMS separates the logical and physical views of the data, relieving the programmer or end user from the task of understanding where and how the data are actually stored. The database concept distinguishes between logical and physical views of data. Logical View: Representation of data as they would appear to an application programmer or end user. Physical View: The representation of data as they would be actually organized on physical storage media. The logical description of the entire database, listing all of the data items and the relationship among them, is termed the schema. The specific set of data from the database that is required by each application program is termed the subschema. For example, for the Human Resources Database illustrated in the previous figure, the payroll application would have a subschema consisting of employee name, address, social security number, and would specify payroll data such as a pay rate and hours worked.
-5-
TYPES OF DATABASE ORGANIZATION There are alternative ways of organizing data and representing relationships among data in a database. Conventional DBMS use one of three principal logical database models for keeping track of entities, attributes, and relationships. The three principal logical database models are hierarchical, network, and relational. Each logical model has certain processing advantages and certain business advantages. HIERARCHICAL DBMS In a hierarchical database, files or records are arranged in related groups resembling a family tree, with “child” records subordinate to “parent” records. A parent may have more than one child, but a child always has only one parent. To find a particular record, you have to start at the top with a parent and trace down the chart to the child. Hierarchical DBMSs work well when the data elements have an intrinsic one-to-many relationship, as might happen with a reservation system. The difficulty, however, is that the structure must be defined in advance and is quite rigid. There may be only one parent per child and no relationships among the child records. One type of logical database model that organizes data in a treelike structure. A record is subdivided into segments that are connected to each other in one-to-many parent-child relationships.
Figure: A hierarchical database for a human resources system Figure shows a hierarchical structure that might be used for a human resources database. The root segment is “Employee,” which contains basic employee information such as name, address, and identification number. Immediately below it are three child segments: Compensation (containing salary and promotion data), Job Assignment (containing data about job positions and departments), and Benefits (containing data about beneficiaries and various benefit options). The Compensation segment has two children below it: Performance Ratings (containing data about employee’s job performance evaluations) and Salary History (containing historical data about employee’s past salaries). Below the Benefits segment are child segments for Pension, Life Insurance, and Health Care, containing data about these various benefit plans. NETWORK DATABASE MODEL A network database is similar to a hierarchical DBMS, but each child record can have more than one parent record. Thus, a child record may be reached through more than one parent. This arrangement is more flexible than the hierarchical one. However, it still requires that the structure be defined in advance. Moreover, there are limits to the number of links that can be made among records. The network data model is a variation of the hierarchical data model. Indeed, databases can be translated from hierarchical to network and vice versa in order to optimize processing speed and convenience. Whereas hierarchical structures depict one-to-many relationships, network structures depict data logically as many-to-many relationships. In other words, parents can have multiple “children” and a child can have more than one parent.
Figure: The network data model This illustration of a network data model showing the relationship the students in a university have to the courses they take represents and example of logical many-to-many relationships.
-6-
RELATIONAL DATA MODEL The relational data model, the most recent of these three database models, overcomes some of the limitations of the other two models. The relational model represents all data in the database as simple two-dimensional tables called relations. The tables appear similar to flat files, but the information in more than one file can be easily extracted and combined. Sometimes the tables are referred to as files. In a relational database, each table contains a primary key, a unique identifier for each record. To make sure the tables relate to each other, the primary key from one table is stored in a related table as a secondary key. For instance, in the Customer table the primary key is the unique Customer ID. That primary key is then stored in the Order Table as the secondary key so that the two tables have a direct relationship. Customer Table
Order Table
Field Name
Description
Order Number
Primary Key
Customer Name
Self Explanatory
Order Item
Self Explanatory
Customer Address Self Explanatory
Number of Items Ordered Self Explanatory
Customer ID
Primary Key-----> Customer ID
Order Number
Secondary Key
Secondary Key
Select: create a subset of records meeting the stated criteria Join: combine related tables to provide more information than individual tables Project: create a new table from subsets of previous tables
The biggest problem with these databases is the misconception that every data element should be stored in the same table. In fact, each data element should be analyzed in relation to other data elements with the goal of making the tables as small in size as possible. The ideal relational database will have many small tables, not one big one. On the surface that may seem like extra work and effort, but by keeping the tables small, they can serve a wider audience because they are more flexible. This setup is especially helpful in reducing redundancy and increasing the usefulness of data. COMPARISON OF DATABASE ALTERNATIVES Hierarchical and network databases can be very efficient as long as we plan ahead. But as we know, needs change, and neither one of these databases offers a lot of flexibility to change with business needs. It's sort of like parents and children; once you establish the tie, it's pretty hard to amend. Relational database management systems are more flexible, especially if you keep the tables small. It is much easier for non-techies to create the query language in a relational system. It's also easier to add new data elements, although if we do, we'll have to go back and fill in the missing information for the old records or just forget them altogether.
Table: Comparison of Database Alternatives Table compares these alternatives on several dimensions to show we the advantages and disadvantages of each. What we should remember is that none of these databases is very good if we don't keep the end user in mind. If we're not careful, we'll wind up with lots of information that no one can use.
-7-
DESIGNING A DATABASE Database design is done to meet both logical and physical organizational needs. Logical design refers to what the database is (what is in it); physical design refers to how the data is organized and stored, and what storage hardware is used. Matching the Design to the Organization Most users working with microcomputer-based DBMS software focus on a very specific set of objectives and information processing needs-their own. For many small applications this is a satisfactory approach. However, the objectives must be broader when working with a large corporation, a large computer system, and more complex and sophisticated DBMS software. And the plans for the use of the database management system must be integrated with the long-range plans for the company’s total information system. The information processing needs of the entire corporation must be considered to build a corporate database that facilitat6es collecting, maintaining, and sharing data among all organizational units. Once the general information needs have been established, the design process can proceed. This process usually comprises two distinct phases of activity – the logical design phase and the physical design phase. Logical Design Logical database design refers to what the database is, not to how it operates; in other words, the logical design is a detailed description of the database model from the users’ perspective rather than the technical perspective. The logical design of a database involves defining user information needs, analyzing data element requirements and logical groupings, considering input and output methods, finalizing the design, and creating the data dictionary. The focus is on identifying every element of data necessary to produce the required information systems reports and on the relationship among the records. Physical Design Once the logical design of the database is ready, the next step is physical design. The physical database design specifies exactly how the data will be arranged and stored on the direct access storage devices allocated for DBMS use. The objective of the physical design is to store data so that it can be updated and retrieved as quickly and efficiently as possible. DBMS users are not involved in the physical design of the DBMS, since that is determined by the type of DBMS software they have purchased for their microcomputer or that is running on a large computer system.
-8-
DATABASE ADMINISTRATION Database Administration The effective use of a database within an organization requires a great deal of cooperation and coordination. User requirements and needs throughout an organization need to be frequently reviewed, and the overall security and integrity of the database must be ensured. Organizations working with DBMS quickly recognized the need for a database administrator (DBA), an individual or a group of individuals to coordinate all related database activities and to control the database. Why Administrator? As we mentioned earlier, developing and implementing a corporate-wide database is a major task that requires management’s complete support, substantial time from designers and users, and often large sums of money. This task needs to be coordinated. In addition, the data in the database often represents the company’s most precious resource: It must be managed well, so that it is not misused or damaged. MANAGEMENT REQUIREMENTS FOR DATABASE SYSTEMS Much more is required for the development of database systems than simply selecting a logical database model. Indeed, this selection may be among the last decisions. The database is an organizational discipline, a method, rather than a tool or technology. It requires organizational and conceptual change. Without management support and understanding, database efforts fail. The critical elements in a database environment are: 1. 2. 3. 4.
Data administration Data planning and modeling methodology Database technology and management, and Users
This environment is depicted in the following figure and will now be described.
Figure: Key organizational elements in the database environment Data Administration A special organizational function for managing the organization’s data resources, concerned with information policy, data planning, maintenance of data dictionaries, and data quality standards. Although data administration is a very important organizational function, it has proven very challenging to implement, as described in the Window on Management. Data Planning and Modeling Methodology Because the organizational interests served by the DBMS are much broader than those in the traditional file environment, the organization requires enterprise-wide planning for data. Enterprise analysis, which addresses the information requirements of the entire organization, is needed to develop databases. The purpose of enterprise analysis is to identify the key entities, attributes, and relationships that constitute the organization’s data.
-9-
Database Technology and Management Databases require new software and a new staff specially trained in DBMS techniques as well as new management structures. Most corporations develop a database design and management group within the corporate information system division that is responsible for the ore technical and operational aspects of managing data. The functions it performs are called database administration. This group does the following:
Define and organize database structure and content Develops security procedures to safeguard the database Develops database documentation Maintains the database management software
In close cooperation with users, the design group establishes the physical database, the logical relations among elements, and the access rules and procedures. Users A database serves a wider community of users than traditional systems. Relational systems with fourth-generation query languages permit employees who are not computer specialists to access large databases. In addition, users include trained computer specialists. In order to optimize access for nonspecialists, more resources must be devoted to training end users. Professional systems workers must be retrained in the DBMS language, DBMS application development procedures, and new software practices.
THE JOB OF THE DATABASE ADMINISTRATOR (DBA) The responsibilities for administering the database activities within an organization are usually assigned to an individual or a small group, depending on the size of the organization and the scope and complexity of the database. The database administrator has six major responsibilities.
1.
Database design: The DBA plays a key role in both the logical and the physical design phases. He or she guides the definition of the database content and the creation of the data dictionary, as well as setting data classification and coding procedures and backup and restart/recovery procedures.
2.
Database implementation and operation: The DBA guides the use of the DBMS on a daily basis. Among other things, this includes adding and deleting data, controlling access to data, detecting and repairing losses, instituting restart/recovery procedures when necessary, and assigning space used on secondary storage devices.
3.
Coordination with users: The DBA receives and reviews user requests for additional DBMS support that have been forwarded by programming analysts. The administrator establishes feasibility, resolves redundant or conflicting request, and assists in the process of establishing priorities for the requests. In addition, the DBA is responsible for establishing and enforcing organization-wide DBMS standards for such things as techniques for accessing data, formats in which data elements will be stored, and data element names.
4.
Backup and recovery: The DBA is responsible for preparing a plan to periodically back up the database(s) and for establishing procedures for recovery fro the failure of the DBMS software or related hardware components.
5.
Performance monitoring: The DBA constantly monitors the performance of the DBMS using specialized software to calculate and record operating statistics. If a problem occurs, such as a slowdown in responsiveness, the DBA must identify the problem and take steps to improve the performance.\
6.
System security: The DBA is responsible for designing and implementing a system that controls users’ access to the database files and determines which DBMS operations can be performed, as well as which applications programs can be accessed. This system often involves the assignment of user identification codes and passwords.
Organizations with a well-organized and well-staffed database administration department can be much more successful with their database management systems than organizations without such a department.
- 10 -
ADVANTAGES AND DISADVANTAGES OF THE DBMS A DBMS can minimize data redundancy, allow easy file updating, maximize data integrity and independence, simplify maintenance, increase user productivity and data security, and standardize data definitions. However, DBMSs require complex planning and expertise to create and maintain. They can also be expensive. Advantages of the DBMS
Minimization of data redundancy: More storage becomes available when maintenance of redundant data elements among traditionally separate application files is minimized.
Easy file updating and maximization of data integrity: In traditional systems in which the same element of data was kept in several different files, ensuring that all copies of the data element were updated when changes were made was a problem. When a data field needed to be changed, it had to he updated in all the files in which it occurred. If some files were missed, data became inconsistent. When data is inconsistent, data integrity is not maintained, and reports will be produced with erroneous information. Data is no longer accurate, reliable, and/or timely. DBMSs make updating files much easier and so improve the consistency of data, thus ensuring data integrity.
Data independence and simplification of program maintenance: In a DBMS the programs are much more independent of the data than in a traditional file processing systems. Previously, programs had to include a substantial amount of information about the format and structure of the fields and records in each file accessed. In a DBMS, this information is contained in the data dictionary.
Increased user productivity: The ability of a DBMS to respond quickly to user requests for additional information without involving the user in technical language manipulation encourages faster and more efficient work. The report generators and query languages associated with database management systems make them easy to use.
Increased security: Control of access to and use of the database is easily established. With traditional file processing systems, the data was too fragmented for effective security to be exercised.
Standardization of data definitions: Before database management systems, each application program could define similar elements of data with different names. However, the use of data dictionaries standardizes the names and descriptions of data elements.
- 11 -
Disadvantages of the DBMS In spite of the advantages of using a DBMS, there are circumstances in which such a system may involve unnecessary overhead costs, as that would not be incurred in traditional file processing. The overhead costs of using a DBMS are due to the following:
High initial investment in hardware, software, and training.
Overhead for providing security, concurrency control, recovery, and integrity functions.
Other problems may arise if the database designers and DBA do not design the database properly or if the database systems applications are not properly implemented. Hence, it may be more favorable to use regular files under the following circumstances.
The database and applications are simple, well defined, and not expected to change.
There are stringent real-time requirements for some programs that may not be met because of DBMS overhead.
It is not required to access data by multiple-users.
WHO OWNS THE DATABASE Here we need to say a few words about who owns a given type of database. Individual Database: The individual database is basically a microcomputer database used by one person. The data is usually stored on a large-capacity hard disk. A sales representative, for example, who is on the road a lot, may build and maintain an individual database of customer and sales information. Company Database: The company database (shared database) is shared by the users of one company in one location. Company databases can be found in local area networks. The company owns the database. The data is usually stored on a microcomputer and managed by a database administrator. Users are linked to the database through terminals or microcomputer workstations. Distributed Database: The distributed database is shared by the users of one company, which owns the database, but the data is stored in several locations linked by a variety of communications networks. Proprietary Database: The proprietary database is a huge database that functions as an information service, such as CompuServe, Prodigy and Dow Jones New/Retrieval. The proprietor owns the database in this case. To access this type of proprietary database, the user needs a modem to hook the computer up to the service via the phone line. Another type of proprietary database can be “licensed” (purchased to sue for a specified period of time) by computer users. For example, ABI’s Business Lists-on-Disc (LCD) contains data on 9.2 million American businesses. It allows users to search for records by company name, type of business, company size, geographic area, and use of Yellow Pages ads in phone books.
REVIEW QUESTIONS 1.
Why is file management important for overall system performance?
2.
Describe how indexes and key fields enable a program to access specific records in a file.
3.
Define and describe the indexed sequential access method and the direct file access method.
4.
List and describe some of the problems of traditional file environment.
5.
Define a database and a database management system.
6.
Name and briefly describe the three components of a DBMS.
7.
What is the difference between a logical and a physical view of data?
8.
List some of the benefits of DBMS.
9.
List some of the limitations of DBMS.
10. Describe the three principal database models and the advantages and disadvantages of each. 11. Describe and briefly comment on the major management challenges in building a database environment.
- 12 -