Charutar Vidya Mandal’s SEMCOM Master of E-Business (Semester-III) Faculty Name: Miss Nehal P Daulatjada Unit & Topic : Distributed Databases (Unit-4, topic-2) Distributed Databases: The DDB has emerged as a merger of 2 technologies: Database technology and Network/Data communication technology. A distributed computing system consists of a number of processing elements, not necessarily homogeneous, that are interconnected by a computer network, and that cooperate in performing certain assigned tasks. As a general goal, distributed computing systems partition a big, unmanageable problem into smaller pieces and solve it efficiently in a coordinated manner. The economic viability of this approach stems from two reasons: 1. More computer power can be utilized to solve a complex task. 2. Each autonomous processing element can be managed independently and develop its own applications. While early databases moved toward centralization and resulted in monolithic gigantic databases in the early times, the trend reversed toward more decentralization and autonomy of processing. Distributed database (DDB) can be defined as a collection of multiple logically interrelated databases distributed over a computer network. Distributed database management system (DDBMS) can be defined as a software system that manages a distributed database while making the distribution transparent to the user. In a distributed database, a database is physically stored in two or more computer systems. Although geographically dispersed, a DDB system manages and controls the entire database as a single collection of data. If redundant data is stored in separate databases due to performance requirements, updates to one set of data will automatically update the additional sets in a timely manner. A distributed system has both data and transaction processing divided between one or more computers connected by network, each computer playing a specific role in the system. DDBs bring together the advantages of distributed computing and database management. DDBs use client/server architecture to process information requests. A DDB system allows applications to access data from local and remote databases. In a Homogeneous DDB system, each database in the system is by the same vendor. In a Heterogeneous DDB system, at least one of the databases will be that of a different vendor. E.g. in a Oracle homogeneous DDB system all the databases in the system will be oracle databases whereas in the heterogeneous one there would be oracle databases and other databases like Sybase, DB2. A DDB appears to a user as a single database but is, in fact, a set of databases stored on multiple computers. The data on several computers can be simultaneously accessed and modified using a network. Each database server in the DDB is controlled by its local DBMS and each cooperates to maintain the consistency of the global database.
Advantages of Distributed Databases: 1. Transparency: A DDB should be distribution transparent. This means hiding the details of where each file is physically stored within the system. This is possible through DDBMS. The following transparencies are achieved: A. Distribution or network transparency: This refers to the freedom for the user from the operational details of the network. Network transparency may be divided into location transparency and naming transparency. Location transparency refers to the fact that the command used to perform a task independent of the location of the data and location of the system where the command was issued. Naming transparency implies that once a name is specified, the named objects can be accessed unambiguously without additional specification. B. SQL and COMMIT transparency: A DDBMS provides query, update and transaction transparency. The standard SQL statements like SELECT, INSERT, UPDATE and DELETE work just as they do in a non-distributed database environment. The DDBMS should guarantee that all nodes involved in a distributed transaction take the same action; either all commit or all rollback the transaction. If a network or system failure occurs during the commit of a distributed transaction, the transaction should be automatically and transparently resolved. C. Replication transparency: Replication is the process by which we store copies of the database objects at multiple sites for better availability, performance and reliability. Replication transparency makes the user unaware of the existence of the copies. D. Fragmentation transparency: There are two types of fragmentation transparency - Horizontal and Vertical. Horizontal fragmentation distributes a relation into sets of tuples. Vertical fragmentation distributes a relation into sub-relations where each sub-relation is defined by a subset of columns of the original relation. A global query by the user must be transformed into several fragment queries. Fragmentation transparency makes the user unaware of the existence of fragments. 2. Improved reliability and availability: Reliability is defined as the probability that the system is up and running at a certain point in time. Availability is the probability that the system is continuously available during a time interval. When data and DBMS software are distributed over several locations, one location may fail while other locations continue to operate. Only the data and software of the failed location is inaccessible. This improves both reliability and availability. Further improvement is achieved by judiciously replicating data and software at more than one site. In the case of centralized systems the failure at one location makes the whole system unavailable to all users. 3. Performance Improvement: A DDBMS fragments the database by keeping the data closer to where it is needed most. Data localization reduces the contention for CPU and I/O services and network traffic. When a large database id distributed over multiple sites, smaller databases exist at each site. As a result, local queries and transactions accessing data at a single site have better performance because of the smaller local database. In addition each site has a smaller number of transactions executing than if all transactions are submitted to a single centralized database. Query performance can be improved by executing multiple queries at different sites or breaking up a query into a number of subqueries that execute in parallel. 4. Improved scalability:
In a distributed environment, expansion of the system in terms of adding more data, increasing database sizes or adding more processors is much easier than other systems. 5. Site autonomy: Site autonomy means that each server participating in a DDB is administered independently from all other databases. Although several databases can work together, each database is a separate repository of data that is managed individually. Some of the benefits are: A. Local administrators control corresponding local data, therefore each database administrator's domain of responsibility is smaller and more manageable. B. Independent failures are less likely to disrupt other nodes of the DDB. C. Administrators can recover from isolated system failures independently from other nodes in the system. D. A data dictionary exists for each local database - a global catalog is not necessary to access local data. E. Nodes can upgrade software independently. F. Nodes of the system can mirror the logical organization of groups that need to maintain independence.
Functions of Distributed database management system (DDBMS): Distribution of databases across a network leads to increased complexity in the system design and implementation. To achieve the benefits of a distributed database that we have seen earlier, the distributed database management software should be able to perform the following functions in addition to the basic functions performed by a non-distributed DBMS: 1. Distributed query processing: Distributed query-processing means the ability to access remote sites and transmit queries and data among the various sites along the communication network. 2. Data tracking: The distributed DBMS should have the ability to keep track of the data distribution, fragmentation and replication by expanding the distributed DBMS catalog. 3. Distributed transaction management: Distributed transaction management is the ability to devise execution strategies for queries and transactions that access data from more than one site and to synchronize the access to distributed data and maintain the integrity of the overall database. 4. Replicated data management: This is the ability of the system to decide which copy of the replicated data item to access and to maintain consistency of the copies of a replicated data item. 5. Distributed data recovery: The distributed DBMS should have the ability to recover from individual site crashes and failures of communication links. 6. Security: Distributed transactions must be executed with the proper management of the security of the data and the authorization and access privileges of users. 7. Distributed catalog management: A catalog or directory contains information about the data inn the database - metadata. The directory may be global for the entire distributed database or local for each location.
Client/Server Database Architecture: Full scale DDBMSs has not been developed to support all the types of functionalities. Instead, distributed database applications are being developed in the context of the client-
server architecture. It is now more common to use a 3-tier architecture, particularly with web applications. 2-Tier model: The client-server model is basic to distributed systems. It is a response to the limitations presented by the traditional mainframe client-host model, in which a single mainframe provides shared data access to many dumb terminals. The client-server model is also a response to the LAN model, in which many isolated systems access a file server that provides no processing power. Client-server architecture provides integration of data and services and allows clients to be isolated from inherent complexities such as communication protocols. The simplicity of the client-server architecture allows clients to make requests that are routed to the appropriate server. These requests are made in the form of transactions. The client-server model consists of three parts: 1. Client: The client is the machine running the front-end applications. It interacts with the user through the keyboard, display and pointing device such as a mouse. The client also refers to the client process that runs on the client machine. The client has no direct data access responsibilities. It simply requests processes from the server and displays data managed by the server. Therefore, the client workstation can be optimized for its job. e.g it might not need large disk capacity, or it might benefit from graphic capabilities. 2. Server: The server is the machine that runs the DBMS software and handles the functions required for concurrent, shared data access. It is often referred to as the backend. Server also refers to the server process that runs on the server machine. The server receives and processes SQL and other query statements originating from client applications. The server can also be optimized for its duties. e.g it can have large disk capacity and fast processors. It can also take the load of disk I/O, printing, file transfer, and so on. 3. Network: The network enables remote data access through client-server and server-to-server communication. The clients and servers may all be located in physical proximity-say within the same building- and connected via a LAN or they may be geographically distributed over large distances and connected by WAN. LANs typically use cables whereas WANs use telephone lines or satellites. It is also possible to use a combination of the two types of networks. Networks will have different topologies. The topology defines the communication paths among locations. The type of topology of the network used may have significant effect on performance and hence on the strategies for distributed query processing and DDB design. A database server is the DBMS software managing a database and a client is an application that requests information from a server. Each computer in a network is a node that can host one or more databases. Each node is a distributed database system can act as a client, a server or both, depending on the situation. Certain features are always present in the client-server model. There is a server process that can process requests from one or more client processes concurrently over a network connection. The client machine provides frontend application software for accessing the data on the server. The client initiates transactions; the server processes the transactions. There is a SQL that can be used to access data stored on the server side. e.g Assume that there are 2 database servers HQ(EMP table) and BRANCH(SALES table) which are located in two separate locations connected through a network.
A client can connect directly or indirectly to a database server. A direct connection occurs when a client connects to a server and accesses information from a database contained on that server. e.g if you connect to the HQ database and access the EMP table on this database, you would write: SELECT * FROM EMP; This query is direct because you are not accessing an object on a remote database. In contrast, an indirect connection occurs when a client connects to a server and then accesses information contained in a database on a different server. e.g. if you connect to the HQ database but access the SALES table on the remote BRANCH database, you have to write: SELECT * FROM SALES@BRANCH; This query is indirect because the object you are accessing is not on the database to which you are directly connected. 3-Tier model: In the 3-tier model, the following 3 layers exists: 1. Presentation layer(client): This provides the user interface and interacts with the user. 2. Application layer(Business logic): This layer programs the application logic. 3. Database server: This layer handles query and update requests from the application layer, processes the requests and send the results. It is a common approach to include the functionality of a centralized DBMS at the database server level. A number of relational DBMS products have taken this approach, where an SQL server is provided. The application server must then formulate the appropriate SQL queries and connect to the database server when needed. The client provides the processing for user interface interactions. In this architecture, the application server may also refer to a data dictionary that includes information on the distribution of data among the various SQL servers as well as modules for decomposing a global query into a number of local queries that can be executed at various sites. Interaction between application server and database server might proceed as follows during the processing of an SQL query: 1. The application server formulates a user query based on input from the client layer and decomposes it into a number of independent site queries. Each site query is sent to the appropriate database server site. 2. Each database server processes the local query and sends the results to the application server site. 3. The application server combines the results of the subqueries to produce the result of the originally required query, formats it into HTML or some other form accepted by the client, and sends it to the client site for display. If the DDBMS has the capability to hide the details of data distribution from the application server, then it enables the application server to execute global queries and transactions as though the database were centralized, without having to specify the sites at which the data referenced in the query or transaction resides. This property is called distribution transparency.
Additional: Two-Phase Commit Mechanism: A database must guarantee that all statements in a transaction, distributed or nondistributed, either commit or roll back as a unit. The effects of an ongoing transaction should be invisible to all other transactions at all nodes; this transparency should be true for transactions that include any type of operation, including queries, updates or remote procedure calls. In a DDB, the database management system must coordinate transaction control with the same characteristics over a network and maintain data consistency, even if a network or system failure occurs. The two-phase commit mechanism guarantees that all database servers participating in a distributed transaction either all commit or all roll back the statements in the transaction. A two-phase commit mechanism also protects implicit DML operations performed by integrity constraints, remote procedure calls, and triggers. Two-phase commit allows group of transactions across several nodes to be treated as a unit; either all the transactions commit, or they all get rolled back. The use of two-phase commit is transparent to the users. The two-phase commit consists of two phases - Prepare phase and Execute phase. 1. Prepare phase: An initiating node called the global coordinator notifies all sites involved in the transaction to be ready to either commit or roll back the transaction. The coordinator sends a message "prepare for commit" to each node to get ready for committing the transaction. Each participating database receiving the message will force-write (i.e. write all pages updated by a transaction immediately to the disk) all the transaction details to the disk and then send a "ready to commit" or "OK" signal to the coordinator. If the force-writing to the disk fails or if the local transaction cannot commit for some reason, the participating database sends a "cannot commit" or "not OK" signal to the coordinator. If the coordinator does not receive a reply from a database within a certain timeout interval, it assumes a "not OK" response. 2. Execute phase: If all participating databases reply "OK", the coordinator signals "OK". This means that the transaction is successful. The coordinator sends a "commit" signal for the transaction to all the participating databases. Each participating database completes the transaction by permanently updating the database. On the other hand, if one or more of the databases has given a "not OK" signal, then the coordinator will send a message to "rollback" or "undo" the local effect of the transaction to each participating database. Thus, if there is no problem with the prepare phase, then all sites commit their transactions. If a network or node failure occurs, then all sites roll back their transactions.
Techniques used with Distributed database design: 1. Data fragmentation: In a DDB, decisions must be taken regarding which site should be used to store which portions of the database. In the simplest case, logical units are the whole relations. In many cases however, a relation can be divided into smaller logical units for distribution. e.g consider a COMPANY database having 2 relations DEPT(dno,dname) (1,MKT)(2,FIN)(3,ADM) EMP(eno,ename,address,gender,basic,pan,dno) (E1, Aamir, Mumbai, M, 1000, 1234, 1) (E2, Abhishek, Mumbai, M, 1100, 2345, 2) (E3, Amitabh, Mumbai, M, 900, 3456, 3)
(E4, Saif Ali, Mumbai, M, 9500, 4567, 2) (E5, Shahrukh, Mumbai, M, 9000, 5678, 3) (E6, Salman, Mumbai, M, 8900, 6789, 3) Horizontal Fragmentation: A horizontal fragment of a relation is a subset of the tuples in that relation. The tuples that belong to the horizontal fragment are specified by a condition on one or more attributes of the relation. If we decide to have a horizontal fragmentation by department number, then we will store the database information relating to each department at the computer site for that department. e.g. At the 1st site of department (dno=1) we store EMP (eno=E1) records. At the 2nd site of department (dno=2) we store EMP (eno=E2),(eno=E4) records. At the 3rd site of department (dno=3) we store EMP (eno=E3),(eno=E5) and (eno=E6) records. Horizontal fragmentation divides a relation horizontally by grouping rows to create subsets of tuples, where each subset has a certain logical meaning. These fragments can be assigned to different sites. Vertical Fragmentation: Each site may not need all the attributes of a relation, which would indicate the need for different type of fragmentation. Vertical fragmentation divides a relation vertically by columns. A vertical fragment of a relation keeps only certain attributes of the relation. e.g. we may fragment the EMP relation vertically by keeping personal information(ename,address,gender) together and separate from work-related information(eno,basic,pan,dno). This vertical fragmentation must have a common attribute on which the 2 fragments can be linked to get the complete details of an employee. Hence the primary/candidate key of the relation must be included in all the fragments. i.e. (eno,ename,address,gender) A fragment schema of a database is a definition of a set of fragments that includes all attributes and tuples in the database and satisfies the condition that the whole database can be reconstructed from the fragments by applying some UNION operations. All almost every time the fragments are disjoint except for the key that is replicated in all fragments. An allocation schema describes the allocation of fragments to sites of the DDBs. It is a mapping that specifies for each fragment the site(s) at which it is stored. If a fragment is stored at more than one site, it is said to be replicated. 2. Data Replication and Allocation: Replication is useful in improving the availability of data. The most extreme case of replication is of the whole database at every site in the distribution system, thus creating a fully replicated DDB. This can improve availability remarkably because the system can continue to operate as long as atleast one site is up. It also improves performance of retrieval for global queries, because the result of such a query can be obtained locally from any one site. The disadvantage of full replication is that it can slow down the update operations since a single logical update must be performed on every copy of the database for consistency. The other extreme involves having no replication i.e. each fragment is stored in exactly one site.
Between these two extremes, we have a wide spectrum of partial replication of data i.e. some fragments of the database may be replicated whereas others may not. The number of copies of each fragment can range up to the total number of sites in the distributed system. e.g Sales persons carry partially replicated databases with them on their laptops and personal digital assistants and synchronize them periodically with the server database. Each fragment or each copy of a fragment must be assigned to a particular site in the distributed system. This process is called data allocation or data distribution. To determine which sites include which information when a query is sent, the DDBMS refers to the fragmentation, replication and distribution information stored in the DDBMS catalog. For vertical fragmentation, the attribute list of each fragment is kept in the catalog. For horizontal fragmentation, a condition, sometimes called guard is kept for each fragment. It is called a guard because only those tuples that satisfy a condition are stored in a fragment.