UNIT 4 Q1- What is Distributed Database? Ans- Distributed database (DDB) can be defined as a collection of multiple logically interrelated databases distributed over a computer network. In a distributed database, a database is physically stored in two or more computer systems. Although geographically dispersed, a DDB system manages and controls the entire database as a single collection of data. If redundant data is stored in separate databases due to performance requirements, updates to one set of data will automatically update the additional sets in a timely manner. A distributed system has both data and transaction processing divided between one or more computers connected by network, each computer playing a specific role in the system. DDBs bring together the advantages of distributed computing and database management. DDBs use client/server architecture to process information requests. A DDB system allows applications to access data from local and remote databases. In a Homogeneous DDB system, each database in the system is by the same vendor. In a Heterogeneous DDB system, at least one of the databases will be that of a different vendor. e.g. In a Oracle homogeneous DDB system all the databases in the system will be oracle databases whereas in the heterogeneous one there would be oracle databases and other databases like Sybase, DB2. A DDB appears to a user as a single database but is, in fact, a set of databases stored on multiple computers. The data on several computers can be simultaneously accessed and modified using a network. Each database server in the DDB is controlled by its local DBMS and each cooperates to maintain the consistency of the global database.
Q2- What are the functions of the distributed database management system? Ans- Distribution of databases across a network leads to increased complexity in the system design and implementation. To achieve the benefits of a distributed database that we have seen earlier, the distributed database management software should be able to perform the following functions in addition to the basic functions performed by a non-distributed DBMS: 1. Distributed query processing: Distributed query-processing means the ability to access remote sites and transmit queries and data among the various sites along the communication network. 2. Data tracking: The distributed DBMS should have the ability to keep track of the data distribution, fragmentation and replication by expanding the distributed DBMS catalog. 3. Distributed transaction management: Distributed transaction management is the ability to devise execution strategies for queries and transactions that access data from more than one site and to synchronize the access to distributed data and maintain the integrity of the overall database. 4. Replicated data management: This is the ability of the system to decide which copy of the replicated data item to access and to maintain consistency of the copies of a replicated data item. 5. Distributed data recovery: The distributed DBMS should have the ability to recover from individual site crashes and failures of communication links. 6. Security: Distributed transactions must be executed with the proper management of the security of the data and the authorization and access privileges of users. 7. Distributed catalog management: A catalog or directory contains information about the data inn the database metadata. The directory may be global for the entire distributed database or local for each location.
Q3-What are the advantages of distributed database? Ans-1.Transparency: A DDB should be distribution transparent. This means hiding the details of where each file is physically stored within the system. This is possible through DDBMS. The following transparencies are achieved: A. Distribution or network transparency: This refers to the freedom for the user from the operational details of the network. Network transparency may be divided into location transparency and naming transparency. Location transparency refers to the fact that the command used to perform a task independent of the location of the data and location of the system where the command was issued. Naming transparency implies that once a name is specified, the named objects can be accessed unambiguously without additional specification. B. SQL and COMMIT transparency: A DDBMS provides query, update and transaction transparency. The standard SQL statements like SELECT, INSERT, UPDATE and DELETE work just as they do in a non-distributed database environment. The DDBMS should guarantee that all nodes involved in a distributed transaction take the same action; either all commit or all rollback the transaction. If a network or system failure occurs during the commit of a distributed transaction, the transaction should be automatically and transparently resolved. C. Replication transparency: Replication is the process by which we store copies of the database objects at multiple sites for better availability, performance and reliability. Replication transparency makes the user unaware of the existence of the copies. D. Fragmentation transparency: There are two types of fragmentation transparency - Horizontal and Vertical. Horizontal fragmentation distributes a relation into sets of tuples. Vertical fragmentation distributes a relation into sub-relations where each sub-relation is defined by a subset of columns of the original relation. A global query by the user must be transformed into several fragment queries. Fragmentation transparency makes the user unaware of the existence of fragments. 2. Improved reliability and availability: Reliability is defined as the probability that the system is up and running at a certain point in time. Availability is the probability that the system is continuously
available during a time interval. When data and DBMS software are distributed over several locations, one location may fail while other locations continue to operate. Only the data and software of the failed location is inaccessible. This improves both reliability and availability. Further improvement is achieved by judiciously replicating data and software at more than one site. In the case of centralized systems the failure at one location makes the whole system unavailable to all users. 3. Performance Improvement: A DDBMS fragments the database by keeping the data closer to where it is needed most. Data localization reduces the contention for CPU and I/O services and network traffic. When a large database id distributed over multiple sites, smaller databases exist at each site. As a result, local queries and transactions accessing data at a single site have better performance because of the smaller local database. In addition each site has a smaller number of transactions executing than if all transactions are submitted to a single centralized database. Query performance can be improved by executing multiple queries at different sites or breaking up a query into a number of subqueries that execute in parallel. 4. Improved scalability: In a distributed environment, expansion of the system in terms of adding more data, increasing database sizes or adding more processors is much easier than other systems. 5. Site autonomy: Site autonomy means that each server participating in a DDB is administered independently from all other databases. Although several databases can work together, each database is a separate repository of data that is managed individually. Some of the benefits are: A. Local administrators control corresponding local data, therefore each database administrator's domain of responsibility is smaller and more manageable. B. Independent failures are less likely to disrupt other nodes of the DDB. C. Administrators can recover from isolated system failures independently from other nodes in the system. D. A data dictionary exists for each local database - a global catalog is not necessary to access local data. E. Nodes can upgrade software independently. F. Nodes of the system can mirror the logical organization of groups that need to maintain independence.
Q4- Explain client/server database architecture. Ans - Full scale DDBMSs has not been developed to support all the types of functionalities. Instead, distributed database applications are being developed in the context of the client-server architecture. It is now more common to use a 3-tier architecture, particularly with web applications. 2-Tier model:
The client-server model is basic to distributed systems. It is a response to the limitations presented by the traditional mainframe client-host model, in which a single mainframe provides shared data access to many dumb terminals. The client-server model is also a response to the LAN model, in which many isolated systems access a file server that provides no processing power. Client-server architecture provides integration of data and services and allows clients to be isolated from inherent complexities such as communication protocols. The simplicity of the client-server architecture allows clients to make requests that are routed to the appropriate server. These requests are made in the form of transactions. The client-server model consists of three parts: 1. Client:
The client is the machine running the front-end applications. It interacts with the user through the keyboard, display and pointing device such as a mouse. The client also refers to the client process that runs on the client machine. The client has no direct data access responsibilities. It simply requests processes from the server and displays data managed by the server. Therefore, the client workstation can be optimized for its job. e.g it might not need large disk capacity, or it might benefit from graphic capabilities. 2. Server:
The server is the machine that runs the DBMS software and handles the functions required for concurrent, shared data access. It is often referred to as the backend. Server also refers to the server process that runs on the server machine. The server receives and processes SQL and other query statements originating from client applications. The server can also be optimized for its duties. e.g it can have large disk capacity and fast processors. It can also take the load of disk I/O, printing, file transfer, and so on. 3. Network:
The network enables remote data access through client-server and server-toserver communication.
The clients and servers may all be located in physical proximity-say within the same building- and connected via a LAN or they may be geographically distributed over large distances and connected by WAN. LANs typically use cables whereas WANs use telephone lines or satellites. It is also possible to use a combination of the two types of networks. Networks will have different topologies. The topology defines the communication paths among locations. The type of topology of the network used may have significant effect on performance and hence on the strategies for distributed query processing and DDB design. A database server is the DBMS software managing a database and a client is an application that requests information from a server. Each computer in a network is a node that can host one or more databases. Each node is a distributed database system can act as a client, a server or both, depending on the situation. Certain features are always present in the client-server model. There is a server process that can process requests from one or more client processes concurrently over a network connection. The client machine provides front-end application software for accessing the data on the server. The client initiates transactions; the server processes the transactions. Thee is a SQL that can be used to access data stored on the server side. e.g Assume that there are 2 database servers HQ(EMP table) and BRANCH(SALES table) which are located in two separate locations connected through a network. A client can connect directly or indirectly to a database server. A direct connection occurs when a client connects to a server and accesses information from a database contained on that server. e.g if you connect to the HQ database and access the EMP table on this database, you would write: SELECT * FROM EMP;
This query is direct because you are not accessing an object on aremote database. In contrast, an indirect connection occurs when a client connects to a server and then accesses information contained in a database on a different server. e.g. if you connect to the HQ database but access the SALES table on the remote BRANCH database, you have to write: SELECT * FROM SALES@BRANCH;
This query is indirect because the object you are accessing is not on the database to which you are directly connected. 3-Tier model: In the 3-tier model, the following 3 layers exists: 1. Presentation layer(client): This provides the user interface and interacts with the user. 2. Application layer(Business logic): This layer programs the application logic. 3. Database server: This layer handles query and update requests from the application layer, processes the requests and send the results.
It is a common approach to include the functionality of a centralizzed DBMS at the database server level. A number of relational DBMS products have taken this approach, where an SQL server is provided. The application server must then formulate the appropriate SQL queries and connect to the database server when needed. The client provides the processing for user interface interactions. In this architecture, the application server may also refer to a data dictionary that includes information on the distribution of data among the various SQL servers as well as modules for decomposing a global query into a number of local queries that can be executed at various sites. Interaction between application server and database server might proceed as follows during the processing of an SQL query: 1. The application server formulates a user query based on input from the client layer and decomposes it into a number of independent site queries. Each site query is sent to the appropriate database server site. 2. Each database server processes the local query and sends the results to the application server site. 3. The application server combines the results of the subqueries to produce the result of the originally required query, formats it into HTML or some other form accepted by the client, and sends it to the client site for display. If the DDBMS has the capability to hide the details of data distribution from the application server, then it enables the application server to execute global queries and transactions as though the database were centralized, without having to specify the sites at which the data referenced in the query or transaction resides. This property is called distribution transparency.
UNIT 5 Q1- What is data warehouse? Ans-