2.
Distributed Database System <ul><li>A distributed database system consists of loosely coupled sites that share no physical component </li></ul><ul><li>Database systems that run on each site are independent of each other </li></ul><ul><li>Transactions may access data at one or more sites </li></ul>

3.
Homogeneous Distributed Databases <ul><li>In a homogeneous distributed database </li></ul><ul><ul><li>All sites have identical software </li></ul></ul><ul><ul><li>Are aware of each other and agree to cooperate in processing user requests. </li></ul></ul><ul><ul><li>Each site surrenders part of its autonomy in terms of right to change schemas or software </li></ul></ul><ul><ul><li>Appears to user as a single system </li></ul></ul><ul><li>In a heterogeneous distributed database </li></ul><ul><ul><li>Different sites may use different schemas and software </li></ul></ul><ul><ul><ul><li>Difference in schema is a major problem for query processing </li></ul></ul></ul><ul><ul><ul><li>Difference in softwrae is a major problem for transaction processing </li></ul></ul></ul><ul><ul><li>Sites may not be aware of each other and may provide only limited facilities for cooperation in transaction processing </li></ul></ul>

4.
Distributed Data Storage <ul><li>Assume relational data model </li></ul><ul><li>Replication </li></ul><ul><ul><li>System maintains multiple copies of data, stored in different sites, for faster retrieval and fault tolerance. </li></ul></ul><ul><li>Fragmentation </li></ul><ul><ul><li>Relation is partitioned into several fragments stored in distinct sites </li></ul></ul><ul><li>Replication and fragmentation can be combined </li></ul><ul><ul><li>Relation is partitioned into several fragments: system maintains several identical replicas of each such fragment. </li></ul></ul>

5.
Data Replication <ul><li>A relation or fragment of a relation is replicated if it is stored redundantly in two or more sites. </li></ul><ul><li>Full replication of a relation is the case where the relation is stored at all sites. </li></ul><ul><li>Fully redundant databases are those in which every site contains a copy of the entire database. </li></ul>

6.
Data Replication (Cont.) <ul><li>Advantages of Replication </li></ul><ul><ul><li>Availability : failure of site containing relation r does not result in unavailability of r is replicas exist. </li></ul></ul><ul><ul><li>Parallelism : queries on r may be processed by several nodes in parallel. </li></ul></ul><ul><ul><li>Reduced data transfer : relation r is available locally at each site containing a replica of r . </li></ul></ul><ul><li>Disadvantages of Replication </li></ul><ul><ul><li>Increased cost of updates: each replica of relation r must be updated. </li></ul></ul><ul><ul><li>Increased complexity of concurrency control: concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented. </li></ul></ul><ul><ul><ul><li>One solution: choose one copy as primary copy and apply concurrency control operations on primary copy </li></ul></ul></ul>

7.
Data Fragmentation <ul><li>Division of relation r into fragments r 1 , r 2 , …, r n which contain sufficient information to reconstruct relation r. </li></ul><ul><li>Horizontal fragmentation : each tuple of r is assigned to one or more fragments </li></ul><ul><li>Vertical fragmentation : the schema for relation r is split into several smaller schemas </li></ul><ul><ul><li>All schemas must contain a common candidate key (or superkey) to ensure lossless join property. </li></ul></ul><ul><ul><li>A special attribute, the tuple-id attribute may be added to each schema to serve as a candidate key. </li></ul></ul><ul><li>Example : relation account with following schema </li></ul><ul><li>Account-schema = ( branch-name , account-number, balance ) </li></ul>

10.
Advantages of Fragmentation <ul><li>Horizontal: </li></ul><ul><ul><li>allows parallel processing on fragments of a relation </li></ul></ul><ul><ul><li>allows a relation to be split so that tuples are located where they are most frequently accessed </li></ul></ul><ul><li>Vertical: </li></ul><ul><ul><li>allows tuples to be split so that each part of the tuple is stored where it is most frequently accessed </li></ul></ul><ul><ul><li>tuple-id attribute allows efficient joining of vertical fragments </li></ul></ul><ul><ul><li>allows parallel processing on a relation </li></ul></ul><ul><li>Vertical and horizontal fragmentation can be mixed. </li></ul><ul><ul><li>Fragments may be successively fragmented to an arbitrary depth. </li></ul></ul>

11.
Data Transparency <ul><li>Data transparency : Degree to which system user may remain unaware of the details of how and where the data items are stored in a distributed system </li></ul><ul><li>Consider transparency issues in relation to: </li></ul><ul><ul><li>Fragmentation transparency </li></ul></ul><ul><ul><li>Replication transparency </li></ul></ul><ul><ul><li>Location transparency </li></ul></ul>

12.
Naming of Data Items - Criteria <ul><li>1. Every data item must have a system-wide unique name. </li></ul><ul><li>2. It should be possible to find the location of data items efficiently. </li></ul><ul><li>3. It should be possible to change the location of data items transparently. </li></ul><ul><li>4. Each site should be able to create new data items autonomously. </li></ul>

14.
Use of Aliases <ul><li>Alternative to centralized scheme: each site prefixes its own site identifier to any name that it generates i.e., site 17.a ccount. </li></ul><ul><ul><li>Fulfills having a unique identifier, and avoids problems associated with central control. </li></ul></ul><ul><ul><li>However, fails to achieve network transparency. </li></ul></ul><ul><li>Solution: Create a set of aliases for data items; Store the mapping of aliases to the real names at each site. </li></ul><ul><li>The user can be unaware of the physical location of a data item, and is unaffected if the data item is moved from one site to another. </li></ul>

16.
Distributed Transactions <ul><li>Transaction may access data at several sites. </li></ul><ul><li>Each site has a local transaction manager responsible for: </li></ul><ul><ul><li>Maintaining a log for recovery purposes </li></ul></ul><ul><ul><li>Participating in coordinating the concurrent execution of the transactions executing at that site. </li></ul></ul><ul><li>Each site has a transaction coordinator, which is responsible for: </li></ul><ul><ul><li>Starting the execution of transactions that originate at the site. </li></ul></ul><ul><ul><li>Distributing subtransactions at appropriate sites for execution. </li></ul></ul><ul><ul><li>Coordinating the termination of each transaction that originates at the site, which may result in the transaction being committed at all sites or aborted at all sites. </li></ul></ul>

18.
System Failure Modes <ul><li>Failures unique to distributed systems: </li></ul><ul><ul><li>Failure of a site. </li></ul></ul><ul><ul><li>Loss of massages </li></ul></ul><ul><ul><ul><li>Handled by network transmission control protocols such as TCP-IP </li></ul></ul></ul><ul><ul><li>Failure of a communication link </li></ul></ul><ul><ul><ul><li>Handled by network protocols, by routing messages via alternative links </li></ul></ul></ul><ul><ul><li>Network partition </li></ul></ul><ul><ul><ul><li>A network is said to be partitioned when it has been split into two or more subsystems that lack any connection between them </li></ul></ul></ul><ul><ul><ul><ul><li>Note: a subsystem may consist of a single node </li></ul></ul></ul></ul><ul><li>Network partitioning and site failures are generally indistinguishable. </li></ul>

19.
Commit Protocols <ul><li>Commit protocols are used to ensure atomicity across sites </li></ul><ul><ul><li>a transaction which executes at multiple sites must either be committed at all the sites, or aborted at all the sites. </li></ul></ul><ul><ul><li>not acceptable to have a transaction committed at one site and aborted at another </li></ul></ul><ul><li>The two-phase commit (2 PC ) protocol is widely used </li></ul><ul><li>The three-phase commit (3 PC ) protocol is more complicated and more expensive, but avoids some drawbacks of two-phase commit protocol. </li></ul>

20.
Two Phase Commit Protocol (2PC) <ul><li>Assumes fail-stop model – failed sites simply stop working, and do not cause any other harm, such as sending incorrect messages to other sites. </li></ul><ul><li>Execution of the protocol is initiated by the coordinator after the last step of the transaction has been reached. </li></ul><ul><li>The protocol involves all the local sites at which the transaction executed </li></ul><ul><li>Let T be a transaction initiated at site S i , and let the transaction coordinator at S i be C i </li></ul>

21.
Phase 1: Obtaining a Decision <ul><li>Coordinator asks all participants to prepare to commit transaction T i . </li></ul><ul><ul><li>C i adds the records < prepare T > to the log and forces log to stable storage </li></ul></ul><ul><ul><li>sends prepare T messages to all sites at which T executed </li></ul></ul><ul><li>Upon receiving message, transaction manager at site determines if it can commit the transaction </li></ul><ul><ul><li>if not, add a record < no T > to the log and send abort T message to C i </li></ul></ul><ul><ul><li>if the transaction can be committed, then: </li></ul></ul><ul><ul><li>add the record < ready T > to the log </li></ul></ul><ul><ul><li>force all records for T to stable storage </li></ul></ul><ul><ul><li>send ready T message to C i </li></ul></ul>

22.
Phase 2: Recording the Decision <ul><li>T can be committed of C i received a ready T message from all the participating sites: otherwise T must be aborted. </li></ul><ul><li>Coordinator adds a decision record, < commit T > or <a bort T >, to the log and forces record onto stable storage. Once the record stable storage it is irrevocable (even if failures occur) </li></ul><ul><li>Coordinator sends a message to each participant informing it of the decision (commit or abort) </li></ul><ul><li>Participants take appropriate action locally. </li></ul>

24.
Handling of Failures- Coordinator Failure <ul><li>If coordinator fails while the commit protocol for T is executing then participating sites must decide on T ’s fate: </li></ul><ul><ul><li>If an active site contains a < commit T > record in its log, then T must be committed. </li></ul></ul><ul><ul><li>If an active site contains an < abort T > record in its log, then T must be aborted. </li></ul></ul><ul><ul><li>If some active participating site does not contain a < ready T > record in its log, then the failed coordinator C i cannot have decided to commit T . Can therefore abort T . </li></ul></ul><ul><ul><li>If none of the above cases holds, then all active sites must have a < ready T > record in their logs, but no additional control records (such as < abort T > of < commit T >). In this case active sites must wait for C i to recover, to find decision. </li></ul></ul><ul><li>Blocking problem : active sites may have to wait for failed coordinator to recover. </li></ul>

25.
Handling of Failures - Network Partition <ul><li>If the coordinator and all its participants remain in one partition, the failure has no effect on the commit protocol. </li></ul><ul><li>If the coordinator and its participants belong to several partitions: </li></ul><ul><ul><li>Sites that are not in the partition containing the coordinator think the coordinator has failed, and execute the protocol to deal with failure of the coordinator. </li></ul></ul><ul><ul><ul><li>No harm results, but sites may still have to wait for decision from coordinator. </li></ul></ul></ul><ul><li>The coordinator and the sites are in the same partition as the coordinator think that the sites in the other partition have failed, and follow the usual commit protocol. </li></ul><ul><ul><ul><li>Again, no harm results </li></ul></ul></ul>

26.
Recovery and Concurrency Control <ul><li>In-doubt transactions have a < ready T >, but neither a < commit T >, nor an < abort T > log record. </li></ul><ul><li>The recovering site must determine the commit-abort status of such transactions by contacting other sites; this can slow and potentially block recovery. </li></ul><ul><li>Recovery algorithms can note lock information in the log. </li></ul><ul><ul><li>Instead of < ready T >, write out < ready T , L > L = list of locks held by T when the log is written (read locks can be omitted). </li></ul></ul><ul><ul><li>For every in-doubt transaction T , all the locks noted in the < ready T , L > log record are reacquired. </li></ul></ul><ul><li>After lock reacquisition, transaction processing can resume; the commit or rollback of in-doubt transactions is performed concurrently with the execution of new transactions. </li></ul>

27.
Three Phase Commit (3PC) <ul><li>Assumptions: </li></ul><ul><ul><li>No network partitioning </li></ul></ul><ul><ul><li>At any point, at least one site must be up. </li></ul></ul><ul><ul><li>At most K sites (participants as well as coordinator) can fail </li></ul></ul><ul><li>Phase 1: Obtaining Preliminary Decision: Identical to 2PC Phase 1. </li></ul><ul><ul><li>Every site is ready to commit if instructed to do so </li></ul></ul><ul><li>Phase 2 of 2PC is split into 2 phases, Phase 2 and Phase 3 of 3PC </li></ul><ul><ul><li>In phase 2 coordinator makes a decision as in 2PC (called the pre-commit decision ) and records it in multiple (at least K) sites </li></ul></ul><ul><ul><li>In phase 3, coordinator sends commit/abort message to all participating sites, </li></ul></ul><ul><li>Under 3PC, knowledge of pre-commit decision can be used to commit despite coordinator failure </li></ul><ul><ul><li>Avoids blocking problem as long as < K sites fail </li></ul></ul><ul><li>Drawbacks: </li></ul><ul><ul><li>higher overheads </li></ul></ul><ul><ul><li>assumptions may not be satisfied in practice </li></ul></ul><ul><li>Won’t study it further </li></ul>

28.
Alternative Models of Transaction Processing <ul><li>Notion of a single transaction spanning multiple sites is inappropriate for many applications </li></ul><ul><ul><li>E.g. transaction crossing an organizational boundary </li></ul></ul><ul><ul><li>No organization would like to permit an externally initiated transaction to block local transactions for an indeterminate period </li></ul></ul><ul><li>Alternative models carry out transactions by sending messages </li></ul><ul><ul><li>Code to handle messages must be carefully designed to ensure atomicity and durability properties for updates </li></ul></ul><ul><ul><ul><li>Isolation cannot be guaranteed, in that intermediate stages are visible, but code must ensure no inconsistent states result due to concurrency </li></ul></ul></ul><ul><ul><li>Persistent messaging systems are systems that provide transactional properties to messages </li></ul></ul><ul><ul><ul><li>Messages are guaranteed to be delivered exactly once </li></ul></ul></ul><ul><ul><ul><li>Will discuss implementation techniques later </li></ul></ul></ul>

29.
Alternative Models (Cont.) <ul><li>Motivating example: funds transfer between two banks </li></ul><ul><ul><li>Two phase commit would have the potential to block updates on the accounts involved in funds transfer </li></ul></ul><ul><ul><li>Alternative solution: </li></ul></ul><ul><ul><ul><li>Debit money from source account and send a message to other site </li></ul></ul></ul><ul><ul><ul><li>Site receives message and credits destination account </li></ul></ul></ul><ul><ul><li>Messaging has long been used for distributed transactions (even before computers were invented!) </li></ul></ul><ul><li>Atomicity issue </li></ul><ul><ul><li>once transaction sending a message is committed, message must guaranteed to be delivered </li></ul></ul><ul><ul><ul><li>Guarantee as long as destination site is up and reachable, code to handle undeliverable messages must also be available </li></ul></ul></ul><ul><ul><ul><ul><li>e.g. credit money back to source account. </li></ul></ul></ul></ul><ul><ul><li>If sending transaction aborts, message must not be sent </li></ul></ul>

30.
Error Conditions with Persistent Messaging <ul><li>Code to handle messages has to take care of variety of failure situations (even assuming guaranteed message delivery) </li></ul><ul><ul><li>E.g. if destination account does not exist, failure message must be sent back to source site </li></ul></ul><ul><ul><li>When failure message is received from destination site, or destination site itself does not exist, money must be deposited back in source account </li></ul></ul><ul><ul><ul><li>Problem if source account has been closed </li></ul></ul></ul><ul><ul><ul><ul><li>get humans to take care of problem </li></ul></ul></ul></ul><ul><li>User code executing transaction processing using 2PC does not have to deal with such failures </li></ul><ul><li>There are many situations where extra effort of error handling is worth the benefit of absence of blocking </li></ul><ul><ul><li>E.g. pretty much all transactions across organizations </li></ul></ul>

31.
Persistent Messaging and Workflows <ul><li>Workflows provide a general model of transactional processing involving multiple sites and possibly human processing of certain steps </li></ul><ul><ul><li>E.g. when a bank receives a loan application, it may need to </li></ul></ul><ul><ul><ul><li>Contact external credit-checking agencies </li></ul></ul></ul><ul><ul><ul><li>Get approvals of one or more managers </li></ul></ul></ul><ul><ul><li>and then respond to the loan application </li></ul></ul><ul><ul><li>We study workflows in Chapter 24 (Section 24.2) </li></ul></ul><ul><ul><li>Persistent messaging forms the underlying infrastructure for workflows in a distributed environment </li></ul></ul>

32.
Implementation of Persistent Messaging <ul><li>Sending site protocol </li></ul><ul><ul><li>Sending transaction writes message to a special relation messages-to-send. The message is also given a unique identifier. </li></ul></ul><ul><ul><ul><li>Writing to this relation is treated as any other update, and is undone if the transaction aborts. </li></ul></ul></ul><ul><ul><ul><li>The message remains locked until the sending transaction commits </li></ul></ul></ul><ul><ul><li>A message delivery process monitors the messages-to-send relation </li></ul></ul><ul><ul><ul><li>When a new message is found, the message is sent to its destination </li></ul></ul></ul><ul><ul><ul><li>When an acknowledgment is received from a destination, the message is deleted from messages-to-send </li></ul></ul></ul><ul><ul><ul><li>If no acknowledgment is received after a timeout period, the message is resent </li></ul></ul></ul><ul><ul><ul><ul><li>This is repeated until the message gets deleted on receipt of acknowledgement, or the system decides the message is undeliverable after trying for a very long time </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Repeated sending ensures that the message is delivered </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>(as long as the destination exists and is reachable within a reasonable time) </li></ul></ul></ul></ul></ul>

33.
Implementation of Persistent Messaging <ul><li>Receiving site protocol </li></ul><ul><ul><li>When a message is received </li></ul></ul><ul><ul><ul><li>it is written to a received-messages relation if it is not already present (the message id is used for this check). The transaction performing the write is committed </li></ul></ul></ul><ul><ul><ul><li>An acknowledgement (with message id) is then sent to the sending site. </li></ul></ul></ul><ul><ul><li>There may be very long delays in message delivery coupled with repeated messages </li></ul></ul><ul><ul><ul><ul><li>Could result in processing of duplicate messages if we are not careful! </li></ul></ul></ul></ul><ul><ul><ul><li>Option 1: messages are never deleted from received-messages </li></ul></ul></ul><ul><ul><ul><li>Option 2: messages are given timestamps </li></ul></ul></ul><ul><ul><ul><ul><li>Messages older than some cut-off are deleted from received-messages </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Received messages are rejected if older than the cut-off </li></ul></ul></ul></ul>

35.
Concurrency Control <ul><li>Modify concurrency control schemes for use in distributed environment. </li></ul><ul><li>We assume that each site participates in the execution of a commit protocol to ensure global transaction automicity. </li></ul><ul><li>We assume all replicas of any item are updated </li></ul><ul><ul><li>Will see how to relax this in case of site failures later </li></ul></ul>

36.
Single-Lock-Manager Approach <ul><li>System maintains a single lock manager that resides in a single chosen site, say S i </li></ul><ul><li>When a transaction needs to lock a data item, it sends a lock request to S i and lock manager determines whether the lock can be granted immediately </li></ul><ul><ul><li>If yes, lock manager sends a message to the site which initiated the request </li></ul></ul><ul><ul><li>If no, request is delayed until it can be granted, at which time a message is sent to the initiating site </li></ul></ul>

37.
Single-Lock-Manager Approach (Cont.) <ul><li>The transaction can read the data item from any one of the sites at which a replica of the data item resides. </li></ul><ul><li>Writes must be performed on all replicas of a data item </li></ul><ul><li>Advantages of scheme: </li></ul><ul><ul><li>Simple implementation </li></ul></ul><ul><ul><li>Simple deadlock handling </li></ul></ul><ul><li>Disadvantages of scheme are: </li></ul><ul><ul><li>Bottleneck: lock manager site becomes a bottleneck </li></ul></ul><ul><ul><li>Vulnerability: system is vulnerable to lock manager site failure. </li></ul></ul>

38.
Distributed Lock Manager <ul><li>In this approach, functionality of locking is implemented by lock managers at each site </li></ul><ul><ul><li>Lock managers control access to local data items </li></ul></ul><ul><ul><ul><li>But special protocols may be used for replicas </li></ul></ul></ul><ul><li>Advantage: work is distributed and can be made robust to failures </li></ul><ul><li>Disadvantage: deadlock detection is more complicated </li></ul><ul><ul><li>Lock managers cooperate for deadlock detection </li></ul></ul><ul><ul><ul><li>More on this later </li></ul></ul></ul><ul><li>Several variants of this approach </li></ul><ul><ul><li>Primary copy </li></ul></ul><ul><ul><li>Majority protocol </li></ul></ul><ul><ul><li>Biased protocol </li></ul></ul><ul><ul><li>Quorum consensus </li></ul></ul>

39.
Primary Copy <ul><li>Choose one replica of data item to be the primary copy . </li></ul><ul><ul><li>Site containing the replica is called the primary site for that data item </li></ul></ul><ul><ul><li>Different data items can have different primary sites </li></ul></ul><ul><li>When a transaction needs to lock a data item Q , it requests a lock at the primary site of Q . </li></ul><ul><ul><li>Implicitly gets lock on all replicas of the data item </li></ul></ul><ul><li>Benefit </li></ul><ul><ul><li>Concurrency control for replicated data handled similarly to unreplicated data - simple implementation. </li></ul></ul><ul><li>Drawback </li></ul><ul><ul><li>If the primary site of Q fails, Q is inaccessible even though other sites containing a replica may be accessible. </li></ul></ul>

40.
Majority Protocol <ul><li>Local lock manager at each site administers lock and unlock requests for data items stored at that site. </li></ul><ul><li>When a transaction wishes to lock an unreplicated data item Q residing at site S i , a message is sent to S i ‘s lock manager. </li></ul><ul><ul><li>If Q is locked in an incompatible mode, then the request is delayed until it can be granted. </li></ul></ul><ul><ul><li>When the lock request can be granted, the lock manager sends a message back to the initiator indicating that the lock request has been granted. </li></ul></ul>

41.
Majority Protocol (Cont.) <ul><li>In case of replicated data </li></ul><ul><ul><li>If Q is replicated at n sites, then a lock request message must be sent to more than half of the n sites in which Q is stored. </li></ul></ul><ul><ul><li>The transaction does not operate on Q until it has obtained a lock on a majority of the replicas of Q . </li></ul></ul><ul><ul><li>When writing the data item, transaction performs writes on all replicas. </li></ul></ul><ul><li>Benefit </li></ul><ul><ul><li>Can be used even when some sites are unavailable </li></ul></ul><ul><ul><ul><li>details on how handle writes in the presence of site failure later </li></ul></ul></ul><ul><li>Drawback </li></ul><ul><ul><li>Requires 2( n /2 + 1) messages for handling lock requests, and ( n /2 + 1) messages for handling unlock requests. </li></ul></ul><ul><ul><li>Potential for deadlock even with single item - e.g., each of 3 transactions may have locks on 1/3rd of the replicas of a data. </li></ul></ul>

42.
Biased Protocol <ul><li>Local lock manager at each site as in majority protocol, however, requests for shared locks are handled differently than requests for exclusive locks. </li></ul><ul><li>Shared locks . When a transaction needs to lock data item Q , it simply requests a lock on Q from the lock manager at one site containing a replica of Q . </li></ul><ul><li>Exclusive locks . When transaction needs to lock data item Q , it requests a lock on Q from the lock manager at all sites containing a replica of Q . </li></ul><ul><li>Advantage - imposes less overhead on read operations. </li></ul><ul><li>Disadvantage - additional overhead on writes </li></ul>

43.
Quorum Consensus Protocol <ul><li>A generalization of both majority and biased protocols </li></ul><ul><li>Each site is assigned a weight. </li></ul><ul><ul><li>Let S be the total of all site weights </li></ul></ul><ul><li>Choose two values read quorum Q r and write quorum Q w </li></ul><ul><ul><li>Such that Q r + Q w > S and 2 * Q w > S </li></ul></ul><ul><ul><li>Quorums can be chosen (and S computed) separately for each item </li></ul></ul><ul><li>Each read must lock enough replicas that the sum of the site weights is >= Q r </li></ul><ul><li>Each write must lock enough replicas that the sum of the site weights is >= Q w </li></ul><ul><li>For now we assume all replicas are written </li></ul><ul><ul><li>Extensions to allow some sites to be unavailable described later </li></ul></ul>

44.
Deadlock Handling <ul><li>Consider the following two transactions and history, with item X and transaction T 1 at site 1, and item Y and transaction T 2 at site 2: </li></ul>T 1 : write (X) write (Y) T 2 : write (Y) write (X) X-lock on X write (X) X-lock on Y write (Y) wait for X-lock on X Wait for X-lock on Y Result: deadlock which cannot be detected locally at either site

45.
Centralized Approach <ul><li>A global wait-for graph is constructed and maintained in a single site; the deadlock-detection coordinator </li></ul><ul><ul><li>Real graph : Real, but unknown, state of the system. </li></ul></ul><ul><ul><li>Constructed graph :Approximation generated by the controller during the execution of its algorithm . </li></ul></ul><ul><li>the global wait-for graph can be constructed when: </li></ul><ul><ul><li>a new edge is inserted in or removed from one of the local wait-for graphs. </li></ul></ul><ul><ul><li>a number of changes have occurred in a local wait-for graph. </li></ul></ul><ul><ul><li>the coordinator needs to invoke cycle-detection. </li></ul></ul><ul><li>If the coordinator finds a cycle, it selects a victim and notifies all sites. The sites roll back the victim transaction. </li></ul>

48.
False Cycles (Cont.) <ul><li>Suppose that starting from the state shown in figure, </li></ul><ul><li>1. T 2 releases resources at S 1 </li></ul><ul><ul><ul><li>resulting in a message remove T 1  T 2 message from the Transaction Manager at site S 1 to the coordinator) </li></ul></ul></ul><ul><li>2. And then T 2 requests a resource held by T 3 at site S 2 </li></ul><ul><ul><ul><li>resulting in a message insert T 2  T 3 from S 2 to the coordinator </li></ul></ul></ul><ul><li>Suppose further that the insert message reaches before the delete message </li></ul><ul><ul><li>this can happen due to network delays </li></ul></ul><ul><li>The coordinator would then find a false cycle </li></ul><ul><li>T 1  T 2  T 3  T 1 </li></ul><ul><li>The false cycle above never existed in reality. </li></ul><ul><li>False cycles cannot occur if two-phase locking is used. </li></ul>

49.
Unnecessary Rollbacks <ul><li>Unnecessary rollbacks may result when deadlock has indeed occurred and a victim has been picked, and meanwhile one of the transactions was aborted for reasons unrelated to the deadlock. </li></ul><ul><li>Unnecessary rollbacks can result from false cycles in the global wait-for graph; however, likelihood of false cycles is low. </li></ul>

50.
Timestamping <ul><li>Timestamp based concurrency-control protocols can be used in distributed systems </li></ul><ul><li>Each transaction must be given a unique timestamp </li></ul><ul><li>Main problem: how to generate a timestamp in a distributed fashion </li></ul><ul><ul><li>Each site generates a unique local timestamp using either a logical counter or the local clock. </li></ul></ul><ul><ul><li>Global unique timestamp is obtained by concatenating the unique local timestamp with the unique identifier. </li></ul></ul>

51.
Timestamping (Cont.) <ul><li>A site with a slow clock will assign smaller timestamps </li></ul><ul><ul><li>Still logically correct: serializability not affected </li></ul></ul><ul><ul><li>But: “disadvantages” transactions </li></ul></ul><ul><li>To fix this problem </li></ul><ul><ul><li>Define within each site S i a logical clock ( LC i ), which generates the unique local timestamp </li></ul></ul><ul><ul><li>Require that S i advance its logical clock whenever a request is received from a transaction Ti with timestamp < x,y > and x is greater that the current value of LC i . </li></ul></ul><ul><ul><li>In this case, site S i advances its logical clock to the value x + 1. </li></ul></ul>

52.
Replication with Weak Consistency <ul><li>Many commercial databases support replication of data with weak degrees of consistency (I.e., without a guarantee of serializabiliy) </li></ul><ul><li>E.g.: master-slave replication : updates are performed at a single “master” site, and propagated to “slave” sites. </li></ul><ul><ul><li>Propagation is not part of the update transaction: its is decoupled </li></ul></ul><ul><ul><ul><li>May be immediately after transaction commits </li></ul></ul></ul><ul><ul><ul><li>May be periodic </li></ul></ul></ul><ul><ul><li>Data may only be read at slave sites, not updated </li></ul></ul><ul><ul><ul><li>No need to obtain locks at any remote site </li></ul></ul></ul><ul><ul><li>Particularly useful for distributing information </li></ul></ul><ul><ul><ul><li>E.g. from central office to branch-office </li></ul></ul></ul><ul><ul><li>Also useful for running read-only queries offline from the main database </li></ul></ul>

53.
Replication with Weak Consistency (Cont.) <ul><li>Replicas should see a transaction-consistent snapshot of the database </li></ul><ul><ul><li>That is, a state of the database reflecting all effects of all transactions up to some point in the serialization order, and no effects of any later transactions. </li></ul></ul><ul><li>E.g. Oracle provides a create snapshot statement to create a snapshot of a relation or a set of relations at a remote site </li></ul><ul><ul><li>snapshot refresh either by recomputation or by incremental update </li></ul></ul><ul><ul><li>Automatic refresh (continuous or periodic) or manual refresh </li></ul></ul>

54.
Multimaster Replication <ul><li>With multimaster replication (also called update-anywhere replication) updates are permitted at any replica, and are automatically propagated to all replicas </li></ul><ul><ul><li>Basic model in distributed databases, where transactions are unaware of the details of replication, and database system propagates updates as part of the same transaction </li></ul></ul><ul><ul><ul><li>Coupled with 2 phase commit </li></ul></ul></ul><ul><ul><li>Many systems support lazy propagation where updates are transmitted after transaction commits </li></ul></ul><ul><ul><ul><li>Allow updates to occur even if some sites are disconnected from the network, but at the cost of consistency </li></ul></ul></ul>

55.
Lazy Propagation (Cont.) <ul><li>Two approaches to lazy propagation </li></ul><ul><ul><li>Updates at any replica translated into update at primary site, and then propagated back to all replicas </li></ul></ul><ul><ul><ul><li>Updates to an item are ordered serially </li></ul></ul></ul><ul><ul><ul><li>But transactions may read an old value of an item and use it to perform an update, result in non-serializability </li></ul></ul></ul><ul><ul><li>Updates are performed at any replica and propagated to all other replicas </li></ul></ul><ul><ul><ul><li>Causes even more serialization problems: </li></ul></ul></ul><ul><ul><ul><ul><li>Same data item may be updated concurrently at multiple sites! </li></ul></ul></ul></ul><ul><li>Conflict detection is a problem </li></ul><ul><ul><li>Some conflicts due to lack of distributed concurrency control can be detected when updates are propagated to other sites (will see later, in Section 23.5.4) </li></ul></ul><ul><li>Conflict resolution is very messy </li></ul><ul><ul><li>Resolution may require committed transactions to be rolled back </li></ul></ul><ul><ul><ul><li>Durability violated </li></ul></ul></ul><ul><ul><li>Automatic resolution may not be possible, and human intervention may be required </li></ul></ul>

57.
Availability <ul><li>High availability: time for which system is not fully usable should be extremely low (e.g. 99.99% availability) </li></ul><ul><li>Robustness: ability of system to function spite of failures of components </li></ul><ul><li>Failures are more likely in large distributed systems </li></ul><ul><li>To be robust, a distributed system must </li></ul><ul><ul><li>Detect failures </li></ul></ul><ul><ul><li>Reconfigure the system so computation may continue </li></ul></ul><ul><ul><li>Recovery/reintegration when a site or link is repaired </li></ul></ul><ul><li>Failure detection: distinguishing link failure from site failure is hard </li></ul><ul><ul><li>(partial) solution: have multiple links, multiple link failure is likely a site failure </li></ul></ul>

58.
Reconfiguration <ul><li>Reconfiguration: </li></ul><ul><ul><li>Abort all transactions that were active at a failed site </li></ul></ul><ul><ul><ul><li>Making them wait could interfere with other transactions since they may hold locks on other sites </li></ul></ul></ul><ul><ul><ul><li>However, in case only some replicas of a data item failed, it may be possible to continue transactions that had accessed data at a failed site (more on this later) </li></ul></ul></ul><ul><ul><li>If replicated data items were at failed site, update system catalog to remove them from the list of replicas. </li></ul></ul><ul><ul><ul><li>This should be reversed when failed site recovers, but additional care needs to be taken to bring values up to date </li></ul></ul></ul><ul><ul><li>If a failed site was a central server for some subsystem, an election must be held to determine the new server </li></ul></ul><ul><ul><ul><li>E.g. name server, concurrency coordinator, global deadlock detector </li></ul></ul></ul>

59.
Reconfiguration (Cont.) <ul><li>Since network partition may not be distinguishable from site failure, the following situations must be avoided </li></ul><ul><ul><li>Two ore more central servers elected in distinct partitions </li></ul></ul><ul><ul><li>More than one partition updates a replicated data item </li></ul></ul><ul><li>Updates must be able to continue even if some sites are down </li></ul><ul><li>Solution: majority based approach </li></ul><ul><ul><li>Alternative of “read one write all available” is tantalizing but causes problems </li></ul></ul>

60.
Majority-Based Approach <ul><li>The majority protocol for distributed concurrency control can be modified to work even if some sites are unavailable </li></ul><ul><ul><li>Each replica of each item has a version number which is updated when the replica is updated, as outlined below </li></ul></ul><ul><ul><li>A lock request is sent to at least ½ the sites at which item replicas are stored and operation continues only when a lock is obtained on a majority of the sites </li></ul></ul><ul><ul><li>Read operations look at all replicas locked, and read the value from the replica with largest version number </li></ul></ul><ul><ul><ul><li>May write this value and version number back to replicas with lower version numbers (no need to obtain locks on all replicas for this task) </li></ul></ul></ul>

61.
Majority-Based Approach <ul><li>Majority protocol (Cont.) </li></ul><ul><ul><li>Write operations </li></ul></ul><ul><ul><ul><li>find highest version number like reads, and set new version number to old highest version + 1 </li></ul></ul></ul><ul><ul><ul><li>Writes are then performed on all locked replicas and version number on these replicas is set to new version number </li></ul></ul></ul><ul><ul><li>Failures (network and site) cause no problems as long as </li></ul></ul><ul><ul><ul><li>Sites at commit contain a majority of replicas of any updated data items </li></ul></ul></ul><ul><ul><ul><li>During reads a majority of replicas are available to find version numbers </li></ul></ul></ul><ul><ul><ul><li>Subject to above, 2 phase commit can be used to update replicas </li></ul></ul></ul><ul><ul><li>Note: reads are guaranteed to see latest version of data item </li></ul></ul><ul><ul><li>Reintegration is trivial: nothing needs to be done </li></ul></ul><ul><li>Quorum consensus algorithm can be similarly extended </li></ul>

62.
Read One Write All (Available) <ul><li>Biased protocol is a special case of quorum consensus </li></ul><ul><ul><li>Allows reads to read any one replica but updates require all replicas to be available at commit time (called read one write all ) </li></ul></ul><ul><li>Read one write all available (ignoring failed sites) is attractive, but incorrect </li></ul><ul><ul><li>If failed link may come back up, without a disconnected site ever being aware that it was disconnected </li></ul></ul><ul><ul><li>The site then has old values, and a read from that site would return an incorrect value </li></ul></ul><ul><ul><li>If site was aware of failure reintegration could have been performed, but no way to guarantee this </li></ul></ul><ul><ul><li>With network partitioning, sites in each partition may update same item concurrently </li></ul></ul><ul><ul><ul><li>believing sites in other partitions have all failed </li></ul></ul></ul>

63.
Site Reintegration <ul><li>When failed site recovers, it must catch up with all updates that it missed while it was down </li></ul><ul><ul><li>Problem: updates may be happening to items whose replica is stored at the site while the site is recovering </li></ul></ul><ul><ul><li>Solution 1: halt all updates on system while reintegrating a site </li></ul></ul><ul><ul><ul><li>Unacceptable disruption </li></ul></ul></ul><ul><ul><li>Solution 2: lock all replicas of all data items at the site, update to latest version, then release locks </li></ul></ul><ul><ul><ul><li>Other solutions with better concurrency also available </li></ul></ul></ul>

64.
Comparison with Remote Backup <ul><li>Remote backup (hot spare) systems (Section 17.10) are also designed to provide high availability </li></ul><ul><li>Remote backup systems are simpler and have lower overhead </li></ul><ul><ul><li>All actions performed at a single site, and only log records shipped </li></ul></ul><ul><ul><li>No need for distributed concurrency control, or 2 phase commit </li></ul></ul><ul><li>Using distributed databases with replicas of data items can provide higher availability by having multiple (> 2) replicas and using the majority protocol </li></ul><ul><ul><li>Also avoid failure detection and switchover time associated with remote backup systems </li></ul></ul>

65.
Coordinator Selection <ul><li>Backup coordinators </li></ul><ul><ul><li>site which maintains enough information locally to assume the role of coordinator if the actual coordinator fails </li></ul></ul><ul><ul><li>executes the same algorithms and maintains the same internal state information as the actual coordinator fails executes state information as the actual coordinator </li></ul></ul><ul><ul><li>allows fast recovery from coordinator failure but involves overhead during normal processing. </li></ul></ul><ul><li>Election algorithms </li></ul><ul><ul><li>used to elect a new coordinator in case of failures </li></ul></ul><ul><ul><li>Example: Bully Algorithm - applicable to systems where every site can send a message to every other site. </li></ul></ul>

66.
Bully Algorithm <ul><li>If site S i sends a request that is not answered by the coordinator within a time interval T , assume that the coordinator has failed S i tries to elect itself as the new coordinator. </li></ul><ul><li>S i sends an election message to every site with a higher identification number, S i then waits for any of these processes to answer within T . </li></ul><ul><li>If no response within T , assume that all sites with number greater than i have failed, S i elects itself the new coordinator. </li></ul><ul><li>If answer is received S i begins time interval T ’, waiting to receive a message that a site with a higher identification number has been elected. </li></ul>

67.
Bully Algorithm (Cont.) <ul><li>If no message is sent within T ’, assume the site with a higher number has failed; S i restarts the algorithm. </li></ul><ul><li>After a failed site recovers, it immediately begins execution of the same algorithm. </li></ul><ul><li>If there are no active sites with higher numbers, the recovered site forces all processes with lower numbers to let it become the coordinator site, even if there is a currently active coordinator with a lower number. </li></ul>

69.
Distributed Query Processing <ul><li>For centralized systems, the primary criterion for measuring the cost of a particular strategy is the number of disk accesses. </li></ul><ul><li>In a distributed system, other issues must be taken into account: </li></ul><ul><ul><li>The cost of a data transmission over the network. </li></ul></ul><ul><ul><li>The potential gain in performance from having several sites process parts of the query in parallel. </li></ul></ul>

71.
Example Query (Cont.) <ul><li>Since account 1 has only tuples pertaining to the Hillside branch, we can eliminate the selection operation. </li></ul><ul><li>Apply the definition of account 2 to obtain </li></ul><ul><li> branch-name = “Hillside” (  branch-name = “Valleyview” ( account ) </li></ul><ul><li>This expression is the empty set regardless of the contents of the account relation. </li></ul><ul><li>Final strategy is for the Hillside site to return account 1 as the result of the query. </li></ul>

72.
Simple Join Processing <ul><li>Consider the following relational algebra expression in which the three relations are neither replicated nor fragmented </li></ul><ul><li>account depositor branch </li></ul><ul><li>account is stored at site S 1 </li></ul><ul><li>depositor at S 2 </li></ul><ul><li>branch at S 3 </li></ul><ul><li>For a query issued at site S I , the system needs to produce the result at site S I </li></ul>

73.
Possible Query Processing Strategies <ul><li>Ship copies of all three relations to site S I and choose a strategy for processing the entire locally at site S I. </li></ul><ul><li>Ship a copy of the account relation to site S 2 and compute temp 1 = account depositor at S 2 . Ship temp 1 from S 2 to S 3 , and compute temp 2 = temp 1 branch at S 3 . Ship the result temp 2 to S I . </li></ul><ul><li>Devise similar strategies, exchanging the roles S 1 , S 2 , S 3 </li></ul><ul><li>Must consider following factors: </li></ul><ul><ul><li>amount of data being shipped </li></ul></ul><ul><ul><li>cost of transmitting a data block between sites </li></ul></ul><ul><ul><li>relative processing speed at each site </li></ul></ul>

77.
Heterogeneous Distributed Databases <ul><li>Many database applications require data from a variety of preexisting databases located in a heterogeneous collection of hardware and software platforms </li></ul><ul><li>Data models may differ (hierarchical, relational , etc.) </li></ul><ul><li>Transaction commit protocols may be incompatible </li></ul><ul><li>Concurrency control may be based on different techniques (locking, timestamping, etc.) </li></ul><ul><li>System-level details almost certainly are totally incompatible. </li></ul><ul><li>A multidatabase system is a software layer on top of existing database systems, which is designed to manipulate information in heterogeneous databases </li></ul><ul><ul><li>Creates an illusion of logical database integration without any physical database integration </li></ul></ul>

78.
Advantages <ul><li>Preservation of investment in existing </li></ul><ul><ul><li>hardware </li></ul></ul><ul><ul><li>system software </li></ul></ul><ul><ul><li>Applications </li></ul></ul><ul><li>Local autonomy and administrative control </li></ul><ul><li>Allows use of special-purpose DBMSs </li></ul><ul><li>Step towards a unified homogeneous DBMS </li></ul><ul><ul><li>Full integration into a homogeneous DBMS faces </li></ul></ul><ul><ul><ul><li>Technical difficulties and cost of conversion </li></ul></ul></ul><ul><ul><ul><li>Organizational/political difficulties </li></ul></ul></ul><ul><ul><ul><ul><li>Organizations do not want to give up control on their data </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Local databases wish to retain a great deal of autonomy </li></ul></ul></ul></ul>

80.
Query Processing <ul><li>Several issues in query processing in a heterogeneous database </li></ul><ul><li>Schema translation </li></ul><ul><ul><li>Write a wrapper for each data source to translate data to a global schema </li></ul></ul><ul><ul><li>Wrappers must also translate updates on global schema to updates on local schema </li></ul></ul><ul><li>Limited query capabilities </li></ul><ul><ul><li>Some data sources allow only restricted forms of selections </li></ul></ul><ul><ul><ul><li>E.g. web forms, flat file data sources </li></ul></ul></ul><ul><ul><li>Queries have to be broken up and processed partly at the source and partly at a different site </li></ul></ul><ul><li>Removal of duplicate information when sites have overlapping information </li></ul><ul><ul><li>Decide which sites to execute query </li></ul></ul><ul><li>Global query optimization </li></ul>

81.
Mediator Systems <ul><li>Mediator systems are systems that integrate multiple heterogeneous data sources by providing an integrated global view, and providing query facilities on global view </li></ul><ul><ul><li>Unlike full fledged multidatabase systems, mediators generally do not bother about transaction processing </li></ul></ul><ul><ul><li>But the terms mediator and multidatabase are sometimes used interchangeably </li></ul></ul><ul><ul><li>The term virtual database is also used to refer to mediator/multidatabase systems </li></ul></ul>

86.
LDAP Data Model <ul><li>LDAP directories store entries </li></ul><ul><ul><li>Entries are similar to objects </li></ul></ul><ul><li>Each entry must have unique distinguished name (DN) </li></ul><ul><li>DN made up of a sequence of relative distinguished names (RDNs) </li></ul><ul><li>E.g. of a DN </li></ul><ul><ul><li>cn=Silberschatz, ou-Bell Labs, o=Lucent, c=USA </li></ul></ul><ul><ul><li>Standard RDNs (can be specified as part of schema) </li></ul></ul><ul><ul><ul><li>cn: common name ou: organizational unit </li></ul></ul></ul><ul><ul><ul><li>o: organization c: country </li></ul></ul></ul><ul><ul><li>Similar to paths in a file system but written in reverse direction </li></ul></ul>

87.
LDAP Data Model (Cont.) <ul><li>Entries can have attributes </li></ul><ul><ul><li>Attributes are multi-valued by default </li></ul></ul><ul><ul><li>LDAP has several built-in types </li></ul></ul><ul><ul><ul><li>Binary, string, time types </li></ul></ul></ul><ul><ul><ul><li>Tel: telephone number PostalAddress: postal address </li></ul></ul></ul><ul><li>LDAP allows definition of object classes </li></ul><ul><ul><li>Object classes specify attribute names and types </li></ul></ul><ul><ul><li>Can use inheritance to define object classes </li></ul></ul><ul><ul><li>Entry can be specified to be of one or more object classes </li></ul></ul><ul><ul><ul><li>No need to have single most-specific type </li></ul></ul></ul>

88.
LDAP Data Model (cont.) <ul><li>Entries organized into a directory information tree according to their DNs </li></ul><ul><ul><li>Leaf level usually represent specific objects </li></ul></ul><ul><ul><li>Internal node entries represent objects such as organizational units, organizations or countries </li></ul></ul><ul><ul><li>Children of a node inherit the DN of the parent, and add on RDNs </li></ul></ul><ul><ul><ul><li>E.g. internal node with DN c=USA </li></ul></ul></ul><ul><ul><ul><ul><li>Children nodes have DN starting with c=USA and further RDNs such as o or ou </li></ul></ul></ul></ul><ul><ul><ul><li>DN of an entry can be generated by traversing path from root </li></ul></ul></ul><ul><ul><li>Leaf level can be an alias pointing to another entry </li></ul></ul><ul><ul><ul><li>Entries can thus have more than one DN </li></ul></ul></ul><ul><ul><ul><ul><li>E.g. person in more than one organizational unit </li></ul></ul></ul></ul>

90.
LDAP Queries <ul><li>LDAP query must specify </li></ul><ul><ul><li>Base: a node in the DIT from where search is to start </li></ul></ul><ul><ul><li>A search condition </li></ul></ul><ul><ul><ul><li>Boolean combination of conditions on attributes of entries </li></ul></ul></ul><ul><ul><ul><ul><li>Equality, wild-cards and approximate equality supported </li></ul></ul></ul></ul><ul><ul><li>A scope </li></ul></ul><ul><ul><ul><li>Just the base, the base and its children, or the entire subtree from the base </li></ul></ul></ul><ul><ul><li>Attributes to be returned </li></ul></ul><ul><ul><li>Limits on number of results and on resource consumption </li></ul></ul><ul><ul><li>May also specify whether to automatically dereference aliases </li></ul></ul><ul><li>LDAP URLs are one way of specifying query </li></ul><ul><li>LDAP API is another alternative </li></ul>

94.
LDAP API (Cont.) <ul><li>LDAP API also has functions to create, update and delete entries </li></ul><ul><li>Each function call behaves as a separate transaction </li></ul><ul><ul><li>LDAP does not support atomicity of updates </li></ul></ul>

95.
Distributed Directory Trees <ul><li>Organizational information may be split into multiple directory information trees </li></ul><ul><ul><li>Suffix of a DIT gives RDN to be tagged onto to all entries to get an overall DN </li></ul></ul><ul><ul><ul><li>E.g. two DITs, one with suffix o=Lucent, c=USA and another with suffix o=Lucent, c=India </li></ul></ul></ul><ul><ul><li>Organizations often split up DITs based on geographical location or by organizational structure </li></ul></ul><ul><ul><li>Many LDAP implementations support replication (master-slave or multi-master replication) of DITs (not part of LDAP 3 standard) </li></ul></ul><ul><li>A node in a DIT may be a referral to a node in another DIT </li></ul><ul><ul><li>E.g. Ou= Bell Labs may have a separate DIT, and DIT for o=Lucent may have a leaf with ou=Bell Labs containing a referral to the Bell Labs DIT </li></ul></ul><ul><ul><li>Referalls are the key to integrating a distributed collection of directories </li></ul></ul><ul><ul><li>When a server gets a query reaching a referral node, it may either </li></ul></ul><ul><ul><ul><li>Forward query to referred DIT and return answer to client, or </li></ul></ul></ul><ul><ul><ul><li>Give referral back to client, which transparently sends query to referred DIT (without user intervention) </li></ul></ul></ul>

97.
Three Phase Commit (3PC) <ul><li>Assumptions: </li></ul><ul><ul><li>No network partitioning </li></ul></ul><ul><ul><li>At any point, at least one site must be up. </li></ul></ul><ul><ul><li>At most K sites (participants as well as coordinator) can fail </li></ul></ul><ul><li>Phase 1: Obtaining Preliminary Decision: Identical to 2PC Phase 1. </li></ul><ul><ul><li>Every site is ready to commit if instructed to do so </li></ul></ul><ul><ul><li>Under 2 PC each site is obligated to wait for decision from coordinator </li></ul></ul><ul><ul><li>Under 3PC, knowledge of pre-commit decision can be used to commit despite coordinator failure. </li></ul></ul>

99.
Phase 3. Recording Decision in the Database <ul><li>Executed only if decision in phase 2 was to precommit </li></ul><ul><li>Coordinator collects acknowledgements. It sends < commit T > message to the participants as soon as it receives K acknowledgements. </li></ul><ul><li>Coordinator adds the record < commit T > in its log and forces record to stable storage. </li></ul><ul><li>Coordinator sends a message to each participant to < commit T > </li></ul><ul><li>Participants take appropriate action locally. </li></ul>

102.
Coordinator – Failure Protocol <ul><li>1. The active participating sites select a new coordinator, C new </li></ul><ul><li>2. C new requests local status of T from each participating site </li></ul><ul><li>3. Each participating site including C new determines the local </li></ul><ul><li>status of T : </li></ul><ul><ul><li>Committed . The log contains a < commit T > record </li></ul></ul><ul><ul><li>Aborted . The log contains an < abort T > record. </li></ul></ul><ul><ul><li>Ready . The log contains a < ready T > record but no < abort T > or < precommit T > record </li></ul></ul><ul><ul><li>Precommitted . The log contains a < precommit T > record but no < abort T > or < commit T > record. </li></ul></ul><ul><ul><li>Not ready . The log contains neither a < ready T > nor an < abort T > record. </li></ul></ul><ul><li>A site that failed and recovered must ignore any precommit record in its log when determining its status. </li></ul><ul><li>4. Each participating site records sends its local status to C new </li></ul>

103.
Coordinator Failure Protocol (Cont.) <ul><li>5. C new decides either to commit or abort T , or to restart the </li></ul><ul><li>three-phase commit protocol: </li></ul><ul><ul><li>Commit state for any one participant  commit </li></ul></ul><ul><ul><li>Abort state for any one participant  abort. </li></ul></ul><ul><ul><li>Precommit state for any one participant and above 2 cases do not hold  </li></ul></ul><ul><ul><li>A precommit message is sent to those participants in the uncertain state. Protocol is resumed from that point. </li></ul></ul><ul><ul><li>Uncertain state at all live participants  abort. Since at least n - k sites are up, the fact that all participants are in an uncertain state means that the coordinator has not sent a < commit T > message implying that no site has committed T . </li></ul></ul>

105.
Fully Distributed Approach (Cont.) <ul><li>System model: a transaction runs at a single site, and makes requests to other sites for accessing non-local data. </li></ul><ul><li>Each site maintains its own local wait-for graph in the normal fashion: there is an edge T i  T j if T i is waiting on a lock held by T j (note: T i and T j may be non-local). </li></ul><ul><li>Additionally, arc T i  T ex exists in the graph at site S k if </li></ul><ul><ul><li>(a) T i is executing at site S k , and is waiting for a reply to a request made on another site, or </li></ul></ul><ul><ul><li>(b) T i is non-local to site S k , and a lock has been granted to T i at S k . </li></ul></ul><ul><li>Similarly arc T ex  T i exists in the graph at site S k if </li></ul><ul><ul><li>(a) T i is non-local to site S k , and is waiting on a lock for data at site S k , or </li></ul></ul><ul><ul><li>(b) T i is local to site S k , and has accessed data from an external site. </li></ul></ul>

106.
Fully Distributed Approach (Cont.) <ul><li>Centralized Deadlock Detection - all graph edges sent to central deadlock detector </li></ul><ul><li>Distributed Deadlock Detection - “path pushing” algorithm </li></ul><ul><li>Path pushing initiated wen a site detects a local cycle involving Tex, which indicates possibility of a deadlock. </li></ul><ul><li>Suppose cycle at site Si is </li></ul><ul><ul><ul><ul><li>T ex  T i  T j  ...  T n  T ex </li></ul></ul></ul></ul><ul><li>and T n is waiting for some transaction at site S j . Then S i passes on information about the cycle to S j . </li></ul><ul><li>Optimization : S i passes on information only if i > n . </li></ul><ul><li>S j updates it graph with new information and if it finds a cycle it repeats above process. </li></ul>

112.
Naming of Replicas and Fragments <ul><li>Each replica and each fragment of a data item must have a unique name. </li></ul><ul><ul><li>Use of postscripts to determine those replicas that are replicas of the same data item, and those fragments that are fragments of the same data item. </li></ul></ul><ul><ul><li>fragments of same data item: “. f 1 ”, “. f 2 ”, …, “. fn ” </li></ul></ul><ul><ul><li>replicas of same data item: “. r 1 ”, “. r 2 ”, …, “. rn ” </li></ul></ul><ul><ul><li>site 17. account . f 3 . r 2 </li></ul></ul><ul><li>refers to replica 2 of fragment 3 of account ; this item was generated by site 17. </li></ul>

114.
Example of Name - Translation Scheme <ul><li>A user at the Hillside branch (site S 1 ), uses the alias local-account for the local fragment account.f1 of the account relation. </li></ul><ul><li>When this user references local-account , the query-processing subsystem looks up local-account in the alias table, and replaces local-account with S 1 . account.f 1 . </li></ul><ul><li>If S 1 . account.f 1 is replicated, the system must consult the replica table in order to choose a replica </li></ul><ul><li>If this replica is fragmented, the system must examine the fragmentation table to find out how to reconstruct the relation. </li></ul><ul><li>Usually only need to consult one or two tables, however, the algorithm can deal with any combination of successive replication and fragmentation of relations. </li></ul>

115.
Transparency and Updates <ul><li>Must ensure that all replicas of a data item are updated and that all affected fragments are updated. </li></ul><ul><li>Consider the account relation and the insertion of the tuple: </li></ul><ul><ul><li>(“Valleyview”, A-733, 600) </li></ul></ul><ul><li>Horizontal fragmentation of account </li></ul><ul><li>account 1 =  branch-name = “Hillside” ( account ) </li></ul><ul><li>account 2 =  branch-name = “Valleyview” ( account ) </li></ul><ul><ul><li>Predicate P i is associated with the i th fragment </li></ul></ul><ul><ul><li>Predicate P i to the tuple (“Valleyview”, A-733, 600) to test whether that tuple must be inserted in the i th fragment </li></ul></ul><ul><ul><li>Tuple inserted into account 2 </li></ul></ul>

116.
Transparency and Updates (Cont.) <ul><li>Vertical fragmentation of deposit into deposit 1 and deposit 2 </li></ul><ul><li>The tuple (“Valleyview”, A-733, ‘Jones”, 600) must be split into two fragments: </li></ul><ul><ul><li>one to be inserted into deposit 1 </li></ul></ul><ul><ul><li>one to be inserted into deposit 2 </li></ul></ul><ul><li>If deposit is replicated, the tuple (“Valleyview”, A-733, “Jones” 600) must be inserted in all replicas </li></ul><ul><li>Problem: If deposit is accessed concurrently it is possible that one replica will be updated earlier than another (see section on Concurrency Control). </li></ul>

120.
Network Topology (Cont.) <ul><li>A partitioned system is split into two (or more) subsystems ( partitions ) that lack any connection. </li></ul><ul><li>Tree-structured: low installation and communication costs; the failure of a single link can partition network </li></ul><ul><li>Ring: At least two links must fail for partition to occur; communication cost is high. </li></ul><ul><li>Star: </li></ul><ul><ul><li>the failure of a single link results in a network partition, but since one of the partitions has only a single site it can be treated as a single-site failure. </li></ul></ul><ul><ul><li>low communication cost </li></ul></ul><ul><ul><li>failure of the central site results in every site in the system becoming disconnected </li></ul></ul>

121.
Robustness <ul><li>A robustness system must: </li></ul><ul><ul><li>Detect site or link failures </li></ul></ul><ul><ul><li>Reconfigure the system so that computation may continue. </li></ul></ul><ul><ul><li>Recover when a processor or link is repaired </li></ul></ul><ul><li>Handling failure types: </li></ul><ul><ul><li>Retransmit lost messages </li></ul></ul><ul><ul><li>Unacknowledged retransmits indicate link failure; find alternative route for message. </li></ul></ul><ul><ul><li>Failure to find alternative route is a symptom of network partition. </li></ul></ul><ul><li>Network link failures and site failures are generally indistinguishable. </li></ul>

122.
Procedure to Reconfigure System <ul><li>If replicated data is stored at the failed site, update the catalog so that queries do not reference the copy at the failed site. </li></ul><ul><li>Transactions active at the failed site should be aborted. </li></ul><ul><li>If the failed site is a central server for some subsystem, an election must be held to determine the new server. </li></ul><ul><li>Reconfiguration scheme must work correctly in case of network partitioning; must avoid: </li></ul><ul><ul><li>Electing two or more central servers in distinct partitions. </li></ul></ul><ul><ul><li>Updating replicated data item by more than one partition </li></ul></ul><ul><li>Represent recovery tasks as a series of transactions; concurrent control subsystem and transactions management subsystem may then be relied upon for proper reintegration. </li></ul>