2PC and 3PC (Commit Protocols) in DBMS

Both Two-Phase Commit (2PC) protocol and Three-Phase Commit (3PC) protocol are popular with Distributed DBMS instances because all nodes must commit to a transaction or none of them will. It is an all-or-nothing proposition. Both protocols share a Prepare (Voting) and Commit/Abort phase, but 3PC adds an additional pre-Commit phase in which every participating node must vote yes to a commit before it is actually done. Compared to 3PC, Two-Phase Commit may be characterized as sending the command and hoping for the best, since the bulk of the transaction (the instructions for what to actually do) are transmitted with the commit phase. The return message after the transaction, from each participant, determines commit or abort status globally. The 3PC extra step of pre-commit is intended to clear up any global commit/abort failure issues or blocking. This step polls for availability before anything is done and the nodes can “act independently in the event of a failure” (Connolly & Begg, 2015). This is an important distinction. In 2PC, a single abort vote or acknowledgement undoes the entire process. In 3PC, assuming the pre-commit phase came back with a global commit vote, even a timeout or network partition would not cause a global abort.

Terminating a process, according to Connolly & Begg (2015), is where the differences between these protocols are most critical. In 2PC it is possible to have a block because after the vote, the nodes are waiting on a commit or abort message from coordinator before making the global commit. If partition occurs, they are stuck until coordinator re-establishes communication. A power failure is more catastrophic, as it may involve multiple nodes and the controller. In both 2PC and 3PC, backup procedures are activated. 2PC participants remain in a blocked state. Of course, overall, there are tradeoffs. The major issue with 3PC is the communication overhead, which is to be expected with the extra phase (Kumar, 2016).

References

Connolly, T. & Begg, C. (2015).  Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

Kumar, M. (2016). Commit protocols in distributed database system: A comparison. International Journal for Innovative Research in Science & Technology, 2(12), 277-281.

Concurrency: Optimistic or Pessimistic?

Optimistic concurrency control is the more complex of the two concurrency control methods. A transaction beginning is timestamped, a process is run, and the change is validated. If another transaction completed after this transaction’s start time, this transaction is aborted. In other words, the original record is unavailable because someone got to it first and completed the transaction. The risk here is a dirty read, as it is possible for more than one person to have access to a record at a time. Change validation is done at the end of the transaction block.

Conservative (or pessimistic) concurrency control is akin to checking out a book at the library, and is the simpler of the two methods. Once a transaction begins, the record is locked, and no one else can modify it. In the library example, I would go to the library to check out a book (record) to read it (modify it); if it is there (no one has initiated a change), I may check it out. If the book (record) is not there (someone has locked it, modifying it), I cannot check it out. It is a first-come, first-served method that ensures no two people have concurrent access to a record at a time.

Each has its risks and rewards. Optimistic concurrency control tends to be used in environments without much contention for a single record of truth. It allows a higher volume of transactions per hour. However, as the name implies, the method essentially hopes for the best then deals with the problem if and when it arises. On the other hand, pessimistic concurrency control virtually guarantees that all transactions will be executed correctly and that the database is stable. It is a simpler decision tree: either abort if locked or commit if unlocked. All the drawbacks of pessimistic concurrency control lie in timing: fewer transactions per hour and limited access to the data depending on the number of users making transactions.

One specific advantage of optimistic locking, that isn’t always thought of immediately, is evident in the scenario when a user cannot maintain a consistent connection to the database. Assume for a moment that a user locks a table in a remote database for updating and the connection is severed (either through server reset, ISP woes, et cetera). The user reconnects and is back in the database. However, the previous session was not properly closed, so we have a phantom user with the record still open.

Reference

Connolly, T. & Begg, C. (2015).  Database Systems – Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

Database Indexing and Clusters

A useful analogy for database indexing is highlighting an article or book. We highlight or mark passages to make them easier for retrieval at a later date, when we pick up the book and want to find something quickly. Likewise, an index provides quick access to key information in the database tables. It would be silly to highlight an entire book, or mark up all of the pages; by the same token, it would be functionally useless to index a large number of columns in a table. There is a point of diminishing returns here.

It is generally recommended to index columns that are involved in WHERE or JOIN clauses (Larsen, 2010). These columns are frequently sought out by multiple query operations and are typically as critical to the table as the Primary Key. It is important to choose wisely here because for every table operation done, an index update must be done. This work can become exponential if multiple indexes are placed on a single table. Again, we come back to the principle of diminishing returns.

There is also the matter of choosing between clustered and nonclustered indexes. The former typically reads like browsing through a telephone directory: in order. Primary Keys are typically used in clustered indexes. One drawback here is the need to re-order the index when information is updated, added, or deleted. On the other hand, a non-clustered index operates much like an index in the back of a textbook, or like a dimension table in a star-schema database. While the latter may seem more advantageous at all times, but it usually shines when values are constantly updated. In situations where most of the table data is returned in a query, or a Primary Key is the rational identifier, a clustered index is the type of choice.

References

Connolly, T. & Begg, C. (2015).  Database Systems – Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson. Larsen, G. A. (2010). The dos and don’ts of database indexing. Retrieved from https://www.databasejournal.com/features/mssql/article.php/3860851/The-Dos-and-Donts-of-Database-Indexing.htm

Wagner, B. (2017). Clustered vs nonclustered: What index is right for my data? Retrieved from https://hackernoon.com/clustered-vs-nonclustered-what-index-is-right-for-my-data-717b329d042c

CAP and ACID Principles in Mobile Database Solutions

The concept of a mobile database brings some characteristics from distributed systems and incorporates the growing developments in wireless technology and mobile devices. At a practical level, a mobile database solution allows a business user to have connectivity to a corporate central database while in the field, without a dedicated link to the corporate database server. The method of communication is not unlike asynchronous replication, wherein the updates between mobile node and corporate node are handled on a schedule as opposed to instantaneously. According to Connolly & Begg (2015), mobile database solutions typically include:

  1. a corporate database server and DBMS that manages and stores the corporate data and provides corporate applications;
  2. a remote database and DBMS that manages and stores the mobile data and provides mobile applications;
  3. a mobile database platform that includes laptop, smartphone, or other Internet access devices;
  4. two-way communication links between the corporate and mobile DBMS.

Beyond the functions of a standard DBMS, mobile database solutions also require the ability to:

  1. communicate with the centralized database server through modes such as wireless or Internet access;
  2. replicate data on the centralized database server and mobile device;
  3. synchronize data on the centralized database server and mobile device;
  4. capture data from various sources such as the Internet;
  5. manage data on the mobile device;
  6. analyze data on a mobile device;
  7.  create customized mobile applications (Connolly & Begg, 2015).

Common issues in mobile database solutions include security, network partitioning, cellular handoff, and ACID transaction management (Connolly & Begg, 2015; Ibikunle & Adegbenjo, 2013). Security, partitioning, and handoff are all inherent in the mobile nature of the solution; that is, the idea of mobile nodes roaming around the country, with the connections being handed off between cellular towers as the user traverses a route, obviously carries with it the possibilities of signal loss or physical loss of an unsecured device.

ACID principles, which address the cluster as a whole entity, must be relaxed and adapted for mobile database solutions (Connolly & Begg, 2015). Assume for a moment that a large number of modifications are done by a user on the mobile node. Those transactions must be committed to the central database at next update, but if the connection is lost or weakened as a cellular signal is handed off, the batch may not complete. In that case, according to strict Atomicity rules, the entire set of transactions must be rolled back. This is not optimal and thus a new approach to Atomicity must be defined for mobile solutions. During that time, isolation is an issue, because a resource is blocked until the transaction is released. This also brings in the questions of consistency and durability: what happens when connectivity is lost and the mobile database is inconsistent with the central database? It is not discoverable until the mobile database is able to re-establish connectivity. What happens if one or more components of the mobile database solution experiences a failure?

Similarly, as CAP concerns are raised when a network partition occurs, we must take into account additional partition likelihoods of mobile handoff or signal loss. The mobile database solution must choose between consistency or availability in an event of partition. In choosing consistency, none of the nodes will be available until they are all back online. In choosing availability, the nodes will be available but not necessarily consistent until connectivity is re-established between all nodes.

The shortcomings in both ACID and CAP are mostly relegated to CAP, which applies to the database solution as a whole. The system overall must be available. However, consistency is possible in a slightly more relaxed way (just as ACID properties tend to be more relaxed for mobile database solutions). Ramakrishnan (2012) acknowledges that consistency can exist on a spectrum.

References

Connolly, T., & Begg, C. (2015).  Database Systems – Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

Frank, L., Pedersen, R. U., Frank, C. H., & Larsson, N. J. (2014). The cap theorem versus databases with relaxed acid properties. Paper presented at the Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication, Siem Reap, Cambodia.

Greiner, R. (2014). Cap theorem: Revisited.  Retrieved from http://robertgreiner.com/2014/08/cap-theorem-revisited/

Ibikunle, F. A., & Adegbenjo, A. A. (2013). Management issues and challenges in mobile database system. International Journal of Engineering Sciences & Emerging Technologies, 5(1).

Shapiro, M., Preguiça, N., Baquero, C., & Zawirski, M. (2011). Conflict-free replicated data types. Paper presented at the Stabilization, Safety, and Security of Distributed Systems, Berlin, Heidelberg.

Ulnes, S. A. (2017). Eventually consistent: A mobile-first distributed system.  Retrieved from https://academy.realm.io/posts/eventually-consistent-making-a-mobile-first-distributed-system/

Breaking down Synchronous and Asynchronous Database Replication

Synchronous writes the data at the same time across source and target(s), simultaneously. It is a single transaction, and all-or-nothing. Asynchronous writes to source then propagates the changes to the target(s) at regular intervals. It is on a schedule, so there is a lag between local commit and replication to remote nodes. This is most commonly used in cloud backup situations.

Breaking it down to the host-storage relationship:

Synchronous

  1. Source Host sends write request to Source Storage
  2. Source Storage writes data and sends to Target Storage
  3. Target Storage writes data and sends acknowledgement to Source Storage
  4. Source Storage acknowledgement to Source Host

Asynchronous

  1. Source Host sends write request to Source Storage
  2. Source Storage writes data and sends acknowledgement to Source Host
  3. The update is held in queue until a specified time, at which the Source Storage sends the update to Target Storage
  4. Target Storage writes data and sends acknowledgement to Source Storage.

The key difference in the chain is where/when the acknowledgement is sent. In Synchronous, the write-to-target action must complete successfully before a Source Host receives acknowledgement. In Asynchronous, the acknowledgement is sent upon writing to source, without confirming an immediate write to target.

Primary concerns between these two methods are data integrity and performance. Synchronous may guarantee no data loss but it consumes much more bandwidth and cost than an Asynchronous solution. Asynchronous tends to be more cost-effective and uses less resources, and is more resilient by design; however, data loss at write is more likely. If data must match real-time across nodes, Synchronous Replication is preferable, as even a small delay between local and remote write in Asynchronous Replication may not be allowable.

References

Connolly, T., & Begg, C. (2015).  Database Systems – Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

Vembu. (2016). Synchronous (vs) Asynchronous Replication. Retrieved from https://www.vembu.com/blog/synchronous-vs-asynchronous-replication/