Some quick definitions regarding the Tablet states in this state machine:

Transcription

1 HBase Terminology Translation Legend: tablet == region Accumulo Master == HBase HMaster Metadata table == META TabletServer (or tserver) == RegionServer Assignment is driven solely by the Master process. Assignment can be thought of as a state machine given the contents of the metadata table. The master keeps some transient information in memory. ZooKeeper is used only for liveliness checks on a TabletServer (ZooKeeper is checked by the Master, but also by TabletServers too; details follow later). As such, the metadata table must always be in a consistent state, a state that the master understands how to transition from, or (worst case scenario) a case that a reasonable fix can be made. Consistency of the metadata table (and of the updates written to it before actions are taken) is very important for assignment to work as intended. Lost updates to the metadata table would near certainly guarantee multiple assignment and data loss type bugs. Some quick definitions regarding the Tablet states in this state machine: Unassigned: Not online and is not scheduled to be assigned somewhere Assigned: Not online, but is scheduled to be assigned somewhere Hosted: Assigned to a server and that server brought the Tablet online (desired state) Assigned to dead server: The metadata table records that a Tablet is hosted, but the Master has noticed that the TabletServer which should be hosting it is dead. While the metadata table contains other information, for the purposes here, let s just assume that the metadata table only contains information about tablets. Each row in the metadata table defines a tablet. For the purposes of assignment, each row contains columns for the following: current location, future location, and last location. Last Location: The last location is used for preserving data locality. When assigning a tablet, the Master will observe the last location column, and attempt to assign the tablet back to that location. Not much more to say. The TabletServer updates this column after a compaction. Future Location: The future location marks that the Master wants to assign a given Tablet to a given server.. A tablet that is unassigned can first have its future set which will later trigger the Master to tell the TabletServer to bring the tablet online. This also helps with fault tolerance in the Master. For example, consider the Master failing during assignment. After calculating where a Tablet should be assigned but being restarted before completing the assignment, it s reasonable to consider that when the Master comes back that it can still assign the Tablet to the server. The negative

2 case where the TabletServer is no longer alive, it s a simple state transition to unset the future location and let the process happen again. Current Location: The current location stores the location of where a tablet is currently assigned. This is updated by the TabletServer during the final phase of assignment, not the Master. This is the last step before a Tablet is considered hosted. Another case when this column gets updated is when the Master notices that the server listed in this value for a Tablet is no longer alive. It will clear the current location as a part of this transition. General Assignment Loop Let s outline the most simple case for assignment. Consider a single Tablet which is currently offline. Let s say it s for a table that was just created. Its relevant assignment state would be as follows: {current=null, future=null, last=null} The master scans the metadata table periodically, looking for Tablets which are not hosted. Because the above Tablet has no value for current, we know that it is not hosted. Because there is no value for last, we can choose any available TabletServer because there is no locality to preserve. The master will take the state of active tabletservers in the cluster (based on ZooKeeper) and choose a TabletServer. The master will then record this server s information in the value for future. {current=null, future= server1:port, last=null} After setting the future value, the Master will inform server1:port that it should assign the Tablet. This is a one way Thrift call which is a fire and forget message. The remote end of the RPC cannot send a response back to the client. This lets the Master tell a TabletServer to bring a Tablet online, but doesn t require the Master to block on the RPC waiting for the TabletServer to actually perform the action. The TabletServer will (eventually) see the request from the Master to bring this Tablet online. After performing some precondition like checks, the TabletServer will make the necessary updates in its own memory to host this Tablet and then write an update to the metadata table for the current column and unset the future column. Writes by the TabletServer are only allowed after a cached check of the ZooKeeper lock. This helps ensure that we don t have a zombie server trying to host tablets due to delayed RPCs from the Master, but doesn t need to

3 be a sync ed ZK read. In the worst case scenario where a TabletServer loses its lock, it tanks itself quickly and the tablets hosted there move into a state capable of being reassigned. These updates let the Master know that the Tablet has moved from the assigned state and into the hosted state. Hooray. {current= server1:port, future=null, last=null} Later on, say a user wrote some data to this Tablet and a compaction is run to write the data in memory to disk. During the update of the Tablet to record the new file in HDFS for this Tablet, it will set the last location since there is locality to consider. {current= server1:port, future=null, last= server1:port } TabletStateStores A layer of abstraction which is relevant for assignment is the TabletStateStore. So far, we have only dealt with the assignment of user tables. This ignores how issues of how to bring the metadata table and root table. Consider each of these three levels of Tablets. Each horizontal bar corresponds to a Store of Tablets that need to be managed. As reading top down implies, there is a dependency that all Tablets in the Store above the current store are assigned. Concretely, before user table tablets can be assigned, the metadata tablets must be assigned. Likewise, before metadata table tablets can be assigned, the root table tablet must be assigned. This is not explicitly enforced because the necessary read operations will block while the previous level is unassigned. The same is not true for unassigning tablets. Unassigning tablets safely must be down bottom up to ensure that the necessary information can be persisted before the Tablet is taken offline. The metadata tables and user tables are stored in Accumulo as normal tables, the root table and the metadata table respectively, while the information to locate the root table s tablet is stored in ZooKeeper to bootstrap the system. As such, the same assignment logic can be reused across all three of these stores simply by changing the Accumulo table being read from (the metadata or root table) or from ZooKeeper (for the root table assignment).

4 Automatic Error Handling/Fixing One other task that the Master does perform WRT assignments is sanity checks on the current state of the Tablet entries. I believe many of these error checks have come across after years of finding a bug, diagnosing how the bug was caused, and then adding fixes to prevent the bug from happening again just also recognize if this state ever happens again and try to automatically recover from it. Many of these error conditions are recoverable, although some are checks for very serious problems (e.g. multiple assignments) and providing a big warning message. I believe many of these checks and fixes also are related to splitting and merging of tablets, and the failure (with pending retry) of these operations. The master can attempt to make a determination based on current state (such as active TabletServers) on how to fix an issue like a future location and a current location (the future location should be erased when the current is set), or removing a 2nd current location when it is on a dead TabletServer. Optimization or novel details Server side filter reading metadata table: As previously stated, the master is regularly scanning the metadata table looking for Tablets which are not in the hosted state. On a system with a large number of tablets, this can be a large amount of data to bring back to a single process (the master). As such, we can push down a custom server side filter (via an Accumulo iterator) that will only return Tablet records (a whole row) that are do not meet the criteria for being hosted. This reduces the amount of computation that the master needs to perform in addition to parallelizing this across multiple servers (ordering of Tablets to bring online within a Store is not necessary). Updates to metadata table are distributed: The Master doesn t have to coordinate all of the updates to the metadata table but can leave this information to the TabletServer perform the update. With the ability for a split metadata table (having multiple tablets), both reads and writes can be handled without becoming bottlenecked on a single server. This helps Accumulo scale beyond millions of tablets. Operations such as assignment can become limited by the speed in which an update to the metadata table can be made; however, this is a worthwhile optimization to pursue as necessary since it would likely also improve the normal user write case. Proactive messages from TabletServer to Master: As was mentioned earlier, the Master sends oneway (void) messages to TabletServers to avoid blocking RPCs. While the Master will eventually see all changes when it next reads the metadata table, the TabletServer will send a message back to the Master with what action it just

5 took. For example, after a Tablet is brought online, the TabletServer will send a message to the Master informing it that this happened. This update can wake up the Master from a sleeping state to more quickly respond to changes in the system, but if these messages are delayed/dropped, it s not a concern since we know we have the durability in the metadata table.

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services

Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

Workflow performance analysis tests Introduction This document is intended to provide some analytical tests that help determine if the SharePoint workflow engine and Nintex databases are being forced to

Bigdata High Availability (HA) Architecture Introduction This whitepaper describes an HA architecture based on a shared nothing design. Each node uses commodity hardware and has its own local resources

At one point in time only a single processor was needed to power a server and all its applications. Then came multiprocessing, in which two or more processors shared a pool of memory and could handle more

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES By: Edward Whalen Performance Tuning Corporation INTRODUCTION There are a number of clustering products available on the market today, and clustering has become

An AppDynamics Business White Paper Top 10 reasons your ecommerce site will fail during peak periods For U.S.-based ecommerce organizations, the last weekend of November is the most important time of the

Page 1 of 8 Excel Home > PivotTable reports and PivotChart reports > Basics Overview of PivotTable and PivotChart reports Show All Use a PivotTable report to summarize, analyze, explore, and present summary

Achieving High Availability What You Need to Know before You Start James Bottomley SteelEye Technology 21 January 2004 1 What Is Availability? Availability measures the ability of a given service to operate.

Apache HBase: the Hadoop Database Yuanru Qian, Andrew Sharp, Jiuling Wang Today we will discuss Apache HBase, the Hadoop Database. HBase is designed specifically for use by Hadoop, and we will define Hadoop

Completing the Big Data Ecosystem: in sqrrl data INC. August 3, 2012 Design Drivers in Analysis of big data is central to our customers requirements, in which the strongest drivers are: Scalability: The

A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel

the Availability Digest Stratus Avance Brings Availability to the Edge February 2009 Business continuity has not yet been extended to the Edge. What is the Edge? It is everything outside of the corporate

Page 1 SECTION 2 PROGRAMMING & DEVELOPMENT DEVELOPMENT METHODOLOGY THE WATERFALL APPROACH The Waterfall model of software development is a top-down, sequential approach to the design, development, testing

We mean.network File System Introduction: Remote File-systems When networking became widely available users wanting to share files had to log in across the net to a central machine This central machine

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL This chapter is to introduce the client-server model and its role in the development of distributed network systems. The chapter

The Google File System (GFS) Google File System Example of clustered file system Basis of Hadoop s and Bigtable s underlying file system Many other implementations Design constraints Motivating application:

AuthAnvil User Guide Version R91 English August 25, 2015 Agreement The purchase and use of all Software and Services is subject to the Agreement as defined in Kaseya s Click-Accept EULATOS as updated from

Sharding and MongoDB Release 3.2.1 MongoDB, Inc. February 08, 2016 2 MongoDB, Inc. 2008-2015 This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 3.0 United States License

Installation and Setup: Setup Wizard Account Information Once the My Secure Backup software has been installed on the end-user machine, the first step in the installation wizard is to configure their account

Introduction Distributed Data Management Involves the distribution of data and work among more than one machine in the network. Distributed computing is more broad than canonical client/server, in that

1-888-674-9495 www.doubletake.com Real-time Protection for Hyper-V Real-Time Protection for Hyper-V Computer virtualization has come a long way in a very short time, triggered primarily by the rapid rate