BookKeeper uses google protobuf and guava libraries
a lot. If your application might include different versions of protobuf or guava introduced by other dependencies, you can choose to use the
shaded library, which relocate classes of protobuf and guava into a different namespace to avoid conflicts.

Provide a host and port for one node in your ZooKeeper cluster, for example zk1:2181. In general, it’s better to provide a full connection string (in case the ZooKeeper node you attempt to connect to is down).

If your ZooKeeper cluster can be discovered via DNS, you can provide the DNS name, for example my-zookeeper-cluster.com.

Creating a new client

In order to create a new BookKeeper client object, you need to pass in a connection string. Here is an example client object using a ZooKeeper connection string:

try{StringconnectionString="127.0.0.1:2181";// For a single-node, local ZooKeeper clusterBookKeeperbkClient=newBookKeeper(connectionString);}catch(InterruptedException|IOException|KeeperExceptione){e.printStackTrace();}

If you’re running BookKeeper locally, using the localbookie command, use "127.0.0.1:2181" for your connection string, as in the example above.

Creating ledgers

The easiest way to create a ledger using the Java client is via the createLedger method, which creates a new ledger synchronously and returns a LedgerHandle. You must specify at least a DigestType and a password.

Reading entries after the LastAddConfirmed range

readUnconfirmedEntries allowing to read after the LastAddConfirmed range.
It lets the client read without checking the local value of LastAddConfirmed, so that it is possible to read entries for which the writer has not received the acknowledge yet
For entries which are within the range 0..LastAddConfirmed BookKeeper guarantees that the writer has successfully received the acknowledge.
For entries outside that range it is possible that the writer never received the acknowledge and so there is the risk that the reader is seeing entries before the writer and this could result in a consistency issue in some cases.
With this method you can even read entries before the LastAddConfirmed and entries after it with one call, the expected consistency will be as described above.

// Create a client object for the local ensemble. This// operation throws multiple exceptions, so make sure to// use a try/catch block when instantiating client objects.BookKeeperbkc=newBookKeeper("localhost:2181");// A password for the new ledgerbyte[]ledgerPassword=/* some sequence of bytes, perhaps random */;// Create a new ledger and fetch its identifierLedgerHandlelh=bkc.createLedger(BookKeeper.DigestType.MAC,ledgerPassword);longledgerId=lh.getId();// Create a buffer for four-byte entriesByteBufferentry=ByteBuffer.allocate(4);intnumberOfEntries=100;// Add entries to the ledger, then close itfor(inti=0;i<numberOfEntries;i++){entry.putInt(i);entry.position(0);lh.addEntry(entry.array());}lh.close();// Open the ledger for readinglh=bkc.openLedger(ledgerId,BookKeeper.DigestType.MAC,ledgerPassword);// Read all available entriesEnumeration<LedgerEntry>entries=lh.readEntries(0,numberOfEntries-1);while(entries.hasMoreElements()){ByteBufferresult=ByteBuffer.wrap(ls.nextElement().getEntry());IntegerretrEntry=result.getInt();// Print the integer stored in each entrySystem.out.println(String.format("Result: %s",retrEntry));}// Close the ledger and the clientlh.close();bkc.close();

Running this should return this output:

Result: 0
Result: 1
Result: 2
# etc

Example application

This tutorial walks you through building an example application that uses BookKeeper as the replicated log. The application uses the BookKeeper Java client to interact with BookKeeper.

The code for this tutorial can be found in this GitHub repo. The final code for the Dice class can be found here.

Setup

Before you start, you will need to have a BookKeeper cluster running locally on your machine. For installation instructions, see Installation.

To start up a cluster consisting of six bookies locally:

$ bookkeeper-server/bin/bookkeeper localbookie 6

You can specify a different number of bookies if you’d like.

Goal

The goal of the dice application is to have

multiple instances of this application,

possibly running on different machines,

all of which display the exact same sequence of numbers.

In other words, the log needs to be both durable and consistent, regardless of how many bookies are participating in the BookKeeper ensemble. If one of the bookies crashes or becomes unable to communicate with the other bookies in any way, it should still display the same sequence of numbers as the others. This tutorial will show you how to achieve this.

The base application

The application in this tutorial is a dice application. The Dice class below has a playDice function that generates a random number between 1 and 6 every second, prints the value of the dice roll, and runs indefinitely.

When you run the main function of this class, a new Dice object will be instantiated and then run indefinitely:

publicclassDice{// other methodspublicstaticvoidmain(String[]args)throwsInterruptedException{Diced=newDice();d.playDice();}}

Leaders and followers (and a bit of background)

To achieve this common view in multiple instances of the program, we need each instance to agree on what the next number in the sequence will be. For example, the instances must agree that 4 is the first number and 2 is the second number and 5 is the third number and so on. This is a difficult problem, especially in the case that any instance may go away at any time, and messages between the instances can be lost or reordered.

Luckily, there are already algorithms to solve this. Paxos is an abstract algorithm to implement this kind of agreement, while Zab and Raft are more practical protocols. This video gives a good overview about how these algorithms usually look. They all have a similar core.

It would be possible to run the Paxos to agree on each number in the sequence. However, running Paxos each time can be expensive. What Zab and Raft do is that they use a Paxos-like algorithm to elect a leader. The leader then decides what the sequence of events should be, putting them in a log, which the other instances can then follow to maintain the same state as the leader.

Bookkeeper provides the functionality for the second part of the protocol, allowing a leader to write events to a log and have multiple followers tailing the log. However, bookkeeper does not do leader election. You will need a zookeeper or raft instance for that purpose.

Why not just use ZooKeeper?

There are a number of reasons:

Zookeeper’s log is only exposed through a tree like interface. It can be hard to shoehorn your application into this.

A zookeeper ensemble of multiple machines is limited to one log. You may want one log per resource, which will become expensive very quickly.

Adding extra machines to a zookeeper ensemble does not increase capacity nor throughput.

Bookkeeper can be seen as a means of exposing ZooKeeper’s replicated log to applications in a scalable fashion. ZooKeeper is still used by BookKeeper, however, to maintain consistency guarantees, though clients don’t need to interact with ZooKeeper directly.

Electing a leader

We’ll use zookeeper to elect a leader. A zookeeper instance will have started locally when you started the localbookie application above. To verify it’s running, run the following command.

To interact with zookeeper, we’ll use the Curator client rather than the stock zookeeper client. Getting things right with the zookeeper client can be tricky, and curator removes a lot of the pointy corners for you. In fact, curator even provides a leader election recipe, so we need to do very little work to get leader election in our application.

In the constructor for Dice, we need to create the curator client. We specify four things when creating the client, the location of the zookeeper service, the session timeout, the connect timeout and the retry policy.

The session timeout is a zookeeper concept. If the zookeeper server doesn’t hear anything from the client for this amount of time, any leases which the client holds will be timed out. This is important in leader election. For leader election, the curator client will take a lease on ELECTION_PATH. The first instance to take the lease will become leader and the rest will become followers. However, their claim on the lease will remain in the cue. If the first instance then goes away, due to a crash etc., its session will timeout. Once the session times out, the lease will be released and the next instance in the queue will become the leader. The call to autoRequeue() will make the client queue itself again if it loses the lease for some other reason, such as if it was still alive, but it a garbage collection cycle caused it to lose its session, and thereby its lease. I’ve set the lease to be quite low so that when we test out leader election, transitions will be quite quick. The optimum length for session timeout depends very much on the use case. The other parameters are the connection timeout, i.e. the amount of time it will spend trying to connect to a zookeeper server before giving up, and the retry policy. The retry policy specifies how the client should respond to transient errors, such as connection loss. Operations that fail with transient errors can be retried, and this argument specifies how often the retries should occur.

Finally, you’ll have noticed that Dice now extends LeaderSelectorListenerAdapter and implements Closeable. Closeable is there to close the resource we have initialized in the constructor, the curator client and the leaderSelector. LeaderSelectorListenerAdapter is a callback that the leaderSelector uses to notify the instance that it is now the leader. It is passed as the third argument to the LeaderSelector constructor.

takeLeadership() is the callback called by LeaderSelector when the instance is leader. It should only return when the instance wants to give up leadership. In our case, we never do so we wait on the current object until we’re interrupted. To signal to the rest of the program that we are leader we set a volatile boolean called leader to true. This is unset after we are interrupted.

Finally, we modify the playDice function to only generate random numbers when it is the leader.

Run two instances of the program in two different terminals. You’ll see that one becomes leader and prints numbers and the other just sits there.

Now stop the leader using Control-Z. This will pause the process, but it won’t kill it. You will be dropped back to the shell in that terminal. After a couple of seconds, the session timeout, you will see that the other instance has become the leader. Zookeeper will guarantee that only one instance is selected as leader at any time.

Now go back to the shell that the original leader was on and wake up the process using fg. You’ll see something like the following:

Create ledgers

the easiest way to create a ledger using the java client is via the createbuilder. you must specify at least
a digesttype and a password.

here’s an example:

BookKeeperbk=...;byte[]password="some-password".getBytes();WriteHandlewh=bk.newCreateLedgerOp().withDigestType(DigestType.CRC32).withPassword(password).withEnsembleSize(3).withWriteQuorumSize(3).withAckQuorumSize(2).execute()// execute the creation op.get();// wait for the execution to complete

A WriteHandle is returned for applications to write and read entries to and from the ledger.

Write flags

You can specify behaviour of the writer by setting WriteFlags at ledger creation type.
These flags are applied only during write operations and are not recorded on metadata.

Available write flags:

Flag

Explanation

Notes

DEFERRED_SYNC

Writes are acknowledged early, without waiting for guarantees of durability

Data will be only written to the OS page cache, without forcing an fsync.

BookKeeperbk=...;byte[]password="some-password".getBytes();WriteHandlewh=bk.newCreateLedgerOp().withDigestType(DigestType.CRC32).withPassword(password).withEnsembleSize(3).withWriteQuorumSize(3).withAckQuorumSize(2).withWriteFlags(DEFERRED_SYNC).execute()// execute the creation op.get();// wait for the execution to complete

Append entries to ledgers

The WriteHandle can be used for applications to append entries to the ledgers.

WriteHandlewh=...;CompletableFuture<Long>addFuture=wh.append("Some entry data".getBytes());// option 1: you can wait for add to complete synchronouslytry{longentryId=FutureUtils.result(addFuture.get());}catch(BKExceptionbke){// error handling}// option 2: you can process the result and exception asynchronouslyaddFuture.thenApply(entryId->{// process the result}).exceptionally(cause->{// handle the exception})// option 3: bookkeeper provides a twitter-future-like event listener for processing result and exception asynchronouslyaddFuture.whenComplete(newFutureEventListener(){@OverridepublicvoidonSuccess(longentryId){// process the result}@OverridepublicvoidonFailure(Throwablecause){// handle the exception}});

The append method supports three representations of a bytes array: the native java byte[], java nio ByteBuffer and netty ByteBuf.
It is recommended to use ByteBuf as it is more gc friendly.

Open ledgers

You can open ledgers to read entries. Opening ledgers is done by openBuilder. You must specify the ledgerId and the password
in order to open the ledgers.

here’s an example:

BookKeeperbk=...;longledgerId=...;byte[]password="some-password".getBytes();ReadHandlerh=bk.newOpenLedgerOp().withLedgerId(ledgerId).withPassword(password).execute()// execute the open op.get();// wait for the execution to complete

A ReadHandle is returned for applications to read entries to and from the ledger.

Recovery vs NoRecovery

By default, the openBuilder opens the ledger in a NoRecovery mode. You can open the ledger in Recovery mode by specifying
withRecovery(true) in the open builder.

If you are opening a ledger in “Recovery” mode, it will basically fence and seal the ledger – no more entries are allowed
to be appended to it. The writer which is currently appending entries to the ledger will fail with LedgerFencedException.

In constrat, opening a ledger in “NoRecovery” mode, it will not fence and seal the ledger. “NoRecovery” mode is usually used by applications to tailing-read from a ledger.

Read entries from ledgers

The ReadHandle returned from the open builder can be used for applications to read entries from the ledgers.

ReadHandlerh=...;longstartEntryId=...;longendEntryId=...;CompletableFuture<LedgerEntries>readFuture=rh.read(startEntryId,endEntryId);// option 1: you can wait for read to complete synchronouslytry{LedgerEntriesentries=FutureUtils.result(readFuture.get());}catch(BKExceptionbke){// error handling}// option 2: you can process the result and exception asynchronouslyreadFuture.thenApply(entries->{// process the result}).exceptionally(cause->{// handle the exception})// option 3: bookkeeper provides a twitter-future-like event listener for processing result and exception asynchronouslyreadFuture.whenComplete(newFutureEventListener<>(){@OverridepublicvoidonSuccess(LedgerEntriesentries){// process the result}@OverridepublicvoidonFailure(Throwablecause){// handle the exception}});

Once you are done with processing the LedgerEntries, you can call #close() on the LedgerEntries instance to
release the buffers held by it.

Tailing Reads

There are two methods for applications to achieve tailing reads: Polling and Long-Polling.

Polling

You can do this in synchronous way:

ReadHandlerh=...;longstartEntryId=0L;longnextEntryId=startEntryId;intnumEntriesPerBatch=4;while(!rh.isClosed()||nextEntryId<=rh.getLastAddConfirmed()){longlac=rh.getLastAddConfirmed();if(nextEntryId>lac){// no more entries are addedThread.sleep(1000);lac=rh.readLastAddConfirmed().get();continue;}longendEntryId=Math.min(lac,nextEntryId+numEntriesPerBatch-1);LedgerEntriesentries=rh.read(nextEntryId,endEntryId).get();// process the entriesnextEntryId=endEntryId+1;}

Long Polling

ReadHandlerh=...;longstartEntryId=0L;longnextEntryId=startEntryId;intnumEntriesPerBatch=4;while(!rh.isClosed()||nextEntryId<=rh.getLastAddConfirmed()){longlac=rh.getLastAddConfirmed();if(nextEntryId>lac){// no more entries are addedtry(LastConfirmedAndEntrylacAndEntry=rh.readLastAddConfirmedAndEntry(nextEntryId,1000,false).get()){if(lacAndEntry.hasEntry()){// process the entry++nextEntryId;}}}else{longendEntryId=Math.min(lac,nextEntryId+numEntriesPerBatch-1);LedgerEntriesentries=rh.read(nextEntryId,endEntryId).get();// process the entriesnextEntryId=endEntryId+1;}}

Delete ledgers

Relaxing Durability

In BookKeeper by default each write will be acklowledged to the client if and only if it has been persisted durably (fsync called on the file system) by a quorum of bookies.
In this case the LastAddConfirmed pointer is updated on the writer side, this is the guarantee for the writer that data will not be lost and it will
be always readable by other clients.

On the client side you can temporary relax this constraint by using the DEFERRED_SYNC Write flag. Using this flag bookies will acknowledge each entry after
writing the entry to SO buffers without waiting for an fsync.
In this case the LastAddConfirmed pointer is not advanced to the writer side neither is updated on the reader’s side, this is because there is some chance to lose the entry.
Such entries will be still readable using readUnconfirmed() API, but they won’t be readable using Long Poll reads or regular read() API.

In order to get guarantees of durability the writer must use explicitly the force() API which will return only after all the bookies in the ensemble ackknowledge the call after
performing an fsync to the disk which is storing the journal.
This way the LastAddConfirmed pointer is advanced on the writer side and it will be eventually available to the readers.

The close() operation on the writer writes on ledger’s metadata the current LastAddConfirmed pointer, it is up to the application to call force() before issuing the close command.
In case that you never call explicitly force() the LastAddConfirmed will remain unset (-1) on ledger metadata and regular readers won’t be able to access data.

BookKeeperbk=...;longledgerId=...;WriteHandlewh=bk.newCreateLedgerOp().withDigestType(DigestType.CRC32).withPassword(password).withEnsembleSize(3).withWriteQuorumSize(3).withAckQuorumSize(2).withWriteFlags(DEFERRED_SYNC).execute()// execute the creation op.get();// wait for the execution to completewh.force().get();// wait for fsync, make data available to readers and to the replicatorwh.close();// seal the ledger