Getting Started

In the Quick Start guide, we got a reference implementation of Hollow up and running, with a mock data model that can be easily modified to suit any use case. After reading this section, you'll have an understanding of the basic usage patterns for Hollow, and how each of the core pieces fit together.

Core Concepts

Hollow manages datasets which are built by a single producer, and disseminated to one or many consumers for read-only access. A dataset changes over time. The timeline for a changing dataset can be broken down into discrete data states, each of which is a complete snapshot of the data at a particular point in time.

This producer runs a single cycle and produces a data state. Once this runs, you should have a snapshot blob file on your local disk.

Publishing Blobs

Note that the example code above is writing data to local disk. This is a great way to start testing. In a production scenario, data can be written to a remote file store such as Amazon S3 for retrieval by consumers. See the reference implementation and the quick start guide for a scalable example using AWS.

Consumer API Generation

Once the data has been populated into a producer, that producer's state engine is aware of the data model, and can be used to automatically produce a client API. We can also initialize the data model from a brand new state engine using our POJOs:

After this code executes, an set of Java files will be written to the location /path/to/java/api/files. These java files will be a generated API based on the data model defined by the schemas in our state engine, and will provide convenient methods to access that data.

Initializing multiple types

If we have multiple top-level types, we should call initializeTypeState() multiple times, once for each class.

Consuming a Data Snapshot

A data consumer can load a snapshot created by the producer into memory:

In order to integrate with your infrastructure, you only need to provide Hollow with four implementations of simple interfaces:

The HollowProducer needs a Publisher and Announcer

The HollowConsumer needs a BlobRetriever and AnnouncementWatcher

Your BlobRetriever and AnnouncementWatcher implementations should be mirror your Publisher and Announcer interfaces. Here, we're publishing and retrieving from local disk. In production, we'll be publishing to and retrieving from a remote file store. We'll discuss in more detail how to integrate with your specific infrastructure in Infrastructure Integration.

Producing a Delta

Some time has passed and the dataset has evolved. It now contains these records:

List<Movie>movies=Arrays.asList(newMovie(1,"The Matrix",1999),newMovie(2,"Beasts of No Nation",2015),newMovie(4,"Goodfellas",1990),newMovie(5,"Inception",2010));

The producer, needs to communicate this updated dataset to consumers. We're going to create a brand new state, and the entirety of the data for the new state must be added to the state engine in a new cycle. When the cycle runs, a new data state will be published, and the new data state's (automatically generated) version identifier will be announced.

Using the same HollowProducer in memory, we can use the following code:

producer.runCycle(state->{for(Moviemovie:movies)state.add(movie);});

Let's take a closer look at what the above code does. The same HollowProducer which was used to produce the snapshot blob is used -- it already knows everything about the prior state and can be transitioned to the next state. When creating a new state, all of the movies currently in our dataset are re-added again. It's not necessary to figure out which records were added, removed, or modified -- that's Hollow's job.

Each time we call runCycle we will be producing a data state. For each state after the first, the HollowProducer will publish three artifacts: a snapshot, a delta, and a reverse delta. Encoded into the delta is a set of instructions to update a consumer’s data store from the previous state to the current state. Inversely, encoded into each reverse delta is a set of instructions to update a consumer in reverse -- from the current state to the previous state. Consumers may use the reverse delta later if we need to pin.

When consumers initialize, they will use the most recent snapshot to initialize their data store. After initialization, consumers will keep up to date using deltas.

Producer Cycles

We call what the producer does to create a data state a cycle. During each cycle, you’ll want to add every record from your source of truth. Hollow will handle the details of publishing a delta for all of your established consumer instances, and a snapshot to initialize any consumer instances which start up before your next cycle.

Consuming a Delta

No manual intervention is necessary to consume the delta you produced. The HollowConsumer will automatically stay up-to-date.

Announcements keep consumers updated

When the producer runs a cycle, it announces the latest version. The AnnouncementWatcher implementation provided to the HollowConsumer will listen for changes to the announced version -- and when updates occur notify the HollowConsumer by calling triggerAsyncRefresh(). See the source of the HollowFilesystemAnnouncementWatcher, or the two separate examples in the reference implementation.

After this delta has been applied, the consumer is at the new state. If the generated API is used to iterate over the movies again as shown in the prior consumer example, the new output will be:

It is safe to use Hollow to retrieve data while a delta transition is in progress.

Adjacent States

We refer to states which are directly connected via single delta transitions as adjacent states, and a continuous set of adjacent states as a delta chain

Indexing Data for Retrieval

In prior examples the generated Hollow API was used by the data consumer to iterate over all Movie records in the dataset. Most often, however, it isn’t desirable to iterate over the entire dataset — instead, specific records will be accessed based on some known key. Let’s assume that the Movie’s id is a known key.

After a HollowConsumer has been initialized, any type can be indexed. For example, we can index Movie records by id:

In our generated API, each type in our data model has a generated index class. We can index by any field, or multiple fields.

Reuse Indexes

Retrieval from an index is extremely cheap, and indexing is (relatively) expensive. You should create your indexes when the HollowConsumer is initialized and share them thereafter. Indexes will automatically stay up-to-date with the HollowConsumer.

Thread Safety

Retrievals from Hollow indexes are thread-safe. They are safe to use across multiple threads, and it is safe to query while a transition is in progress.

We've just begun to scratch the surface of what indexes can do. See Indexing/Querying for an in-depth exploration of this topic.

Hierarchical Data Models

Our data models can be much richer than in the prior example. Assume an updated Movie class:

When we add these movies to the dataset, Hollow will traverse everything referenced by the provided records and add them to the state as well. Consequently, both a type Movie and a type Actor will exist in the data model after the above code runs.

Deduplication

Laurence Fishburne starred in both of these films. Rather than creating two Actor records for Mr. Fishburne, a single record will be created and assigned to both of our Movie records. This deduplication happens automatically by virtue of having the exact same data contained in both Actor inputs.

Consumers of this dataset may want to also create an index for Actor records. For example:

Restoring at Startup

From time to time, we need to redeploy our producer. When we first create a HollowProducer and run a cycle it will not be able to produce a delta, because it does not know anything about the prior data state. If no action is taken, a new state with only a snapshot will be produced and announced, and clients will load that data state with an operation called a double snapshot, which has potentially undesirable performance characteristics.

We can remedy this situation by restoring our newly created producer with the last announced data state. For example:

In the above code, we first initialize the data model by providing the set of classes we will add during the cycle. After that, we restore by providing our BlobRetriever implementation, along with the version which should be restored. The HollowProducer will will use the BlobRetriever to load the desired state, then use it to restore itself. In this way, a delta can be produced at startup, and consumers will not have to load a snapshot to get up-to-date.

Initializing the data model

Before restoring, we must always initialize our data model. When a data model changes between deployments, Hollow will automatically merge records of types which have changed. In order to do this correctly, Hollow needs to know about the current data model before the restore operation begins.