One core advantage of using OSGi in the development process is that OSGi allows to enforce information hiding between modules in a large Java application.

Unfortunately, most open source projects do not provide versions of their libraries, which are ready made for an OSGi environment. There is some great tooling available to create bundles from vanilla Java Jars, my favorite one being the Maven Bundle Plugin, which wraps the BND tool.

However, it can still be an annoyance to create your own OSGi-enabled maven artifacts for every dependency you want to use in your OSGi project.

This is my motivation to publish OSGi wrappers for all the third party dependencies I come across in my projects:

Basically, these wrappers provided nothing more but a simple pom, which is pre-configured to generate valid OSGi declarations for the wrapped dependency:

If the project is easy to build and/or there is no single artifact available, which could be wrapped, the wrapper might also contain the source code of the third party dependency. Therefore, please make sure that you use all wrappers in accordance with the source code license of the wrapped artifact.

Currently, the wrapper artifacts provide the following key features:

You can use the artifacts from the provided repository to link the third party project as OSGi-bundles/Maven artifact (see included libraries).

An important theme in my implementation of onedb was to write code in Java, which can be used both in a JVM environment and a GWT/JavaScript environment. Unfortunately, even though the Java Concurrency API and GWT’s concurrency features often provide similar features, their APIs are incompatible.

This motivated me to write a simple abstract API (oneUtils), into which implementations for either a JVM environment or a GWT environment can be injected. Code, which uses this abstract API, can therewith be shared between JVM and GWT apps.

Github is an awesome repository to share your source code. Although there are numeroustutorials discussing how to use git and eclipse, I got stuck again today while trying to upload an existing eclipse project to github. This tutorial walks thus through all the steps from signing up for github to uploading an eclipse project to the site!

Please note that the focus of this tutorial is the mere upload of source code and not any of the more sophisticated features git and github offer.

Step 1: Sign Up for github

Step 2: Installing EGit

You will need to install the git plugin for eclipse, EGit, in order to upload code from eclipse projects.

In eclipse, go to Help / Install New Software

Open the eclipse Indigo repository

Select Collaboration / Eclipse EGit

Click [Next]

JGit should have been selected for you automatically

Click [Next] and confirm the licence agreement

Restart eclipse and the EGit plugin should be installed

Step 3: Create a DSA Key in Eclipse

In order to be able to upload source code to github, you need to define a secure key, which must be known both to your local eclipse installation as well as the github service. Luckily, eclipse provides all the tooling necessary to generate the appropriate key.

Open Menu Window / Preferences

In the preferences, go to General / Network Connections / SSH2

On the SSH2 page, open the tab ‘Key Management’

Click on [Generate DSA Key …]

At the bottom of the page, enter a passphrase of your choosing

Click [Save Private Key …] (what’s going on with these three dots in the button captions … strange)

Save the key at a location of your choosing (best in the location specified as your SSH2 home on under the tab General)

Step 4: Register DSA Key with github

Open the file you have saved in the previous step with a text editor (e.g. Notepad on windows)

Chose a repository name and description to your liking and click [Create Repository]

Step 6: Import github Repository into eclipse

Before you can link an existing eclipse project to a github repository, you must import the repository you have created on github first. For this:

In eclipse, open Menu / Window / Show View / Other …

Select Git / Git Repositories and click [Ok]

You might see a warning message such as the one show below (even setting the environment variable did not help me to get rid of the message, but everything seems to be working okay) – you can confirm with [Ok]

Under your repository description, you can get the URI for your project. Copy the text starting with ‘git@’ (make sure that SSH is selected)

Go back to eclipse. You can paste the URI you have just copied into the field ‘URI’

Further select as Protocol ‘ssh’

Click [Finish]

If asked to select a branch, select the ‘master’ branch

Git (in difference to subversion) allows storing a full blown repository on your local machine rather than just a local copy of the files. This requires to store all source you want to synchronize with git at least twice on your local machine: one copy will be stored in the clone of the remote git repository and another one will be stored in your eclipse project.

Hence, when you close the git repository from github, you should define a repository location, which lies outside the eclipse project you want to upload:

Select one such location and click [Finish]

Now you should have one ‘Git Repository’

Step 7: Link Eclipse Project with github Repository

After you have created a local clone of the repository from github, you can link the eclipse project you would like to upload to this local repository.

Right click your eclipse project and select Team / Share Project …

Select ‘Git’ as Repository Type

Select under ‘Repository’ the repository you have cloned in the previous step and click [Finish]

Step 8: Uploading Project Sources to github

After you have linked your project with the local clone of the github repository, you can ‘Commit’ all the source files in your existing project to this repository. After you have committed the files to your local repository, you can ‘Push’ them to the github repository.

Right click your project and select Team / Commit … from the popup menu

API Improvments

This release primarily implements feedback from Michael received by email (Thanks!).

Michael has pointed out:

The name of the ShutdownCallback class does not ‘fit in’ with the general naming pattern
for callbacks (When*).

This irregular naming pattern is of course no instance of a well-designed API! To remedy this incoherence, the ShutdownCallback has been renamed to WhenShutDown and the primary callback method has been renamed to the usual thenDo(...).

The select operation returns a list of strings when all children are selected but a list
of references in case the children are filtered according to some criteria.

The select operation returned a list of strings primarily for performance reasons. Thanks to some design decisions for engine and API, the list of children returned by the
select operation is a direct reference to the internal collection of ArrayLists, which is used as a local cache by the onedb client (wrapped, of course, in a unmodifiable list). This allows to access the list of child references with very little memory and CPU overhead.

However, premature optimization is the root of all … not good things.
Therefore, the default return value for the select operation has now been changed to a list of references, which is not only aligned with the other select operations but also allows to use the results of the select operation without having to wrap the returned string values in a One.reference(...) call.

Moreover, the option for good performance must not be abandoned. I have added a new operation variant One.select(...).allChildrenFast(), which allows to access a list of strings in the same way the One.select(..).allChildren() operation did in versions 0.0.1 and 0.0.2.

Documentation

All examples and tutorials have been updated to reflect the API changes in version 0.0.3.

In addition to the two articles mentioned above, this article provides a practical guide to understand and use the various data representations in onedb. In particular, the following four representations are discussed:

1. Nodes

Nodes are the central component of onedb’s data model. Basically, every piece of information be it a basic data type such as String, Integer, etc. a custom object (new MyPerson()) or onedb object (One.newNode(..)) is managed as a node by the onedb engine.

However, only those objects are nodes, which have been appended and or loaded using the onedb API. For instance, in the following example text1 is not a node while text2 is:

String text1="I am just another string";
String text2="I will be a node";
One.append(text2).to(root).in(client);

Every node managed by a onedb client has got an address and an identity as described in the following section.

2. Addresses and Identities

Nodes in onedb are identified by a global unique identifier. These global unique identifiers are encoded as resolvable
Uniform Resource Identifiers (URIs).

Hence, every piece of information in onedb has a unique identity, which can be expressed in form of a resolvable address.

3. References

Most operations in the One API do not work with addresses directly but use so called references. A reference is nothing more than a simple wrapper object around an address.

onedb favors the usage of addresses ‘wrapped’ in reference objects over simply supplying an address in form of a String object to distinguish objects, which by chance hold a value which ‘looks like’ an URI, from objects, which are meant to represent links to entities within onedb.

Note in the example above that the text1 object has been cast to Object in the invocation of the One.reference(..) operation. This is necessary for the API to know that no address but a generic object is passed to the operation. For other types of objects (e.g. Integer) this explicit cast to Object is not necessary.

References can also be used to obtain the resolved object of a node. For instance, the resolved object of the node text1 in the example above would be “I will be a node”. The resolved object of a node given its reference can be obtained as follows:

4. Value Nodes

While virtually any Java object can become a node (given the object’s class implements the Serializable interface), there is one special kind of object, which can be added to a onedb client: value nodes.

These objects are special in that they ‘know’ their own address. For instance, the object value1 above is of the class Integer and there is therefore no possibility this object would know its own address; since the class Integer is final in Java and no additional methods/attributes can be defined for instance of this class.

The interface OneNode is used to denote classes which have knowledge of their own address. This interface defines a single method getId(), which will return the address of the node/object. As it can be seen above, the mentioned reference objects belong to this type of objects.

Value nodes are a special type of object within the OneNode category. They are nodes, which implement the interface OneValue<Type extends Serializable> These nodes, apart from knowing their own address, define a value object of any type implementing the Serializable interface. This object can be accessed using the method getValue().

In the following example, a value node bob is created and its two properties address and decorated object are accessed:

Note that the call to One.reference(..) for the value node is made without the addition .in(client) as has been done above for the String and Integer objects. The client does not need to be specified for determining the reference of any object, which by itself ‘knows’ its own address.

It is usually a good practice to wrap objects whenever possible into value nodes. This allows for better performance, since the onedb engine does not have to derive the address of nodes from their object identity. Since object identities can also change in often unexpected ways, it is also usually safer to work with value objects.

The easiest way to define value nodes is generally by defining the .atAddress(...) parameter when appending a new object to an existing node. For instance, in the following example a new value node with the value alice is appended to the node root with the address ./alice relative to the root node’s address:

One.append("alice").to(root).atAddress("./alice").in(client);

Given the root node is defined at the address http://u1.linnk.it/example/root, this append statement would result in the creation of the following node arrangement:

The new client libraries mainly incorporate an advice from Java API expert Lukas Eder (check out JOOQ – a neat way to access SQL databases using a fluent Java API): All callback methods in the core API now have only one parameter; a simple data object, which allows access to the information, which before was handed to the callback method in form of parameters. For instance, the create realm operation in version 0.0.1 had the following primary callback method:

Since callback methods are defined very often in code using onedb, source code size is reduced: instead of listing all parameters and their types, only callback result objects have to be defined in the method signatures.

This reduction in source code, plus the avoidance of (unnecessary) type information increases the readability of the source code.

The API is much easier to change without breaking client code; since methods/data can be added to the *Result objects without having to change the callbacks in the client code.

The tutorial has been updated to reflect the changes of the revised API.

onedb Java Client: This client library provides all base features required to connect with the onedb cloud.

onedb Java Test Toolkit: Apart from all features of the onedb Java Client, the test toolkit includes a stripped down version of the onedb cloud. This allows to start up a local test cloud for unit tests (starting a test cloud should take less than 200ms).

Both libraries do not have any dependencies and are minified for reduced application size and optimal performance.

For this tutorial, pleas download the most recent version of onedb Java Test Toolkit from the downloads page. If you want to deploy an application to your users, please link the onedb Java Client instead.

3. Linking the client library (eclipse or maven)

The onedb client libraries can be linked to any Java application by adding them to the Java application’s classpath. Since this is rarely done manually, I will instead briefly describe how the library can be linked to an eclipse project or added to a Maven project. If you are already familiar with these procedures, feel free to skip to the next section.

Eclipse (Windows)

Link onedb to a new eclipse project by following the steps below:

Download the onedb Java Test Toolkit library as discussed above and store the ‘oneTestJre.min-x.x.x.jar’ file on your local machine.

4. Initializing the onedb engine

After adding the onedb libraries to a Java project, the onedb engine must be initialized. Adding the following statement anywhere in your application will initialize the onedb engine and prepare the engine to make request to the onedb cloud:

OneJre.init("[Your API Key]");

Please note that you will need to replace [Your API Key] with the API key you have obtained by email as described above.

If you have linked the onedb Java Test Toolkit, you can initialize the onedb engine as above (OneJre.init(..)) but also have the option to initialize the engine in test mode. The test toolkit can recreate a local version of the onedb cloud on a per test case basis. Just add the following statement instead of the statement listed above to any of your JUnit, TestNG, … test cases:

OneTestJre.init();

Note that since this will create a local onedb cloud, you will not need to supply an API key. Also note that the local onedb cloud will not respond to any REST requests for any created resources (for performance and portability reasons).

Part II: Core Operations

onedb uses only a handful of core operations to compose flexible and expressive data structures. The operations described in the following are a comprehensive list of the core operations supported by the onedb engine.

1. Create Realm

Realms are essential in working with onedb. Every item of data stored in the onedb cloud must be part of a realm. You can check the article “onedb Architecture and Design” to learn about the design concepts behind realms. This tutorial will walk through the practical steps necessary to create realms using the onedb Java API.

If you have followed the steps above to link the downloaded library in eclipse, you will have an empty eclipse project, including a link to the library ‘oneTestJre.min-x.x.x’. I will describe the following steps using eclipse in detail. If you are an eclipse whizz or use another IDE (or no IDE at all), please feel free to skip any of following steps.

You can add a Java file by right-clicking on the ‘src’ folder in your project and selecting New / Class. Add a new class ExploreOneDb as follows:

Add the statement OneJre.init(..) with your API key.

The main entry point to interact with the onedb client library is the class one.common.One. You can see all fundamental operations the onedb client library provides by typing One. and waiting for the context help to appear (below a screenshot for eclipse but the same will work in IntelliJ and NetBeans as well).

Chose the option createRealm(String withTitle):

The onedb client now requires the specification of a title of the realm, which is to be created. Realm titles have no special significance apart from that they help to build descriptive URIs for any nodes stored in the realm. Chose the title "exploration" and add a . after the closing brackets to see the context options for the createRealm(..) operation:

We can either specify a client session by choosing the .in(OneClient client) option or specify the operation, we would like to perform after the realm has been created successfully by choosing the option .and(RealmCreated callback). Since there is no existing client session, we chose the second option.

The parameter type RealmCreated or more precisely When.RealmCreated will help to define a so called callback for the operation. A callback defines a method, which will be called once a remote operation is completed. Callbacks are used throughout the onedb APIs for all operations, which depend on remote systems.

Realms are managed by the onedb cloud and, therefore, the operation of requesting a realm needs to make a call to the remote onedb cloud. The duration of this call can vary significantly depending on your network connection and the way messages are sent through the Internet. In any case, the call to the remote cloud takes, in terms of computer time, very long. Using a callback enables your application to do some other useful work while it is waiting for the response from the onedb cloud.

To define the callback for the createRealm(..) operation, chose the option and(..) and type within the brackets after and: new When.. This will show the available callbacks in the onedb API. Select the callback definition, which corresponds to the required parameter type indicated on top of the parameter.

A bit of cleanup needs to be performed before proceeding: A semicolon needs to be added at the end of the (now completed) statement and the When class containing the callback definitions needs to be added to the imports.

The method thenDo(..) will be called ‘back’ upon successful creation of the new realm on the onedb cloud. If the parameter name for the thenDo(..) method are something as informative as arg0 .., we can rename the parameter with a more descriptive name such as result or r:

@Override
public void thenDo(WithRealmCreatedResult r) {

The result object will carry the following values:

r.client(): The createRealm operation will create a new client session, which can be accessed through the client parameter. This session will also have the realmRoot and its children available for further operations.

r.root(): The realmRoot parameter points to the root node of the newly create realm; to this node, new nodes to be stored as part of the realm can be appended.

r.secret(): The secret parameter holds a String, which must be supplied when the realmRoot is accessed using the onedb Java Client or the REST API (It’s a kind of access token).

r.partnerSecret(): This parameter will hold no value after the conducted invocation of the createRealm operation. This parameter will hold a secret, which will allow to write (but not read) a node for a postbox type realm.

While the application as given above will successfully create a realm, we will not be able to access this realm after the application is closed, since we will neither know the address of the realm nor the secret token to access it.

To save the realm’s address and access secret, we can, for now, print them to the console by adding a few print statements as follows:

You will notice that applications starts, then prints the output above after a few seconds, but will not stop. In order to terminate our application correctly, we have to shut down every client session we have created. In our case, this is the one client session created by the createRealm operation.

To shut down the client session, we can add the following to the existing application after the last System.out statement:

Save both the URL of the node (“…/explora”) and the printed secret for the next steps of the tutorial in a text or source file. You can also rerun the application at any time to obtain this information again.

Note that although we have not changed any parameters of the createRealm operation, the second invocation resulted in a different realm root and a different secret being reported.

2. Interlude: REST access using web browser

We can access the newly created realm(s) using any web browser using the REST API of onedb: just type in the URI of the realm root reported by your application (e.g. https://u1.linnk.it/bgbpce/explora) into the browsers address bar and hit enter.

If you are greeted by the following friendly message, just select [Proceed anyway] or the available equivalent of your browser.

Supply as authentication the username ‘token’ along with the secret access token reported by your application (e.g. gv3etqoingxe0nx) and press login.

Select save password if you are given the option. You should see the sparse contents of your node displayed by the web browser.

You can see different representations of your node by appending a variant (.node. or .value.) and a data format (.html, .xml or .json) to the URI of your node such as:

https://u1.linnk.it/bgbpce/explora.value.xml or
https://u1.linnk.it/bgbpce/explora.node.json

The variant value with the format XML, for instance, should be rendered as follows:

3. Load

While it is convenient to access nodes using the provide REST interface, it is far easier to access the created nodes using the onedb API. As long as we know the access token (secret) and address of a node, we can access it from any system linked to the onedb library.

For the purposes of this tutorial, we can create another class in our Java application to emulate another app accessing the node. Add a class to your project ‘Load’ and add a main method to this class.

Although both ExploreOneDb and Load are part of the same eclipse project, they form two distinct applications for Java, since both classes have independent main methods. The onedb engine must be initialized once for every Java application. Since Load defines a new application, we need to initialize the onedb engine as follows:

OneJre.init("[your API key here]");

Next we type One. once again but this time chose the operation load(Object node):

The load operation now requires the specification of an Object node. This object specifies what is to be loaded. The only information we have available from the invocation of the ExploreOnedb example app is the address of the root node of the realm (e.g. https://u1.linnk.it/bgbpce/explora) plus its access secret you have saved (if not, just rerun the ExploreOnedb application).

onedb distinguishes between resolved nodes with a value and references to nodes. Node references have a unique identity but have no value. We can use such a reference to specify, which node we want to load, for instance:

Replace the address in above statement with the address you have saved after running the ExploreOnedb example and provide the reference definition as the node to be loaded for the load operation. Again type a . after the closing bracket to see the further parameters for the load operation. Select the parameter .withSecret(String secret):

Supply the access token secret you saved earlier (e.g. “gv3__0nx”). Finally, you should define the callback by selecting the option .and(...):

Proceed in the same way as specifying the callback for the createRealm(..) operation, but this time chose the callback new When.Loaded(..). Your application should now look as follows:

The callback method (thenDo(..)) returns a load result with a reference to the node, we have just loaded (lr.loadedNode()). The load result also returns a client session (lr.client()). We can retrieve the resolved loaded node as follows:

4. Append

Thus far, this tutorial walked through the steps of setting up a Java project, creating a realm and accessing this realm using a REST interface and the onedb API. Although creating a realm inevitably results in the creation of one node (the realm root), we have not really done a lot of work with nodes, for instance establishing connections. In this section, I will explain the most important operation to define nodes and connections between them: append.

First create a new class NodeOperations and add a main method, which creates a new realm (check the creating a realm section above for details regarding the following code snippet):

Let’s assume the following scenario: We would like to define a customer in the newly created name with the name “Bob” who lives at “26 Short Av”. To express this information using a proven and tested object-oriented approach should not be too difficult: First, we define a class Customer and subsequently create an instance for Bob.

We need to make one small modification before we can upload the Bob object to the onedb cloud: The customer class will need to implement the Serializable interface, in order for onedb to be able to transport objects of this type safely to the onedb cloud. This is easy enough; just change the class definition to:

You can run the application and should receive the login information for the test realm such as Created One.reference("https://u1.linnk.it/crd87h/ops"):hhz______ni. We can use this reference to load both the realm and the associated bob object using the One API as described above under the load operation.

We can also access the newly created realm using a web browser as described in the REST section above.

The picture above points to a problem with the approach taken so far: Although the customer has been added to the realm as a SerializedNode (class NodeOperations$Customer), the data associated with this node cannot easily be interpreted through the REST interface. Indeed, also other Java applications would have difficulties in ‘deciphering’ the data of this node without the definition of the class Customer in byte code. This stays in contrast with the ideals of a small data system!

A better approach is to decompose the information a Customer instance expresses into various nodes with ‘standard’ data types (such as String, Integer …). One way to do this could be as follows:

The root node of the realm is designated to represent the bob entity. We append a node with the text “Bob” to this realm root node to indicate the name. We also append another node with the text “26 Short Av” to the realm root. To indicate the ‘type’ of the used nodes, we further append two generic nodes Address and Customer, which do not hold any particular value to “26 Short St” and the realm root respectively.

The described arrangement of nodes can be visualized as follows:

We can create a new class NodeOperationsBetter and again specify the logic for creating a new realm. The following code snippet shows how to define a node arrangement as described above using the onedb API.

If we access the realm created with the logic above using the REST interface, the data should be presented in a more accessible manner such as below:

The REST interface allows navigating from one node to another. If we click on ’26 Short Av’, the node representing the address value will be displayed:

Apart from presenting the information associated with the customer in a more accessible manner, following a connection-oriented approach makes the data semantically richer. For instance, we will have implicitly created a globally accessible type ‘address’ (linked to 26 Short Av Node). The type URI for address will look something like the following:

https://u1.linnk.it/l8hpud/ops/26_Short_A2/an_Address0

However, this URI does not appear to be very ‘pretty’ and portable. In specific, the part ’26ShortA2′ does appear to be in conflict with the intention to define a reusable identity for the type ‘address’.

It is usually a good practice to define ‘type’ nodes in their own independent realm. This way their reusability can be increased and also shorter and more succinct URIs can be created. Type nodes can be appended to a realm like any other node. We can write a little application such as the following to define the type nodes required for the example:

Note here the slightly changed append statements with an added atAddress(..) parameter. Specifying the atAddress parameter allows to specify a precise address to be used for nodes; if atAddress is not specified, onedb will attempt to generate a suitable address.

Running the above application should result in an output such as the following:

Now, if we were to define a customer Alice, we could reuse the type nodes for ‘address’ and ‘customer’ used for the definition of Bob.

5. Select

Querying data in onedb is done on a recursive from-node-to-node basis much in the spirit of Linked Data. In order to aid the navigation from node to node, onedb provides three operations to query the children of a node:

selectFrom(node).allChildren(): Will return a list of the references of all children appended to the node.

selectFrom(node).allChildrenFast(): Will return a list of addresses as Strings of all children appended to the node.

selectFrom(node).theChildren().withType(type): Will return a list of references to all children of the node with the specified (Java) type.

selectFrom(node).theChildren().linkingTo(reference): Will return a list of references to all children of the node, which have the specified reference as one of their children. For instance, in the example node arrangement given below, selecting all children from the node persons linking to Customer will return the nodes bob and alice.

Create a new Java class with main method, initialize the onedb engine, create a realm and define the following nodes for the realm:

Note that the operation to select children by type in difference to the previous one (select all children) requires the specification of another callback (When.ChildrenSelected). As a general rule, onedb will always require the specification for all operations, which may need to send a remote message to the onedb cloud. The initial loading of the root node of the realm, will download the root node from the onedb cloud including the references to all its children. A reference, however, is not sufficient to determine the (Java) type of a node. Therefore, the select operation with type parameter will need to assure all child nodes have been downloaded from the onedb cloud.

The same applies for specifying the linkingTo(..) parameter. Since the load operation initially only loads the children of a node but not its children’s children, a remote request must possibly be sent to the onedb cloud, requiring the specification of a callback:

Running your application should result in an output such as shown below.

All Children: [One.reference("https://u1.linnk.it/4hxdr8/query/NxAuth.rea0"), One.reference("https://u1.linnk.it/4hxdr8/query/This_is_a_1"), One.reference("https://u1.linnk.it/4hxdr8/query/bob")]
Found Messages:
This is a test realm
Found Customers:
One.value(bob).at("https://u1.linnk.it/4hxdr8/query/bob")
All queries completed.

The select example makes heavy use of references (One.reference(...)) and value nodes (One.value(..).at(..). For an in-depth discussion of these different node types and how one node type can be converted into another, please check the article ‘A Practical Guide on Node Types in onedb‘.

6. Replace

Apart from creating realms and appending nodes in complex and deep arrangements, onedb supports to replace the value of nodes as well as removing nodes. However, it is often a good idea to avoid these operations whenever possible. onedb in its core is designed to enable distributed systems: any piece of data or node might be opened by multiple clients on the same or different devices.

As long as we constrain ourselves to using the operations of createRealm, append and select, the data stored in onedb will be immutable. For instance, when we append a node representing the type customer to a ‘types’ node, it can be guaranteed that this node will be available to any client working with the system.

The operation remove, in particular, can lead to unexpected and undesired situations in a system with many involved clients. However, there are cases, where using update and remove is just by far the simplest solution. To support these cases, onedb offers operations both to update and remove nodes from the network.

The update operation will replace a node value with another node value. The connections of a node remain unaffected by the update operation. The value of nodes with externally managed address can easily be updated as shown in the following:

The example above will first add a node “phase1” to the onedb cloud and then replace this node with a node with the value “phase2”. Please note that although the value of the node has been changed, the address of the node will stay the same:

Before update:

Value : "phase1"
Address: https://u1.linnk.it/di14a2/update/phase11

After update:

Value : "phase2"
Address: https://u1.linnk.it/di14a2/update/phase11

In general, it is not allowed to change the address of a node using the update operation. This can become tricky when working with nodes with internally managed addresses. For instance, the following operations present a BAD practice:

onedb stores a version every time a node is changed (e.g. replaced, a child is appended, a child is removed ..). Therefore, calling One.clearVersions(...) for nodes, which are frequently changed, can significantly increase the performance of loading and manipulating data. You can find an example on github on how to use the clear versions operation.

7. Remove

As already mentioned in the previous section, replace and remove are operations, which should be used with care in environments, where one node is synchronized between multiple clients. Remove, in this regard, is more ‘dangerous’ than replace, since replace guarantees that once defined nodes are available to other parts of a distributed system. Remove, in contrast, can render nodes unavailable to other components of a system. Therefore, remove should always be used with caution when working with the onedb cloud!

The remove operation can basically be used for two purposes: First, removing one node from another node will delete the connection between these nodes (if a connection has been defined before). However, remove can also be used to remove a node from the onedb cloud; a node will be removed from the onedb cloud, if the connection to its direct parent is removed.

The following snippet will first append a node "to be removed" to the realm root and define it in the onedb cloud. Then, this node will be removed from the realm root AND the onedb cloud.

If, however, a connection between nodes, which are in no direct parent-child relationship is removed, ONLY the connection will be removed and not the connected node. In the following example, the node "to be kept" will still be defined in the onedb cloud even after it has been removed from the node "another node".

Architecture Overview

A problem in many cloud-based databases is the increased latency when accessing data on a remote server. In onedb, all data manipulation and query operations are executed against a local replication of data from the cloud to reduce latency. Data of this local replication is kept in sync with the cloud through an automated synchronization process.

Working with a local copy of data from the server rather than sending API calls to the server allows reducing the latency of many operations and reduce traffic between client and server. Of course, a similar result can be achieved by using caches. However, to implement and maintain these is often a labor-intensive matter (for another approach/architecture to synchronization check Kresten Thorup’s presentation on synchronization using the Riak NoSQL database).

Core Data Model: Nodes and Global Connections

The data model of onedb is based on two components: nodes and global connections.

Nodes

Nodes, in essence, are values expressed as basic data types (text, numbers, dates, …) or user-defined objects such as the following:

Although using the newNode operation it is possible to assign any URI to a newly created node, there is only a limit range of URIs, which will be accepted by the onedb cloud. In particular, it is not allowed to define nodes with URIs, which lie outside the URI range of a known realm.

To aid in defining valid URIs for nodes, the append statement allows to define an .atAddress(..) parameter. This parameter will append the specified value with the specified URI relative to the node it is appended to. For instance, the following statement …

It is generally a good practice to define nodes with internal addresses, since they allow the onedb engine to identify these nodes with greater speed and accuracy.

External Addresses

One major disadvantage of nodes with an internal address is that they must implement the interface OneNode to provide their address to the onedb engine. This is often not possible for legacy classes or conflicts with building ‘pure’ POJO objects. The onedb engine therefore allows defining nodes in the onedb cloud, which do not implement the OneNode interface.

For these nodes, consequently, addresses need to be generated and managed externally when a node is added to the onedb cloud. For instance, appending an object such as the following …

One.append("news").to(realmRoot).in(client);

.. will define a new node “news” in the onedb cloud with an address such as the following:

Global Connections

Global connections allow to connect any node with any other node. Connections between nodes are extrinsic, in that node value and the connections of a node can be changed without affecting each other.

There are two different types of connections: direct connections and indirect connections. A direct connection is a connection between nodes with virtual URI addresses, which indicate the destination node lying within the folder of the source node. See the following example where node 1 is the direct parent of node 2.

All connections in onedb are uni-directional and unlabeled (connections do not have properties). Nodes also do not have properties or any meta-data associated to them. Nodes and connections make up the core of onedb’s data model. All other features (e.g. security, properties, etc.) are built upon this core data model.

Authorization: Realms

Realms are a thin logic layer on top of nodes and connections which provide fine-grained access control for nodes. Every realm is defined by a root node. These are nodes, which do not have an incoming direct connection.

Realms can be protected with access tokens. These tokens define an authorization level as well as the secret required to be granted the defined authorization. onedb currently support the following authorization levels:

Core Operations

The append and insert operations allow to define new connections between nodes. Insert differs from append in that it allows specifying a particular position within a node’s connections for a new connection. Append and insert also assure that undefined nodes will be added to the onedb cloud.

The remove operation allows removing connections between nodes. It will also remove tangling nodes (those without any incoming direct connections) from the onedb cloud.

The update operation allows changing the value of a node.

The select all children operation allows retrieving a list of the outgoing connections of a node.

Client Sessions

All of the operations listed in the previous session require the context of a valid client session. Client sessions contain the local replication of the nodes mentioned in the architecture section above.

A new client session can be established in two ways: through the creation of a new realm or through loading a node, which is already defined on the server. The operations createRealm and load respectively allow this.

The createRealm operation requests a new realm on the onedb cloud. The realm root node of the created realm will be made available in a new client session.

The load operation makes a node from a remote system available in local client session.

Synchronization

As already mentioned in the section architecture above, a client application interacts with a local replication of remote nodes and their connections. This means that all atomic operations (append, insert, remove, update, and select) are executed against a local database (with in-memory performance).

Synchronization in onedb, however, can also be used to synchronize data between clients. In particular, if a node is loaded in more than one client, the background synchronization process will synchronize the state of this node between the local replications of the node on multiple clients.

The synchronization process will automatically try to merge conflicts automatically. In the example below, a node (1) is loaded in two clients. Client 1 appends a node (2), while client 2 appends a node (3). Regardless of the temporal order of these operations, node (1) will have both node (2) and node (3) appended after synchronization is performed.

If conflicts cannot be merged automatically, onedb will attempt to resolve them using a last writer wins policy. In the example below, the value of node (1) is changed first on client 1 to “B” (T=2) and then changed on client 2 to “C” (T=3). Since the change from client 2 occurred later than the change on client 1, the value of node (1) after synchronization will be “C”.

In case a conflict cannot be resolved, onedb will issue a descriptive error message and allow for an explicit resolution of the conflict. In the example below, client 1 appends a node (2) to node (1). Client 2, however, deletes the node (1), which renders the operation on client 1 invalid. On client 1, in result, an error message will be reported.

The synchronization between clients is a very useful feature in onedb to build distributed systems. Indeed, most of the implementation and testing effort was spend on this feature.

Further Information

The increased scale of contemporary applications is often measured in the currencies of terabytes and thousands of transactions per second. An amazing array of tremendous new and old technologies help in dealing with this increased scale: reaching from clouds in various shapes to scalable NoSQL databases and even emerging, asynchronous programming paradigms.

However, while we get better at handling large quantities of data in high reliability involving dazzling numbers of concurrent transactions, the new mountains of data bring challenges beyond data processing and storage with them. In particular, big data is not smart data. Indeed, making sense of big data has been identified as one of the most important challenges lying ahead for technologist.

The problem to make sense of big data is not an easy one, especially since big data is, well, big and bulky and, generally, in its entirety difficult to comprehend.

This article introduces onedb. onedb is a free, cloud-based platform for connected small data. The platform currently consists of a free web service for Java Developers to store and connect data from any Java app. If you try out the service, you help me greatly in my studies!

In this article, I will first give a brief introduction explaining the background and motivation for onedb. I will then describe the various design goals of the service as well as how onedb helps to support smart and ‘small’ data.

Background and Motivation

When I joined graduate school at the University of Auckland some four years ago, I started working on the question of how an individual or a small team can organize their knowledge and information.

These theories point to a solution to the problem described in the introduction: how to make sense of big data. The answer that these theories provide is that if we strive to make big data interconnected, it has the potential to become more useful to us.

There are many technologies to store connected data and information, query connected data and even reason about connected data. However, most of these technologies are very sophisticated and provide many features beyond the simple connection of data, for instance querying, reasoning and applying various graph algorithms on connected data. These features, though doubtlessly valuable and crucial for many tasks, often come with the price of increased complexity as well as reduced portability and generalizability of libraries, databases and services. Often, these properties make it difficult to connect data residing on different platforms, servers and applications in an easy and coherent way.

Thus, the motivation to implement onedb was to design a service focusing exclusively on one key feature: to connect data across applications and systems in the simplest way possible.

onedb Design Goals

While the key requirement for the onedb service is to enable linking data across applications and platforms, a number of secondary design goals have been chosen to differentiate onedb from other solutions. These design goals are: (1) developer productivity, (2) generalizability and simplicity, (3) portability, and (4) testability.

Developer Productivity

There is a great difference between a technology, which allows doing something, and a technology, which enables doing something: most technologies can be bended in some form or another to find a solution for a problem (allowing), but few help to find a truly elegant and effective solution (enabling).

Since the feature at the heart of the onedb solution is apparently simple (connect two pieces of data), a lot of the development effort for onedb was spent on effectively supporting this core feature in an enabling fashion. In general, onedb aims to achieve this by minimizing the steps required to get from idea to connected data. This is supported by three intertwined design features: (1) minimal configuration, (2) fluent and readable API and (3) powerful conventions. I will give an example in the following to illustrate these three design features.

The following are the minimal but sufficient steps to connect two pieces of data in the onedb cloud:

The code listed above will create two nodes in the onedb cloud. The first node “foo” will be a realm: a kind of mini-database, which can be accessed using the generated secret token and the address of the node foo. This ‘realm’ node will be connected to another node “bar”. Finally, the application will print out the address of the node “foo” along with the access secret necessary to access the node.

It is also possible to access the nodes using a simple REST interface (given the access secret is supplied using HTTP BASIC authentication). Various representations such as the following are supported through URLs such the following:

The example above describes all the configuration necessary to start storing data using onedb. There is no need to set up a server, to configure databases, buckets or private keys, no definition of tables or keys and no JDBC connection pool. You can further see the fluent API of the onedb client resembling English sentences (“create realm ‘foo’ and when realm created then do …”). But, most importantly, the expressiveness of the code snippet is amplified by a number of powerful conventions embedded in onedb, such as the automatic designation of global identities for all created nodes or the various data representations in formats like JSON or XML, which are available for every node in the onedb cloud.

Generalizability and Simplicity

While C and JavaScript as languages both have their well-discussed shortcomings, they are without doubt extremely popular and widely used. One key ingredient to their success is their generalizability. You can literally implement any non-UI logic for any platform in C, while you can implement UIs for most rich client platforms with JavaScript.

There are two ways to achieve generalizability: through simplicity or sophistication. Java, for instance, achieves generalizability through sophistication; by providing many advanced and sometimes complex features (e.g. Threads, NIO, ..), which allow deploying the language in a large number of use cases. JavaScript, in contrast, achieves generalizability through simplicity; for one, it’s much easier to implement a basic JavaScript interpreter than a JVM+JDK. Moreover, there are only a handful of (useful) language features, which makes the language widely supported and known.

onedb strives for generalizability through simplicity in a number of ways:

onedb’s core engine supports only the most basic operations on such graphs: append a node, remove a node, and replace a node.

Apart from supporting these operations, onedb’s core engine supports to synchronize nodes and their connections between multiple locations but nothing else.

This simplicity at the heart of the onedb engine enables the database to support a whole range of more sophisticated data structures, for instance trees, maps, and even labeled graphs.

onedb further strives to be generalizable in that it minimizes the assumptions about your data. For instance, how the data is organized, how it is queried or which data types are used. As an example for the support of a wide diversity of data types, see the example snippet below. All listed operations will work in onedb without need for custom configuration:

Portability

Data in any application of non-trivial size ceases to exist and is reborn in sheer endless incarnations. For instance, take a user’s last name. Initially the last name will be held by a text field as part of the web browser’s DOM. It might begin its journey as element in a JSON data structure being sent to the application server as part of a HTTP message. Next, the last name will be deserialized on the server and live, temporarily, as part of a Java or C# object. Then, after being included in an SQL statement the last name might find its final, persisted resting place … until it is requested by another browser session.

The problem here is less that identical data is replicated numerous times (this is unavoidable if multiple devices are involved) but more that the data is represented in various different and incompatible formats for the various involved platforms (Java object, JSON, DOM property, …).

onedb strives to provide one common platform across devices and environments, which enables to work with and connect heterogeneous pieces of data. The way onedb achieves this is by offering portable and embeddable client libraries for various platforms, which can access one integrated data space: the onedb cloud.

The onedb client engine is written in vanilla Java with no external dependencies apart from the core JDK classes (java.*). Moreover, the core engine can be compiled using Google’s awesome Google Web Toolkit (GWT) Java-to-JavaScript compiler. The onedb cloud can therefore be accessed on all Java compatible environments and most modern web browsers. Please bear with me for the web browser part, though. I don’t believe it is really increasing developer productivity to require the continuous compilations of a 50,000+ LOC client library with the not exactly lightning fast GWT compiler. I therefore plan to provide a precompiled JavaScript client library rather than a GWT library (but I am still working on the API for said library).

Testability

Application logic which is tightly coupled to persisted data is notoriously difficult to test using automated unit tests. There are many reasons for this but one key factor is that it is often non-trivial to start up a database with the right configuration and test data for a particular test case.

onedb strives to make code, which relies on data that will be persisted in the production system, both easy and fast to test. For this purpose, an almost fully functional in-memory onedb cloud can be started up for test cases. Starting up the test cloud should take less than 200ms and can therefore be done, if required, for each individual unit test.

I have given an example above for a simple application, which connects two nodes in the onedb cloud. We can test this application locally in far superior speed (since there are no Internet messages being sent) by simply changing the first line of the application code to OneTestJre.init();:

Support for Small Data

I have mentioned in the introduction that onedb is a platform to support connected small data. To discuss onedb’s ability to support small data, I will use the following definition from another article:

Small Data is interconnectable and portable data which can be stored, managed, and processed independently on all components of a distributed system.

The essential idea this definition is based upon is that ‘big’ data can be made ‘small’ if three requirements are satisfied: (1) the components of a distributed application are enabled to manage their data independently from other application components, (2) data is portable in that it can be seamlessly moved from one component to the other and (3) data managed by one component can be ‘connected’ to data being managed by other components. I will briefly describe how onedb fulfills these requirements in the following.

To allow application components to manage data independently from other components (1st requirement of small data), the onedb cloud is divided into a large number of sections in different granularity (e.g. one last name or all user data). These sections, called realms, are self-contained and allow application components to manage a set of nodes and their connections. It is very easy to create new realms, so application components are enabled to create and manage their own, independent realms if required.

To allow for data portability (2nd requirement of small data), onedb provides two intertwined features: First, realms or parts of a realm can be shared between various components of an application. Second, data from the onedb cloud is made available locally to system components through means of an automated synchronization process (think fine-grained Dropbox for applications). This synchronization process is available on all platforms to which the onedb client engine can be deployed.

Allowing data managed by different components to be connected (3rd requirement of small data) is the core feature of onedb as mentioned above. Any piece of data in the onedb cloud (or node) can be connected to any other piece of data in the onedb cloud.

Why bother?

onedb is a very young technology and as such I expect there to be bugs, downtimes, and, of course, there is the old friend of any new framework: quite sparse documentation.

However, I do believe that there are many exciting use cases for onedb. You can plug it into your own apps in a matter of minutes and use it to store logs, settings, test data, or test parameters. You can also use onedb as a quick way to publish and update a set of web pages or other REST resources, which may be consumed by any REST capable client.

There are currently two ways, in which you can use onedb. Firstly, you can go to the onedb webpage and grab an API key for the technology preview server. The technology preview server allows you to store up to 10,000 nodes/objects in the onedb cloud per API key. Secondly, you can contact me, if you would like to install your own onedb server node, and I will be happy to assist you with installation and configuration procedure.

Limitations & Last thoughts

onedb is a service focused on one particular task: to help you connect and integrate small data across applications and platforms. onedb is built as a lightweight add-on to existing applications and infrastructures and not as an replacement for these. onedb, in consequence, does not provide many features commonly found in other databases or cloud platforms. However, since onedb is lightweight and generic, it is very easy to integrate it with other technologies, for instance building an index with Lucene or running a Hadoop job to process data stored in onedb.

Generic software problems, such as the one addressed by onedb, can often be solved by an array of related technologies. I have created a preliminary list of interesting related technologies. Please let me know of any technologies I missed there and I will be happy to include them.

onedb is a very important part of my PhD thesis and therefore you could help me a great deal, if you could send me some quick feedback (@mxro or feedback@onedb.de). Please also let me know if you find any bugs and I will try to fix them as quickly as possible.

Modern database systems (e.g. RDMBS/Object DB/Graph DB) are easily one of the most complex aspects of application deployments. This, to a certain degree, is unavoidable given the many intrinsic complexities in working with large quantities of data. However, complexity in designing the persistence back-end of an application is seldom desirable. Storing and retrieving data, after all, is rarely at the heart of the unique business value a system provides.

I propose that many current challenges in working with data could be addressed, not necessarily by the next fancy technology, but by changing the way we think about data. In many current approaches, there is a tendency to make data bigger: building larger systems to hold more data in the spirit of economies of scale and data integration. I argue that we should rather strive to make data smaller: building small, tightly interconnected database systems to hold less data but provide more agility.

I first list a number of problems commonly encountered in using contemporary database systems. I then provide a definition for ‘small data’ and discuss a number of advantages small data can bring to the development of distributed and evolving applications. Lastly, I synthesize the discussions in this article into the simple PACT Test, which can be used to assess the ‘small data readiness’ of existing and planned applications.

Contemporary Problems

While there are arguably many important problems in dealing with database systems, I will limit the discussion here on three important key problems:

Database systems are usually heavy-weight components of applications and are not easily portable.

Database systems are often difficult to change during development and after deployment.

Database systems tend to be monolithic and do not easily enable the integration of data on third party systems.

I will discuss each of these problems in the following.

Problem 1: Heavy Cargo

Database systems tend to be feature-rich. The range of features commonly supported includes: full support for an SQL standard, data serialization, index creation, data backup and data replication. These features, while without question very useful, also come at a cost.

First, while databases might offer simple and easy interfaces, these often reduce the complexity on the surface, trading local simplifications against increased global complexity. For instance, it is often non-trivial to estimate the performance of SQL queries (which in themselves are relatively simple but embedded in the global complexity of the database solution). This problem is highlighted when using further abstraction layers such as an ORM framework.

In summary, database systems are often complex and heavy components of applications. They often make the tasks of testing, deploying, and redeploying applications on different platforms more difficult.

Problem 2: Resistance to Change

Many current database systems are using some form of schema to assure integrity and structure of the data they manage. In case of relational database systems, for instance, the data stored in any table is accompanied by a schema defining the valid fields and field types. Other flavors of database systems rely on schemas as well: many object databases, for instance, require class definitions for all stored objects.

The structure imposed by schemas is a key strength of many databases systems. Indeed, it is the rigidity of schemas, which helps to make data stable, robust and reliable over prolonged time periods. However, in cases when data needs to be agile and adapt to changing requirements at a rapid pace, rigid schemas are seldom of great help.

Problem 3: The Great Wall

Many database systems tend to enforce logical spaces into which data is grouped. It is generally difficult to integrate data, which does reside in more than one logical space. In particular, it is difficult to establish logical and/or semantic connections between data from different logical spaces.

Many NoSQL solutions such as key-value stores maintain few native mechanism to establish connections between different pieces of data. Graph databases, in contrast, are pointedly focused on connecting data in form of nodes in complex networks. However, many graph databases still maintain logical spaces into which nodes are grouped. For instance, often nodes are identified by a form of proprietary node id unique to the server or server cluster of the graph databases. This uniqueness makes it difficult to link nodes stored in one server with nodes stored on another server.

In summary, most contemporary database systems organize data into logical spaces (e.g. database instances), which make it difficult to link data residing within these spaces with data residing in other logical spaces (of the same database system or other database systems).

Proposing Small Data

Although the problems discussed in the previous section describe a number of undesirable consequences of working with database systems today, they are in no way valid as critique on the merit of current technological solutions. Indeed, I believe we have many awesome database technologies, beginning with the maturity of the free MySql, to various greatgraphdatabases, objectdatabases and keyvaluestores.

The described problems are less a consequence of insufficient implementations but rather they are the consequence of how data is usually conceptualized. I would argue that, even before the advent of the term big data, discussions of data more often leaned towards making data ‘bigger’ rather than ‘smaller’.

For instance, the article marking the beginning of the age of relational database systems, “A relational model of data for large shared data banks” by E. F. Codd, has at its heart the idea that users access one centralized system, containing all required data in a cohesive and integrated fashion.

While a centralized and monolithic approach has significant advantages in terms of assuring integrity, manageability and performance, today’s distributed environments have requirements, which are in conflict with this approach:

Data needs to be ‘open’ in that it needs to be accessible by third party systems as well as that it has implicit dependencies to data on third party systems.

Data and its structure need to be adaptable to react to rapidly changing requirements.

Data needs to be portable to be available directly on the various devices users use.

I propose that designing data as small data poses various advantages in designing contemporary systems. Below, I first give a brief definition of small data and then describe a number of advantages of small data.

Since there appears to be no general agreement on a definition for small data, I will use the following definition of small data for the purpose of this article:

Small data is interconnectable and portable data which can be stored, managed, and processed independently on all components of a distributed system.

This definition of small data has two main components: (1) the requirement of portability and (2) the requirement of interconnectability. Small data should be portable in that there are no conceptual, logical or physical constraints to store, represent and manipulate this data on all important components of a distributed system. Moreover, it should be possible to move parts of the data from one application component to another. Small data should be interconnectable in that pieces of data (e.g. a value, a record …) can be connected with other pieces of data, regardless of whether they are stored on the same component/devices of the system or on different components/devices.

To give an example: in a traditional system, a central database would usually contain all user details (first name, last name, email …). If a new user registers using a web application, usually the information containing the user details will be submitted to an application server, which will in turn send messages to a database server. The database server will then ultimately add the user details to the central user details database.

A small data system, in contrast, would require every device involved in an application to be able to store, manage and process the involved data independently. The web application (running in a web browser) through which the new user registers should be able to interact with the user details record resulting from the transaction directly.

Adhering to the requirement of interconnectability, the small data system will further enable to establish logical and/or semantic connections between the user details information held locally by the web browser and other, possibly remote, components of the application. For instance, it would be sensible to connect the newly created user details with a central repository of all registered users.

Moreover, the requirement of portability suggests that any data processed by the application should be portable from one application component to another. The user details created by the web client, for instance, should be transported at some point to a central repository containing all registered users. Since the web client likely requires data from other components of the application as well, data will need to be transported in two-ways, most sensibly as automated synchronization.

Advantages of Small Data

While organizing data as small data doubtlessly poses its own engineering challenges, it provides a number of advantages in building distributed applications. I will describe four key advantages in the following: Proximity, Agility, Connectability and Testability.

Small data, by definition, requires sufficient mechanisms on every device partaking in the system to manipulate and manage data. In consequence, the data is ‘closer’ to where it is needed and manipulated.

For instance, let’s assume a user registers and would like to view the user details straight after having completed the registration. In a traditional application such as described in the previous section, a significant number of calls must be made between the various systems involved (please note, these steps could of course be reduced using some form of intelligent caching.)

In a small data application, the user details data will be available locally for the application running on the user’s browser. Consequently, latency to perform the two operations of registration and show user details can be dramatically reduced.

Agility

Small data, as a consequence of its definition, is more open to change than traditional big data and monolithic data repositories. Imposing global rules and constraints for data involved in a small data application would run contrary to the requirement that every device should be enabled to manage the data it requires independently of other components.

For instance, let’s assume the user registration systems described above would have to be changed in two ways: (1) the web application should issue a warning to the user, if there was no user activity for more than 10 minutes and (2) the application server should keep track of user logins and generate a monthly report of logins per day for the service provided.

In a traditional application, there are essentially two ways to accommodate these changed requirements: First, new data stores could be added to the system as shown below.

Alternatively and more in line with ideals of integrated data, the data in the original data store could be extended to hold the additional required information.

In a small data application, the existing data could be amended at the local data store where it is required (see below). The amended data may or may not be synchronized with the other involved systems. In difference to the first traditional approach, no new data stores need to be introduced to the application. In difference to the second traditional approach, no global and potentially risky data changes are made to the system, which are globally visible and possibly affecting all involved application components.

Connectability

It is often difficult to establish connections between data residing on different systems, preventing truly ‘smart’ and connected applications. In traditional systems, this is often caused by the physical, conceptual and logical ‘walls’ around logical spaces in current database systems (see Problem 3 above).

Small data by definition is divided into a large number of independent pieces. Each of these individual pieces provides the opportunity to establish connections with other pieces of data. Data in a small data application further needs to be portable, in that parts of data can be transferred from one system to another.

This portability is an important factor in the connectability of data. To give an example, if a third party system interfaces with the application discussed in the previous sections, it might need a way to identify users in this application. A user id, for instance, might be employed to reference a user record from original application.

The sharing of a user id does not really establish a connection between the involved database systems. In consequence, plenty of custom logic would need to be implemented to assure the integrity of the loose connection between the data pieces in original application and third party system.

It is usually not an option to give a third party system access to the database system of a traditional application as database systems organize their data in ‘big chunks’, for instance not allowing to grant authorization only for one particular user record.

A small data application, in contrast, should make it easier to allow access to its data to a third party application. In particular, since individual components of the small data application should be enabled to manage their data independent from other components, it should be easier to add a third party system as another ‘component’ of this system without compromising its integrity.

Testability

Application logic, which depends on data and database features, is often difficult to test. Partly, these difficulties arise from the heavy-weight nature of many databases (see Problem 1) and the difficulty to evolve database schemas in quick iterations (see Problem 2).

Small data systems by their nature are easier to test, since the data is required to be locally available and managed by the component, which works with the data. This also enables testing of the component (including its ‘real’ data) in isolation from other components of the system.

In a traditional applications, there are often two options to test application logic, which depend on (persisted) data: First, an isolated ‘unit’ test can be defined for the specific component under test using some form of mock data. This mock data is usually not provided using the database system used in the application. Second, an integration test can be developed, which tests mainly a feature of one component, but which assures the availability of a wide range of system components. Integration tests often use a similar (if not the same) database system as the one used in production (for instance an in-memory database).

In a small data application, in contrast, isolated test cases can be written against the same database system used in production (since it is a small database system which is portable and can be independently managed by the application components).

PACT Test

Although the term small data is not frequently employed when talking about applications, the advantages described in the previous section might well be realized by existing solutions. Based on the four key advantages of small data systems, proximity, agility, connectability and testability, the PACT test can help to determine the ‘small data readiness’ of any existing or planned application or application platform.

P: Proximity

Does the platform enable modules to define, create and manage persisted data independently?

Does the platform enable modules to query and manipulate persisted data independently?

A: Agility

Does the platfom enable to change module-specific data without directly or indirectly affecting module-external data?

Does the platfom enable to change the structure of module-specific data without directly or indirectly affecting the structure of module-external data?

C: Connectability

Does the platform enable the establishment of semantic connections between pieces of persisted data in fine granularity?

Does the platform enable the establishment of semantic connections between pieces of persisted data managed by different modules?

T: Testability

Does the platform enable to test module logic interwoven with persisted data in isolation of other modules?

Limitations & Conclusions

Some questions are not easy to answer: for instance: “Which is better: object-oriented programming languages or functional programming languages?”. As interesting as this question might be, it is very difficult to arrive at a conclusive and constructive answer. General approaches and design paradigms are difficult to assess using single-dimensional ordinal systems of better or worse. Is striving towards ‘small data’ better than striving towards ‘big data’? Ultimately, I don’t know.

I have listed a number of potential advantages of striving towards making data smaller: these are increased proximity, agility, connectability and testability (PACT) of data-dependent components of an application. However, there are also trade-offs in pursuing small data:

I nonetheless believe that there are use cases in which ‘small data’ can be an attractive choice. These are, unsurprisingly, small systems, which can live without the heavy baggage of a sophisticated data management system, or quickly evolving systems with many independent or semi-independent parts, which would be constrained by having to agree continuously on a common data standard. Also, as the result of my research, I would expect that small data systems could be more successful in supporting knowledge-intensive and unstructured work.

Data needs less complexity. We are spending too much time on managing our database systems; time better spent in delivering new features of business value. Small data following the definition provided here is one possible way of simplifying data with its own advantages and trade-offs. I will be happy to hear your thoughts and opinions.