Archives

Search

Blog Topics

Tag Archives: JSON

AiMED Stat is a startup working to facilitate better medical information capture, analysis, and reporting through web and mobile technologies. They provide clinicians with easy-to-use tools and provide researchers with direct access to real-time information capture from the front lines of medicine. They recently worked with the audiology clinic at University of Western Ontario (UWO) and used a Riak system to help the University collect and search data related to the research.

In general, innovation in health research databases has been very stagnant – with many companies simply opting for a legacy relational system like MySQL or PostgreSQL. However, AiMED Stat realized the limitations of these systems. With these relational systems, researchers would need to decide their schemas at the start of studies. However, once researchers were a few months into a study, they would need to update data or collect data in a different way. This meant researchers needed to update the entire table, which involved very costly data migration. As AiMED Stat set out to manage and present research data in a better way, it simply wasn’t feasible for their two-person team to manage a costly data migration every time there was a data update. So they began to look at more flexible, NoSQL databases as a replacement.

They first looked at MongoDB, but soon learned that MongoDB wouldn’t be able to handle their high write volumes without losing data. In clinical research, data loss is never acceptable as it can skew results. They then looked at Cassandra; however, for a small team, they found Cassandra to be too complex to operate efficiently. Finally, they evaluated Riak. They were immediately drawn to Riak’s flexible data model, schemaless design, and ability to scale out quickly. In 2011, they brought Riak into production as the backend of their research data application.

“We set out to create an application that stores and queries data in a way researchers understand,” said Kartik Thakore, Co-Founder at AiMED Stat. “By using Riak to power our application, it gives us a sizable competitive advantage (relative to other electronic audiograms). Its flexibility allows us to store data exactly as needed, its ease-of-scale eliminates the chunk of our budget previously dedicated to data migration, and its high availability ensures we never have to worry about losing data. Riak is a breath of fresh air – it does exactly what we need it to do.”

Their Riak application enables rich HTML5 forms for data collection, using a method that increases compliance and data integrity at the point of capture. From data collection, demographic identifiers are used as the key in Riak and values are stored as JSON. Riak post- and pre-commit hooks are used to further validate the data. Additionally, Riak Search, Secondary Indexes, and MapReduce are all used to allow researchers to store and search data (via a D3.js enabled application) using an Audiogram shown below:

This Audiogram allows researchers to easily search within the graph to find and compare patients that match certain audiological profiles. The quicker researchers can find patients for their study, the quicker they can get funding, making this queryability imperative.

AiMED Stat is currently running five-nodes in production and looking to scale out as they grow. “For us, the importance is not on big data but on never losing data,” continued Kartik. “With Riak, we can rest assured that all our data is archived and accessible, regardless of scale or write volume.”

Amherst College is a private liberal arts college in Massachusetts that enrolls about 1,800 undergraduates. Their Archives & Special Collections houses rare books, literary manuscripts, and unique and rare materials documenting the College and its history. Its collections include many of Emily Dickinson’s original poems and letters. The Amherst College Library has been working to digitize images, manuscripts, and rare books in the Archives, and improve access to a large collection of digital images used in the teaching of art and architecture. They currently have 140,000 objects in their digital collections and they are adding up to 10,000 new objects each month.

Fedora (the underlying digital asset management system used by many colleges) is used for archiving, storing, and managing these documents. While it has the ability to support the number of objects being stored, Fedora tends to favor object fixity checks (checksums) and XML schema validation over speedy response times. It has worked for Amherst in terms of digital preservation and metadata support, but they have run into problems with its ability to handle high levels of concurrency (such as when Bon Appétit Magazine directed users to an Emily Dickinson manuscript featuring a recipe for doughnuts: acdc.amherst.edu/view/asc:17832). They use Riak as the intermediary layer between Fedora and the web, and as a huge caching mechanism for all of their data.

Previously, they were using a PHP app that directly accessed Fedora. While this solution worked, it was resource intensive and too slow for most purposes. It also wouldn’t allow them to grow their repository at the rate needed. They evaluated a few different systems (including CouchDB and MongoDB), but found Riak’s lack of sharding made it extremely easy to scale and offered better fault tolerance than the others.

Amherst brought Riak into production earlier this year. They are storing around one million objects in Riak across four nodes. Riak unifies all of the XML- and RDF-based metadata about each of their digitized objects (such as structural metadata in RDF and descriptive metadata in MODS) and stores it in a single JSON structure. When querying, they typically utilize the general key/value lookup or run MapReduce jobs. Since moving to Riak, their entire system is now an order of magnitude faster.

“We have been extremely happy with Riak and what it provides,” says Aaron Coburn, Systems Administrator at Amherst College. “While most of the objects stored aren’t publicly available, Riak still allows us to make over 2,000 manuscripts available to the world.”

This post is an example of how you can solve a practical querying problem in Riak with Map-Reduce.

The Problem

This query problem comes via Jakub Stastny, who is building a task/todolist app with Riak as the datastore. The question we want to answer is: for the logged-in user, find all of the tasks and their associated “tags”. The schema looks kind of like this:

Each of our domain concepts has its own bucket – users, tasks and tags. User objects have links to their tasks, tasks link to their tags, which also link back to the tasks. We’ll assume the data inside each object is JSON.

The Solution

We’re going to take advantage of these features of the map-reduce interface to make our query happen:

1. You can use link phases where you just need to follow links on an object.
2. Inputs to map phases can include arbitrary key-specific data.
3. You can have as many map, reduce, and link phases as you want in the same job.

Now that we’ve got all of my tasks, we’ll use this map function to extract the relevant data we need from the task — including the links to its tags — and pass them along to the next map phase as the keydata. Basically it reads the task data as JSON, filters the object’s links to those only in the “tags” bucket, and then uses those links combined with our custom data to feed the next phase.

Here’s the phase that uses that function:

Now in the next map phase (which operates over the associated tags that we discovered in the last phase) we’ll insert the tag object’s parsed JSON contents into the “tags” list of the keydata object that was passed along from the previous phase. That modified object will become the input for our final reduce phase.

Here’s the phase specification for this phase (basically the same as the previous except for the function):

Finally, we have a reduce phase to collate the resulting objects with their included tags into single objects based on the task name.

Our final phase needs to return the results, so we add *”keep”:true* to the phase specification:

Here’s the final format of our Map/Reduce job, with indentation for clarity:

I input some sample data into my local Riak node, linked it up according to the schema described above and this is what I got:

Conclusion

What I’ve shown you above is just a taste of what you can do with Map/Reduce in Riak. If the above query became common in your application, you would want to store those phase functions we created as built-ins and refer to them by name rather than by their source. Happy querying!

We are happy to announce the release of Riak 0.8 available for download immediately. Riak 0.8 features a number of enhancements to the core map/reduce machinery that will make Riak more accessible to a wider audience. The biggest enhancement is the ability to write map/reduce queries in JavaScript. We’re using our erlang_js project to integrate Mozilla’s Spidermonkey engine directly into Riak to keep overhead to a minimum.

We’ve also built a spiffy REST API for submitting map/reduce queries. Queries are described in JSON and POST-ed to the Riak server. Results are sent back as JSON for your processing pleasure. And, the REST interface supports streaming results for large result sets, too.

To kick it all off, we’ve put together a short screencast demonstrating how to use Riak’s flashy new features. You can watch it below, or view it on Vimeo. There’s also a slew of bug fixes and optimizations included in Riak 0.8. See the release notes for all the juicy details.