Abstracts

Many NoSQL presentations focus on product categorization, architectures, language bindings, and so on. Similarly, many NoSQL vendor presentations focus on product capabilities and ease of development. These issues are important in choosing the right product for a particular use-case. However, very few NoSQL presentations cover some of the soft issues, such as developer skills and market analysis. From a business perspective, finding the right product is also about understanding the availability of skilled developers and ensuring that a vendor has a good product strategy, revenue stream and is profitable to invest for the future. Since there are thought to be over 150 NoSQL database products, developers would also find such information valuable, so that they can wisely invest their time and effort in learning the right skills. This presentation will focus on these soft issues in greater detail.

We present StratioDeep, an integration layer between the Spark distributed computing framework and Cassandra, a NoSQL distributed database.Cassandra brings together the distributed system technologies from Dynamo and the data model from Google's BigTable. Like Dynamo, Cassandra is eventually consistent and based on a P2P model without a single point of failure. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems. For these reasons, C* is one of the most popular NoSQL databases, but one of its handicaps is that it's necessary to model the schema on the executed queries. This is because C* is oriented to search by key.Integrating C* and Spark gives us a system that combines the best of both worlds.Existing integrations between the two systems are not satisfactory: they basically provide an HDFS abstraction layer over C*. We believe this solution is not efficient because introduces an important overhead between the two systems. The purpose of our work has been to provide an much lower-level integration that not only performs better, it also opens to Cassandra the possibility to solve a wide range of new use cases thanks to the powerfulness of the Spark distributed computing framework. We’ve already deployed this solution in real applications with diverse clients: pattern detection, log mining, fraud detection, sentiment analysis and financial transaction analysis.In addition this integration is the building block for our challenging and novel Lambda architecture completely based on Cassandra. In order to complete the integration, we provide a seamless extension to the Cassandra Query Language: CQL is oriented to key-based search. As such, it is not a good choice to perform queries that move an huge amount of data. We’ve extended CQL in order to provide a user-friendly interface. This is a new approach for batch processing over C*. It consists in an abstraction layer that translates custom CQL queries to Spark jobs and delegates the complexity of distributing the query itself over the underlying cluster of commodity machines to Spark.

Elasticsearch 1.0 features a completely new way of doing analytics called Aggregations. As powerful as Aggregations are to its predecessor called facets, it needs a bit more time to grasp its concepts. This talk will introduce you into Aggregations step-by-step and shows some use-cases, how easy it is to extract useful information from your data.

From financial services, to digital advertising, omni-channel marketing and retail, companies are pushing to grow revenue by personalizing the customer experience in real-time based on knowing what they care about, where they are, and what they are doing now. For growing numbers of these businesses, this means developing applications that combine the historical analysis provided by Hadoop with real-time analysis through Storm and within NoSQL databases, themselves. This session will examine the design considerations and development approaches for successfully delivering interactive applications that incorporate real-time and batch analysis using a combination of Hadoop, Storm and NoSQL. Key topics will include:· A review of the respective roles that Hadoop, Storm and NoSQL databases play.· Considerations in choosing which technology to use in areas where their capabilities overlap.· An overview of a typical solution architecture.· Strategies for addressing the diverse data types required for providing a complete view of the customers.· Approaches to managing large data types to ensure reliable real-time responses.Throughout the discussion, concepts will be illustrated by use cases of businesses that have implemented real-time applications using Hadoop, Storm and NoSQL, which are in production today.

Working in automated securities trading back in 2o12, we experienced how little traditional data models and existing information products could suffice the need for information value in a connected economic world. Now in 2014, we not only solved our own issue, but instead developed an integrated technology readily applicable to broad range of similar issues in various industries, be it hospital patient case management or POS management in retail. We show, how we combine enterprise data services, open data sources, and issue-specifically collected data in a unified information model and lay out, how we build a graph database that suits various applications and grows with user needs. Lastly, we cover our decision for a visual graph explorer front-end as ‘ice-breaker’ for that part of the corporate world that considers graphs a too complex technology to implement.

Dr. Ernst-Georg Schmid – Single Point of Entry: Integrating relational and semi-structured data with PostgreSQL

Relational and NoSQL Databases doubtless have their individual strengths and weaknesses. How about teaming them up? In this session it is shown how to get the best of both worlds by using the PostgreSQL Object-Relational Database Server as single point of entry to both, relational and semi-structured data by using HSTOREs and Foreign Data Wrappers – and why we just might want to do that at Bayer.

The combination of database systems and cloud computing is extremely attractive: unlimited storage capacities, elastic scalability and as-a-Service models seem to be within reach. This talk will give an in-depth survey of existing solutions for cloud databases that evolved in the last years and provide classification and comparison. This includes real-world systems (e.g. Azure Tables, DynamoDB and Parse) as well as research approaches (e.g. RelationalCloud and ElasTras). In practice however, there are some unsolved problems. Network latency, scalable transactions, SLAs, multi-tenancy, abstract data modelling, elastic scalability and polyglot persistence pose daunting tasks for many scenarios. Therefore, we conclude with „Orestes“ a research approach based on well-known techniques such as web caching, Bloom filters and optimistic concurrency control that demonstrates how existing cloud databases can be enhanced to suit specific applications.

Talk objectives: OSv is a new open source operating system written from scratch for the cloud initiated by Cloudius Systems. This talk describes problems encountered while making Apache Cassandra run on OSv as well as current and future OSv operating system and JVM optimizations that make NoSQL databases perform better in cloud deployments. Topics include: Networking performance, mmap I/O performance, ZFS tuning, JVM optimizations. More information on OSv at https://osv.io.

At teowaki we have a system for API usage analytics, with Redis as a fast intermediate store and bigquery as a big data backend. As a result, we can launch aggregated queries on our traffic/usage data in just a few seconds and we can try and find for usage patterns that wouldn’t be obvious otherwise.In this session I will talk about how we entered the Big Data world, which alternatives we evaluated, and how we are using Redis and Bigquery to solve our problem.

Early NoSQL databases were designed in the shadow of the CAP theorem. We were told to "pick 2 out of 3–you can't have it all." Well, maybe you can't have it all, but was throwing out ACID transactions the best way forward? NoSQL innovator, Google seems to think no and claim their new distributed database Spanner is both consistent and highly-available. Have they beat CAP? This talk will:- Examine impact of CAP on early NoSQL systems- Explore the design space of consistent systems- Answer the question of what price to we pay for transactions in a distributed system- Propose a path towards a better NoSQL

John Sumsion – Consistency without Transactions for Global Family Tree

This talk is about lessons learned while reimplementing FamilySearch's large, interactive Family Tree application on Cassandra. The model we chose retains the consistency that our users demand, and is able to be implemented without requiring ACID transactions. FamilySearch hosts a large, dynamic Family Tree to enable the world to piece together our collective genealogy. Up to now this application has been hosted on an Oracle database, but we started to hit performance problems, and found ourselves close to the edge of the number of concurrent users we could support. We needed more performance and more scalability at the core data layer. The dataset has resisted sharding in the past, so the straightforward NoSQL sharding approaches weren't an option. We engaged in a from-scratch rewrite to see if we could build a comparable application on NoSQL technology. This talk is about lessons learned during the reimplementation. Specifically, the consistency model we chose combined a Convergent and Commutative Replicated Data Type (CvRDT and CmRDT) with Cassandra's atomic batch implementation to form the basis for a consistency model that met the demanding needs of the Family Tree application.

This topic will introduce the Apache Cassandra native protocol, asynchronous native drivers and Cassandra Query Language (CQL). It is important for developers to be aware of this new asynchronous way of integrating with and querying Cassandra – without using Thrift or RPC. There are various ways of tuning that integration and modelling your data – all intended to make it easier and more productive to build against Cassandra with some additional performance benefits.

Konrad Beiske – Maintaining a quorum throughout the lifecycle of your Elasticsearch cluster

Elasticsearch allows for scaling both up and down and at the same time responding to indexing and search requests. Elasticsearch also allows for a configurable resiliency to node failures. This flexibility sometimes comes at the cost of not being all that foolproof. In this talk I will present the measures we have undertaken to minimize the risk of data corruption and maximize the uptime during migration between different cluster configurations.

In this talk we will present real life use cases of integrating search and nosql solutions into a classical information management application. We will focus on advantages of such a hybrid approach for an, otherwise, pure relational database use case. We will also present the technical challenges, propose patterns for the polyglot persistence, address issues with synchronization between the relation databases and the nosql eco-system. The presentation will give you a good comparison between relational databases / sql and Lucene / nosql on all the levels (functionality, semantics, performance, user experience).

Mahesh Paolini-Subramanya – Drinking from the Firehose – the NoSQL Way

Imagine that you have tens of millions of endpoints, each of which is sending you a constant stream of data – to the tunes of petabytes per second. You also have millions of uses who want to monitor and administer them. And don't forget all the historical reports that these users insist upon all the time. And imagine that this these users and endpoints are distributed all over the world.This pretty much describes our environment – one that we've implemented through the magic of Erlang, as well as not just Riak, but also ElasticSearch, *and* Cassandra (because why go with one, when you can have all three!)Join me as I show you how we successfully drink from this firehose of data without spilling a drop.

The iDating industry cares about interactions and connections. Those two concepts are closely linked. If someone has a connection to another person, through a shared friend or a shared interest, they are much more likely to interact. Graph databases are optimized for querying connections between people, things, interests, or really anything that can be connected. Dating sites and apps worldwide have begun to use graph databases to achieve competitive gain. Neo4j provides thousand-fold performance improvements and massive agility benefits over relational databases, enabling new levels of performance and insight. Join Amanda Laucher to discuss the five graphs of love, and how companies like eHarmony, Hinge and AreYouInterested.com, are now using graph databases to create algorithms to help people find more interactions, connections and hopefully love!

Data security and privacy is a critical concern in today’s connected world. Data collected from new sources such as social media, logs, mobile devices and sensor networks has become as sensitive as traditional transaction data generated by back-office systems. For this reason, big data technologies must evolve to meet the regulatory compliance standards demanded by industry and government.This session provides an overview of MongoDB’s security architecture, including authentication, authorization, auditing and encryption, collectively designed to to defend, detect and control access to valuable online big data

Get a practical demonstration of creating from scratch a working iOS app that uses Couchbase Lite, the native mobile NoSQL database. You'll see how to save JSON directly on the device and then how to effortlessly sync data to and from your main database on a remote Couchbase server.

This talk gives an overview over and a technical insight into the recently added sharding capabilities of the multi-model NoSQL database ArangoDB. We will explain the software architecture behind a cluster of ArangoDB processes, their various roles, as well as the failover methods to increase reliability. We will continue with a live demonstration including the graphical interface for cluster management and finish by sketching the current state of the implentation and the roadmap. This presentation will contain technical details but will also be enjoyable by a wider audience.

Building a "single page application" should be as easy as visiting a SPA. A lot of browser frameworks exists to make life easy for a developer to create single page web-applications. In a lot of cases only a minimalistic backend support is needed. NoSQL is the perfect match for such an application, as they allow you to put your domain model directly into your database. In this talk I create a simple single page application live and on stage. Using Angular.js for the frontend, we explain how to model a backend API and realise it in ArangoDB's Foxx.

Newly released, the latest major 2.0 version of Neo4j makes the graph data model even more accessible. Having extended the data model with node labels and new optional schema information it got much easier to model and manage your graph data. The new, slick Neo4j-Browser is a interactive Workbench for incrementally working out the best way to visualize and execute your use-cases with our graph query language Cypher. Cypher also got more useful with comprehensive support for labels, merge operations and document like data structures. Technically we added a new endpoint to Neo4j Server that allows you to talk more efficiently and transactionally over the wire. This talk will show all the changes in a hands-on presentation and also discuss how they apply to concrete use-cases.

With the proliferation of data sources and growing user bases, the amount of data generated requires new ways for storage and processing. Hadoop opened new possibilities, yet it falls short of instant delivery. Adding stream processing using Apache Storm, can overcome this delay and bridge the gap to real-time aggregation and reporting. On the Batch layer all master data is kept and is immutable. Once the base data is stored a recurring process will index the data. This process reads all master data, parses it and will create new views out of it. The new views will replace all previously created views.In the Speed layer data is stored not yet absorbed in the Batch layer. Hours of data instead of years of data. Once the data is indexed in the Batch layer the data can discarded in the Speed layer. The Query Service merges the data from the Speed and Batch layers. This presentation focuses on the Lambda architecture, of Nathan Marz, which combines multiple technologies to be able to process vast amounts of data, while still being able to react timely and report near real-time statistics, with an added bonus of a high tolerance against human and system failure.

Nicolas Dalsass and Yann Schwartz – An eventful journey to real-time joins with Storm and Kafka

To join logs, we used to rely on Hadoop to "brute force" the problem: crunch terabytes of data daily with a whole lot of computers, and get the refined logs. But with data pouring from 6 DCs worldwide, there's so much you can cram into transcontinental pipes. So we started to do part of the job online, locally, in real-time joins to save on IO, bandwidth and CPU. However, joining up to 1 million events/sec in realtime is no picnic, and we happily hit a lot of walls on our way to find a scalable, fault-tolerant architecture. In this talk, we'll present what worked and what didn't in the building of our backend solution, both from the hardware infrastructure point of view and the software design one. In particular, we'll focus on our Kafka design, the implications of the loose coupling between write and read orders on your storm topology, and the tradeoffs you're bound to make.

Goal of this talk is to shed light on the use of JavaScript for building applications with NoSQL. Promoted by the so-called "MEAN" application stack, JavaScript is becoming increasingly mainstream for building web applications. I will first explain the background behind "MEAN" (MongoDB/Express/Angular/Node) and the role of JavaScript there. We then look at the ecosystem of Node and how you can manage data with JavaScript and different data stores such as Redis, ArangoDB/Foxx and ElasticSearch.

Pavlo Baron – Data on its way to history, interrupted by analytics and silicon

In this talk I will introduce the current state of work on a system which is intended to run immediate, continuous analytics on never ending streams of data in the area of multiple millions events per second, running on as minimal quotity of hardware as possible. I will explain what we aim for, why we have decided to go with the approach of the maximum, inexorable “mechanical sympathy”, why we have decided to go with the JVM after testing different options, what we use from the JVM at all and especially what we don’t use, how we approach the arrival of as many events, how we approach time control to ensure as few drift as possible, how we cut and parallelise computations, how we split data streams, what algorithms and approximations we use on what level of implementation, what hardware accelerators we are looking into and why. Also, I would like to show our current challenges and yet unsolved issues.

An einfache Hadoop SchedulerReporting is a must in every analytical platform, it have to be easy, and intuitive, for everyone to get what they what and when they want it. With this idea in our mind we wanted to keep our systems easy to use, not only for us, but also for the end user who usually have not much computer experience.In this talk we aim to talk about our experience crafting the a custom scheduling solution for the virtualized hadoop environment, responsible of dealing big sensor data. Aiming to be a technical overview, to include code details and hands on experiences, the main motivation is to share and discuss common approaches to that problem.—Written and directed by: Pere Urbon-Bayes & The Team.Producers: The final Client.Cast: Ruby as The programing language.JRuby as The virtual machine.Sinatra as The web framework.Hadoop as The data processing framework.PIG as The scripting language.Design effects:Apache PDFBox as The PDF craftsman.JFreeChart as The Charting director.Accounting:Neo4j as The graph database.Redis as The CFO. —

High Frequency Trading places many technical demand on persisted data. This talk will look at the techniques used in storing, processing and reporting common to mainstream NoSQL applications, but also look at some of the requirements specific to HFT. One such requirement is persistence with one micro-second latency and replication with less than 10 micro-second latency. The talk will include how this open source Java technology can be used with other NoSQL databases.

Driven by developments on the internet, there is a continuously growing demand for analyzing ever increasing volumes of data. PowerDrill is successfully used to interactively analyze a trillion of cells. To reach a new level of usability, we explore how to sacrifice computing accuracy for speed. We present approximation techniques based on a combination of sampling and data sketches that we developed that increased the efficiency of the execution engine by an order of magnitude. We also discuss usability questions related to showing approximate results and performance observations on a large data set generated from Google web logs.

NOSQL are often limited in the type of queries that they can support due to the distributed nature of the data. In this session we would learn patterns on how we can overcome this limitation and combine multiple query semantics with NoSQL based engines. We will demonstrate specifically a combination of key/value, SQL like, Document model and Graph based queries as well as more advanced topic such as handling partial update and query through projection.We will also demonstrate how we can create a meshaup betweeen those API's i.e. write fast through Key/Value API and execute complex queries on that same data through SQL query.

Riak 2.0 has built in data types. These data types (called CRDTs) converge automatically. No more siblings. This talk looks at why eventual consistency is often good enough, and how some well designed data primitives from academia make life easier than ever for developers.

With NoSQL, NewSQL and plain old SQL, there are so many tools around it’s not always clear which is the right one for the job.This is a look at a series of NoSQL technologies, comparing them against traditional SQL technology. I’ll compare real use cases and show how they are solved with both NoSQL options, and traditional SQL servers, and then see who wins. We’ll look at some code and architecture examples that fit a variety of NoSQL techniques, and some where SQL is a better answer. We’ll see some big data problems, little data problems, and a bunch of new and old database technologies to find whatever it takes to solve the problem.By the end you’ll hopefully know more NoSQL, and maybe even have a few new tricks with SQL, and what’s more how to choose the right tool for the job.

Stratosphere (getstratosphere.org) is a next-generation Apache licensed platform for Big Data Analysis. Stratosphere offers an alternative runtime engine to Hadoop MapReduce, but is compatible with HDFS and YARN. The backend follows the principles of MPP databases, but is not restricted to SQL. The streaming runtime operates in memory, gracefully degrading to disk if needed. Stratosphere is programmable via a Java or Scala APIs with common operators like map, reduce, join, cogroup, and cross. Stratosphere includes a cost-based optimizer that automatically picks data shipping strategies, and reuses prior sorts and partitions. Finally, Stratosphere features first class support for iterative programs, achieving similar performance to Giraph without being a graph-specific system. Stratosphere is a mature codebase, developed by a growing developer community, and is currently witnessing its first commercial installations.

Recent developments in deep learning make it possible to improve time series databases. I will show how these methods work and how to implement them using Apache Mahout. Systems such as the Open Time Series Database (Open TSDB) make good use of the ability of HBase, MapR tables and related databases to store columns sparsely. This allows a single row to store many time samples and allows raw scans to retrieve a large number of samples very quickly for visualization or analysis. Typically, older data points are batched together and compressed to save space. At high insertion rates, this approach falters largely because of the limited insert/update rate of HBase. In such situations, it is often better to short segments of data and insert batches that span short time ranges rather than inserting individual data points. When inserting compressed batches in this fashion, there are a number of obvious strategies that can be used. General compression utilities such as gzip do not normally provide particularly high compression rates. Bespoke crafted compression systems may provide point solutions with high compression rates, but they are generally fairly time-intensive to develop. I will describe how deep learning and sparse-coding techniques can be used to build systems that have very high compression levels (50x or more is typical) and which have the very interesting property that the resulting compressed data can often be queried or analyzed directly without ever decompressing the data. Moreover, it is possible to selectively decompress signals only from desired time ranges within a compressed batch. These new techniques for building time series data bases enable some exciting capabilities. The benefits include the ability to do query push-down into the time-series database from systems like Apache Drill, better visualization systems, and the ability to build an interesting form of anomaly detector on top of the time-series database. I will describe how to build these systems using Apache Mahout and illustrate the results with several real examples.

NoSQL data stores represent a specialized design trade-off in data storage. By giving up full compatibility with the de facto SQL standards in terms of joins, transactions optimizers and query languages, NoSQL systems have been able to benefit in terms of expressivity with document oriented databases, in terms of speed with specialized in-memory databases and in terms of scalability when heavily and automatically sharded. Changing these trade-offs has been very successful for the NoSQL community. The benefits of these trade-offs are now well known and the once highly controversial, "one size does not fit all", statement by Michael Stonebraker is now completely standard dogma. There is more coming, however. Combining a more flexible data model with other advanced technologies such as deep learning can give extraordinary results in NoSQL architectures. Furthermore, the benefits of cross-fertilization are bigger with NoSQL than with traditional technologies. I will describe examples of these results and outline several directions that NoSQL practitioners can go to discover new approaches.

CouchDB is one of the relatively smaller NoSQL options that are flying around at the moment, but that doesn't mean it doesn't pack a punch when used to solve the right problems.In this talk we'll look at the areas where CouchDB excels, and examine some of the mechanisms it uses to make this possible. In addition, we'll take a quick walk through a real deployment of a CouchDB network, backing a large multi-site private-cloud web service with millions of users, and look at some of the benefits and problems CouchDB can bring practically, in this scenario and others.

Enterprise developers are used to run unit tests against embedded JDBC databases, either manually or automated within a CI environment. Doing so with NoSQL databases isn't that easy, mostly because they have non-standard APIs and are harder to embed in-process.NoSQLUnit may be a solution to this issue, at least for the most common NoSQL datastores like Cassandra, Redis, MongoDB and some more. NoSQL is a JUnit extension that launches the database process and loads test data with a just a few lines of code. I'll also show how to use the PaaS "Travis CI" to run JUnit tests depending on NoSQL datastores.

One major challenge when using NoSQL databases is eventual consistency. Applications did rely on ACID transactions and their guarantees for a long time. But the CAP theorem tells us that we need to relax our consistency requirements in the Cloud era of massively scalable, highly available and partition tolerant distributed systems. We present some real-life insights where we faced problems with eventual or weak consistency and the pitfalls we had to overcome to realize the Cloud backend of our mobile app bestellbar. Fortunately, most NoSQL databases offer features for strongly consistent queries and transactions for special cases. We demonstrate our usage of entity groups and ancestor queries with the Google App Engine datastore to obtain strongly consistent results. But as there is no free lunch, we also discuss the drawbacks and limitations of these features and how to reasonably apply these features without sacrifying the scaling potential of your Cloud infrastructure.

One of the biggest challenges of most NoSQL databases is the lack of ACID transactions. Latencies, partial failures and all the other imponderabilities of distributed systems make it very likely that you will end up with inconsistent data sooner or later.Is there anything, we developers can do about it? Yes, there is!This session will show you two powerful tools that can help you to get rid of inconsistent data for good in your NoSQL database: Quorums and CRDTs (Conflict-free Replicated Data Types).Quorums help you to avoid reading stale data and fine-tune your availability vs. consistency requirements while CRDTs introduce self-healing power to your data in face of network partitioning.You will learn about the concepts of those to tools and how to implement them in practice down to the code level – plus tips, tricks, alternatives and limitations. Release the power of your data and do not shed any tears over missing ACID transactions anymore!