Sessions at OSCON Data 2011 with video on Tuesday 26th July

Much has been made of scalability as a driver for choosing a database, but the choice of a database influences much more than the scaling architecture. Different database choices drive different data models which in turn influence the development process.

We love data, and today we generate data in astronomical amounts. When we hit save on a document, snap a photo, or fill out a form online, we want to know that this data will persist, and we want to know that we can share, access, or reference it in the future. For any meaningful use, we need to how data relates to other data.

Imagine for a moment doing a JOIN on two HBase tables, crazy talk right? Well now you can thanks to Hive. True, it is only meant to be used in a batch context, but we have being doing it for a few months now at StumbleUpon and our analysts and engineers love it. This presentation will cover how the Hive-HBase integration works and how we use it at our company.

The last few years have brought a wealth of new data technologies organized around horizontal scalability. This talk will cover the essential infrastructure areas: real-time stream processing, offline data crunching, large-scale data deployments and live serving. The focus will be on how these ingredients come together to enable innovative data-driven products at LinkedIn.

Learn how to cobble together a PostgreSQL database, install a few handy R packages, a pinch of language extensions, and a handful of publicly available data to generate a forest monitoring platform to help landscape managers make better decisions using basic design-engineering paradigms to perform quick trade-off analyses.

The story of the development team and what lessons we learned in building Open Legislation - an open government platform. It will detail our transition from a MySQL back end to an application fully powered by Lucene, the data quality and efficiency issues that we’ve had to address, and how we’re now trying to rebuild internal trust after our iterative and initially shaky development process.

Synthetic biology is a new field where basic biological components can be engineered to create something new. It often involves DNA synthesizers, ligation, promoters, and polymerase chain reaction -- which may or may not be safe for your in silico environment. However, as the size and complexity of the systems increase, tools become more and more important, thus CAD for biology has emerged.

We'll present the architecture and implementation of a Node.js/DTrace-based distributed platform for analyzing the performance of cloud applications in real-time. We'll do a live demo on a real, internet-facing cloud and discuss some of the interesting performance pathologies we've found and explained using this tool.

Ever wondered what would happen if you could rethink a decade worth of design changes? Drizzle is a redesign of the MySQL server targeted at web development and cloud infrastructure. Update yourself on the latest features, and use cases for Drizzle7 and what is in store for the near future.

Sharing data is critical in a world where crisis can occur at any moment. Often, valuable data is stored in disparate locations with no information on how to access. This presentation discusses spatial data discovery and open source tools for implementing a data-sharing catalog. Esri’s Geoportal Server will be used to show sharing and discovery in action. Talk is open to all attendees.

Location-based services are hot, but geographic datasets are complex. But this shouldn’t put you off writing awesome location-aware services. This talk will show how to create spatial models and query the Open Street Map dataset together with social data using the Neo4j graph database.

Solr, an open source enterprise search server, scales very well within an index (vertical scaling). It is when you have multiple indexes (horizontal scaling) that it starts to get hairy, which happens a lot when you are hosting a cloud based solution for multiple users. In this session we will discuss these issue as well as the techniques of how to overcome them in-depth.