SAMOA is an open-source platform for mining big data streams that runs on several distributed stream processing engines (such as S4 and Storm), and includes streaming algorithms for the most common machine learning tasks such as classification and clustering.
More info at http://samoa-project.net

How to extend your toolbox to solve more big data problems with less effort.
AWS provides a set of big data services that are elastic, scalable and highly available out of the box. Learning best practices and tips of how to integrate them together and with your architecture adds to your abilities to provide fast and reliable big data solutions.

14:35-15:15 (40m)
Hadoop & Beyond

Building an intelligent big data app in 30 minutes

Claudiu Barbura (Ubix), David Talby (Pacific AI)

Live demo of building an intelligent big data application from a web console. The tools and APIs behind are built on top of Spark, Shark, Tachyon, Mesos, Aurora, Cassandra, iPython and include: ELT pipeline (ingestion and transformation), data warehouse explorer, export to NoSql and generated APIs, predictive model building, training and publishing, dashboard UI, monitoring and instrumentation

16:05-16:45 (40m)
Hadoop & Beyond

Spark Streaming Case Studies

Paco Nathan (derwen.ai)

Apache Spark: Streaming case studies based on interviews with the dev teams, compared and contrasted with alternative open source projects, plus an open source example that demonstrates integration of Spark Streaming, Spark SQL, and Tachyon within a single app.

16:55-17:35 (40m)
Hadoop & Beyond

Spark and Cassandra

Tim Berglund (Confluent)

An exploration of Apache Spark, an in-memory analytics framework that applies functional programming paradigms to provide ad-hoc analysis for distributed databases like Cassandra.

17:45-18:25 (40m)
Hadoop & Beyond

Identifying Outliers at Scale Using Real-time Search Engines

Costin Leau (Elastic)

A practical exploration of anomaly detection (from credit card fraud
to incorrectly tagged movies) through harnessing the power of the
'inverted index' - the foundation of information retrieval systems.
Use Hadoop, Elasticsearch and Spark to gain insights into your big
data and discover 'what stands out' at scale.

11:50-12:30 (40m)
Data Science

Relevant and Real-Time: Building a bi-directional recommendation system for a massive online game

Simon Worgan (Jagex Ltd), Samuel Kerrien (RESEREC)

We will detail the development of a bi-directional event stream recommendation system in RuneScape, a massively multiplayer online game. By capturing a feature rich relationship between player and content we were able to train different 'flavours' of recommendation. Delivered in real-time these 'flavours' balance engagement, monetisation and enjoyment according to shifting business needs.

13:45-14:25 (40m)
Data Science

Exploratory Data Analysis with Apache Spark

Hossein Falaki (Databricks Inc.)

Apache Spark enables interactive analysis of big data by reducing query latency to the range of human interactions through caching. Additionally, Spark’s unified programming model and diverse programming interfaces enable smooth integration with popular visualization tools, such as ggplot and matplotlib. We can use these to perform visual exploratory big data analysis with Spark.

14:35-15:15 (40m)
Data Science

A Gentle Introduction to Apache Spark and Clustering for Anomaly Detection

Sean Owen (Cloudera)

Apache Spark is a popular new paradigm for computation on Hadoop. It's particularly effective for iterative algorithms relevant to data science like clustering, which can be used to detect anomalies in data. Curious? Get a taste of Spark MLlib, Scala and k-means clustering in this walkthrough of anomaly detection as applied to network intrusion, using the KDD Cup '99 data set.

16:05-16:45 (40m)
Data Science

Data Science Toolbox and the Importance of Reproducible Research

Jeroen Janssens (Data Science Workshops B.V.)

The Data Science Toolbox is a new, open source virtual environment for data science. Its mission is to: (1) get data scientists started in a matter minutes, (2) enable teachers and authors to offer a custom virtual environment for their students and readers, and (3) encourage researchers to set up reproducible experiments. We'll discuss its importance, its technology, and its future.

16:55-17:35 (40m)
Data Science

Building a Unified Data Pipeline in Spark

Aaron Davidson (Databricks)

Apache Spark lets users build unified data analytic pipelines that combine diverse processing types. In this talk, we will leverage the versatility of Spark to combine SQL, machine learning, and realtime streaming processing to build a complete data pipeline in a single, short program which we will build up throughout the session.

17:45-18:25 (40m)
Data Science

Search Query Categorization at Scale

Alex Dorman (Magnetic), Michal Laclavik (Magnetic)

The need to categorize short text strings arises in many domains: online advertising, search engines, social networking, etc. In this session, we will share strategies for categorizing large volumes of queries and keywords in the advertising space, our successes with open document collections (Wikipedia, DBPedia, Freebase), and details on our solution using Hadoop and Solr.

11:50-12:30 (40m)
Hadoop Platform

A Survey of HBase Application Archetypes

Lars George (Cloudera), Jonathan Hsieh (Cloudera, Inc)

This talk will show how HBase use-cases vary significantly from write-once, read many workloads storing events, to updatable entity workloads that use it as random read and write backing store. A discussion of how these use-cases can be classified, along with example, concludes the session.

13:45-14:25 (40m)
Hadoop Platform

Petascale Genomics

Uri Laserson (Cloudera)

The advent of next-generation DNA sequencing technologies is revolutionizing life sciences research by routinely generating extremely large data sets. Big data tools developed to handle large-scale internet data (like Hadoop) will help scientists effectively manage this new scale of data, and also enable addressing a host of questions that were previously out of reach.

14:35-15:15 (40m)
Hadoop Platform

Moving Towards a Streaming Architecture

Garry Turkington (Improve Digital), Gabriele Modena (Improve Digital)

Improve Digital is an ad tech company with large data volumes. This talk will explore our learnings from enhancing our established batch infrastructure with streaming near-realtime capabilities. In addition to discussing the impact on our architecture we will also describe how the work changed our approach to data lifecycle management.

16:05-16:45 (40m)
Hadoop Platform

Driving Personalization with Real Time Big Data Analytics

Ameya Kantikar (Groupon)

Relevance and Personalization is crucial to building personalized local commerce experience at Groupon. Talk covers overview of the real time analytics infrastructure built using open source technologies such as Kafka- Storm - HBase- Redis which handles over 1 million data points per second in real time. Talk covers various solution choices, different techniques and strategies and more.

16:55-17:35 (40m)
Hadoop Platform

From Raw Data to Analytics with No ETL

Marcel Kornacker (Cloudera)

Find out how to run real-time analytics over raw data without requiring a manual ETL process targeted at an RDBMS. This talk describes Impala’s approach to on-the-fly data transformation and its support for nested data; examples demonstrate how this can be used to query raw data feeds in formats such as text, JSON and XML, at a performance level commonly associated with specialized engines.

17:45-18:25 (40m)
Hadoop Platform

Enterprise Hadoop Architecture – Lessons from Cisco’s Hadoop Journey

Floris Grandvarlet (Cisco)

This session presents details on Cisco’s enterprise Hadoop architecture including roadmap details, centralized funding model that helped it get deployed quickly as well as its logical and physical views. Prominent use cases already in use at Cisco will also be covered.

11:50-12:30 (40m)
Design

Making Data Human

Jesús Gorriti (Fjord)

A lot of decisions are made for us based on data – but are we at risk of crossing over into the ‘uncanny valley’ of over-familiar personalisation? Designers need to focus on human elements, rather than allowing tech to lead the way. Jesus Gorriti will discuss SMART, a collaboration with the Harvard Medical School where the pediatric growth chart was reinvented using big data and design thinking.

13:45-14:25 (40m)
Design

Data and Design: We're All Invited on the Data Journey

Juliette Melton (New York Times)

Making meaning and value from data is not only a job for data scientists. Ethnographic researchers, subject matter experts, visual communication designers, and behavioral scientists all play key roles in the data journey. In this talk, we'll explore the data value chain, and share opportunities for how all of us -- whether data scientists or not -- can create and use data for insight and impact.

14:35-15:15 (40m)
Design

The Data Future

Kim Rees (Periscopic)

We have the unfortunate tendency to fit our problems to the technology at hand. We should be looking for ways to bend technology to our problems...our big problems. Kim will take a long look into the future of data covering the controversial and hopeful areas of privacy, open data, hacking, ETL relief, latent machines, M2M, and mass crowdsourcing.

16:05-16:45 (40m)
Design

Challenges in Developing Contextual Applications

Håkan Jonsson (Sony Mobile Communications)

Experiences from development of contextual applications, especially on data, design and privacy issues

16:55-17:35 (40m)
Data Science, Design

ggvis: Interactive, intuitive graphics in R

Garrett Grolemund (RStudio)

The ggvis package makes it easy to create interactive data graphics with R, with a declarative syntax similar to that of ggplot2. Like ggplot2, ggvis uses concepts from the grammar of graphics, but it also adds the ability to create interactive graphics and deliver them over the web.

17:45-18:25 (40m)
Design

From Confusing to Convincing: A Framework for Using Animation and Storytelling to Bolster the Effectiveness of Interactive Visualizations

Michael Freeman (University of Washington)

Complex relationships in big data require involved graphical displays which can be intimidating to users. This talk uses real world examples to identify confusing elements in online visualizations, and articulates a framework for using animation and story-telling to amplify their impact and usability. Tangible and generalizable techniques applicable across fields will be presented.

11:50-12:30 (40m)
Business & Industry

Why decision automation is key in big data analysis

Uwe Weiss (Blue Yonder)

While many companies are struggling to adopt big data and to unlock its potential, facing challenges of visualization and democratization of insight, a number of industry leaders are leapfrogging big data adoption and circumvent the analyst bottleneck by going straight to automation of core business processes. This requires overcoming a set of tough cultural, technical and scientific challenges.

13:45-14:25 (40m)
Business & Industry

High Level Abstractions Make Big Data Useful for Real People

Melissa Santos (Big Cartel)

By having understandable abstractions for important data objects, Etsy has enabled employees across the whole company to actively take part in the collection and analysis of data. Converting data to objects allows us to more naturally convert analysis questions into code, and enforce business rules and definitions consistently.

14:35-15:15 (40m)
Business & Industry

Implementing a Data Warehouse Front-End in Google Docs

Aaron Frazer (Seeking Alpha)

Demonstrating how to use Google Docs for a flexible, extensible, self-service front-end for your data warehouse. A simple, cheap, stable, flexible, user-friendly alternative to traditional tools.

16:05-16:45 (40m)
Business & Industry

Old Dogs, New Tricks: How Data-driven Intrapreneurs Make Big Companies Innovate

Alistair Croll (Solve For Interesting)

In this session, Alistair Croll, author of the best-selling Lean Analytics and chair of O’Reilly Strata, will share what he’s learned in a year of working with and interviewing intrapreneurs all over the world.

Outbrain serves 150 billion content recommendations to more than 500 million monthly users. Data tells us what’s driving the mindset of the crowed. But how do you analyze if the individual user finds value in recommendations? Why being satisfied with click-focused-metrics is dangerous for growth? We outline a 3-layer framework for Data Scientists to analyze user engagement, facing such challenges.

17:45-18:25 (40m)
Business & Industry

Understanding your Unicorns: Data Science Team Building in Action

Kim Nilsson (Pivigo)

A data strategy is only as good as its execution. In the world of Data Science it has become increasingly apparent that business leaders focus on the technical aspects for success in data projects, when in fact the quality of the data team is key. I will in this talk share my experiences training data scientists, and give some key insights into how to build a high-performing Data Science team.

11:50-12:30 (40m)
Sponsored

Analytics 3.0

Rod Smith (IBM Emerging Internet Technologies )

Analytics 3.0 is all about exploiting big data for just-in-time results to impact business outcomes. But what's really changing?

Opportunities and Challenges of Data Processing in the Internet of Things (IoT)

Michael Hausenblas (Red Hat)

We will discuss requirements for IoT data processing platforms incl. stream processing, dealing with raw device data, ensuring business continuity and to enforce security and privacy. We will dissect a number of IoT applications, such as a manufacturer offering pro-active maintenance, optimisations of waste management as well as streamlining a supply chain.

16:05-16:45 (40m)
Sponsored

Splunk at UniCredit: Our Big Data Journey from Daily Troubleshooting to Business Analytics

Marcello Bianchetti (UniCredit SPA)

This session will show the evolution of big data at UniCredit, from troubleshooting and application monitoring to the real-time analytics of ATMs, mobile banking, transactions and card usage. It will go under the hood of technical decisions in setting up a scalable and reliable architecture and dealing with a heterogeneous, geographical distributed and multi-layered environment.

16:55-17:35 (40m)
Sponsored

Why Enterprise IT Management Tools are Essential for Big Data Success

Joe Goldberg (BMC Software)

Enterprise IT Management tools play a key role in helping IT organizations deliver a high level of service to their customers and manage the ongoing operation of production and mission critical systems according to regulatory requirements and to meet the goals of the business...

17:45-18:25 (40m)

Session

To be confirmed

11:50-12:30 (40m)
Sponsored

The Internet of Trains

Frank Saeuberlich (Teradata)

How can big data make your journey to work better? In this case study we’ll explore how! Trains today are complex systems consisting of many embedded subsystems, which operate together with the overall goal of delivering a high quality transportation service...

Lean how Pentaho's data integration and business analytics platform accelerates value from blended big data.
* Leverage analytics -from data access and integration, through visualisation and predictive analytics– to deliver near real-time business insights.
* Empower users to architect big data blends at the source AND stream for more complete and accurate analytics...

14:35-15:15 (40m)
Sponsored

Configuring a Secure, Multi-Tenant Hadoop Cluster For The Enterprise

James Kinley (Cloudera)

Key takeaways:
The business drivers and objectives
Multi-tenancy concepts and architecture
Multi-tenancy features in EDH
Multi-tenancy configuration in EDH

16:05-16:45 (40m)
Sponsored

The ART of data Governance and Security

bob middleton (Tableau Software)

Understanding the balance between Availability, Risk and Trust when dealing with big data analytics.
As we approach the end of 2014 more people are talking “big data” than ever before, but what we are now calling big data is just a drop in the ocean. The danger we all face is that as we step back to consider just how beautifully BIG our data is getting, we start to lose control.

16:55-19:00 (2h 5m)
Data Science

Developer Certification for Apache Spark

Get certified as a Spark Developer at Strata + Hadoop World in Barcelona.

11:50-12:30 (40m)
Internet of Things

Using Road Sensor Data for Official Statistics: Towards a Big Data Methodology

We show how to use road sensor data for making reliable statistics about traffic intensities on the 3000 km long Dutch motorways. To use the data of 20.000 road sensors, dimension reduction is applied on the sensor data, which is highly redundant, for compensating the poor quality of the data.

13:45-14:25 (40m)
Internet of Things

Intel's Cloud Wearable & IoT Analytics Platform

Assaf Araki (Intel)

IoT analytic brings an engineering and analytic complexity to the new market solutions.In this session we will share the learnings from the development of Intel's Cloud IoT Analytics Platform based on open source software.We will share learning from the product development and present use case in the Parkinson Disease research, leverages wearable sensors to monitor PD patient’s activities,24/7.

14:35-15:15 (40m)
Internet of Things

Will the Hordes of IoT Data Bring the Post-Hadoop Era and Democratize Data Stores?

Jodok Batlogg (CRATE Technology GmbH)

Creating a backend for data intensive apps requires gluing several technologies together, which isn’t always simple, cheap or scalable. The world of sensor and IoT data, together with privacy concerns (mostly European), and the need to make contextual sense of it all, presents an opportunity to bring in the post-hadoop era and democratise data stores.

We’ll explain how we’re automatically deriving a person’s mood and personality from mobile sensor data, and how we map and quantify these so that it becomes possible for technology to understand and work with ‘how we feel’.
We'll cover the technical details of the data gathering setup, our data-mining and machine learning approaches, and the big-data processing architecture developed.

16:55-17:35 (40m)
Government/Open Data

Disrupting the Music Tech Space with Open Data

Robert Kaye (MusicBrainz)

Too many big data sets live in walled gardens and thus limit innovation to a few players. Creating open data sets levels the playing field and allows open source hackers to participate.

17:45-18:25 (40m)
Business & Industry

Case study: The Benefits and Challenges of Running in the Cloud

Marton Trencseni (Prezi)

We recently moved our entire data infrastructure to AWS: we now use Elastic MapReduce, Redshift and S3 for storage and processing. The talk describes the benefits and challenges of running in the cloud, how treating storage and processing as a utility allowed our small team to work on tools that democratized access to business analytics across the company and made us more happy in general.

Program Chairs, Roger Magoulas, Doug Cutting, and Edd Dumbill, welcome you to the first day of keynotes.

9:40-9:55 (15m)

Open Data Center of the Future

Mike Olson (Cloudera)

Mike Olson, CSO and Chairman, Cloudera

9:55-10:10 (15m)

Data Driven Design at F1 Speed

Geoff McGrath (McLaren Applied Technologies)

McLaren Applied Technologies capitalises on the convergence of real-time data management, predictive analytics and simulation to produce high performance design of products and processes. In this talk we will describe how the approach of data-driven design can transform the way we go about creating and using products that are intrinsically intelligent and capable of adaptation

10:10-10:20 (10m)
Sponsored

Big Data 3.0

Rod Smith (IBM Emerging Internet Technologies )

Big Data & Analytics continues to be a disruptive business force. Are we entering a new phase – Big Data & Analytics 3.0?

10:20-10:35 (15m)

Hiding Information Inside Big Data, and the Hypocrisy of Privacy

Alicia Asin (Libelium)

“Welcome to the era of big, bad, open information.”
Analysts have predicted huge numbers of Internet-connected devices in our future for years now. We may dispute the number, but it is clear that the Internet of Things (IoT) will produce a colossal amount of data.

10:35-10:50 (15m)

Data and Product and Tech, Oh My!

Camille Fournier (Independent)

Camille Fournier, Head of Engineering, Rent the Runway

10:50-10:55 (5m)
Sponsored

Mission Critical Big Data

David Richards (WANdisco, Inc.)

WANdisco CEO and Co-Founder David Richards will explore ‘mission critical’ applications of Big Data across industry sectors, and highlight the importance of continuous availability, performance, and scalability in its application.

10:55-11:15 (20m)

#IoTH: The Internet of Things and Humans

Tim O'Reilly (O'Reilly Media, Inc.)

The network, new data capabilities, and mobile devices rich in sensors have created fresh and unconventional possibilities to rethink workflows and processes in the real world. To succeed in creating totally new services and rethinking old ones, we must first adopt fresh thinking about the design process, and how sensors and algorithms are driving significant changes in what is possible.

11:20-11:50 (30m)

Break: Morning Break sponsored by WANdisco

12:30-13:45 (1h 15m)

Lunch / Thursday Birds of a Feather

Birds of a Feather (BoF) discussions are a great way to informally network with people in similar industries or interested in the same topics.

Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.