Last year Sebastian Schelter from Berlin was added to the list of committers for Apache Mahout. With two committers in town the idea was born to meet some day, work on Mahout. So why not just announce that meeting publicly and invite others who might be interested in learning more about the framework? I got in touch with c-base - a hacker space in Berlin well suited to host a Hackathon - and quickly got their ok for the event.

As a result the first Apache Mahout Hackathon took place at c-base in Berlin last weekend. We had about eight attendees - arriving at varying times: I guess 11a.m. simply is way too early to get up for your average software developer on a Saturday. I got a few people surprised by the venue - especially those who were attending a Hackathon for the very first time and had expected c-base to be some IT company ;)

We started the day with a brief collection of ideas that everyone wanted to work on: Some needed help to use Mahout - topics included:

How to use Apache Mahout collaborative filtering with complex models.

How to use Apache Mahout via a web application?

How to use classification (mostly focussed on using Naive Bayes from within web applications).

Is HBase a solution for scalable graph mining algorithms?

Is there a frequent itemset algorithm that respects temporal changes in patterns?

Those more into Mahout development proposed a slightly different set of topics:

Come up with a more Java API friendly configuration scheme for Mahout clusterings.

Complete the distributed SVD recommender.

Quickly teams of two to three (and more) people formed. First several user side questions could be addressed by mixing more experienced Mahout developers with newbie users. Apart from Mahout specifics also more basic questions of getting involved even by simply contributing to the online documentation, answering questions on the mailing lists or just providing structured access to existing material that users generally have trouble finding.

Another topic that is being overlooked all too when asking users to contribute to the project is the process of creating, submitting, applying and reviewing patches itself: Being deeply involved with free software projects dealing with patches, integration of issue tracker and svn with the project mailing lists all seems very obvious. However even this seemingly basic setup sometimes looks confusing and complex to regular users - that is very common but not limited to people who are just starting to work as software developers.

Thanks to Thilo Fromm for taking the group picture.

In the evening people finally started hacking more sophisticated tasks - working on the first project patches. On Sunday only the really hard core developers remained - leading to a rather focussed work on Mahout improvements which in the end led to first patches sent in from the Mahout Hackathon.

Last week I was honoured to be invited as one of the two speakers on Apache Mahout at the Mahout meetup in Amsterdam at JTeams offices. After free beer, cola and pizza Frank Scholten gave an overview of Mahout's clustering capabilities. After a brief introduction to Mahout itself he went into a little more detail on how clustering works in general. After that with a selection of Seinfeld scripts he used a fun data set to guide the audience through the process of choosing the right data preparation steps, coming up with good training parameters and finally evaluating clustering quality.

After that I gave a brief introduction to classification with Mahout - going into a little more detail when it comes to data preparation and quality evaluation. The audience seemed most interested in learning more on how data preparation works - after all that step cannot really be covered by Mahout itself (though we do have some support) but instead needs a lot of domain knowledge from the user side.

Judging from the brief round of self introductions the meetup was well visited by an intesting mixture of people coming from JTeam, Hippo, the dutch police working on data analytics, developers working at RIPE and many more.

If you are interested in more data analysis, search and data storage - do not miss registration for Berlin Buzzwords on June 6/7th 2011.

This is to announce the Berlin Buzzwords 2011. The second edition of the successful conference on scalable and open search, data processing and data storage in Germany,taking place in Berlin.

Call for Presentations Berlin Buzzwords

http://berlinbuzzwords.de

Berlin Buzzwords 2011 - Search, Store, Scale

6/7 June 2011

The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics:

IR / Search - Lucene, Solr, katta or comparable solutions

NoSQL - like CouchDB, MongoDB, Jackrabbit, HBase and others

Hadoop - Hadoop itself, MapReduce, Cascading or Pig and relatives

Closely related topics not explicitly listed above are welcome. We are looking for presentations on the implementation of the systems themselves, real world applications and case studies.

Important Dates (all dates in GMT +2)

Submission deadline: March 1st 2011, 23:59 MEZ

Notification of accepted speakers: March 22th, 2011, MEZ.

Publication of final schedule: April 5th, 2011.

Conference: June 6/7. 2011

High quality, technical submissions are called for, ranging from principles to practice. We are looking for real world use cases, background on the architecture of specific projects and a deep dive into architectures built on top of e.g. Hadoop clusters.

Proposals should be submitted at http://berlinbuzzwords.de/content/cfp-0 no later than March 1st, 2011. Acceptance notifications will be sent out soon after the submission deadline. Please include your name, bio and email, the title of the talk, a brief abstract in English language. Please indicate whether you want to give a lightning (10min), short (20min) or long (40min) presentation and indicate the level of experience with the topic your audience should have (e.g. whether your talk will be suitable for newbies or is targeted for experienced users.) If you'd like to pitch your brand new product in your talk, please let us know as well - there will be extra space for presenting new ideas, awesome products and great new projects.

The presentation format is short. We will be enforcing the schedule rigorously.

If you are interested in sponsoring the event (e.g. we would be happy to provide videos after the event, free drinks for attendees as well as an after-show party), please contact us.

Follow @berlinbuzzwords on Twitter for updates. News on the conference will be published on our website at http://berlinbuzzwords.de.

Program Chairs: Isabel Drost, Jan Lehnardt, and Simon Willnauer.

Schedule and further updates on the event will be published on http://berlinbuzzwords.de Please re-distribute this CfP to people who might be interested.

Title: O'Reilly Strata ConferenceLocation: Santa ClaraLink out: Click hereDescription: Early next February O'Reilly is planning to put on a very interesting conference on the topic of data analysis and the business of generating value from raw digital data.

I'm really glad to have received the acceptance notification for my presentation and travel sponsorship from the DICODE project. So see you in Santa Clara.Start Date: 2011-02-01End Date: 2011-02-03

If you are still unsure whether you should attend or not: Strata kindly handed out discount codes to speakers to share with their followers and readers. It saves you 25% of the registration cost - just use str11fsd during registration.

As always there will be slots of 30min each for talks on your Hadoop topic. After each talk there will be a lot time to discuss. We head over to a bar after the event for some beer and something to eat.

Talks scheduled so far:

Simon Willnauer: "Lucene 4 - Revisiting problems for speed"

Abstract: This talk presents a brief case study of long standing problems in Lucene and how they have been approached to gain sizable performance improvements. Each of the presented problems will have brief introduction, implemented solution and resulting performance improvements. This talk might be interesting even for non-lucene folks.

Josh Devins: "Title: Hadoop at Nokia"Abstract: In this talk, Josh will outline some of the ways in which Nokia is using Hadoop. We will start by having a quick look at the practical side of getting started with Hadoop and outline cluster hardware and configuration and management with tools like Puppet. Next we'll dive head first into how Hadoop and its' ecosystem are being utilized on a daily basis to perform business analytics, drive machine learning and help build data-driven products. We will also touch on how we go about collecting metrics from dozens of applications distributed in multiple data centers around the world. An open Q&A session will follow.

Paolo Negri: "The order of magnitude challenge: from 100K daily users to 1M "Abstract: "Social games backends share many aspects of normal web applications, but exasperate scaling problems, follow this talk to see how we evolved and brought a plain ruby on rails app to sustain 5000 reqs/sec, moved part of our data from sql to nosql to reach 5 millions queries per minute and see what we learned from this experience."

Please do indicate on Upcoming or Xing if you are coming so we can more safely plan capacities.

A big Thank You goes to zanox for providing the venue for free for our event as well as to Cloudera for supporting videos being taped of the presentations.

Early next year - on February 19th/20th to be more precise - the first Apache Mahout Hackathon is scheduled to take place at c-base. The Hackathon will take one weekend. There will be plenty of time to hack on your favourite Mahout issue, to get in touch with two of the Mahout committers and get your machine learning project off the ground.

Please contact isabel@apache.org if you are planning to attend this event or register with the xing event so we can plan for enough space for everyone. If you have not registered for the event there is now guarantee you will be admitted.

If you'd like to support the event: We are still looking for sponsors for drinks and pizza.

Early next year - on February 19th/20th to be more precise - the first Apache Mahout Hackathon is scheduled to take place at c-base. The Hackathon will take one weekend. There will be plenty of time to hack on your favourite Mahout issue, to get in touch with two of the Mahout committers and get your machine learning project off the ground.

Please contact isabel@apache.org if you are planning to attend this event or register with the xing event so we can plan for enough space for everyone. If you have not registered for the event there is now guarantee you will be admitted.

If you'd like to support the event: We are still looking for sponsors for drinks and pizza.