Join us for the 15th incarnation of the Recommender Stammtisch hosted by plista. The event will take place on Nov 13th, starting at 7pm. Please register here.

We will have two talks:

Torben Brodt: “Latest in large scale recommendation engines and machine learning” – Torben will present some latest developments in the field of large scale recommendation engines and machine learning from the last RecSys conference.

Sebastian Schelter: “Factorbird – a Parameter Server Approach to Distributed Matrix Factorization” – We present ‘Factorbird’, a prototype of a parameter server approach for factorizing large matrices with Stochastic Gradient Descent-based algorithms. We designed Factorbird to meet the following desiderata: (a) scalability to tall and wide matrices with dozens of billions of non-zeros, (b) extensibility to different kinds of models and loss functions as long as they can be optimized using Stochastic Gradient Descent (SGD), and (c) adaptability to both batch and streaming scenarios. Factorbird uses a parameter server in order to scale to models that exceed the memory of an individual machine, and employs lock-free Hogwild!-style learning with a special partitioning scheme to drastically reduce conflicting updates. We also discuss other aspects of the design of our system such as how to efficiently grid search for hyperparameters at scale. We present experiments of Factorbird on a matrix built from a subset of Twitter’s interaction graph, consisting of more than 38 billion non-zeros and about 200 million rows and columns, which is to the best of our knowledge the largest matrix on which factorization results have been reported in the literature.

Mikio Braun (streamdrill): What is scalable Machine Learning? — Scalability is one of the key concepts in Big Data, although historically speaking, it has meant quite a different thing in Machine Learning. In this talk, I’ll discuss different aspects of large scale learning and how it relates to map reduce, stochastic gradient descent, next generation big data frameworks like Spark and Stratosphere/Flink, as well as algorithmic approaches to scalability like approximation, and stream mining.

This talk will give a preview to the latest developments in Apache Mahout. Mahout features a new scala DSL for linear algebraic computations. Programs written in this DSL will be automatically parallelized and executed on Apache Spark. I will give an introduction to the DSL and show how Mahout uses it to implement a cooccurrence-based recommender.

Sebastian is a PhD student at the Database Systems and Information Management Group of TU Berlin as well as a member of the Apache Software Foundation, where he works on Mahout and Giraph.

Did you ever come across terms like BM25 similarity model, KL divergence model, the language model, and the translation model in ElasticSearch or Solr documentation? Did you wonder what these models do and whether they can improve your product?
In this talk we revisit the classical information retrieval approaches and explain the thinking and intuition behind the models so that you can decide whether they fit your use case. We also show some useful extensions handling common cases such as matching the query across multiple fields, handling spelling errors, synonyms and personalized context.

Stefan works as a Senior Software Engineer at ResearchGate, where he focuses on developing their recommendation system. Previously, Stefan worked as a Software Engineer for relevance in Microsoft. He holds a PhD in Information Retrieval from Northeastern University, Boston and a MSc degree in Natural Language Processing from RWTH, Aachen. Stefan loves to discuss topics related to algorithms, search engine implementation, functional programming languages and machine learning.

We’re back at SoundCloud. SoundCloud will host the next Recommender Stammtisch on Thursday, February 27th.

We’re planning to have an evening of mirth, pizza, RecSys talks, and more mirth. Doors open at 7:00 pm at SoundClouds offices at Greifswalder Str 212, 5th floor, Berlin.

We will have talks from:

Özgür Demir – SoundCloud – Recommendations at SoundCloud

Since its foundation SoundCloud has become one of the major platforms for user generated audio content. The uniqueness of the uploaded content together with its sheer mass makes it very difficult for the enduser to find relevant content. Hence, fully automated recommendations become a crucial part of an outstanding user experience. At SoundCloud currently various projects deal with personalized as well as unpersonalized recommendations and its related topics e.g. content classification. This talk will give a brief overview about those projects and the used technologies.

Most users only pay attention to the top 5 to 10 recommendations (on Mobile domains even less) it is thus very important to get these recommendations right. Ranking algorithms can help achieve this by using most of the modelling power to get the most relevant items at the top of the recommendation list.

I will give a short overview of the ranking techniques that we developed the last couple of years and the main idea behind them. Recommendations should also be interesting and potentially allow users to discover new content and perhaps even expand his/her preferences.

In the second part of the talk, I will focus on Diversifying recommendations, the challenges and the ways we tackle them. I would also like to introduce a new Open Source project for Machine Learning and Recommendations in Giraph/Hadoop called Okapi. Okapi provides a range of methods for Collaborative Filtering and Social Network Analysis and is released under the Apache license.

Christoph Lingg – komoot – Recommender Use Case at komoot

komoot is your personal guide for cycling and hiking tours. Cristoph Lingg will give a short introduction about recommender use cases at komoot and their current recommendation techniques.

ResearchGate will host the next Recommender Stammtisch on Thursday, November 14th.
We are looking forward to an exciting talk by Andreas Lommatzsch on online real-time recommendations (details below), followed by a happy hour.

Right after the presentation we will walk together to the Prater Garden to get some more drinks.

Abstract:
WTF (“Who to Follow”) is Twitter’s user recommendation service, which is responsible for creating millions of connections daily between users based on shared interests, common connections, and other related factors. This paper provides an architectural overview and shares lessons we learned in building and running the service over the past few years. Particularly noteworthy was our design decision to process the entire Twitter graph in memory on a single server, which significantly reduced architectural complexity and allowed us to develop and deploy the service in only a few months. At the core of our architecture is Cassovary, an open-source in-memory graph processing engine we built from scratch for WTF. Besides powering Twitter’s user recommendations, Cassovary is also used for search, discovery, promoted products, and other services as well. We describe and evaluate a few graph recommendation algorithms implemented in Cassovary, including a novel approach based on a combination of random walks and SALSA. Looking into the future, we revisit the design of our architecture and comment on its limitations, which are presently being addressed in a second-generation system under development.

About Jimmy (Web):
Jimmy Lin is an associate professor in the iSchool at the University of Maryland, affiliated with the Department of Computer Science and the Institute for Advanced Computer Studies. He graduated with a Ph.D. in computer science from MIT in 2004. Lin’s research lies at the intersection of information retrieval and natural language processing, and he has done work in a variety of areas, including question answering, medical informatics, and bioinformatics. Lin’s current research focuses on massively-distributed data analytics in cluster-based environments.
Recently, Lin just completed an extended sabbatical at Twitter, where from 2010-2012 he worked on services designed to surface relevant content for users and the distributed infrastructure that supports mining relevance signals from massive amounts of data.