Sunday, May 15, 2011

NSF Workshop: 8F (Algorithms in the field)

The Algorithms in the field NSF workshop is starting tomorrow (hashtag is #8f on Twitter). Suresh invites us to "...make your opinions known. But even if you're not, you can voice them right here ! I'll incorporate comments posted here into the discussions (and sign your name if you'd like credit)." Muthu also mentions it here. Here is the current list of talks:

Recommender Systems: The Art and Science of Matching Items to Users

Speaker: Deepak Agarwal

Yahoo! Research

Algorithmically matching items to users in a given context is essential for the success and profitability of large scale recommender systems like content optimization, computational advertising, search, shopping, movie recommendation, and many more. The objective is to maximize some utility (e.g. total revenue, total engagement) of interest over a long time horizon, hence this is a bandit problem since there is positive utility in displaying items that may have low mean but high variance. A key challenge in such bandit problems is the curse of dimensionality --- data is obtained via interactions among several heavy-tailed categorical variables and the goal is to estimate response rates of some rare event like click, buy, and so on. Other than statistical challenges due to high dimensionality and data sparsity, bandit problems are difficult to work with for response that are observed with considerable delay (e.g. return visits, confirmation of a buy). To mitigate this issue, one approach is to optimize multiple competing objectives in the short-term to achieve the best long-term performance. For instance, in serving content to users on a website, one may want to optimize some combination of clicks and downstream advertising revenue in the short-term to maximize revenue in the long-run. This combination of bandits, high dimensional modeling, and constrained optimization in an environment where one has access to a persistent stream of data arriving at high frequency, poses unique algorithmic challenges. In this talk, I will attempt to capture some of the technical issues by focusing on a concrete application - content optimization on the Yahoo! front page. I will briefly discuss our own research, mention some specific open problems, and outline some broad technical challenges in this area.

About the speaker: Deepak Agarwal is a statistician at Yahoo! who is interested in developing statistical and machine learning methods to enhance the performance of large scale recommender systems. Deepak and his collaborators significantly improved article recommendation on several Yahoo! websites, most notably on the Yahoo! front page. He also works closely with teams in computational advertising to deploy elaborate statistical models on the RightMedia Ad Exchange, yet another large scale recommender system. He currently serves as associate editor for the Journal of American Statistical Association.

Sampling-Based Motion Planning: From Intelligent CAD to Group Behaviors to Protein Folding

Nancy Amato

Department of Computer Science and Engineering

Texas A&M University

Motion planning arises in many application domains such as computer animation (digital actors), mixed reality systems and intelligent CAD (virtual prototyping and training), and even computational biology and chemistry (protein folding and drug design). Surprisingly, a single class of planners, called sampling-based planners, have proven effective on problems from all these domains. In addition to versatility, these methods are simple and efficient, even in high-dimensional configuration spaces.

In this talk, we describe two major classes of sampling-based planners, the graph-based probabilistic roadmap method (PRM) and the tree-based rapidly-exploring random tree (RRT). We give an overview of some PRM variants developed in our group and present work related to virtual prototyping, multi-agent systems, and protein folding. For virtual prototyping, we show that in some cases a hybrid system incorporating both an automatic planner and haptic user input leads to superior results. For multi-agent systems, we describe PRM-based techniques for herding, pursuit/evasion and evacuation planning. Finally, we describe our application of PRMs to simulate molecular motions, such as protein and RNA folding. More information regarding our work, including movies, can be found at http://parasol.tamu.edu/~amato/

Bio: Nancy M. Amato is a professor of computer science and engineering at Texas A&M University where she co-directs the Parasol Lab, is Deputy Director of the Institute for Applied Math and Computional Science, is past chair of the Council of Principal Investigators and chair of the Alliance for Bioinformatics, Computational Biology, and Systems Biology. She received undergraduate degrees in Mathematical Sciences and Economics from Stanford University, and masters and Ph.D. degrees in Computer Science from UC Berkeley and UIUC. She received an NSF CAREER Award, is an IEEE Fellow, is an ACM Distinguished Speaker, and was a Distinguished Lecturer for the IEEE Robotics and Automation Society. She was an Associate Editor of the IEEE Transactions on Robotics and Automation and the IEEE Transactions on Parallel and Distributed Systems. She is a member of the Computing Research Association's Committee on the Status of Women in Computing Research (CRA-W) and of the Coalition to Diversity Computing (CDC); she co-directs the CDC/CRA-W Distributed Research Experiences for Undergraduates (DREU) program and the CDC/CRA-W Distinguished Lecture Series. She is Co-Chair of the NCWIT Academic Alliance. Her main areas of research focus are motion planning and robotics, computational biology and geometry, and parallel and distributed computing

Optimization Challenges in Machine Learning and Data Mining

Kristin P. Bennett

Rensselaer Polytechnic Institute

The areas of machine learning and mathematical programming are increasingly intertwined. Optimization problems lie at the heart of most machine learning approaches. Machine learning researchers have embraced the advances in convex mathematical programming contributing to rapid advances in machine learning models and algorithms. In turn, the particular nature of machine learning models has stimulated new advances in convex optimization. Recently, nonsmooth optimization methods have led to massively scalable algorithms for solution of convex machine learning models such as Support Vector Machines. However convex machine learning models are not enough. Using problems in drug discovery, we will illustrate how nonconvex models can elegant capture learning tasks and how novel nonsmooth nonconvex optimization methods can be analogously used to create scalable algorithms for solving these nonconvex machine learning models. Advances in nonsmooth and nonconvex optimization theory and algorithms to meet the challenges of machine learning and data mining could contribute to another surge of synergistic advances in both learning and mathematical programming with far reaching impact in many domains.

Scalable Topic Modeling

David Blei

Princeton Univ

Probabilistic topic modeling provides a suite of tools for the unsupervised analysis of large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. This analysis can be used for corpus exploration, document search, and a variety of prediction problems.

In this talk, I will review the state-of-the-art in probabilistic topic models. I will describe the ideas behind latent Dirichlet allocation and its extensions, and I'll point to some of the software

available for topic modeling.

I will then describe our online strategy for fitting topic models, which lets us analyze massive document collections and document collections arriving in a stream. Our algorithm can fit models to millions of articles in a matter of hours, and I will present a study of 3.3M articles from Wikipedia. These results show that the online approach finds topic models that are as good as or better than those found with traditional inference algorithms.

(Joint work with Matthew Hoffman, Francis Bach, and Chong Wang)

Some Sketchy Results

Graham Cormode

At&T Labs - Research

'Sketch' data structures are simple, efficient, compact summaries of large data sets. Sketches enable approximate computations like similarity comparison and summarization of distributions. There have been a number of successes where use of sketches have been applied in the field, most notably when dealing with very large data sets. In this talk, I'll present some key sketches and their properties, and describe some successes and some non-successes in their application. I'll conclude by outlining emerging applications for sketches, and their future potential.

Scaling data mining techniques

Anirban Dasgupta

Yahoo! Research

In this talk, we will discuss problems that arise from scaling data mining and machine learning techniques for large, sparse, high dimensional datasets. We will discuss this in the context of two theoretical constructs -- random projection and locality sensitive hashing. Random projection has been much studied as a primitive in various algorithms to control dependence on input dimension. We will show a machine learning application that motivates, in addition, the preservation of sparsity of input vectors, and show a corresponding new construction. We will also discuss improvements to locality sensitive hashing techniques that enable a theoretically justifiable, and empirically observed lift in performance.

Minimizing Communication in Linear Algebra

Jim Demmel

UC, Berkeley

Algorithms have two kinds of costs: arithmetic and communication, by which we mean moving data either between levels of a memory hierarchy (in the sequential case) or between processors over a network (in the parallel case). Communication costs can already exceed arithmetic costs by orders of magnitude, and the gap is growing exponentially over time following technological trends, so our goal is to design linear algebra algorithms that minimize communication. First, we show how to extend known communication lower bounds for O(n^3) dense matrix multiplication to all direct linear algebra, i.e. for solving linear systems, least squares problems, eigenproblems and the SVD, for dense or sparse matrices, and for sequential or parallel machines. We also describe new algorithms that attain these lower bounds; some implementations attain large speed ups over conventional algorithms. Second, we show how to minimize communication in Krylov-subspace methods for solving sparse linear system and eigenproblems, and again demonstrate new algorithms with significant speedups.

Universal laws and architectures

John Doyle

Caltech

Essential dimensions of complex systems research are the hard limits on what is achievable (laws), the organizing principles that succeed or fail in achieving them (architectures and protocols), the resulting behavior observed in real systems (behavior, data), and the processes by which systems evolve (variation, selection, design). Hard limits on measurement, prediction, communication, computation, decision, and control, as well as the underlying physical energy and material conversion mechanism necessary to implement these abstract processes are at the heart of modern mathematical theories of systems in engineering and science. Unfortunately, these subjects remain largely fragmented and incompatible, even as the tradeoffs between these limits are of growing. The principle aim of this talk is to motivate an integrated theory based on optimization and robustness.

Insights can be drawn from three converging research themes. First, detailed description of components and a growing attention to systems biology and neuroscience, the organizational principles of organisms and evolution are becoming increasingly apparent. While the components differ and the system processes are far less integrated, advanced technology’s complexity is now approaching biology’s and there are striking convergences at the level of organization and architecture. Determining what is essential about this convergence and what is merely historical accident requires a deeper understanding of architecture — the most universal, high-level, persistent elements of organization — and protocols. Protocols define how diverse modules interact, and architecture defines how sets of protocols are organized. Finally, new mathematical frameworks for the study of complex networks suggests that this apparent network-level evolutionary convergence within/between biology/technology is not accidental, but follows necessarily from their universal system requirements to be fast, efficient, adaptive, evolvable, and most critically, robust to perturbations in their environment and component parts.

We will sketch the underlying mathematical ideas and illustrate the key results with case studies drawn from statistical mechanics, turbulence, cell biology, human physiology and medicine, wildfire ecology, earthquakes, the Internet, market volatility, toys, fashion, and neuroscience. If there is time we will also review (i.e. poke fun at) a remarkably popular and persistent source of error and confusion, namely the many past and ongoing attempts to find “new sciences” of complex systems and networks. This highlights another apparent tradeoff we must overcome in complex systems, between the popularity and accessibility of ideas versus their mathematical rigor and practical relevance.

From FOCS to Sand Hill Road

Martin Farach-Colton

Rutgers Univ

In this talk, I'll be presenting the steps of a multi-phase transformation:

1) My theory work on cache-oblivious algorithms

2) Collaborative systems/algorithmic work on high-performance file systems and how that changed the theory I do

3) The breakthrough that lead to my company, Tokutek

4) What I've learned about how people use software and how that changed the theory I do

5) What it means to do algorithms work in a wide context.

Statistical Methods and Algorithms for the Analysis of Social Network Data

There are three broad classes of statistical models used for the analysis of (social) network data: exponential random graph models, latent space models, and mixed membership stochastic blockmodels. They have been used in both static (cross-sectional) and dynamic (longitudinal) network situations. “Exact” computation for such models often relies on iterative or simulation methods, such as Markov chain Monte Carlo methods, which do not scale especially well to large sparse network settings. For some models good approximations are available. We give an overview of the models and computational approaches for estimating their parameters. We also discuss model degeneracies that pose special problems of both computational and inferential sorts. Fast and scalable computations of network motifs have been of special interest in the machine learning literature but they do not appear to help enough for problems of model estimation and assessing model goodness-of-fit.

Reflections on Algorithm Engineering in Computer Graphics

Eugene Fiume

University of Toronto

Computer graphics is both a mathematical science and an engineering discipline. Mathematical modelling of biological, geometric and physical phenomena are a common first step; these models then require careful transformation into simulation and control algorithms that are aware of the constraints imposed by numerical precision, large heterogeneous data requirements, and idiosyncrasies of different computing platforms, run-time systems, operating systems and languages. In this, computer graphics is not much different from other numerically rich application areas. The differentiation is that error and complexity measures are at best a starting point for determining a "good" implementation, for the human visual system is the final arbiter, both of image quality and of responsiveness. I will survey some of these issues and propose that we enhance the education of our undergraduate and particularly graduate students in applied computation.

Principles of Fact and Ontology Mining from the Web

Alon Halevy

Google

The World-Wide Web can be viewed as the biggest repository of factual information available to mankind. Facts on the Web are typically embedded in text and sometimes presented in more structured form such as tables or lists. As a result, since the WWW emerged, it has been mined for knowledge bases that can be used for query answering and higher-level inference. I will describe some of the challenges we face when trying to extract knowledge bases from the Web. In particular, I will focus on two issues that emerged in our work on the WebTables Project, a collection of over 100 million tables extracted from the Web. The first issue is the need for a principled theory for assigning confidence for facts mined from the web that considers the uneven popularity of different facts on the Web (e.g., "capital of France" vs. "capital of Malawi") and limitations of the extractors we employ. The second issue is the need to develop more flexible knowledge representation mechanisms (schema languages) that can deal with the inherent uncertainty and lack of central data modeling on the Web.

Online Matching of Ads with Advertisers

Kamal Jain

MSR

Online matching was introduced by Karp-Vazirani-Vazirani in 1990's. In the last decade the setup was used to study the problem of allocating ad slots to advertisers. Online algorithms were proposed with worst case ratio of 63% and works only for hard budgets.

One criticism of such algorithms and analysis for practical use is their pessimistic viewpoint. We suggest a flexible online algorithm. Our algorithm is flexible in many ways.

1. It allows soft budgets.

2. It allows the optimal performance to be computed on a computer.

3. It allows accommodating the known stochastic information.

4. It allows tuning the algorithm during the run time, such as making it more aggressive or conservative, depending on whether one is going luckier or unluckier than usual.

The analysis of the algorithm shows that a tight-example and an optimal algorithm are primal-dual pairs or in other words satisfy min-max relationship. In doing so we introduced a new kind of duality for convex programs and adapt the primal-dual algorithmic technique to convex programs. This technique usually result in fast combinatorial algorithms.

Cheap Wins: Applications from Elementary Algorithms

David Karger

MIT

While theoretical computer science often develops deep and rich techniques by abstracting important problems from specific application domains, there are many cases where the standard algorithmic toolbox already contains the answers to important problems. In such cases the key contribution of the theoretician may be simply to recognize and apply the appropriate solutions. I'll run through some examples from natural language processing, peer to peer systems, networking, coding, and computational biology. And I'll explore the place of this kind of algorithms "service" work on our theoretical field.

Algorithms and Social Networks: How Close are our Ties?

David Kempe

USC

Social Networks have been one of the application areas that have inspired quite a lot of algorithms work lately. This is only natural: on the one hand, they are a very tangible concept that laymen can relate to (friendships, Facebook); on the other hand, they are naturally modeled as graphs (one of our favorite classes of objects), and give rise to novel algorithmic questions on them.

A large fraction of the work has focused on what we can do with information on social networks. Algorithmic tasks include identifying social communities (densely connected groups of individuals), tracking their evolution, locating influential individuals, or devising viral marketing strategies. A complementary approach views the networks themselves as executing algorithms, such as routing messages, spreading information, facilitating trades, or allocating social stature or power.

Several of these directions have shown appeal to multiple communities with different tool sets and approaches, including theoretical computer science, physics, machine learning, and data mining. Often, the strucutral results and algorithmic techniques are intriguing, and long sequences of papers explore ever-refined versions of the models or definitions.

At the same time, our point of departure - treating social networks simply as graphs, or perhaps graph augmented with a simplistic dynamic of our own design - frequently creates a disconnect to the actual application area (social networks) we would like to have an impact on. I would like to argue that as a community, we should perhaps try to understand more closely what questions matter to sociologists or researchers who really deal with social networks. More importantly, we should gain more understanding of actual human behavior. This may help push our contribution from intriguing papers to impact in the field.

Enriching Education Through Data Mining

Krishnaram Kenthapadi

MSR

Education is acknowledged to be the primary vehicle for improving the economic well-being of people. Textbooks have a direct bearing on the quality of education imparted to the students as they are the primary conduits for delivering content knowledge. They are also indispensable for fostering teacher learning and constitute a key component of the ongoing professional development of the teachers.

Many textbooks, particularly from emerging countries, lack clear and adequate coverage of important concepts. In this talk, we present our early explorations into developing a data mining based approach for enhancing the quality of textbooks. We discuss techniques for algorithmically augmenting different sections of a book with links to selective content mined from the Web. For finding authoritative articles, we first identify the set of key concept phrases contained in a section. Using these phrases, we find web (Wikipedia) articles that represent the central concepts presented in the section and augment the section with links to them. We also describe a framework for finding images that are most relevant to a section of the textbook, while respecting global relevancy to the entire chapter to which the section belongs. We pose this problem of matching images to sections in a textbook chapter as an optimization problem and present an efficient algorithm for solving it.

We also present a diagnostic tool for identifying those sections of a book that are not well-written and hence should be candidates for enrichment. We propose a probabilistic decision model for this purpose, which is based on syntactic complexity of the writing and the newly introduced notion of the dispersion of key concepts mentioned in the section. The model is learned using a tune set which is automatically generated in a novel way. This procedure maps sampled text book sections to the closest versions of Wikipedia articles having similar content and uses the maturity of those versions to assign need-for-enrichment labels. The maturity of a version is computed by considering the revision history of the corresponding Wikipedia article and convolving the changes in size with a smoothing filter.

We also provide the results of applying the proposed techniques to a corpus of widely-used, high school textbooks published by the National Council of Educational Research and Training (NCERT), India. We consider books from grades IXÐXII, covering four broad subject areas, namely, Sciences, Social Sciences, Commerce, and Mathematics. The preliminary results are encouraging and indicate that developing technological approaches to enhancing the quality of textbooks could be a promising direction for research for our field.

Algorithm Design using Spectral Graph Theory

Gary L Miller

Computer Science Department, Carnegie Mellon

Spectral Graph Theory is the interplay between linear algebra and combinatorial graph theory. One application of this interplay is a nearly linear time solvers Symmetric Diagonally Dominate system (SDD). This seemingly restrictive class of systems has received substantial interest and in the last 15 years both algorithm design theory and practical implementations have made substantial progress. There is also a growing number of problems that can be efficiently solved using SDD solvers including: image segmentation, image denoising, finding solutions to elliptic equations, computing maximum flow in a graph, graph sparsification, and graphics.

Theoretically, we have recently shown that if $A$ is an $nxn$ SDD matrix with $m$ entries the linear system $Ax =b$ can be solve to constant precision in $O(m \log n)$ time ignoring lower order factors. On the implementation side, the latest code runs in linear time experimentally for large sparse systems. In fact the problem sizes are now large enough that on modern multicore workstations the memory bandwidth is a limiting factor. Compression techniques are now used to substantially speedup the solver.

Real-time systems have timeliness properties that must be satisfied. Because the real-time behavior of programs depends on implementation details, the traditional software design approach that abstracts away implementation details and focuses on functional correctness does not work well for real-time system design. To address this issue, we describe an annotation scheme that is used to superimpose timeliness assertions on top of the functional specification of programs. The first part of this talk will discuss some formal aspects of the annotation scheme, including the logical basis, the algorithmic analysis and synthesis problems and the related issues such as implementability and robustness in real-time scheduler design.

The second part of this talk will address the design problems of real-time systems in an open system environment. In the open system environment, a set of resources is shared by multiple applications each of which has its own resource scheduling policy, and the resource scheduling policy of each individual application is not accessible by other applications. Because of the resource scheduler isolation, meeting the timeliness requirements requires new resource sharing models and analysis techniques. We shall discuss the concept of real-time resource virtualization and the related resource allocation problems. If time permits, we shall touch on real-time database issues and speculate about real-time intelligent agent design.

It pays to do the right thing: Incentive mechanisms for Societal Networks

Balaji Prabhakar

Stanford University

Why did kamikaze pilots wear helmets?

Why does glue not stick to the inside of the bottle?

Why is lemonade made with artificial flavor but dishwashing liquid made with real lemons?

How can I avoid traffic jams and be paid for it?

While the first three are some of life's enduring questions, the fourth is the subject of decongestion research projects at Stanford University and in Singapore. In this talk, I will briefly describe these projects and, more generally, discuss incentive mechanisms for Societal Networks; for example, transportation, energy, healthcare and waste management networks. I will talk about incentive mechanisms and experiments for reducing road congestion, pollution and energy use, and for improving "wellness" and good driving habits. Some salient themes are: using low-cost sensing technology to make Societal Networks much more efficient, using price as a signal to co-ordinate individual behavior, and intelligently "throwing money at problems".

Insights into Routing in Real-world Networks

D. Sivakumar

Yahoo! Research

Sociologist Stanley Milgram's experiments on the "small-world" phenomenon have led to a variety of fascinating questions at the intersection of sociology and computer science. In this work, we seek to understand what makes routing algorithms similar to those employed by Milgram work as effectively as they do. To this end, we develop a network models and tools, and provide theoretical and empirical answers this question. While affirming the importance of the role of weak ties (as pioneered by Granovetter in sociology and modeled crisply by Kleinberg in CS), our work highlights the role of two phenomena -- the existence of a dual space of affiliations of the members of the network, and the emergence of central nodes in the network.

Dealing with Massive Data: Parallelism to the Rescue

Sergei Vassilvitskii

Yahoo! Research

The present availability of data has lead to data analysis problems whose size would be unimaginable just 10 years ago. With datasets routinely measured in gigabytes and terabytes, one challenge that has emerged is scaling algorithms to handle this deluge of information. With chip designers sustaining Moore’s law by turning to parallelism, so have algorithmicists turned to parallelism to speed up algorithms and improve performance. In this talk, we will focus on the MapReduce model of parallel computing, highlight some of the recent algorithmic advances and present current challenges.