Past Seminars

Summer 2016

Ecological sciences benefit from the huge diversity of plant species which play an important role in large scale ecological aspects such as global warming, land cover change, CO2 emission, invasive species, fire hazard, and etc. State-of-the-art species classification techniques utilize remote sensing data such as hyperspectral and LiDAR, however this task involves plenty of field data collection which is both highly time consuming, costly and can only be accomplished by ecological experts. Among thousands of the most commonly found plant species there is huge similarities between them from a remote sensing point of view which makes the task of species classification very daunting; therefore we see a whole body of literature specifically dedicated to this issue which is yet far from real world scenarios with thousands of possible species. While this is an indicator of the importance and complexity of the issue, little has been done to tackle the problem from a computational point of view harnessing the power of "big data". Periodic airborne campaigns can generate terabytes of data on vast swaths of land. To tackle these problems we propose to use probabilistic knowledge bases which work best when there is lots and lots of uncertain data. Probabilistic knowledge base captures ecological expert knowledge in terms of probabilistic rules, which will be mapped to remote sensing data and used to infer new facts and therefore enhance species classification accuracy.

Recent years have seen a drastic rise in the construction of web-scale knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge and extraction algorithms, current knowledge bases are still far from complete. In this paper, we study the problem of mining first-order inference rules to facilitate knowledge expansion. We propose the Ontological Pathfinding algorithm (OP) that scales to web-scale knowledge bases via a series of parallelization and optimization techniques, including a new parallel rule mining algorithm implemented on Spark, a novel partitioning algorithm to break the learning tasks into smaller independent sub-tasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. Combining these techniques, we are able to develop a first rule learning system that scales to Freebase--the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing system achieves this scale.

Convex functions are well-studied and useful because they exhibit certain properties that allow for efficient optimization. Submodular functions are a discrete analog for combinatorial problems that exploit a notion of "diminishing returns" to provide performance guarantees on a large class of problems. In this talk, I provide an introduction to submodular functions, discuss their utility within the machine learning community, explore a number of applications that utilize submodular functions, and consider how they might be applied to the problem of crowdsourcing knowledge base inference.

Precision study of ecology plays a crucial role in our understanding of the environment regarding issues such as climate change, carbon emissions, invasive species, disease outbreaks, potential fires and etc. Traditional ecological approaches merely look at one or at most a few ecological sites and researchers usually perform independent analysis that do not necessarily align well with each other. However, these systems are not closed domains and a global study of our ecosystem is needed to fully understand their dynamics. In this talk we look at big-data analytics solutions that we have studied and propose for further investigation as part of NIST DSE challenge using NEON data collection initiative.

We consider the problem of semi-supervised learning to extract textual categories (e.g. cities, clothing, and plant) from unstructured text corpus. Starting from a handful of seed instances, a typical system iteratively learns useful syntactic patterns like “cities such as ” or “ is a city”, and then applies these learned patterns to extract new instances. However, information extracted in this manner is usually of low quality because most of the syntactic patterns are not very reliable. To address this problem, in this paper we present a novel information extraction approach based on multilingual learning that combines information from multiple modalities (e.g. text and vision) to enhance the reliability of information extractors. Our approach is primarily motivated by an observation that multilingual information usually complements each other, which leads to a potentially more robust learning method: the multilingual learning.

The knowledge expansion algorithm efficiently applies first-order inference rules to derive implicit facts from incomplete knowledge bases. The novel contributions to achieve efficiency and quality include: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows an efficient SQL-based inference algorithm that applies inference rules in batches; 2) We implement Probkb on massive parallel processing databases to achieve further scalability; and 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that Probkb system outperforms the state-of-the-art inference engine in terms of both performance and quality.

The ontological pathfinding algorithm mines first-order inference rules from these knowledge bases. It scales up via a series of optimization techniques: a new rule mining algorithm to parallelize join queries, a pruning strategy to eliminate unsound and resource-consuming rules before applying them, and a novel partitioning algorithm to break the learning task into smaller independent sub-tasks. Combining these techniques, we develop a first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing system achieves this scale.

Based on these contributions, we propose a probabilistic knowledge base system, Probkb, that manages web-scale knowledge by scalable learning and inference. We propose to expand Freebase by applying the 36,625 first-order inference rules and evaluate our approach using performance measures and cross validation.

In this talk we are going to to talk about two exciting projects in the field of computational law and legal analytics.

In one we are designing an intelligent system that predicts the outcome of a case brought to the court. We are currently experimenting with the data collected from United States International Trade Commission's Electronic Document Information (USITC EDIS). EDIS is a repository for documents filed in Title VII, Section 337 (Unfair Import Investigations Information System) , and other investigations before the Commission.

Our second project is semantic edge labeling in legal citation graphs. We tackle the problem of automatically determining the type of citation, where a certain statue is being cited in another statue. This project involves analysis of the content and structure of the United States Code (USC). Our efforts involve defining, annotating and assigning each citation edge with a specific semantic labels.

Mar 10

Babak Alipour

Contributing to Apache MADlib Analytics Library

Apache MADlib (incubating) is a framework for distributed/parallel in-database Machine Learning over data processing engines such as Greenplum and HAWK. MADlib project was initiated in 2011 and is a collaboration among developers and researchers from Greenplum/Pivotal/EMC; University of California, Berkeley; University of Wisconsin and University of Florida. It has recently moved to Apache for extension and support by open source developers and encourage innovation.

In this talk, we will discuss our individual experience of development over MADlib and will provide an overview of efforts of this group to contribute to the MADlib community, through issue reporting, bug fixes, development of new modules and algorithm implementations. We will also discuss challenges and next steps to resolve those.

Feb 26

Mebin Jacob

Scalable SPARQL Querying of Large RDF Graphs

The generation of RDF data has accelerated to the point where many data sets need to be partitioned across multiple machines in order to achieve reasonable performance. We look into one scheme of partitioning and how it can be achieved and also how it performs on real world queries like of DBPEDIA.

Feb 19

Jayson Mclaughlin-Salkey

On Constructing Biomedical Knowledge Bases

Knowledge Bases are now finding themselves in increasing use in Biomedical applications. In this talk, we will discuss two state of the art approaches to automatically generating these KB. I will also discuss my current work towards creating a biomedical KB and its possible application for the Open Science Prize challenge.

In this talk, I discuss novel system components and algorithms that we are designing and building at UF to enable a probabilistic master Knowledge Base (KB) system. In the context of the Archimedes project, I will discuss a spectrum of research directions we are exploring at the UF Data Science Research (DSR) Lab, in particular probabilistic modeling and scalable inference over a probabilistic knowledge base that can integrate information extracted from diverse and multimedia data sources and systems. This line of research of supporting analytics over automatically extracted knowledge bases is of high impact for many applications from QA systems, situational awareness to biomedical informatics. Other related projects include DeepDive from Stanford, YAGO from Max Planck Institute, NELL from CMU as well as WikiData/Freebase and Google Knowledge Vault.

We entered the NIST Pre-Piplot Evaluation for two tasks: data cleaning and event prediction. More specifically, we have submitted two runs for the data cleaning task, and seven runs for the event prediction. In this talk, we mainly focus on the feedback that we have received from the NIST group. We present the algorithm details of each runs, and their corresponding scores. Our goal is to learn which algorithm works and which doesn't, as well as how the data/scoring of NIST evaluation, so that we can further improve our systems for the future tasks. In addition, we will also discuss our observations, suggestions and feedback to NIST for up-coming 2016/2017 DS open-track evaluations in terms of dataset, tasks, evaluation metrics, timeline and organizations.

MADlib is an open source library of scalable in-RDBMS analytics. MADlib is now moving to Apache, gaining impact on both academia and industry. MADlib supports PosgreSQL, Greeplum DB and HAWQ. This talk is a brief introduction to MADlib and how to contribute to MADlib.

In this talk, we discuss probabilistic, deterministic, and algebraic approaches for exploratory data analysis. We consider a corpus created from the English wikipedia and look at various document modeling schemes to model the underlying structure of the corpus. We discuss two matrix factorization based approaches Latent Semantic Analysis (LSA) and Principle Component Analysis (PCA), and a popular deterministic clustering scheme the k-means algorithm. We also discuss the most popular probabilistic topic modeling scheme Latent Dirichlet Allocation (LDA) and how it is different from the other three schemes. In the end, we discuss the applications of document modeling in the e-discovery project and a short summary of experimental results on a few e-discovery datasets.

Fall 2015

With the advent of abundant multimedia data on the Internet, there have been research efforts on multimodal machine learning to utilize data from different modalities. Current approaches mostly focus on developing models to fuse low-level features from multiple modalities and learn unified representation from different modalities. But most related work failed to justify why we should use multimodal data and multimodal fusion, and few of them leveraged the complementary relation among different modalities. In this paper, we first identify the correlative and complementary relations among multiple modalities. Then we propose a probabilistic ensemble fusion model to capture the complementary relation between two modalities (images and text). Experimental results on the UIUC-ISD dataset show our ensemble approach outperforms approaches using only single modality. Word sense disambiguation (WSD) is the use case we studied to demonstrate the effectiveness of our probabilistic ensemble fusion model.

University of Florida DSR Lab System for KBP Slot Filler Validation 2015

Abstract
We present a Slot filler Validation (SFV) system that uses a semi-supervised ensemble learning approach to aggregate the results from multiple slot fillers from the Cold Start track. We apply Bipartite Graph-based Consensus Maximization (BGCM) to combine the output of supervised stacked ensemble methods with the output of slot filling runs that can’t be trained. By using BGCM we are also able to leverage a small set of assessed fillers to increase the performance of the system. The ensemble results outperformed the best cold start run, the best filtered runs, and other ensemble systems.

Abstract
Recent years have seen a drastic rise in the construction of web-scale knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge and extraction algorithms, current knowledge bases are still far from complete. In this paper, we study the problem of mining first-order inference rules to facilitate knowledge expansion. We propose the Ontological Pathfinding algorithm (OP) that scales to web-scale knowledge bases via a series of parallelization and optimization techniques, including a new parallel rule mining algorithm implemented on Spark, a novel partitioning algorithm to break the learning tasks into smaller independent sub-tasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. Combining these techniques, we are able to develop a first rule learning system that scales to Freebase--the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing system achieves this scale.

Abstract
We had participated the Pre-Pilot datascience evaluation by NIST which focuses on traffic data processing, including data cleaning and prediction. The traffic data contains measurements (e.g. flows, speed, and occupancy) by traffic sensors, and event reports distributed over DC-Baltimore areas. In the data cleaning task, our task is to correct possibly error flow values in the measurements. We propose to solve this problem by verifying data integrity using various constraints such as smoothness constraint and measurement constraint. For the prediction task, we are asked to predict the number of traffic events in a given geographical areas within a time interval of one month. We designed a regression model followed by ensemble for the prediction task. The major motivations are 1) use regression models to predict number of events based on road features that have significant impact on event occurrence; and 2) use ensemble method to combine outputs of multiple regression models for enhanced prediction performance.

Abstract
Knowledge Base Population (KBP) is the task of extracting triples in the form of (subject, relation, object) to populate a knowledge base. English Slot Filling (ESF) and Cold Start (CS) tasks are part of the KBP effort conducted by NIST. Following the ESF task, the Slot Filler Validation (SFV) task was created in order to use the outputs of a number of individual systems attempting the ESF task and improve upon the accuracy in the aggregate. Various approaches, both supervised and unsupervised, have been applied to improve slot filler systems including entailment, truth finding, constraint optimization, majority voting and stacked ensembles. Although these methods refine the output of individual systems, they can be computationally expensive, unsuitable for ESF’s list-valued results, or require substantial data for training. We propose to apply Bipartite Graph-based Consensus Maximization (BGCM), an ensemble learning approach that combines the outputs of supervised and unsupervised models in a semi-supervised fashion.

Abstract
We consider the problem of semi-supervised learning to extract text categories (e.g. persons, cities) and image object bounding boxes from the web pages. Starting with a handful of handcrafted seed examples for text categories, and hundreds of seed images (collected from the ImageNet), our system can automatically extract useful knowledge from the meta web. This talk pursues the thesis that, by extracting text and image jointly, the extraction accuracy can be noticeably improved. To enable this multimodal extraction scheme, we propose a graphical fusion model, which combines multimodal information that is complementary with each other into a unified framework. Evaluation experiment shows noticeable improvement of the proposed multimodal extraction over their single modal versions.

Abstract
The BigDAWG polystore system is designed to handle large-scale analytics, real-time streaming support, smaller analytics at interactive speeds, data visualization, and cross-storage-system queries. Guided by the "one size does not fit all", they build on top of a variety of storage engines, each designed for a specialized use case. The system provides a new view of federated databases to address the growing need for managing information that spans multiple data models.

Dr. Andrew Moore will discuss some of the big developments in computer science from the perspective of someone crossing over from industry to academia. He will talk about roadmaps for AI-based consumer and advice products in the commercial world and contrast with some of the potentially viable roadmaps in healthcare. Dr. Moore will also touch on entity stores (aka knowledge graphs), question answering and ultra-large data center architectures. Please visit the event page at https://datascience.nih.gov/community/datascience-at-nih/frontiers for more information.

Abstract
Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related to Google’s Internet advertising business. Mesa is designed to satisfy a complex and challenging set of user and systems requirements, including near real-time data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Specifically, Mesa handles petabytes of data, processes millions of row updates per second, and serves billions of queries that fetch trillions of rows per day. Mesa is geo-replicated across multiple datacenters and provides consistent and repeatable query answers at low latency, even when an entire datacenter fails. This paper presents the Mesa system and reports the performance and scale that it achieves.

Abstract
Dr. Christof Koch, President and Chief Scientific Officer of the Allen Institute for Brain Science, and Dr. Emery Brown, Professor of Computational Neuroscience and Health Sciences and Technology, Department of Brain and Cognitive Sciences, MIT-Harvard Division of Health Sciences and Technology, will describe the computational or experimental challenges associated with Big Data in their respective domains of neuroscience. From the basic to applied realms, science is being transformed by the collection of data on increasingly finer resolutions, both spatially and temporally. Storing, accessing, and analyzing these data create numerous challenges as well as opportunities. Please visit the event page at https://datascience.nih.gov/events/BRAIN-BD2K for more information.

Michael Stonebraker has made fundamental contributions to database systems, which are one of the critical applications of computers today and contain much of the world's important data. He is the inventor of many concepts that were crucial to making databases a reality and that are used in almost all modern database systems. His work on Ingres introduced the notion of query modification, used for integrity constraints and views. His later work on Postgres introduced the object-relational model, effectively merging databases with abstract data types while keeping the database separate from the programming language. Stonebraker's implementations of Ingres and Postgres demonstrated how to engineer database systems that support these concepts; he released these systems as open software, which allowed their widespread adoption and their code bases have been incorporated into many modern database systems. Since the pathbreaking work on Ingres and Postgres, Stonebraker has continued to be a thought leader in the database community and has had a number of other influential ideas including implementation techniques for column stores and scientific databases and for supporting on-line transaction processing and stream processing.

Summer 2015

Knowledge-Base Population using Ensemble Learning of Supervised and Unsupervised Models

Abstract
A wide variety of techniques have been implemented to participate in the English Slot Filling (ESF) task, part of the Knowledge Base Population(KBP) effort from NIST. The Slot Filler Validation (SFV) task, was created in order to use the outputs of multiple ESF to improve the accuracy of individual systems. Different supervised and unsupervised approaches have been used to improve slot filler systems including entailment, constraint optimization, majority voting and stacked ensembles. We propose the use Consensus Maximization, an ensemble learning approach that combines the outputs of supervised and unsupervised models.

Reasoning Marginal Inference Probability on Dynamic ProbKBAbstract
Knowledges bases are growing rapidly, the assimilated new facts leads to incremental changes to Probabilistic Knowledge Base (ProbKB), which invalidates the inferred marginal probability for nodes in the factor graph. Facilitated by Kun’s previous work on query-time k-hop approximate inference, we investigate how incremental information influences the marginal inference probability on NELL-sport dataset.

Abstract
One of the major tasks in knowledge base construction (KBC) is to populate category instances (e.g. "is_a") over a predefined ontology. While the state-of-the-art KBC systems (e.g. NELL, NEIL) are all based on information extraction technologies limited to a single modality, we propose to extraction information in a multimodality manner. Our system expects to adopt a similar never-ending learning model from NELL, which repeatedly extracts new instances from a large collection of web pages, and then refine and update the extractors using newly extracted instances. The major contributions of our project include: 1) show that the information extracted using multimodal fusion model has higher precision than their respective unimodal versions; and 2) show that by combining multimodal constraints, we are able to mitigate the "semantic drift" issues of the never-ending learning models.

Abstract
With the precipitous increase in data, performing text analytics using traditional methods has become increasingly difficult. From now until 2020 the worlds data is predicted to double every year. Techniques to store and process these large data stores are quickly growing out of date. The increase in data size with improper methods could mean a large increase in retrieval and processing time. In short, the former techniques do not scale. Complexity of data formats is increasing. No longer can one assume data will be structured numbers and names. Traditionally, to perform analytics, a data scientist extracts parts of large data sources to local machines and perform analytics using, R, Python or SASS. Extracting this information is becoming a pain point. Additionally, many algorithms performed over sets of data perform extra work, the data scientist may only be interested in particular portion of the data.

In this dissertation, I introduce query-driven text analytics. Query-Driven text analytics is the use of declarative semantics (a query) to direct, restrict and alter computation in analytic systems without a major sacrifice in accuracy. I demonstrate this principle in three ways. First, I add text analytics inside of a relational database where the user can use SQL to bound the scope of their algorithm. This way, computation is in the same location as storage and the user can take advantage of the query processing provided by the database. Second, I alter an entity resolution algorithm so it uses example queries to drive computation. This demonstrates a method of making a non-trivial algorithm aware of the query. Finally, I describe a method for inferring information from knowledge bases. I describe new techniques to perform inference over knowledge bases that model uncertainty for a real scenario and its application within question answering.

Abstract
In today's seminar, we will have a tutorial on docker setup, server guidelines, and the steps to host a live demo on web server via docker.

Expanding SigmaKB with GDETL dataAbstract
GDELT is a project aimed to create a global dataset of events, locations and tone by collecting news media articles from around the world. This data set put together spacial, and temporal dimensions of world events adding context such as tone, the kind of language media is using to cover the event. This kind of dataset can be used to expand factual knowledge bases such as SigmaKB/YAGO that already include spatio-temporal dimensions for entities and facts. In this talk i will discuss the nature of both datasets, possible ways to integrate them and the advantages it can bring.

Abstract
In this talk, I will review and motivate marginal inference as the prevailing task in treating knowledge bases as probabilistic graphical models. To that end, there are a number of inference algorithms that lead to the balance of certain tradeoffs. These include level of approximation vs. scalability and feature specificity vs. expressivity. Along with Markov Logic and Path Ranking, I will discuss a number of modifications and how the tradeoffs are affected. Additionally, whether such models treat rule features as producers of knowledge or constraints on knowledge has far-reaching effects on our intuitive understanding of the inference results.

Abstract
The goal of Knowledge Base Population (KBP) at TAC is to promote research and evaluate the ability of automated systems to discover information about named entities and incorporate this information in a knowledge source. Specifically, given a reference knowledge base, a set of attributes (Slots), and a set of entities from the reference KB, the Slot Filling (SF) task consists of mining information about entity, slot pairs from text to complete missing slots from the reference KB. Since 2013, a new task, Slot Filler Validation, has been proposed to focus on refinement of the output from SF systems by applying more intense linguistic processing or combining information from multiple systems. In this talk, the datasets used in the 2014 SFV track will be discussed, a pipeline for a stacking ensembling system, that aggregates multiple SF system outputs will be presented along with possible ways to improve it for the 2015 SFV task.

Abstract
This talk will serve as a brief introduction to modeling knowledge using either a probabilistic graphical model or first-order logic. After reviewing basic concepts and motivating shortcomings in both, Markov Logic will be presented as one solution to combat complex specificity in Markov random fields and determinism in first-order logic. The material presented is a precursor to next week's talk on problems and possible solutions inherent in Markov Logic.

Abstract
Recent years have seen a drastic rise in the construction of web-scale knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge and extraction algorithms, current knowledge bases are still far from complete. In this paper, we study the problem of mining first-order inference rules to facilitate knowledge expansion. We propose the Ontological Pathfinding algorithm (OP) that scales to web-scale knowledge bases via a series of parallelization and optimization techniques, including a new parallel rule mining algorithm implemented on Spark, a novel partitioning algorithm to break the learning tasks into smaller independent sub-tasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. Combining these techniques, we are able to develop a first rule learning system that scales to Freebase–the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing system achieves this scale.

Applying Big Data Technology to Remote Sensing for Species Identification

Abstract
Species identification through remote sensing provides the means to monitor biodiversity and their inter-related dynamics in large ecological scales. With the advent of NEON, a standardized protocol for data collection in a wide range of domains would be used to collect data across continental US for over 30 years. We use big data technologies such as probabilistic knowledge bases and deep learning to incorporate expert knowledge and features learning to enhance species identification from remote sensing data.

Abstract
Our motivation is to utilize multimodal data to achieve better performance compared to single modality. We will first introduce two applications for multimodal data fusion, multimodal information retrieval and multimodal word sense disambiguation. The methods to combine images and text will be explained, as well as the experimental results that show that the multimodal approaches outperform single modality approaches. We will discuss about a few different models to combine different modalities and propose a promising model.

Apr 10

Kushal Arora

Neural Nets and Knowledge Bases

Abstract
In this talk we will discuss the neural network architecture applied to multi-relational data and how it is used to solve the problems like inference, expansion and reasoning over KBs. We will touch the basic architecture used, various objective functions and how are they used to in context of the problems stated above.

Abstract
Probabilistic knowledge bases are incorporating new knowledge learned from the web. With the incremental changes to the KB, a naive approach to answer a marginal query is to re-run the inference algorithm, e.g., Gibbs sampling, MC-SAT, which is time consuming. We present an approach to the approximate the marginal inference. We shows that we achieve an order of magnitude faster to answer a query with negligible error.

Abstract
In this talk, I will be discussing a recent paper from the Knowledge Vault team at Google. In this new paper, the researchers investigate the use of facts extracted from the web as a signal for search and ranking result. They build upon the previously published Knowledge Vault system to collectively model the extraction and factual errors in the corpora, as well as extraction errors. I will discuss their techniques and present their findings. This paper, presumably, has been submitted but not yet accepted for publication.

Abstract
In this talk we will cover the basics of neural networks, starting with basic logistic regression, multilayer preceptron, auto-encoders to deep architecture like stacked auto encoders. In addition to this we will discuss Theano framework basics and implementation of discussed architectures.

Abstract
Question answering systems allow humans to ask questions in natural language, and the system responds with an answer in a human recognizable way. There has been a recent renewed interest in developing QA systems using Knowledge Graphs. In this talk, I will discuss the development of an in-house system over probabilistic knowledge bases. A probabilistic KB aims to provide an additional trustworthiness score to traditional QA systems. I will address both the motivation and progress of this work.

Abstract
First-order logical rules are an expressive and powerful way to infer new facts from existing evidence. Markov Logic applies all rules at once to reason jointly over the entire possible world of knowledge, but exponential growth makes application to large-scale knowledge bases intractable. Approximations such as Association Rule Mining, instead perform inference on a fact-by-fact basis, ignoring higher-order correlations. This talk explores the divide between these two approaches to the problem of fact inference and what the space of approximations somewhere in the middle may be.

Abstract
First-order logical rules are an expressive and powerful way to infer new facts from existing evidence. Markov Logic applies all rules at once to reason jointly over the entire possible world of knowledge, but exponential growth makes application to large-scale knowledge bases intractable. Approximations such as Association Rule Mining, instead perform inference on a fact-by-fact basis, ignoring higher-order correlations. In order to scale to web-scale knowledge base, we describe a new algorithm that scales association rule mining to today's KBs with billions of facts.

Abstract
Probabilistic knowledge bases are incorporating new knowledge learned from the web. With the incremental changes to the KB, the current approach to answer a marginal probability query is to re-run the inference algorithm, e.g., Gibbs sampling, which is time consuming. We present an approach to the approximate the marginal inference. We shows that we achieve an order of magnitude faster to answer a query with negligible error.

Fall 2014

Abstract
In this talk, I discuss system and algorithmic components that we are designing and building at UF to enable a master Knowledge Base (KB). I will also discuss many research directions we are exploring at the UF Data Science Research (DSR) group including: query-driven inference and sampling, probabilistic knowledge base, state-parallel and data parallel data analytics framework, multimodal (e.g., text, image) information extraction, and KB schema enrichment. This line of research is of high impact has received funding from industry as well as federal government including DARPA, EMC, Amazon and Google. Other related projects include DeepDive from Stanford, YAGO from Max Planck Institute, NELL from CMU as well as WikiData/Freebase and Google Vault.

Abstract
introduce current research activities in Interaction Design Lab of Shanghai Jiao Tong University, the portfolios under the principle of “Form Follows Emotion”.

Bio
Kevin Dong is assistant professor of interaction design at Shanghai Jiao Tong University, China. He received his doctor degree from college of computer science in Zhejiang University. He is the principle investigator of several government-funded projects, including: Universal Interaction Design of Digital-TV under Aging Society; Relationship between Customers’ Participation and User Experience of Customized Products. Aside of government-funded projects, Kevin is also leading enterprise projects, including: A New Automobile Navigation Interface Design Based on Touch Panel & Knob. Currently, Kevin is a visiting scholar of HCI group in University of Florida and his research focuses on user-centered design and emotional design which studies users’ perception, response and feeling from products and interfaces.

Abstract
With the precipitous increase in data, performing text analytics using traditional methods has become increasingly difficult. From now until 2020 the worlds data is predicted to double every year. Techniques to store and process these large
data stores are quickly growing out of date. The increase in data size with improper methods could mean a large increase in retrieval and processing time. In short, the former techniques do not scale. Complexity of data formats is increasing. No longer can one assume data will be structured numbers and names. Databases are now storing more a mix of structured and unstructured data. To support text analytics, queries over disparate data types cannot be an over sight.

In this proposal I introduce query-driven text analytics. Query-Driven text analytics is the use of declarative semantics (a query) to decrease the amount of processing in analytic systems without a major sacrifice in accuracy. I demonstrate this in three ways. First, I add text analytics inside of a parallel relational DBMS where the user can use SQL and UDFs to choose the scope of their algorithm. Second, I alter a data mining algorithm so it uses an example query to drive computation. Finally, I propose an integrated question answering system over the different parts of the web.

Abstract
Text-based document classification and image retrieval are two most fundamental problems in data science. In the era of big data, how to efficiently search text documents and images while at the same time guarantee a good accuracy has been one of the most interesting topics. In our project, we propose to combine these two topics for enhanced performance and possibly new applications. Currently, we have built an online search engine, which has demonstrated state-of-the-art accuracy on Oxford Building Dataset. In our presentation, we will focus on technical details from several aspects including data collection, system implementation and search algorithms. In the meanwhile, we will also introduce our software packet for a highly scalable approximate K-means clustering with OpenMPI support.

Abstract
While we train our computer vision systems with a series of images and labels, it is clear that children do not learn language this way. They are faced with a large variety of objects and behaviors visible at once, and must pull references from a jumble of words as they are still learning grammar. And yet, with a number of sometimes unintuitive learning strategies, they seem to be able to learn language grounded in their experiences faster than our top-of-the-line object recognition systems.

With the field of symbol grounding becoming more popular, work at the intersection of computer vision, natural language understanding, and cognitive science is poised to discover more complete and efficient ways of learning grounded language in AI systems. Advances in grounded language learning can be applied to scene description, dialogue systems, knowledge representation, and other fields. In this talk, I will cover our work so far on SALL-E, a system that uses child language learning strategies and pragmatic inference to perceptually ground language from video demonstrations. I will also cover the challenges we faced along the way and the precautions one must take to truly create a grounded language system.

Hyperspectral Classification of Savanah Tree Species Using k-fold Cross-Validated Non-linear SVM and MESMA

Abstract
Identifying savannah species at ecological scale is a major milestone in measuring biomass, carbon reserves, drought and invasive species spread predictions. In this talk we perform classification and geo-mapping of tree species from hyperspectral imagery collected using AVIRIS airborne sensors. We provide a thorough comparison of the effects of ATCOR and FLAASH atmospheric corrections in prediction accuracy. This study classifies common savannah tree species in Ordway-Swisher Biological Station in north-central Florida, USA. Specie classification was performed using variety of Support Vector Machine kernels both on pixel level and canopy level where Polynomial Kernel outperformed others. We also verify MESMA (Muliple Endmember Spectral Mixture Analysis) and try to build an spectral library and observe the results. Also we look into LiDAR (Light Detection and Ranging) airborne data and find interesting patterns in species heights. All this information along with added expert knowledge available online such as USDA Plants database or lots of other resources can lead to a much more informed classification of species.

Oct 10

Ishan Patwa

Word Sense Disambiguation through Images

Abstract
The automatic disambiguation of word senses is of growing interest in natural Language processing community. Use of Images to disambiguate short text with limited context is an important intermediary step in many Natural Language processing task. We are going to review our proposed method of solving the WSD problem and possible improvements on our preliminary results.

Abstract
Large scale image retrieval system is a big challenge because of the rapid increase of images on web today. In this presentation, we will first talk about a brief introduction to image retrieval systems. Then we will show our own pipeline design to handle large scale image retrieval by using advanced parallel data processing systems, including Hadoop and Mahout. We will also talk about the severe challenges for the system scale-up and how to solve them. At last, we will discuss our results and next steps.

Oct 2

Pawel Terlecki (Tableau Software)

An analytic data engine for visualization in Tableau

Abstract
The talk covers the history, architecture and capabilities of Tableau Data Engine. It is an in-house columnar database based on the MonetDB design and developed specifically to support users with mid-size data sets and no efficient analytic back-ends. We cover important components and design decision, as well as give an overview how industrial projects of this size start and evolve.

Short Bio
Pawel leads the query team at Tableau. His responsibilities include vision, design and implementation of various query processing elements of the Tableau visualization platform. One can find his contributions in the Tableau Data Engine, caching infrastructure or data extraction. Prior to Tableau he worked on business application, web frameworks, database servers, in particular MS SQL Server, and data mining projects. He holds a PhD in Computer Science from Warsaw University of Technology, with specialization in information systems and knowledge discovery, and BS in Economics from Warsaw University. He published several works on databases and data mining and is a frequent attendant of major conferences in these fields. Performance and building reliable solutions are his passion.

Abstract
In social networks, there are several challenges for word sense disambiguation, including short context and little annotation/knowledge. While there is only limited textual information, we could use multi-modal data including images to help disambiguate word senses. We are going to review related work and propose new methods using multi-modal data to solve WSD problems.

Abstract
Markov Logic Networks (MLNs) combine the domains of first order logic and statistical probability by attaching weights to first order formulas or rules. This talk will serve as an introduction to intuitively understanding MLNs, particularly how they perform inference and learn weights and structure. MLN structure learning is equivalent to weighted inference rule learning and comparisons will be drawn with association rule mining metrics.

Abstract
Recent years have seen a tremendous research interest in knowledge base construction. These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge and extraction algorithms, all existing knowledge bases are incomplete. As one potential solution to knowledge expansion, we study the problem of rule mining in such knowledge bases. In this talk, I will survey the state-of-the-art rule mining algorithms and report potential research directions, our progress, and our contributions toward the rule-based solution of the knowledge expansion problem.

Abstract
Many current large-scale knowledge bases (KBs) are highly incomplete either due to errors in the construction process or because the knowledge is implicit as opposed to explicit. For example, 93.8% of people in Freebase have no birthplace and 78.5% have no nationality. The construction of inference rules from mining repeatable patterns in the KB has the potential to contribute additional knowledge to the KB. In this talk I will outline the most recent attempts at mining both structured and unstructured data for inference rules and elucidate similarities in methodologies and algorithms. Finally, I will present some ideas for future contributions to this nascent field.

Applications Quest: A Nominal Population Metric Approach to Diversity in Admissions

Abstract
In 2003, two land mark cases challenged the University of Michigan admissions policies, one focused on Law School admission and the other on undergraduate admissions. In Grutter v. Bollinger, the case focused on the Law School, the U. S. Supreme Court ruled 5-4 in favor of the Law School. However, in the Gratz v. Bollinger, by a vote of 6-3, the Court reversed, in part, the University's undergraduate admission's policy to provide points for race/ethnicity. Therefore, the Court decided that race could be considered in admission's decision, but could not be the deciding factor. Later, Michigan residents voted to adopt a ban on racial and gender preferences through Proposal 2. In 2007, the U.S. Supreme Court heard two cases on race-conscious school placement policies in Louisville and Seattle. The court struck down the programs in Louisville and Seattle. In 2013, the U.S. Supreme Court heard another case on this very topic in Fisher v. Texas. In the Fischer case, the U.S. Supreme Court sent the case back to the 5th District Court citing that the case had not passed strict scrutiny. Applications Quest is a data mining tool that provides preference free, holistic diversity using a patented nominal population metric. In this talk, Dr. Gilbert will discuss the legal implications of Applications Quest and the nominal population metric.

Short Bio
Dr. Juan E. Gilbert is the Andrew Banks Family Preeminence Endowed Chair and the Associate Chair of Research in the Computer & Information Science & Engineering Department at the University of Florida where he leads the Human Experience Research Lab. He is also a Fellow of the American Association of the Advancement of Science, National Associate of the National Research Council of the National Academies, an ACM Distinguished Scientist and a Senior Member of the IEEE. Dr. Gilbert was recently named one of the 50 most important African-Americans in Technology.

Long Bio
Dr. Juan E. Gilbert is the Andrew Banks Family Preeminence Endowed Chair and the Associate Chair of Research in the Computer & Information Science & Engineering Department at the University of Florida where he leads the Human Experience Research Lab. Dr. Gilbert has research projects in spoken language systems, advanced learning technologies, usability and accessibility, Ethnocomputing (Culturally Relevant Computing) and databases/data mining. He has published more than 140 articles, given more than 200 talks and obtained more than $24 million dollars in research funding. He is a Fellow of the American Association of the Advancement of Science. In 2012, Dr. Gilbert received the Presidential Award for Excellence in Science, Mathematics, and Engineering Mentoring from President Barack Obama. He was recently named one of the 50 most important African-Americans in Technology. He was also named a Speech Technology Luminary by Speech Technology Magazine and a national role model by Minority Access Inc. Dr. Gilbert is also a National Associate of the National Research Council of the National Academies, an ACM Distinguished Scientist and a Senior Member of the IEEE. Recently, Dr. Gilbert was named a Master of Innovation by Black Enterprise Magazine, a Modern-Day Technology Leader by the Black Engineer of the Year Award Conference, the Pioneer of the Year by the National Society of Black Engineers and he received the Black Data Processing Association (BDPA) Epsilon Award for Outstanding Technical Contribution. In 2002, Dr. Gilbert was named one of the nation's top African-American Scholars by Diverse Issues in Higher Education. In 2013, the Black Graduate and Professional Student Association at Auburn University named their Distinguished Lecture Series in honor of Dr. Gilbert. Dr. Gilbert testified before the Congress on the Bipartisan Electronic Voting Reform Act of 2008 for his innovative work in electronic voting. In 2006, Dr. Gilbert was honored with a mural painting in New York City by City Year New York, a non-profit organization that unites a diverse group of 17 to 24 year-old young people for a year of full-time, rigorous community service, leadership development, and civic engagement.

Summer 2014

We had a discussion on different systems for large-scale graph processing and the pros and cos of each one. The systems discussed include GraphLab, distributed GraphLab, GraphChi, PowerGraph, GraphX, and GIST.

Abstract
Information extraction and human collaboration techniques are widely applied in the construction of web-scale knowledge bases. However, these knowledge bases are often incomplete or uncertain. In this paper, we present ProbKB, a probabilistic knowledge base designed to infer missing facts in a scalable, probabilistic, and principled manner using a relational DBMS. The novel contributions we make to achieve scalability and high quality are: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows an efficient SQL-based inference algorithm for knowledge expansion that applies inference rules in batches; 2) We implement ProbKB on massive parallel processing databases to achieve further scalability; and 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that ProbKB system outperforms the state-of-the-art inference engine in terms of both performance and quality.

Hyper-spectral Classification of Savannah Tree Species Using k-fold Cross-Validated Non-linear Support Vector Machines
In this paper we classify Savannah tree species using AVRIS hyper-spectral images, the pre-processing performed dramatically increased classification accuracy.

Abstract
Organizations such as companies, government and hospitals heavily rely on relational database management system(RDBMS) to store large amounts of data in the formats of structured data and unstructured data. A deep analysis to the data stored in database would help to discover useful information, suggesting conclusions, and supporting decision making. It helps companies to make the next best decision, enable doctors to have better assessment of their patients, and alleviate lawyers of document review processes. However a deep and comprehensive understanding of data requires various machine learning algorithms and statistical methods. Several challenges exist in using state-of-the-art systems to perform analysis on data resides in RDBMS. First, expensive big data transfer cost must be paid up front to move data between databases and external analytics systems. Second, many popular statistical packages do not scale up to production sized datasets. Thus enterprise applications need sophisticated in-database analytics in addition to traditional online analytical processing(OLAP) from a database. To meet customers' pressing demands, researchers, database vendors have been pushing advanced analytics techniques into databases. This thesis has two major contribution to the in-database analytic community. Firstly, it contribute a in-RDBMS statistical text analysis package to the community and introduce GPText, Greenplum parallel statistical text analysis framework which seamlessly integrates the Solr search engine and applies statistical algorithms such as k-means and LDA using MADLib. Secondly, it present a GIST operator for large scale statistical inference to address the limitation of current RDMBS. The two work are summarized in the following two paragraphs.

MADlib Text Analytics and GPText
Text analytics has gained much attention in the big data research community due to the large amounts of text data generated in organizations such as companies, government and hospitals everyday in the form of emails, electronic notes and internal documents. Many companies store this text data in relational databases because they relay on databases for their daily business needs. We bring statistical text analysis power into MADlib, a state-of-art in-database analytic package which can be installed in postgres and Greenplum. We developed and contributed a linear-chain conditional random field(CRF) module to MADLib to enable information extraction tasks such as part-of-speech tagging and named entity recognition. We show the elegant in-RDBMS parallel implementation of CRF which achives sub-linear scalability. We introduce GPText, Greenplum parallel statistical text analysis framework which seamlessly integrates the Solr search engine and applies statistical algorithms such as k-means and LDA using MADLib. We describe an eDiscovery application built on the GPText framework.

GIST: An Operator for Large Scale Statistical Inference
Every major RDBMS offers a User-Defined Aggregate (UDA) facility to implement many of the analytical techniques in parallel. However, inference algorithms, like Markov chain Monte Carlo, where there is some amount of setup done for the problem and then most of the work is performed by iterating over a large state, the UDA model is not a natural fit. This paper presents the General Iterative State Transition (GIST), a RDBMS operator for large scale inference. GIST is an operator which receives a state, which is generated by a UDA, and then performs rounds of transitions on the state until the state has converged to the desired result. We argue that the combination of UDA and GIST can express the majority of learning algorithms, thus significantly extends the analytical capabilities of RDBMSs. We exemplify the use GIST through two high-profile applications: cross-document coreference and loopy belief propagation. We show that the database-GIST combination allows us to tackle a task 27 times larger than state-of-the-art for the first problem and produces a solution that is an order of magnitude faster than the state of the art for the last problem.

Spring 2014

A Short Introduction to SciDB
The presentation first briefly introduces SciDB with its architecture and array processing, then focuses on the work with Neon image import/export in SciDB.

Knowledge Feedback on Prediction of Post-operative Outcomes
A study was conducted to establish the the requirement of an algorithm in predicting post-operative outcomes in collaboration with UF Health. We will discuss the methodology used in this study along with a demo of the software used. The presentation will focus on the experimental data collected in the most recent version of this study and the analysis/results derived from them.

EDiscovery
This presentation will provide a brief description of the project "SMARTeR" being developed for document retrieval in collaboration with UF law. We will focus on an overview of the algorithm developed and a demo of the software which will be provided to the law school. A comparison will be presented detailing the advantages of the algorithm against present techniques of Document Retrieval.

Abstract
Enterprise applications need sophisticated in-database analytics in addition to traditional online analytical processing (OLAP) from a database. To meet customers’ pressing demands, database vendors have been pushing advanced analytics techniques into databases. Every major RDBMS offers a User-Defined Aggregate (UDA) facility to implement many of the analytical techniques in parallel. However, inference algorithms, like Markov chain Monte Carlo, where there is some amount of setup done for the problem and then most of the work is performed by iterating over a large state, the UDA model is not a natural fit. This talk presents the General Iterative State Transition (GIST), an RDBMS operator for large scale inference. GIST is an operator which receives a state, which is generated by a UDA, and then perform rounds of transitions on the state until the state has converged to the desired result. We argue that the combination of UDA and GIST can express the majority of learning algorithms, thus significantly extending the analytical capabilities of RDBMSs. We exemplify the use GIST through two high-profile applications: cross-document coreference and loopy belief propagation. We show that the database-GIST combination allows us to tackle a task 27 times larger than state-of-the-art for the first problem and produces a solution that is an order of magnitude faster than the state-of-the-art for the last problem.

Abstract
Knowledge graphs are becoming the next big goal for the web and researchers have realized various ways to construct knowledge graphs. However, the user interfaces for knowledge bases are limited. In this talk, we present VisKB, a visual search engine that allows users to interactively query and explore web-scale knowledge graphs. VisKB visualizes only part of the knowledge graph relevant to user queries, and allows users to interact with the visualization to express more queries to expand the visualization. In this way, VisKB avoids visualizing the entire graph without losing information. Using DBPedia as the data source, we show it helps users discover interesting properties and relationships of the entities they are interested in.

Understanding Climate Change: Opportunities and Challenges for Data Driven Research

Abstract
Climate change is the defining environmental challenge facing our planet, yet there is considerable uncertainty regarding the social and environmental impact due to the limited capabilities of existing physics-based models of the Earth system. This talk will present an overview of research being done in a large interdisciplinary project on the development of novel data driven approaches that take advantage of the wealth of climate and ecosystem data now available from satellite and ground-based sensors, the observational record for atmospheric, oceanic, and terrestrial processes, and physics-based climate model simulations. These information-rich datasets offer huge potential for monitoring, understanding, and predicting the behavior of the Earth's ecosystem and for advancing the science of climate change. This talk will discuss some of the challenges in analyzing such data sets and our early research results.

Speaker Bio
Vipin Kumar is currently William Norris Professor and Head of Computer Science and Engineering at the University of Minnesota. His research interests include High Performance computing and data mining, and he is currently leading an NSF Expedition project on understanding climate change using data driven approaches. He has authored over 250 research articles, and co-edited or coauthored 10 books including the widely used text book ``Introduction to Parallel Computing", and "Introduction to Data Mining" both published by Addison-Wesley. Kumar co-founded SIAM International Conference on Data Mining and served as a founding co-editor-in-chief of Journal of Statistical Analysis and Data Mining (an official journal of the American Statistical Association). Kumar is a Fellow of the ACM, IEEE and AAAS. He received the Distinguished Alumnus Award from the Indian Institute of Technology (IIT) Roorkee (2013), the Distinguished Alumnus Award from the Computer Science Department, University of Maryland College Park (2009), and IEEE Computer Society's Technical Achievement Award (2005). Kumar's foundational research in data mining and its applications to scientific data was honored by the ACM SIGKDD 2012 Innovation Award, which is the highest award for technical excellence in the field of Knowledge Discovery and Data Mining (KDD).

Building Data Storage, Retrieval and Analysis Platform for Ecological Research at Continental Scale

We will specifically talk about applying state-of-the-art machine learning techniques over remote sensing data, where the goal is species classification of plants. Also, we discuss existing platforms that are available to scientists to easily share and query data. Our goal is to build a platform to perform data analysis over massive amounts of ecological data centered around remote sensing data such as hyperspectual and lidar data for ecological research and applications, such as climate change, invasive species identification, at continental scale.

Feb 28

Kushal Arora

Universal Knowledge Base

Ontology alignment of multiple knowledge bases to create a universal knowledge base with integrated schema and entities. This work is based on PIDGIN paper from CMU -- Ontology alignment using web text as interlingua.

Abstract
Cluster analysis is a common task in exploratory data mining, and involves combining entities with similar properties into groups. However, most clustering techniques face one key challenge when used in real world applications: the algorithms expect a quantitative, deterministic distance function to quantify the similarity between two entities. Whereas in most real world problems, such similarity measurements usually require subjective domain knowledge that can be hard for users to explain.

In this talk, we present MindMiner, a mixed-initiative interface and visualization system for capturing subjective similarity measurements via a combination of new interaction techniques and machine learning algorithms. MindMiner collects qualitative, hard to express similarity measurements from users via active polling with uncertainty and example based visual constraint creation. MindMiner also formulates human prior knowledge into a set of inequalities and learns a quantitative similarity distance metric via convex optimization. In a 12-subject peer-review understanding task, we found MindMiner was easy to learn and use, and could capture users' implicit knowledge about writing performance and cluster target entities into groups that match subjects' mental models. We also found that MindMiner's constraint suggestions and uncertainty polling functions could improve both efficiency and the quality of clustering.

Speaker Bio
Dr. Jingtao Wang is an Assistant Professor in Computer Science and Learning Research and Development Center (LRDC) at the University of Pittsburgh. His primary research direction is Human-Computer Interaction (HCI). Jingtao's current research interests include - mobile interfaces, education/learning technology, end-user programming, machine learning and its applications in HCI. He received his Ph.D. degree in computer science from the University of California, Berkeley. Before that, Jingtao was a researcher and team lead at the IBM China Research Lab, working on large-vocabulary, online handwriting recognition technologies for Asian languages. He received his master degree and bachelor degree both from Xi'an Jiaotong University, China.

Abstract
Information extraction and human collaboration techniques are widely applied in the construction of web-scale knowledge bases. However, these knowledge bases are often incomplete or uncertain. In this paper, we present ProbKB, a probabilistic knowledge base designed to infer hidden facts in a scalable, probabilistic, and principled manner using a relational DBMS. The novel contributions we make to achieve scalability and high-quality are: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows efficient SQL-based inference algorithms for knowledge expansion that apply inference rules in batches; 2) We implement ProbKB on massive parallel processing databases to achieve further scalability; and 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that ProbKB system outperforms the state-of-the-art inference engine in terms of both performance and quality.

Abstract
The spectacular failure of the Affordable Care Act website ("Obamacare") has focused public attention on software engineering. Yet experienced practitioners mostly sighed and shrugged, because the historical record shows that only 10% of large (>$10M) software projects using conventional methodologies such as Waterfall are successful. In contrast, Amazon and others successfully build comparably large and complex sites with hundreds of integrated subsystems by using modern agile methods and service-oriented architecture.

This contrast is one reason Industry has complained that academia ignores vital software topics, leaving students unprepared upon graduation. In too many courses, well-meaning instructors teach traditional approaches to software development that are neither supported by tools that students can readily use, nor appropriate for projects whose scope matches a college course. Students respond by continuing to build software more or less the way they always have, which is boring for students, frustrating for instructors, and disappointing for industry.

This talk explains how the confluence of cloud computing and Massive Open Online Courses (MOOCs) have allowed us to greatly improve both the effectiveness and the reach of UC Berkeley's undergraduate software engineering course. The shift toward Software as a Service has not only revolutionized the future of software, but changed it in a way that makes it easier and more rewarding to teach. UC Berkeley’s revised Software Engineering course leverages this productivity to allow students to both enhance a legacy application and to develop a new app that matches requirements of non-technical customers. By experiencing the whole software life cycle repeatedly within a single college course, and by using the same tools and techniques that professionals use, students actually use and learn to appreciate the skills that industry has long encouraged. The course is now popular with students, rewarding for faculty, and praised by industry.

The technology developed for the course has also been used to offer a subset of the material as a MOOC to hundreds of thousands of students, and through an arrangement with edX, is available to classroom instructors interested in trying this approach as a SPOC (Small Private Online Course) offering instructor support far beyond what is usually available for traditional textbooks. Indeed, our experience has been that despite recent hand-wringing about MOOCs destroying higher education, appropriate use of MOOC technology can improve on-campus pedagogy, increase student throughput while raising course quality, and even reinvigorate faculty teaching.

Speaker Bio
Armando Fox (fox@cs.berkeley.edu) is a Professor in Berkeley's Electrical Engineering & Computer Science Department as well as the Faculty Advisor to the UC Berkeley MOOCLab. He co-designed and co-taught Berkeley's first Massive Open Online Course on Engineering Software as a Service, currently offered through edX, through which over 10,000 students worldwide have earned certificates of mastery. He also serves on edX's Technical Advisory Committee, helping to set the technical direction of their open MOOC platform. With colleagues in Computer Science and in the School of Information, he is doing research in online education including automatic grading of students' computer programs and improving student engagement and learning outcomes in MOOCs. His other computer science research in the Berkeley ASPIRE project focuses on highly productive parallel programming.

While at Stanford he received teaching and mentoring awards from the Associated Students of Stanford University, the Society of Women Engineers, and Tau Beta Pi Engineering Honor Society. He has been a "Scientific American Top 50" researcher, an NSF CAREER award recipient, a Gilbreth Lecturer at the National Academy of Engineering, a keynote speaker at the Richard Tapia Celebration of Diversity in Computing, and an ACM Distinguished Scientist. In previous lives he helped design the Intel Pentium Pro microprocessor and founded a successful startup to commercialize his UC Berkeley Ph.D. research on mobile computing. He received his other degrees in electrical engineering and computer science from MIT and the University of Illinois. He is also a classically-trained musician and performer, an avid musical theater fan and freelance Music Director, and bilingual/bicultural (Cuban-American) New Yorker living in San Francisco.

Abstract
Today's scientific processes heavily depend on fast and accurate analysis of experimental data. Scientists are routinely overwhelmed by the effort needed to manage the volumes of data produced either by observing phenomena or by sophisticated simulations. As data management software proves inefficient, inadequate, or insufficient to meet the needs of scientific applications, the scientific community typically uses special-purpose legacy software. With the exponential growth of dataset size and complexity, however, application-specific systems no longer scale to efficiently analyse the relevant parts of their data, thereby slowing down the cycle of analysing, understanding, and preparing new experiments. I will illustrate the problem with a challenging application on brain simulation data and will show how the problems from neuroscience translate into challenges for the data management community. I will show how novel data management technology can enable today's neuroscientists to simulate and discover a meaningful percentage of the human brain at unprecedented levels of detail. Finally I will describe the challenges of integrating simulation and medical neuroscience data to advance our understanding of the functionality of the brain.

Speaker Bio
Anastasia Ailamaki is a Professor of Computer Sciences at the Ecole Polytechnique Federale de Lausanne (EPFL) in Switzerland. Her research interests are in database systems and applications, and in particular (a) in strengthening the interaction between the database software and emerging hardware and I/O devices, and (b) in automating database management to support computationally-demanding and demanding data-intensive scientific applications. She has received an ERC Consolidator Award (2013), a Finmeccanica endowed chair from the Computer Science Department at Carnegie Mellon (2007), a European Young Investigator Award from the European Science Foundation (2007), an Alfred P. Sloan Research Fellowship (2005), eight best-paper awards at top conferences (2001-2011), and an NSF CAREER award (2002). She earned her Ph.D. in Computer Science from the University of Wisconsin-Madison in 2000. She is a senior member of the IEEE and a member of the ACM, serves as the ACM SIGMOD vice chair, and has also been a CRA-W mentor.