Research Notes

bigdata@csail

Big Data promises a better world. A world where data will be used to make better decisions, from how we invest money to how we manage our healthcare to how we educate our children and manage our cities and resources. These changes are enabled by a proliferation of new technologies and...

The MIT Big Data Challenge Take me to the CITY OF BOSTON TRANSPORTATION challenge page! OVERVIEWThe MIT Big Data Initiative at CSAIL is organizing competitions designed to spur innovation in how we think about and use data to address major societal issues. Big Data...

MapD (Massively Parallel Database) is an analytics database being built by Todd Mostak and Prof. Sam Madden at MIT that allows interactive querying of big datasets.It takes advantage of the immense computational power and memory bandwidth available in commodity-level, of-the-shelf multiocore...

The goal of the MIT Big Data Initiative, a multi-year effort launched in May 2012, is to identify and develop new technologies needed to solve next generation data challenges that will require the ability to scale well beyond what today's computing platforms, algorithms, and methods can provide. We want to enable people to leverage Big Data by developing systems and platforms that are reusable and scalable across multiple application domains.

Our approach includes two important aspects. First, we will work closely with key industry and government stakeholders to provide real-world applications and drive impact. Promoting in-depth interactions between academic researchers, industry and government is a key goal. Second, we believe the solution to Big Data is fundamentally multi-disciplinary. The team includes faculty and researchers hailing from diverse research backgrounds, including algorithms, architecture, data management, machine learning, privacy and security, user interfaces, and visualization, as well as domain experts in finance, industrial, medical, smart infrastructure, education and science.

The Big Data Problem

We define big data as data that is too big, too fast, or too hard for existing tools to process. Here, “too big” means that organizations increasingly have to deal with petabyte-scale collections of data, which come from click streams, transaction records, sensors, and many other places. “Too fast” means that not only is data big, but it needs to be processed quickly – for example, to perform fraud detection at a point of sale, or determine what ad to show to a user on a web page. “Too hard” is a catchall for data that doesn’t fit neatly into an existing processing tool, i.e., data that needs more complex analysis than existing tools can readily provide. Examples of the big data problem abound.

Web Analytics

On the Internet, many websites now register millions of unique visitors per day. Each of these visitors may access and create a range of content. This can easily amount to tens to hundreds of gigabytes per day (tens of terabytes per year) of accumulating user and log data, even for medium sized websites. Increasingly, companies want to be able to mine this data to understand limitations of their site, improve response time, offer more targeted ads, and so on. Doing this requires tools that can perform complicated analytics on data that far exceeds the memory of a single machine or even a cluster of machines.

Finance

As another example, consider the big data problem as it applies to banks and other financial organizations. These organizations have vast quantities of data about consumer spending habits, credit card transactions, financial markets, and so on. This data is massive: for example, Visa processes more than 35B transactions per year; if they record 1 KB of data per transaction, this represents 3.5 petabytes of data per year. Visa, and large banks that issue Visa cards would like to use this data in a number of ways: to predict customers at risk of default, to detect fraud, to offer promotions, and so on. This requires complex analytics. Additionally, this processing needs to be done quickly and efficiently, and needs to be easy to tune as new models are developed and refined.

Medical

Consider the impact of new sensors on our ability to continuously monitor a patient's health. Recent advances in wireless networking, miniaturization of sensors via MEMS processes, and incredible advances in digital imaging technology have made it possible to cheaply deploy wearable sensors that monitor a number of biological signals on patients, even outside of the doctors office. These signals measure functioning of the heart, brain, circulatory system, etc. Additionally, accelerometers and touch screens can be used to assess mobility and cognitive function. This creates an unprecedented opportunity for doctors to provide outpatient care, by understanding how patients are progressing outside of the doctor’s office, and when they need to be seen urgently. Additionally, by correlating signals from thousands of different patients, it become possible to develop a new understand of what is normal or abnormal, or what kinds of signal features are indicative of potential serious problems.

Computational Platforms

We are building several parallel data processing platforms, including SciDB, BlinkDB, and several cloud-based deployment platforms, including FOS and Relational Cloud. The goal of these platforms is to make it easy for developers of big data applications to write programs much as they would on a single-node computational environment, and to be able to rapidly deploy those applications on tens or hundreds of nodes. Additionally, as the computation and storage requirements of applications change, these platforms should be able to dynamically and elastically adapt to those changes.

Scalable Algorithms

We are developing a range of algorithms designed to deal with very large volumes of data, and to process that data in parallel. These include parallel implementations of a range of known algorithms, including matrix computations, as well as statistical operations like regression, optimization methods like gradient descent, and machine learning algorithms like clustering and classification. In addition, we are developing fundamental new types of algorithms designed to handle the challenges of Big Data. For example, we are working on sublinear algorithms that can compute a range of statistics, such as estimates of the number of distinct items in a set, using space that is exponentially smaller than the input. Additionally, we are developing new algorithms for encoding, comparing, and searching massive data sets; specific examples include hash-based similarity search on massive scale data, and algorithms for compressed sensing that provide a new way to encode sparse matrices that arise in a number of scientific applications.

Machine Learning and Understanding

On top of these algorithms, we are deploying a number of novel machine learning applications focused on machine understanding in specific domains. For example, in work on scene understanding in images we are building tools that automatically label parts of an image, or that classify an image as belonging to a certain category or categories based on the types of images that appear in them. As a second example, we are using natural language processing to convert massive quantities of text tweets and text reviews on the web into structured information about products, restaurants, and services that indicate the type of content in some text (e.g., a food review, a rating), an assessment of the sentiment of the text, etc.

Privacy and Security

Finally, because much of the mining and analysis involved in a big data context involves sensitive, private information, we are working technologies and policies for protecting, anonymization, and allowing people to retain control over their data. As an example, in the Crypt DB project, we are building a database system that stores data in an encrypted format in the cloud, in such a way that a curious database or system administrator cannot decrypt the data. Users retain the encryption keys over their data, but have the ability to execute queries over that encrypted data on the database serving, enabling much better performance than simply sending the data back an decrypting on the client’s machine.

Work in these four areas is coupled with application experts in finance (Professor Andrew Lo), medicine (Professor John Guttag), science (Professor Michael Stonebraker), education (through a relationship with the MITx initiative), and transportation (Professor Balakrishnan and Professor Madden).

We provide a survey of 31 quantitative measures of systemic risk in the economics and finance literature, chosen to span key themes and issues in systemic risk measurement and management. We motivate these measures from the supervisory, research, and data perspectives in the main text, and present concise definitions of each risk measure—including required inputs, expected outputs, and data requirements—in an extensive appendix.

We have deployed a data sensing and data sharing architecture in the city of Trento in order to `mashup' government, company, and individual mobile data. The goal is to validate the value, monitization, and privacy/ownership issues of Big Data in running a `smart city.' Joint with Telefonica, Telecom Italia, Government of Trento, European Inst. Technology.

BlinkDB is a database system that runs on top of Hadoop (MapReduce), running SQL queries and translating them into MapReduce jobs. The key idea is that rather than running queries over the entire data set, it runs queries on a random (precomputed) sample of the data, and uses sampling theory to estimate the true query answer.

The next generation of search engines should not simply retrieve URLs, but should aim at retrieving information. We designed a system that leads into this next generation, leveraging information from across the Internet to grow an authoritative list on almost any topic. Our method starts from a small seed of examples, and intelligently grows a list of items relevant to the seed.

Scalable machine learning systems will underpin the complex data analytics afforded by BigData. The goal of the FlexGP project is scaled machine learning which exploits cloud-scale parallelism and resource elasticity. FlexGP launches diverse learning engines onto the cloud which are different along training data partitions and explanatory dimensions as well as model expression and objective function choices. Among other related questions, the Evolutionary Design and Optimization Group is working to learn how much data is enough in the context of BigData's high dimensionality and volume and how the diverse outcomes of factored learners can be efficiently fused.

We apply machine-learning techniques to construct nonlinear nonparametric forecasting models of consumer credit risk. By combining customer transactions and credit bureau data from January 2005 to April 2009 for a sample of a major commercial bank’s customers, we are able to construct out-of-sample forecasts that significantly improve the classification rates of credit-card-holder delinquencies and defaults, with linear regression R2’s of forecasted/realized delinquencies of 85%.

Many use cases for business-oriented databases involve the creation of tailor-made summaries known as "reports". Report development is tedious because multiple SQL queries may be required to generate a single report, because queries may include complex combinations of formulas and aggregate functions (e.g. averages of totals), and because the visual output layout of non-tabular results must be manually defined through the use of templating languages or a graphical form editor. This project aims to reduce the user input required for report development by combining a visual query system generating hierarchical data with a system for automatic creation of report page layouts.

Modern database management systems (DBMS) have been designed to efficiently store, manage and perform computations on massive amounts of data. In contrast, many existing visualization systems do not scale seamlessly from small data sets to enormous ones. We have designed a three-tiered visualization system called ScalaR to deal with this issue. ScalaR dynamically performs resolution reduction when the expected result of a DBMS query is too large to be effectively rendered on existing screen real estate. Instead of running the original query, ScalaR inserts aggregation, sampling or filtering operations to reduce the size of the result. This paper presents the design and implementation of ScalaR, and shows results for an example application, displaying satellite imagery data stored in SciDB as the back-end DBMS.

The new field of energy-efficient algorithms aims to develop new techniques for solving computational problems with vastly reduced energy consumption—for some problems, by several orders of magnitude—in exchange for a small increase in time and memory requirements. Specifically, we explore how to algorithmically exploit reversible computation, an idea that has been around since the 1970s and has just started to become a practical reality in the latest AMD chips, but for which we have only just begun understanding how to design efficient algorithms. Our preliminary investigations indicate that some basic problems (but not others) can have their energy cost substantially reduced, far less than that spent by a traditional algorithm on traditional hardware. This theoretical ground work will become especially important over the next two decades, when Landauer’s Principle predicts that the energy efficiency of computers (kilowatt hours spent per nonreversible computation) must stop improving. Improving the energy efficiency of computation will reduce the ~$15 billion annual cost spent to power today’s data centers, not to mention many other computers in the world processing big data. Furthermore, reducing the energy consumed by a chip reduces its heat generation, which should allow that chip to run proportionally faster with the same cooling system, or to run proportionally longer powered by the same battery. Thus our work will enable computation to become cheaper, faster, and longer lasting, and determine the theoretical limits thereof.

In this project, the goal is to identify boundaries between different types of underground rocks using seismic sensors. Such boundaries are of interest in hydrocarbon exploration as they are places where oil is often present. These sensors produce massive streams of data that need to be mined to understand the location of boundaries.

Modern data services manage data by storing it on distributed servers. It is a common requirement of such systems that they ensure that the data is available to the users even though the system components can be unreliable, for instance, the servers can crash. It is also important that the data should appear to the users as if stored on a single centralized system, even though the data is stored in distributed servers. For instance, when the data is being constantly updated, a user that reads from the system should obtain the latest version. In this talk, we present a new approach for the design of storage systems satisfying these requirements. Our approach merges ideas from two distinct fields, coding theory and distributed computing theory, to create a solution that incurs a significantly smaller footprint in terms of the amount of communication and storage as compared with current state of the art.

We are using the ALFA group's MOOCdb project as a framework to explore specific technical aspects related to global research collaboration around online education data. We are exploring software sharing, common data organization, data privacy and data access. The piloting effort uses data from several of MITx courses. Our activities recognize the importance of a transparent, fair access policy on online education data which respects privacy, confidentiality and the legal obligations of data controllers. We recognize that computer technology will play a key role in supporting general access so that respect for privacy and confidentiality is ideally balanced with the public good of investigating the digital evidence gathered from online education offerings. We are interested in joining forces with all who want to participate in these activities. Please contact us for more information.

Big Data needs Big Processors, and Big Processors need Big Caches. Increasingly, however, power and thermal considerations dictate that many small processors and many small caches supplant the paradigm of few big processors and caches. The Execution Migration Machine (EM²) project aims to find the best way of using these resources.

The goal of Linked Data is to replace traditional app-data silos with a universal integration platform to provide globally contextualized information using global identifiers, authentication, authorization, storage, and privacy. The architecture separates application from data giving users control of the data and where it’s stored, independent of the choice of application.

The goal of this project is to study computer and human vision when large amounts of visual data become available. We are developing the Scene UNderstanding (SUN) database, a large database of images found on the web organized by scene types that are being fully segmented and annotated. With this large database we are developing computer vision algorithms for scene understanding that make use of a large training combined with non-parametric (memory based) methods. In parallel, we are also studying how humans memorize large amounts of visual information. As a result we try to understand which representations might be useful for developing new efficient computer vision algorithms and also, how can we use computer vision models of human memory to predict which images will be remembered.

Ease of information flow is both the boon and the bane of large-scale, decentralized systems like the World Wide Web. For all the benefits and opportunities brought by the information revolution, with that same revolution have come the challenges of inappropriate use. Such excesses and abuses in the use of information are most commonly viewed through the lens of information security. This paper argues that debates over online privacy, copyright, and information policy questions have been overly dominated by the access restriction perspective. Our alternative is to design systems that are oriented toward information accountability and appropriate use, rather than information security and access restriction. Our goal is to extend the Web architecture to support transparency and accountability.

We are designing machine learning algorithms whose results are meaningful and intuitive to human experts, yet have predictive accuracy on par with the state-of-the-art machine learning algorithms. These models are parsimonious (sparse), as humans can handle only a handful of cognitive entities at one time, and are in the forms of a decision list or linear scoring system. Our algorithms have been used to design predictive models for medical condition prediction, stroke prediction in medical patients, prediction of violent crime in youth raised in out-of-home care, and for other applications.

Modern use of data relies heavily on predictive modeling. Machine learning methods are needed to distill large, heterogeneous, and fragmented data sources into useful pieces of information such as answers to search queries, purchasing patterns of customers, or likely actions of mobile users. This research focuses on predicting the behavior of mobile users -- actions they are likely to take in any particular context -- based on a collection of intermittent sensors such as GPS, wifi, accelerometer, and others. Our goal is to develop methods that will be useful more broadly.

Many crimes can happen every day in a major city, and figuring out which ones are committed by the same individual or group is an important and difficult data mining challenge. To do this, we propose a pattern detection method called Series Finder. Series Finder incorporates both the common characteristics of all patterns and the unique aspects of each specific pattern. This is joint work between MIT and the Cambridge Police Department.

Modern data services manage data by storing it on distributed servers. The data should appear to its users to be consistent, as if it were maintained on a single centralized server, even though it is actually distributed and replicated. It should be efficient to access, and available in the face of unpredictable failures and other network changes. We design and analyze algorithms to implement such data services.

The National Alliance for Medical Image Computing (NA-MIC) is a multi-institutional, interdisciplinary team of computer scientists, software engineers, and medical investigators who develop computational tools for the analysis and visualization of medical image data.

As we develop storage and compute platforms for scaling to big data and practical algorithms for efficiently processing it, we will need to create new ways to access and interact with massive scale data. A comprehensive solution to the problem of dealing with large amounts of Web and sensor data involves not only analysis strategies, but also access strategies. It is entirely possible that for a given, large dataset, there will be hundreds if not thousands of distinct types of queries that may be applied to the data.

In collaboration with New York City's Power company, Con Edison, we aim to identify components of New York's underground electrical grid that are the most vulnerable to failure. This information can be used to assist with Con Edison's pre-emptive maintenance and repair programs. We focus in particular on prediction of manhole events (fires, explosions, smoking manholes) on the low-voltage network. These events can be difficult to predict, and some of the data used for the project are over a century old. This project is the winner of the 2013 INFORMS Innovative Applications in Analytics Award.

The vast majority of machine learning, statistical, and scientific operations can be expressed via a small number of linear algebra operations. SciDB is a database system designed to support scalable linear algebra over massive arrays stored on disk of a large cluster of machines. It is much faster than relational databases on these types of workloads, and scales to much larger datasets than main memory matrix-oriented systems like Matlab and R.

The goal of CarTel (“car telecommunications”) is to investigate how sensor equipped cars and smartphones can be used to capture information about the transportation network and urban environment in general. Example results include an interactive map of the biggest potholes in Cambridge and Boston, collected using car-mounted accelerometers, and traffic aware routing, where real-time traffic delays from cars are used to find the fastest driving routes.

Condensr is a review summarization system that processes Yelp restaurant reviews and categorizes them, breaking down reviews into comments about food, ambiance, service and value, as well as giving an overall summary of reviewer sentiment. The goal is to go beyond a simple star rating to give the overall consensus of diners about various aspects of a restaurant experience.

The goal of this project is to learn how people inside of large organizations influence each other, and to track the flow of influence throughout an organization. Relationships can be modeled as graphs, with edges indicating the degree of influence. Weights are learned from a variety of data sources, including personal communication and data gathered from sensors about face-to-face interaction.

TwitInfo extracts a series of tweets that match a keyword from Twitter and arranges them on a timeline, provide a quick summary of a collection of Tweets on topic in a simple visualization. The key idea is to identify “peaks” in the frequency of tweets that represent interesting occurrences in time (e.g., points scored in a sporting event, or a major speech by a politician)..

We are designing a prediction and decision system for real-time use during a professional car race. This project exemplifies some of the main challenges in Big Data, where careful knowledge of the domain (racing) needs to be infused within statistical modeling techniques in order to build a real-time decision system.

The goal of this project is to develop powerful algorithmic sampling techniques which allow one to estimate parameters of the data by viewing only a miniscule portion of it. Such parameters may be combinatorial, such as whether a large network has the "six degrees of separation property", algebraic, such as whether the data is well-approximated by a linear function, or even distributional, such as whether the data comes from a distribution over a large number of distinct elements.

Locality-Sensitive Hashing (LSH) is an efficient algorithm for finding pairs of similar (or highly correlated) objects in a database without enumerating all pairs of such objects. Example applications include searching for near-duplicate documents, similar images, highly correlated stocks etc. Although the algorithm is very fast, one can envision further improvements in its efficiency by adapting it to specific data sets. The goal of this project is to develop tools and techniques for performing such tuning.

The day-to-day practice of medicine is based largely on a combination of the personal experience of those making the decisions and non-patient-specific information derived by applying conventional statistical methods to large clinical trials. With the boom in the collection of clinical information in computationally accessible formats, it is now possible to use advanced machine learning and data mining techniques to put clinical decision making on a sounder more patient-specific basis. That is the mission of CSAIL's Data -driven Medical Research Group. Current projects include risk stratification post acute coronary syndrome, prediction of impending heart failure, diagnosis of mental disease from EEG data, and understanding ways to reduce the prevalence of healthcare associated infections.

We want to teach machines to see. If we could recognize objects and locations, this would have large impact on robotics, assistive care, and public safety, to name just a few areas. Presently, machine vision systems can recognize a small number of object categories, and can localize objects within an image on moderately well. The next big task in computer vision is to scale-up object recognition: to reliably detect thousands of object classes. An important component of any long term solution to the vision problem is an online, unsupervised training from a massive dataset of images.

Talks will feature distinguished individuals from academia, industry and government including pre-eminent people from all the subfields of computer science that have something to say about data, data processing and analytics, as well as people from organizations that are consumers of Big Data from both industry and government.

We partner with different organizations including industry, government, non-profits and other universities.

Members provide the Initiative with invaluable funding support. As a member, organizations have unique access to Big Data research at MIT. It is also an opportunity for corporations to share and discuss their real world challenges and concerns.

Data Partners make data sets available for research at MIT -- for exploring new ideas; for testing out theories and new algortihms, systems and tools; for student projects and data challenges; and ultimately, for demonstrating the impact of Big Data with real world data.

We partner with different organizations to make data sets available for research here at MIT -- for exploring new ideas; for testing out theories and new algortihms, systems and tools; for student projects and challenges; and ultimately, for demonstrating the impact of Big Data with real world data.

Data Resources: MGH and The Laboratory for Quantiative Medicine MGH/Harvard Medical School have built a number of very large databases on patients. Last June, we were tasked by the MGH Cancer Center to build a database on all of the 173,301 Massachusetts General Hospital cancer patients,167,814 of whom were diagnosed between 1968 and 2010. The database contains 559,921 pathology reports, 575,204 discharge reports, 10,938,444 encounter notes, 304,211 operative reports, 22,009,527 procedure notes, 9,159,232 radiology reports,~1,700,000 aggregated medical bills, and ~250,000 images. The database contains all-cause survival information from the Social Security Administration Death Master File (which provides information on all deaths of persons issued social security numbers since 1937), and cause-of-death information from the Massachusetts Death Certificate Database (which contains international classification of disease cause of death information on 1,984,790 people who died in the state of Massachusetts between 1970 and 2008). The database is linked to the MGH SNAPSHOT gene sequence dataset, thus providing a great wealth of genetic data on a large number of patients.

As far as we are aware, in terms of the total mass of data, this database is the largest source of clinical information on cancer in the world.

Big Data promises a better world. A world where data will be used to make better decisions, from how we invest money to how we manage our healthcare to how we educate our children and manage our cities and resources. These changes are enabled by a proliferation of new technologies and tools that have the ability to measure, monitor, record, combine and query all kinds of data about us and the world around us -- but how will that data get used and for what purpose? Who owns the data? How do we assure accountability for misuse?

Just as Big Data lays out many promises, it lays out many questions and challenges when it comes to privacy. We must think carefully about the role of technology and how we design and engineer next generation systems to appropriately protect and manage privacy, in particular within the context of how policy and laws are developed to protect personal privacy. Decisions about how to address privacy in big data systems will impact almost everyone as we push to make more data open and available inside organizations and publicly. Governments around the world are pushing themselves and private companies to make data transparent and accessible. Some of this will be personal data. We will need new tools and technologies for analysis, for anonymizing data, for running queries over encrypted data, for auditing and tracking information, and for managing and sharing our own personal data in the future. Because issues of data privacy will be relevant across so many aspects of our life, including banking, insurance, medical, public health, government, etc, we believe it is important to collectively address major challenges managing data privacy in a big data world.

Workshop: Big Data Privacy

Exploring the Future Role of Technology in Protecting Privacy

The goal of this workshop [held in June 2013] is to bring together a select group of thought leaders, from academia, industry and government, to focus on the future of Big Data and some of the unique issues and challenges around data privacy. Our aim is to think longer term (5 years +) and better understand and help define the role of technology in protecting and managing privacy particularly when large and diverse data sets are collected and combined. We will use the workshop to collectively articulate major challenges and begin to layout a roadmap for future research and technology needs.

This workshop was supported by the MIT Big Data Initiative at CSAIL and by a Grant from The Alfred P. Sloan Foundation.

DEFINING “PRIVACY” IN A BIG DATA WORLD

DAVID VLADECK - GEORGETOWN UNIVERSITY LAW CENTER

Professor Vladeck served as former Director of the US Federal Trade Commission, Bureau of Consumer Protection

How can we make sure that we harness the power of big data effectively without sacrificing personal privacy completely?

BIG DATA, SYSTEMIC RISK, AND PRIVACY-PRESERVING RISK MEASUREMENT

ANDREW LO - MIT SLOAN SCHOOL OF MANAGEMENT

In the financial industry privacy is a tremendously hotly contested issue. The problem is that in the financial system, where we don’t use patents to protect our intellectual property, we use trade secrecy, we equate data privacy with profitability. This is "big data versus big dollars".

BIG DATA: NEW OIL OF THE INTERNET

ALEX (SANDY) PENTLAND - MIT MEDIA LAB

In a report, “Personal Data: Emergence of a New Asset Class,” prepared for the World Economic Forum, we proposed a "New Deal on Data" framework with a vision of ownership rights, personal data stores, and peer-to-peer contract law. The report findings helped to shape the EU Human Rights on Data document and the US Consumer Privacy Bill of Rights. Among the proposals in the report was the notion that there could be a combination of informed consent and contract law that allows for auditing of data about oneself. The personal data would also have meta data that accompany the personal data, showing provenance, permissions, context, and ownership. At MIT, a open source version of such a scheme has been created in the form of Open Personal Data Store (openPDS) [ref: http://openpds.media.mit.edu/]

DIFFERENTIAL PRIVACY

KOBBI NISSIM - BEN-GURION UNIVERSITY

I think there is a great promise for the marriage of Big Data and Differential Privacy. Big data brings with it a promise for research and society, but often the data contains detailed sensitive information about individuals, making privacy a real issue. Heuristic privacy protection techniques were designed for an information regime very different from today's. Many failures were demonstrated in the last decade. Differential privacy provides provable guarantee for individuals and provides good utility on larger datasets.

NO FREE LUNCH

ASHWIN MACHANAVAJJHALA - DUKE UNIVERSITY

The issues relating to data privacy in the real world exhibit some similarities and some differences across all kinds of different domains. In the medical environment, there is medical data, genomic data or other research data based on private information about patients and subjects. Functional uses of the private information include finding correlations between a disease and a geographic region, or between a genome and disease. In an advertising context, social media firms focus on the clicks and browsing habits of their users, assessing trends by region, age, gender, or other distinguishing features of the user population. Private information includes an individual’s personal profile and “friends”. Functional uses of the data could entail a prompt to recommend certain things to certain groups of users or to produce ads targeted to users based on their social networks. Different applications will have different requirements for the level of privacy needed. The big challenge in any of these applications is: "How do you really trade off privacy for utility?"

DEVELOPING ACCOUNTABLE SYSTEMS

LALANA KAGAL - MIT CSAIL

Much of the research in the area of data privacy focuses on controlling access to the data. But, as we have seen, it is possible to break these kinds of systems. You can in fact infer private information from anonymized data sets, examples include the re-identification of medical records, exposure of sexual orientation on Facebook, and breaking the anonymity of the Netflix prize dataset. What we are proposing is an accountability approach to privacy, when security approaches are insufficient. The accountability approach is a supplement to, and not a replacement for upfront prevention.

ENCRYPTED QUERY PROCESSING

RALUCA ADA POPA - MIT CSAIL

In 2012 hackers extracted 6.5 million hashed passwords from LinkedIn’s database and were able to reverse most of them. This is a problem we all are familiar with: confidentiality leaks. There are many reasons why data leaks, and for the purpose of this talk I’m going to group them into two threats. First, consider the layout of an application that has data stored in a database. The first threat is attacks to the database server. These are attacks in which an adversary could get full access to the database server, but does not modify the data, it just reads it. The second threat is more general and includes any attacks, passive or active, to any part of the servers. For example, hackers today can infiltrate the application systems and even obtain root access. How do we protect data confidentiality in the face of these threats?

The MIT Big Data Challenge

OVERVIEW

The MIT Big Data Initiative at CSAIL is organizing competitions designed to spur innovation in how we think about and use data to address major societal issues. Big Data promises a better world where data will be used to make better decisions, from how we invest money to how we manage our healthcare to how we educate our children and manage our cities and resources. These changes are enabled by a proliferation of new technologies and tools that have the ability to measure, monitor, record, combine and query all kinds of data about us and the world around us.

Working with our partners, the MIT Big Data Challenge will define real-world challenges in different areas such as transportation, health, finance and education, and make available data sets for the competition. Our goal is to provide the MIT community with new and unique opportunities to show how data can make a difference.

The first MIT Big Data Challenge launched November 12 2013 in partnership with the City of Boston and co-sponsored by Transportation@MIT focuses on transportation in downtown Boston. The challenge will make available multiple data sets, including transportation data from more than 2.3 million taxi rides, local events, social media and weather records, with the goal of predicting demand for taxis in downtown Boston and creating visualizations that provide new ways to understand public transportation patterns in the city.

The City of Boston is interested in gaining new insights into how people use all modes of transportation travel in and around the downtown Boston area. A critical imperative of Boston's Complete Streets Policy is to move all modes of transportation more efficiently and to use real-time data to facilitate better trip-planning between modes of transportation. With urban congestion on the rise, city planners are looking for ways to improve transportation such as providing people with more options to get from one place to another (walking, biking, driving, or using public transit) and by reducing and more efficiently routing vehicles in the city.

This Data Challenge provides a unique opportunity to analyze City of Boston taxi data and combine multiple data sets including social media, transit ridership, events data and weather data to effectively predict demand and better understand patterns in taxi ridership. We hope this will result in new insights for the City of Boston and the public that will improve transportation in the city (and ability to get a cab when you need one)!

We now have the ability to collect and acquire digital information at an unprecedented rate across practically all aspects of our life including healthcare, financial transactions, social interactions, education, energy usage, transportation, environmental monitoring and so on. "Big Data" is about harnessing all of this digital information by combining and analyzing it in completely new ways to make better predictions and ultimately, better decisions. Over the next decade Big Data has the potential to profoundly change the way we live, work and play. Big Data also introduces unique challenges when it comes to managing and protecting personal privacy. Big Data privacy issues are complex, introducing a host of ethical, legal, policy and technical questions. How do we build on Big Data’s potential for good, while maintaining essential privacy protections? And, how do we design future technologies, policies, and practices to get that balance right for society?

Our goal with this project is to demonstrate how organizations can leverage data in the future, including how we collect, manage, and use personal information, from setting appropriate policies to demonstrating systems that can implement it in practice. In terms of integrating, analyzing, and sharing data, MIT faces similar challenges to many organizations across different sectors whether in industry or government. A Big Data testbed at MIT will allow us to demonstrate how data can be used to better understand and improve our community; collectively explore ways to address technical and privacy challenges; and demonstrate new approaches and solutions emerging from the research community.

A Living Lab

What is a "Living Lab"? It is a open innovation ecosystem, operating in a specific context, integrating research and innovation processes within a public-private partnership.* In short, a testbed for research and experimentation.

We propose creating a Big Data Living Lab at MIT to allow the community to access, share, and use data about life on campus. Why? To explore issues around large-scale data integration, privacy, visualization, and performance, as well as social implications of big data.

Extracting useful information from very large data sets is challenging. In this workshop, we will focus on the challenges of applying machine learning, data mining, and statistics to massive-scale data sets.

“In the past, we've focused on scale, but over the last few years, the big new problems have been about variety,” says Sam Madden, professor of electrical engineering at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). “Big Data is no longer just about processing a huge number of bytes, but doing things with data that you couldn’t do previously. Increasingly, data is coming at you really fast, and it’s much more complex. It’s not just tabular data you can easily stick into a spreadsheet or a database.”

"Taxi data show Bostonians don't fear the cold" An MIT contest reveals a portrait of the city through cab rides

"FOR YEARS, Boston’s Department of Transportation has collected GPS data on every taxi pickup and drop-off throughout the city. It is an astonishing accumulation of raw numbers on how Bostonians get around, ripe with opportunity for analysis.

"On Monday, MIT hosted a daylong workshop on big data and privacy, co-sponsored by the White House as part of a 90-day review of data privacy policy that President Barack Obama announced in a Jan. 17 speech on U.S. intelligence gathering.

Extracting useful information from very large data sets is challenging. In this workshop, we will focus on the challenges of applying machine learning, data mining, and statistics to massive-scale data sets.

As part of President Barack Obama’s call for a review of privacy issues in the context of increased digital information, MIT and the White House Office of Science and Technology Policy will co-host a daylong workshop on “big data” and privacy March 3 at MIT.

Cambridge, Mass. – MIT will offer its first online professional course, Tackling the Challenges of Big Data, to a global audience beginning March 4, 2014. The four-week online course, aimed at technical professionals and executives, will tackle state-of-the-art topics in big data ranging from data collection, storage and processing to analytics and visualization, as well as address a range of real world applications.

Announcing a call for papers and participants for the Seventh Annual New
England Database Summit (NEDB Summit '14). This will be an all day
conference-style event where participants from the research community
in the New England area can come together to present ideas and discuss
their research. Registration for the event is free from students and a
nominal charge ($25) for non-students, and anyone is welcome to
attend. Lunch, drinks and appetizers will be provided.
The event will also feature keynotes from James Mickens (Microsoft

The Big Data Initiative at the MIT Computer Science and Artificial Intelligence Lab (CSAIL) today announced two new activities aimed at improving the use and management of Big Data. The first is a series of data challenges designed to spur innovation in how people use data to solve problems and make decisions. The second is a new Big Data and Privacy Working Group that will bring together leaders from industry, government and academia to address the role of technology in protecting and managing privacy.

Help us design a cool graphic for our new Big Data T-Shirt! It's your chance to participate in designing the official t-shirt for BIG DATA @ CSAIL. Submit your t-shirt design by September 30th for a chance to win the competition and have your design become part of BIG DATA!

BlinkDB is a new query processing system which allows one to run interactive SQL queries for tens of terabytes of data with the response time being equivalent to "blink-time". This was the goal when CSAIL and UC Berkeley AMPLab joined together to collaborate on the BlinkDB.

Big Data and the Law: Researchers at MIT and Harvard have applied Natural Language Processing to Supreme Court decisions to determine authorship in cases where the justices have decided not to sign the opinions.

"A U.S. Supreme Court mystery drew them together: the Harvard 3L, the engineer, the Jenner & Block associate, the Massachusetts Institute of Technology professor and a team of MIT doctoral students. The mystery: Who actually wrote the joint dissent in last year's health care blockbuster?

Big Data promises a better world. A world where data will be used to make better decisions, from how we invest money to how we manage our healthcare to how we educate our children and manage our cities and resources. These changes are enabled by a proliferation of new technologies and tools that have the ability to measure, monitor, record, combine and query all kinds of data about us and the world around us -- but how will that data get used and for what purpose? Who owns the data? How do we assure accountability for misuse?

Recently the news has been full of stories about the potential for `big data’ about customer behavior to revolutionize business, and personal data has been called the `new oil of the internet’. But what about big data for the average person? For the poor?

Big Data integration -- combining multiple diverse data sets together -- is one of the key problems organizations face when working with Big Data. The challenges of integrating data are myriad: different data sets may come from different sources (internal and external); may have been created by different people; be structured in different ways; and contain different data types (e.g., text, images, maps, database tables, etc.) In addition, data sets may represent different levels of temporal or spatial granularity (e.g., real time per-transaction data vs.

Coming up on February 1, 2013 is the Sixth Annual New England Database Summit (NEDB Summit), to be held at the MIT Stata Center. This will be an all day conference-style event where participants from the research community and industry in the New England area can come together to present ideas and discuss their latest research and experiences on databases.

In May 2012, CSAIL announced a major new initiative to tackle the challenges of the burgeoning field known as “big data” -- data collections that are too big, growing too fast, or are too complex for existing information technology systems to handle. The announcement was made at an MIT event attended by Massachusetts Governor Deval Patrick, who simultaneously announced a new statewide initiative to establish Massachusetts as a hub of big data research.