This class will survey techniques and systems for ingesting, efficiently processing, analyzing, and visualizing large data sets. Topics will include data cleaning, data integration, scalable systems (relational databases, NoSQL, Hadoop, etc.), analytics (data cubes, scalable statistics and machine learning), and scalable visualization of large data sets. The goal of the class is to gain working experience along with in-depth discussions of the topics covered. Students should have a background in database or distributed systems (6.814/6.830 or 6.824 or permission of instructor). There will be a semester-long project and paper, and hands-on labs involving using real systems. There will be no exams, and grading will be based largely on class participation and whether assignments are completed.

Grading:

Grading will be based on class participation, successful completion of labs, and a final project, as follows:

For homeworks, you allowed 5 penalty free late days to use throughout the semester. One late day equals one 24 hour period after the due date of the assignment. Once you have used your late days, there will be a 20% penalty for each day an assignment is late. You do not need to explictly declare the use of late days; we will assign them to you in a way that is optimal for your grade when different assignments are worth different numbers of points. Late days may not be used for the final project.

If you are not familiar with MapReduce you should read:
the original MapReduce paper
(We will not cover this paper in detail so you do not need to read it again if you understand the basic MapReduce model.)

Frequently, your dataset is so large that asking a question of it in real-time isn’t possible. Cubes and materialized views allow you to precompute the answers to questions so that you have them when you need them later.

What are the challenges of running ML algorithms on Big Data, what are frameworks that have been proposed, how are they better than existing systems (e.g., Relational DBs + UDFs/UDAs).
In particular, iterative algorithms require you to do several passes over your data. Tools like MapReduce are not as useful for this because both their programming model and execution model aren’t designed around iteration.

Database techniques may be able to further speed up machine learning techniques by embedding them within the DBMS, declaratively specifying the machine learning algorithms and parameters and letting the DBMS optimize the parameters, or implementing the algortihms on map-reduce systems.

Visualizations map summary statistics of your data into the visual domain. We will have a crash course on typical summary statistics, and a comparison of declarative vs painting based models of visualizations.

More and more sensitive data is being collected and processed by analysts. How do we protect user data from privacy breaches? We'll study several adversarial models and talk about a few solutions explored by the research community.

For the next three weeks, class will be used to work on final projects. Course staff will be on hand to meet and discuss with groups. Brief progress reports are
due on each Tuesday. Progress reports should consist of two bulleted lists: "Progress so Far" and "Goals For Next Week"; they don't need to be super long or detailed
and should be submitted via Stellar.