Monday, 7 October 2013

Paper summary : NADEEF

Extracting valuable information from data is dependent of the quality of the data. However, there is no commodity platform similar to DBMSs to fix application specific data quality problems w.r.t a set of heterogenous and ad-hoc quality constraints. The authors present an extensible end-to-end commodity data cleaning system called NADEEF, which provides a programming interface (allowing users to define rules i.e., what is wrong, via the vio() method and possibly how to fix it via the fix() method) and a core (which applies the defined rules to detect and clean data errors).

NADEEF’s architecture consists of (i) the rules collector, (ii) the core component and (iii) the metadata management and data quality dashboard. The rule collector collects user defined data quality rules. The core component consists of the rule compiler module (compiles and manages user defined rules), the violation detection module (computes a set of data errors; optimized via partitioning and compression) and the data repairing module (takes a set of data errors and computes a set of data repairs; two algorithms were explored, the variable-weighted MAX-SAT solver based algorithm and an equivalence class based algorithm). The metadata manager tracks data changes and supports efficient metadata operations while the data quality dashboard provides an interface for user feedback and system information (not discussed fully).

NADEEF was comprehensively evaluated on real-life datasets along four dimensions - generality, extensibility, effectiveness and efficiency. The authors found that providing an interface to define rules was general enough so that users would only have to write a few lines of relevant code without having to write an entire system for their rules. The two data repair algorithms demonstrate extensibility since users can also define their own implementations of the GetFix method. Equivalence class based repair algorithms were found to be sensitive to certain types of noise. Also, executing a sequence of rules in a particular order can affect results. The system was also found to be efficient for data in reasonable sizes.