Data Warehouse

Mapping Individual database to PRML schema

IUSE project aims at unifying peer review data across different systems to enable data analysis across wider users and different set of peer review approaches. In the consortium, we have 4 systems, whose data is to be unified (Expertiza, Crowd Grader, Mobius SLIP, CritViz). In addition, data from 2 other bigger systems, Perceptive and CPR, where the the project advisors should also be integrated to some extent.

We choose to use an existing open source ETL (Extract, Transform, Load) tool, called Pentaho, for extracting data from the database of individual system, transform this data according to PRML schema and store the result to a centralized PRML DB. Figure 1 shows the architecture how we plan to extract data from expertiza database, then we transform the data according to the PRML schema. However, since PRML is normalized in 3NF, it might not be easy to extract some information for data analysis purposes. Thus, we plan to denormalize the PRML data and store them in a data warehouse.

Figure 1. Architecture of the ETL transformation.

Initial Step

Figure 2. ETL illustration

The Pentaho Kettle tool extracts data from the expertise schema, performs transformation on it and load the data in the PRML format. The tables are mapped as shown in the schema map above. The operations performed by the ETL tool:

Extraction The data is extracted from the expertise database and brought into the staging area by this operation of the ETL tool. It is the staging area in which all the transformation are done.

Transformation Many operations are performed on the data in the staging area so that the expertise schema can be converted to the PRML schema. This is called transformation of the data. An example of transformation is while mapping of participants table in the expertiza database to actors table in the PRML database. Each value of the id field is incremented by 1000 while inserting in the participants table.

Loading:The transformed data when moved from the staging area to the PRML database is called as loading. There are three types of loading implemented on the tables depending on the size of the table – initial load,partial loading and full loading. Partial Loading: Only the new or updated entries in the expertiza table are transformed and loaded in the corresponding PRML table. This is generally done for table with very large number of entries. Eg: Reviews

Partial loading is implemented with the help of timestamps and checksums.

Timestamp: One way to perform partial loading is to have a ‘modified’ time stamp in the source table. The destination table should mirror this ‘modified’ time stamp. Then while loading the destination table we extract only those records from the source table which has a ‘modified’ time stamp value greater than the latest ‘modified’ time stamp value in the destination table. The advantage is that we need not fetch all records from the source table. Only those records which needs to be updated in the destination table will be retrieved. (refer load_prml_ModifiedTimeStampTable transformation)

Checksum: Another way to perform partial loading is using checksum. A checksum is computed for each record, on selected field values, of the source table and added to the destination table while initial loading. Loading which follow would again compute the checksum for source table records and if there is no matching checksum in the destination table an update or an insert would be performed. (refer load_prml_checksumtable transformation)

Full Loading: All the entries from the expertiza table are transformed and loaded in the corresponding PRML table. This is generally done for tables with less number of entries. Eg: Courses

Pentaho

Figure 3. Pentaho Spoon to extract data from expertiza and store them into PRML DB.
Expertiza tables which have been implemented

Pentaho provides a GUI to define the ETL process. as shown in figure 3, first we define the DB connections to Expertiza database and PRML database, then we use a table input component to query the necessary data for each PRML table. Figure 3 also shows an example of the SQL query that extract course data from expertiza and map each column to the corresponding PRML column. we then use Insert/Update component to insert the queried data into the corresponding PRML table. since the mapping has done in the initial query, we only need to map the ID of the queried record. A similar step is defined for each PRML table.

Data Warehouse Schema

Figure 4. PeerLogic Data Warehouse Schema

PRML was used to define a data warehouse model that can be used to share data from different peer-review systems. We designed the schema based on dimensional modeling approach[] Dimensional modeling requires the data that contains measurements, metrics, or facts of the business process to be stored as Fact tables. The Fact tables also contains foreign keys to the dimension tables that can be used to group the facts into multidimensional arrays of data, known as OLAP cube or hypercube. Dimensional modeling encourages data warehouse schema to follow a star topology, in which fact tables are placed in the center.

Following this approach, our schema is centered around the Critique table. It contains reviewer’s qualitative and quantitative feedback. The quantitative feedback can be expressed in rating, ranking or the combination of both. As depicted in Figure 4, the Critiques can be sliced based on different dimensions including the Criterion, Eval_Mode, Task, Actor, and Course_Setting. The criterion table contains criteria questions, the scale used to rank or rate the work, and the weighting that is used to calculate the final score. The eval_mode determines whether ranking, rating or both are used to evaluate the artifact. The Task table contains information such as when the task starts and ends, the CIP (Classification of Instructional Programs) codes, whether it is an assignment, reviewing, or meta-reviewing task. The actor table contains the actors involved in the assignment and their roles, whether it is student, instructor, or administrator. The actor table is linked to the participant table in the actor_participant table to maintain the group memberships of each participant.

The Artifact table contains information about the student’s work in response to the assignments, which is usually a url to uploaded files or web pages. Course_setting table contains meta information about how the peer review was conducted. For instance, anonymity could be none, single blind, double blind, partial (critiques and artifacts are public after the review process is done). The workflow, could be single loop, double loop, or n rounds. The rubric mode, could be holistic or specific, and assignment style could be assignment based, where a set of activities are fixed or task based where activities may varies from assignment to assignment. The Course_setting table is used to compare the effect of different features adopted by peer review systems to the learning gains as well as the quality of the peer review process itself.

Accessing the data

If you are familiar with SQL, you can query the data on our MySQL database directly. You’ll need a mysql client such as MySQL Workbench (http://dev.mysql.com/downloads/workbench/). You can reach our server using the following read-only credential:

Alternatively, if you are not familiar with SQL, you can also use REST API to query the data. please be aware that our REST API currently is only able to retrieve a limited set of data from the data warehouse, as documented here. But we’d love to hear your feedback, and if you need to retrieve some data that is currently not supported, please do let us know. We’ll try to have it implemented within 1-2 days.