Penn's Data Warehouse

The University's Data Warehouse makes Penn's institutional data available to decision makers for query, analysis, and reporting.The Warehouse stores selected data from Penn's business systems, and is organized in subject areas listed below. The data is filtered, made consistent, and aggregated in various ways. Access to the Warehouse does not require formal programming skills; desktop tools such as BusinessObjects support analysis and reporting for authorized University users.

The Warehouse is organized by subject areas, into "data collections"; you can access information about a collection and general information about Penn's Data Warehouse from each data collection's home page. Data collection information includes the collection's refresh schedule, data diagrams, table and data element documentation, data training, and data security.

The Warehouse contains data from many source systems, organized by logical groupings, all in one Oracle database. Information from a source system (often a transactional system, but could also be another set of tables) is referred to as a data collection. For example, information from Pennant and SRS (Student Records System) is referred to as the Student Data Collection, while account balance information from the BEN Financials General Ledger is referred to as the General Ledger Data Collection.

Combined Universes

The co-location of data from these disparate sources allows for querying across collections, joining on common elements to retrieve results that one couldn't otherwise achieve from a single, isolated collection. Users writing their own SQL may join any tables to which they have access, after consulting the relevant collection documentation. Business Objects users can take advantage of universes that combine collections.

Additional details about the Warehouse and how it works:

The University's Data Warehouse is an Oracle 12.1.0.2.5 (12c) relational database management system (RDBMS). The database is over 2 TB in size and resides on a Dell PowerEdge R530 server with 64 GB of memory and 32 CPUs, running under Red Hat Linux 6.7 (64-bit).

Transaction systems handle the day-to-day functions of a business--recording, processing, and storing operational data. Examples of transaction systems at Penn include the Pennant, BEN Financials General Ledger, PennWorks, PennERA. These systems handle operations such as enrolling a student in a course, changing a student's major, creating journal entries, managing sponsored programs. Transaction systems, however were not designed to provide managers with the information they need to conduct trend analyses or to undertake longitudinal studies that require use of historical University data. Data Warehouses, unlike transaction systems, are specifically designed to handle these types of management queries and analyses. Data in the Warehouse is for queries only. That is, you cannot add, change, or delete data in the Warehouse. These activities are done only in the source transaction systems.

To build the Warehouse, data from the transaction systems was analyzed (each piece of data evaluated for inclusion into the Warehouse), "cleaned" (old codes converted into new codes, existing data redefined), and restructured (entities broken apart and new entities created). The data was then summarized and arranged in a format to support analysis and reporting. In some instances, additional data elements were calculated to support a particular analysis.

The Warehouse is refreshed periodically, using the transaction systems as its source of data. That is, programs extract data from a transaction system and translate it into Warehouse format. The formatted files are than transferred and loaded into the Warehouse. The data is than available for analyses and reports.

Note that each data collection has its own refresh cycle governing when its data is periodically updated in the Warehouse. For example, one data collection may be refreshed daily, while others may be refreshed hourly or once a term. Thus, data in the Warehouse reflects the most recent refresh cycles for the data collections. Recipients of reports from the Data Warehouse need to be aware of this "delay" in Warehouse data. Refresh schedule information is noted in the documentation for each collection.

The Warehouse contains data from transaction systems. The information from each transaction system is referred to as a data collection. For example, information from Pennant and SRS (Student Records System) is referred to as the Student Data Collection, while account balance information from the BEN Financials General Ledger is referred to as the Genera Ledger Data Collection.

Note that depending on your authorization level, you may or may not have access to a specific data collection. If you have questions about a specific data collection or want access to a collection, contact Enterprise Information & Analytics.

Tables

Each data collection in the Warehouse is organized in a set of related tables. Each table consists of data elements that describe or qualify an item of business significance. For example, the Student Data Collection has an ADDRESS table with data elements such as Penn ID, street address, etc. The collection also has a PERSON table with data elements such as Penn ID, name, birth date, etc. The ADDRESS table is related to the PERSON table through the Penn ID. Because the data is stored in tables, it is easy to access just the data you need, rather than having to plow through all the data in the data collection. A table may be a part of more than one data collection.

Note that some data elements in tables are indexed (indexed columns are noted in the documentation for each table). Indexing enables the system to execute queries faster. A query with a record selection condition using an indexed data element tells the system to go directly to the rows in the table that contain the value indicated and to stop retrieving data when the value is no longer found. If a query does not select records based on an indexed data element in its record selection condition, the system starts searching at the first row in the table and works through every row until it reaches the last row in the table. Tables can contain hundreds of thousands or even millions of rows (for example, one table contains 93,000 students, about 250,000 addresses, and approximately 700,000 enrollments). Thus, queries that do not use indexed data elements for record selection will run slowly.

Table Help. Help documentation is available for each table in the Warehouse, accessible via a hyperlink from the table name. Help describes the basic contents of the table. If applicable, it also gives the following information:

Explanation. Describes the physical makeup or content of the table.

Common uses. Describes some queries that would make use of the table.

Primary key. Lists the data elements that are the primary keys in the table.

Indexed data elements. Lists the data elements that are indexed in the table. Since tables can consist of many rows, queries that include record selection conditions based on indexed data elements provide faster results.

Related tables. Identifies other tables that may be meaningful to your query. That is, tables that are good candidates for containing information that you may want to include in your results. For example, if you are using the Enrollment table to list students in a specific course section, you may want to use the Person table to get the students' names.

Cautions. Provides additional guidance, help, or explanation about a table. It can also include recommendations that must be followed to prevent poor query results.

Data Elements

The smallest unit of data that you can work with is called a data element. A data element cannot be logically divided any further without losing its meaning or context. Zip code, last name, and SSN are examples of data elements that cannot be logically divided any further without becoming meaningless. In contrast, student and address are not data elements because they can be logically divided into more units of data.

Data Element Help

Help documentation is available for each data element in the Warehouse. Help describes the data element and includes its indexed, format, and not null values. If applicable, it also provides a list of valid values for the data element. Values for data elements can be listed in alphabetical order or in the order most frequently used.

The primary datatypes used in the Date Warehouse are CHAR (character), DATE and NUMBER. Element formats are indicated by the datatype, length and, for NUMBER types, precision and scale. The format of Name column in the EMPLOYEE_GENERAL table, for example, is listed as CHAR(30), meaning that the column is of character datatype, and holds a maximum of 30 characters. Numeric datatypes (such as Payment_Amount in the EMPLOYEE_PAYMENT table) have a specified precision and scale. Precision is the total maximum length of the column, while scale represents the number of places to the right of the decimal. For example, the format for EMPLOYEE_PAYMENT.Payment_Amount is represented by NUMBER(9,2), meaning that the column is of numeric datatype, with a total of 9 characters of which 2 are to the right of the decimal point; thus, the maximum value is 9999999.99.

Summary Data

One major advantage of the Data Warehouse is that it contains data at different levels of summarization. For example, you could retrieve data as individual transactions or as summaries by week, by month, or by year. Note that additional levels of summarization can be added to the Warehouse as needed and as resources allow without impacting existing data.

Summary data are different for every data collection in the Warehouse. For example, the Student Data Collection includes a student detail level which consists of the basic SRS tables, and the student census level which is a snapshot of student term activity taken at the census date (one week after the end of the drop period). The student detail level changes daily. The student census level, once loaded for a given term, remains static and never changes for that term. The advantage of the student census level is that you can run hundreds of queries on a hundred different days, and run them ten years from now and still have the same numbers for comparison. The census level is the data Penn uses for official enrollment statistics, for example, for providing data to the state and federal governments. In the General Ledger Data Collection, the SUMMARY_BALANCES table contains budget, encumbrance, and actual balances for summary-level Accounting Flexfields by accounting period. Balances are available for the month, the fiscal year-to-date, and the project year-to-date.

If you need to know what summary data are available for a specific data collection, refer to the documentation for that data collection.