Our group includes PostDocs, PhD students, and student assistants, and is headed by Prof. Dr. Hasso Plattner. If you are interested in our work or want to join our team, please contact Dr. Matthias Uflacker.

Our team is giving a series of lectures and seminars with a focus on enterprise systems design and in-memory data management. Strong links to the industry ensure a close connection between theory and its implementation in the real world.

Our research focuses on the principles of in-memory data management on modern hardware and the integration of different hard- and software systems to meet business requirements. This involves studying the conceptual and technological aspects of modern enterprise applications as well as tools and methods for enterprise systems design.

We continually strive to translate our research into practical outputs that improve the quality of enterprise applications. A close link to industry partners ensures relevance and impact of our work. Get here an overview of our current and previous projects.

Innovations of In-Memory Data Management

Object Data Guides

The in-memory database improves the retrieving performance of a business object by adding some redundancy to the physical data model. This redundancy represents a join index for querying sparse tree-shaped data, which is called object data guide and includes two aspects: In addition to the parent instance, every node instance can contain the link to the corresponding root instance. Using this additional attribute, it is possible to retrieve all nodes in parallel instead of waiting for the information from the parent level. Additionally, each node type in a business object can be numbered. Then, for every root instance, a bit vector (Object Data Guide) is stored, whose bit at position i indicates if an instance of node number i exists for this root instance. Using this bit vector, a table only needs to be checked if the corresponding bit is set, reducing the complexity of queries to a minimum. Furthermore, the amount of data retrieved and transmitted is minimized as well Please also see our podcast on this technology concept.

Bulk Load

Besides transactional inserts, HANA also supports a bulk load mode. This mode is designed to insert large sets of data without the transactional overhead and thus enables significant speed-ups when setting up systems or restoring previously collected data. Furthermore, bulk loading has been applied to scenarios with extremely high insert loads, such as RFID event handling and smart grid meter reading collection, by buffering events and then bulk-inserting them in chunks. While increasing the overall insertion rate, this buffered insertion comes at the cost of a small delay of data availability for analytics depending on the defined buffering period, which is oftentimes acceptable for business scenarios. Please also see our podcast on this technology concept

Group-Key

A common access pattern of enterprise applications is to select a small group of records from larger relations, e.g. all line-items belonging to an order. The standard execution of such an operation scans the complete table and evaluates the selection condition for every record of the table. Applications executing such operations frequently may suffer from a degraded performance since the complete table is scanned often, although only a small group of records match the selection condition. To speed up such queries, group-key indexes can be defined that build on the compressed dictionary. A group key index maps a dictionary-encoded value of a column to a list of positions where this value can be found in a relation. Please also see our podcast on this technology concept.

MapReduce

MapReduce is a programming model to parallelize the processing of large amounts of data. MapReduce took the data analysis world by storm, because it dramatically reduces the development overhead of parallelizing such tasks. With MapReduce, the developer only needs to implement a map and a reduce function, while the execution engine transparently parallelizes the processing of these functions among available resources. HANA emulates the MapReduce programming model and allows the developer to define map functions as user-defined procedures. Support for the MapReduce programming model enables developers to implement specific analysis algorithms on HANA faster, without worrying about parallelization and efficient execution by HANA’s calculation engine. Please also see our podcast on this technology concept.

Text Retrieval and Exploration

Elements of search in unstructured data, such as linguistic or fuzzy search find their way into the domain of structured data, changing system interaction, e.g., enabling the specification of analytical queries in natural language. Furthermore, for business environments added value lies in combining search in unstructured data with analytics of structured data. Cancer databases in healthcare are an example where the combination of structured and unstructured data creates new value by being able to map (structured) patient data in the hospital database onto (unstructured) reports from screening, surgery, and pathology on a common characteristic, e.g., cancer size and type, to learn from treatments in similar cases. Please also see our podcast on this technology concept.

Combined Row and Column Store

To support analytical and transactional workloads, two different types of database systems evolved. On the one hand, database systems for transactional workloads store and process every day’s business data in rows, i.e. attributes are stored side-by-side. On the other hand, analytical database systems aim to analyze selected attributes of huge data sets in a very short time. If the complete data of a single row needs to be accessed, storing data in a row format is advantageous. For example, when comparing details of two customers, all database attributes of these customers, such as inquirer’s name, time, and content need to be loaded. In contrast, columnar databases benefit from their storage format, when a subset of attributes needs to be processed for all or a huge number of database entries. For example, summing up the total amount of products that passed a certain reader gate involves the attributes date and business location while ignoring the product id and the business step. Using a row store for this purpose would result in processing all attributes of the event list, although only two attributes are required. Therefore, incorporating a columnar store benefits from accessing only relevant data and less search resp. skipping operations. Please also see our podcast on this technolgy concept.

Minimal Projections

Typically, transactional enterprise applications follow a very simple access pattern. A lookup for a given predicate is followed by reading all satisfying tuples. Interestingly, for traditional disk-based row databases it is very easy and fast to read all attributes of the table because they are physically co-located. Since the overall processing time is quite high due to he I/O overhead, it does not matter how many attributes are projected. However, the situation changes for in-memory column store databases. Here, for each selected tuple, access to each of the projected attributes will touch a different memory location incurring a small penalty. Thus, to increase the overall performance, it is required to select only the minimal set of attributes that should be projected for each query. This has two important advantages: First, it dramatically reduces the amount of accessed data that is transferred between client and server. Second, it reduces the number of accesses to random memory locations and thus increases the overall performance. Please also see our podcast on this technology concept.

Any Attribute as an Index

Traditional row-oriented databases store tables as collections of tuples. To improve access to specific values within columns and to avoid scanning the entire table, that is, all columns and rows, indexes are typically created for these columns. In contrast to traditional row-oriented tables, the columnar storage of tuples in HANA allows scanning any columns corresponding to the attributes of the selection criteria to determine the matching tuples. The offsets of the matching values are used as an index to retrieve the values of the remaining attributes, avoiding the need to read data that is not required for the result set. Consequently, complex objects can be filtered and retrieved via any of their attributes. Please also see our podcast on this technology concept.

Insert-only

Insert-only or append-only describes how data is managed when inserting new data. The principle idea of insert-only is that changes to existing data are handled by appending new tuples to the data storage. In other words, the database does not allow applications to perform updates or deletions on physically stored tuples of data. This design approach allows the introduction of a specific write-optimized data store for fast writing and a read-optimized data store for fast reading.Traditional database systems support four operations for data manipulations, i.e. inserting new data, selecting data, delete data, and updating data. The latter two are considered as destructive since original data is no longer available after its execution. In other words, it is neither possible to detect nor to reconstruct all values for a certain attribute; only the latest value is available. Insert-only enables storing the complete history of value changes and the latest value for a certain attribute. For instance, this is also a foundation of all bookkeeping systems to guarantee transparency. For the history-based access control, insert-only builds the basis to store the entire history of queries for access decision. In addition, insert-only enables tracing of access decision, which can be used to perform incident analysis. Please also see our podcast on this technology concept.

Multi-core and Parallelization

In contrast to the hardware development until the early 2000 years, todays processing power does no longer scale in terms of processing speed, but degree of parallelism. Today, modern system architectures provide server boards with up to 8 separate CPUs where each CPU has up to 12 separate cores. This tremendous amount of processing power should be exploited as much as possible to achieve the highest possible throughput for transactional and analytical applications. For modern enterprise applications it becomes imperative to reduce the amount of sequential work and develop the application in a way that can be easily parallelized. Parallelization can be achieved at a number of levels in the application stack of enterprise systems – from within the application running on an application server to query execution in the database system. Processing multiple queries can be handled by multi-threaded applications, i.e. the application does not stall when dealing with more than one query. Threads are a software abstraction that needs to be mapped to physically available hardware resources. A CPU core can be considered as single worker on a construction area. If it is possible to map each query to a single core, the system’s response time is optimal. Query processing also involves data processing, i.e. the database needs to be queried in parallel, too. If the database is able to distribute the workload across multiple cores of a single system, this is optimal. If the workload exceeds physical capacities of a single system, multiple servers or blades need to be involved for work distribution to achieve optimal processing behavior. From the database perspective, the partitioning of data sets enables parallelization since multiple cores across servers can be involved for data processing. Please also see our podcast on this technology concept.

Active and Passive Data Store

By default, all data is stored in-memory to achieve high-speed data access. However, not all data is accessed or updated frequently and needs to reside in-memory, as this increases the required amount of main memory unnecessarily. This so-called historic or passive data can be stored in a specific passive data storage based on less expensive storage media, such as SSDs or hard disks, still providing sufficient performance for possible accesses at lower cost. The dynamic transition from active to passive data is supported by the database, based on custom rules defined as per customer needs. We define two categories of data stores: active and passive: We refer to active data when it is accessed frequently and updates are expected (e.g., access rules). In contrast, we refer to passive data when this data either is not used frequently and neither updated nor read. Passive data is purely used for analytical and statistical purposes or in exceptional situations where specific investigations require this data. For example, tracking events of a certain pharmaceutical product that was sold five years ago can be considered as passive data. Why is this feasible? Firstly, from the business perspective, the pharmaceutical is equipped with a best-before data of two years after its manufacturing date, i.e. even when the product is handled now, it is no longer allowed to sell it. Secondly, the product was sold to a customer four years ago, i.e. it left the supply chain and is typically already used within this timespan. Therefore, the probability that details about this certain pharmaceutical are queried is very low. Nonetheless, the tracking history needs to be conserved by law regulation, for example, to prove the used path within the supply chain or when sales numbers are analyzed for building a new long-term forecast based on historical data. Furthermore, introducing the concept of passive data comes with the advantage to reduce the amount of data, which needs to be accessed in real-time, and to enable archiving. As a result, when data is moved to a passive data store it consumes no longer fast accessible main memory and frees hardware resources. Dealing with passive data stores involves the need for a memory hierarchy from fast, but expensive to slow and cheap. A possible storage hierarchy is given by memory registers, cache memory, main memory, flash storages, solid state disks, SAS hard disk drives, SATA hard disk drives, tapes, etc. As a result, rules for migrating data from one store to another need to be defined, we refer to it as aging strategy or aging rules. The process of aging data, i.e. migrating it from a faster store to a slower one, is considered as background tasks, which occurs on regularly basis, e.g. weekly or daily. Since this process involves reorganization of the entire data set, it should be processed during times with low data access, e.g. during nights or weekends. Please also see our podcast on this technology concept.

Partitioning

We distinguish between two partitioning approaches: vertical and horizontal partitioning, whereas a combination of both approaches is also possible. Vertical partitioning refers to the rearranging of individual database columns. It is achieved by splitting columns of a database table into two or more column sets. Each of the column sets can be distributed on individual databases servers. This can also be used to build up database columns with different ordering to achieve better search performance while guaranteeing high-availability of data. Key to success of vertical partitioning is a thorough understanding of the application’s data access patterns. Attributes that are accessed in the same query should rely in the same partition since locating and joining additional partitions may degrade overall performance. In contrast, horizontal partitioning addresses large database tables and how to divide them into smaller pieces of data. As a result, each piece of the database table contains a subset of the complete data within the table. Splitting data into equivalent long horizontal partitions is used to support search operations and better scalability. For example, a scan of the request history results in a full table scan. Without any partitioning, a single thread needs to access all individual history entries and checks the selection predicate. When using a naïve round robin horizontal partitioning across 10 partitions, the total table scan can be performed in parallel by 10 simultaneously processing threads reducing response time to approx. one ninth compared to the single threaded full table scan. Please also see our podcast on this technology concept.

Lightweight Compression

Compression defines the process of reducing the amount of storage needed to represent a certain set of information. Typically, a compression algorithm tries to exploit redundancy in the available information to increase the efficiency of memory consumption. Compression algorithms differ in the amount of time that is required to compress and decompress data and the achieved compression rate defined as the reduction in memory usage. Complex compression algorithms will typically sort and perform complex analysis of the input data to achieve the highest possible compression rate on the cost of increased run-time. For in-memory databases, compression is applied to reduce the amount of data that is transferred between main memory and CPU, as well as to reduce overall main memory consumption. However, the more complex the compression algorithm is, the more CPU cycles it will take to decompress the data to perform query execution. As a result, in-memory databases choose a trade-off between compression ration and performance using so called light-weight compression algorithms. An example for a light-weight compression algorithm is dictionary compression. With dictionary compression, all value occurrences are replaced by a fixed length encoded value. This algorithm has two major advantages for in memory databases: First, it reduces the amount of required storage and second, it allows to perform predicate evaluation directly on the compressed data, thereby reducing the amount data to be transferred from memory to the CPU. As a result, query execution becomes even faster with in-memory databases. Please also see our podcast on this technology concept.

Dynamic Multithreading within Nodes

Parallel execution is key to achieve sub second response time for queries processing large sets of data. The independence of tuples within columns enables easy partitioning and therefore supports parallel processing. We leverage this fact by partitioning database tasks on large data sets into as many jobs as threads are available on a given node. This way, the maximal utilization of any supported hardware can be achieved. Please also see our podcast on this technology concept.

Analytics on Historical Data

All enterprise data has a lifespan: depending on the application, a datum might be expected to be changed or updated in the future in many cases, in few cases, or never. In financial accounting, for example, all data that is not from the current year plus all open items from the previous year can be considered 'historic data', since they may no longer be changed. In HANA, historical data is instantly available for analytical processing from solid state disk (SSD) drives. Only active data is required to reside in-memory permanently. Please also see our podcast on this technology concept.

SQL Interface on Columns and Rows

Business operations have very diverse access patterns. They include read-mostly queries of analytical applications and write-intensive transactions of daily business. Further, all variants of data selects are present including point selects (e.g., details of a specific product) and range selects, retrieving sets of data (e.g., from a specified period like sales overview per region of a specific product for the last month). Column and row-oriented storage in HANA provides the foundation to store data according to its frequent usage patterns in column or in row-oriented manner to achieve optimal performance. Through the usage of SQL, that supports column as well as row-oriented storage, the applications on top stay oblivious to the choice of storage layout. Please also see our podcast on this technology concept.

No Aggregate Tables

A very important part of the HANA philosophy is that all data should be stored at the highest possible level of granularity (e.g. the level of greatest detail). This is in contrast to the prevailing philosophy in most enterprise data centers, which says that the data should be stored on whatever level of granularity is required by the application to ensure maximum performance. Unfortunately, multiple applications use the same information and require different levels of detail, which results in high redundancy and software complexity around managing the consistency between multiple aggregate tables and source data. Given the incredible aggregation speed provided by HANA, all aggregates required by any application can now be computed from the source data on-the-fly, providing the same or better performance as before and dramatically decreasing code complexity which makes system maintenance a lot easier. Please also see our podcast on this technology concept.

Single and Multi-Tenancy

To achieve the highest level of operational efficiency, the data of multiple customers can be consolidated onto a single HANA server. Such consolidation is key when HANA is provisioned in an on-demand setting, a service which SAP plans to provide in the future. Multi-tenancy allows making HANA accessible for smaller customers at lower cost, as a benefit from the consolidation. Already today HANA is equipped with the technology to enable such consolidation while ensuring that no critical resources are contending between the customers sharing a server and while ensuring a reliant and highly-available storage of the customers data at the hosting site.

Reduction of Layers

In application development, layers refer to levels of abstractions. Each application layer encapsulates specific logic and offers certain functionality. Although abstraction helps to reduce complexity, it also introduces obstacles. The latter result from various aspects, e.g. a) functionality is hidden within a layer and b) each layer offers a variety of functionality while only a small subset is in-use. From the data’s perspective, materialized layers are problematic since data is marshaled and unmarshaled for transformation in the layer-specific format. As a result, the identical data is kept in various layers redundantly. To avoid redundant data, logical layers, describing the transformations, are executed during runtime, thus increasing efficient use of hardware resources by removing all materialized data maintenance task. Moving formalizable application logic to the data it operates on results in a smaller application stack and increases maintainability by code reduction. Furthermore, removing redundant data storage increases the audibility of data access. Please also see our podcast on this technology concept.

On-the-fly Extensibility

The possibility of adding new columns to existing database tables dramatically simplifies a wide range of customization projects that customers of enterprise software are often required to do. When physically storing consecutive tuples in row-store pages, all pages belonging to a database table must be re-organized when adding a new column to the table. In a column store database, such as HANA, all columns are stored in physical separation from one another. This allows for a simple implementation of column extensibility, which does not need to update any other existing columns of the table. This reduces a schema change to a pure metadata operation, allowing for flexible and real-time schema extensions. Please also see our podcast on this technology concept.

No Disk

For a long time the available amount of main memory on large server systems was not enough to hold the complete transactional data set of large enterprise applications. Today, the situation has changed. Modern servers provide up to multiple Terabytes of main memory and allow keeping the complete transactional data in memory. This eliminates multiple I/O layers and simplifies database design, allowing for high throughput of transactional and analytical queries. Please also see our podcast on this technology concept.

Open Positions

We are proud to announce " A Course in In-Memory Data Management" by Prof. Dr. h.c. Hasso Plattner. This book is the culmination of six years work of in-memory research. As such, it provides the technical foundation for combined transactional and analytical workloads inside one single database as well as examples of new applications that are now possible given the availability of the new technology. The book is available at Springer.