Caché’s Multidimensional Data Server

Caché’s high-performance database uses a multidimensional data engine that allows efficient and compact storage of data in a rich data structure. Objects and SQL are implemented by specifying a unified data dictionary that defines the classes and tables and provides a mapping to the multidimensional structures – a mapping that can be automatically generated. Caché also allows direct multidimensional data access.

Integrated Database Access

Caché gives programmers the freedom to store and access data through objects, SQL, or direct access to multidimensional structures. Regardless of the access method, all data in Caché’s database is stored in Caché’s multidimensional arrays.

Once the data is stored, all three access methods can be simultaneously used on the same data with full concurrency.

Multidimensional Access

A unique feature of Caché is its Unified Data Architecture. Whenever a database object class is defined, Caché automatically generates a SQL-ready relational description of that data. Similarly, if a DDL description of a relational database is imported into the Data Dictionary, Caché automatically generates both a relational and an object description of the data, enabling immediate access as objects. Caché keeps these descriptions coordinated; there is only one data definition to edit. The programmer can edit and view the dictionary both from an object and a relational table perspective.

Caché automatically creates a mapping for how the objects and tables are stored in the multidimensional structures, or the programmer can explicitly control the mapping.

The Caché Advantage

Flexibility: Caché’s data access modes – Object, SQL, and multidimensional – can be used concurrently on the same data. This flexibility gives programmers the freedom to think about data in the way that makes the most sense and to use the access method that best fits each program’s needs.

Less Work: Caché’s Unified Data Architecture automatically describes data as both objects and tables with a single definition. There is no need to code transformations, so applications can be developed and maintained more easily.

Multidimensional Data Model

At its core, the Caché database is powered by an extremely efficient multidimensional data engine. The built-in Caché scripting languages support direct access to the multidimensional structures – providing the highest performance and greatest range of storage possibilities – and many applications are implemented entirely using this data engine directly. Direct “global access” is particularly common when there are unusual or very specialized structures and no need to provide object or SQL access to them, or where the highest possible performance is required.

There is no data dictionary, and thus no data definitions, for the multidimensional data engine.

Caché’s multidimensional arrays are called “globals”. Data can be stored in a global with any number of subscripts. What’s more, subscripts are typeless and can hold any sort of data. One subscript might be an integer, such as 34, while another could be a meaningful name, like “LineItems” – even at the same subscript level.

For example, a stock inventory application that provides information about item, size, color, and pattern might have a structure like this:

^Stock(item,size,color,pattern) = quantity

Here’s some sample data:

^Stock(“slip dress”,4,”blue”,”floral”) = 3

With this structure, it is very easy to determine if there are any size 4 blue slip dresses with a floral pattern – simply by accessing that data node. If a customer wants a size 4 slip dress and is uncertain about the color and pattern, it is easy to display a list of all of them by cycling through all the data nodes below:

^Stock(“slip dress”,4)

In this example, all of the data nodes were of a similar nature (they stored a quantity), and they were all stored at the same subscript level (four subscripts) with similar subscripts (the third subscript was always text representing a color). However, they don’t have to be. Data nodes may have a different number or type of subscripts, and they may contain different types of data.

Here’s an example of a more complex global with invoice data that has different types of data stored at different subscript levels:

Often only a single data element is stored in a data node, such as a date or quantity, but sometimes it is useful to store multiple data elements together in a single data node. This is particularly useful when there is a set of related data that is often accessed together. It can also improve performance by requiring fewer accesses of the database.

For example, in the above invoice, each item included a part number, quantity, and price all stored as separate nodes, but they could be stored as a list of elements in a single node:

^Invoice(invoice #,”LineItems”,item #)

To make this simple, Caché supports a function called $list(), which can assemble multiple data elements into a length delimited byte string and later disassemble them. Elements can in turn contain sub-elements, etc.

In systems with thousands of users, reducing conflicts between competing processes is critical to providing high throughput. One of the biggest conflicts is between transactions wishing to access the same data.

Caché processes don’t lock entire pages of data while performing updates. Instead, because transactions require frequent access or changes to small quantities of data, database locking in Caché is done at a logical level. Database conflicts are further reduced by using atomic addition and subtraction operations, which don’t require locking. (These operations are particularly useful in incrementing counters used to allocate ID numbers and for modifying statistics counters.)

With Caché, individual transactions run faster, and more transactions can run concurrently.

Because Caché data is inherently variable length and is stored in sparse arrays, Caché often requires less than half of the space needed by a relational database. In addition to reducing disk requirements, compact data storage enhances performance because more data can be read or written with a single I/O operation and data can be cached more efficiently.

Caché multidimensional arrays are inherently typeless, both in their data and subscripts. No declarations, definitions, or allocations of storage are required. Global data simply pops into existence as data is inserted.

In Caché, data and code are stored in disk files with the name CACHE.DAT (only one per directory). Each such file contains numerous globals (multidimensional arrays). Within a file, each global name must be unique, but different files may contain the same global name. These files may be loosely thought of as databases.

Rather than specifying which database file to use, each Caché process uses a “namespace” to access data. A namespace is a logical map that maps the names of multidimensional global arrays and code to databases. If a database is moved from one disk drive or computer to another, only the namespace map needs to be updated. The application itself is unchanged.

Usually, other than some system information, all data for a namespace is stored in a single database. However, namespaces provide a flexible structure that allows arbitrary mapping, and it is not unusual for a namespace to map the contents of several databases, including some on other computers.

The Caché Advantage

Performance: By using an efficient multidimensional data model with sparse storage techniques instead of a cumbersome maze of two-dimensional tables, data access and updates are accomplished with less disk I/O. Reduced I/O means that applications will run faster.

Scalability: The transactional multidimensional data model allows Caché-based applications to be scaled to many thousands of clients without sacrificing high performance. That’s because data access in a multidimensional model is not significantly affected by the size or complexity of the database in comparison to relational models. Transactions can access the data they need without performing complicated joins or bouncing from table to table.

Caché’s use of logical locking for updates instead of locking physical pages is another important contributor to concurrency, as is its sophisticated data caching across networks.

Rapid Development: With Caché, development occurs much faster because the data structure provides natural, easily understood storage of complex data and doesn’t require extensive or complicated declarations and definitions. Direct access to globals is very simple, allowing the same language syntax as accessing local arrays.

Cost-effectiveness: Compared to similarly sized relational applications, Caché-based applications require significantly less hardware and no database administrators. System management and operations are simple.

SQL Access

SQL is the query language for Caché, and it is supported by a full set of relational database capabilities – including DDL, transactions, referential integrity, triggers, stored procedures, and more. Caché supports access through ODBC and JDBC (using a pure Java-based driver). SQL commands and queries can also be embedded in Caché ObjectScript and within object methods.

SQL accesses data viewed as tables with rows and columns. Because Caché data is actually stored in efficient multidimensional structures, applications that use SQL achieve better performance with Caché than with traditional relational databases.

Caché supports, in addition to the standard SQL syntax, many of the commonly used extensions in other databases so that many SQL-based applications can run on Caché without change – especially those written with database independent tools. However, vendor-specific stored procedures will require some work, and InterSystems has translators to help with that work.

Caché SQL includes object enhancements that make SQL code simpler and more intuitive to read and write.

The Caché Relational Gateway enables a SQL request that originates in Caché to be sent to other (relational) databases for processing. Using the Gateway, a Caché application can retrieve and update data stored in most relational databases.

Optionally, the Gateway allows Caché database classes to transparently use relational databases. However, applications will run faster and be more scalable if they access Caché’s post-relational database.

Caché Objects

Caché’s object model is based upon the ODMG standard. Caché supports a full array of object programming concepts, including encapsulation, embedded objects, multiple inheritance, polymorphism, and collections.

The built-in Caché scripting languages directly manipulate these objects, and Caché also exposes Caché classes as Java, EJB, COM, .NET, and C++ classes. Caché classes can also be automatically enabled for XML and SOAP support by simply clicking a button in the Studio IDE. As a result, Caché objects are readily available to every commonly used object technology.

There are several ways for a program outside of the Caché Application Server to access Caché classes:

Any Caché class can be projected as a class in the native language. When a Java, C++, C#, or other program accesses a Caché object, it calls a template of the class in the native language. That template class (which is automatically generated by Caché) communicates with the Caché Application Server to invoke methods on the Caché server and to access or modify properties. State for the Caché objects is maintained in the Caché Application Server. To speed execution and reduce messaging, Caché caches a copy of the object’s data on the client and piggybacks updates with other messages when possible.

Caché eXTreme technology can be used for database classes in which the native language template class directly accesses the database – bypassing the Application Server. The object’s state is not kept on the Application Server; the in-memory properties are only maintained in the client. This approach provides significantly higher throughput but less functionality, since server-side instance methods of the class (i.e., methods that need access to the in-memory properties) cannot be invoked. Caché eXTreme is available for C++ and Java.

InterSystems Jalapeño™ technology allows Java developers to first create Java database classes just like any other POJO (plain old Java object) class in their IDE of choice and then have Caché automatically generate a database schema and corresponding Caché class. Using this approach, the Java class is unchanged, and the application continues to access its properties and methods. Caché provides a library class (“Object Manager”) with an API that is used to store and retrieve database objects and issue queries.

With each of these three approaches, the object appears to be local to the user program. Caché transparently handles all communications, using either call-in or TCP.

Caché includes a number of unique advanced object technologies – one of which is method generators. A method generator is a method that executes at compile time, generating code that can run when the program is executed. A method generator has access to class definitions, including property and method definitions and parameters, to allow it to generate a method that is customized for the class. Method generators are particularly powerful in combination with multiple inheritance – functionality can be defined in a multiply inherited class that customizes itself to the subclass.

The Caché Advantage

Caché is fully object-enabled, providing all the power of object technology to developers of high-performance transaction processing applications.

Rapid Application Development: Object technology is a powerful tool for increasing programmer productivity. Developers can think about and use objects – even extremely complex objects – in simple and realistic ways, thus speeding the application development process. Also, the innate modularity and interoperability of objects simplifies application maintenance and lets programmers leverage their work over many projects.

Natural Development: Database objects appear as objects native to the language being used by the developer. There is no need to write tedious code to decompose objects into rows and columns and later re-assemble them.

Transactional Bit-map Indexing

Database performance is critically dependent on having indexes on properties that are frequently used in searching the database. Most databases use indexes that, for each possible value of the column or property, maintain a list of the IDs for the rows/objects that have that value.

A bit-map index is another type of index. Bit-map indexes contain a separate bit map for each possible value of a column/ property, with one bit for each row/ object that is stored. A 1 bit means that the row/object has that value for the column/property.

The advantage of bit-map indexes is that complex queries can be processed by performing Boolean operations (AND, OR) on the indexes – efficiently determining exactly which instances (rows) fit the query conditions, without searching through the entire database. Bit-map indexes can often boost response times for queries that search large volumes of data by a factor of 100 or more.

Bit-maps traditionally suffer from two problems: a) they can be painfully slow to update in relational databases, and b) they can take up far too much storage. Thus, with relational databases, they are rarely used for transaction processing applications.

Caché has introduced a new technology – Transactional Bit-map Indexing – that leverages multidimensional data structures to eliminate these two problems. Updating these bit-maps is often faster than traditional indexes, and they utilize sophisticated compression techniques to radically reduce storage. Caché also supports sophisticated “bit-slicing” techniques. The result is ultra fast bit-maps that can often be used to search millions of records in a fraction of a second on an online transaction-processing database. Business intelligence and data warehousing applications can work with “live” data.

Caché offers both traditional and transactional bit-map indexes. Caché also supports multi-column indexes. For example, an index on State and Car Model can quickly identify everyone who has a car of a particular type that is registered in a particular state.

The Caché Advantage

Radically Faster Queries: By using transactional bit-map techniques, users can get blazing fast searches of large databases – often millions of records can be searched in a fraction of a second – on a system that is primarily used for transaction processing.

Update at “Transaction Speed”: Caché’s bit-maps can be updated as quickly as traditional indexes, making them suitable for use with transaction-processing applications.

Analytics on the Production Server: There is no need for a second computer dedicated to data warehouse and decision support. Nor is there any need for daily operations to transfer data to such a second system or database administrators to support it.

Scalability: The speed of transactional bit-maps enhances the ability to build systems with enormous amounts of data that needs to be searched.

Word-aware Text Searching

Through Word-aware indexing, Caché supports free text searching in which queries can search for text containing words of interest, even though the actual words in the text may be variants of the search words. Word-aware algorithms are specific to the natural language being used. Word-aware searching is available for a wide range of natural languages, including English, French, German, Italian, Portuguese, and Spanish. Others are being added.

The Caché Advantage

Powerful Unstructured Text Searches: Unstructured text, such as physician’s notes or documents, can be easily searched for keywords and related words.

Extremely Rapid Searches: Coupling Word-aware with Caché bit-map technology, searching of massive quantities of text can be performed in a fraction of a second.

InterSystems iKnow Technology

InterSystems iKnow technology allows you to analyze and index text and other unstructured data types, identifying important knowledge, concepts, and relationships. Unlike most other semantic and search technologies, iKnow will automatically indicate the most interesting meaningful elements in the data without needing any input from the user, not even a search term.

Smart indexing analyzes and transforms unstructured text into an understandable network of relationships and concepts without any need for pre-defined dictionaries, taxonomies, or ontologies. Smart indexing provides insight into what’s relevant, what’s related, and what’s representative within large volumes of unstructured text, without needing input of a search term.

Smart indexing works with a number of different languages. It can also identify concepts (recurring patterns) within unstructured data that is not traditional text.

Smart matching links the results of smart indexing to existing knowledge specific to a domain, organization, or industry. Matching is based on meaningful concepts and their combinations, not just words, and includes exact, partial, and “scattered” matches.

Smart interpretation applies analytics and/or business rules to the results derived from smart indexing and smart matching. (May require use of InterSystems Ensemble- or InterSystems DeepSee.)

The Caché Advantage

Built-in Capabilities for Analyzing and Indexing Unstructured Data: InterSystems iKnow technology is built into Caché. You can access all your data – structured and unstructured – without resorting to third-party solutions or tools.

Minimal Up-front Work: Unlike other semantic search tools, InterSystems iKnow technology can find concepts and relationships without having them pre-defined. Users don’t necessarily need to know what they are looking for before they look for it.

Multilingual: InterSystems iKnow technology works with multiple languages, even if languages are mixed within a document.

Multipurpose: InterSystems iKnow technology can be used to solve different kinds of problems: Identifying the most interesting elements in a huge pile of documents, automatically routing information based on its similarity to other information, summarizing large amounts of text into comprehensive summaries, etc.

Enterprise Cache Protocol For Distributed Systems

InterSystems’ Enterprise Cache Protocol (ECP) is an extremely high-performance and scalable technology that enables computers in a distributed system to use each other’s databases. The use of ECP requires no application changes – applications simply treat the database as if it was local.

Here’s how ECP works: Each Caché application server includes its own Caché Data Server, which can operate on data that resides in its own disk systems or on blocks that were transferred to it from another Caché Data Server by ECP. When a client makes a request for information that is maintained on a remote Data Server, the application server will attempt to satisfy the request from its local cache. If it cannot, it will request the necessary data from the remote Data Server. The reply includes the database block(s) where that data was stored. These blocks are cached on the application server, where they are available to all applications running on that server. ECP automatically takes care of managing cache consistency across the network and propagating changes back to Data Servers.

The performance and scalability benefits of ECP are dramatic. Clients enjoy fast responses because they frequently use locally cached data. And caching greatly reduces network traffic between the database and application servers.

The use of ECP is transparent to applications. Applications written to run on a single server run in a multi-server environment without change. To use ECP, the system manager simply identifies one or more Data Servers to an application server and then uses Namespace Mapping to indicate that references to some or all global structures (or portions of global structures) refer to that remote Data Server.

Every Caché system can function both as an application server and as a Data Server for other systems. ECP supports any combination of application servers and Data Servers and any point-to-point topology of up to 255 systems.

The Caché Advantage

Massive Scalability: Caché’s Enterprise Caché Protocol allows the addition of application servers as usage grows, each of which uses the database as if it was a local database. If disk throughput becomes a bottleneck, more Data Servers can be added and the database becomes logically partitioned.

Higher Availability: Because users are spread across multiple computers, the failure of an application server affects a smaller population. Should a Data Server “crash” and be rebooted, or a temporary network outage occur, the application servers can continue processing with no observable effects other than a slight pause. Configuring Data Servers as a failover hardware cluster with backup Data Servers can significantly enhance availability.

Lower Costs: Large numbers of low cost computers can be combined into an extremely powerful system supporting massive processing.

Transparent Usage: Applications don’t need to be written specifically for ECP – Caché applications can automatically take advantage of ECP without change.

Disaster Recovery and High Availability

Even in the most rigorous environments unexpected events can occur – hardware failure, power loss, or something as severe as a flood or other natural disaster – yet hospitals, telecommunications, and other critical operations cannot afford to be “down”. To meet such exacting standards, Caché is designed to recover gracefully from outages and offers a variety of failover and other options to reduce or eliminate the impact on users.

Caché Write-Image Journaling and other integrity features ensure database integrity for most types of hardware failures – including power outages – allowing rapid recovery while minimizing the impact on users.

Mirroring is a high-availability configuration that replicates data on a separate disk in real time, with automatic failover.

A database mirror is a logical grouping of two Caché systems. Upon startup, the mirror automatically designates one of these two physically independent systems as the primary system; the other one automatically becomes the backup system. Mirrored databases are synchronized from the primary to the backup failover member in real time through a TCP channel. The backup system returns acknowledgments about receipt of mirrored data over a dedicated channel. The acknowledgment indicates, among other things, how up to date the backup failover member is.

External clients (language bindings, ODBC/JDBC/SQL clients, direct-connect users, etc.) connect to the mirror through the Mirror Virtual IP (VIP), which is specified during mirroring configuration. The Mirror VIP is automatically bound to an interface on the primary system of the Mirror.

The configuration of a Mirror VIP is optional; if not specified, all external clients must connect directly to the running primary, and must have knowledge of both the failover members and their current role within the Mirror. If the primary system fails, the backup system automatically takes over. When the Mirror is configured so that clients are logged on through the VIP, they do not know which of the mirror members is serving them. The failover is entirely transparent to users.

Caché can also update multiple (usually geographically separate) ASYNC mirror members, allowing you to build redundancy into your system and to aid in disaster recovery. Failover to an ASYNC member is not automatic.

The Caché Advantage

Bullet-Proof Database: Caché Write-Image Journaling and other integrity features ensure database integrity for most types of hardware failures, including power outages.

High-Availability Configurations: The use of Database Mirrors, ECP, and/or Fail-over Clusters allow rapid recovery from outages while minimizing, or in some cases eliminating, their impact on users.

High-Availability at Lower Cost: Caché Database Mirrors do not require large investments in hardware, support, operating system licenses, or storage. In addition, Caché Database Mirrors are easy to set up and maintain, so administration costs are minimized.

When used with a Database Mirror, Enterprise Caché Protocol (ECP) application servers have built-in knowledge of the members of the mirror, including the current primary. The application servers, therefore, do not rely on the Mirror VIP, but rather connect directly to the elected primary system.

If the primary member of the Mirror fails, the ECP application servers will view this as a server restart condition. The servers will simply reestablish their connections to the new primary failover member and continue processing their in-progress workload. Users will, at most, experience only a slight pause.

Using fail-over clustered hardware, Data Servers share access to the same disks, but only one is actively running Caché at a time. If the active server fails, Caché is automatically started on another server that takes over the processing responsibilities. The users can then sign back on to the new server.

An ECP data server can be configured as a failover cluster. If the primary data server crashes, the backup data server takes over.

Security Model

Caché is certified for Common Criteria EAL 3. The Caché security model is designed to support application deployment in three ways:

Caché security is based on authentication, authorization, auditing, and database encryption. Caché provides these security capabilities while minimizing the burden on the application performance.

Authentication is how users prove to Caché that they are who they say they are. (A “user” is not necessarily a human being. It could, for example, be a measurement device generating data or an application running on another system that is connected to Caché.) Caché has a number of available authentication mechanisms:

Kerberos: The most secure means of authentication. The Kerberos Authentication System provides mathematically proven strong authentication over a network.

LDAP: Caché supports authentication through the Lightweight Directory Access Protocol (LDAP). In this case, Caché contacts an LDAP server to authenticate users, relying on its database of users and their associated information to perform authentication. The LDAP server also controls all aspects of password management, password policies, etc.

Passwords: Caché prompts the user for a password and compares a hash of the provided password against the hash value it has stored.

Operating-system–based: OS-based authentication trusts that the OS has verified the identity of each user, and uses that same identification for Caché purposes.

You can also allow all users to connect to Caché without performing any authentication.

Caché provides built-in support for two-factor authentication, which requires users to verify their identity via something they know and something they have. For example, when a user provides a password (something they know) the applications may send a text message to their cell phone (something they have). The text message would include an additional security code that must be entered before access to the application is granted.

Caché supports single sign-on by enabling participation in an OpenAM configuration.

Once a user is authenticated, the next security-related operation is to determine what that user is allowed to use, view, or alter. This determination and control of access is known as authorization. The assignation and management of privileges are normally performed through the Caché Management Portal.

The primary goal of security is the protection of resources – information or capabilities in one form or another. With Caché, resources can be databases, services, applications, tools, and even administrative actions. The system administrator grants access to these by assigning permissions, such as READ, WRITE, or USE. Together, a resource and an associated, assigned permission are known as a privilege. In addition to the system-defined resources, the security administrator can create application-specific resources and use the same mechanisms for granting and checking permissions.

For simplicity, users are usually assigned one or more “roles” (e.g., “LabTech”, or “Payroll”), and the Security Administrator then grants privileges for a particular resource to those roles rather than to individual users. The user inherits all of the privileges granted to the roles it is assigned.

It is often useful for a user to temporarily gain additional privileges rather than have them permanently assigned. For example, rather than the security administrator granting a broad set of privileges to a user (such as the ability to access and modify the payroll database), the user can instead be given just the privilege to access the payroll application, and that application can then elevate the user’s privileges while that application is being used.

To accomplish this elevation, roles can be assigned to applications. When that application is accessed, the user temporarily acquires additional roles. The additional roles may be simply a list that everyone authorized to use the application acquires, or the additional roles may be more customized, based on the roles the user already has.

This feature is particularly useful for browser-based applications using CSP (Caché Server Pages). With CSP, a portion of every URL specifies an application name. Following authentication and a determination that the user is authorized to use that CSP application, the user temporarily gains the additional roles assigned to that application for the duration of that page request.

The security administrator can also designate specific routines as capable of performing role elevation to gain the additional roles of specified applications, after passing user specified security tests. This facility is tightly controlled, and it is the mechanism by which non-CSP applications perform role elevation.

Many applications, especially those that must comply with government regulations like HIPAA or Sarbanes-Oxley, need to provide secure auditing. In Caché, all system and application events are recorded in an append-only log, which is compatible with any query or reporting tool that uses SQL.

The Security Administrator can designate one or more CACHE.DAT files (databases) to be encrypted on disk. Everything in those files is then encrypted, including any indexes.

Developers can use system functions to encrypt/decrypt data, which then may be stored in the database or transmitted. This feature can be used to encrypt sensitive data to protect it from other users who have read access to the database but not the key.

By default, Caché encrypts data with an implementation of the Advanced Encryption Standard (AES), a symmetric algorithm that supports keys of 128, 192, or 256 bits. Encryption keys are stored in a protected memory location. Caché provides full capabilities for key management.