.NET and data persistence

This article was originally published in VSJ, which is now part of Developer Fusion.

The .NET framework has become the platform of choice for many enterprise applications. Not surprisingly, many .NET concepts parallel those in CORBA and J2EE. The benefit of hindsight gives .NET architects and developers the opportunity to imitate the success while avoiding the pitfalls inherent in complex distributed development. This is especially true for data persistence, which often gets less attention than it deserves.

Issues with data persistence are often not discovered until later stages of development and testing, when the options for addressing them can be limited and quite expensive. A well-designed data persistence layer can drastically improve the performance of the entire application, if it is addressed throughout the development process.

This article outlines best practices for developing advanced data persistence in a .NET application. It first discusses the development and runtime requirements of an effective data layer. It then shows the benefits of an object-oriented approach and intelligent persistent data caching and compares these with the data persistence facilities available in the .NET framework.

Requirements for a Efficient Data Persistence Layer

A data persistence solution must address both development and runtime challenges. It must be able to support the complex data requirements of an enterprise application – without reducing developer productivity or compromising performance.

Enterprise applications have similar requirements for data persistence regardless of the programming languages with which they were developed and the platforms on which they are deployed. One lesson learned early on in the Java community was that the representation of relational data as objects yields huge returns. The Java world continues to focus on this object-oriented approach with EJB, JDO and even more recently, Hibernate. Successful .NET architects can benefit from the lessons of the Java world.

Nobody wants to spend a lot of time writing, testing, and debugging data persistence code. For the most part, this is plumbing code that “should just work” without a lot of extra effort. Developers want to focus their efforts on the custom business logic of their application. However, because data persistence is an important element of the overall application, it cannot be ignored completely. Ideally, a solid data persistence layer requires minimal developer intervention. Choosing good tools that assist in object-relational mapping and code generation along with careful design of an efficient data persistence model helps meet both of these requirements.

However, the ability to quickly design and build an application does not guarantee success at deployment time. Performance and scalability are possibly the two most popular terms in all of enterprise development. A poorly designed data access layer can have a significant impact on application performance, and ultimately availability and scalability. While some approach this problem by throwing more hardware at it, caching is the most common and effective solution for performance problems.

While web page caching is a common solution for improving response times, it does not address data bottlenecks for enterprise applications. Applications with complex data requirements quickly become unmanageable when data access is coded directly in web components. Instead, business logic and data access typically reside in the server tier between the database and web components, as shown in Figure_1.

Figure 1: Enterprise application architecture

Scalability and performance problems occur when the volume of requests from the server tier overwhelm the limited database resources. Server tier caching of persistent data increases the application’s capability for handling many concurrent requests.

So, important success factors for a data persistence layer include:

Developer productivity and code maintainability

Built-in runtime performance and scalability

High availability features for deployment

The remainder of this article describes the approaches that have been proven in many successful enterprise Java applications. The availability of new, third-party products for .NET development that embrace some or all of these approaches indicates that they are already gaining acceptance in the .NET community.

Advantages of an Object-Oriented Approach for Developer Productivity

For the most part, enterprise applications work with complex data requirements. For example, a banking application may perform thousands of transactions per hour. Each transaction may consist of different bits of data such as customers, accounts, regulatory information, etc. This data is most likely stored in multiple tables within multiple different database systems. By programming to a data abstraction layer, developers are isolated from the underlying details of the data persistence layer. Instead, they are able to treat the persistent objects the same as they would any other type of object. Data abstraction enables developers to easily manage complex data requirements. By hiding the details of the underlying data sources (through design and/or runtime activities) the data access APIs become much simpler to work with.

When Microsoft first introduced the .NET framework, they also introduced a data layer API called ADO.NET. Derived from ADO, an earlier form of data access, ADO.NET’s goal is to work (through 3rd party adapters) with various data sources throughout the internet. Its status as a “de-facto” standard, coupled with its tight integration to Visual Studio.NET, quickly made ADO.NET useful to developers who were working with relational data.

Unfortunately, ADO.NET is not an object-oriented approach. Although it facilitates access to various data sources, ADO.NET remains a relatively low-level API whose code can be generated either through Visual Studio or by 3rd party tools. Regardless of what tool is used, complex systems usually need some additional data persistence code, which must be manually written. In addition, ADO.NET cannot directly satisfy the requirement of transparent object persistence. This is because the API is geared towards the processing of tabular result sets (the DataSet object) built from SQL queries, rather than towards the business objects used by the application.

In contrast, the O-R (Object-Relational) mapping approach takes all that is good from object-oriented programming and applies that to the data layer. Concepts such as encapsulation, abstraction, and reusability make the data layer easier to manipulate and use, meaning faster, more efficient development.

The Data Persistence Object Model

At the heart of the object-oriented approach lies the data persistence object model. This model is similar in look and feel to any other object model in your application (such as business objects). The only difference is that unlike business objects, the objects in the data persistence object model actually represent data stored in persistent storage devices (such as a relational database). The underlying mapping details are dealt with as part of a design and/or runtime activity, meaning the application code does not need to contain any database-specific code.

For example a relational table A_Table may be modeled and mapped to the C# class A.cs. Column names become property names, and the CRUD operations are replaced with methods.

Figure 2: Relational data tables

Figure 2 shows two tables (Department and Employee) that have a one-to-many relationship. The foreign key (DeptID) is on the Employee table.

Suppose we want to get all Employees associated with the “sales” Department. In ADO.NET the code may look like Sample 1.

Notice how the O-R mapped solution uses no SQL or database-specific code. In fact, the relational data is treated the same as any ordinary .NET object. A closer look at Sample 2 shows how relationships are represented and used in the O-R approach (dept.employees). Obtaining information in related tables becomes as simple as accessing an object attribute.

Advanced Features of an Object-Oriented Approach

The previous sample showed how the O-R mapping approach improves the usability and maintainability of the data persistence code for simple tables. But what about more complex data models, those often found in enterprise applications? For example, what if Employee was simply a base type in a model that allowed for various types of Employees (see Figure 3).

Figure 3: Inheritance model for Employee

Inheritance is a powerful feature of object-oriented programming. Therefore applying this capability to the object model further enhances the usability of the data persistence code. Now the developer can work with and understand the data persistence objects in a more natural way.

Products that offer this level of O-R mapping functionality usually provide various tools to define the mapping and generate code for persistent objects. Choosing a solution with a complete toolset will greatly improve productivity. For example, some products offer a tool that will read and generate object models directly from existing database schemas. This means that with very little time and effort, and object model can be created, code can be generated, and the developers can enjoy the benefits the object-oriented approach brings to the data persistence layer.

Integrating Caching for Performance and Scalability

As was mentioned earlier, server-tier caching is the most common and effective solution for performance problems. Caching works by moving data closer to the business processes and reducing the number of database calls.

ADO.NET can provide a form of lightweight caching through the use of disconnected datasets. In using disconnected datasets, DataSet information is serialized to XML files, which can then be modified “offline” in local processes. Some folks in the Microsoft world feel that disconnected datasets are the .NET way to implement data caching. However, caching was not the initial intent behind disconnected datasets; designers were simply looking for a way to reduce the number of open database connections.

By tearing down the connection after data has been transferred to or from the database, disconnected datasets improve performance on both the database and application side (clients are no longer blocked).

Various other types of caching solutions are available. However, it is important to realize that caching strategies are not all equal. Depending on the complexity of your application, certain caching solutions may provide a significant differences in performance and scalability. Some also require much more development effort to integrate and manage the cache.

Questions you need to ask when looking for a caching solution include:

What kind of caching is supported (static or dynamic)? How much of my data can be cached?

Does it leverage the object model (caching relationships)

How useable is this solution? How much developer intervention is required? What kinds of management capabilities are available?

Will the cached data remain consistent and accurate if it is distributed across multiple applications?

Static Caching

Let’s start the discussion with the most basic type of caching: static caching. A static cache (sometimes also called a read-only cache) contains data that never changes. This is usually reference data that is potentially used a number of times. Therefore instead of retrieving this data from the database every time it is needed, an application can simply query the cache, thus reducing the overhead of a database call.

Although a static cache can be a huge improvement over querying the database, it only solves performance problems for read-only data – which may be only a portion of the entire data model. For most enterprise systems, many elements in the data model are very dynamic (change frequently) and also need to be accessed at near real-time speeds. The old 80-20 rule applies: 80% of an application’s transactions will be performed on 20% of the data. Caching that 20% can significantly improve performance, even for transactional data.

Dynamic Caching

It takes a dynamic or “intelligent” cache to fulfill the 80-20 requirement. A dynamic cache allows for data to be both read and updated in the cache. Although updates may ultimately be propagated to the database, the ability to update objects directly in the cache can provide increased performance benefits as it immediately impacts users reading the updated data.

When looking at caching systems, make sure to understand the locking behavior for updates. One approach is to lock the data that is getting updated (pessimistic locking). For example if an Employee instance is getting updated, then all other threads sharing this cache will be blocked from reading (and possibly updating) this Employee until the update is complete. Although this method works, for transactions on highly requested objects, pessimistic locking tends to reduce performance and negate the benefits of the cache.

More sophisticated caching solutions offer optimistic locking. By assigning version numbers to different revisions of the data, the cache will know if a thread is attempting to update stale data. By using a versioning mechanism, data in the cache never needs to be locked, thus increasing performance.

Relationship Caching

If a cache treats a data object as a black box, it cannot understand and manage relationships between objects. An intelligent cache knows the both the application object model and the database schema. It is aware, not just of object attribute values, but also the relationship between objects. By caching relationships, an intelligent cache can dramatically reduce the number of expensive join queries performed by the database, resulting in a corresponding improvement in performance. Enterprise applications, with their complex data models, can enjoy a surprisingly dramatic performance boost.

Usability

Another consideration is how much programming is required to implement the caching. If object-relational mapping classes are tightly integrated with the runtime cache, objects can be automatically instantiated in the cache whenever they are accessed by the application. Developers do not have to manually track or code which objects must be cached, when they should be cached, or when they should be cleared from the cache. Simple configuration of caching policies can optimize such intelligent caches. Other alternatives require significant manual coding, and in the worst cases, application rearchitecture to insert a cache.

Distributed Caching to Increase Scalability and High Availability

In order to address the scalability and high availability concerns of enterprise applications, many architects look to using distributed caches. Distributed caches are simply a cluster of dynamic caches. When one cache in the cluster updates some data, that update will be sent to all participating caches in the cluster. Upon receiving the update, the participating caches can then apply the update directly to their own cache (see Figure_4).

Figure 4: Distributed caches

Just as with caching, clustering technologies vary widely. Most simply invalidate objects when a change occurs to the underlying data. This means that the next request for any of the changed objects requires an expensive database call. True cache synchronization provides the updated information to all of the distributed caches, delivering the best performance.

Not only can a distributed caching system provide performance and scalability requirements of an enterprise application, but it provides a safe foundation for advanced enterprise deployments such as those providing failover and load balancing.

When all application caches are in sync (as in Figure_4) then if one server were to go down, the other servers would easily be able to satisfy the additional failover requests with little to no performance impact on the end user. Ideally, cache synchronization complements server clustering mechanisms to provide better uptime in the case of database or network outages, as well as allowing for server-side failover.

Conclusion

Although data persistence is a critical component of enterprise applications, its importance is often overlooked. Architects and developers often settle for de facto data persistence solutions, only to experience problems later in development or at deployment. The .NET community can learn from the successes Java has had with enterprise applications, and take advantage of third-party development products that implement these techniques.

This article identified developer productivity, performance, scalability and high availability as important requirements for the data persistence layer of enterprise .NET applications. Productivity gains can be realized by adopting an object-oriented approach and choosing code generation tools with flexible features. Choosing a good caching and distribution technology can increase performance, scalability, and availability with minimal code changes. The combination of O-R mapping, caching, and distribution for advanced data persistence can play a significant role in reducing risks and contributing to success for enterprise applications.

Greg Aloi is a product consultant for the Progress Real Time Division. He has eight years of experience designing, developing, and teaching distributed systems, with a focus on the distributed data layer.

Tobias Grasl has been developing software for the past seven years, with a specific focus on metadata-driven infrastructure libraries and applications. His current interests include MDA, agile methodologies, and the impact of both on a team’s ability to deliver quality software.

The Progress Real Time Division (formerly ObjectStore) is an operating company of Progress Software Corporation, and provides data management, access and synchronization products for the real-time enterprise. These products are used by thousands of companies around the world to help manage streaming event data for applications such as algorithmic trading and RFID data management. The products also accelerate the performance of existing relational databases and support occasionally connected users requiring real-time access to enterprise applications.

Windows Communication Foundation has become an integral part of many .NET based solutions, enabling highly customizable messaging across distributed environments. In Expert WCF 4, you will cover scenarios that include designing, implementing, consumi...

“It is practically impossible to teach good programming style to students that have had prior exposure to BASIC. As potential programmers, they are mentally mutilated beyond hope of regeneration.” - E. W. Dijkstra