This blog tracks development of the open source accounting and ERP software LedgerSMB. I also offer some perspectives on PostgreSQL including new features which we may find useful. Brought to you by Metatron Technology Consulting.

Tuesday, July 17, 2012

Why LedgerSMB uses Moose. An intro for PostgreSQL folks.

In LedgerSMB 1.4 we are moving to using a Perl object system called Moose for new code. This post will discuss why we are doing so, what we get out of it, etc. For those new to Moose this may serve as a starting point to decide whether to look further into this object framework.

To start with, however, we will need to address our overall strategy regarding data consistency and accuracy.

Why we Use Database Constraints Aggressively

Before getting into Moose specifically it's important to re-iterate why we use database constraints aggressively in LedgerSMB.

In general you can divide virtually all application bugs into a few categories which fall within two major classifications:

State Errors

Execution Errors

State errors involve not only transient state problems but also stored information. In other words if the application processes meaningless information, the results will be similarly meaningless. Summarized: garbage in, garbage out. Detecting garbage, and preventing garbage from being persistently stored is thus important. Typically a specific misbehavior can be a cascading failure from an undetected bug regarding storage of information. This is particularly important to avoid in an accounting application.

Database constraints allow us to declare mathematically the constraints of meaningful data and thus drastically reduce the chance of state errors occurring in the application. Foreign keys protect against orphaned records, type constraints protect against invalid data, and check constraints can be used to ensure data falls within meaningful parameters. NOT NULL constraints can protect against necessary but missing information.

The other type of error is an execution error. These can be divided into two further categories: misdirected execution (confused deputy problems and the like) and bad execution (cases where the application takes the right information and does the wrong things with it).

Any reduction in state errors has a significant number of benefits to the application. Troubleshooting is simplified, because the chances of a cascade from previous data is reduced. This leads to faster bugfixes and a more robust and secure application generally. Not only are the problems simplified but the problems that remain are reduced. We see this as a very good thing. While it is not necessarily a question of alternatives, testing cannot compare to proof.

One really nice thing about PostgreSQL is the very rich type system and the ability to add very rich constraints against those types. The two of those together makes it an ideal database for this sort of work.

Enter Moose

Recently, Perl programmers have been adopting a system of object-oriented programming based on metaclasses and other concepts borrowed from the LISP world. The leading object class that has resulted is called Moose, and bills itself as a post-modern object system for a post-modern language (Perl). Moose offers a large number of important features here including the following, which we consider to be the most important:

A rich type system for declaring object properties and constraints

Transparent internal structures

Automatic creation of constructors and accessors

Method modifiers.

These four benefits bring to the Perl level the ability to add the same sort of proof of state that the database level has traditionally had. We can be sure that a given attribute falls within meaningful range of values and we can be sure everything else just works.

Brief Overview of Useful Type Features

In plain old Perl 5 we would normally build accessors and constructors to check specific types for sanity. This is the imperative approach to programming. With Moose we do so declaratively and everything just works.

For example we can do something like this, if we want to make sure the value is an integer:

has location_class => (is => 'rw', isa => 'Int');

This would be similar to having a part of a table definition in SQL being:

location_class int,

This tells Moose to create a constructor that's read/write, and check to make sure the value is a valid Int.

We can also specify that this is a positive integer:

subtype 'PosInt', as 'Int', where { $_ > 0 };

has location_class => (is => 'rw', isa => 'PosInt');

This would be similar to using a domain:

CREATE DOMAIN posint AS int check(VALUE > 0)

Then the table definition fragment might be:

location_class posint,

But there is a huge difference. In the SQL example, the domain check constraint (at least in PostgreSQL) is only checked at data storage. If I do:

select -1::posint;

I will get -1 returned, not an error. Thus domains are useful only when defining data storage rules. The Moose subtype however, is checked on every instantiation. It defines the runtime allowed values, so any time a negative number is instantiated as a posint, it will generate an error. This then closes a major hole in the use of SQL domains and allows us to tighten up constraints even further.

So here we can simply specify the rules which define the data and the rest works. Typical object oriented systems in most languages do not include such a rich declarative system for dealing with state constraints.

The type definition system is much richer than the above examples allow and we can require that attributes belong to specific classes, specify default values, when the default is created and many other things. Moose is a rich system in this regard.

Transparent Internal Structures

Moose treats objects as blessed references to hash tables (in Perl we'd call this a hashref), where every attribute is simply listed by name. This ensures that when we copy and sanitize data for display, for example, the output is exactly expected. Consequently if I have a class like:

In that case, when we go to look at the hash, we could print values from the hash in our templates. Our templating engines can copy value for value the resulting hash, escape it (in the new hash) for the format as needed, and then pass that on to the templating engine. This provides cross-format safety that accessor-based reading of attributes would not provide.

In other words we copy these as hashrefs, pass them to the templates and these just work.

Accessors, Constructors, and Methods

Most of the important features here follow from the above quite directly. The declarative approach is used to create constructors and accessors, and this provides a very different way to think about your code.

Now Moose also has some interesting features regarding methods which are useful in integrating LedgerSMB with other programs. These include method modifiers which allow you to specify code to run before, instead of, or after functions. These can take the place of database triggers on the Perl level.

Pitfalls

There are two or three reasons folks may choose not to use Moose, aside from the fact that it does have some overhead.

The first is complexity. Moose has a lot of features and learning how to program effectively in Moose takes some time, though this is reduced by approaching it from a database perspective.

The second is that sometimes it is tempting to look into the raw data structure directly and fiddle with it, and if you do this then the proofs of accuracy are invalid. This is mitigated by the use of correct controls on the database level.

Features we are Still Getting Used To

This isn't to say we are Moose experts already. Moose has a number of concepts we haven't started using yet, like roles, which add cross-class functionality orthogonal to inheritance and which are highly recommended by other Moose programmers.

There are probably many other features too that will eventually come to be indispensable but that we aren't yet excited about.

Future Steps

On the LedgerSMB side, long-run, I would like to move to creating our classes directly using code generators from database classes. We could query system catalogs for methods as well. This is an advanced area we probably won't see much of for a while.

It's also likely that roles will start to be used, and perhaps DBObject will become a role instead of an inheritance base.

Update:

Matt Trout has pointed out that setting required => 0 is far preferable to Maybe[] and helped me figure out why that wasn't working before. I agree. Removing the Maybe[] from the type definitions would be a great thing. Thanks Matt!

10 comments:

One thing that jumps out from me in these examples is that all your attributes are read-write.

I strongly advocate defaulting to read-only. I also suggest that if your attr is read-write, that the setter be a private method.

Basically, the more ways that the state of an object can change, the more complex it is, and the more potential for error you have.

For example, in your object with a country_id and country_name, what happens if I set the name but not the id?

Actually, that's a particularly problematic example, since making both of those settable via the constructor is still problematic. You should probably make one unsettable ("init => arg") and generate the value based on the other.

That's good feedback. Also Matt Trout suggested Lazy as a good thing in some of these cases.

As for country_name, it's only set when reading. If it is not set, chances are ID is not either and we don't care. Once you write it, it is set. In other words id and country_id are not set before writing the object to the db. But it's a bit of a contrived example. It might be easier to have sub-types, or delegated methods, or lazy builders, but not sure.

For example, in your object with a country_id and country_name, what happens if I set the name but not the id? ...

if you make them both read only, and assuming they do not have to be required, in BUILD you should do a state check with a predicate, that is essentially if ( $self->has_country_name || $self->has_country_id ) { croak 'you fail' unless $self->has_country_name && $self->has_country_id } this is of course assuming a lazy doesn't work better. But I have some similar checks in Business::CyberSource where I require A or B but not A and B.

You may also want to look at Mouse. http://search.cpan.org/~gfuji/Mouse-0.99/lib/Mouse.pm It is a Perl object system with the goal of "use the exact same syntax as Moose but run faster". It doesn't cover everything that Moose does, but it does cover the 95% that gets used the most, so you probably only need to install Mouse and then do s/Moose/Mouse/g across all your code in order to see a speedup. (Oh, and comment out any __PACKAGE__->meta->make_immutable lines)

I'm a huge fan of Moose, but I've been writing some speed-and-RAM sensitive stuff lately, so Mouse has been a real savior.

I'm not sure you should be accessing the object hash directly in your code. That breaks encapsulation. Maybe checkout my MooseX::RemoteHelper which will add some stuff to your code but allow a very flexible creation of a hash that does not break encapsulation.

The reason why the object hash access is good has nothing to do with how we typically access it.

Basically when we generate a document (could be LaTeX, ODS, CSV, HTML, or more) we do a deep copy and escape on the data structures selected. In some cases we encapsulate this data but *not* for escaping purposes. We want to be able to add new formats easily, orthogonal to the rest of the app, and the fact that we can do a deep copy and escape and end up with something which conceptually similar to the object model we started with is a huge bonus.