This is why algorithms rule

For the people who know me a little it's no surprise, but in case you didn't know: I love algorithms. I think they're the cornerstone of good software and they should be your first source of wisdom for every piece of software you're creating. This post will show an example of what I mean by that and how easy it is if you have a set of algorithms at your disposal which are solid, proven and correct.

The nice thing about algorithms is that you can reason over them without writing a single line of code. Another nice thing about algorithms is that many smart people out there have already documented and proven even more algorithms. This is important, because a proven algorithm has a key feature: it's correct within the boundaries stated by the algorithm itself (e.g. 'correct for all positive numbers'). A correct algorithm is great for writing software because you don't have to worry if the algorithm is buggy, it's not. All you have to do is implement it correctly. That can be done by taking baby-steps in projecting the algorithm to code, reviewing the code and if you don't trust your own skills, writing tests for the boundaries stated for the algorithm. Compare that to code where the algorithm also has to be tested and you'll quickly understand how important good solid algorithms are: they allow you to make you write software which is flawless without a lot of effort.

With LLBLGen Pro v3.0, which will hit its first beta in January 2010 (fingers crossed! ), we'll also ship the source-code of our algorithm library Algorithmia (and every v3 licensee is allowed to re-use that code in whatever form they see fit, within the boundaries of the flexible license of course). Algorithmia contains algorithms for graphs, queues, heaps, commands etc. etc. and offers a ready-to-rock set of classes and methods to build software with without worrying if the algorithm is even working in all cases. Algorithmia is written with .NET 3.5 and contains only classes which are not already found in .NET 3.5's BCL. This means that we only implemented functionality not found in .NET itself. For example, there's a KeyedCommandifiedList<T>. This List<T> lookalike class is command aware which means it's fully aware of undo/redo: every element addition/removal action can be undone and re-done. It's also keyed and can update it's own index based on changes of itself or the elements within itself without any help from the outside. Still it's a list and an enumerable. Its FindByKey() method is roughly an O(1) operation (amortized), no matter how big the set of elements in the list is.

Another example is the set of graph classes (directed and non-directed) and accompanying classes and algorithms. Not only are the graph classes usable in a commandified environment (and can be made undo/redo aware in full), they also can be used with SubGraphView classes, which are views on a graph and which contain a sub-set of the elements (vertices) and edges of the graph they're defined on. These views manage themselves, so if an element is removed from the graph, they detect that and also remove the element from themselves. You can subclass the views and override methods to intercept additions to the main graph to maintain views automatically as well. You can see this in action in a video I recently posted about LLBLGen Pro v3.0's QuickModel feature: the view you're looking at on the main graph updates itself when a new element is added to the main graph and it matches a set of rules.

But enough about that, let's look at another practical example of these classes in action: let's look at validation and examining data in data-structures with algorithms. The case in point is to see if fields in an entity B which is a split-off entity of entity A and which aren't mapped in A are optional (nullable). If they're not, we've to reflect that to the user with an error. LLBLGen Pro v3.0 has a deep validation system to make sure the model is valid before further action is taken (e.g. code is generated, it's updated from refreshes etc.). This validation system is of course extensible so you can add your own validation as well and has per-target framework validation so it validates in the scope of the target framework chosen (e.g. LLBLGen Pro runtime framework, Linq to sql etc.)

What exactly is a split-off entity? Say you have the Employees table in Northwind. This Employees table has two big fields: Photo, which is an Image (BLOB for oracle fans) typed field and Notes which is an NText (CLOB for oracle fans) typed field. LLBLGen Pro today, in v2.6, has the feature to exclude them from a fetch and load them at a later point. However, v3.0 will also support other frameworks and for example in the Entity Framework, it's solved in a different way: it supports multiple entities mapped onto the same table with a 1:1 PK-PK relationship between them. This means in practice that the entities which are 'split off' have to be merged into the same record in the table as the root of the entities when they're persisted to the database.

In the example of Employees, let's map two entities onto this table, EmployeeNoBlob and EmployeeBlob. EmployeeNoBlob has fields mapped to all Employees fields except Photo and Notes, and EmployeeBlob has besides the primary key EmployeeId only the fields Photo and Notes. EmployeeNoBlob is the 'root' of the two: if a new entity instance of that type is saved, it's inserted into the table Employees. However if an instance of EmployeeBlob is saved it's always resulting in an UPDATE statement. This because it's actually an entity which follows the root and this is the result of its primary key field which is depending on the primary key field of EmployeeNoBlob.

To validate the fields of EmployeeNoBlob and EmployeeBlob, we first have to find all entities which form a split group, thus which are all mapped onto the same target and one of them is the root and the rest is split off this root. We have to make sure we don't take into account entities which are also mapped onto this target but which aren't part of the group, otherwise we'll get false positives.

So how do you find these groups and how do you make sure you know what the root of such a group is?

All entities and their relationships are stored in a non-directed graph in the project: the EntityModel graph. So we first have to find all 1:1 relationships which are between two primary keys. Then we have to make sure both sides are mapped onto the same target.

Once we have that set of relatonships, we can create a new graph from those relationships (which are the edges in the graph) and tell a fancy algorithm in Algorithmia, the DisconnectedGraphFinder, to find all sub-graphs in this new graph which are disconnected from each-other. Each sub-graph is then returned as a SubGraphView and we can process each view further as each view is a split-group.

To find the root of a split-group, we can use another algorithm, the TopologicalSorter. We create from every relationship in a SubGraphView a directed edge in a directed graph, and sort that graph topological. Topological sorting is a well-known algorithm which tells you in what order a directed graph is ordered. In other words: it finds dependencies and orders the elements accordingly.

Once we have the root per group, we're practically done because we can then verify which fields are not mapped in the entities in a group which aren't the root and verify whether they're optional or not.

So how does this look in code? Let's look at the method which produces the list of split-off entities first. This method isn't inside the EntityModel as entity splitting depends on the mapped target and therefore it's located in the DatabaseMappingStore class.

/// <summary>
/// Gets all split entities, per target. A split entity is really a group of entities which are mapped onto
/// the same target and which all have 1:1 pk-pk relationships with one entity in their group. If A, B, C and D are
/// mapped onto the same target T and B and D have a 1:1 pk-pk relationship towards A (pkside), and C does not,
/// it means that B and D are 'split off' of A. C is not part of the split and is ignored. Returned is then A with
/// its split off companions B and D. A is then returned as key with values B and D.</summary>
/// <param name="containingProject">The containing project.</param>
/// <returns>Multivalue dictionary with as key the root of the group and as values all split off entities.</returns>
/// <remarks>Split-off entities are used for validations. Not all frameworks support split-off entities.</remarks>
public MultiValueDictionary<EntityDefinition, EntityDefinition> GetAllSplitEntities(Project containingProject)
{
var allOneToOnePkPkRelationships = containingProject.EntityModel.Edges
.Where(e => e.RelationshipType == EntityRelationshipType.OneToOne)
.Cast<NormalRelationshipEdge>().Where(e => e.FkFieldsFormPkOfFkSide);
var allOneToOnePkPkRelationshipsWithBothSideOnSameTarget = allOneToOnePkPkRelationships
.Where(e=>_entityMappings.FindFirstByKey(e.EntityPkSide).MappedTarget==
_entityMappings.FindFirstByKey(e.EntityFkSide).MappedTarget).ToHashSet();
// now add all found edges in a graph and determine all disconnected subgraphs. These are our groups. We do that
// by traversing the graph we create with a disconnected graphs finder, which is a DFS based algorithm for graphs.
var groupFinderSourceGraph = new NonDirectedGraph<EntityDefinition, NonDirectedEdge<EntityDefinition>>();
foreach(var foundRelationship in allOneToOnePkPkRelationshipsWithBothSideOnSameTarget)
{
groupFinderSourceGraph.Add(new NonDirectedEdge<EntityDefinition>(
foundRelationship.EntityFkSide, foundRelationship.EntityPkSide));
}
var groupFinder = new DisconnectedGraphsFinder<EntityDefinition, NonDirectedEdge<EntityDefinition>>(
() => new SubGraphView<EntityDefinition,
NonDirectedEdge<EntityDefinition>>(groupFinderSourceGraph),
groupFinderSourceGraph);
groupFinder.FindDisconnectedGraphs();
var toReturn = new MultiValueDictionary<EntityDefinition, EntityDefinition>();
foreach(var subgraphView in groupFinder.FoundDisconnectedGraphs)
{
// create a new directed graph which is topological sorted, based on the edges in the subgraphview. The
// ordering will give us the root and the rest. fk side is start vertex, pk side is end vertex in the non-directed
// edges in the view (which is a view on groupFinderSourceGraph).
var sortedSourceGraph = new DirectedGraph<EntityDefinition, DirectedEdge<EntityDefinition>>();
foreach(var edge in subgraphView.Edges)
{
sortedSourceGraph.Add(new DirectedEdge<EntityDefinition>(edge.StartVertex, edge.EndVertex));
}
var sorter = new TopologicalSorter<EntityDefinition, DirectedEdge<EntityDefinition>>(sortedSourceGraph);
sorter.Sort();
// views are always filled with at least 1 edge, which means at least 2 entity definitions, as edges are
// relationships between pk's, so sides are always different. The first entity is the root, the rest is the
// set of split off entities.
toReturn.Add(sorter.SortResults[0], sorter.SortResults.Skip(1).ToHashSet());
}
return toReturn;
}

You might see some extension methods or classes you don't know, like the MultiValueDictionary, they're all in Algorithmia. Looks pretty straightforward and actually pretty easy, simply because it's based on building blocks which are already working and solid. I don't have to worry if the Topological sorter really finds the right order, it does, because the algorithm is correct and the implementation works.

Once we have these entities, we can simply traverse them and match the fields:

In the code above you'll also see a glimpse of the designer's internal message system. This message system is central hub which receives messages from all kinds of classes and which dispatches them in a central queue. This queue is then visualized in the UI by a form. The advantage of this system is that everywhere in the core system (so that's not the UI, it has no awareness of the UI), you can dispatch an error to the UI without knowing there is a UI. Furthermore, you can attach corrections. These are displayed below the error / warning and the user can click a link to activate the correction, for example it can open an editor for you to go to the location where the error was, but it can also offer you a correction immediately, for example by removing an element directly. It gets you a high de-coupling between elements which otherwise wouldn't have a relationship anyway and also gets you an easy way to 'break out' a deep object hierarchy so you don't have to pass back elements up the call chain.

It's things like this which will give you high productivity without a lot of effort. Oh, and Linq + lambda's of course. But you already noticed that from the code snippets I guess.

@Thomas: hehe :D
really, linq to objects and lambdas are a time saver you won't believe. Lambda's make it easy to create true generic code (because you can pass in the action to perform with a func :)), and linq to objects makes it very easy to do trivial tasks, like consuming an ordered list. It might take some time to get used to the things but it quickly pays off :)

Once I started thinking a bit like a functional programmer, my way of thinking about code changed as well. Linq is the first step! Everyone should learn a language like scheme of OCaml just to start thinking that way.