I was recently asked about the possibility to store large amounts of linked data efficiently in DXL.

So in this post we want to take a look at two questions:
- how can I store data in DXL efficiently without producing a lot of allocations, which will slow down all scripts?
- how can I store complex data in DXL so, I do not need to cope with skip lists and key names / types all the time?

Motivation

A very simple example would be to simply store the contents (or the links) of a module from DXL. Most people would probably use a "Nested Skip" approach. The code could look like this:

The Problem of allocations and complexity of nested structures

While simple, this approach has two drawbacks:
- it allocates one skip list per object
- the developers need to remember the skip list layout (string key? what value kind?)

One can imagine that this becomes very complex, once more complex data is being stored, e.g. store all links of an object. The same is true for storing "class instances", i.e. object like data in DXL. Whenever a large amount of instances will be created, Skip lists are a bad choice for storage.

While for more recent DOORS versions the allocations and deallocations are much quicker, the effect of having a large number of allocations is still relevant. You can read about allocations here:

or run the attached example script (allocation_problem.dxl), to see the effect.

Example Problem

So imagine you want to store the structure of a complete DOORS project, with all the links inside, so you can make your awesome traceability analysis, without constantly closing and opening modules. So normally you will come up with a class structure like this (classes prefixed with S to avoid name clashes with DOORS types):

As you can see, with this structure there a two problems:
- The objects have a lot of cross references
- There is a nesting hierarchy of 4 (module->object->link->attribute value)

Implementing this with a Nested Skip approach can turn out to be a nightmare. From the allocations perspective you can say:
- each module in the project will allocate 2 skips (top level objects, links)
- each object in the project will allocate 3 skips (childs, attributes, links)
- each link in the project will allocate 1 Skip (attributes)

That being said, imagining 100000 objects inside the project in 100 modules, containing 20000 links you will get an insane number of allocations (320200). Testing this with the below script (allocation_problem) you get a runtime penalty for allocating and deleting 10000 Buffers:
- without allocations: 30ms
- with 320200 allocations: 30779ms

which might give you a factor 1000(!!!) slowdown of your script after some time. From a code perspective, managing a nesting level of 4 with 4 different skip types will give you spaghetti code. So lets take a look on what we can do to change this.

Note: We recently found, that DxlObjects do not suffer from the allocation problem, since (at least newer versions of) DOORS will NOT store DxlObjects inside the allocations list. The good news about this is, you will not suffer from any slowdown inside your scripts! The bad news is you will suffer from bad permanent memory leaks, since the memory for the DXLObjects is never freed automatically!

Usually these example use DxlObjects, which are nothing else but Skip lists. And while IBM tried to remove alowdown (in my opinion with a stupid approach) by removing DxlObjects from the allocation list, probably as a esult of some PR, there is a better way to store objects without producing the risk for permanent leaks.

The idea is, instead of using a skip list to store each object instance, to simply use an Array(). Each column inside the array is a property value, each row is an object. This way:
- All class instances are stored inside the same array instance
- Instead of a handle to the DxlObject/Skip we simply store the array index of the instance

- Instead of allocating one DxlObject per instance, we only allocate 1 (!) array for all instances!
- Since the array will automatically resize, we do not really need to worry about the initial size of the array
- Since each array element will allocate 8 Bytes on DOORS 64 you can easily set the initial array size to 1 Million if you care to. This will allocate 8 MB x number of propreties which will not kill your memory, but 100k is a good starting size.
- Since we do not need to allocate memory for new items (just increase a counter), we also do not need to free memory!

BUT: you will probably say, what if I need to free and reallocate items a lot? Memory will not be freed and pile up!
Right. But this can be easily fixed by introducing a stack of freed indexes to be reused, whenever a new object is allocated. This slighly complicates the class boilerplate, but the code will be the same for ALL classes.

So the following template can be used for ALL classes, you just need to replace <<<TYPE NAME>>> by the name of the class:

- When a node is deleted and then later reallocated it will still have the values of the previous allocation, so make sure you properly initialize all values in the real constructur.
- Since our class instances are merely integer indexes to a large array, the value "null" refers to the first entry inside the array. Therfore if you do instance = null; instance is still a valid class instance and all methods will work and you will get no null ptr errors! If you care about this behaviour you need to check for null on each property access!

While this template solves the problem of not creating allocations for class instances we still need to look on how to create properties for the class and how to solve the nesting problem (i.e. storing arrays inside a class instance without creating allocations).

Class Properties

Properties can be added to an array managed class in an easy way (as already shown above):

where
- <<<CLASSNAME>>> needs to be replaced by the name of the class, e.g. "MyClass".
- <<<PROPERTY TYPE>>> needs to be replaced by the type of the property, e.g. "string" or "MyClass".
- <<<PROPERTY NAME>>> needs to be replaced by the name of the property, e.g. "MyProperty"
- <<<INDEX>>> needs to be replaced by the index of the property in the array

Making a "virtual function" template is left as an exercise for the reader (See github tutorial for virtual functions).

Solving the problem of array properties

Ok, now lets take a look on how we can implement those array properties efficiently. As already mentioned using a Skip property is not indicated since it will again create one allocation per instance. Therefore what we will do is replace the Skip by our own implementation called "List" which has the same efficient memory management as shown above.

Note that we have an "underscore" type parameter in the list, so our list can use any data type. For these kinds of properties the property definition is a bit different (see the attached List.dxl file) We added a currentNode property to our list so we can iterate it later from DXL with an easy syntax.

Unfortunately in DXL we cannot implement a custom for loop, for iterating over the list. To implement an iteration protocol for DXL with a nice syntax I propose:

Feel free to adapt the example list implementation in the attached file as you which (fast append to the end of the list, etc.). The important stuff is, that now having a List which items do not produce any allocation we are free to implement our classes that will not use any allocation.

Applying the approach to the Example Problem

So assuming that we have made our classes SObject, SModule, SLink and SAttrValue according to the above description we can now easily code our complete project cache script:

Imagine now the difference in the code if we had used Skips or DXLObjects to store all values. Look how nice and easy the code looks, using our classes. We have complete type safety, so that we will never have a wrong type inside a skip list and get crashes. This is a huge benefit!

Conclusion

As we saw, with the above approach we can

- store arbitrary nested data type safe with an easy syntax
- we can store millions of data records without producing ANY allocation, i.e. without any slowdown

For all that made it to the end, thanks for reading. Like the post if you read it, so I know there is still DXL programmers interesting in tutorials.

As always, have phun, regards, Mathias

P.S: For all of those who need to store tree data and are looking for an optimized Tree implementation look here:

I was recently asked about the possibility to store large amounts of linked data efficiently in DXL.

So in this post we want to take a look at two questions:
- how can I store data in DXL efficiently without producing a lot of allocations, which will slow down all scripts?
- how can I store complex data in DXL so, I do not need to cope with skip lists and key names / types all the time?

Motivation

A very simple example would be to simply store the contents (or the links) of a module from DXL. Most people would probably use a "Nested Skip" approach. The code could look like this:

The Problem of allocations and complexity of nested structures

While simple, this approach has two drawbacks:
- it allocates one skip list per object
- the developers need to remember the skip list layout (string key? what value kind?)

One can imagine that this becomes very complex, once more complex data is being stored, e.g. store all links of an object. The same is true for storing "class instances", i.e. object like data in DXL. Whenever a large amount of instances will be created, Skip lists are a bad choice for storage.

While for more recent DOORS versions the allocations and deallocations are much quicker, the effect of having a large number of allocations is still relevant. You can read about allocations here:

or run the attached example script (allocation_problem.dxl), to see the effect.

Example Problem

So imagine you want to store the structure of a complete DOORS project, with all the links inside, so you can make your awesome traceability analysis, without constantly closing and opening modules. So normally you will come up with a class structure like this (classes prefixed with S to avoid name clashes with DOORS types):

As you can see, with this structure there a two problems:
- The objects have a lot of cross references
- There is a nesting hierarchy of 4 (module->object->link->attribute value)

Implementing this with a Nested Skip approach can turn out to be a nightmare. From the allocations perspective you can say:
- each module in the project will allocate 2 skips (top level objects, links)
- each object in the project will allocate 3 skips (childs, attributes, links)
- each link in the project will allocate 1 Skip (attributes)

That being said, imagining 100000 objects inside the project in 100 modules, containing 20000 links you will get an insane number of allocations (320200). Testing this with the below script (allocation_problem) you get a runtime penalty for allocating and deleting 10000 Buffers:
- without allocations: 30ms
- with 320200 allocations: 30779ms

which might give you a factor 1000(!!!) slowdown of your script after some time. From a code perspective, managing a nesting level of 4 with 4 different skip types will give you spaghetti code. So lets take a look on what we can do to change this.

Note: We recently found, that DxlObjects do not suffer from the allocation problem, since (at least newer versions of) DOORS will NOT store DxlObjects inside the allocations list. The good news about this is, you will not suffer from any slowdown inside your scripts! The bad news is you will suffer from bad permanent memory leaks, since the memory for the DXLObjects is never freed automatically!

Usually these example use DxlObjects, which are nothing else but Skip lists. And while IBM tried to remove alowdown (in my opinion with a stupid approach) by removing DxlObjects from the allocation list, probably as a esult of some PR, there is a better way to store objects without producing the risk for permanent leaks.

The idea is, instead of using a skip list to store each object instance, to simply use an Array(). Each column inside the array is a property value, each row is an object. This way:
- All class instances are stored inside the same array instance
- Instead of a handle to the DxlObject/Skip we simply store the array index of the instance

- Instead of allocating one DxlObject per instance, we only allocate 1 (!) array for all instances!
- Since the array will automatically resize, we do not really need to worry about the initial size of the array
- Since each array element will allocate 8 Bytes on DOORS 64 you can easily set the initial array size to 1 Million if you care to. This will allocate 8 MB x number of propreties which will not kill your memory, but 100k is a good starting size.
- Since we do not need to allocate memory for new items (just increase a counter), we also do not need to free memory!

BUT: you will probably say, what if I need to free and reallocate items a lot? Memory will not be freed and pile up!
Right. But this can be easily fixed by introducing a stack of freed indexes to be reused, whenever a new object is allocated. This slighly complicates the class boilerplate, but the code will be the same for ALL classes.

So the following template can be used for ALL classes, you just need to replace <<<TYPE NAME>>> by the name of the class:

- When a node is deleted and then later reallocated it will still have the values of the previous allocation, so make sure you properly initialize all values in the real constructur.
- Since our class instances are merely integer indexes to a large array, the value "null" refers to the first entry inside the array. Therfore if you do instance = null; instance is still a valid class instance and all methods will work and you will get no null ptr errors! If you care about this behaviour you need to check for null on each property access!

While this template solves the problem of not creating allocations for class instances we still need to look on how to create properties for the class and how to solve the nesting problem (i.e. storing arrays inside a class instance without creating allocations).

Class Properties

Properties can be added to an array managed class in an easy way (as already shown above):

where
- <<<CLASSNAME>>> needs to be replaced by the name of the class, e.g. "MyClass".
- <<<PROPERTY TYPE>>> needs to be replaced by the type of the property, e.g. "string" or "MyClass".
- <<<PROPERTY NAME>>> needs to be replaced by the name of the property, e.g. "MyProperty"
- <<<INDEX>>> needs to be replaced by the index of the property in the array

Making a "virtual function" template is left as an exercise for the reader (See github tutorial for virtual functions).

Solving the problem of array properties

Ok, now lets take a look on how we can implement those array properties efficiently. As already mentioned using a Skip property is not indicated since it will again create one allocation per instance. Therefore what we will do is replace the Skip by our own implementation called "List" which has the same efficient memory management as shown above.

Note that we have an "underscore" type parameter in the list, so our list can use any data type. For these kinds of properties the property definition is a bit different (see the attached List.dxl file) We added a currentNode property to our list so we can iterate it later from DXL with an easy syntax.

Unfortunately in DXL we cannot implement a custom for loop, for iterating over the list. To implement an iteration protocol for DXL with a nice syntax I propose:

Feel free to adapt the example list implementation in the attached file as you which (fast append to the end of the list, etc.). The important stuff is, that now having a List which items do not produce any allocation we are free to implement our classes that will not use any allocation.

Applying the approach to the Example Problem

So assuming that we have made our classes SObject, SModule, SLink and SAttrValue according to the above description we can now easily code our complete project cache script:

Imagine now the difference in the code if we had used Skips or DXLObjects to store all values. Look how nice and easy the code looks, using our classes. We have complete type safety, so that we will never have a wrong type inside a skip list and get crashes. This is a huge benefit!

Conclusion

As we saw, with the above approach we can

- store arbitrary nested data type safe with an easy syntax
- we can store millions of data records without producing ANY allocation, i.e. without any slowdown

For all that made it to the end, thanks for reading. Like the post if you read it, so I know there is still DXL programmers interesting in tutorials.

As always, have phun, regards, Mathias

P.S: For all of those who need to store tree data and are looking for an optimized Tree implementation look here: