I'm developing a program that has a SQL database as a backing store. As a very broad description, the program itself allows a user to generate records in any number of user-defined tables and make connections between them. As for specs:

Any record generated must be able to be connected to any other record in any other user table (excluding itself...the record, not the table). Yes, this creates a arbitrary user-generated relational map.

These "connections" are directional: Incoming, outgoing, and bi-directional.

The list of connections a record has is user ordered.

A record must "know" of connections made from it to others as well as connections made to it from others.

A record's field can also include aggregate information from it's connections (like obtaining average, sum, etc) that must be updated on change from another record it's connected to.

To conserve memory, only relevant information must be loaded at any one time (can't load the entire database in memory at load and go from there).

I cannot assume the backing store is local. Right now it is, but eventually this program will include syncing to a remote db.

I cannot depend on a specific implementation of a backing store. This means I cannot be certain the backing store is, in fact, relational. Thus all relational data is stored explicitly.

The connections are kind of the point of this program, so there is a strong possibility that the number of connections made is very high, especially if the user is using the software as intended. Neither the user tables, connections or records are known at design time as they are user generated.

I've spent a lot of time trying to figure out how to design the backing store and the object model to best fit these specs. In my first design attempt on this, I had one object managing all a table's records and connections. I attempted this first because it kept the memory footprint smaller (records and connections were simple dicts), but maintaining aggregate and link information between tables became....onerous (ie...a huge spaghettified mess). Tracing dependencies using this method almost became impossible. Instead, I've settled on a distributed smart graph where each record and connection is 'aware' of what's around it by managing it own data and connections to other records. Doing this increases my memory footprint but also let me create a faulting system so connections/records aren't loaded into memory until they're needed. It's also much easier to code: trace dependencies, eliminate cycling recursive updates, etc. My biggest problem is storing/loading the connections. I'm not happy with any of my current solutions/ideas so I wanted to ask and see if anybody else has any ideas of how this should be structured. Connections are fairly simple. They contain: fromRecordID, fromTableID, fromRecordOrder, toRecordID, toTableID, toRecordOrder. Here's what I've come up with so far:

Store all the connections in one big table. If I do this, either I load all connections at once (one big db call) or make a call every time a user table is loaded. The big issue here: the size of the connections table has the potential to be huge, and I'm afraid it would slow things down.

Store in separate tables all the outgoing connections for each user table. This is probably the worst idea I've had. Now my connections are 'spread out' over multiple tables (one for each user table), which means I have to load each table just to find all the incoming connections for a particular user table. I've avoided making "one big ass table", but I'm not sure the cost is worth it.

Store in separate tables all outgoing AND incoming connections for each user table (using a flag to distinguish between incoming vs outgoing). This is the idea I'm leaning towards, but it will essentially double the total storage for all the connections (as each connection will be stored in two tables). It also means I have to make sure connection information is kept in sync in both places (each 'from' connection has a corresponding 'to' connection in the appropriate table. This is obviously not ideal but it does mean that when I load a user table, I only need to load one 'connection' table and have all the information I need. This also presents a separate problem, that of connection object creation. Since each user table has a list of all connections, there are two opportunities for a connection object to be made. However, connections objects (designed to facilitate communication between records) should only be created once. This means I'll have to devise a common caching/factory object to make sure only one connection object is made per connection.

Does anybody have any ideas of a better way to do this? Once I've committed to a particular design pattern I'm pretty much stuck with it, so I want to make sure I've come up with the best one possible.

I'll also note that my current backing store is SQLite. I've written the program to not depend on a specific backing store, though, by introducing an adapter layer to disassociate the data retrieval from the the rest of the program. So the backing store could be anything, even a simple text file. I use the backing store exclusively for storage and data retrieval, nothing else. I don't use any specific database/SQL relational features and I don't use the database to perform calculations (aggregate or otherwise). All of that is handled in program logic.

An example hows those user tables may look like would clarify a lot.
–
Doc BrownJun 30 '12 at 14:33

I'm not sure that's really possible, though I'll try. The user tables are user defined, so they can have any number of columns containing all 'kinds' of data....although I limit it to text and numbers...blobs and large-texts are stored externally with a reference to the 'file' (doesn't have to be an actual file) in the DB. A user table can have as little as 4 columns and as many as several hundred. Regardless, each row is a record and each record can be connected to any other record.
–
Aaron HaymanJun 30 '12 at 15:01

That sounds as if your users have the capability of creating an arbitrary relational model for themselves - why not using one database with a user defined schema for each user? I mean, you could utilize something like SQLlite for that task, and, if needed, store the database files of each user in a bigger system.
–
Doc BrownJun 30 '12 at 15:14

Yes, the ability of the user to easily create an arbitrary relational model is a defining aspect of this program. And I do see what you're saying, but the program itself isn't divided out into 'users'. Basically, one user per program, so storing a separate db schema isn't necessary. Eventually, when I introduce my own 'cloud' syncing, something like this will be implemented on the server side, but for the moment I'm concerned with the client side.
–
Aaron HaymanJun 30 '12 at 15:29

You bring up an interesting point though. The ability to make connections between databases isn't something I've considered and is worth thinking about.
–
Aaron HaymanJun 30 '12 at 15:40

1 Answer
1

All your requirements, save for #8, are implemented by relational databases (consider INFORMATION_SCHEMA, which is a standard relational database feature- the information schema contains views which describe the tables in the database, which columns they have, etc.- your app would work a lot with the INFORMATION_SCHEMA data).

Thanks to #8, you have basically cornered yourself into writing your own relational database engine.

It sounds like you are implementing Access which... implements its own relational database (Jet); it also sucks because implementing a relational engine is quite hard (more if you are subject to the constraints that Access has), and in the end, Access doesn't really make hard things easy.

Implementing a relational database engine on top of another datastore might simplify things, of course (i.e. Berkeley DB is useful for implementing an RDBMS, and Berkeley DB is everywhere. See a related answer I gave recently), but I would think hard about requirement #8 and the viability of your project.

I'm using SQLite on iOS. But I don't think my problem is just #8. The problem is my users can create any number of tables with many-to-many relationships between them (think: mind map). So far as I can tell, the only way to do this is with a 3rd table between each user table. Using SQL I would have to create a table for each separate many-to-many relationship. That could be a lot of tables, more than I want (2^n, where n = # of user tables). Instead, by creating my own relationship table I have one extra table per user-table.
–
Aaron HaymanJul 3 '12 at 13:07

Will they really create ~2^n tables? If that's a realistic case, yes, you have a problem. But I'm guessing they're not going to...
–
alexJul 3 '12 at 15:06

Well, pretty much, yes. I suspect it won't actually be 2^n tables, especially as the number of user tables increase, but part of the point of this program are the connections made between the table records, so users are very much encouraged to make them. All in all, I'm not overly concerned with implementing my own 'relational engine'. I've already implemented most of my required DB functions (like aggregation, sorting and filtering) using my own algorithms on in-memory cache. But my current method of storing the relations is duplicative and I want to make sure I've not missed an idea.
–
Aaron HaymanJul 3 '12 at 16:25

I will also point out that I do not store large files in the database. Even rich text files are stored as references to html files. Essentially, I'm using the database as index storage, and for very simple data storage (small text & numbers). I've done this explicitly so that I can pull the entirety of those indexes into memory on devices that are memory constrained. It's vastly increased device performance by reducing DB calls, albeit at the expense of code complexity. But I'm fine with that. It's a fun project. :)
–
Aaron HaymanJul 3 '12 at 16:35