This comment has been minimized.

Can either of you point me to a doc explaining the proper way to do this, if not directly embed it in the Table Schema? It seems like a useful feature and I'm not sure of the best way to accomplish something like this.

Our use case is a bit different since we're using REST web services, so we're considering either embedding small factors directly into the Schema (a la #29) or providing a reference to the REST call which would provide such a mapping ({"id":"color", "rest":"/colors/"}). I realize that particular attribute would be outside of any spec, but if there were an encouraged way to reference another Schema, perhaps we could use that more heavily in the R package then customize from there.

This comment has been minimized.

I like it. The more I think about it, the REST-based approach I mentioned would actually fit within your proposed solution. If I were to host a datapackage.json file at the root of my webapp, all of my data/file references could just be relative URLs defining the GET/list functions for each type of object I'm interested in. Assuming my REST service produces strictly JSON Table Schema, we could end up with a pretty clean solution.

What's the best next step for this? We're trying to move pretty quickly on our client application, so I'd be happy to write up the docs and submit them in a pull request as we continue our development if we're in agreement that this should happen within the scope of this spec? If not, I'll just document it internally and save myself some time. Let me know...

This comment has been minimized.

foreignkey: {
// points to the datapackage.json of the relevant data package
url: ...
// id / name of the dataset
resource: ...
# id of the field in the referenced table
field: ...
}

Note that we're now using resource instead of file. Since IDs are not guaranteed to be provided for a given resource, we use the following logic to map a resource attribute to another resource in the specified DataPackage:

If the resources in this DataPackage are a named hash, check in those names for the resource id.

Check the id field of each resource for a match.

Check the name field of each resource for a match.

Check the path and url field of each resource for a match.

As soon as a single match is found at any of these steps, no more searching is done. (As you can see, I think it would be easier for clients to implement this feature once this is pinned down in the spec).

In R, there's not an obvious way to preserve the underlying ID of an object and still map it to another abstract object ("hash" in JS or "list" in R) within the context of a data.frame, so we support this fairly naively at this point by only supporting mapping integer IDs to a single field (which get cast to a string, if its not already). The details (excerpt from https://github.com/QBRC/RODProt/blob/master/R/incorporate-foreign-keys.R) are copied below:

The behavior is as follows: if there is only one column in the referenced resource and that column
matches the \code{ID} given for the foreignkey, that column will be used as both the IDs and
the levels of the factor -- meaning that any values defined in this external table will map
to themselves, and any other values not in the table will map to \code{NA}. If the table has two
columns, one of which matches the \code{ID} value of the foreignkey, then the \code{ID} column
will be used for the IDs, and the second column, regardless of name, will be used as the levels
for the factor. If there are more than two columns, then the column to be used as the name must
be specified using the \code{name.column} parameter.

I created an issue to support more robust mappings of IDs to complex objects (QBRC/RODProt#12), but it will require some engineering to get that functionality in place.

This comment has been minimized.

I'm not sure which way you intend "multiple." To me, multiple foreign keys (single-columned FKs on multiple columns) could be handled well in this approach (indeed, we're using that pretty heavily in our application). To extend the previous example:

(pardon any syntax errors). To be honest, though, I don't think I've ever seen that use case -- maybe it's more common in other fields.

Finally, if you meant composite foreign keys, then I agree that would be a limitation of this style. In my experience, it seems like the trend (for better or worse) has been to move away from composite keys where possible, though. I think at least part of this is motivated by the rise of web applications and the need to more easily de/serialize data. Most of my ORM experience has been with Hibernate, so I can only speak to that library, but I know support for Composite keys is pretty sloppy. Essentially, you typically need to embed your composite key as a single object so that HIbernate can treat it as a single reference to another table.

I'm not sure how well that would work or if it would be worth the complexity, but it would be one approach.

I may just be biased by our application, but if I'm already serializing the data to JSON, I would have long given up the luxury of "fancy" relational mappings involving composite keys. Perhaps I'm in the minority there, though. I'm not sure about other common client languages like Python, but I know in R the support for a concept of a foreign key is so simplistic that it would require a pretty foundational re-write of some of the base data structures in the language to even get something like a composite foreign key into the language.

This comment has been minimized.

datapackage: [optional] a url or a single name reference. If a single name it will be up to client to resolve that to a datapackage (we will not allow relative path references). If not supplied then we resolve for a resource within this datapackage.json

This comment has been minimized.

@jpmckinney all good points. For foreign keys which are already quite complex I guess this makes sense. I suppose for other items e.g. even unique or primary key it makes thinks more hasslesome to write.

Overall: I get the feeling that a good set of people want constraints outside of the fields in a separate section a la sql.

If you see a "primary_key" property on the first field in a datapackage.json file, you might assume that field is the primary key, when in fact it's composite with a field that appears later in the definition.

data referred by the foreignkeys are not available in dpkg1 (the resource data only contains x and not date and seq). To get date and seq, one need to go into dpkg2 and extract them from here (note that this is recursive as dpkg2 can also have foreignkeys...).
In this scenario it is convenient to know the order of the fields and have the foreignkeys right into schema.fields.

So to me foreignkeys allow to recreate composite JSON Table Schema from fields originating from different data packages.

Generalizing outside JSON Table Schema it would be good to have a discussion on how to use the top-level property dependencies or dataDependencies of a datapackage.
Maybe we should think of a more generic solution of "importing" resources (from x import y as z used in Python or var y = require('x') in node.js.).

If we stick with the current foreignkey hash and generalize it for non SDF resources we just need to omit the field property.
For instance we could have a resource y in dpkg1 coming from dpkg2:

Here "foreignkey" is an extra possible property in addition to "data", "path" or "url".

If it is the intended use maybe "foreignkey" is not the best possible word.
What I really like about this usage of "foreignkey" is that it allows to repackage resources coming from different datapackages into one datapackage (tailored to someone specific data need) with full versioning support and without any data duplication.

This comment has been minimized.

@sballesteros That's not how foreign keys work... the data for myresource would still have values for date and seq. A foreign key is just a constraint that can be enforced to ensure that the values in date and seq exist in the referenced columns of a resource in dpkg2. It's a way to check data consistency.

What you're describing is an entirely different feature, which may be worthwhile, but is not what a foreign key is. What you're describing is closer to a materialized view, that joins multiple tables together.

This comment has been minimized.

@jpmckinney thanks for clarifying. So what I suggest is more related to the way dependencies will be handled. Got it now... Foreign key is a constraint to ensure that the values taken by the foreignkey are elements of the resource field it points to.

This comment has been minimized.

I think constraints is more appropriate. A primary key constrains that field to be unique, as does a unique key. Indexes are for avoiding table scans, and such optimizations are usually use case-specific. For example, you might index on a person's name, but not have any uniqueness or other constraints on that name. It just so happens that most DBs automatically create indexes for most constraints, hence the potential confusion.

I prefer constraints to be a simple array, where each object has a type key equal to "foreign key", "primary key" or "unique key". Unless there is a good conceptual reason for adding structure like nesting, in general I prefer a more flat structure (a conservative, "built-it-when-you-need-it" approach). In either case, though, implementations would just loop through the constraints once, and read in whatever constraints they support.

Re: using the string "foreign key" as a value, etc.: If people prefer using underscores instead of spaces, I'm fine with that. I don't think we should remove the space - it's easier to read words when they are properly separated, and no one is crying over the extra 1 byte used for each space/underscore.

If using foreign key as a key instead of a value, as in the parallel example, then I think we should simply conform to whatever the rest of datapackage.json does. datapackage_version and last_modified use underscore style. If we want to switch to camel-case to be more like commonjs, then we should open a new ticket and do it for all properties at once.

This comment has been minimized.

+1 for constraints.
I like the single array idea but I am not a huge fan of spaces for type values.
In an implementation, if you create a hash to handle all the types, for some languages (e.g JS) having a space in the keys is annoying (you cannot use hash.key and have to use hash[key]).

More generally for the data package keys, I think switching to camelCase to be more like CommonJS would be nice, especially if datapackage.json is renamed package.json.

This comment has been minimized.

@sballesteros If we put the types as keys, then yes, using underscores or camelcase is better. But if we're putting the types as values, then it makes no difference. Edit: Nevermind, I misunderstood what you meant by creating a hash.

This comment has been minimized.

Going to repost this here since apparently #21 isn't the right place anymore.

Howdy guys I just wanted to add my recommendation here based on a few things.

Keys are intrinsic descriptions of a dataset and not a field

I don't think adding any kind of key description at the field level is correct. I don't think It's a matter of shorthand either as structurally all keys are properties of a datasets and not a field.

Keys aren't indices and aren't necessarily constraints either

Indices are a database specific feature used for optimizing common operations. The terms are often used interchangeably in those systems because database vendors often default to (or require) building an index onto the primary key columns for their own internal operations or for the convenience of the user. Keys don't necessarily imply a constraint either although generally implementations constrain primary keys to be unique. One vendor specific example is that the InnoDB engine allows both non-unique indices and foreign keys to non-unique indices.

IMO If the specification's goal is to be producer and consumer agnostic then the it wouldn't include references to the concept of an index and leave that up to the system consuming the data package. I think there's probably some room here to include a unique flag as it might be seen as a description of the key's relationship to the dataset and not a constraint that the package is defining.

As far as how to implement a specification for keys I think the most generic form would try to only include things that are descriptive of the data itself. A key should signify that fields within or between datasets are linked to one another in some way or have some kind of special meaning in the dataset itself (such as uniqueness). Something like the following (which even I consider somewhat verbose) might be a good first shot:

This comment has been minimized.

My one objection to the array approach for keys/constraints is that it makes it more painful for consumers - rather than just doing:

resources[0].schema.constraints.primarykey.fields

i have to iterate through the list of constraints.

@besquared re constraints / keys in naming I feel a little ambivalent. I think it would be nice to include unique requirements in their at some point and they definitely aren't keys ... (however this isn't a biggie)

I do think generally that keys/constraints should live inside the schema attribute and be part of json table schema (@besquared your example suggested otherwise but i was not sure if that was intentional ...)

This comment has been minimized.

@rgrp For all except the primary key, you'll still need to iterate, as only the primary key will have a single key object as its value. The others will have arrays. If your app loads all keys, you'll have a double-nested iteration if the constraints are an object, instead of a simple iteration if the constraints are an array. It depends on what use cases you want to prioritize.

Also, can we have separation between words, e.g. primary_key?

Also, unique keys are keys (it's in the name). How are they "definitely not keys"? @besquared is correct in saying that constraints in the language of DBs are something different, e.g. you can put a constraint at the field-level on a price field to say that it must have a value greater than zero. NOT NULL is maybe the most common constraint. "Keys" is the more universal DB word for what we're talking about.

This comment has been minimized.

+1 for everything @jpmckinney is saying. Also @rgrp I originally did think keys belongs to the resource and not the schema but I take it back now they obviously belong in the schema. To draw some inspiration from Codd's relational model something like the format I proposed above allows us to correctly express the three types of relational constraints.

This seems fine too and allows the kind of object access one would expect (resource.schema.keys.primary) while still allowing multiple foreign keys and allows the keys section to remain pretty robust to extension later.

This comment has been minimized.

This might also be a good time to talk about foreign keys vs. references. In the relational model foreign keys imply a constraint in which the foreign relation must have a corresponding value for the keyed fields in order for a new relation to be added. This allows referential integrity between relations to be maintained.

If what the specification is trying to achieve is a way to link data together (but not necessarily imply a constraint) then we might want to have a section called "references" instead of a set of foreign keys. Consumers who would like to resolve reference packages and apply foreign key constraints in their data management system may choose to do so but wouldn't be required to.

This comment has been minimized.

I prefer foreign keys to "relations", etc. because it is a label that precisely refers to the existing concept.

datapackage_version and last_modified use underscores. Do any current properties use camelcase? The case should be consistent across all properties, which for now would mean sticking to underscores. I don't mind switching everything to camelcase, but then we'd have to change at least datapackage_version and last_modified.

This comment has been minimized.

Given that it's likely that primaryKey will become a thing it seems reasonable that foreignKeys should become the other thing. I think it's important to tell people that this might not have the same meaning that it does in their rdbms. This seems ok.

This comment has been minimized.

One idea I have which would be a nice to have is to create some sort of a fallback sequence. It might make everything more complicated but what I'm thinking is that resource could be an array of resources (most significant resource on the left) and when looking up the key it would just go through the resources in that order until it finds the key being searched for.

This means I can have multiple resources for datasets published in different phases. I'm thinking about a case for budget data where countries have different budget phases. The problematic one is when we have an approved budget which can be adjusted at a later point in time. Adjustments are not necessarily a completely new dataset but changes to different rows of the approved budget.

Let's say we represent this with two resources (marking them with dates for clarity): approved-2013-12-25.csv and adjusted-2014-01-22.csv. This problem could be solved by just making adjusted-2014-01-22.csv contain the whole budget with changes (so it also contains rows which haven't changed). For this the foreign key resource would just be the most recent budget resource. This means that if there are a few adjustments made we'll be copying a lot of the same rows between resources. To compare changes you would have to do a diff.

Another option would be to make adjusted-2014-01-22.csv only contain lines that have changed, with the same schema (fields, primary keys, and foreign keys) and define the resources for the foreign key with an array. Then when trying to reference an item in the budget it would first try to find it in adjusted-2014-01-22.csv and if not found go through approved-2013-12-25.csv. This reduces amount of redundancy and makes it clearer what the changes are but this makes the implementation of foreign keys more difficult.

Just an idea. I'm glad to drop it and use up more disk space if that means we can get foreign keys in as soon as possible.

This comment has been minimized.

Another use case I came up with though just now was that this would enable you to break up resources to more manageable sizes. For example one resource per year and some foreign key in another resource could include the year which could fallback through the resources broken down by years.