23 March 2009

I probably have hundreds of online accounts: email, discussion forums, social networking, online shopping, server hosting, issue reporting, etc. Trying to remember all the passwords is a pain. Often, when going to a site I haven't been to in a while, I just reset it, or have it sent by email, or have a new one sent by email, or however the site in question works.

I might want to try something different for Names on Nodes. As hinted at in earlier posts, users will be considered "authorities" in Names on Nodes, along with publications, bioinformatics files, specimen repositories, nomenclatural codes, etc. All authorities are associated with one or more unique URIs, such as website addresses, ISBN numbers, DOIs, LSIDs, etc. For users, the primary URI will be an email address, in the form <mailto:myname@somedomain.tld>.

Why have an account? Well, because then, as an authority, you get to "authorize" your own datasets and taxon identifiers (and, by proxy, taxon definitions). Datasets and taxon identifiers are "qualified" objects, meaning that they each refer to an authority, and they each have a "local name" unique under that authority. A qualified name is formed by joining the URI of the authority and the local name. So, for example, if you wanted to create a new phylogenetic hypothesis about mushrooms, it might have the qualified name <mailto:me@myemailprovider.com::dataset:basidiomycota+phylogeny>. If you wanted to provide your own definition for the name "Eumetazoa", it would be attached to a taxon identifier with the qualified name <mailto:me@myemailprovider.com::Eumetazoa>. And so on.

How do you log in without a password? I'm thinking of a system involving IP addresses, the numerical code that identifies your computer's connection. For most environments, these are relatively stable, although if you use, e.g., a DSL modem it may reset once in a while. Here are some potential use cases:

Initial Login

Preconditions.—User has never logged in. User's email is unregistered. User is 13 years of age or older.Trigger.—User tries to do something that requires login.Course of events.

User is prompted for their email address. They are also prompted on whether they want to stay logged in across sessions.

User is prompted for their birthdate, full name, and family name.

User gets a notice telling them that they have been sent a confirmation via email. The notice includes an input field for a "key".

User checks their email, and sees an email message with a link. There is also a "key", a string of letters and numbers that they can copy and paste.

User clicks on the link.

Names on Nodes reopens, with the user logged in.

Alternate course of events.

User copies and pastes the "key" into the input field.

User is now logged in.

Postcondition.—User's email and current IP address is registered. User can perform the action that triggered this use case.

Subsequent Login, Registered IP Address

Preconditions.—User's email is registered. User is not logged in, having logged out or having declined to stay logged in across sessions.Trigger.—User tries to do something that requires login.Course of events.

User is prompted for their email address. They are also prompted on whether they want to stay logged in across sessions.

User is now logged in.

Postcondition.—User can perform the action that triggered this use case.

Automatic Subsequent Login, Registered IP Address

Preconditions.—User's email is registered. User indicated that they wanted to stay logged in across sessions the last time they logged in.Trigger.—User visits website.Course of events.

User is automatically logged in, and their name is shown in a "Welcome" message.

Postcondition.—User can perform any action that requires being logged in.

Subsequent Login, Unregistered IP Address

Preconditions.—User's email is registered. User has never logged in from their current IP address.Trigger.—User tries to do something that requires login.Course of events.

User is prompted for their email address. They are also prompted on whether they want to stay logged in across sessions.

User enters their email address.

User gets a notice telling them that they have been sent a confirmation via email. The notice includes an input field for a "key".

User checks their email, and sees an email message with a link. There is also a "key", a string of letters and numbers that they can copy and paste.

User clicks on the link.

Names on Nodes reopens, with the user logged in.

Alternate course of events.

User copies and pastes the "key" into the input field.

User is now logged in.

Postcondition.—User's current IP address is registered. User can perform the action that triggered this use case.

Unregistering IP Addresses

Preconditions.—User's email is registered, and user is logged in.Trigger.—User decides to invalidate other IP addresses, perhaps fearing someone else may log in as them from another computer.Course of events.

User selects the "Block Other Locations" option.

User is prompted to confirm this request.

User confirms the request.

User receives notification that other locations have been unregistered.

Postcondition.—User's current IP address remains registered, but all others are not. User must now register other addresses again if they try to log in from a previously-used address.

I'm not sure if this would be too convoluted in practice, but somehow I doubt it. If anything, it seems no worse then the usual type of system, except possibly for people who use laptops and are constantly on the move.

17 March 2009

As I discussed previously, the Names on Nodes project had reached a point where the schema just wasn't working out. I went through a list of what was wrong with it: confusing nomenclature, various unnecessary classes, unnecessary references, and major practical problems with looking up contextual relations.

Another big problem was the home-brewed keyword search system I had going. Synchronizing the keyword lists was becoming problematic, and I realized there are already perfectly good (better, even) tools out there such as Hibernate Search. That's a chief rule of programming: don't reinvent something that people smarter than you, with more time on their hands, have already invented.

After a clear, honest look at the contextual relations, I came to a realization: they should be in the client, not the back end. No need to bog down the server with computing definition applications when it can be done in the client. That simplified things a great deal.

Another thing I didn't really need was categories. They were basically an ad hoc form of class inheritance, e.g., a species name is a nomen, a nomenclatural code is a publication, etc. For a little while I considered implementing this as a class hierarchy, as I had in earlier versions. But, really, this is irrelevant data—Names on Nodes doesn't really need to know what category an identifier falls in.

Finally, I had another problem in the way datasets and taxon identifiers (=signifiers) used qualified names. Each one was supposed to have a unique qualified name. While I was able to guarantee uniqueness within datasets and within taxon identifiers, I wasn't able to guarantee that qualified names would be unique between datasets and taxon identifiers.

So, here's the new version (click to magnify):Again, white arrows indicate "is-a" relationships ("inheritance")—so a PhyloDefinition is a type of Definition, a Dataset is a type of Qualified object, etc. And black diamonds indicate "has-a" relationships ("composition")—so a TaxonIdentifier has one (and only one) Taxon, an Equation has at least two TaxonIdentifier objects, etc. (I've left out a few non-core classes, like BioFile and UserAccount.)

Brief discussions of each class:

Authority.—An authority can be a publication, a person, a bioinformatics file, a database, a specimen catalogue, etc. Each authority has a canonical name (e.g., "Yale Peabody Museum: Vertebrate Paleontology Collection") and an optional abbreviation (e.g., "YPM-VP").

AuthorityIdentifer.—One or more identifiers may be used to indicate an authority, each one associated with a unique URI. Examples:

Qualified.—This new abstract class makes it possible for qualified names to be unique across all classes that use them. Each refers to an authority identifier and contains a local name, which is unique to that identifier. When combined, the identifier's URI and the local name form a qualified name, e.g., <urn:isbn:0853010064::Homo+sapiens> or <http://peabody.yale.edu/collections/vp::1450>.

TaxonIdentifier & Taxon.—Formerly called "signifiers", taxon identifiers are qualified objects that each refer to a taxon. Taxon identifiers may be scientific names, vernacular names, specimen identifiers, character state descriptions, etc. As with authorities, each taxon may have more than one identifier referring to it. For example, the following qualified names all refer to the same species: <urn:isbn:0853010064::Abeillia+abeillei>, <http://iucnredlist.org::species:142883>, and <http://iucnredlist.org::common_name:Eng:Emerald-chinned+Hummingbird>.

Label.—Authorities, datasets, and taxon identifiers are all labelled entities, possessing one label object. Each label has a name, an optional abbreviation, and a flag telling whether it should be italicized. Labels are merely cosmetic, and need not be unique. They are used as the targets of searches, using Hibernate Search.

Definition.—Each definition has one taxon identifier, and only one definition pertains to that taxon identifier. How do I accommodate differing definitions, then? I use a concept from the PhyloCode: conversion. Consider the name "Aves". Under the ICZN, it refers to a suprafamilial ranked taxon with no type. According to Sereno's TaxonSearch, it refers to a node-based clade including Archaeopteryx. According to Gauthier and de Queiroz (2001), it refers to a crown group. But instead of having multiples definitions for the same identifier, I consider each definition to define a different identifier, each indicating a (potentially) different taxon: <urn:isbn:0853010064::Aves>, <http://www.taxonsearch.org/Archive/stem-archosauria-1.0.php::Aves>, and <urn:bici:0912532572(200112)%3C7:FDFDCD%3E2.0.TX;2-H::Aves>, respectively. In cases of conversion, the definition also indicates the original identifier.

PhyloDefinition & RankDefinition.—These have not changed much, except that they now refer directly to their specifers and types, respectively. No more useless "Anchor" class.

Dataset.—Instead of storing a bunch of relations of unspecified type, each type of relation falls within its own set. I've also added optional ratios for converting weights in phylogenetic networks to generations and/or years.

Equation.—I almost called this "Synonymy". This is a new type of relation, which asserts that two or more identifiers refer to the same taxon.

Heredity & Inclusion.—Heredity was previously called "Parentage". The new nomenclature better reflects its real meaning, since it models ancestor-descendant relationships, not necessarily parent-child. These two classes are little changed, except that now they don't both descend from a useless Relation class, so their nomenclature can be clearer (predecessor and superset used to be "a"; successor and subset used to be "b").

This schema is much cleaner, and will make for a more efficient server-side. I've already implemented the entities, removed deprecated code, and updated the relevant code. After some hiccups with a Hibernate upgrade, unit tests are working again. The back-end should be complete fairly soon (pending some ideas about user accounts), and then it will be time to look at some massive refactorings for the front end!

03 March 2009

(Warning: If you are not me, this post may not make much sense. Same could be said for many recent posts. Sorry for all the self-indulgence here, lately, but I'm trying to work through a lot of thorny issues.)

Last year I wrote a post about some revisions to the entity schema of Names on Nodes, my longstanding project to automate the application of phylogenetic nomenclature. The revisions were pretty hefty, and necessitated a rewrite of much of the project. I got pretty far without making any further major modifications to the schema. But, after a few months of work, some flaws are beginning to show.

Once again, here is the UML diagram:And, once again: The white arrows indicate inheritance, i.e., "is-a" relationships. For example, a PhyloDefinition is a type of Definition. The black diamonds indicate composition, i.e., "has" relationships. For example, a Definition has any number of Anchor entities, each of which has exactly one Signifier entity.

So, the problems...

The nomenclature is confusing.

Not all of it, but some. What I was calling a SignifierIdentity is, in fact, a taxon (in a somewhat loose sense, i.e., any set of organisms, or subset of life, or whatever—more here), and a Signifier is just a taxon identifier. What I was calling an Authority is actually a authority identifier, and what I was calling an AuthorityIdentity ... is really an authority!

Anchors are insufficient.

The idea of the Anchor class was to allow every definition, be it rank-based or phylogenetic, to be connected with any number of taxa, namely, those taxa required by the definition. Each Anchor object specifies a taxon (through an identifier/signifier) and tells whether it is internal or external. I had hoped that this would work equally well for both rank-based and phylogenetic definitions, modeling biological types for the former and the specifiers for the latter. But there are some crucial differences between types and specifiers:

A rank-based definition may not have a type; but a phylogenetic definition must have at least one specifier (usually two or more, but in theory you could get by with one, e.g., "Homo erectus (Dubois 1892) and all of its descendants," not that I'd recommend it in most cases).

A specifier can be a character state description, but a type cannot. (Both can be taxonomic names or specimen identifiers).

Types are always internal, so it's pointless to have to mark them as such.

A type is always included in the taxon. A specifier, even an internal one, may not be, since phylogenetically-defined taxa are potentially empty.

Relations are insufficient.

Why do Parentage and Inclusion both extend Relation? Because they can. They both require two ordered operands (parent and child for the former, superset and subset for the latter). There really is no other reason; modeling them this way doesn't make calculations faster (in fact, it slows them down), and gives no benefit otherwise. Furthermore, the Relation class is incapable of modeling other types of relations, like equation (i.e., subjective and heterodefinitional synonymy), which has two or more unordered operands. (Note: objective/homodefinitional synonymy is already well-handled by the relation of identifiers/signifiers to taxa/identities.)

Relators are insufficient.

Why do Definition, DefinitionApplication, and Dataset all extend Relator? Good question. The idea was that all of them indicate relations of some kind. But this resemblance only goes so far.

Rank-based definitions do indicate that the types are included by the defined identifier/signifier, but phylogenetic definitions don't really indicate anything, since they potentially yield empty results.

The inclusions indicated by rank-based definitions are redundant with the information about their types. I had to implement an awkward system to synchronize this.

Datasets are the only relators that can indicate parentage; the other two can only indicate inclusion.

Only datasets can indicate subjective synonymy, and only definition applications can indicate heterodefinitional synonymy. Those relations aren't currently modeled at all, but should be.

Definitions do not need to reference an authority.

For a while, I had been considering taxonomic names as defined by different authorities to share the same identity. This proved unworkable. Instead, whenever an authority defines a name, it is either coining that name anew, or converting it into a new name (that happens to have the same spelling, but a different authority). For example, Aves under Linnaeus 1758 and Aves under the ICZN are the same thing, but Aves sensu Gauthier & de Queiroz 2001 and Aves sensu Sereno 2005 are different entities.

For this reason, a definition can be considered to have the same authority as the name it defines. Keeping an extra reference to an authority is redundant. Under this system, every name gets only one definition (if that).

Looking up contextual relations is awkward.

Those other problems are pretty minor compared to this one. One of the core ideas of Names on Nodes is that you are free to create a phylogenetic context. A context is basically a way of saying which datasets you want to use (and which you want to ignore). Every definition is true for all contexts, but the application of each definition may differ.

Thus, when looking up things like whether taxon A is ancestral to taxon B (something you have to do a lot when applying phylogenetic definitions), the algorithm has to look at every single relation and decide whether it belongs or not. Does it belong to a definition? Then it belongs. Does it belong to a definition application? Then it belongs if tat application is under the specified context. Does it belong to a dataset? Then it belongs if that dataset is included in the context. I optimized this a lot, but, at the end of the day, I was making it do something it did not really need to do. Which brings me to my last point.

Looking up contextual relations is not easily optimized.

The Context class is pretty bare bones, and that's not a good thing. I've been looking into implementing some of the optimizations present in Bender & al. 2005, but it's not possible with the current schema.

So, some revisions are needed. Not nearly as major as last time, but fairly significant. More in Part II, coming some day....