ralphm's blog

Wednesday, 30 July 2008

How to organize the nodes that can be subscribed to and what
identifiers to use for them.

In part 1 I wrote about what you can
subscribe to and how a social network service will send out
notifications. I often used node as the thing
you subscribe to, a term comes directly from the XMPP Publish
Subscribe specification. In other publish-subscribe
implementations this is often referred to as
topic. Nodes are kept by a publish-subscribe
service, and, among other things, this service is responsible for
keeping the list of subscribers and sending out notifications.

Publish-subscribe services currently come in two forms: dedicated
publish-subscribe services with their own domain (e.g.
pubsub.ik.nu) and publish-subscribe services tied
to a user account (often mentioned in combination with the Personal Eventing
Protocol, also known as PEP). In the latter case, nodes are
kept at the bare JID of a user's account (e.g.
ralphm@ik.nu. Personal pubsub-nodes have nice
properties, like the ability to directly associate a particular node
with a person, and the possibility of doing access control on the
user's contact list (roster).

Node organization

In the context of federating social networks, a service needs to
decide where to put the nodes it wants to allow other entities to
subscribe to and send out notifications from. In some cases it makes
sense to keep nodes at user accounts, though in some other cases it
is better to provide the nodes at the domain of the service itself.
This depends on the nature of the social objects and the
subscribable unit you provide. Let's explore some use cases.

Jaiku

In Jaiku, social
objects (microblog posts and aggregated items like photos,
bookmarks, etc), are organized in streams. Streams are tied to
either a user, or a channel, and don't change ownership. The
social objects themselves are static, once created, they cannot be
edited. They can have comments associated with them, but those
also cannot be edited. The only thing that can happen to streams,
stream items, and comments is deletion.

Here, it makes sense to have a node for each stream, and
possibly a stream for the comments to each stream item. Those can
be tied to the owner's JID (e.g.
ralphm@jaiku.com or
#jabber@jaiku.com). Another possible node could
be: all comments by a person. Another node an entity might want to
subscribe to is: all public microblog posts. Such a node would be
associated with the domain of the service rather than any
particular user's JID.

anyMeta

The company I work for, Mediamatic Lab has a
(proprietary) CMS called anyMeta. Instead of
'content', the C in CMS here stands for Community, to highlight
the social network properties it provides. anyMeta is a highly
semantic system that deals in things (a
person, an article, an event, a blog), and
edges (the relations between things, each
with a predicate like friend-of, author-of, etc). I mainly work on
federating instances of anyMeta.

Things in anyMeta are usually editable, so it makes sense to
want to keep informed about changes. For example, an article can
have a large number of edits, and a person might move, change
employers or have other changes to his profile. Thus, we chose to
at least provide each thing as a subscribable unit. Upon creating a
thing, a new node is created, and a representation of the thing is
published to the node. Editing a thing, results in subsequent
publishes. Subscribers will receive notifications as the node gets
published to.

We organized the nodes in a flat namespace, tied to a domain,
rather than a user. One reason is that the owner of any particular
thing might change. Tying a node to the first owner, and then
needing to move it when the owner changes, is cumbersome.

Node naming

Each node has an identifier that is unique within the
publish-subscribe service holding them. So you could have two nodes
named updates tied to two different users. Node
identifiers are opaque; one should not derive meaning from how the
node identifier looks. Embedded slashes might suggest some
hierarchy, for example, but an application should not assume that
such a hierarchy actually exists.

That said, it makes perfect sense to use logical, human readable
identifiers for nodes. They might even be very similar to the URI
layout of the service's web site. Let's check what one could do for
the examples given above.

Jaiku

It makes sense to have the node identifier for the regular
posts (called presence) be presence and the
nodes for the individual posts (with comments)
presence/123456, where the number is the same
as used in the web page for that post. Those two examples could be
tied to a JID representing me at Jaiku:
ralphm@jaiku.com.

The node for all public posts could be called
explore and located at the JID of the whole
service: jaiku.com. This would be similar to
the web site, where all public posts can be viewed at http://jaiku.com/explore.

It might also make sense to have a dedicated node for a user's
profile information, that can be retrieved and presented at a
service or application that consumes the social object updates. At
least a (full) name and some icon or headshot would be nice to
have there. Obviously, subscribing to such a node would mean that
future profile changes will also propagate to the consuming
entities. An example identifier would be
profile, to be kept at the user's JID.

anyMeta

In anyMeta, each thing has an identifier, that could be used
for the node identifier as well. However, in the current
implementation, all nodes are held by a loosely coupled, generic
publish-subscribe service that caters multiple anyMeta instances.
We chose to use unique identifiers as generated by the
publish-subscribe service, which don't have any relation with the
thing identifier.

As you might have guessed, some of the stuff being discussed
here has already been implemented in anyMeta. The
publish-subscribe service used is Idavoll. It has grown an
HTTP
interface that is used (internally) to create new nodes,
publish items that represent things, and subscribe to, and receive
notifications from, remote publish-subscribe nodes. The thing that
holds my
Mediamatic profile is represented by the node
generic/4efe2253-2242-4e01-bfdf-957cc2a9481d at
pubsub.mediamatic.nl. All things in this site,
but also the PICNIC
site, have nodes like this. In a future post I will
explore what we do with these nodes.

In this part, we explored how one could organize the nodes that
entities can subscribe to to get updates. Some might be tied to the
(virtual) JID of the user's account, or associated with the JID of the
service itself. Node identifiers might be human guessable, and like
the web URIs, or could be seemingly random opaque strings.
Implementations that consume subscribe to, and consume notifications
from, the nodes at social networking services, should not assume
anything about the organization and naming of the providing service.
This presents a challenge for the next episode: how does one know
which nodes are there and what they are called? So, up next:
discovery. Homework assignment: look carefully at the HTML of my
Mediamatic profile page.

Saturday, 26 July 2008

The use of XMPP publish-subscribe in federation and third-party
applications deviates a bit from the standard use-case. Usually
publishing, subscribing and receiving notifications happen through
the same protocol on specific (leaf) nodes. Entities subscribe to a
node that represents a particular thing they are interested in
getting updates for, and when an item is published to that node,
these subscribers will receive a notification for that item.

For federating social networks, the focus is on the exchange of
updates on social objects or comments between services. For
third-party applications, the most important thing is getting
updates, preferably as soon as possible. So, for both of those use
cases, receiving notifications through XMPP gives it an edge over
HTTP: no polling, lower latency, less connections.

How these items are published, does not really matter that
much. What you will typically see is that services somehow have a
new item available (submission via the web, SMS, e-mail or a
web-based API) and want to expose that through XMPP. Posting a new
update through XMPP from a third-party client usually does not
provide an advantage over existing web-based APIs.

For a service like Jaiku, Twitter or Identi.ca to provide XMPP
publish-subscribe support, it is important to define the
subscribable unit and provide that as a
node. Such a node will usually not be published to directly, but is
more of an aggregate node. Examples would be: all updates by a
particular user, all updates in particular channel, all updates by
a user and his contacts, all public updates. An other example could
be: all comments on a particular social object.

Conceptually, all such aggregate nodes are internally
subscribed to a particular subset of new and updated social objects
and comments. You might even implement it exactly like that. Think
of a prospective search that is captured by a node: every time a
new item comes into the service, it is determined which of the
provided nodes would be a match for this item, based on author,
contact lists and permissions. Subsequently, for all of those
nodes, a notification will be sent out to its subscribers. Telling
items apart in this scenario is then likely not done using the
service JID, node identifier of item identifier, but using some
identifier in the payload, like Atom's id
element, although those other identifiers might provide a
context.

For those familiar with the concept of XMPP publish-subscribe
collection nodes: those would be a special form of aggregate nodes
that make it explicit what their relationship to the nodes they
aggregate items for is.

The major topics on the 5th XMPP Summit were Jingle, and XMPP
as a complementary protocol next to HTTP for building social
networking services, as stpeter briefly mentioned. While I think
that the consensus on OAuth
over XMPP, was very important, I think we also settled on
a good set of best practices for federating social networks using
XMPP
Publish Subscribe.

This particular topic has had my full attention over the last
year or so, and it is about time that I start writing about that,
explaining the afore mentioned best practices in their context. As
this covers a lot of ground, I'd like to make a series out of it,
each detailing a particular aspect.

Topics that will come by include: the subscribable
unit and how notifications are generated, payload
formats, discovery, local representation and implementation
strategies.