From users-return-3112-apmail-jackrabbit-users-archive=jackrabbit.apache.org@jackrabbit.apache.org Fri Apr 27 07:37:47 2007
Return-Path:
Delivered-To: apmail-jackrabbit-users-archive@locus.apache.org
Received: (qmail 45483 invoked from network); 27 Apr 2007 07:37:46 -0000
Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2)
by minotaur.apache.org with SMTP; 27 Apr 2007 07:37:46 -0000
Received: (qmail 11955 invoked by uid 500); 27 Apr 2007 07:37:53 -0000
Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org
Received: (qmail 11796 invoked by uid 500); 27 Apr 2007 07:37:52 -0000
Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: users@jackrabbit.apache.org
Delivered-To: mailing list users@jackrabbit.apache.org
Received: (qmail 11787 invoked by uid 99); 27 Apr 2007 07:37:52 -0000
Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133)
by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Apr 2007 00:37:52 -0700
X-ASF-Spam-Status: No, hits=-0.0 required=10.0
tests=SPF_PASS
X-Spam-Check-By: apache.org
Received-SPF: pass (herse.apache.org: domain of david.nuescheler@gmail.com designates 209.85.134.188 as permitted sender)
Received: from [209.85.134.188] (HELO mu-out-0910.google.com) (209.85.134.188)
by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Apr 2007 00:37:45 -0700
Received: by mu-out-0910.google.com with SMTP id g7so883925muf
for ; Fri, 27 Apr 2007 00:37:23 -0700 (PDT)
DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed;
d=gmail.com; s=beta;
h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
b=fO9acC+1dYY0OqqUPCUMvdvgr/OpCrqoUv5Zy6B7E/3Vp7G1C+KVxv503UKfLpgSIMgEYsKnzfh4Hg6BGWorA5H0sJJCb3JoOuEbBTo7kTXcIPMMRvPdLREOGumi7cQyHRSu4lxzZWStsJfGvDQjbVz5wYLLs90qy3feTaO+irg=
DomainKey-Signature: a=rsa-sha1; c=nofws;
d=gmail.com; s=beta;
h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
b=pdrnDddL1jvAv5NbKaWGjl5Cjvbb59agXPiLHZnc/pmGugFXIPA5jK0GXgg7TNE71g2ZJf3r2l19tm7vwI5bo6Lsl184PEUziOfWXp/a6k+jXHv6MnaMAwjilIfqqUclwVVPqz2Gxry82mRm81RNCsgTwwWJKjjzgEwYOzJyDHM=
Received: by 10.82.163.13 with SMTP id l13mr5122606bue.1177659443223;
Fri, 27 Apr 2007 00:37:23 -0700 (PDT)
Received: by 10.82.125.5 with HTTP; Fri, 27 Apr 2007 00:37:23 -0700 (PDT)
Message-ID:
Date: Fri, 27 Apr 2007 09:37:23 +0200
From: "David Nuescheler"
To: users@jackrabbit.apache.org
Subject: Re: importing jackrabbit into jackrabbit
In-Reply-To:
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <86d42bef0704240927h1b20832dl10ff8e6221be37d2@mail.gmail.com>
<90a8d1c00704250645x3bc6ae91p44c21ff6bbe7ddb7@mail.gmail.com>
<86d42bef0704251029w57b3f799ud25ff05441181a1e@mail.gmail.com>
<46305E93.80208@gmx.net>
<86d42bef0704261040x3d965adanf892cf80d19ad647@mail.gmail.com>
<510143ac0704261138p5b5cdba4p772f518d65a7fe1d@mail.gmail.com>
X-Virus-Checked: Checked by ClamAV on apache.org
Hi Alessandro,
thanks a lot for your thoughtful mail.
I think you hit the nail right on the head.
> I think that the main problem is not really about the specific case,
> but in general that when people design relational databases, they
> always use references (or more properly, joins) to define data that
> belongs logically to many entities, but should not duplicated.
I completely agree with your statement.
And I think this is one of the biggest challenges that we are
going to face.
People are thinking within the facilities provided by a relational
database and within the data modeling practices that they have
been using for decades now. Which is very understandable.
A content repository offers much richer facilities for content modelling
primarily through features like a hierarchy, multi-value properties or
even features like sorted children which in an RDBMS world have
to be modeled by the application developer.
> Imagine that you have a company tree, with "positions",
> "departments", "employees", "health plans" etc.
> An employee could belong to a department, have a position and an
> health plan, but typically you would not make all those nodes child
> nodes of the employee: you would instead define references to the
> proper node in the "position" and "health plan" subtrees.
I think one-to-many relationship should be modeled as a hierarchy.
So my initial gut feeling would be a datamodel like this:
/bigco
/bigco/marketingdept
/bigco/marketingdept/joeshmoe
and "joeshmoe" would be of nodetype
[bigco:employee]
- position
- healthplan
Now "position", "healthplan" are many-to-many relationships.
I think that those can either be modeled as references, paths,
names or strings.
People that come from a "hard structured" RDBMS background
very often think that a reference is the only option.
For example "position" might very well be a "string" or a "name"
if the application can deal with the fact that information is "dangling".
If we continue to model the above tree with...
/bigco/positions/
/bigco/positions/secretary
/bigco/positions/svp
... I think I would personally choose to store a "string"-property that is
human readable thats actually the name of the target node in
/bigco/positions.
So i would store "svp" or "secretary" in the position property.
Since I would not use namespaces for the names of the children
in "positions" I would not need the overhead of true name property in
my employee node.
While this probably rubs a lot "structure first" people the wrong
way I prefer this model since the information carried in the
string "secretary" is still valuable even if it is "dangling".
(...opposed to some UUID)
I think it is important to understand that there certainly are use cases
where referential integrity is very important, but it is important to understand
that it comes at a price.
Both in performance and even more importantly it constrains the
flexibility of your applications from a "data-first" perspective.
> What could be the right way to model things? Maybe using a "path"
> property to point to the node instead? Of course, it would not be as
> easy to use as a reference, and it would be requiring global updates
> if the pointed node ever change position, but I can't see other options.
If you would like to protect against "move"-operations but wants to avoid
the overhead of referential integrity, you can store the UUID of the target
in a string property. In JSR-283 we are looking at a "weak-reference" to
express a reference that can dangle in a more formal way.
> It's easy to see how, in a large company, there could be thousands of
> employee holding the same position and health plan, and those
> specific nodes ("Secretary" and "Plan A") would have thousand of
> references pointing to them.
> So, given the issue as explained by Marcel that "whenever a
> reference is added that points to a node N the complete set of
> references pointing to N is re-written to the persistence manager",
> it seems that using references to a node that is very "popular" is
> really going to be creating problems in the long term.
Agreed. And I think we will not be able to re-educate everybody with
an RDBMS background before using Jackrabbit so I think Jackrabbit has
to be able to deal with very large quantities of references in a very
efficient way.
So I would recommend to fix that as noted by Tom in the last sentence of:
http://issues.apache.org/jira/browse/JCR-657
regards,
david