Comments

To me, the hard problem here always was how to share and transfer the metadata between users and machines. For instance, suppose I am using more than one machine; would the Nepomuk information created on one machine even make sense on the other? Would it be possible to sync the databases between machines? Is it possible to distill out a meaningful, privacy-filtered subset of the data and send that to a friend? What happens when I reorganise all of my data using non Nepomuk aware tools?

I see in the article that there are some plans to address these issues, but it seems they have not been solved so far. Any thoughts?

I hadn't thought about it but the problem you bring up should be thought about. Though I don't know, I assume that any non-nepomuk enabled desktop won't support tags made with a nepomuk desktop. Perhaps if a nepomuk is brought to all of the free desktops than atleast we can be assured that it would work on all the free operating systems.

In a way, KDE is the first largescale test. If it works out, perhaps it'll be adopted elsewhere (gnome, xfce,... perhaps even a proprietary OS like OSX?).

Actually, as far as this goes, I was somewhat certain that OSX already had a framework in place for the support of arbitrary metadata. I'm not sure on details, but a friend of mine was talking about it at length one night

RDF/XML is one format for it. Internally, it's probably going to look more like the Notation3 (N3 format) -- just a list of "triples": lines like "uri1 relationship uri2". For instances, you might declare relationships like "http://x/y photographer_of http://a/z", or "googleearth://postcode location_of ipinfo://yourserver") or "http://hongkonggenerics.com manufacturer_of companyservers://missioncriticalserver1".

As an ancestor post said, this is pretty much perfect for exporting/importing/otherwise sharing info. You can easily create queries based on this data, like "? photographer_of ?", to get a list of all photographers, or "? photographer_of http://companyserver/publicphotos/*" to get a list of all photos published by your company. Then, you just need to provide that list to others in some way. Depending on how its implemented, it might also be possible to mark certain namespaces as private, but make the rest available, so that anything referring to objects such as "myborrowedmp3collection://*" or "topsecretprojects://*" or just "smb://" gets filtered, but everything else is made available. Likewise, and probably more safely, the opposite could be true, with only public namespaces made available.

Interestingly, let's say you have a kde io plugin that understands URIs with unique hashes, and deferences those to the appropriate files: something like "md5://number". By publishing this on some shared site (say nepomuk_repository.kde.org), then every KDE user with that file could automatically gain all the (non-filtered, public) tags of information that any other participating KDE user contributes. So, some KDE user in taiwan might mark set a song attribute such as "amarok://performed_by amarok://artist/Sarah McLachlan", and everyone else's desktop would suddenly know this.

For general queries, let's assume Wikipedia will take up the (already very functional) Semantic MediaWiki Extension at some point. Then, it'll be possible for your desktop to ask Wikipedia for all sorts of complicated information, like "countries with a population of more than 1,000,000, but less than three internet providers", or, for a more basic Unix utility, "languages that include the characters X, Y, Z, but not A". Or, for a person in need of medical help, they might consult a national medical database, along with a blog site, asking for "doctors within coordinates A,B and C,D who specialise in earache and who no one called a sadist". Within an organisation, lots of useful queries, like "people working on project X, who work over lunch" would be possible.

No one's saying the file format (be it XML/RDF, N3, CSV, or something else) is revolutionary (although, in the relative simplicity of N3/RDF, they do make some advances, I suppose). The trick is in taking all these information sources, combining them into a huge database of triples that performs well, and designing the right queries, the right interfaces, the right amount of sharing, and the right security features, so that your desktop "knows" more than it used to, and can work with other systems that know more than they used to, without being bogged down by the terabytes of new data we're soon going to be using for this.

Of course, this all depends on your own/others' ability to organise information, but it's all coming together, from other projects online. This WILL take off, and it will almost certainly be the REAL Web 2.0, that people actually notice, like they noticed Web 1.0. KDE *must* be part of that, and I'm very glad to see it's going to be there.

I DO hope KDE's/NEPOMUK's not going to be limited to simple things like tagging and searching files though, much as I want to see KDE have those features. At the very least, I'm hoping to see what GNOME's (now abandoned, for some insane reason) hint-based system did: let applications actually share knowledge in real time, like "user is working with a document that has subject X" and "Oh, I have files related to subject X". It's unclear whether NEPOMUK will actually allow the kind of things described above. The technology certainly does, though, and Nepomuk is claiming to advance it, as I understand things.

metadata you can't search is completely useless.... that's the point of nepomuk, isn't it? finding and connecting stuff through metadata. therefor you need a central index. an index scattered through the whole filesystem is useless. that's why everyone is working on something like strigi...

>that only works if you want to find the metadata of a file. whats with the other way around? i want to find every file i got per mail.

That's why you have the same data stored in the file and in the database. Having the metadata in every file means that all applications automatically keep the metadata intact without modification. Having the metadata in a database allows fast searching.

>its quite simple: if you move the file you break the index. the index needs to be updated everytime a file is moved.

but the database is broken every time you move a file even if no metadata is stored in the file.

The database will have to include the location of every file so when you search for files based on metadata you probably want strigi to tell you where to find the file, this means the location of the file has to be in the database and updated every time a file moved.

Why have "files" in the first place? Why "copy/move them around"? They are just
sequences of bytes. Why should I have a file manager? Isn't that what we want to
replace? The only reason is different physical computers on a network. But we could imagine even that to be irrelevant in some, not so far, point in the future.

but this doesn't solve the whole problem. if you move a file you still have to change the index, otherwise you could only find the oldlocation of the file. so you don't gain much.

so if you have to change the index everytime a file moves anyway, there is no real gain from storing anything with the file.

also, you don't need a filename or an id to track files. look at modern version controll systems like monotone or git. the identity of a file isn't an id, or a name. its the content - so use a hash. that would automatically solve all copy problems.

the only remaining problem would be tools that alter the file somehow. that should be solved by nepomuk integration into all applications. for legacy apps you could store the location of a file too. so if you overwrite a file, the index should automaticaly "transfer" the metadata, if not told otherwise through the nepomuk api.

so with this in place, the only scenario that could break the data-metadata relationship would be legacy applications (apps without nepomuk support) which create "copies" of files with new content (like converting images).
but that's a case you can't do anything about.

The filesystem _is_ a database. A metadata supporting filesystem can maintain its own indexes. Why put the file and metadata relationship on such a high level if you don't need to? There might be considerable overhead space-wise, but with 1TB harddisks getting mainstream soon this should not be a big problem.

If you put this indexing responsibility on filesystem level you get automatic, default nepomuk support for low level commands like cp and mv.

If you want this information to 'cross over' non-metadata filesystems you can use higher level tools. I could see a project like BasKet fit such a role for example.

Regarding hashes to bind relationships, I think this is not so useful on a filesystem. reading the full content of a couple of ISO files or a large mp3 collection just to get the hashes seems a little inefficient to me. And a hash still isn't as uniquely identifying as a URI.

I'll be curious to see how it turns out. WinFS was supposed to have a metadata based filesystem, but WinFS is vaporware at this point. If KDE beats Microsoft to the relational/metadata/integrated-search desktop, I think a lot of businesses might suddenly become interested. I haven't gotten a chance to try Nepomuk, but I really like Strigi - it's freaking fast (compared to Beagle, which I tried previously) and it doesn't have security holes like Google Desktop Search (which I haven't tried, because of the constant "A new zero-day hole has been found in Google Desktop!" stories on Slashdot).

The file system argument is a good one: Isn't is all about filesystems? I think distributors need to think more about filesystems. For our home partition a crypto file system should be standard. I don't know whether user space solutions make much sense.