I feel like I've been asked this question enough times that I should have an exciting answer by now. But well, I wrote JuK and TagLib as well as a couple of other small applications in KDE CVS and do some work on a handful of things in kdelibs and elsewhere across KDE.

What kind of search capabilities do you think a modern desktop should have?

Well, I think I'd like to step back a bit first and look a little at the problem — and the problem isn't a lack of a search tool, the problem is that it's hard to find things. Search tool or no, all of the ideas flow from the idea of solving the problem rather than just creating a new tool. So, in a sense, I don't think a modern desktop should have a search tool; I think a modern desktop should make it easy to find stuff — we're then left with how to get there.

And I suppose with all of the buzz around search tools these days people have a much more concrete idea in mind when they hear about searching on the desktop. But such wasn't the case when I started kicking these ideas around a while back. Spotlight was announced a few days after I'd submitted my abstract for the KDE Developer's conference, Beagle was relatively low profile, Google for the Desktop and its successors hadn't entered the scene yet, etc.

So, I think — fundamentally "what sort of search should the desktop have" is almost the wrong question. "How should we make it easier to work with the data we accumulate on the desktop?" is closer to the right question. I think search is just part of the answer.

Where did the idea of integrating a search capability throughout KDE come from?

Well, a few things actually. It mostly came from not being able to find things and asking some fundamental questions about how we organize and access information on the desktop. The first step — and this is tied up with the first part of the name of both this talk (which is related to the one that I gave at Linux Bangalore) and the one at the KDE conference this summer — is that hierarchical interfaces simply don't make sense in a lot of cases.

When I started looking around for examples of how this had played out in other domains of information, the most obvious example was the World Wide Web, where we've already moved from hierarchical interfaces to search based interfaces. It seemed logical that we could learn from that metaphor.

On the technical side of things I'd just written the listview search line class (used in JuK) that's now fairly prevalent in KDE that makes filtering of information in lists much easier, so that played into things too.

What do you think of other search tools such as GNOME's Beagle and Google's Desktop Search?

Well, they're fundamentally different in scope. Again, right now the term "desktop search" actually means something; that wasn't really true when I started working on these ideas this summer. So while there are some things in common, they're really pretty different approaches.

Beagle, Spotlight, Google for the Desktop, and their relatives are more interested in static indexing and search through that information. That's kind of where I was at conceptually early this summer when I coded the first mock-up. Since then however the ideas have moved on quite a bit and I think we've actually got something rather more interesting up our proverbial sleeves. (I should note however that I think the Beagle group is doing fine work, but it's something pretty different from what I'm interested in.)

The first difference is that this is a framework, not a tool. Beagle has some elements of this, but it's still not integrated into the core of the desktop. Google for the Desktop is mostly just a standalone tool from what I know of it. Honestly I think it's really below the level of innovation that I tend to expect from Google.

What we're now looking for in the KDE 4 infrastructure is a general way of linking information and storing contextual information — that information can come from meta-data, usage patterns, explicit relationships and a host of other places.

There won't be a single interface to this set of contextual information; we'll provide some basic APIs for accessing the components in KDE applications, but we're quite interested in seeing what application authors will think to do with it. Really I think they'll surprise us.

We're looking at everything from reorganizing KControl to make search and related items and usage patterns more prevalent to annotating mails or documents with notes to reworking file dialogs. Really the scope is pretty broad.

Do you think Free Software solutions from KDE and GNOME can compete with the likes of Google and Microsoft?

Sure. I mean — I don't think the ability to compete with commercial players is significantly different with desktop search than it is with other components of the desktop. And honestly I think we've kind of got a head start here.

Has there been any progress on planning or coding search into KDE yet? Is anyone helping you? What problems are you facing?

There have been a number of cycles through some API and database design sketches. But right now we tend to write code and as soon as it's done we've realized the flaws in it and start rewriting. This will probably continue for a while, but I think we'll be able to have something pretty useful in KDE 4.

There are a number of folks involved in discussion of these issues from various sub-projects inside of KDE. Thusfar it's been mostly myself and Aaron Seigo banging on the API, but others have contributed to the discussions.

I think the biggest problem that we're dealing with is moving from the abstract set of ideas that we're working with into real APIs — trying to keep things general enough to stay as extensible as we'd like them to be, but not so lofty that they're convoluted and useless.

What technologies do you plan on using, e.g. which database?

Well, we've gravitated towards Postgres, but mostly because of licensing. Other than that, well, uhm, we're using Qt. The Qt 4 SQL API seems much improved, so I've kind of been mentally stalling on really finishing up the current code until I can just work with that since otherwise everything would just have to be rewritten in a few weeks.

Is the KDE search tool likely to be cross desktop compatible so we could have a common base with Gnome?

Well, again, this really isn't about a "KDE search tool" -- and the chances of it being GNOME compatible out of the box aren't particularly high. That said, as the data store will just be a Postgres database and ideally we won't have to use too many complex serialized types, there wouldn't be a reason that a GNOME frontend couldn't be written. But generally speaking I'd like to get the technology laid down and then see if we can convince others to adopt it rather than the other way around.

What does the project need most now?

Time. And I mean that in a few ways — we need time to finish fleshing out the ideas, time to let the stuff mature inside of KDE and well, the couple of us working on it could use more time for such. But really as most of the framework for things like metadata collection and whatnot are already inside of KDE this won't be a huge project from the framework side. What will take a good while will be porting over applications to use it where appropriate.

Have anyone looked at Derby.http://incubator.apache.org/derby/
It is licensed under Apache license, Ansi SQL, portable (pure java), has a small footprint and is supposedly very easy to use. So it seems to fit the bill perfectly.

Definitely agreed. Also note that Beagle uses a C# port of Lucene. Lucene's index format is well documented, and there are also ports to languages like Python (http://pylucene.osafoundation.org). It supports a
flexible query syntax, and support for many natural languages. I think the best choice KDE could do is choose Lucene as index format.

Lucene is a fine tool -- and CLucene as well and something that I've looked at, but part of what I was trying to indicate in the interview is that we're not working on a "search tool" -- search is just one of the things that we'll be using it for. Lucene is well set up for static document indexing, but isn't particularly useful for a graph based contextual web.

This is kind of a problem right now that we didn't have when these ideas were hatched -- people have ideas in mind for what "desktop search" is and that's not really what we're working on. The questions even kind of indicated that -- if I'd done an interview on this stuff last June or so the questions would have been very different because that was before all of this stuff kind of burst onto the main stream.

KDE already has a plugin based metadata layer, extending that to where it can extract the needed information is likely the direction that things will move.

I think it's only gonna be an API available to every apps. So if you're writing a dashboard app you could use the API in it or if you're writing a mediaplayer u could use the API for your playlist etc... it's nor a tool neither a software, just some API usable by any KDE app.

Do you do the discussion on some particular mailing list or do you have some place where you show the current drafts (wiki?) or is a proof of concept already available in one of the numerous kdenonbeta modules?

I'd think catching all the metadata kfile reads out for ages already would be an excellent start. ;)

does this mean we'll all need a full installation of postgreSQL ? isn't that a bit heavy? why not sqlite or mysql (i know that they're GPL (not sure about sqlite)but they're faster and lighter than postgres which is great but maybe a bit too much). Just because MS is going to use some kind of reworked sqlserver with winfs on longhorn doesn't mean we should do the same with postgres :)

Well, Postgres is the third database that I've tried. I did the original mockup prior to the KDE conference this summer with SQLite and the performance for the type of queries that we're doing was so bad that it simply isn't an option. Also SQLite is really only designed to be used from a single process, so we'd have to implement locking and multi-user access in a daemon on top of it, which, well, at that point you're just implementing a database server, really, but without the performance of more robust databases.

So I then ported that mockup to MySQL, which performs fine, but is GPL'ed rather than LGPL'ed. As performance is similar for MySQL and Postgres, but Postgres has more flexible licensing (i.e. suitable for use in things linked to kdelibs) Postgres wins there.

WinFS is something completely different. It's what's called an Attribute Value System or object database -- and it's only being used for "My Documents", not the complete FS. What we're working on isn't something that's going to replace parts of the FS, it will supplement it with contextual information.

Have you ever tried FirebordSQL? It's a true database server, ACID compliant, multiplatform, license is a modified versions of the Mozilla Public License (so unfortunatly no pure GPL), takes relly few resources and is developed by an active community:http://www.firebirdsql.org/
"Firebird is a relational database offering many ANSI SQL-99 features that runs on Linux, Windows, and a variety of Unix platforms. Firebird offers excellent concurrency, high performance, and powerful language support for stored procedures and triggers"

MPL isn't GPL compatible; there's not even a reason to look further. We're going to have to use the client libraries for these databases.

I don't see any reason not to use Postgres; there are loads of ways that all of this could be done, but the datastore is the boring part and Postgres fits our requirements. I don't really see a reason to look further at the moment.

I've never heard of problems with FirebirdSQL and GPL, since you have to connect with the database, not include his code. Maybe their "modified" MPL is modified enought to be able to interface with GPL programs ;) But I'm not a lawyer.
I was replying to your message about PostgresSQL vs MySQL, and I'm sure you will be surprised on Firebird performances, low footprint, high stability, low manteniance needs, etc..
So mine was just a suggestion, since if you "don't see any reason not to use Postgres", I don't see any reason to use Postgres instead of Firebird ;)
So would be great if you could have a look also at Firebird :)
thanks

So KDE is going to choose a database that is LGPL because of licensing restrictions from using a straight GPL'ed database when we use a GPL'ed toolkit to do EVERYTHING related to Qt? I hope this is not really the case because it would seem to lend credibility to the Gnome-huggers who say that KDE is worthless as a general purpose business desktop because QT is GPL'ed and not LGPL'ed. If we can use QT in KDE why not MySQL? Don't get me wrong, I don't have a problem with Postgres, I just hate to see KDE use the same argument against something that has been leveled against us.

as an embedded database it looks good. but we don't need an embedded database, we need a database that can be accessed simultaneously and, preferably, over a network. the TODO list for this project is already big enough without adding "write a scalable RDBMS" ;)

once the first edition is out using an external RDBMS then perhaps all the data storage fans can swoop in with their super dooper file systems and coolio ultra-tiny database-like engines and experiment/optimize that area of the software.

maybe u could make the data storage part "pluginable" so that developpers could easily implement different db backend à la kexi or even like that damn amarok while you could focus on the postgresql part :)

we (amarok) have dealed with quite all sqlite issues you could imagine ;-)
so, let me assure you, that i still believe it could be used, but some simple things have to be done:

a) you can set a threadsafe variable in the Makefile. this will help improve the situation!

b) why not use a singleton interface for searching? this way you can take care of the locking easily. if you need code: look at collectiondb.cpp/.h in amarok/src. there is even a connection pool n everything. that is also useful for other db interfaces, even though i dunno what qsql offers wrt this issue.

c) most sqlite issues can be easily solved by setting proper indices. when you do that, sqlite is barely slower than mysql. although, one must admit that there are more situations where sqlite is not able to use an index at all.

anyways, mysql or postgresql is to have imho. reconsider sqlite! it's worth the little hassle.

> we (amarok) have dealed with quite all sqlite issues you could imagine

Honestly I doubt it. amaroK is a very simple application (in database terms) relative to what we're working on. A music player with a couple MB of data is a very different beast trying to store and query graphs in a database that may easily grow to a few hundred MB.

> why not use a singleton interface for searching?

Because that's only useful for one process.

> most sqlite issues can be easily solved by setting proper indices

Not for heavy use of cross table selects on interrelated values. I literally had several queries that took 15 minutes on SQLite that were done in less than 1-2 seconds on PGSQL or MySQL. And that was just on a 10 MB test database. I'm not saying that the limitations of SQLite couldn't possibly be worked around, I'm just saying that there's no compelling reason to work around them when there are better databases available that already solve these problems.

SQLite also locks the entire database on write, which just isn't acceptable in a tool used frequently by multiple processes.

Basically, as Aaron already said -- it might be possible to work around all of the issues with SQLite. But in the end we'd just be implementing the features that other databases already provide and we'd need a daemon process anyway to handle communication with it, which, well, makes no sense.

I'd rather like to see Amarok integrate in such a new framework, instead that every application has it's own db. E.g. to access covers and the correspoding metadata with Digikam, without having to add a new album in digikam and the metadata stored twice.

If it would be possible to specify something like "apply on amarok:covers which are smaller than (x,y), jpegs and greenish, pipe result to ,[...]", without having joe user to think about what the hell he is doing... ;)

Would it be possible to specify a list of directories to be indexed/monitored with the option to recurse through the directory structure and make it available for searches by other users.?

E.g.
I would have /home/bryan, and mark it recursive and hidden
As root I would add /usr/share/music (all my oggs etc., shared between users), and mark it as recursive and visible to all.

Other users do the same. The list of public search folders is stored in /etc/indexeddirs. Each person's private list is stored in ~/.indexeddirs. For each directory have a sqlite DB in the root of that directory called .dirindex.sqlite or something.

Then when someone does a a search, open and concatentate the private list (~/.indexeddirs) and the public list (/etc/indexeddirs), open all the directories, and search through each?

That might be a bit much though, I don't know the specifics of search. Is this being developed using Qt4 or KDE/Qt. It'd be cool if the backend was developed using Qt4, so it would be a nice small dependency that other projects could use.

yes, it is small. pgsql isn't exactly huge however. the postgresql system isn't very large. the rpms on SUSE are ~10MB (includes the docs, stored procedure language and what not) and it's memory usage is also pretty good. we're not talking 100s or even dozens of MB of ram.

> fast,

not for the types of queries that are required.

and, as Scott mentioned above, this needs to support multi-process access which means locking and the whole bit. sqlite is great for the purposes it was intended for; this isn't one of them. =)

This is a very valid point... kdelibs requiring a database (which it sounds like it will be doing) will make things a lot easier for a variety of KDE programs that currently have to come with packaged with sqlite.

While my first reaction was, "*Groan*, WhyTF do I need to run a DB server on my PC just to have a good desktop experience", I think this is a good decision. Modern systems will not be loaded excessively by a Postgres server running in the background, and the payoff would be *much* more than worth it.

But I do wish it could all be independent of the actual database used. Or at least give a choice between 2-3 popular databases. Maybe eventually, closer to productization, when such pragmatic issues are more important.

I know that using reiser4 should probably not be a requirement to using the !"search tool", but would it make sense to create a reiser4 plugin to be used with your ideas? Maybe you could store some information not in the database, but right with the files. One could argue that this is where the information belongs: in the filesystem.

IIRC, Hans Reiser said that whenever you use a database, its because of the shortcomings of your filesystem, and the now-released reiser4 is supposed to fix that.

i think kernel devs wants to implement some of reiser4 unique functionalities into the kernel vfs so that other fs can use them and people won't be forced to use reiser4 but that ain't gonna happen anytime soon so I guess we have to wait. I wonder if we'll get it before winfs :) (the real winfs, not the one that will come next year with longhorn).

> but would it make sense to create a reiser4 plugin to be used with your ideas

seeing as our time is not exactly limitless, i don't think doing two implementations with different storage designs is realistic, especially when very, very few people have the target data storage mechanism (reiser 4 file system) available to them. this is meant to be a practical project rather than a research topic.

> One could argue that this is where the information belongs: in the filesystem

yes, that's a viable argument given that the information you are indexing/linking exists in the filesystem in the first place. this assumption isn't universally true, however. not only is a lot of our data dynamically generated on demand these days, there's also a lot of data that is implied by our usage forming context that isn't a "document" or even necessarily very suitable for storage as a "file" on disk.

trivial example: how would you store a personal annotation of a web page in a filesystem based approach?

if we wish to more than just index and search local data, it becomes apparent that the file system is not the catch-all locus of our information anymore.

> trivial example: how would you store a personal annotation of a web
> page in a filesystem based approach?

The direction that Reiser is heading is towards a general filesystem/retrival system. Kind of a mix of a traditional filesystem plus a database, while being very flexible and 'plastic' (i.e. you wouldn't have to define a fixed schema before you could use it). Searching using partial chunks of info being a big part it. Basically you could build and search almost arbitrary data-structures (on disk). A traditional Unix style filesystem is just one thing that you could make using this kind of system, but much more would be possible.

> The direction that Reiser is heading is towards a general
> filesystem/retrival system

yes, it's a very interesting and ambitious goal, one Oracle failed at in the 90s, though for market reasons rather than technical ones.

> Basically you could build and search almost arbitrary
> data-structures (on disk)

of course, and we're using an RDBMS to do exactly that at the moment. the reason that a file system doesn't offer anything (featurewise) above what the RDBMS does to make it more attractive is that not everything in the necessary "arbitrary data structures" refer/link to things that are local or even storable (e.g. time).

it's much more practical and reasonable to require people to installed 10MB of software that provides a database engine than it is to require them to reformat their disks and migrate all their data over. Reiser isn't even available for all the platforms KDE runs on. this removes it as a potential target for a practical tool, though it would make a really cool target for a research project.

i do think that applications will drive the success of Reiser4, however. and once we have tools such as what we are building, i wouldn't be surprised if someone worked the storage layer to oprtionally use Reiser4 to produce something smaller and more performant. at that point there's a real motivator to use these kinds of file systems that goes beyond the theoretical, at which point they become interesting from the point of view of "off the shelf" software users and manufacturers.

Frankly, I believe that SQL databases in general may not be the best solution for this problem.

Some years ago, I was one of the investigators into a potential application in which we benchmarked various database solutions. The "post-relational" databases of which the old Pick system was progenitor had performance many times that of the relational ones.

In case you are unaware of it, these databases comply with all but one of the Codd and Date's Rules of Normalization--the first, that every item must be "unique and atomic." In other words, a single record could store an entire table.

Consider the common introduction to relational databases in which a video rental store is used. One table contains the individual customer records, with a unique customer number. Another table contains the rental records, with one column containing the customer ID numbers. To get a look at rental history for a given customer, then, a join must be performed in which the rental table is searched with each rental by the given customer's ID being extracted to build the list.

In a Pick-style database, the individual rental info can be stored directly int he customer table. Extracting rental info merely means looking up that customer record and viewing the ever-expanding rental list. No joins, far less memory, and much more speed!

Another possibility might be a datastore similar to the "Titanium" database of Micro Data Base Systems. In that one, many-to-many relations can be directly mapped as selected by the programmer. This results in a well-written program having even more performance yet.

In our tests, the Pick style database ran about 25 times faster than the then-current SQL databases; the Titanium engine about 40 times faster and with much lower system overhead.

In short, the only advantage I see for an SQL system is the large number of tools available for them.

:: One could argue that this is where the information belongs: in the filesystem.

The problem is, KDE targets the nebulous "Unix" as a goal. POSIX does not define much (are ACLs even defined as a standard, or are they an optional part of the standard?).

:: IIRC, Hans Reiser said that whenever you use a database, its because of the shortcomings of your filesystem, and the now-released reiser4 is supposed to fix that.

I agree so wholeheartedly it is difficult to express with words. I think a rich filesystem that is used by apps is as revolutionary as the desktop metaphor. But I also think that KDE is right not to make such a fundamental requirement. As an *optional* way of storing data, on the other hand...

Not to mention that a rich filesystem plus an advanced code rights system (a la jail) will result in a very secure, powerful and stable system - far more abuse tolerant and flexible than has ever been commonly available ("Open all the attachments you want - they run fine, anything they change rolls back, and they can't send Spam").

you hit the nail on the head: it has to be used by applications. if no applications use it, it's a theoretically cool idea with no real world benefits and becomes an unimportant interesting footnote in computing history.

this is why instead of targeting a unique storage system or creating one application/tool, we're creating a system that will allow any (and hopefully as close to "all" as possible) application to easily take advantage of these paradigms.

the location of the storage, filesystem or database or clay tablet, is an implementation detail with implications for performance and ease of implementation only. it's the application APIs that matter, and which are also missing. to analogize, Reiser and RDBMS's are like X Window: low level technologies that provides a means to accomplish the task (with varying degrees of success); things like what this interview is about are like Qt: a layer that makes application development leveraging the possibilities of the platforms possible.

innovation is not just the creation of a new idea, it's the implementation of that new idea in the marketplace.

Reiser4 as is won't fix a lot, except for more efficient storage of small files maybe. If you want to add a plugin for reiser4 you have to recompile your kernel. If you want other people to use it, you have to distribute a kernel patch. If you want to reach a broad audience in finite time you better put another layer on top of the FS to get things done.
I have started playing with reiser4 a while ago to write a plugin to enumerate changed files fast and reliably. But even if I find the time to get it done some day, it would still be hard to make people use it, at least with reiser4's notion of a plugin. If you want to introduce features which completely redefine a filesystem.. what can I say.. good luck?