lucene-java-user mailing list archives

Hi Gregor,
Gregor Heinrich wrote:
>Hi Peter.
>
>Docco is a great tool which I have been using since you posted your first
>announcement (version 1.0, that is). Beside the things you mention in you
>
No, it must be 0.1 :-) It wasn't too bad for a first hack, though *g*
>mail I also generally think it's a great idea to using formal concept
>analysis with Lucene. I would be interested to explore the idea also for
>more structured data (maybe include fields and even hierarchies).
>
Sounds interesting. We will be busy with other things for a month or
two, afterwards we will probably do some experiments with extending
Docco towards LSA and similar technologies (main idea is guessing good
keywords to refine the query -- which is a bit different if you are
happy with overspecified queries). If you have some particular ideas we
can discuss them on tockit-general@lists.sf.net. Combining the current
Docco approach with conceptual scaling could be interesting -- if that
is what you mean. It would be a bit off-topic here, so let's put this on
our list.
The question for this list was more if someone is interested in turning
the indexing and index managements bits into a separate project. I'd
love to see it used by a broader audience and I think it would make
getting started with Lucene a bit easier for some people -- at least if
we add a little JavaDoc in form of package.htmls and class descriptions.
And there are not that many classes, so that shouldn't be too much pain.
>
>Apart from this, if I had an idea of the time commitments connected, I would
>definitely consider to join.
>
Hey -- it is Open Source and voluntary. You commit as much time as you
want ;-) Not that that couldn't be a problem -- at least I tend to try
too many things at once.
If you are interested in more Docco-specific things, just post your
ideas on our mailing list. That's usually a good start to get involved.
And I am academic enough to always find the time to discuss things --
doing them is the hard bit :-) Once I understand what exactly you want
to do I can probably give you a reasonable estimate of the effort involved.
Cheers,
Peter
>Best,
>
>Gregor
>
>
>
>-----Original Message-----
>From: Peter Becker [mailto:pbecker@dstc.edu.au]
>Sent: Tuesday, September 02, 2003 1:52 PM
>To: Lucene Users List
>Subject: ANN: Docco 0.2 / contribution offer
>
>
>Hi all,
>
>we finally finished the 0.2 release of our little personal document
>management tool based on Lucene:
>
> http://tockit.sourceforge.net/docco/index.html
>
>This might be interesting for some readers of this list since its source
>contains some infrastructure for document handlers and index management.
>The document handlers are written with a very simple API, which just
>asks the implementation to fill a structure with the information
>retrieved from a URL. It is similar to the Ant task in the Lucene
>sandbox, but it separates the information collection and the actual
>indexing, i.e. all the decisions what should be stored and what shouldn't.
>
>The program comes with implementations for plain text, HTML (based on
>Swing), XML (based on JAXP) and Open Office (using ZipStreams/SAX). We
>wrote plugins for POI, PDFbox and Multivalent. The latter is
>unfortunately a wild hack since Multivalent is the worst Java code I've
>seen. Literally. Bad C written in Java. The tool would be nice to use,
>but catching exceptions in little helper classes to do a System.exit is
>just insane. And that is just one of the problems -- we had to do some
>bad hacks to fix these issues. The other implementations should be fine,
>although they need some more testing.
>
>The source (including all required libs) of the program is available via
>Sourceforge's CVS:
>
> http://sourceforge.net/cvs/?group_id=37081
>
>The module in question is called "docco". A current snapshot of only the
>source is here:
>
> http://tockit.sourceforge.net/docco/source20030902.zip (~100kb)
>
>
>The relevant packages are:
>
> org.tockit.docco.documenthandler: the documenthandler interface and
>implementations
> org.tockit.docco.filefilter: some code to pick document handlers via
>file extensions or regexps
> org.tockit.docco.index: the model/static bits of the index management
> org.tockit.docco.indexer: the dynamic aspects of the index management:
>runnable, framework for handlers
>
>The index management is probably not optimal, I strongly suspect that an
>expert could tweak it. But the structure should be ok.
>
>We would be happy to contribute this code to the Lucene sandbox if there
>is interest. Or to turn it into a project of its own, we don't think it
>should be hidden in our more specific program. It should be easy to
>merge it with the Ant task and we are happy to give a hand if wanted.
>Adding some documentation would be easy, too -- at the moment the code
>is still more for ourself, but it should be very readable by itself. We
>require JDK 1.4, but this can be reduced by moving some more document
>handlers into plugins.
>
>Anyone interested in joining into maintaining this code? Any feedback is
>welcome.
>
>Cheers,
> Peter
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>