.NET and Oracle BLOB

I have a database that is going to contain a lot of documents i
..DOC,.TXT,.PPT,.PDF etc. formats. I want to index the documets to use a free
text search on the database table. I also want to insert and retrieve the
documents using .NEt (C# or VB.NET) !

Is there anyone of you out there that can give me some tips, links or other
helpful hints ?

Advertisements

While storing documents in a database has often "seemed" like a good idea,
the truth is that it is not. In short, a database is for storing data. A
file system is for storing files. Sure, one can store binary data in a
database and maybe (just maybe) this is OK in a case or two, the best place
for documents seems to be the file system. That's the OS's job and it does
it VERY well. One can get a DB to do it, but it is clunky at best.

A good way to manage such documents, if you must have database "handle" on
them, is to store the filename and perhaps the location in a database, as a
"pointer" to the file itself. However, if you do this then there is an
argument that says there are plenty of built-in DotNet classes for getting
to and from the file system (which is a good argument), so the database is
redundant anyway. Still, having the filenames collected neatly may be a
good idea now and again.

With files in a file system, one can hook to the file system's
functionality for searching, or use some kind of indexing system, and so
on. Usually, to build a searchable index, one gets a product or use's the
OS's functionality. It is an involved task to write this sort of code; but,
of course, it CAN be done.

Now, if one simply MUST store files in a database, then it is going to be
tricky building a dynamic index on documents of type PPT and the like. I
expect it can be done, but I should want to avoid it. But, I am a shirker
looking for the easiest way. Furthermore, building and keeping this "search
index" fresh is going to take time, especially if there are "a lot of
documents", as you have mentioned. Then again, some data analysis is
required here-- for example, if the system is not in-use 24-hours a day,
and if one does not need an up-to-the-minute index, then building a day-old
index would be an option. And so on.

Now, another way that I have addressed this issue is to truly separate
content from format. I have designed a newsgroup system that stores each
post's text in the database, as plain text. The formatting is handled by
CSS and/or XSLT. This way, the database just handles plain text and it is
easy to search. Furthermore, this is a relatively low-traffic newsgroup.
Finally, there is a limit to the size of each post (which I control), so
the database is not storing large pieces of text. All of this, however,
makes for a much different problem set when compared to the one you
describe; but, it may help you to think about the issues involved.

As I mentioned, this is a BIG topic, so I'll stop here while I'm behind.
There will be many arguments for and against what I have said, some good on
both sides. Please just take this as food for thought. I doubt that I have
clarified anything at all here; but, I hope that I have at least muddied
the waters.

HTH.

--Mark.

"Robert Vabo" <> wrote in message
news:%23%...
I have a database that is going to contain a lot of documents i
..DOC,.TXT,.PPT,.PDF etc. formats. I want to index the documets to use a
free
text search on the database table. I also want to insert and retrieve the
documents using .NEt (C# or VB.NET) !

Is there anyone of you out there that can give me some tips, links or other
helpful hints ?

Share This Page

Welcome to The Coding Forums!

Welcome to the Coding Forums, the place to chat about anything related to programming and coding languages.

Please join our friendly community by clicking the button below - it only takes a few seconds and is totally free. You'll be able to ask questions about coding or chat with the community and help others.
Sign up now!