Topics (SiSU Metadata Harvest). Organization not done, but this classification structure built from information in individual document headers could be used with the "thesaurus" discussed in this seminal paper: "A Uniform International Sales Law Terminology" Vikki M. Rogers and Albert H. Kritzer 2003 CISGw3 Database, Pace University Institute of International Commercial Law. Reproduced from Ingeborg Schwenzer / Gýnter Hager ed., Festschrift fýr Peter Schlechtriem zum 70. Geburtstag, Mohr Siebeck (2003) 223-253.]. Custom search engines could be built to take advantage of such classification terminology as well.

With minimal preparation of a plain-text (UTF-8) file, using sisu markup syntax
in your text editor of choice, SiSU can generate various document formats, most
of which share a common object numbering system for locating content, including
plain text, HTML, XHTML, XML, EPUB, OpenDocument text (ODF:ODT), LaTeX, PDF
files, and populate an SQL database with objects (roughly paragraph-sized
chunks) so searches may be performed and matches returned with that degree of
granularity. Think of being able to finely match text in documents, using
common object numbers, across different output formats and across languages if
you have translations of the same document. For search, your criteria is met
by these documents at these locations within each document (equally relevant
across different output formats and languages). To be clear (if obvious) page
numbers provide none of this functionality. Object numbering is particularly
suitable for "published" works (finalized texts as opposed to works that are
frequently changed or updated) for which it provides a fixed means of reference
of content. Document outputs can also share provided semantic meta-data.

SiSU also provides concordance files, document content certificates and
manifests of generated output and the means to make book indexes that make use
of its object numbering.

Syntax highlighting and folding (outlining) files are provided for the Vim and
Emacs editors.

For basic text, html operations SiSU has no necessary dependencies other than
the language Ruby in which it is written. For Epub and OpenDocument text you
need the zip program. For searches either SQLite or Postgres. For pdf, well
quite a lot of texlive as LaTeX is used. SiSU source code is provided in a git
repository
http://git.sisudoc.org/gitweb/?p=code/sisu.git;a=summary

Using Debian dependencies for various features are taken care of in sisu
related packages.
The package sisu-complete installs the whole of SiSU.

Additional document markup samples are provided in a git repository
http://git.sisudoc.org/gitweb/?p=doc/sisu-markup-samples.git;a=summary
or in the package sisu-markup-samples which is found in the Debian non-free
archive. The licenses for the substantive content of the marked up documents
provided is that provided by the author or original publisher.

take two

SiSU may be regarded as an open access document publishing platform, applicable
to a modest but substantial domain of documents (typically law and literature,
but also some forms of technical writing), that is tasked to address certain
challenges I identified as being of interest to me over the years in open
publishing.

The idea and implementation may be of interest to consider as some of the
issues encountered and that it seeks to address are known and common to such
endeavors. Amongst them:

* how do you ensure what you do now can be read in decades?

* how do you keep up with new changing and technologies?

* do you select a canonical format to represent your documents, if so
what?

* how do you reliably cite (locate) material in different document
representations?

* how do you deal with multilingual texts?

* what of search?

* how are documents contributed to the collection?

(these questions are selected in to help describe the direction of efforts with
regard to sisu).

My Dabblings in the Domain of Open Publishing
---------------------------------------------

The system is called SiSU, it is an offshoot of my early efforts at finding out
what to make of the web, that started at the University of Tromsø in 1993 (an
early law website Ananse/ International Trade Law Project / Lex Mercatoria). I
have worked on SiSU continually since 1997 and it has been open source in 2005
(under a license called GPL3+), though I remain its developer.

In working in this field I have had to address some of the common issues.

So how do you ensure what you do now can be read in decades to come? There are
alternative solutions. (i) stick with a widely used and not overly complicated
well document open standard, and for that the likes of odf is an excellent
choice (ii) alternatively go for the most basic representation of a document
that meets your needs, in my case based on UTF-8 text and some markup tags,
fairly easily parsable by the human eye and as long as utf8 is in use it will
always be possible to extract the information

How do you keep up with new changing and technologies? Here my solution has
been to generate new versions of the substantive content so as to always have
the latest document representations available e.g. HTML has changed a lot over
the years, different specifications come out for various formats including ODF,
electronic readers have become an important viewing alternative, introducing
the open reader format EPUB. Output representations are generated from source
documents. Different open document file formats can be produced and databases
and search engines populated. (The source documents and interpreter are all
that are required to re-create site content. Source documents can be made
public or retained privately). The strict separation of a simple source
document from the output produced, means that with updates to SiSU (the
interpreter/processor/generator), outputs can be updated technically as
necessary, and new output formats added when needed. Amongst the output formats
currently supported are HTML, LaTeX generated Pdfs (A4, letter, other;
landscape, portrait), Epub, Open Document Format text. Returning to HTML as an
example, it has changed a lot over the years I have worked with it, this way of
working has meant it is possible to keep producing current versions of HTML,
retaining the original substantive document... and new formats have been added
as thought desired. There is no attempt to make output in different document
formats/ representations look alike let alone identical. Rather the attempt is
to optimize output for the particular document filetype, (there is no reason
why an epub document would look or behave like an open document text or that a
Pdf would look like HTML output; rather PDF is optimized for paper viewing,
HTML for screen etc.) Wherever possible features associated with the
particular output type are taken advantage of. This freedom is made possible to
a large extent by the answer to the question that follows.

How do you reliably cite (locate) material in different document
representations? The traditional answer has been to have a canonical
publication, and resulting fixed page numbers. This was not a viable solution
for HTML (which changes from one viewer to another and with selectable font
faces & size etc.); nor is it otherwise ideal in an electronic age with the
possibility of presenting/interacting with material/documents in so many
different ways. Why be so restricted? Here my solution has been "object
citation numbering". What the various generated document formats have in
common is a shared object numbering system that identifies the location of text
and that is available for citation purposes. Object numbers are: sequential
numbers assigned to each identified object in a document. Objects are logical
units of text (or equivalent parts of a document), usually paragraphs, but also
document headings, tables, images, in a poem a verse etc. [In an electronic
publishing age are page numbers the best we can come up with? Change font
type, font size, page orientation, paper size (sometimes even the viewer) and
where are you with them? And paper though a favorite medium of mine is no
longer the sole (or sometimes primary) means of interacting with documents/text
or of sharing knowledge]

What object numbers mean (unlike page numbers) is e.g.

* if you cite text in any format, the resulting output can be reliably located
in any other document format type. Cite HTML and the reader can choose to
view in Epub or Pdf (the PDFs being an independent output, generated by
book publishing software XeTeX/LaTeX).

* if you do a search, you can be given a result "index" indicating that your
search criteria is met by these documents, and at these specific locations
within each document, and the "index" is relevant not only for content
within the database, but for all document formats.

* if you have a translated text prepared for sisu, then your citations are
relevant across languages e.g. you can specify exactly where in a Chinese
document text is to be found.

What of search? For search, see the implications of object numbers for search
mentioned above. The system currently loads an SQL server (Postgresql) with
object sized text chunks. It could just as well populate an analytical engine
with larger sections or chapters of text for analytical purposes (such as the
currently popular Elasticsearch), whilst availing itself also of the concept of
objects and object numbers in search results.

How do you deal with multilingual texts? If you have translated text prepared
for sisu, then your citations are relevant across languages. Object numbers
also provide an easy way to compare, discuss text (translations) across
languages. Text found/cited in one language has the same object number in its
translations, a given paragraph will be the same in another language, just
change the language code. (documents are prepared in UTF-8, current language
restrictions are: through use of LaTeX tools, Polyglosia & CJK (Chinese,
Japanese & Korean), and from the fact that sisu parses left to right)

How are materials prepared for contribution to the collection? (a) The easiest
solution if the system allows is for submission in the format in which work is
authored, usually a word processor, for which odf may be a decent selection.
(b) I have stuck with enhanced plaintext, UTF-8 with minimal markup. Source
documents are prepared in UTF-8 text, with a minimalist native markup to
indicate the document structure (headings and their relative levels),
footnotes, and other document "features". This markup is easily parsable to the
human eye, and plays well with version control systems. Documents are prepared
in a text editor. Front ends such as markup assistants in a word processor that
can save to sisu text format or other tool whist possible do not exist. [(c)
yet another form of submission for collaborative work are wikis which have
shown their strength in efforts such as Wikipedia.]

The system has proven to be a good testing ground for ideas and is flexible and
extensible. (things that could usefully be done: apart from a front end for
simpler user interaction; feed text to an analytical search engine, like
Elasticsearch/Lucene; it still needs a bibliography parser (auto-generation of
a bibliography from footnotes); and it might be useful to allow rough auto
translation documents on the fly by passing text through a translator (such as
Google translate)).

In any event, my resulting technical opinions (in my modest domain of
action) may be regarded as encapsulated within SiSU
[http://www.sisudoc.org/]

http://www.sisudoc.org/
http://www.jus.uio.no/sisu/

git clone git://git.sisudoc.org/git/code/sisu.git
http://git.sisudoc.org/gitweb/?p=code/sisu.git;a=summary
(there are additional commits in the upstream branch)
git clone git://git.sisudoc.org/git/doc/sisu-markup-samples.git
Development work is on Linux and the easiest way to install it is through the
Debian Linux package as this takes care of optional external dependencies such
as XeTeX for PDF output and Postgresql or Sqlite for search.