It has happened so, that we have never worked with jQuery, however were aware of
it.

In early 2000 we have developed a web application that contained rich javascript
APIs, including UI components. Later, we were actively practicing in ASP.NET, and
later in JSF.

At present, looking at jQuery more closely we regret that we have failed to
start using it earlier.

Separation of business logic and presentation is remarkable when one uses JSON
web services. In fact server part can be seen as a set of web services
representing a business logic and a set of resources: html, styles, scripts,
others. Nor ASP.NET or JSF approach such a consistent separation.

The only trouble, in our opinion, is that jQuery has no standard data binding: a way to bind JSON data
to (and from) html controls. The technique that will probably be standardized is called jQuery Templates or JsViews
.

Unfortunatelly after reading about this
binding API, and
being in love with Xslt and XQuery we just want to cry. We don't know what would
be the best solution for the task, but what we see looks uncomfortable to us.

I would like the WG to consider an addition of a function that turns a sequence
into a enumeration of values.

Consider a function like this:
fn:enumerator($items as item()*) as function() as item()?;

alternatively, signature could be:

fn:enumerator($items as function() as item()*) as function() as item()?;

This function receives a sequence, and returns a function item, which upon N's
call shall return N's element of the original sequence. This way, a sequence of
items is turned into a function providing a enumeration of items of the
sequence.

As an example consider two functions:

a) t:rand($seed as xs:double) as xs:double* - a function producing a random
number sequence;
b) t:work($input as element()) as element() - a function that generates output
from it's input, and that needs random numbers in the course of the execution.

Enumerators will help to compose algorithms where one algorithm communicate
with other independant algorithms, thus making code simpler. The most obvious
class of enumerators are generators: ordered numbers, unique identifiers,
random numbers.

Technically, function returned from fn:enumerator() is nondetermenistic, but its "side effect" is
similar to a "side effect" of a function generate-id() from a newly created
node (see bug #13747, and bug #13494).

The idea is inspired by a generator function, which returns a new value upon each
call.

Such function can be seen as a stateful object. But our goal is to look at
it in a more functional way. So, we look at the algorithm as a function that
produces a sequence of output, which is pure functional; and an enumerator that
allows to iterate over algorithm's output.

This way, we see the function that implements an algorithm and the function that
uses it can be seen as two thread of functional programs that use messaging to
communicate to each other.

Honestly, we doubt that WG will accept it, but it's interesting to watch the
discussion.

The essence of the problem is that we have constructed argumentless function to
return a unique identifiers each time function is called. To achieve the effect
we have created a temporary node and returned its generate-id() value.

Such a function is nondetermenistic, as we cannot state that its result depends
on arguments only. This means that engine's optimizer is not free to reorder
calls to such a function. That's what happens in Saxon 9.2, and Saxon 9.3 where
engine elevates function call out of cycle thus producing invalid results.

Michael Kay, the author of the Saxon engine, argued that this is "a gray area of
the xslt spec":

If the spec were stricter about defining exactly when you can rely on identity-dependent
operations then I would be obliged to follow it, but I think it's probably deliberate
that it currently allows implementations some latitude, effectively signalling to
users that they should avoid depending on this aspect of the behaviour.

He adviced to raise a bug in the w3c bugzilla to resolve the issue. In the end
two related bugs have been raised:

The Working Group agreed that default behavior should continue to require these
nodes to be constructed with unique IDs.
We believe that this is the kind of thing implementations can do with
annotations or declaration options, and it would be best to get implementation
experience with this before standardizing.

This means that the technique we used to generate unique identifiers is correct
and the behaviour is well defined.

The only problem is to wait when Saxon will fix its behaviour accordingly.

We're not big fans of
Entity Framework, as we don't directly expose the database structure to
the client program but rather through stored procedures and functions. So, EF for
us is a tool to expose those stored procedures as .NET wrappers. This limited use
of EF still greatly automates the data access code.

But what we have lately found is that the EF has a problem with char parameters. Namely,
if you import a procedure say MyProc that accepts char(1),
and then will call it through the generated wrapper, the you will see in sql profiler
that char(1) parameter is passed with many trailing spaces as if it
were char(8000). There isn't necessity to prove that this is highly
ineffective.

We can see that the problem happens in VS 2010 designer rather than in the EF runtime,
as SP's parameters are not attributed with length, see model xml (*.edmx):

Incidentally we have never noticed the problem earlier. Along with this issue we
have found that eclipse compiler has changed in the Indigo in a way that we had
to recompile the source. Well, that's a price you have to pay when you access
internal API.

No "select System.ItemUrl from SystemIndex where contains('...')"
has ever returned a row.

We thought that the problem was in our protocol handler, and tried to localize it,
but finally have discovered that Windows Search is not able to find anything within
text files.

Registry comparision has shown that *.txt extension was indexed by the IFilter defined
in the query.dll, while on the other computers, where everything worked, the implementation
was in the tquery.dll.

Both libraries were present on the Windows 2003 server, so we have corrected the
registry and everything has started to work.

As far as we understand query.dll is part of legacy
Indexing Service, and tquery.dll is up to date implementation.

2. Search index size

We have to index a considerable amout of data. But before we can do it we have to
estimate the size of index.

In the past it seems we saw somewhere a statement that search index needs a storage
that's about 10% of original data for its purposes. Unfortunatelly we cannot
find this estimation at present, neither we cannot find any other estimation. This
complicates our planning.

To get empirical estimate we've indexed several thousands *.xml-gz files, which
are gz'ed big xmls. The total size of this files is about 4.5GB. Total uncompressed
size of xmls ~50GB. Xml contained about 10 millions pages of data.

According to 10% criteria we had to arrive to ~5GB search index.

But what we have discovered is that the index has grown to more than 50GB. That's
very disappointing. We cannot afford such expense, as we've commited test on
a tiny part of data, which increases over time.

So, the solution is to find out what's wrong, and how can it be cured, or to
fulltext index only most recent subset of data.

P.S. We have tried to mark folder with search index as compressed, but it did not
work.

Yesterday (2011-07-31) we have finished the project (development and support) of
the modernization of Cool:GEN code base to java for the
Chicago Mercantile Exchange.

It wasn't the first such project but definitely most interesting. We have migrated
and tested about 300 MB of source code. In the process of translation we have
identified many bugs that were present in the original code. Thanks to
languages-xom that task turned to be pure xslt.

We hope that CME's developers are pleased with results.

If you by chance is looking for Cool:GEN conversion to java, C#, or even COBOL
(don't understand why people still asking for COBOL) then you can start at
bphx site.

This code performs some transformation and assigns unique values to
name-ref attributes. Values generated with
t:generate-id() function are guaranteed to be unique, as spec
claims that every node has its unique generate-id() value.

Imagine, what was our surprise to find that generated elements all have the same
name-ref's. We studied code all over, and found no holes in our
reasoning and implementation, so our conlusion was: it's Saxon's bug!

It's interesting enough that if we rewrite code a little (see commented part),
it starts to work properly, thus we suspect Saxon's optimizer.

Well, in the course of development we have found and reported many Saxon bugs,
but how come that this little beetle was hiding so long.

After feeding the start tag <data>, and flushing xml writer we observe that only
"<data" has been written down to the stream. Well,
Flush() have never promissed anything particular about the content
of the stream, so we cannot claim any violation, however we expected to see
whole start tag.

Inspection of the implementation of xml writer reveals laziness during writting
data down the stream. In particular start tag is closed when one starts the
content. This is probably to implement empty tags: <data/>.

To do the trick we had to issue empty content, moreover, to call a particular
method with particular parameters of the xml writer. So the code after the fix
looks like this:

Update: further analysis shows that it's
only possible behaviour, as after the call to write srart element, you either
can write attributes, content or end of element, so writer may write either
space, '>' or '/>'. The only
question is why it takes WriteChars(empty, 0, 0) into account and WriteValue("")
it doesn't.

to find all .xml-gz sources. This is not reliable, as your protocol handler can
be (and is) called before file is indexed.

So, the only reliable way to index your data is to (re-)add indexing rule for
the protocol handler, which in most cases reindexes everything.

The only bearable solution we found is to define indexing rule in the form:
.xml-gz://file:d:/data/... and to use
IShellFolder(2)
interfaces to discover sub items and their modification times. This technique allows
minimal data scan when you're (re-)add indexing rule.

In most cases this query returns nothing and runs very long. It's interesting to note that it may start returning data if "top" clause is missing or uses a bigger number, but in this cases query is slower even more.

At some point we have started to question the utility of Windows Search if it's so slow, but then we have found that there is a property System.ItemNameDisplay, which in our case coincides with the value of property System.ItemName, so we have tried the query:

We have developed our custom Windows Search Protocol Handler. The role of this component is to expose items of complex content (or unusual storage) to Windows Search.

You can think of some virtual folder, so a Protocol Handler allows to enumerate it's files, file properties, and contents.

The goal of our Protocol Handler is to represent some data structure as a set of xml files. We expected that if we found a data within a folder with these files, then a search within Protocol Handler's scope would bring the same (or almost the same) results.

Reality is different.

For some reason .xml IFilter (a component to extract text data to index) works differently with file system and with our storage. We cannot state that it does not work, but for some reason many words that Windows Search finds within a file are never found within Protocol Handler scope.

We have observed that if, for purpose of indexing, we represent content xml items as .txt files, then search works as expected. So, our workaround was to present only xml's text data for the indexing, and to use .txt IFilter (this in fact roughly what .xml IFilter does by itself).

Is there a conclusion?

Well, Windows Search is a black box probably containing bugs. Its behaviour is not always obvious.