The processing instruction target matching “[xX][mM][lL]” is not allowed

This exception reminds me of quote “Much ado about nothing”.

I experienced this exception while parsing an XML using SAXParser. I searched over the web and found many questions on this and there were several others who experienced the same issue. So thought of putting a solution over here, so that it might help someone if they safe something similar.

Well, the solution is pretty simple. The problem is somewhere with the first line of you XML i.e.

<?xml version=”1.0″ encoding=”utf-8″?

In my case there was a whitespace before the start tag <. I trimmed the XML string and that’s all!

One of the most common question I have been asked recently is what are the new features in Solr 4.0. Well, there are lot many posts on the web that provides the information. Solr wiki also explains it quite well.

I will try to throw some light on the features that are really cool and what are the practical usages of the feautes and which of them we are trying to leverage in our current projects. Also, I will define the feature in brief to set the ground.

Pseudo-fields: It provides ability to alias a field, or add metadata along with returned documents. We are using it to return the confidence score of the matched document

New Spell Checker implementation: This will not require a new index to be created and will work on the main index. Hence, no extra index need to be maintained for spellchecker to work

Enhancements has been done to the function queries. Conditional function query is allowed. We had a scenario where we were boosting the document on the basis of download count. There were some documents for which download count was not available. They were badly effected. Now conditional boosting can be done or only document with more than specified download count will be boosted

Atomic update: Provide flexibility to update only the fields of the document that has been modified. Prior version required us to send the complete document even in single field has been modified. Note: Internally, it’s still implemented as delete and add and is not DB like update

New relevance ranking models like BM25, Language Models etc has been introduced. Analysis need to be done to check if some other model works better than current VSM

Indexed terms are no longer UTF-16 char sequences, instead terms can be any binary value encoded as byte arrays. By default, text terms are now encoded as UTF-8 bytes

A transaction log ensures that even uncommitted documents are never lost

FuzzyQueries are 100x faster

Solr Cloud. I will refrain from using this until 4.1 releases

NoSql features: As of now, I will prefer to use Solr for search. For NoSql i will prefer to stick with my NoSql DB of my choice

Precision and recall is something which comes to our mind first when we talk of information retrieval.

Whenever we develop an IR engine or tune the existing engine, we are interested to know how good our search result is or how is the improvement. This is where precision and recall comes into play.

Whenever we query the IR system, we generally retrieve the x result out of the y relevant results from the total documents z in corpus. Out of these x retrieved documents some a will be relevant.

Precision can be defined as a/x and recall is a/y.

Hence, we can define precision as the fraction of retrieved instances that are relevant, while recall is the fraction of relevant instances that are retrieved.

For example, the index has 20 documents for music and 10 for movies. A query for some music returns 10 document which as 5 music and 5 movies. Hence, the precision is 5/10= 1/2 i.e. 50% and recall is 5/20= 1/4 i.e. 25%for the query.

In a nutshell, we can say that precision is a measure of quality while recall is a measure of quantity. So, high recall means that an algorithm returned most of the relevant results and high precision means that an algorithm returned more relevant results than irrelevant.

One of the promising features of Solr 4.0 is atomic updates. With the previous releases of Solr, to update a document, you are supposed to send all the fields, even those that have not changed. If you provide only the fields that has changed, the values of other fields will be lost. What does it behave so? It’s because Lucene deletes the document and then adds it.

There are many a times, you form Lucene index by reading data from different sources or from different tables in DB. Forming a complete Solr document is a costly operation in many a case, say you are forming a Solr document from different graph DBs. Solr 4.0 sets you free! Just add the field (along with few additional parameters) that is to be updated along with the unique field and you are done. Internally Solr queries the document based on uniqueId, modifies it and adds it back to the index. But it sets you free from doing the same in your client application.

To update a field, add update attribute to the field tag, with set as the value. Here is an example:

I happened to find a work around to achieve proper add of value to multi-valued field. Trick is update any other field with same value (you are sure of) or have a dummy field and update it will null value. Also pass the values for multi-valued fields the way you do will adding new document. Here is an example:

The two primary operations on Solr are indexing and searching. When it comes to indexing in Solr, documents can be indexed using different sources like DB, XMLS, CSV etc. In this blog, we are going to focus on indexing XMLS. The XML can be indexed to Solr as follows:

Over HTTP:To index document to Solr, xml should be created in following format:

<add>
<doc>
<field name="field1">value to be indexed</field>
<field name="field2">value to be indexed</field>
</doc>
<doc>
<field name="field1">value to be indexed</field>
<field name="field2">value to be indexed</field>
</doc>
</add>

Note: field1 & field2 should correspond to field name in schema.xml. Ensure that the value for required field is present in the XML.

The document can be indexed using GET or POST method. Use GET only when adding few documents.

If the document is being added using GET method, index the documents as follows: