2012-03-20 13:16:06,405 INFO [solr.core.SolrCore] - [http-80-18] - : [core_a] webapp=/solr path=/admin/ping params={} status=500 QTime=7
2012-03-20 13:16:06,406 ERROR [solr.servlet.SolrDispatchFilter] - [http-80-18] - : null:org.apache.solr.common.SolrException: Ping query caused exception: org.apache.solr.client.solrj.SolrServerException: java.lang.RuntimeException: Invalid version (expected 2, but 60) or the data in not in 'javabin' format
at org.apache.solr.handler.PingRequestHandler.handleRequestBody(PingRequestHandler.java:77)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.solr.common.SolrException: org.apache.solr.client.solrj.SolrServerException: java.lang.RuntimeException: Invalid version (expected 2, but 60) or the data in not in 'javabin' format
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:298)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
at org.apache.solr.handler.PingRequestHandler.handleRequestBody(PingRequestHandler.java:68)
... 16 more
Caused by: org.apache.solr.client.solrj.SolrServerException: java.lang.RuntimeException: Invalid version (expected 2, but 60) or the data in not in 'javabin' format
at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:278)
at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:158)
at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:123)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
... 1 more
Caused by: java.lang.RuntimeException: Invalid version (expected 2, but 60) or the data in not in 'javabin' format
at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:109)
at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:468)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:251)
at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:251)
... 10 more

Dawid Weiss
added a comment - 20/Mar/12 13:29 I once tried to debug it but couldn't reproduce. It does happen from time to time on my build machine though.
'60' is ASCII for '<' so I guess it's something weird emitted. Can you apply the attached patch and try to cause this, Markus?

I've redeployed with your patch. This is very peculiar indeed! The stack trace shows a ping failing and at the bottom some that work well. I've also noticed the /select handler not being there so i've did a manual request on /select?q=: and i sometimes get the same error. Some work, some don't.

And here comes the moment where my knowledge of Solr ends I'd say there is definitely a bug in improper handling of HTTP response status (and this should be fixed), unless there is a filter somewhere that emits this HTML and fakes HTTP 200... But as for the cause of why this happens in general – no idea.

Dawid Weiss
added a comment - 20/Mar/12 13:52 And here comes the moment where my knowledge of Solr ends I'd say there is definitely a bug in improper handling of HTTP response status (and this should be fixed), unless there is a filter somewhere that emits this HTML and fakes HTTP 200... But as for the cause of why this happens in general – no idea.

With this issue indexing fails, although some documents seem to be added. Custom request handlers still work but the default /select handler gives the trouble, which is used by our ping handler. Manual requests to /select?distrib=false do work without trouble.

I also know that this happens with an empty index.

I'd love to provide more details but i haven't. For now the issue is here but it just might disappear as suddenly as it appeared.

Markus Jelsma
added a comment - 20/Mar/12 14:00 Nasty!
With this issue indexing fails, although some documents seem to be added. Custom request handlers still work but the default /select handler gives the trouble, which is used by our ping handler. Manual requests to /select?distrib=false do work without trouble.
I also know that this happens with an empty index.
I'd love to provide more details but i haven't. For now the issue is here but it just might disappear as suddenly as it appeared.

I suspected Solr's distributed capabilities because of the error occuring with distrib=true. So i stopped Zookeeper, removed the data directory and restarted Zookeeper and the Solr nodes. I attached a zookeeper dump i took just moments before removing the data directory.

Markus Jelsma
added a comment - 20/Mar/12 14:07 I suspected Solr's distributed capabilities because of the error occuring with distrib=true. So i stopped Zookeeper, removed the data directory and restarted Zookeeper and the Solr nodes. I attached a zookeeper dump i took just moments before removing the data directory.

Yonik Seeley
added a comment - 20/Mar/12 14:11 I suspected Solr's distributed capabilities because of the error occuring with distrib=true.
I was going to ask... do you mean for the ping query to be distributed?

It seems i found the problem. The solr.xml file on one of the nodes received a typo. Instead shard="shard1" i had shard_a="shard1". It's pretty hard to reproduce but after removing the ZK data directories you can start the nodes with one core having a bad shard parameter. Originally only one node had a corrupt solr.xml file but i could only reproduce by corrupting the file on both nodes and starting Solr.

Markus Jelsma
added a comment - 20/Mar/12 14:43 It seems i found the problem. The solr.xml file on one of the nodes received a typo. Instead shard="shard1" i had shard_a="shard1". It's pretty hard to reproduce but after removing the ZK data directories you can start the nodes with one core having a bad shard parameter. Originally only one node had a corrupt solr.xml file but i could only reproduce by corrupting the file on both nodes and starting Solr.

If i'm understanding correctly, it sounds like the root issue was invalid configuration on one server, and the error message was due to that javabin response codec trying to parse an plain text error response page from that server.

Hoss Man
added a comment - 07/Jun/12 21:42 If i'm understanding correctly, it sounds like the root issue was invalid configuration on one server, and the error message was due to that javabin response codec trying to parse an plain text error response page from that server.
So it doesn't seem like there is any bug to fix here?

Dawid Weiss
added a comment - 07/Jun/12 21:48 I don't know if it's invalid configuration or something else but the error message could be less cryptic I mean – some kind of sanity check during header parsing would be nice (if possible).

Hoss Man
added a comment - 09/Jun/12 00:33 some kind of sanity check during header parsing would be nice (if possible).
yeha ... but that's kind of orthoginal to the initial bug report (that ping wasn't working in distrib mode)
I opened SOLR-3530 to look into improving things, but there may be diminishing returns here.