Tika network server

Details

Description

It would be cool to be able to run Tika as a network service that accepts a binary document as input and produces the extracted content (as XHTML, text, or just metadata) as output. A bit like TIKA-169, but without the dependency to a servlet container.

I'd like to be able to set up and run such a server like this:

$ java -jar tika-app.jar --port 1234

We should also add a NetworkParser class that acts as a local client for such a service. This way a lightweight client could use the full set of Tika parsing functionality even with just the tika-core jar within its classpath.

Maxim Valyanskiy
added a comment - 08/Feb/11 08:50 I made HTTP-server with Jersey (JAX-RS) and embedded Glassfish (or Grizzly?) for text, metadata and binary attachment extraction. I has very simple REST-style interface
I think we can contribute it to Tika project. Also I can try to replace Glassfish and Jersey with Tomcat and Apache Wink if it is required.
What do you think?

Out of curiosity, why not just have a simple webapp (war) that uses Tika that reads the InputStream and spits back the data in whatever format is needed/specified? Sure, it requires a servlet container, but is that really a big deal? Just asking because it seems a tiny bit simpler than using Netty or Mina or HttpComponents or embedded Jetty or Grizzly.

Otis Gospodnetic
added a comment - 10/Feb/11 02:33 Out of curiosity, why not just have a simple webapp (war) that uses Tika that reads the InputStream and spits back the data in whatever format is needed/specified? Sure, it requires a servlet container, but is that really a big deal? Just asking because it seems a tiny bit simpler than using Netty or Mina or HttpComponents or embedded Jetty or Grizzly.

Reopening until we figure out what to do with the references to the dev.java.net repositories.
Earlier we had problems with such references to non-standard Maven repositories and I
wouldn't like to have this issue block another release.

In revision 1079922 I removed the tika-server component from the default build, which
should allow us to release Tika even with the dev.java.net dependencies in place (we just
can't deploy tika-server to Maven central then).

Jukka Zitting
added a comment - 09/Mar/11 18:43 Reopening until we figure out what to do with the references to the dev.java.net repositories.
Earlier we had problems with such references to non-standard Maven repositories and I
wouldn't like to have this issue block another release.
In revision 1079922 I removed the tika-server component from the default build, which
should allow us to release Tika even with the dev.java.net dependencies in place (we just
can't deploy tika-server to Maven central then).
There were also some test failures due apparently to some dependency version mismatch.
See https://builds.apache.org/hudson/job/Tika-trunk/483/org.apache.tika$tika-server/ for details.

Maxim Valyanskiy
added a comment - 22/Mar/12 13:09 I found that Jersey dependencies are on Maven Central now ( https://blogs.oracle.com/theaquarium/entry/jersey_moving_forward_contributions_maven ).
I'm going to synchronize tika-server with our production code and enable it in default build after tika 1.1 release if there is no objections

Maxim Valyanskiy
added a comment - 23/Mar/12 07:58 - edited
testExeDOCX(org.apache.tika.server.UnpackerResourceTest): PUT http://localhost:9998/unpacker returned a response status of 204 No Content
I could not reproduce this problem on current trunk version

Hey Max, I don't have objections to moving forward re-enabling the module. How about we use CXF like I suggested though? I will try a commit to the POM shortly that will add in the CXF JaxRS dependencies. Let's try that.

Chris A. Mattmann
added a comment - 26/Mar/12 04:40 Hey Max, I don't have objections to moving forward re-enabling the module. How about we use CXF like I suggested though? I will try a commit to the POM shortly that will add in the CXF JaxRS dependencies. Let's try that.

Max, see above. That will take care of the transitive dependencies for JAX-RS, including the API, etc.
I'm not sure of a replacement for the test portions of the Jersey code. If you are +1 with the above, I'd like
to commit it to the tika-server/pom.xml file.

Chris A. Mattmann
added a comment - 26/Mar/12 04:47
<dependency>
<groupId> org.apache.cxf </groupId>
<artifactId> cxf-rt-frontend-jaxrs </artifactId>
<version> 2.3.1 </version>
</dependency>
Max, see above. That will take care of the transitive dependencies for JAX-RS, including the API, etc.
I'm not sure of a replacement for the test portions of the Jersey code. If you are +1 with the above, I'd like
to commit it to the tika-server/pom.xml file.

Max FYI my current progress. I'm trying to get the unit tests rewritten but they are failing right now. Check out MetadataResource to see. The cool part is that we reduce a bunch of the Maven dependencies with CXF and we are eating our own dog food. I will go to the CXF lists tomorrow with my question about the failing unit tests.

Chris A. Mattmann
added a comment - 27/Mar/12 06:46
Max FYI my current progress. I'm trying to get the unit tests rewritten but they are failing right now. Check out MetadataResource to see. The cool part is that we reduce a bunch of the Maven dependencies with CXF and we are eating our own dog food. I will go to the CXF lists tomorrow with my question about the failing unit tests.

a lot closer. Unpacker tests are failing. Max, how did Jersey deal with the Map<String,byte[]> that you are returning in UnpackerResource? I don't see any @Providers in Jersey that natively know how to deal with this data structure, nor do I see any @Provider classes that you have written to take care of it. How was Jersey dealing with this?

Chris A. Mattmann
added a comment - 27/Mar/12 16:36
a lot closer. Unpacker tests are failing. Max, how did Jersey deal with the Map<String,byte[]> that you are returning in UnpackerResource? I don't see any @Providers in Jersey that natively know how to deal with this data structure, nor do I see any @Provider classes that you have written to take care of it. How was Jersey dealing with this?

Maxim Valyanskiy
added a comment - 27/Mar/12 18:39 Chris, there is two providers in my code that process this Map's. It is ZipWriter and TarWriter:
https://github.com/apache/tika/blob/trunk/tika-server/src/main/java/org/apache/tika/server/ZipWriter.java
https://github.com/apache/tika/blob/trunk/tika-server/src/main/java/org/apache/tika/server/TarWriter.java
I think now that it was not good idea to use Map class directly, it is better to introduce more specific interface

Hey Max, in r1305940, I committed the latest patch with those 3 tests disabled in UnpackagerResource for now. We can fix them and wrap this up and until I do so, I'll leave the issue open. Help is welcomed!

Chris A. Mattmann
added a comment - 27/Mar/12 19:25 Hey Max, in r1305940, I committed the latest patch with those 3 tests disabled in UnpackagerResource for now. We can fix them and wrap this up and until I do so, I'll leave the issue open. Help is welcomed!

OK, I give up for now. I disabled the 415 test that isn't passing. After researching this for hours, and working with Paul Ramirez (thanks for the help Paul), we basically found the following things to be true:

Jersey automatically sets Accept to something like / which IMHO is more sensible than CXF which sets it to an XML accept type (which causes the resource to not even find the path in test415)

For whatever reason, if you set accept to "xxx/xxx" instead of checks up front like it seems Jersey did, CXF will let the call get all the way to the UnpackerResource#unpack method and then cause the Tika AutoDetectParser to fail. Jersey seemed to have caught this. I have no clue why. We mucked around with different accept and type calls and got it to send 200 OK back and parse fine (e.g., if you set the accept to / and type to APPLICATION_MSWORD – but that defeats the purpose of the test. If you send in xxx/xxx, it seems like the JAX RS service should send back a 415.

I need some massive help from anyone that knows CXF to figure this out. I have to step away from this for now. For now all tests pass, they are cleaned up using CXF client (with HttpClient removed), and I disabled test415. Any help to get 415 working with CXF is welcomed. Even if we have to modify UnpackerResource to do the check. I know that Sergey is watching this one (from CXF ville so would love some help here!)

Chris A. Mattmann
added a comment - 28/Mar/12 06:13 OK, I give up for now. I disabled the 415 test that isn't passing. After researching this for hours, and working with Paul Ramirez (thanks for the help Paul), we basically found the following things to be true:
Jersey automatically sets Accept to something like / which IMHO is more sensible than CXF which sets it to an XML accept type (which causes the resource to not even find the path in test415)
For whatever reason, if you set accept to "xxx/xxx" instead of checks up front like it seems Jersey did, CXF will let the call get all the way to the UnpackerResource#unpack method and then cause the Tika AutoDetectParser to fail. Jersey seemed to have caught this. I have no clue why. We mucked around with different accept and type calls and got it to send 200 OK back and parse fine (e.g., if you set the accept to / and type to APPLICATION_MSWORD – but that defeats the purpose of the test. If you send in xxx/xxx, it seems like the JAX RS service should send back a 415.
I need some massive help from anyone that knows CXF to figure this out. I have to step away from this for now. For now all tests pass, they are cleaned up using CXF client (with HttpClient removed), and I disabled test415. Any help to get 415 working with CXF is welcomed. Even if we have to modify UnpackerResource to do the check. I know that Sergey is watching this one (from CXF ville so would love some help here!)

Re Accept: I think that the client code needs to have an idea about the format of the data it expects back thus CXF WebClient will try to set some specific default. FYI, proxy-based clients will analyze @Produces/@Consumes. Also the idea of the client having to know what it expects back is finding its way into JAX-RS 2.0 client api too.

Update: WebClient (trunk/2.5.3-SNAPSHOT) will only default Accept to application/xml if a specific custom class is expected to be populated on return, if Response is expected back then no action is taken

Sergey Beryozkin
added a comment - 28/Mar/12 14:29 - edited Hi, here is the thread on the CXF list to do with handling 415:
http://cxf.547215.n5.nabble.com/TIKA-593-odd-behavior-related-to-CXF-JAX-RS-services-and-415-Http-response-codes-tt5600131.html .
Re Accept: I think that the client code needs to have an idea about the format of the data it expects back thus CXF WebClient will try to set some specific default. FYI, proxy-based clients will analyze @Produces/@Consumes. Also the idea of the client having to know what it expects back is finding its way into JAX-RS 2.0 client api too.
Update: WebClient (trunk/2.5.3-SNAPSHOT) will only default Accept to application/xml if a specific custom class is expected to be populated on return, if Response is expected back then no action is taken

I do not completely understand your discussion about 415, but the test failed because TikaExceptionMapper was not added to providers list (by the way, maybe CXF supports classpath scanning like Jersey?).

Maxim Valyanskiy
added a comment - 29/Mar/12 14:46 I do not completely understand your discussion about 415, but the test failed because TikaExceptionMapper was not added to providers list (by the way, maybe CXF supports classpath scanning like Jersey?).

Maxim Valyanskiy
added a comment - 29/Mar/12 14:52
The cool part is that we reduce a bunch of the Maven dependencies with CXF and we are eating our own dog food.
CXF implementation looks much heavier than Jersey, look at "mvn dependency:tree"

> I do not completely understand your discussion about 415, but the test failed because TikaExceptionMapper was not added to providers list.

What do you not understand ?
FYI I do not understand how having TikaExceptionMapper registered can result in 415 being returned, I'm looking at it and seeing no traces of 415, can you clarify please ?

> (by the way, maybe CXF supports classpath scanning like Jersey?)

No it does not yet. It was a very specific decision - IMHO the random class scanning is impractical in many cases and causes more troubles than it's worth and if of little use when the custom providers have to be configured in the per-endpoint specific way as in case with most interesting applications. However I do accept that for simple mappers it can make sense, though I'm not sure what is simpler, restricting the packages to scan or just go and register required providers , I prefer the latter, but please see#

Sergey Beryozkin
added a comment - 29/Mar/12 15:13 > I do not completely understand your discussion about 415, but the test failed because TikaExceptionMapper was not added to providers list.
What do you not understand ?
FYI I do not understand how having TikaExceptionMapper registered can result in 415 being returned, I'm looking at it and seeing no traces of 415, can you clarify please ?
> (by the way, maybe CXF supports classpath scanning like Jersey?)
No it does not yet. It was a very specific decision - IMHO the random class scanning is impractical in many cases and causes more troubles than it's worth and if of little use when the custom providers have to be configured in the per-endpoint specific way as in case with most interesting applications. However I do accept that for simple mappers it can make sense, though I'm not sure what is simpler, restricting the packages to scan or just go and register required providers , I prefer the latter, but please see#
https://issues.apache.org/jira/browse/CXF-4199
In CXF 2.6.0 FIQL search extensions got moved to the new module, so it is time to optionally support it

I guess here I was talking more about simply only having to rely on one Maven dependency tag in the tika-server pom.xml for cxf-rt-frontend-jars, rather than jersey server + core, and the other dependencies we used to have. If you look at the pom.xml, the deps are now reduced. That's what I was thinking (maybe a side effect?)

Chris A. Mattmann
added a comment - 29/Mar/12 15:23 Hey Max:
The cool part is that we reduce a bunch of the Maven dependencies with CXF and we are eating our own dog food.
CXF implementation looks much heavier than Jersey, look at "mvn dependency:tree"
I guess here I was talking more about simply only having to rely on one Maven dependency tag in the tika-server pom.xml for cxf-rt-frontend-jars, rather than jersey server + core, and the other dependencies we used to have. If you look at the pom.xml, the deps are now reduced. That's what I was thinking (maybe a side effect?)

> FYI I do not understand how having TikaExceptionMapper registered can result in 415 being returned, I'm looking at it and seeing no traces of 415, can you clarify please ?

I'll try to explain. Tika server's resources can handle any input mime-type. When we no not specify mime type in our PUT request (or specify something generic like 'application/octet-stream'), Tika uses its own mime-type detector to detect its type and choose parser.

When we specify mime-type it skips detection stage and choose parser that handles specified document type.

When we can't handle specified mime-type, when we can't detect it, or when we do not have parser for that type, we throw WebApplicationException(Response.Status.UNSUPPORTED_MEDIA_TYPE) - 415 code.

Maxim Valyanskiy
added a comment - 29/Mar/12 15:29 - edited > FYI I do not understand how having TikaExceptionMapper registered can result in 415 being returned, I'm looking at it and seeing no traces of 415, can you clarify please ?
I'll try to explain. Tika server's resources can handle any input mime-type. When we no not specify mime type in our PUT request (or specify something generic like 'application/octet-stream'), Tika uses its own mime-type detector to detect its type and choose parser.
When we specify mime-type it skips detection stage and choose parser that handles specified document type.
When we can't handle specified mime-type, when we can't detect it, or when we do not have parser for that type, we throw WebApplicationException(Response.Status.UNSUPPORTED_MEDIA_TYPE) - 415 code.
Tika parser framework wraps that exception into TikaException.
TikaExceptionMapper unwraps it:
if (e.getCause() !=null && e.getCause() instanceof WebApplicationException) {
return ((WebApplicationException) e.getCause()).getResponse();
}
That exception mapper was lost after transition from Jersey to CXF, so we had 500-error instead of 415.
PS: maybe we can speak Russian on jabber?

Chris A. Mattmann
added a comment - 29/Mar/12 15:38 but the test failed because TikaExceptionMapper was not added to providers list
Max, you're totally right! In r1306883, I committed some cleanup, removing the FIXME and uncommenting @Test, and all tests pass. I'm going to mark this issue as resolved now!
We can track further progress and updates in other issues. Thanks for the help here!

I we have another problem with Tika server. We combine all dependency jar's into one big jar that can be run via 'java -jar tika-server.jar'. It includes Tika with all parsers, web-server and etc.

When I try to run it a have following exception:

SEVERE: Can't start
org.apache.cxf.service.factory.ServiceConstructionException
at org.apache.cxf.jaxrs.JAXRSServerFactoryBean.create(JAXRSServerFactoryBean.java:190)
at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:92)
Caused by: org.apache.cxf.BusException: No DestinationFactory was found for the namespace http://cxf.apache.org/transports/http.
at org.apache.cxf.transport.DestinationFactoryManagerImpl.getDestinationFactory(DestinationFactoryManagerImpl.java:126)
at org.apache.cxf.endpoint.ServerImpl.initDestination(ServerImpl.java:88)
at org.apache.cxf.endpoint.ServerImpl.<init>(ServerImpl.java:72)
at org.apache.cxf.jaxrs.JAXRSServerFactoryBean.create(JAXRSServerFactoryBean.java:151)
... 1 more

Maxim Valyanskiy
added a comment - 29/Mar/12 15:38 I we have another problem with Tika server. We combine all dependency jar's into one big jar that can be run via 'java -jar tika-server.jar'. It includes Tika with all parsers, web-server and etc.
When I try to run it a have following exception:
SEVERE: Can't start
org.apache.cxf.service.factory.ServiceConstructionException
at org.apache.cxf.jaxrs.JAXRSServerFactoryBean.create(JAXRSServerFactoryBean.java:190)
at org.apache.tika.server.TikaServerCli.main(TikaServerCli.java:92)
Caused by: org.apache.cxf.BusException: No DestinationFactory was found for the namespace http://cxf.apache.org/transports/http.
at org.apache.cxf.transport.DestinationFactoryManagerImpl.getDestinationFactory(DestinationFactoryManagerImpl.java:126)
at org.apache.cxf.endpoint.ServerImpl.initDestination(ServerImpl.java:88)
at org.apache.cxf.endpoint.ServerImpl.<init>(ServerImpl.java:72)
at org.apache.cxf.jaxrs.JAXRSServerFactoryBean.create(JAXRSServerFactoryBean.java:151)
... 1 more
I think that something is wrong in bundle-plugin configuration

for the intent of this issue, I think that the functionality is complete. We can open up new issues to track further improvements and bugs. Thanks Max, Sergey, Ingo, and others who have contributed, and to Jukka for the original idea and spec!

Chris A. Mattmann
added a comment - 29/Mar/12 15:40
for the intent of this issue, I think that the functionality is complete. We can open up new issues to track further improvements and bugs. Thanks Max, Sergey, Ingo, and others who have contributed, and to Jukka for the original idea and spec!

Chris A. Mattmann
added a comment - 29/Mar/12 15:41
crap, just saw Max's comment. I'm going to let this sit for a while and make sure Max and I can fully run this, before closing the issue. We're close though!

> ... That exception mapper was lost after transition from Jersey to CXF, so we had 500-error instead of 415.

Right. The good thing we know the cause and as I indicated we will get to the optional class scanning support in CXF.

The funny side to it is that we spent a lot of time with Chris thinking how Jersey magically turns away "xxx/xxx" with 415, we thought initially Jersey blocked it even before dispatching, but as it happens it was also passing it through

Update: Max, sure, we can talk on Jabber, please share your id with me on #cxf or post here

Sergey Beryozkin
added a comment - 29/Mar/12 16:34 - edited Max,
> ... That exception mapper was lost after transition from Jersey to CXF, so we had 500-error instead of 415.
Right. The good thing we know the cause and as I indicated we will get to the optional class scanning support in CXF.
The funny side to it is that we spent a lot of time with Chris thinking how Jersey magically turns away "xxx/xxx" with 415, we thought initially Jersey blocked it even before dispatching, but as it happens it was also passing it through
Update: Max, sure, we can talk on Jabber, please share your id with me on #cxf or post here

The funny side to it is that we spent a lot of time with Chris thinking how Jersey magically turns away "xxx/xxx" with 415, we thought initially Jersey blocked it even before dispatching, but as it happens it was also passing it through

I know! I got down the wrong rabbit hole, thanks for the help, both of you heh...

Chris A. Mattmann
added a comment - 29/Mar/12 16:49 The funny side to it is that we spent a lot of time with Chris thinking how Jersey magically turns away "xxx/xxx" with 415, we thought initially Jersey blocked it even before dispatching, but as it happens it was also passing it through
I know! I got down the wrong rabbit hole, thanks for the help, both of you heh...

Sergey Beryozkin
added a comment - 29/Mar/12 16:58 > I think that something is wrong in bundle-plugin configuration
The packaged jar contains duplicate entries for different packages in /org/apache/cxf/, and probably for others. May be you should use the Maven Shade plugin, here is the example from CXF:
http://svn.apache.org/repos/asf/cxf/trunk/osgi/bundle/all/pom.xml