Re: [VuFind-Tech] Problems in importing fulltext from pdfs

Is it possible the solution is this simple?
http://www.exampledepot.com/egs/java.io/ReadFromUTF8.html
- Demian
________________________________
From: Ronan McHugh [rmchugh@...]
Sent: Friday, September 07, 2012 11:55 AM
To: Demian Katz; guenter.hipler@...; vufind-tech@...
Subject: RE: [VuFind-Tech] Problems in importing fulltext from pdfs
Hi,
Tika takes an encoding argument. We set this to UTF8 and read the output into a StringBuilder from stdout. If we run Tika directly from the command line and save the contents of the cmd line to file (using the > operator) this encodes the text correctly. Therefore I tried to replicate this in the bsh script. Since Tika doesn’t take an argument to save output to file, I used the > operator in the script also. However, for whatever reason, doing this didn’t work and nothing was saved to file. Therefore, I had to read directly from stdout to the String.
So, our guess is that something is going wrong between Tika’s output at stdout and the Buffered Reader reading this in. However, we don’t know how to fix it at this stage. I attempted to use a BufferedWriter to write to file in UTF8 but this only made matters worse.
Best,
Ronan
________________________________
From: Demian Katz [mailto:demian.katz@...]
Sent: 07 September 2012 16:33
To: Ronan McHugh; guenter.hipler@...; vufind-tech@...
Subject: RE: [VuFind-Tech] Problems in importing fulltext from pdfs
Is Tika incorrectly interpreting characters, or is the problem simply that your documents are in a variety of encodings and need to be normalized? Does Tika include a parameter for normalizing character sets? If not, it might be possible to add an intermediate step to detect character sets and normalize everything to UTF-8.
- Demian
________________________________
From: Ronan McHugh [rmchugh@...]
Sent: Friday, September 07, 2012 11:28 AM
To: guenter.hipler@...; vufind-tech@...
Subject: Re: [VuFind-Tech] Problems in importing fulltext from pdfs
Hi,
I’ve created a new bsh script which uses Tika to parse input files and uploaded it to JIRA. We have noticed that there are problems with encoding of some characters such as curly apostrophes, long hyphens and diacritics. However, we can’t see any obvious way to fix this. If anyone had any ideas, it would be much appreciated.
Thanks,
Ronan
________________________________
From: guenter.hipler@... [mailto:guenter.hipler@...]
Sent: 07 September 2012 15:07
To: Ronan McHugh
Subject: Re: [VuFind-Tech] Problems in importing fulltext from pdfs
Done,
Günter
On 09/07/2012 03:28 PM, Ronan McHugh wrote:
Yes, please send it on or post it to JIRA, might be helpful.
Thanks,
Ronan
________________________________
From: guenter.hipler@...<mailto:guenter.hipler@...> [mailto:guenter.hipler@...]
Sent: 07 September 2012 13:15
To: vufind-tech@...<mailto:vufind-tech@...>
Subject: Re: [VuFind-Tech] Problems in importing fulltext from pdfs
Hi Ronan,
we are using Tika since some time - it's mature and fast. But we are not using the vufind document processing channel (at least so far) although it's quite similar - so this makes it a little bit difficult to provide specialized support in this area.
But let me know if you are stumbling upon difficulties. Especially the pdf - parser (PdfBox) Tika is using as a wrapper is a little bit tricky in case the PDF documents you are going to parse contain some not well formed characters - so good exception handling is necessary.
I can publish the code (Java, not a lot of lines) we are using for it - let me know.
Günter
On 09/07/2012 01:09 PM, Ronan McHugh wrote:
Grand, I’ll chance my arm without it and see how I get on.
________________________________
From: Demian Katz [mailto:demian.katz@...]
Sent: 07 September 2012 12:03
To: Ronan McHugh; vufind-tech@...<mailto:vufind-tech@...>
Subject: RE: Problems in importing fulltext from pdfs
The only reason I mentioned Leechcrawler is that the existing Aperture team recommended it as a successor to Aperture... but it's entirely possible that Leechcrawler replicates Aperture functionality that VuFind doesn't need. If the stand-alone Tika app is good enough, I see no pressing reason to complicate things with extra tools.
- Demian
________________________________
From: Ronan McHugh [rmchugh@...<mailto:rmchugh@...>]
Sent: Friday, September 07, 2012 6:57 AM
To: Demian Katz; vufind-tech@...<mailto:vufind-tech@...>
Subject: RE: Problems in importing fulltext from pdfs
We were just discussing that – is there any specific reason for using Leechcrawler as a wrapper? Tika can be called directly from the command line similar to the way in which Aperture is called, so it doesn’t seem necessary in this case.
________________________________
From: Demian Katz [mailto:demian.katz@...]
Sent: 07 September 2012 11:56
To: Ronan McHugh; vufind-tech@...<mailto:vufind-tech@...>
Subject: RE: Problems in importing fulltext from pdfs
Excellent, thanks! I'm also very glad to hear you're experimenting with Tika -- if that all works out it will be a valuable addition to VuFind. Have you looked at the Leechcrawler tool mentioned on http://vufind.org/jira/browse/VUFIND-600<http://vufind.org/jira/browse/VUFIND-600?>? (Of course, if that's not necessary to achieve the necessary functionality, using Tika alone would probably be even better).
- Demian
________________________________
From: Ronan McHugh [rmchugh@...<mailto:rmchugh@...>]
Sent: Friday, September 07, 2012 6:53 AM
To: Demian Katz; vufind-tech@...<mailto:vufind-tech@...>
Subject: RE: Problems in importing fulltext from pdfs
No problem at all. I’ll stick it on the JIRA ticket. We’re experimenting with using Tika instead at the moment which looks promising. Specifically, I’m trying to update getFullText to use either Aperture or Tika. I’ll let you know how it goes.
Ronan
________________________________
From: Demian Katz [mailto:demian.katz@...]
Sent: 07 September 2012 11:50
To: Ronan McHugh; vufind-tech@...<mailto:vufind-tech@...>
Subject: RE: Problems in importing fulltext from pdfs
Makes sense. Still in the midst of the last push to finish up 2.0beta, but I'll try to take care of this when I get back to JIRA catchup.
If there's a way to also post a sample PDF that causes the problem, that would be very helpful too for testing purposes... though I imagine there's a good chance that copyright issues make that impossible.
- Demian
________________________________
From: Ronan McHugh [rmchugh@...<mailto:rmchugh@...>]
Sent: Friday, September 07, 2012 4:22 AM
To: Demian Katz; vufind-tech@...<mailto:vufind-tech@...>
Subject: RE: Problems in importing fulltext from pdfs
Sure, I have created http://vufind.org/jira/browse/VUFIND-644 with the patch attached. I reckon the try/catch bit at least should probably be committed when you have time since it will prevent the indexer from crashing out on encountering illegal characters.
Best,
Ronan
________________________________
From: Demian Katz [mailto:demian.katz@...]
Sent: 06 September 2012 20:53
To: Ronan McHugh; vufind-tech@...<mailto:vufind-tech@...>
Subject: RE: Problems in importing fulltext from pdfs
I worked with Nathan a bit more on the HTML problem he had encountered, and we reached a similar solution – strip out invalid characters from the stream prior to parsing. It’s too bad there isn’t a mode to simply ignore these weird control characters… but I think this is the best we can do in the meantime! It will be interesting to see if Tika is more robust if we eventually adopt that in place of Aperture (though that project is not high on my priority list right now, I have to confess). In any case, it might be worth posting this patch on JIRA for future reference.
- Demian
From: Ronan McHugh [mailto:rmchugh@...]
Sent: Thursday, September 06, 2012 12:00 PM
To: vufind-tech@...<mailto:vufind-tech@...>
Subject: [VuFind-Tech] Problems in importing fulltext from pdfs
Hi all,
We are experimenting at present with importing large pdf documents into our index and have run into a problem with bad character encoding. It seems that Aperture sometimes outputs an invalid character 0xc which then causes the parser to break, in turn crashing the indexer and preventing the record from being indexed at all. This error was noticed previously by Nathan in relation to a HTML document (http://vufind-tech.2307425.n4.nabble.com/Re-VuFind-General-Aperture-Issue-td4411904.html) but it seems the problem wasn’t solved back then.
To solve the problem we have modified the getFullText bsh script as follows:
1) We have inserted an additional method to clean this specific character out of the file before it is sent to be parsed. There are potentially other characters which should be removed, but we decided to err on the side of caution, as using a more zealous replaceAll tended to break the document in a myriad of other ways. Anyone who has other problems should simply expand the string being replaced at this point.
2) The parse method has been encased in a try / catch block meaning that any exceptions caused by the parsing will not prevent the rest of the record from being indexed.
Attached is our patch, any questions or comments are more than welcome.
Best,
Ronan
Visit our free exhibitions <http://www.nli.ie/en/udlist/current-exhibitions.aspx&gt;
___________________________________________
Tabhair cuairt ar ?r dtaispe?ntais saor in aisce <http://www.nli.ie/ga/udlist/current-exhibitions.aspx&gt;
The contents of this e-mail (including attachments) are private and confidential and may also be subject to legal privilege. It is intended only for the use of the addressee. If you are not the addressee, or the person responsible for delivering it to the addressee, you may not copy or deliver this e-mail or any attachments to anyone else or make any use of its contents; you should not read any part of this e-mail or any attachments. Unauthorised disclosure or communication or other use of the contents of this e-mail or any part thereof may be prohibited by law and may constitute a criminal offence. If you receive this e-mail by mistake please notify the system manager @ 6030219.
T? an r?omhphost seo (agus aon iat?n a ghabhann leis) pr?obh?ideach agus r?nda agus d?fh?adfadh go mbeadh eolas inti at? faoi phribhl?id dhl?thi?il. N? ceadmhach ?s?id an r?omhphoist seo d??inne ach don t? ar seoladh chuige ?. Mura duitse an r?omhphost seo n? an t? at? freagrach as ? a sheoladh, t? cosc ar ch?ipe?il agus ar sheachadadh an r?omhphoist seo agus aon iat?n a ghabhann leis chuig ?inne n? ?s?id a bhaint as a bhfuil ann; n? ceart an r?omhphost seo n? aon iat?n a l?amh. D?fh?adfadh go mbeadh cosc ioml?n dl?thi?il ar sceitheadh n? comhfhreagras n? aon ?s?id eile gan chead ar a bhfuil sa r?omhphost seo agus d?fh?adfadh s? a bheith ina chion coiri?il. M? fuair t? an r?omhphost seo tr? earr?id, d?an teagmh?il le bainisteoir an ch?rais @6030219
Visit our free exhibitions <http://www.nli.ie/en/udlist/current-exhibitions.aspx&gt;
___________________________________________
Tabhair cuairt ar ?r dtaispe?ntais saor in aisce <http://www.nli.ie/ga/udlist/current-exhibitions.aspx&gt;
The contents of this e-mail (including attachments) are private and confidential and may also be subject to legal privilege. It is intended only for the use of the addressee. If you are not the addressee, or the person responsible for delivering it to the addressee, you may not copy or deliver this e-mail or any attachments to anyone else or make any use of its contents; you should not read any part of this e-mail or any attachments. Unauthorised disclosure or communication or other use of the contents of this e-mail or any part thereof may be prohibited by law and may constitute a criminal offence. If you receive this e-mail by mistake please notify the system manager @ 6030219.
T? an r?omhphost seo (agus aon iat?n a ghabhann leis) pr?obh?ideach agus r?nda agus d?fh?adfadh go mbeadh eolas inti at? faoi phribhl?id dhl?thi?il. N? ceadmhach ?s?id an r?omhphoist seo d??inne ach don t? ar seoladh chuige ?. Mura duitse an r?omhphost seo n? an t? at? freagrach as ? a sheoladh, t? cosc ar ch?ipe?il agus ar sheachadadh an r?omhphoist seo agus aon iat?n a ghabhann leis chuig ?inne n? ?s?id a bhaint as a bhfuil ann; n? ceart an r?omhphost seo n? aon iat?n a l?amh. D?fh?adfadh go mbeadh cosc ioml?n dl?thi?il ar sceitheadh n? comhfhreagras n? aon ?s?id eile gan chead ar a bhfuil sa r?omhphost seo agus d?fh?adfadh s? a bheith ina chion coiri?il. M? fuair t? an r?omhphost seo tr? earr?id, d?an teagmh?il le bainisteoir an ch?rais @6030219
Visit our free exhibitions <http://www.nli.ie/en/udlist/current-exhibitions.aspx&gt;
___________________________________________
Tabhair cuairt ar ?r dtaispe?ntais saor in aisce <http://www.nli.ie/ga/udlist/current-exhibitions.aspx&gt;
The contents of this e-mail (including attachments) are private and confidential and may also be subject to legal privilege. It is intended only for the use of the addressee. If you are not the addressee, or the person responsible for delivering it to the addressee, you may not copy or deliver this e-mail or any attachments to anyone else or make any use of its contents; you should not read any part of this e-mail or any attachments. Unauthorised disclosure or communication or other use of the contents of this e-mail or any part thereof may be prohibited by law and may constitute a criminal offence. If you receive this e-mail by mistake please notify the system manager @ 6030219.
T? an r?omhphost seo (agus aon iat?n a ghabhann leis) pr?obh?ideach agus r?nda agus d?fh?adfadh go mbeadh eolas inti at? faoi phribhl?id dhl?thi?il. N? ceadmhach ?s?id an r?omhphoist seo d??inne ach don t? ar seoladh chuige ?. Mura duitse an r?omhphost seo n? an t? at? freagrach as ? a sheoladh, t? cosc ar ch?ipe?il agus ar sheachadadh an r?omhphoist seo agus aon iat?n a ghabhann leis chuig ?inne n? ?s?id a bhaint as a bhfuil ann; n? ceart an r?omhphost seo n? aon iat?n a l?amh. D?fh?adfadh go mbeadh cosc ioml?n dl?thi?il ar sceitheadh n? comhfhreagras n? aon ?s?id eile gan chead ar a bhfuil sa r?omhphost seo agus d?fh?adfadh s? a bheith ina chion coiri?il. M? fuair t? an r?omhphost seo tr? earr?id, d?an teagmh?il le bainisteoir an ch?rais @6030219
Visit our free exhibitions <http://www.nli.ie/en/udlist/current-exhibitions.aspx&gt;
___________________________________________
Tabhair cuairt ar ?r dtaispe?ntais saor in aisce <http://www.nli.ie/ga/udlist/current-exhibitions.aspx&gt;
The contents of this e-mail (including attachments) are private and confidential and may also be subject to legal privilege. It is intended only for the use of the addressee. If you are not the addressee, or the person responsible for delivering it to the addressee, you may not copy or deliver this e-mail or any attachments to anyone else or make any use of its contents; you should not read any part of this e-mail or any attachments. Unauthorised disclosure or communication or other use of the contents of this e-mail or any part thereof may be prohibited by law and may constitute a criminal offence. If you receive this e-mail by mistake please notify the system manager @ 6030219.
T? an r?omhphost seo (agus aon iat?n a ghabhann leis) pr?obh?ideach agus r?nda agus d?fh?adfadh go mbeadh eolas inti at? faoi phribhl?id dhl?thi?il. N? ceadmhach ?s?id an r?omhphoist seo d??inne ach don t? ar seoladh chuige ?. Mura duitse an r?omhphost seo n? an t? at? freagrach as ? a sheoladh, t? cosc ar ch?ipe?il agus ar sheachadadh an r?omhphoist seo agus aon iat?n a ghabhann leis chuig ?inne n? ?s?id a bhaint as a bhfuil ann; n? ceart an r?omhphost seo n? aon iat?n a l?amh. D?fh?adfadh go mbeadh cosc ioml?n dl?thi?il ar sceitheadh n? comhfhreagras n? aon ?s?id eile gan chead ar a bhfuil sa r?omhphost seo agus d?fh?adfadh s? a bheith ina chion coiri?il. M? fuair t? an r?omhphost seo tr? earr?id, d?an teagmh?il le bainisteoir an ch?rais @6030219
Visit our free exhibitions <http://www.nli.ie/en/udlist/current-exhibitions.aspx&gt;
___________________________________________
Tabhair cuairt ar ?r dtaispe?ntais saor in aisce <http://www.nli.ie/ga/udlist/current-exhibitions.aspx&gt;
The contents of this e-mail (including attachments) are private and confidential and may also be subject to legal privilege. It is intended only for the use of the addressee. If you are not the addressee, or the person responsible for delivering it to the addressee, you may not copy or deliver this e-mail or any attachments to anyone else or make any use of its contents; you should not read any part of this e-mail or any attachments. Unauthorised disclosure or communication or other use of the contents of this e-mail or any part thereof may be prohibited by law and may constitute a criminal offence. If you receive this e-mail by mistake please notify the system manager @ 6030219.
T? an r?omhphost seo (agus aon iat?n a ghabhann leis) pr?obh?ideach agus r?nda agus d?fh?adfadh go mbeadh eolas inti at? faoi phribhl?id dhl?thi?il. N? ceadmhach ?s?id an r?omhphoist seo d??inne ach don t? ar seoladh chuige ?. Mura duitse an r?omhphost seo n? an t? at? freagrach as ? a sheoladh, t? cosc ar ch?ipe?il agus ar sheachadadh an r?omhphoist seo agus aon iat?n a ghabhann leis chuig ?inne n? ?s?id a bhaint as a bhfuil ann; n? ceart an r?omhphost seo n? aon iat?n a l?amh. D?fh?adfadh go mbeadh cosc ioml?n dl?thi?il ar sceitheadh n? comhfhreagras n? aon ?s?id eile gan chead ar a bhfuil sa r?omhphost seo agus d?fh?adfadh s? a bheith ina chion coiri?il. M? fuair t? an r?omhphost seo tr? earr?id, d?an teagmh?il le bainisteoir an ch?rais @6030219
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Vufind-tech mailing list
Vufind-tech@...<mailto:Vufind-tech@...>
https://lists.sourceforge.net/lists/listinfo/vufind-tech
--
Universität Basel
Universitätsbibliothek
Günter Hipler
Projekt SwissBib
Schoenbeinstrasse 18-20
4056 Basel, Schweiz
Tel.: + 41 (0)61 267 31 12 Fax: ++41 61 267 3103
E-Mailguenter.hipler@...<mailto:E-Mailguenter.hipler@...>
URL:www.swissbib.org /http://www.ub.unibas.ch/
Visit our free exhibitions <http://www.nli.ie/en/udlist/current-exhibitions.aspx&gt;
___________________________________________
Tabhair cuairt ar ár dtaispeántais saor in aisce <http://www.nli.ie/ga/udlist/current-exhibitions.aspx&gt;
The contents of this e-mail (including attachments) are private and confidential and may also be subject to legal privilege. It is intended only for the use of the addressee. If you are not the addressee, or the person responsible for delivering it to the addressee, you may not copy or deliver this e-mail or any attachments to anyone else or make any use of its contents; you should not read any part of this e-mail or any attachments. Unauthorised disclosure or communication or other use of the contents of this e-mail or any part thereof may be prohibited by law and may constitute a criminal offence. If you receive this e-mail by mistake please notify the system manager @ 6030219.
Tá an ríomhphost seo (agus aon iatán a ghabhann leis) príobháideach agus rúnda agus d’fhéadfadh go mbeadh eolas inti atá faoi phribhléid dhlíthiúil. Ní ceadmhach úsáid an ríomhphoist seo d’éinne ach don té ar seoladh chuige é. Mura duitse an ríomhphost seo nó an té atá freagrach as é a sheoladh, tá cosc ar chóipeáil agus ar sheachadadh an ríomhphoist seo agus aon iatán a ghabhann leis chuig éinne nó úsáid a bhaint as a bhfuil ann; ní ceart an ríomhphost seo nó aon iatán a léamh. D’fhéadfadh go mbeadh cosc iomlán dlíthiúil ar sceitheadh nó comhfhreagras nó aon úsáid eile gan chead ar a bhfuil sa ríomhphost seo agus d’fhéadfadh sé a bheith ina chion coiriúil. Má fuair tú an ríomhphost seo trí earráid, déan teagmháil le bainisteoir an chórais @6030219
Visit our free exhibitions <http://www.nli.ie/en/udlist/current-exhibitions.aspx&gt;
___________________________________________
Tabhair cuairt ar ár dtaispeántais saor in aisce <http://www.nli.ie/ga/udlist/current-exhibitions.aspx&gt;
The contents of this e-mail (including attachments) are private and confidential and may also be subject to legal privilege. It is intended only for the use of the addressee. If you are not the addressee, or the person responsible for delivering it to the addressee, you may not copy or deliver this e-mail or any attachments to anyone else or make any use of its contents; you should not read any part of this e-mail or any attachments. Unauthorised disclosure or communication or other use of the contents of this e-mail or any part thereof may be prohibited by law and may constitute a criminal offence. If you receive this e-mail by mistake please notify the system manager @ 6030219.
Tá an ríomhphost seo (agus aon iatán a ghabhann leis) príobháideach agus rúnda agus d’fhéadfadh go mbeadh eolas inti atá faoi phribhléid dhlíthiúil. Ní ceadmhach úsáid an ríomhphoist seo d’éinne ach don té ar seoladh chuige é. Mura duitse an ríomhphost seo nó an té atá freagrach as é a sheoladh, tá cosc ar chóipeáil agus ar sheachadadh an ríomhphoist seo agus aon iatán a ghabhann leis chuig éinne nó úsáid a bhaint as a bhfuil ann; ní ceart an ríomhphost seo nó aon iatán a léamh. D’fhéadfadh go mbeadh cosc iomlán dlíthiúil ar sceitheadh nó comhfhreagras nó aon úsáid eile gan chead ar a bhfuil sa ríomhphost seo agus d’fhéadfadh sé a bheith ina chion coiriúil. Má fuair tú an ríomhphost seo trí earráid, déan teagmháil le bainisteoir an chórais @6030219