Can you give me some more information as to what you're trying to do? It will help me give a better answer.

If you are parsing content from a URL, jsoup will automatically recognise the charset from the response headers or the HTML meta-equiv tag, so you don't need to worry about it. Once the HTML is parsed and you are accessing it (via .html(), .text() etc) you are working with Java Unicode Strings.

If you are parsing from a file on disk, you may need to tell jsoup what the charset is, because there is no HTTP header to help it. It comes down to however you saved the file on disk in the first place. If you're saving as (e.g.) UTF-8, then when you load it you need to specify that. (If you pass in a null charset to the file parse, jsoup will try the meta-equiv header, but that is unreliable.)

Anyway, let me know what you're doing and hopefully I can give a clearer example.

We have a crawler program that downloads a large number of web pages. Unfortunately I don't know how the crawler saves the html files after it crawls, but I notice that some have UTF-8 charset, while some have other charsets including foreign languages. When I look at the parsed text from some web pages, some text contents are not readable which is caused by charset. So we want to know if we can use Jsoup's charset related function to parse the contents correctly.

It seems that there are three possibilities here, depending on how your crawler behaves:

1: Crawler is aware of the input charset decodes correctly, and saves everything as UTF-8. E.g. input may be ascii, gb232, utf-8, etc. If that was the case, you can just "UTF8" as the charset when parsing from a file in jsoup. However I don't think this is the case per your description.

2: Crawler is aware of the input charset, and saves the output using the input charset. E.g. input = ascii, output = ascii; input = gb232, output = gb232. I think this is most likely to be the case because it's the easiest and most likely to always work. If it is the case, you need to get the crawler to tell you what the output charset is so that you can parse it correctly. Does the crawler save the response HTTP headers? Most I've seen do. If it does, you need to parse the headers, get the charset, and then use jsoup file parse with that as the input charset.

3: Crawler is not aware of the input content type, assumes it's UTF-8, and outputs as UTF-8. If that's the case, you're hosed, because that's a destructive operation, and you'd need to get it modified to work like 1 or 2. But that seems unlikely.

So, I think the best approach is to find out exactly how your crawler works with different input charsets, how it saves them out, and how it records what the output charset was; and then parse from there. One experiment may be to hit some of the pages that the crawler has saved yourself and identify what the server charset is, and then compare that with the output of the crawler.

Hope these thoughts help; any other suggestions from the group are welcome.

We have a crawler program that downloads a large number of web pages. Unfortunately I don't know how the crawler saves the html files after it crawls, but I notice that some have UTF-8 charset, while some have other charsets including foreign languages. When I look at the parsed text from some web pages, some text contents are not readable which is caused by charset. So we want to know if we can use Jsoup's charset related function to parse the contents correctly.

I just spoke to my colleague about this charset issue. He said thecrawler outputs all the page sources of the html pages as byte[] type,and we probably can manage to get the Http header information. Giventhat we can extract charset from the headers, what will be the nextsteps? Kindly note that we are dealing with byte[] variables not filevariables.