As I understand it, HTTP headers are only set if the web server is setup to do so, and may default to a particular encoding even if developers didn't intend this. Meta headers are only set if the developer decided to do so in their code... this may also be set automatically by some development frameworks ( which is problematic if the developer didn't consider this ).

I've found that if these are set at all, they often conflict with each other. eg. the HTTP header says the page is iso-8859-1 while the meta tag specifies windows-1252. I could assume one supersedes the other ( likely the meta tag ), but that seems fairly unreliable. It also seems like very few developers consider this when dealing with their data, so dynamically generated sites are often mixing encodings or using encodings that they don't intend to via different encodings coming from their database.

My conclusion has been to do the following:

Check the encoding of every page using mb_detect_encoding().

If that fails, I use the meta encoding ( http-equiv="Content-Type"... ).

If there is no meta content-type, I use the HTTP headers ( content_type ).

If there is no http content-type, I assume UTF-8.

Finally, I convert the document using mb_convert_encoding(). Then I scrape it for content. ( I've purposely left out the encoding to convert to, to avoid that discussion here. )

I'm attempting to get as much accurate content as possible, and not just ignore webpages because the developers didn't set their headers properly.

What problems do you see with this approach?

Am I going to run into problems using the mb_detect_encoding() and mb_convert_encoding() methods?

I suggest you apply all available methods and also make use of external libraries (like this one: http://mikolajj.republika.pl/) and use the most probable encoding.

Another approach to make it more precise is to build a country-specific list of possible character sets and use only those with mb_convert_encoding. Like in Hungary, ISO-8859-2 or UTF-8 is most probable, others are not really worth considering. Country can be guessed from the combination of TLD, Content-Language HTTP header and IP address location. Although this requires some research work and extra development, it might worth the effort.

Some comments under the documentation of mb_convert_encoding report that iconv works better for Japanese character sets.