DOMDocument::loadHTML

Description

The function parses the HTML contained in the string source.
Unlike loading XML, HTML does not have to be well-formed to load. This
function may also be called statically to load and create a
DOMDocument object. The static invocation may be
used when no DOMDocument properties need to be
set prior to loading.

DOMDocument is very good at dealing with imperfect markup, but it throws warnings all over the place when it does.

This isn't well documented here. The solution to this is to implement a separate aparatus for dealing with just these errors.

Set libxml_use_internal_errors(true) before calling loadHTML. This will prevent errors from bubbling up to your default error handler. And you can then get at them (if you desire) using other libxml error functions.

Pay attention when loading html that has a different charset than iso-8859-1. Since this method does not actively try to figure out what the html you are trying to load is encoded in (like most browsers do), you have to specify it in the html head. If, for instance, your html is in utf-8, make sure you have a meta tag in the html's head section:

When using loadHTML() to process UTF-8 pages, you may meet the problem that the output of dom functions are not like the input. For example, if you want to get "Cạnh tranh", you will receive "Cáº¡nh tranh". I suggest we use mb_convert_encoding before load UTF-8 page :
<?php
$pageDom = new DomDocument();
$searchPage = mb_convert_encoding($htmlUTF8Page, 'HTML-ENTITIES', "UTF-8");
@$pageDom->loadHTML($searchPage);

Be aware that this function doesn't actually understand HTML -- it fixes tag-soup input using the general rules of SGML, so it creates well-formed markup, but has no idea which element contexts are allowed.

For example, with input like this where the first element isn't closed:

Using loadHTML() automagically sets the doctype property of your DOMDocument instance(to the doctype in the html, or defaults to 4.0 Transitional). If you set the doctype with DOMImplementation it will be overridden.

I assumed it was possible to set it and then load html with the doctype I defined(in order to decide the doctype at runtime), and ran into a huge headache trying to find out where my doctype was going. Hopefully this helps someone else.

Here is a function I wrote to capitalize the previous remarks about charset problems (UTF-8...) when using loadHTML and then DOM functions.It adds the charset meta tag just after <head> to improve automatic encoding detection, converts any specific character to an html entity, thus PHP DOM functions/attributes will return correct values.

your HTML code will get interpreted as ISO-8859-something and non-ASCII chars will get converted into HTML entities. However the HTML4-like version will work (as has been pointed out 10 years ago by "bigtree at 29a"):