Parsing HTML in PHP using Native Classes

If you have ever dealt with DOM (Document Object Model) manipulation on the front-end, you know the basics of parsing HTML using Javascript. All these happen usually on the client-side. But what if we want to process HTML data on the server? The use cases are broader than just DOM manipulation. In this post, let us look at some of the useful PHP classes which enables us to process HTML on a server.

What is Parsing & What are its Uses?

Parsing (in this case) is the process of extracting or modifying useful information from an HTML or XML string. A parser gives us easy ways to query raw data instead of using regex.

Suppose you want to get all the links on a web page. PHP DOM parsing classes can help you.

The Table of Contents you see above is another simple application of PHP DOM parsing classes. In that plugin, it extracts all the headings from the page, sorts it, creates a new element, and inserts it back into the page content.

Important DOM classes in PHP

DOMDocument, Nodes & Elements

The DOMDocument is the main class which takes in HTML and gives an object for us to interact with. It can load HTML or XML from a string or file. The class defines several methods like getElementById which resemble the functions in Javascript.

The DOM made from HTML is a tree-like structure made up of individual nodes. These nodes can be of any type, say an element, text, comment, attribute etc. DOMNode is the base class from which all types of node classes inherit.

Elements

The DOMElement class extends the DOMNode class which can represent the elements in your HTML markup. An object of DOMElement can be any element like an image, div, span, table etc.

Practical Examples

Without going more into the theories, let us dive into some practical examples. First of all, we want some HTML data. For that, let us use one of the posts in this blog about image optimization.

Selecting by ID

If you look at our sample page, you can see that it contains two tables. Suppose I want to find the number of rows in the first table. Using chrome dev-tools, I found that the required table has the Id – #tablepress-3.

Selecting a Tag by Its Name

Both the DOMDocument and DOMElement classes have the method getElementsByTagName() which allows us to select elements using the name of the tag. For example, if we have to get all the h2 headings from a page, we can use this function.

Find elements with a particular class

In Javascript, the querySelectorAll() method makes it easy to select any elements using a CSS selector. In PHP, it is not that straightforward. Instead, the DOMXpath class in PHP helps to query and traverse the DOM tree easily.

Just like getElementByTagName(), the query() method of DOMXpath also returns a DOMNodeList. It takes an expression as an argument. This XPath expression is so versatile that we can perform almost any type of queries.

If you are new XPath, this cheatsheet from Devhints.io contains a wide list of CSS & JS selectors and their corresponding XPath expressions. It will help you in finding out the appropriate expression for the query you want to perform.

Extract links from a page

Parsing opens a number of opportunities. Extracting the links from a web-page is one such use. That’s how crawlers crawl the world wide web.

Suppose I want to find all the external links to a particular website on a web-page. In our sample page, what I like to do is to find all the outbound links to the wordpress.org website from the blog post. So, this is how I did it.

Here we have performed an XPath query to find all the elements with the class del. Then we remove each node from the document by iterating over the DOMNodeList object using a foreach loop.

This is our first paragraph
Delete this
This is our second paragraph
This is our third paragraph
Delete this too
-------after deletion--------
This is our first paragraph
This is our second paragraph
This is our third paragraph

Manipulating Attributes

Classes and Ids are not the only attributes we can access in PHP DOM. The DOMElement class has several functions which can get, set or remove attributes from an element. These methods look similar to that of Javascript. So you will find it easy to understand.

getAttribute($attribute_name) – get the value of an attribute

setAttribute($attribute_name, $attribute_value) – set the value of an attribute

hasAttribute($attribute_name) – checks whether an element has a certain attribute and returns a true or false

Conclusion

So far, we have looked into some of the important DOM APIs in PHP. I hope that it will help you to get started in parsing HTML and XML data with ease. If I am not clear in certain points, do ask it in the comments.