Login

Parsing Web Document Nodes with the Tidy Library in PHP 5

Writing well-formatted (X)HTML code to include in the presentation layers of certain PHP applications can be an annoying and time-consuming process for many web developers. However, the Tidy extension that comes integrated with PHP 5 can turn this ugly task into a pleasant experience. Keep reading to learn how.

Introduction

Welcome to the second tutorial of the series that began with "Working with the Tidy Library in PHP 5." Made up of three instructive articles, this series steps you through using the most important functions bundled with this powerful library, and complements the corresponding theory with illustrative hands-on examples.

If you already read the first installment of the series, then it’s quite possible that you find the Tidy extension very familiar, since its remarkable capacity for parsing and formatting (X)HTML markup is accompanied by an extremely easy learning curve. True to form, Tidy comes equipped with a decent arsenal of functions (or method and properties, if you’re using an object-based syntax) that allows you to correct the format of any web document in a few simple steps.

And speaking of performing simple tasks, certainly you’ll recall that in the first article of the series I discussed how to parse and format several basic (X)HTML documents, by using some straightforward functions bundled with this library, such as "tidy_parse_file()," "tidy_repair_file()" and "tidy_parse_string()."

As you learned in that tutorial, repairing badly-formatted web documents is actually an effortless process with the assistance of the Tidy extension. Thus, based upon the fact that Tidy has much more to offer when it comes to parsing and fixing (X)HTML code, in this second article of the series I’m going to discuss how to extract different sections of a specific (X)HTML document (called file nodes) by using the capabilities provided by some additional functions included with this library.

At the end of this tutorial you’ll be equipped with the required background to dissect the principal nodes of a concrete (X)HTML file with the help of some easy-to-follow Tidy functions.

So, are you ready to explore some more useful features integrated with the Tidy extension? Okay, let’s begin this journey now!

Before I continue discussing the other functions included with the Tidy extension, I’d like to review some important topics, such as the implementation of the functions covered in the preceding article of the series. Doing so should give you a much better idea of how these previous functions can be linked with the new ones that I plan to explain in a few moments.

Having said that, below I included some illustrative examples concerning the use of the hopefully familiar "tidy_parse_string()," "tidy_parse_file()" and "tidy_repair_string()" functions respectively. All of them were explained in detail in the first tutorial of this series.

Here are the corresponding code samples. Take a look at them please:

// example of ‘tidy_parse_string()’ function

<?phpob_start();?><html> <head> <title>This file will be parsed by Tidy</title> </head> <body> <p>This is an erroneous line <p>This is another erroneous line</i> </body></html><?php$fileContents=ob_get_clean();$params=array(‘indent’=>TRUE,’output-xhtml’=>TRUE,’wrap’=>200);$tidy=tidy_parse_string($fileContents,$params,’UTF8′);$tidy->cleanRepair();echo $tidy;

<?phpob_start();?><html> <head> <title>This file will be parsed by Tidy</title> </head> <body> <p>This is an erroneous line <p>This is another erroneous line</i> </body></html><?php$fileContents=ob_get_clean();$tidy=tidy_repair_string($fileContents);echo $tidy;

<html> <head> <title>This file will be parsed by Tidy</title> </head> <body> <p>This is an erroneous line <p>This is another erroneous line</i> </body></html>$tidy=tidy_parse_file(‘target_file.html’);$tidy->cleanRepair(); if(!empty($tidy->errorBuffer)){ trigger_error(‘Some errors occurred when parsing target
file’.$tidy->errorBuffer,E_USER_ERROR);}

Undoubtedly, after analyzing the above hands-on examples, you’ll recall how the useful "tidy_parse_string()," "tidy_parse_file()" and "tidy_repair_string()" functions can be used to fix and format correctly any web document. Of course, as I stated in the first tutorial of the series, these are simple demonstrations of how to utilize these useful functions, but I’m sure that they’ll be for you a good starting point toward the development of more complex (X)HTML parsing applications.

All right, at this stage you hopefully recalled how to implement, at least basically, the three previous functions bundled with the Tidy extension. So, what’s the next step? Well, in accordance with the concepts deployed in the introduction of this article, basically I plan to show you how to use a few additional functions integrated with Tidy, whose functionality is aimed mainly at extracting different nodes of a given (X)HTML string.

To see how these brand new functions can be put to work in a useful way, please click on the link that appears below and keep reading.

Indeed, it must be admitted that breaking a concrete (X)HTML string into different parts for further processing isn’t the most common task that a web developer has to tackle on a frequent basis. Regardless, the Tidy library has a respectable number of functions which are precisely targeted to extracting or dissecting a specific (X)HTML string into its main sections.

Speaking more specifically, Tidy offers two concrete functions, called "tidy_get_html()" and "tidy_get_head()" respectively, which are tasked with breaking the structure of a concrete (X)HTML string into several pieces.

But, let me get rid of these boring explanations and show you a couple of illustrative examples of how to use these new Tidy functions. Here are the corresponding code samples:

True to form, that’s all the source code required to test the previous "tidy_get_html()" and "tidy_get_head()" functions. As you can see, the functions in question are indeed very easy to follow, since they demonstrate in a simple fashion how the different sections of a specific (X)HTML string can be extracted separately.

Of course, as you might have guessed, the implementation of the first hands-on example is rather useless, simply because the "tidy_get_html()" function returns the whole (X)HTML string as a new node, which is directly displayed on the browser via its "value" property. However, it’s worthwhile to mention that the second case is slightly more useful, since it first extracts the <head> part of a sample (X)HTML string, and then displays its contents by utilizing the aforementioned "value" property.

So far, so good, right? At this point I’m pretty certain that you already grasped the logic that stand behinds dissecting a concrete (X)HTML string into different parts for further processing. As you learned from the pair of practical examples shown above, this process is reduced simply to calling the appropriate Tidy function, then extracting the selected part of a given (X)HTML string, and finally displaying the pertinent contents on the browser.

However, the Tidy extension still has a couple of extra functions which can be useful when it comes to breaking a concrete (X)HTML string into several sections. Therefore, considering that these brand new functions might be interesting to you, in the following section I’m going to show you how to use them to extract the <body> part of a given (X)HTML string, in addition to parsing and fixing the string in question as an unique node.

To learn how these tasks can be performed with the Tidy library, please jump ahead and read the following lines. I’ll be there, waiting for you.

In consonance with the concepts expressed in the section that you just read, the last two functions included with the Tidy library that I plan to teach you in this tutorial will be the ones called "tidy_get_body()" and "tidy_get_output()." As you may guess, the first function comes in handy for extracting the <body> section of a concrete (X)HTML string, while the second one simply retrieves the whole string as a unique node.

Now that I have explained how these brand new Tidy functions work, please take a look at the following code samples, which demonstrate their rather limited functionality:

$html='<html><head><title>This file will be parsed by
Tidy</title></head><body><p>This is an erroneous line</i>This is
another erroneous line</i></body></html>';$tidy=tidy_parse_string($html);$tidy->cleanRepair();echo tidy_get_output($tidy);

As you can see, the source code corresponding to the above examples is very easy to follow. In the first case, the "tidy_get_body()" function is used obviously to retrieve the <body> part of a sample (X)HTML string, certainly a procedure that doesn’t bear too much discussion.

With reference to the second code listing, it simply demonstrates how to correct the format of the sample string via the already familiar "cleanRepair()" Tidy method, and then display the respective contents on the browser, in this case using the "tidy_get_output()" function. Quite simple, right?

Finally, as usual with many of my articles on PHP-based web development, feel free to modify the source code of all the examples shown here, if you want to continue exploring how to handle these useful Tidy functions.

Final thoughts

This second article of the series was entirely aimed at demonstrating how to use some simple functions bundled with the Tidy library to extract the different parts of a specified (X)HTML string.

Nevertheless, this story is not yet finished, since in the last tutorial of the series I’m going to show you how to utilize Tidy’s remarkable capabilities to keep track of the eventual errors that occur when parsing a web document. You won’t want to miss it!