I want to be able to query the voting status, who is ranked what in each category and so on. But I don't want to type in votes in a table, especially not when they change all the time. Instead I'd like to query the live data in the HTML webpages. Can that be done? Yes it can...

But unfortunately a slight caveat for this. It appears that when I do these HTTP calls from the database to the voting webpages on http://community.oracle.com , something happens that often make the URLs give me "An unexpected error occurred" rather than the HTML I want. I suspect (but not sure) that it may have to do with either that these HTTP calls do not log in or that they fail to accept a cookie acceptance popup window or that they look like robot crawler and get blocked or something.

Whatever the reason, the code I show in this blog post works only sporadically on the Oracle Community website. But I will show it anyway, as the technique can be very useful to other websites and URLs too. And then in a Part 2 blogpost I will show a workaround I did for making this work anyway, just with a little bit of manual work involved.

So here's a recipe for querying the HTML of a live webpage. Just beware that the particular website I use as an example is not the best of examples ;-)

In schema SCOTT I then do my actual work. First I create an object type, collection type and a table. The object and collection type enables me to make a table function I will use later, the table then can be used to cache the results of the table function to avoid excessive HTTP calls.

Then a helper function to execute the HTTP call for retrieving the HTML from the 5 different category webpages. As I need HTTPS (SSL) support, I can't use my favorite method, HttpUriType. Instead I can use UTL_HTTP, or in this case I chose APEX_WEB_SERVICE.MAKE_REST_REQUEST for simplicity. Even though meant for REST requests, it can retrieve any webpage and simple return it as CLOB. If I had used UTL_HTTP I would have had to read the HTML in a loop and build the CLOB myself.

Second helper function is to transform the HTML into XML. A really well-formed HTML document is actually very close to being XML, but a few things I have to do here to make it work as XML.

First I cutout the "middle" (and relevant) part of the original HTML document; that gets rid of a lot of Javascript and other things that would make it invalid XML.

Secondly as the "cutout" part then contains a <div> element for each nominee, I need to wrap it in a root <doc> element (I could name it anything, but doc seems appropriate.)

Third the HTML contains a lot of <br> which traditionally in HTML is a simple newline, but in XML would be an opening <br> tag that misses a closing tag - new HTML standards state you should use <br/> which also works in XML, so I do a replace here.

And last there are some HTML entities that the database XML parser does not recognize as valid entities, but as they are not part of the text I am interested in, I simply replace them with spaces to get rid of them.

Note also that in case the HTML parameter does not contain the specific <div> and <span> tags I am looking for, then I know my XML parsing will fail, so I raise an explicit error saying that this HTML is not in the expected format.

My table function starts by retrieving the HTML and transforming it to XML via the two helper functions, and then it extracts the information I want using XMLTABLE function. I use '/doc/*' to retrieve all the <div> that are nominees. For each <div> I specify the XPATH expression of the "position" in the XML of the 4 pieces of information I desire. But those 4 pieces contain superfluous text and whitespace, so I perform some trimming on name and for the 3 numeric expression I use regexp_replace to remove anything that is not a numeric digit. Each retrieved nominee from the XML is then piped out, so the results can be selected using the table function.

Calling the table function above involves an HTTP call. To avoid doing that over and over, I finally have this small procedure for populating my cache table. It very simply clears the cache for the invoked category and inserts rows from the table function into the cache.

So, this is a nice way to query live webpages, or perform "scraping" as it also is called sometimes.

Unfortunately sometimes webpages are not very good for calling in this way. The page might include something interactive like popups that need to be accepted, or it might be dynamically built with Ajax so the first call only retrieves an empty "shell" of HTML and some Javascript that populates the shell. In such cases it can be hard to produce a "scraping" code like this. And the Oracle Community webpages actually do something I haven't yet identified, so that they often return "An unexpected error occured" and my code raises 'Not expected HTML content' exception.

So what to do in those cases where a webpage is hard to get without actually using a browser? Well, one option is to use a browser manually, save the output as raw HTML from the browser, and then let the "scraping" code read the HTML from file instead of using HTTP. That is the subject of Part 2.

Was this useful to you? If so, can I ask you to consider to go and vote for me in the Oracle Database Developer Choice Awards SQL category? I would be eternally grateful :-)

Interesting, sometimes the trailing slash doesn't work. Strangely though, removing it works then for a while and when it fails, putting the slash back on fixes it again. Maybe the problem is some kind of caching and changing the url, even trivially, resets it.