Parsing a page (HTML) using PHP, HOW?

Take a look at this page, it's a clothe shop for kids. This is one of their items and I want to point out the size section. What we need to do here is to get all the sizes for this item and check whether the sizes are available or not. Right now all the sizes for this items are:

What you're describing is called "scraping" a web site. Each page you need to scrape is typically a custom application, and it's a rather brittle technology, since a change in the foreign web site can cause your application to break without notice.

A better approach is to ask the owner of the web site for a REST API. You send the item identifier to the API; the API responds with an XML string giving the item identifier and the availability.

If the owner of the web site wants to give you automated access to this information, the API is a very easy solution. If you can make the case that they will increase sales even a little bit, it will make a good business case for you.

As you can see, this is the size selection DIV. The information that populates this DIV apparently comes from an AJAX interaction with a backend script. The words "Not Available" are not in the HTML, but they are clearly rendered on the screen when you open the SELECT form control. So they are put into the DOM some other way. it's things like this that argue for an API, instead of a screen scraper. As more AJAX interactivity is put into web sites, the scrapers become less useful. One might even understand that a vendor who wanted to prevent scraping would deliberately design an AJAX application to make it harder to leech the content from the site.

Perhaps one of the other Experts can help you parse the DOM, but it might be faster, easier and more dependable to ask for the API.

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

This code should do the job, the product variations are in the HTML but they are declared in a javascript variable near the top of the page then pulled in via javascript on page load. This script will extract the declarations and convert them to a PHP array containing an associative array for each variation, including whether it is available as a boolean. There are a few columns in the data that i have not been able to work out what they do, these are labeled unknown_col[1-6]

That said i fully agree with previous comments, this is not an efficient way of working and an API would be MUCH better.

<?php
function getProductVariations($url) {
//Use CURL to get the raw HTML for the page
$ch = curl_init();
curl_setopt_array($ch,
array(
CURLOPT_RETURNTRANSFER=>true,
CURLOPT_HEADER => false,
CURLOPT_URL => $url
)
);
$raw_html = curl_exec($ch);
//If we get an invalid response back from the server fail
if ($raw_html===false) {
throw new Exception(curl_error($ch));
}
curl_close($ch);
//Find the variation JS declarations and extract them
$raw_variations = preg_match_all("/arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct\[[0-9]+\].*Array\((.*)\);/",$raw_html,$raw_matches);
//We are done with the Raw HTML now
unset($raw_html);
//Check that we got some results back
if (is_array($raw_matches) && isset($raw_matches[1]) && sizeof($raw_matches[1])==$raw_variations && $raw_variations>0) {
//This is where the matches will go
$matches = array();
//Go through the results of the bracketed expression and convert them to a PHP assoc array
foreach($raw_matches[1] as $match) {
//As they are declared in javascript we can use json_decode to process them nicely, they just need wrapping
$proc=json_decode("[$match]");
//Label the fields as best we can
$proc2=array(
"variation_id"=>$proc[0],
"size_desc"=>$proc[1],
"colour_desc"=>$proc[2],
"available"=>(trim(strtolower($proc[3]))=="true"),
"unknown_col1"=>$proc[4],
"price"=>$proc[5],
"unknown_col2"=>$proc[6], /*Always seems to be zero*/
"currency"=>$proc[7],
"unknown_col3"=>$proc[8],
"unknown_col4"=>$proc[9], /*Negative price*/
"unknown_col5"=>$proc[10], /*Always seems to be zero*/
"unknown_col6"=>$proc[11] /*Always seems to be zero*/
);
//Push the processed variation onto the results array
$matches[$proc[0]]=$proc2;
//We are done with our proc2 array now (proc will be unset by the foreach loop)
unset($proc2);
}
//Return the matches we have found
return $matches;
} else {
throw new Exception("Unable to find any product variations");
}
}
//EXAMPLE USAGE
try {
$variations = getProductVariations("http://www.asos.com/Asos/Prod/pgeproduct.aspx?iid=803846");
//Do something more useful here
print_r($variations);
} catch(Exception $e) {
echo "Error: " . $e->getMessage();
}
?>

What is the problem with that one? it gives you all the variations for both colours (see the colour_desc value in the array) in that example you get some entries for Steel68Years and some Steel35Years referring to the Steel colour for 3-5Yrs and Steel 6-8Yrs.

I have attached a revised version which will give you the Colour code & description rather than just the code.

Using SQL Scripts we can save all the SQL queries as files that we use very frequently on our database later point of time. This is one of the feature present under SQL Workshop in Oracle Application Express.

Any business that wants to seriously grow needs to keep the needs and desires of an international audience of their websites in mind. Making a website friendly to international users isn’t prohibitively expensive and can provide an incredible return…

The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…