i tried to retrieve the contents of a div from the external site withg PHP, and XPath. See below the story and - subsequently my very very first steps in a Perl-approach to this problem (below the php-explanations.

What happened: as it is sometimes a bit tricky i tried several attempts -and used various approaches - in PHP - now i want to try out Perl.

See the php-Story: as it goes..

This is an excerpt from the page, showing the relevant code: note: i try to add all - also to add @ on the class and a at the end on my query, After that, i use saveHTML() to get it. see my test:

goal: i need the following data:

Version: Last updated: Active installations: Tested up

see for example the following - view-source:https://wordpress.or...wp-job-manager/

i want to have a little database that runs locally - with those data of my favorite-plugins. So i want to fetch the data automatically - with a chron job. Well after the PHP-trials, i need to know how to do this in perl instead - i want to try out this in perl

btw: this is my XPath: //*[@id="post-15991"]/div[4]/div[1] this is the URL: https://wordpress.org/plugins/wp-job-manager/

Try to retrieve the contents of a div from the external site withg PHP, and XPath

This is an excerpt from the page, showing the relevant code: note: i try to add all - also to add @ on the class and a at the end on my query, After that, i use saveHTML() to get it. see my test:

PHP Warning: include(simple_html_dom): failed to open stream: No such file or directory in /home/martin/dev/php/p100.php on line 4 PHP Warning: include(): Failed opening 'simple_html_dom' for inclusion (include_path='.:/usr/share/php5:/usr/share/php5/PEAR') in /home/martin/dev/php/p100.php on line 4 PHP Fatal error: Call to undefined function file_get_html() in /home/martin/dev/php/p100.php on line 6 martin@linux-3645:~/dev/php>

goal: i need the following data:

Version: Last updated: Active installations: Tested up

see for example the following - view-source:https://wordpress.or...wp-job-manager/

i want to have a little database that runs locally - with those data of my favorite-plugins. So i want to fetch the data automatically - with a chron job. Well after the PHP-trials, i need to know how to do this in perl instead - i want to try out this in perl

Re: [dilbert] since php-parser attempts failed i need to get a perl-approach
[In reply to]

Can't Post

Hi Dilbert,

You didn't maintain updating us on the progress you had made on your Europa task, you were last attempting to implement paging. The code I provided, with a change to the conf, would apply here too. I have since written a comprehensive scraper module to supersede Web::Scraper with iteration capabilities, though it needs a bit of a rework.

I haven't written PHP in years now, I couldn't tell you whats wrong with your code off the top of my head, but PHP is perfectly compable of this task.

I have never used XML::LibXML directly to parse html, I typically recommend HTML::TreeBuilder::XPath. This is because it simplifys executing xpath queries, it inherits from HTML::TreeBuilder which is a featuresome html parser, which in turn inherits from HTML::Element which is a featuresome html extractor/modifier. Together they create a very powerful html processing package that cover all you'd need and more.

If you are having trouble getting to grips with the basics of web scraping in Perl, I'd be happy to go through it with you. Every scrape is different, if you don't understand the process behind one, you will have difficulty writing another.

if we have an array from which we load the urls that need to be visited - we would come across all the pages.. Note: we have more than 6000 results - and on each page 21 little entries that represent one record so we have approx 305 Pages that we have to visit.

Regarding the loop-process:

Well Chris, after parsing each page, we have to check for the existence of the next › link at the bottom of the page.

The proceedings: When we have arrived on page 292, there seem to be no more pages left, so we are done with the process of counting upward and can exit the loop with e.g. last.

Dear Chris i try to achieve more steps in the Europe-Task. It would be great if can go ahead with this great parser-approach that you have revealed. Many many thanks for the great approach. I am very glad.

I keep you maintain updating on the progress you will make on the Europa attempting to implement paging. The code you provided, with a change to the conf, would apply here too. Of Course!!

Dear Chris - that following sounds very very good:

In Reply To

I have since written a comprehensive scraper module to supersede Web::Scraper with iteration capabilities, though it needs a bit of a rework.

i would be more than glad to have some insights into this project - it sounds very promising, You know that i am learning - and with such day to day project like parses and scrapers i can learn quite alot.

In Reply To

I typically recommend HTML::TreeBuilder::XPath. This is because it simplifys executing xpath queries, it inherits from HTML::TreeBuilder which is a featuresome html parser, which in turn inherits from HTML::Element which is a featuresome html extractor/modifier. Together they create a very powerful html processing package that cover all you'd need and more.

Well this is a great idea to parse with HTML::TreeBuilder::XPath i thought that i have to do some preliminary tests. And yes: i am pretty sure that this is agreat chance to learn.

my way to get the xpath; use google chrome: I have a webpage I want to get some data off: see here https://wordpress.org/plugins/wp-job-manager/

goal: i need the following data:

Code

Version: Last updated: Active installations: Tested up

Well i think that i have to use the findvalue function:

The findvalue function in HTML::TreeBuilder::XPath returns a concatenation of any values found by the xpath query. Why does it do this, and how could a concatenation of the values be useful to anyone?

Why does it do this?

When we call findvalue, we're requesting a single scalar value. If there are multiple matches, they have to be combined into a single value somehow.

From the documentation for HTML::TreeBuilder::XPath:

findvalue ($path)

...If the path returns a NodeSet, $nodeset->xpath_to_literal is called automatically for us (and thus a Tree::XPathEngine::Literal is returned).

And from the documentation for Tree::XPathEngine::NodeSet:

xpath_to_literal()

Returns the concatenation of all the string-values of all the nodes in the list. An alternative would be to return the Tree::XPathEngine::NodeSet object so the user could iterate through the results himself, but the findvalues method already returns a list. How could a concatenation of the values be useful to anyone?

i want to have a little database that runs locally - with those data of my favorite-plugins. Finally - i want to keep this data-chart updated by fetching the data automatically - with a chron job

In Reply To

If you are having trouble getting to grips with the basics of web scraping in Perl, I'd be happy to go through it with you. Every scrape is different, if you don't understand the process behind one, you will have difficulty writing another.

well - dear Chris - i understadnd - i can learn with each step and with each new scraping / parsing task. This is a new great chance to dive into perl.

Re: [dilbert] since php-parser attempts failed i need to get a perl-approach
[In reply to]

Can't Post

Hi Dilbert,

Quote

After parsing each page, check for the existence of the next › link at the bottom

That is an excellent idea, and likely your best option. In a rough script I tested I fetched the total results using //span[@class="ey_badge"] then the max page using my $page_max = $results / 21; $page_max = int( $page_max ) == $page_max ? $page_max-- : int( $page_max ) ;. But stick to your plan.

Quote

if we have an array from which we load the urls that need to be visited - we would come across all the pages.

Yes this would be fine. Preferably, use a URI object to update the urls page param per iteration of a loop until page max is achieved.

Quote

i would be more than glad to have some insights into this project - it sounds very promising

Its a work in progress, a first draft. I need to modularise more components as the interface from a users perspective isn't too clean nor intuative. It also needs to cover various other scenarios to ensure its capable of every possible scrape. The code won't make much sense alone, but I plugged in Europa and here is a snippet:

In general the map represents the resultant data structure. The iterators purpose is quite straight forward, it should return a node each time it is called, or undef to finish. The organizations paging iterator shifts off each node from an array, once the array is empty it calls a url iterator which increments the page, until there are no pages left.

Its too complex to go into too much detail right now, but hopefully in the near future.

Quote

i want to have a little database that runs locally - with those data of my favorite-plugins.

Keep working at it, if you bump into a specific issue, or theres an aspect you don't understand, feel absolutely free to share with us and we will do our best to help you move forward. Try to produce a working script, start by looping over each plugin from a hardcoded array, fetching the relevant pages content and putting it through an xpath module of your choice.

Re: [dilbert] since php-parser attempts failed i need to get a perl-approach
[In reply to]

Can't Post

hello dear Chris hello dear all - i get some errors while i try to debug the following code..

Quote

martin@linux-3645:~/dev/perl> perl eu.pl syntax error at eu.pl line 81, near "our " Global symbol "$iterator_organizations" requires explicit package name at eu.pl line 81. Can't use global @_ in "my" at eu.pl line 84, near "= @_" Missing right curly or square bracket at eu.pl line 197, at end of line Execution of eu.pl aborted due to compilation errors. martin@linux-3645:~/dev/perl> ^C martin@linux-3645:~/dev/perl>

it fetches the data from approx 6000 fields from the http://europa.eu/youth/volunteering/evs-organisation#open

see the code

Code

use strict; use warnings FATAL => qw#all#; use LWP::UserAgent; use HTML::TreeBuilder::XPath; use Data::Dumper;

Re: [dilbert] since php-parser attempts failed i need to get a perl-approach
[In reply to]

Can't Post

Hi,

It appears you've merged two different variations of code, they are not compatible with each other. The second was an excerpt from a more comprehensive module set which I wrote merely to test various designs. I had shared one of these designs so that you may use some of its concepts, particularly surrounding pagination. The "crawler" module set has radically changed now, it is very well organized and much simpler than before, but incomplete.

I don't even have a copy of the original script I posted an excerpt of above, I have attached the latest code I have with regards to Europa, it supports pagination of organizations and compiles just fine. But note, it was thrown together, it should be used cautiously. Also note, at line 186 there is a condition to limit it to iterating just two pages, the 1 needs to be replaced with $page_max to iterate all. In other words, it probably needs work.