Re: [dilbert] a little script that makes use of LWP::Simple
[In reply to]

Can't Post

Hi,

Quote

the first step: first i do a view on the page source to find HTML elements?

view-source is a browser based command, it tells the browser to output the response in plain text rather than render it based on its actual content type, html in this case. You should not need to include view-source in your url.

I have written a little script that extracts the data out of each block and cleans it up a little. The browse function is generic, it takes an input ref which contains the url and xpaths of the parent and children in order to construct the output ref.

It is just to give you an idea of an approach I might take, it does not yet navigate across each page, you may want to use it as a basis.

Code

use strict; use warnings FATAL => qw#all#; use LWP::UserAgent; use HTML::TreeBuilder::XPath; use Data::Dumper;

Re: [dilbert] a little script that makes use of LWP::Simple
[In reply to]

Can't Post

Hi,

Hardcoding the total number of pages isn't practical as it could vary. You could:

- extract the number of results from the first page, divide that by the results per page ( 21 ) and round it down. - extract the url from the "last" link at the bottom of the page, create a URI object and read the page number from the query string.

Note that I say round down above, because the query page number begins at 0, not 1.

Otherwise, I personally would probably try to incorporate paging into the $conf, perhaps an iterator which upon each call fetches the next node, behind the scenes it automatically increments the page when there are no nodes left until there are no pages left. But this is probably beyond the scope of what you need and a basic looping mechanism should be sufficient.

I checked Web::Scraper in case it had features to handle paging, which it unfortunately doesn't. It is however a much more featuresome replacement to my solution above, it could be used in place if you preferred.

Finally, if you eventually need to look into distributing the scrape process across multiple processes, there are various ways this could be incorporated, but you should consider doing it asynchronously via HTTP::Async.

Re: [Zhris] a little script that makes use of LWP::Simple
[In reply to]

Can't Post

hello dear Chris, hello dear all. ;)

note: first of all dear Chris - this is just a quick posting - (at the moment i am not at home - but i wanted to share some ideas with you - and i wanted to say many thanks for the continued help!!

To begin with the begining: The scraper runs so well - i have tried it out. This is just overwhelming.

Besides the results that this script brings - it is a great chance to see how (!!) Perl works - and this is a great chance to learn and to digg deeper into Perl.

BTW: not only to the above mentioned page of volunteering organizations (alltogether more than 6000 records)

even more: - but also to the volunteering projects - Search results: 506 projects found see https://europa.eu/youth/volunteering_en and especially: https://europa.eu/youth/volunteering/project_en?field_eyp_country_value=All&country=&type_1=All&topic=&date_start=&date_end= Search results: 506 projects found

The data structure is (allmost) the same. So the scraper can be applied here too: wonderful : as mentioned above: I want to learn more and more - Perl is pretty difficult - and for me it seems to be harder to dive into Perl than to learn PHP and Python.

But anyway - Perl is very very powerful - and as we see - we can do such alot of things. I try to get more XPath knowledge -and i will try to apply the solution on more than only one target - perhaps there is a kind of a swiss-army knife solution of a robust scraper/parser that works for such (pretty difficult) sites with cells like we have in this European target.

conclusio: After the parsing of the data i try to store all the d ata in a MySQL-DB. This is the next step in the process of this little project.

As for the next step - the looping over different pages: i have done a quick search in order to find some solutions that may give some hints how we can do this "pagination-thing":

First of all: like the idea of incrementing the page - as you say "behind the scenes it automatically increments the page when there are no nodes left until there are no pages left." Chris, this is a great idea.

i have done a quick search on the solutions - that could be applied - more or less:

a Bash-Solution https://stackoverflow.com/questions/35423019/iterate-through-urls

if I wanted to [..] download a bunch of .ts files from a website, and the url format is http://example.com/video-1080Pxxxxx.ts

Question: Where the xxxxx is a number from 00000 to 99999 (required zero padding), how would I iterate through that in bash so that it tries every integer starting at 00000, 00001, 00002, etc.?

Loop over the integer values from 0 to 99999, and use printf to pad to 5 digits.

The code is, quite straight forward. We have a list of URLs in the @urls array. An HTTP::Tiny object is created and assigned to the $ht variable. The in a for-loop we go over each url and fetch it. In order to save space in this article I only printed the size of each page. This is the result: inally we arrived giving an example of downloading many pages using HTTP::Tiny.

BeautifulSoup looping through urls: https://stackoverflow.com/questions/27752860/beautifulsoup-looping-through-urls Follow the pagination by making an endless loop and follow the "Next" link until it is not found.

dear Chris dear all - this was just a quick posting - (at the moment i am not at home - but i wanted to share some ideas with you - and i wanted to say many thanks for the continued help!!

In the next few days i try to figure out how to apply a solution for itterating over all the pages and ... last but not least - try to apply all the dataset to a MySQL - DB.

but that can wait at the moment...

i keep coming back to this wonderful thread on a regular base - and yes: i ll keep you informed how it is going on here.

regards dilbert

- i love this place!!! ;) keep up this - great place for idea exchhange and knowledge transfer - it is a great place for discussing ideas - and yes - for learning.!!!! ;)