Okay, so my company thinks because I had a 7 week class on Perl (Completed, got a decent grade) that I'm a genuis now (sarcasm). They've asked me for some assistance on a project. We are moving servers and networks, it's a crazy mess. My boss wants me to pull the urls from some of our intranet sites by viewing the source code, so we can see how we might want to config the new sharepoint and intranet sites (we have so much fluff for site locations, that some files haven't been updated since like 2009, but people have been creating them elsewhere). So, it's basically web crawling across our network. We had a huge layoff and its been nuts. What I thought about doing is using Perl to pull the urls from a saved .txt or .html and add them to an array (something I used to hate in Java, but find nicer in Perl). Everything will print out and I can copy it into a spreadsheet/word doc and start my Visio workflow diagram. I'd ask my scripting guru at work, but he's out with a new kid. So, now you see my dilemma. Wanna assist? Thanks for reading.

Being an intern, I can't VPN in to access the intranet sites, but it's a project he'd like me to help with. So I've just been using source views from firefox, or making my own "one or two" entries to test my code.

now this pulls my fake test html (if in the same directory) and allows me to print from the array. my plan was to use a foreach method to use the $each_line to acquire each line of text and then a regular expression to verify my the information before printing it out and being able to copy it to notepad/word/etc.

Below is where I get lost. I know the idea is sound and should be rather easy to accomplish. Thanks again for any assistance.

foreach $each_line (@htmlarray)

{

if ((!($each_line=~/^http/))&&($each_line=~/^\//))

}

Scheme

open file

add file data to arrray

find lines of urls

print them out so I can copy them to a text documents (probably between 1,000-1,500 urls or A LOT more); Perl Newbie - 7 months of PERL basics.

I don't know, according to which criterium you want to recognize URLs in your files. According to your code, you are searching for lines which start with "http" and start with "/" at the same time, so it is impossible that you come up with a valid line.

A more reasonable approach would be to grep all strings starting with "http://" or "https://", up to the first white space, but this might also return strings embedded in a comment, for instance.

is actually looking for lines not starting with http (because of the '!'), and starting with '/', which is not contradictory, but redundant, since lines starting with '/' are bound not to start with 'http'. But it is probably not the OP's intention.

@wyndcrosser: please explain what the lines you are looking for look like and what you want to extract from them.

I've got the code to return any line starting with a <href="http://...

My question -How to add those lines to an array (this would be the best method, right? My boss said he'd like it that way if possible) I've been able to do a foreach loop and add the data, but I'd like to hear your opinion. -After your regular expressions finds the lines with urls, how to you get it to just print out http://www.whatever.co.uk.index.html, etc. without all those added details. My issue is that I want to be able to copy it into a word/excel doc and divide them up for my Visio layout.

Thanks guys

This is just a very basic attempt. I'm still working on it. I've got work around the house to complete as well.

To parse the lines for URLs within quote marks, you could use something like this:

print $1 if /"([^"]*)"/;

This will capture things between quote marks. But if it is spread on more than one line, this will not work. There are probably other reasons why this would not work. And then, you'll really need a real parser, i.e. most probably a CPAN module.