I’m working on a project to read stock quotes from the internet. The code at the bottom of this post is my first step. It reads the URL, gets rid of all of the web spacing, and then does a custom conversion of the page into a text stream: no fonts, no styles, etc.

I want to call the new sub-function with two parameters: the URL, and an optional disk file to hold the results.

I’m currently creating the stripped down stream into a file on disk. I’d like to create it as an in memory stream, or whatever they call it in Perl. Save that stream to disk if the optional path is given, and then return the stream as a single string to the calling process.

In the case of getting a stock quote, I could ask for the page that contains multiple quotes, split the result on the work “table>”, search for the split that contains column heading for Symbol and Time & Price, and finally search that table for the lines that start off with each requested symbol. But, before I do that I have to build the returned string in pieces and return it as a single string.

my $inBody = 0; # are we currently in the body of the web page my $inStyle = 0; # are we currently defining a style in the body of the web page my $inScript = 0; # are we currently defining a script in the body of the web page

Hi, from what you said, I understand your code is doing what you want, you are looking for best practices. I do not see any bad practices in your code, except that it is not very "perlish", in the sense that it is using systematically a C style syntax, rather than using the more idiomatic Perl shortcuts.

This is a first suggestion to reduce your code from 153 lines to about 103 lines using simple Perl constructs:

my $inBody = 0; # are we currently in the body of the web page my $inStyle = 0; # are we currently defining a style in the body of the web page my $inScript = 0; # are we currently defining a script in the body of the web page

I have tried to get the code to do exactly the same thing, but, of course, since I can't test, I might have made an error here or there. I guess that not everybody will agree with my changes, my logics is that, the shorter the code (so long as it does not get cryptic), the easier it is to have it bug-free. In particular, IMHO, it is much easier to avoid bugs when you see mode conde on your screen than when you have to scoll all the time. Therefore, I would suggest a second shorter version, still trying to keep the same algorithm, with a few other improvement (such as the way to open the file):

my $inBody = 0; # are we currently in the body of the web page my $inStyle = 0; # are we currently defining a style in the body of the web page my $inScript = 0; # are we currently defining a script in the body of the web page

I am now down at 75 lines, less than half the original code count. Again, I haven't tested the changes, there may be an error here or there; I could probably cut it down further, but it might start to become a bit cryptic, and I don't want that. In my view, the code above is just at least as clear (possibly clearer) as your original code, and the fact that it is less than half the size of the original program makes it easier to develop and debug. I would probably go even further if I knew the context better.

Hi PapaGeek, Welcome to Programming in Perl once again. But before I say anything else please let me say, Please and Please NEVER parse an html file using REGEX like you are doing except it is a single line of HTML. Rather make good use of the module you have listed in your script.

That been said, if I could point to some good practice like you requested for: 1. NEVER parse html files with REGEX use modules, 2. Always check the return value of function "open" like this

Code

open my $fh, '>', $filename or die "can't open file: $!";

or you

Code

use autodie;

3. The same principle applies to function "close" 4. use tested module to also parse your URLs, instead of using REGEXs.

Below is a script that does what you wanted but left out the part of your workings using a subroutine which I believe you can figure out. . Because all that remains is just putting a sub. and do a little re-arrangement.

Code

use warnings; use strict; use LWP::UserAgent; use HTML::TreeBuilder; use URI;

Thank you for the come backs so far, and yes I do know that it does not look “perlish”. Like I said I’m a newbie and that should come with time. But, I will look at every line change suggested here, they are all appreciated.

My original question was how do I build this page as a string? The current code says:

Build the reply string in memory, not as a disk file. Create a disk file only if requested return the reply as a single string to the caller.

I then want to use a regular expression something like: <tr>|{(.*)}||Aug 2||(.*)|| To parse out the symbols and prices from the returned file.

I am familiar with the HTML::TreeBuilder process, but it did not give me a lot of control over which commands to pass on and which ones to hide. That is why I’m writing my own HTML custom parser, but I want it to return a single string, not a disk file!

Bill, Thank you for an excellent reply. I will definitely change my code to use this style (Perlish!)

The code was written because HTML::TreeBuilder trims down the web page the way someone else decided it should be done. My code, and especially with your change, gives me full control over what the page looks like.

But, I will ask my original question again!

I wanted to create the resulting page in memory. To that end I have modified the code as:

Code

my $pageStr = "";

if ( $commandOut ) { $pageStr .= $commandOut}

$pageStr .= $parts[1];

The page returned from Yahoo Finance was 62,162 characters.

My process created the pseudo text only file of 9,889 characters by performing 1,841 of the .= appends to the in memory string of the file.

Is 1,841 appends efficient? Or is there a more “Perlish” way to create the output string in memory from 1,841 pieces?

As I understand your original question, you already write your results to a disk file, but you want them in a string.

Of course, you could "slurp" (read with $/=undef) the file back into a string.

Another possibility is to open the file as a memory file rather than a disk file. (Refer perldoc -f open)

Quote

Since v5.8.0, Perl has built using PerlIO by default. Unless you've changed this (such as building Perl with "Configure -Uuseperlio"), you can open filehandles directly to Perl scalars via:

open($fh, ">", \$variable) || ..

I do not know enough about your application to comment on efficiency except to note that it is seldom important if your program runs in available memory and in an acceptable time. It is far more important that your program be clear to people (including yourself). Perl provides the tools to do this. Good Luck, Bill

The parenthesis are needed in the regular expressions to contain the alternations. The ?: changes them to "non-capturing" parenthesis. Their result is not stored in $1, $2...etc. I have developed the habit of using plain ones only when the result is needed. This makes the processing slightly faster, and more important the expression is easier to modify because changes to the non-capturing ones do not renumber the results. Good Luck, Bill