On Fri, Sep 21, 2012 at 08:56:34AM +0100, Simon Wistow wrote:
> On Thu, Sep 20, 2012 at 12:35:18PM +0100, Nicholas Clark said:
> > Lots of "one trick pony" type benchmarks exist, but very few that actually
> > try to look like they are doing typical things typical programs do, at the
> > typical scales real programs work out, so
>> As a search engineer (recovering) I'm inclined to say - get a corpus of
> docs, build an inverted index out of it and then do some searches. This
> will test
>>> 1) File/IO Performance (Reading in the corpus)
> 2) Text manipulation (Tokenizing, Stop word removal, Stemming)
> 3) Data structure performance (Building the index)
> 4) Maths Calculation (performing TF/IDF searches)
>> All in pretty good, discrete steps. Plus by tweaking the size of the
> corpus you can stress memory as well.
Thanks, this is a useful suggestion, but...
I'm not a search engineer (recovering or otherwise), so this represents
rather more work that I wanted to do. In that I first have to learn enough
of how to *be* a search engineer to figure out how to write the above code
to do something useful, and *then* how to write such code to a reasonably
performant production versions, and then to turn working code into something
sufficiently stand alone to be a benchmark.
I don't want to be spending my time figuring out the right way to do all the
above algorithms in Perl. I want to get as fast as possible to the point of
figuring out how the perl interpreter (mis)behaves when presented with
extant decent code to do the above.
Unless there's a CPAN-in-a-box for doing most of the four steps.
(which doesn't depend on external C libraries. That was one of my
"preferably" criteria)
So, next question - if I wanted to be as lazy as possible and write a search
engine (as described above) using as much of CPAN as possible, which modules
are recommended? :-)
Nicholas Clark