bfdi533 has asked for the
wisdom of the Perl Monks concerning the following question:

I have log files that I am processing every hour and they grow from the beginning of the day to the end of the day and typically are in the millions of lines per log by the end of the day. I have previously been doing a cmp to get just the new lines and keeping the previous file to check against. Still, I am running into duplicates and need to prevent that. As I am already keeping the start and end dates from the log in a DB, I am trying to switch to checking the dates in the log against what has already been loaded.

To do this, I am building a hash from the a DB query that stores the ID, start and end dates. The dates are stored in the hash by converting them to seconds since 1970 to make comparisons faster. However, I think the sub that checks for a date, though straight-forward, is way too slow to check millions of log lines against and run in a rapid manner. The sub in question is check_date_in_range. I thought that only calculating the array of sorted keys would speed things up so I am "caching" it but that does not appear to be the slowdown ...

Obviously, the DB code is left out and the building of the ranges_dt hash is somewhat separate and the logic to process the log file is rather complicated. But this is a working code set uses the sub so you can see what I am trying to do.

Another option would be not to store the timestamps in a hash and sort them every time a new one arrives, but use a data structure that can add a new element and quickly search for the closest members (binary search tree, for example).

Alternatively, why not just store the line number of the last line processed between runs? It will be much faster to just discard the first N lines on the next run than to do date-based processing on them.

The reason is that the logs are uploaded to a central server for processing and are contained in a .tgz file. Once opened up, there is no guarantee that each hour being processed by the loader script is in sequential order and the processing of the log files is done in parallel in a queue with work handed out to child processing scripts. So, the DB is the only point of reference. Also, storing a byte count would not work when the files are rotated (supposedly once a day) and the file starts over at 0 (but on some systems the logrotate does not always happen so those files just keep growing).

It could be that hour 3, 5 and 6 are processed and THEN hour 4 gets processed so again the byte count from hour 6 will not be applicable to hour 4. But with hour 4 having been included in the hour 5 and hour 6 upload, there is no need to actually pull out and load the data but without checking the dates for inclusion in the range, there is no way to guarantee this.

Given all of this, dates are the only reliable means of knowing data has been processed before.

But, even with being able to process only dates after a byte skip, I would still need to process the dates in the range. And there are a lot of them each hour on some of the busier systems.

Hence, the original question, how to make the lookup faster/more efficient?

I'm not sure if this applies in your specific case, but when dealing with ranges, I found that using a hash is usually not very efficient because you usually end up making sequential search through the hash keys, thereby annihilating the advantage of fast hash lookup. Using an array of sorted ranges and a binary search is often much faster in such cases.

Yes, there is a lot going on. Details on the problem and discussion on solutions are in an earlier post from me:

http://www.perlmonks.org/?node_id=1207429

But, introducing the has range lookup was Terrible for the performance as it meant I had to call a sub, search through about 500 hash items, return if found; AND it had to do this for between 250,000 and 500,000 items each run. If you look at my stats output, you will see this was taking about (edited numbers to reflect actual data) an average of 120 seconds per 1000.

Changing to the array lookup was a HUGE increase in speed knocking the per 1000 time to 5-7 seconds.

With all of the other stuff going on, it is a performance nightmare ... Every little bit counts.

I’m sorry that I don’t right now have time to flesh this brainstorm out with a full code example, but it should be obvious ...

A very good approach would use a data structure such as Tree::RB to store entries, keyed by start-time. Each node is a 2-tuple hash which contains the start and the end time. (As you initially populate the tree from the database, you specifically look for overlaps and remove them, replacing them with the longer ranges so that the tree finally contains only the longest possible ranges.)

(Use any sort of convenient “orderable-number representation of a time/date” that is helpfully provided by your Perl time-piece of choice.)

I fingered this particular CPAN module because its lookup() method contains a mode parameter which supports inexact lookups. The specific mode that I am looking for is called LULTEQ:

Returns the node exactly matching the specified key. If this is not found then the next node that is less than the specified key is returned.

Retrieve this key, whatever it is, and then check to be sure that the time-range that you are looking at falls entirely within – or, does not intersect? – the time-range represented by that node.

As a further simple optimization, then stash that object-reference that you last retrieved ... whether or not it was successful ... and check it first. This will avoid tree-lookups in a majority of cases because log-file entries are consecutive.

While considering this suggestion further, I would point out that the process for collapsing nodes so that the tree includes only the longest possible ranges needs to carefully consider both sides. Specifically:

First, search for the node which contains the largest low-value that is less than or equal to the specified key – using the inexact-search as noted previously.

If the current date-range overlaps this one, add this key to a list of nodes that will momentarily need to be removed, and note what the new date-range will be ... so far.

Now, use iter(high_value) to create an iterator that will allow you to walk forward from that high-value key. Check each node to see if it also overlaps the new range. If so, expand the range and (to avoid disturbing the iterator) push these keys onto the bucket-list as well.

Once the iterator no longer produces overlapping ranges, run the bucket-list to remove all of the now-obsolete keys (a step that we deferred to avoid impact to the iterator), and insert one node containing the expanded range. The red/black tree algorithm will efficiently keep the tree balanced. (Notice that you cannot update the nodes in-place.)

Now, all of this is simply describing an implementation of a “range-set” type. I surfed CPAN for that alternative first but did not readily find one. It would, however, be an excellent candidate for CPAN inclusion. Did anyone find the package that I was looking for?

A second alternative worth considering in cases like this is MRU – maintain a simple array of ranges, search it sequentially, and then remove the range that you found from the list and shift it onto the front of the list so that the most-recently found entries will be found first. But the OP indicated that his situation was extremely performance-critical, hence the suggestion of a slightly more exotic data structure to drive the search in O(log2) time. The actual frequency-distribution of the date values in the log-data would drive this decision: is the right-answer likely to be, simply, most-recently found?