eversuhoshin has asked for the
wisdom of the Perl Monks concerning the following question:

Dear Monks
I have been struggling for a while with the negative look ahead regex. I am trying extract certain excerpts from financial statements. Specifically, I would like to extract "Item 1. Business."
Here is the file link
http://sec.gov/Archives/edgar/data/931683/000115752309002434/a5927574.txt
The way I extract the Item 1 section is using boundaries starting from "Item 1" to either "Item 1A," "Item 2", or "Item 3." Unfortunately, the regex does not extract the whole Item 1 because it matches on "Item 3" mentioned in the excerpt, specifically it stops at "discussed more thoroughly in Item 3"
I have tried using negative lookahead to make it match all the way but I can't get my code to work.
Here is my code.

Hey PrakashK, thanx for the quick reply. The issue is that I still don't know whether I matched the real end of Item 1. Also, I have bunch of other SEC files that mention Item 3 inside of Item 1. This is the hardest part because I am matching pure text almost. The ones with html, I was able to match more easily. Anyway, thank you for your reply!

It seems likely, looking at his regex, that he's insecure about the notion of "ITEM" being capitalized uniformly over his full data set. The sample he provided us is uniform, but why else would he go to all the trouble of creating alternations like (?:Item|ITEM) several times? If he's unable to depend on an all-caps "ITEM" as a delimiter, your solution won't be any more robust than his current one.

ITEM I: BUSINESS
----------------
Littlefield Corporation develops, owns and operates charitable bingo +halls, and
owns and operates an event rental company. In our Entertainment div+ision, we
operate 37 charitable bingo halls in Texas, Alabama, Florida and South+ Carolina.
...
are with Littlefield Hospitality and twelve (12) are at corporate he+adquarters
in Austin, Texas. Littlefield Entertainment consists of sixteen (16)+ full time
employees and nineteen (19) part time employees. Littlefield H+ospitality
consists of thirty-two (32) full time employees and one part time empl+oyee.

How stable is the format of this file? Have you done any statistical analysis to test your assumptions? For instance, are section headings always left aligned? Always in caps as in the sample file? There is variability in the dividers between item number and section title (sometimes a colon and sometimes a hyphen). Is this the only variability?

You mention that sometimes section 3 is found within section 1. Do you mean that section 1 is interrupted by section 3 and then resumes? Or that section 3 immediately follows section 1? If section 1 resumes how do you know as a human reader that you have transitioned from the end of section 3 and back to the remainder of section 1?

In general using regexes in natural language documents to identify the boundaries of semantic chunks is not very reliable. Regexes are the textual equivalent of hearing sentences in a language you don't know. As a listener you can identify that certain sound sequences occur but if you hear them in two places you have no way of knowing if both are part of a noun or one is part of a verb and another is part of a noun. And even if it turns out both are part of a noun, you don't know whether they mean the same thing because nouns can sometimes have two meanings.

Using regexes sometimes works if you have a rigid document format and no possibility that markers of section boundaries can occur elsewhere in the document with different meanings and uses. For example, suppose the SEC will only accept documents where (a) the section titles are always marked by the word "ITEM" (all caps) followed by section title section (b) titles never cross line boundaries and are limited to a specific set of values (c) the next line is always a series of hyphens (d) the number of hyphens equals the number of characters in item + title. It would be highly unlikely that such a sequence would appear naturally as part of the regular text of a section. You could then use such a structure to chunk the text.

On the other hand, if "item" can be lower or upper case and there is no SEC mandated format to titles, then you indeed have a problem because there are many uses of the word "item" even in your sample text. Even if it were true that titles are always left aligned, it wouldn't be enough to pick out the section headings. Since section content text is left aligned, there is a significant possibility that "item" as part of context text will be left aligned in at least some of the SEC files. You'd have to do statistical analysis on the rate of false matches, i.e. comparing your algorithm's extraction to a human reader's extraction. Then you would have to check with your client about its acceptability. If your client thinks there are too many false matches, you'll need to have some mechanism to disambiguate between the different contextual uses of "item" and may need to look into setting up some sort of Baysian filter and training corpus.

When putting a smiley right before a closing parenthesis, do you:

Use two parentheses: (Like this: :) )
Use one parenthesis: (Like this: :)
Reverse direction of the smiley: (Like this: (: )
Use angle/square brackets instead of parentheses
Use C-style commenting to set the smiley off from the closing parenthesis
Make the smiley a dunce: (:>
I disapprove of emoticons
Other