Pages

Thursday, March 1, 2012

I had a bunch of filenames. I wanted to know which two-word phrases appeared most frequently. The purpose of this effort was to put those filenames into categories. I hoped that the approach of finding significant two-word phrases in those filenames would help me automate the file-sorting process. This post describes the first approach I took to identify those two-word phrases.

Parsing the Filenames

I hoped that two-word phrases would be a good size. I figured they would sort into many essentially one-word phrases (e.g., "a dog" would be trimmed to "dog," which would not have much significance) and a relatively small number of more significant two-word phrases (e.g., "big dog"). I hoped, that is, that I could easily filter the list for common words (a, an, the, if, of ...) and then count what was left. It seemed, though, that I could still have taken approximately the approach described here if I had had been looking for three-word or longer phrases.

The approach would have been similar if I had been examining sentences instead of filenames. I would have gotten the sentences into individual rows by using Microsoft Word and replacing sentence ends (e.g., ? ! .) with the same punctuation followed by ^p characters, which Word would interpret as the start of a new line, and then pasting the text into Excel. It might have taken a while to resolve sentence ends, given problems with e.g., "Mr.". I did not have more than 65,000 filenames, so Excel 2003 would handle it. Otherwise, I might have needed a newer spreadsheet (e.g., the Calc programs in LibreOffice or OpenOffice).

Consider this sentence or filename: "I saw a dog - a big dog." Using spaces as delimiters (as I intended to do), its two-word phrases would be "I saw," "saw a," "a dog," "dog -," "- a," "a big," and "big dog." This approach would produce some junk, but I hoped it would be easily filtered. Not worrying about the punctuation seemed faster than trying to anticipate endless punctuation combinations (e.g., "dog -- a"). It would fail in some instances (e.g., "dog--a"), but those too seemed likely to involve insignificant phrases.

That sentence contains seven spaces. I could count the number of spaces manually. With many sentences, it would make more sense to use a formula. One approach would be to use Excel's SUBSTITUTE and LEN functions to calculate the difference in length between the sentence with and without spaces, and then use MAX to see which row has the largest number of spaces. TRIM would remove excess spaces.

Since each two-word pair would contain one space, the sample sentence would contain seven pairs of words. Therefore, my spreadsheet needed enough columns to accommodate seven two-word phrases. The approach I took was to set up the spreadsheet like this:

The top row indicated the number of the two-word phrase. The first one was "I saw," in group 1. The second row indicated the number of the space found. For instance, the first space was at character position 2 (i.e., right after "I"), and the second space appeared at position 6 (after "saw"). This image shows only the first three groups; I would need four more. As it shows, the first group required calculation of only two numbers because its starting point (character position 1) was already known. I preferred this approach of breaking out the calculations visibly, rather than jamming them all together in one long, complicated formula. It was usually faster this way, and it made it easier to spot errors.

I used formulas in the header rows to facilitate copying. For instance, the formula in cell F2 was =F1, and in F1 it was =C1. That way, once I worked out my formulas in row 3, I could select and copy cells F1 through I3 repeatedly across the spreadsheet, until I had enough column groups to accommodate all of the spaces in my longest filename or sentence.

To prepare the following list of formulas for the cells on row 3, I used a macro. Those formulas were as follows:

So anyway, now I had a spreadsheet containing the two-word phrases that I wanted to sort and examine. I made a backup and proceeded to convert the spreadsheet's formulas to values. With the formulas frozen as values, I could sort and delete rows and columns without changing the results shown. There appeared to be functions for this purpose in ASAP Utilities and possibly elsewhere. An easy way to do this was to go to the top left-hand corner of the spreadsheet, hit Ctrl-A to select it all, and then hit Edit > Copy, Edit > Paste Special > Values > Enter > Enter. I had usually found that the same keystrokes worked in later versions of Excel. In this case, the keystrokes were Alt-E, C, Alt-E, S, V, Enter, Enter.

However achieved, the point is that cell C3 no longer contained the formula =FIND(" ",$A3). Now it just contained the number 2, and it would continue to contain that number even if I deleted cell A3. With that done, I could now delete the columns containing the numbers that I had used to detect my two-word phrases (e.g., columns C, D, F, G, H ...). I sped up that process by doing a column sort (select all (Ctrl-A) and then Data > Sort > Options > Sort Left to Right > Row 2. I put numbers at the tops of the Phrase columns, so as to keep their order straight.

Lining Up the Phrases

At this point, the cells toward the right side of my spreadsheet looked like this:

That is, the ends of the rows did not end at the same place. In this illustration, the filename on row 2 had 17 two-word phrases, while the one on row 3 had only 16. I wanted them all in a single column. I wasn't sure if I needed them in proper order (in this example, P2, Q2, R2, followed by P3, Q3, R3), but I preferred that if possible.

I had tackled a similar problem previously. That post still provided more details than I have written here. But in some regards, I now saw a way to simplify that process a bit. I deleted row 1, containing the column headings, as well as column 1, containing the original filenames. They seemed unnecessary and potentially confusing at this point. Then I created another worksheet (Insert > Worksheet). Its default name was Sheet2. In cell A1 in that worksheet, I entered a formula that referred to the corresponding cell in Sheet1, where my data were located. The formula that I put into cell A1 in Sheet2 was =IF(LEN(Sheet1!A1)>0,CELL("address",Sheet1!A1),"").

That formula produced this output in Sheet2, cell A1: [xA8.xls]Sheet1!$A$1. (xA8.xls was the name of my spreadsheet file.) I was going to use the preliminary information, the part that said [xA8.xls]Sheet1!, to tell the computer where a new entry began. In other words, if the spreadsheet had given me just $A$1, I would have changed my formula so that Sheet2, cell A1 would contain something like this: #$A$1. Any odd character would have been sufficient in place of # as long as it gave the computer something unique to look for.

Next, I copied that formula from Sheet2, cell A1 to each cell that corresponded to my data in Sheet1. In other words, in Sheet1 I hit Ctrl-End to discover that the bottom right corner of my spreadsheet was at cell Q14998. So now, in Sheet2, I copied that formula from cell A1 to all cells down to Q14998. A fast way to do that was to hit F5 and type Q14998 and hit Enter. That took me to Q14998. I put an X there. Then I went back to A1 and hit Ctrl-C > Ctrl-Shift-End > Enter. Boom! Done. I went back to Q14998 to double-check. Then Alt-O, C, A to autofit the column widths.

Now I had a bunch of cells that had ugly stuff like cell A1 (above): [xA8.xls]Sheet1!$A$1. I hit Ctrl-A > Ctrl-C to copy it all. I opened Notepad and hit Ctrl-V to paste it. (If there had been too much data for Notepad, I think something like Notepad++ would have worked.) The purpose of putting it into Notepad was to remove the column formatting. (There were delays with some of these operations, as the Windows clipboard and the various programs wrapped their heads around the size of the task.)

Now I copied it all from Notepad to Microsoft Word. Any editor would have worked, as long as it had a way of doing global search-and-replace operations that would produce line breaks. That is, in Word, I replaced all instances of the ugly stuff (i.e., [xA8.xls]Sheet1!) with a newline character (^p). Word reported that it had made about 170,000 replacements. This gave me the desired list, starting like this:

$A$1

$B$1

$C$1

$D$1

There were invisible tabs after the entries (e.g., it was actually $A$1 tab), and there were also some blank lines. I cleaned these up with various ^p^p and ^t^p replace operations, repeated until there were no more. The file was long -- about 3,700 pages.

I went back to Excel and deleted Sheet2: I didn't need it anymore, and all its calculations would slow things down. I created a new Sheet2 and pasted the results from Word into it. I wasn't sure if Word had inserted any other funky invisible characters, so I ran the results through Notepad on the way home, just to be sure.

But now, oops, new problem. My old Excel 2003 couldn't accommodate 170,000 rows. It gave me an error: "File not loaded completely." I took a look. Row 65536 contained a reference to $F$5911. I could have transferred the whole project over to LibreOffice, which could handle 170,000 rows, but that would have been premature. I had kept Notepad open, so now I searched for that item $F$5911 in the data that were still sitting there in Notepad. I deleted everything down to that point, copied, and repeated in an adjacent column. I had to do this a couple of times. So now I had a spreadsheet that was three columns wide and 65,536 rows long. Warning: do not print your results.

I inserted blank columns between these columns of output. Those blank columns would contain the text corresponding to those cell references. For instance, in Sheet2 cell A1, I now had $A$1. So in the blank cell B1, I put this formula: =INDIRECT("Sheet1!"&A1). I copied that formula to all the other cells in those adjacent columns. That gave me three sets of two columns each: the first provided the reference, and the second containd my actual two-word phrase.

At this point, I froze all those values, as above, and deleted the reference columns. So now I just had three columns containing two-word phrases. In short, I had managed to isolate the phrases. Now it was time to start slicing and dicing them.

Working with Unique Phrases

I wanted to filter those two-word phrases for unique ones. For example, if there were a thousand two-word phrases that said "Email from," I wanted just one entry of that nature, with a numerical companion cell containing the number 1,000.

To do that filtering, I tried to use LibreOffice Calc, where I could get all that material into column A. But Calc hung for about 15 hours before producing output. (Excel also hang at a number of points in this whole process.) I did not test Calc's output, but it appeared to have given me some repetitive items in that supposedly nonrepetitive list. I went back to Excel, filtering each of the three columns and then combining the short output into one column and filtering that. This gave me 7,539 unique two-word phrases. That list passed the test for duplicates, but counts indicated that it was not accounting for all 169,891 two-word phrases. The formulas to do those counts really slowed things down, so I deleted them for the moment.

I thought that it might help to clean up unnecessary characters. I sorted each of the three columns according to starting and ending characters, so as to detect extraneous ones (e.g., various kinds of brackets, quotation marks, and other symbols). I also did some global (Ctrl-H) searches in Excel to the same effect. In addition, I deleted a few meaningless two-word pairs. These efforts exposed me to the fact that many two-word phrases would consist, at least in part, of single letters (e.g., "A dog," "Roger P"). It seemed that I probably should have done global replace operations, even before breaking out the two-word phrases, so as to promote meaningful two-word combinations (e.g., "Roger P" and "P Smith" might be less recognizable than "Roger Smith"). Possibly a preliminary search for three- and even four-word proper nouns would have been a good approach there.

I also decided, along the way, that it would make sense to tolerate meaningful one-word terms (e.g., the "Pentium" word that would remain after removing the unhelpful "a" from "a Pentium"). That predictable conclusion raised questions about my sequence of steps generally.

After those cleanup efforts, I filtered again for unique two-word phrases and counted their occurrences. Working with three source columns required some extra steps, mostly to eliminate duplicates, but ultimately my count of items found from searches for unique items was within 1% of the actual number of such items. In other words, my filtering for unique items substantially accounted for all items. Specifically, I had 163,391 two-word phrases, and in some apparent overcounting, my list of unique phrases accounted for 163,772 such phrases.

Because those counts did not jibe perfectly, I decided to take an elimination approach -- that is, to remove the two-word phrases that lacked significance, rather than selecting those that had it. To help me in this process, I divided the two-word phrases into individual words, and then prepared to test each of them against a list of relatively insignificant words that I had cooked up for the occasion. Before doing that, I had to screen again for stray punctuation (now that e.g., a comma had emerged in "soon, however" such that "soon," would not be found in an elimination of "soon") -- making it increasingly appealing to find some way of doing that earlier in the process. Another step that would have been beneficially taken earlier, to the extent possible, would have been to convert date references (e.g., "2-13-09 project" vs. "Feb. 13 project") into a standard format, or else decide that none were significant. It seemed, generally, that renaming the files to standardize various names and terms (or at least revising aliases for them) should have occurred earlier in the process. Such changes seemed likely to shape this sentence-parsing process. For instance, if full names were to be used (e.g., "Roger P. Smith"), then a two- rather than three-word focus could be wrongheaded.

Project Abortion

At this point, I had an unpleasant realization. The need to pare down my spreadsheet, so as not to have to wait forever for calculations, and the loss of concentration due to long waits and the passage of time, had produced a situation where I had not realized that, in the past several hours, my work had ceased to be directly linkable back to the original list of files. There was no index. Worse, there had been global changes to remove troublesome punctuation, further reducing direct linkability. I could continue with this effort, using backups to reconstruct the links. But there seemed to be good reasons to reconceive the whole approach.

One emerging conclusion was that there just might not be a shortcut to a good rubric or other structured approach to the renaming of files, the coding of sentences, or the constructing of aliases for bits of text being examined in this kind of effort. One question in that regard was whether there would be a good way to identify and change troublesome characters or combinations of characters up front. That question led me to a second try.