I encountered a situation tonight where I wanted to parse a text file. I had a very, very long word list that contained English words delimited by lines. I wanted to get rid of every word (or line) that was longer than 7 characters. This would be simple in Linux but I can't seem to find a simple solution in Windows XP. I tried using Notepad++ regular expression search, but that was a huge failure. I tried using the expression .{6,} without finding any matches. I'm really at a loss because I thought this sort of thing would be extremely easy and there would be tons of tools to accomplish a task like this. It seems like Notepad++ supports every other feature in the world except the very basic ones that seem the most obvious.

Another one of my goals was to put some code before and after the word on each line.

12 Answers
12

To filter out lines in a text file longer than 7 characters, you could use another command line tool, findstr:

findstr /v /r ^.........*$ words.txt > shorter-words.txt

The /r option specifies that you want to use regex matching, and the /v option tells it to print lines that do not match. (Since it appears that findstr doesn't allow you to specify a character count range, I faked it with the "8 or more" pattern and the "do not match" option.)

Perl for sure, simply paste this script and run it in the same directory as the wordlist. Change your wordlist name to words.txt or alter the name in the script. You can redirect the output to a new file like so:

words.pl > list.txt

without further avail (whipped it together quick, can be chopped down a fair bit):

gVim is a worthy editing tool that has its origins in the venerable vi used on Unix systems. You will want to use the substitute command to do global search/replacements for each word.

AWK and Perl are very powerful tools, but overkill for what you need. You'll enjoy gVim since it is an editor first and foremost. The thing that rocks with gVim is that you are only one keystroke away from giving it a search/substitute/replace command which can be specified with the robust regular expression format.
Good luck.

Maybe this is better suited for StackOverflow, because the best advice I can give you is to learn one of the scripting languages to make such tasks easier. It's much better to know one powerful tool than dozens of little ones, IMHO, and it's an investment that pays off.

Downloading Python and going through the tutorial will take a few hours, but afterwards such tasks will seem very easy to you. Better yet, you will learn to recognize tasks "looking for some programming" in other fields as well, and it will increase your productivity tenfold.

I know plenty of scripting/programming but I don't really think it's necessary. This is one of those times I'm trying to get used to something that isn't a programming solution.
–
Joe PhilllipsAug 22 '09 at 5:46

why? wouldn't it be easier to just program it? you also get to keep a script that can be just reused later
–
Eli BenderskyAug 22 '09 at 6:34

1

This is somewhat of a theoretical question for future reference. I'd much rather have the option of programming OR using a tool
–
Joe PhilllipsAug 22 '09 at 15:51

Believe it or not but Microsoft Word in fact has regular expressions too. CTR+H > More > Wild card.
The search expression will probably be something like [.]{8+} - press F1 while the Search/Replace dialog is shown to see a description of Word's regexps.

You can solve that without any extra tool to download by using a little vbScript or an Excel VBA macro.
This is indeed, more a question for stackoverflow.com.
The code for that script would run in Excel VBA as well with nearly no change.

A sample VBA (not tested) could be:

Sub filterRows()
Dim InputData
Open "c:\test.txt" For Input As #1 ' Open file for input.
Open "c:\out.txt" For Output As #2
Do While Not EOF(1) ' Check for end of file.
Line Input #1, InputData ' Read line of data.
If Len(InputData) <= 7 Then
Print #2, InputData
End If
Loop
Close #1 ' Close file.
Close #2
End Sub