20 February 2009

Rearrange text using multiline regular expressions in Emacs

My previous posts on Notepad++ and regular expressions have become very popular. In particular, readers have posed many questions along the lines of "how do I write a regular expression that will..." After having read this post you should be an intermediate regular expression creator. Let's get started.

While I am a fan of Notepad++, it is not powerful enough to perform the regular expressions that I will be going through in this post. I strongly recommend that you download and use the excellent (and free) XEmacs. Installing XEmacs is not as simple as it could be. To simplify the process, I have made my Config files available for download here. Instructions are included.

Regular expression you say, what exactly is a regular expression?Think of a regular expression as a fancy Find+Replace command that can span across multiple lines and can rearrange bits of text. Using regular expressions involves two steps:1. Create a search term that finds and selects only the text that you want to modify/move/delete.2. Create a replace term that outputs the desired text in the correct way.This is a bit vague, isn't it. Let's look at our first example.

Example 1. Manipulating a simple list of names using Find+ReplaceI got married on 11 October (a few months ago) and the photographer gave us a DVD containing all of the photos, approximately 1350 JPEG files. My wife and I were told to choose 120 of these photos to be printed for our wedding album. The DVD contained two folders named 'A' and 'B'. Folder A contained files named DSC_0001 through to DSC_1000, and folder B contained files named DSC_0001 through to DSC_0350. The reason for the existence of two folders was that they came from two memory cards. A thousand photos were taken on card A, which got full, and then a further 350 photos were taken on card B. Now, the problem is that we have 350 overlapping filenames. So we need to change the filenames in some way to make them unique and identifiable. This is relatively easy. Here is what we did.

First, we chose 120 photos. 105 came from folder A. I copied the filenames of the chosen photos from folder A and pasted them into a new text file. It looked like this:

DSC_0004.jpegDSC_0007.jpegDSC_0015.jpeg(and so on....)DSC_0997.jpeg

Before moving on to adding filenames from folder B, I performed a simple find and replace.

Find: DSC_Replace with: DSCa_

This made the text look like this:

DSCa_0004.jpegDSCa_0007.jpegDSCa_0015.jpeg(and so on....)DSCa_0997.jpeg

We then pasted the filenames from folder B (in pink) into the text file:

DSCa_0004.jpegDSCa_0007.jpegDSCa_0015.jpeg(and so on....)DSCa_0997.jpeg

DSC_0002.jpeg

DSC_0004.jpeg

DSC_0011.jpeg

DSC_0015.jpeg

(etc.)

I then changed the newly pasted names using a simple Find and Replace term:

Find: DSC_Replace with: DSCb_

That made our text look like this:

DSCa_0004.jpegDSCa_0007.jpegDSCa_0015.jpeg(and so on....)DSCa_0997.jpegDSCb_0002.jpegDSCb_0004.jpegDSCb_0011.jpegDSCb_0015.jpeg(etc.)

This made each file unique and identifiable. I also removed the .jpeg extension to minimise the possibility of confusion.

DSCa_0004DSCa_0007DSCa_0015(and so on....)DSCa_0997DSCb_0002DSCb_0004DSCb_0011DSCb_0015(etc.)

Of course, I could have renamed the filenames in many different ways. For example, I could have changed DSC_0004.jpeg to A\DSC_004, or A_004, or FolderA_004, and so on. The name is not important. What is important is that the names are uniquely identifiable, meaning that no two names are alike. Now, our photographer will not get confused between which DSC_0004.jpeg we want (the one from folder A or B). This is the end of our simple Find+Replace exercise. Let's have a go at a regular expression.

Example 2. A simple Regular Expression: Removing newlines and replacing them with commasLet's cover some Regular Expression basics. A Regular Expression is used to search for text with a similar pattern, that is, text that matches the search criteria. What makes Regular Expressions so powerful is that there isn't a one-to-one mapping between the regular expression and the text (as there would be in a simple Find+Replace). For example, a period or full stop character . matches any character. So searching for DSC. (note: that is DSC followed by a full stop) would select both DSCa and DSCb, as the full stop could be any character.

Find+Replace is a useful tool. However, there are certain tasks that go beyond the capabilities of simple Find+Replace. Let's assume that we have the list of photos from above.

I don't want to leave this list as it is (one filename on each line). I want to place the files all on the same line, separated by a comma. This would be the regexp:

Find: newlineReplace with: ,

Note: You do not actually type in the word newline. When we type, every time the Return or Enter key is pressed, a newline character or carriage return is placed on the page (even if you cannot see it on the screen). Given that our filenames above occur on separate lines, there is a newline character after the final digit on each line. To further complicate matters, there are different types of newline characters. Most text editors have a hard time dealing with newline characters (it is precisely for this reason that I have switched from Notepad++ to XEmacs). XEmacs is quite excellent at handling them with minimum fuss. To enter a newline character into your Regular Expression, type Control+Q Control+J, represented in XEmacs as C - q, C - j.

Example 3. A more complex example: Rearranging groups of text onto the same lineLet's assume that my wife went through the list of photos and wrote instructions to the photographer under each name, making the list look like this:

DSCa_0004Print this one 6*4DSCa_0007Print this one 5*7DSCa_0015Print this on canvasDSCa_0997Print this 6*4DSCb_0002Print this one 8*10DSCb_0004Can you print this one in matteDSCb_0011Make 3 copies of this oneDSCb_0015Print this one 6*4

I think that the best way to make this clear for the photographer would be to rearrange the text so that it looks like this:

DSCa_0004 = Print this one 6*4DSCa_0007 = Print this one 5*7(and so on....)

In order to make the correct regular expression, we must first identify the correct pattern in the text. We cannot simply replace all the newlines as we did in Example 2, because then every photo and comment would be on one very long line. We must place some unique characteristic into our regular expression that will discriminate between which two lines belong together and which do not. In this case, one possibility is by taking advantage of the fact that all photo filenames begin with D, the comments for that photo are on the line immediately below, and no comments begin with D. The regexp would look like this:

Find: \(D.*\) C-q C-jReplace with: \1 =

Let's examine this regexp. D searches for the letter D. The full stop "." searches for any character. The asterisk after the full stop allows for recursion, meaning that we are searching for any characters after D. The slash and parenthesis around the D.* allow us to save the contents of the regexp and manipulate it in the replace term. The line break, represented by the Ctrl+Q Ctrl+J keyboard command is outside the slashes and parentheses and is discarded.

The replace term takes the D.* (effectively the filename, which is enclosed in slash+parentheses in the search term), adds a space, then an equals sign, then a space. This produces the following output:

DSCa_0004 = Print this one 6*4DSCa_0007 = Print this one 5*7DSCa_0015 = Print this on canvasDSCa_0997 = Print this 6*4DSCb_0002 = Print this one 8*10DSCb_0004 = Can you print this one in matteDSCb_0011 = Make 3 copies of this oneDSCb_0015 = Print this one 6*4

Example 4. An even trickier example: Rearranging groups of text onto the same line 2Let's assume that my wife wrote instructions to the photographer under each name, however, some comments began with a D. The list looks like this:

DSCa_0004Do you think you could print this one 6*4DSCa_0007Print this one 5*7DSCa_0015Do this on canvasDSCa_0997Dave, print this 6*4DSCb_0002Print this one 8*10DSCb_0004Can you print this one in matteDSCb_0011Make 3 copies of this oneDSCb_0015Print this one 6*4

This time, we cannot use the previous regexp as it does not identify only the photo filenames. It will also select the comments beginning with D. Using the regexp from Example 3 in this case will result in this:

DSCa_0004 = Do you think you could print this one 6*4 = DSCa_0007 = Print this one 5*7DSCa_0015 = Do this on canvas = DSCa_0997 = Dave, print this 6*4 = DSCb_0002 = Print this one 8*10DSCb_0004 = Can you print this one in matteDSCb_0011 = Make 3 copies of this oneDSCb_0015 = Print this one 6*4

It is a mess. We need to find a pattern in the text that uniquely identifies the filenames only and not the comments. Look at the filenames. All begin with DSC. So, we could change the regexp to find DSC or even DS. No comments begin with DS. Let's give it a shot.

Find: \(DS.*\) C-q C-jReplace with: \1 =

And here's the output:

DSCa_0004 = Do you think you could print this one 6*4DSCa_0007 = Print this one 5*7DSCa_0015 = Do this on canvasDSCa_0997 = Dave, print this 6*4DSCb_0002 = Print this one 8*10DSCb_0004 = Can you print this one in matteDSCb_0011 = Make 3 copies of this oneDSCb_0015 = Print this one 6*4

Easy. We could have used \(DSC.*\) as the search term and the result would have been the same.

Example 5. Getting rid of unwanted information.Let's assume that the photographer receives our list of photos. He wants to see which photos we want to print in what sizes, but he also wants to leave the descriptions for files where no size has been specified. Using regexp, we can remove all info other than the sizes. Here's the file:

DSCa_0004 = Do you think you could print this one 6*4DSCa_0007 = Print this one 5*7DSCa_0015 = Do this on canvasDSCa_0997 = Dave, print this 6*4DSCb_0002 = Print this one 8*10DSCb_0004 = Can you print this one in matteDSCb_0011 = Make 3 copies of this oneDSCb_0015 = Print this one 6*4

The unique thing that identifies the comments that contains sizes is the asterisk character in the dimensions. We want to keep the asterisk as well as the characters immediately before and after the asterisk (the numbers). And here is the regexp:

Find: \(= \).*\(.\*.\)Replace with: \1\2

This is the output:

DSCa_0004 = 6*4DSCa_0007 = 5*7DSCa_0015 = Do this on canvasDSCa_0997 = 6*4DSCb_0002 = 8*10DSCb_0004 = Can you print this one in matteDSCb_0011 = Make 3 copies of this oneDSCb_0015 = 6*4

As you can see, only lines conatining sizes have had their comments removed. Let's go through the search term. The \(= \) searches for the equals sign followed by a space (which denotes where comments begin) and stores it in \1. All text thereafter .* is not kept, until we come across \(.\*.*\) this pattern which is stored in \2. Let's unpack the \(.\*.*\) final part of the search. The first full stop searches for any character, the \* searches for the asterisk (note: the slash before the asterisk is essential as it specifies that we are searching for an asterisk * character, and not a recursive .*), and the last full stop and asterisk allows any character after the asterisk.

Example 6: Rearranging information.Our photographer has decided that he would like to display the sizes of our photos on the left and the filenames on the right of the equals sign. Photos containing comments only should remain as they are. Here is the text:

DSCa_0004 = 6*4DSCa_0007 = 5*7DSCa_0015 = Do this on canvasDSCa_0997 = 6*4DSCb_0002 = 8*10DSCb_0004 = Can you print this one in matteDSCb_0011 = Make 3 copies of this oneDSCb_0015 = 6*4

Again the asterisk is our unique identifier of the sizes. Here is the regexp:

Find: \(.*\)\(= \)\(.\*.*\)Replace with: \3 \2\1

And here is the output:

6*4 = DSCa_00045*7 = DSCa_0007DSCa_0015 = Do this on canvas6*4 = DSCa_09978*10 = DSCb_0002DSCb_0004 = Can you print this one in matteDSCb_0011 = Make 3 copies of this one6*4 = DSCb_0015

Our photographer could have done this at the beginning of Example 5 (if he wanted to). Here is the text from Example 5:

DSCa_0004 = Do you think you could print this one 6*4DSCa_0007 = Print this one 5*7DSCa_0015 = Do this on canvasDSCa_0997 = Dave, print this 6*4DSCb_0002 = Print this one 8*10DSCb_0004 = Can you print this one in matteDSCb_0011 = Make 3 copies of this oneDSCb_0015 = Print this one 6*4

And here is the regular expression that will (a) get rid of all info apart from sizes, and (b) rearrange the order of the text to size = filename:

Find: \(.*\)\(= \).*\(.\*.*\)Replace with: \3 \2\1

And here is the output:

6*4 = DSCa_00045*7 = DSCa_0007DSCa_0015 = Do this on canvas6*4 = DSCa_09978*10 = DSCb_0002DSCb_0004 = Can you print this one in matteDSCb_0011 = Make 3 copies of this one6*4 = DSCb_0015

As you can see, the two outputs are identical.

These are only a few examples of how regular expressions can be used. I hope that these examples will be useful for you and will provide some guidance and stimulation about what is possible with regexp. If you have any questions, please post them in the comments below.

Step 2: Add incremented numbers to the front of each line.Click Edit | Column Editor (shortcut Alt+C)and then click "Number to insert" and set the value of "Initial number to "1" and "Increase by" to "1", and then click Ok.So, now we have this:

Step 3: Use regular expression to add R and : and correct spacing.Search (Regular expression) for: (.)IFReplace with: R\1: IFAnd you end up with your desired outcome (except that the emtpy lines are gone):

R1: IF xx is e and....R2: IF zx is w and....R3: IF xt is f and....R4: IF nx is b and....R5: IF xt is f and....R6: IF nx is b and....

With Emacs, I could have given you a single step solution, but that's the price you pay for using Notepad++ ;)

Could you hint on why "Most text editors have a hard time dealing with newline characters"? I find this deficiency very surprising. A text document is form the viewpoint of the editor just a string of characters, and newline is just another character (or two, depending on the platform). Why would this be so troublesome. I don't get it?

@Kusin KnaseThis is going a little bit beyond my expertise, but I believe that part of the problem is that over time and across operating systems, a variety of characters have been used for newline (or line break, or line return, etc.). For a discussion, see http://en.wikipedia.org/wiki/Newline, particularly under the "Unicode" section of that Wikipedia entry.

So, I guess that a text editor such as Emacs is better at handling newlines because it recognises more of the newline characters than Noptepad++ does. This still does not account for why Notepad++ requires an extended search mode - my guess is that to do away with extended search mode would require a significant rewrite of the source code, and the programmers do not consider it to be a priority.

what I need (1st) is that the line starting with 4 has to follow the line starting with 1, like this:

10600100... 48001571000180662

but I need an identifier before the 4 to know where it starts.Then I need (2nd) to add something(eg. a comma) at determinate places (this is census data so the first position means level1; the position 2-3 the region; the positions 4-5-6 the city; etc.) so as to be able to differentiate the variables in columns when importing them to Excel.

I have figured out how to do the 1st part as it is similar to one of your examples, but I cannot find how to add commas at determinate positions.

Note: I use Notepad++ and tried UltraEdit (the free version for 30 days) as I couldn't manage to run your examples in Emacs (it just doesn't work or I don't know how to make it work... the program starts OK but when I type your regex expressions with your data it says "couldn't find ..."). For the 1st part I have "translated" your regex expressions to UltraEdit ones and it worked (although after a lot of trial and error).

Ok, so it seems that you want to do quite a few manipulations. I am not sure if I have understood them all, but will do my best. First thing is to get the number beginning with 1 and the number beginning with 4 on the same line. I have chosen to use a comma as the character between the two large numbers. I did this in Emacs using the Replace Regexp command, although you could probably do it in UltraEdit if you find it easier. Note that the newline in the search term is entered by pressing Control+Q Control+J.

Re: adding commas at specific positions, you have not provided enough information for me to provide you with a foolproof solution. I have assumed that the commas are to be inserted in the first number, which begins with a 1. If the number of characters is always the same, you could just specify the number of characters using the period regexp "." which matches any character. So if digit 1 is the level, you could use one period to match this, if 2-3 are the region you could use two periods, and 4-6 are the city you could use 3 periods. And if this is pattern is identical for all lines, you could use something like the following regexp

Search for: \(.\)\(..\)\(...\)\(.*\)Replace with: \1,\2,\3,\4which will give you this. Note that the commas have been inserted at the start of each line.

so that it can be easily imported to Excel, each comma used to separate columns.

The expression for that would be the same as you wrote but adding up to 32 \(\); i.e. \(.\)\(..\)\(...\)\(......\)\(..\)\(..\)\(.\)\(.\)\(.\)\(.\)\(.\)\(.\)\(.......\)\(..\)\(.......\)\(..\)\(.\)\(.\)\(.\)\(.\)\(......\)\(......\)\(......\)\(......\)\(......\)\(....\)\(....\)\(....\)\(....\)\(........\)\(....\)\(.........\)\(.*\)

And then the replace:\1,\2,\3,\4,\5,\6,\7,\8,\9,\10,\11,\12,\13,\14,\15,\16,\17,\18,\19,\20,\21,\22,\23,\24,\25,\26,\27,\28,\29,\30,\31,\32

Is that correct? Can Emacs handle so many \(\)?

The reason of using UltraEdit instead of Emacs is just because I didn't manage to find how to do that in Emacs. When I write down the expressions (at the bottom of the Emacs window) it keeps saying that nothing is found... I am probably doing something wrong.Did you finally post the 'Guide to using regular expressions with Emacs' in your blog? I was unable to find it. I wish I could use Emacs as the regular expressions are more straight forward than UE ones, plus UE help says: "A regular expression may have up to 9 tagged expressions" so I cannot perform the 32 tagged expressions mentioned before.

Just in case, do you know of a website where one can learn how to use Emacs? Maybe you know of an alternate software (more user friendly than Emacs, like Notepad++) which can use the Emacs regular expressions.I guess it's a stupid question but, can I perform what I need in Notepad++? (from what I have read in your blog, the response is no, or at least not in one unique expression).

This *is* my post on how to use Emacs! Haha. I suggest starting with simple expressions and building your way up. You could even split up your 42 (not 32) element expression in Notepad++ if you wanted to. You could break this down into multiple steps where you do buffers 1 to 9 in one regexp, then 10 to 18 in another and so on. You could insert a strange character at the end of each group of 9 buffers, such a X, and use that to pick up where you left off, i.e., (.*)X(.whatever...

I have this:105001000001080113061400235001278100000001002783000000010010500100000208011116150000120027810000000100278300000001001050010000030801111615000086002781000000010027830000000100105001000004080111161500009600278100000001001050010000050801111615000150012781000000010027830000000100105001000006080111161500024201275100000001002763000000010027830000000100278500000001001050010000070801111614000220012781000000010027830000000100105001000008080111161400020301278100000001002783000000010027850000000100105001000009080111161500008700278100000001002783000000010010500100001008011116170000130010500100001108011116150006000110500100001108011116150006000127530000000100278300000001002785000000010027900000000100etc

This is census data, the row 1 (starting with 1) has locality ID and the rows starting with 2 agricultural data. To know to which locality the agricultural data is from I need to have agricultural rows following the locality ID.As you can see the problem is that there can be from none up to 7 (not shown in this example) rows starting with 2.

I have tried many times but don't get how to do that with notepad ++ regex.

All that you want to find is whenever a 2 occurs after a newline, and then replace that newline with an x. In Notepad++, you search for newlines in Extended Search mode using \r\n. So, in order to find all of the 2s that occur after a newline, you would

I want to replace all the text between two words with a blank. Can you please help me with it.The contents between the two words arent always the same.For eg: A and B are the two words between which I want to replace all characters.Line 1: "A abc hijk ... B" should become: ABLine 2: "A pqr fcsdh .. B"should become AB