29 June 2008

Notepad++: A guide to using regular expressions and extended search mode

The information in this post details how to clean up DMDX .zil files, allowing for easy importing into Excel. However, the explanations following each Find/Replace term will benefit anyone looking to understand how to use Notepad++ extended search mode and regular expressions.

If you are specifically looking for multiline regular expressions, look at this post.

You may already know that I am a big fan of Notepad++. Apparently, a lot of other people are interested in Notepad++ too. My introductory post on Notepad++ is the most popular post on my speechblog. I have a feeling that that is about to change.

Since the release of version 4.9, the Notepad++ Find and Replace commands have been updated. There is now a new Extended search mode that allows you to search for tabs(\t), newline(\r\n), and a character by its value (\o, \x, \b, \d, \t, \n, \r and \\). Unfortunately, the Notepad++ documentation is lacking in its description of these new capabilities. I found Anjesh Tuladhar's excellent slides on regular expressions in Notepad++ useful. After six hours of trial and error, I managed to bend Notepad++ to my will. And so I decided to post what I think is the most detailed step-by-step guide to Search and Replace in Notepad++, and certainly the most detailed guide to cleaning up DMDX .zil output files on the internet.

What's so good about Extended search mode?

One of the major disadvantages of using regular expressions in Notepad++ was that it did not handle the newline character well—especially in Replace. Now, we can use Extended search mode to make up for this shortcoming. Together, Extended and Regular Expression search modes give you the power to search, replace and reorder your text in ways that were not previously possible in Notepad++.

Search modes in the Find/Replace interface

In the Find (Ctrl+F) and Replace (Ctrl+H) dialogs, the three available search modes are specified in the bottom right corner. To use a search mode, click on the radio button before clicking the Find Next or Replace buttons.

Cleaning up a DMDX .zil file

DMDX allows you to run experiments where the user responds by using the mouse or some other input device. Depending on the number of choices/responses (and of course the kind of task), DMDX will output a .zil file containing the results (instead of the traditional .azk file). This is specified in the header along with the various response options available to the participant. For some reason, DMDX outputs the reaction time twice—and on separate lines—in .zil files. Here's a guide for cleaning up these messy .zil files with Notepad++. Explanations of the Notepad++ search terms are provided in bullet points at the end of each step.

Step 1: Backup your original result file (e.g. yourexperiment.zil) and create a copy of that file (yourexperiment_copy.zil) that we will edit and clean up.

Step 2: Open yourexperiment_copy.zil in Notepad++ (version 4.9 or later).

\r\n\r\n finds two newline characters (what you get from pressing Enter twice).

Step 5: Put each Item (DMDXspeak for trial) on a new line.

Switch to Regular Expression search mode.

Find what: (\+.*)(Item)

Replace with: \1\r\n\2

Press Replace All. "Item"s have been placed on new lines.

\+ finds the + character.

.* selects the text after the + up until the word "Item".

Item finds the string "Item".

() allow us to access whatever is inside the parentheses. The first set of parentheses may be accessed with \1 and the second set with \2.

\1\r\n\2 will take + and whatever text comes after it, will then add a new line, and place the string "Item" on the new line.

So far so good. Our aim now is to delete duplicate or redundant information (reaction time data).

Step 6: Remove all newline characters using Extended search mode, replacing them with a unique string of text that we will use as a signpost for redundant data later in RegEx. Choose a string of text that does not appear in you .zil file—I have chosen mork.

Switch to Extended search mode in the Replace dialog.

Find what: \r\n

Replace with: mork

Press Replace All. All the newline characters are gone. Your entire DMDX .zil file is now one very long line of (in my case word-wrapped) text.

However, the reason why I arrived on your blog still remains unanswered:

How to replace a multiple line regexp by a simple value (in my case: nothing).

Here is the case:In Symfony YAML generated files, I have the created_at and updated_at fields dumped, which I don't want.I need to replace something like this:/ *created_at:.*\n *updated_at:.*\n/by//The way to do it is important because I want the blank lines to disappear as well.

Of course I know it is possible to do it in two or three steps, but I'd like to find how to achieve it in one only, I'm a regexp maniac ;)

Maybe you or someone else own a solution... i couldn't manage to get one neither through CTRL-H nor through CTRL-R dialogs.

ninj, currently you cannot do this in Notepad++. This is because replacing newlines is possible in Extended search mode, and regular expressions are available in Regexp search mode. You are trying to combine the two search modes, and in the current version of Notepad++ you cannot.

Since I wrote this post, I too have caught regexp mania. If you are serious about using regular expressions for more advanced search and replace (as you are) then you need to use a more powerful text editor. I recommend XEmacs—I've been using it for about a month, and it is very powerful. I'm working on a post for XEmacs right now.

As for your specific problem, it is possible to get rid of the created_at and updated_at information. I would need to see the text file (feel free to send a sample to me as an email attachment). I have made a few assumptions: 1. that created_at and updated_at always occur on consecutive lines, 2. that there is information above and below these lines that is useful. The XEmacs regular expression would be this:

Thank you for the guide! I have to admit it's a little advanced for me, and I've only just found out about REGEX expressions, but am still very excited nonetheless!

I'm alittle confused by what to do in my situation. I have a mySQL file that I'd like to run, and the first part of each line is something like this:

INSERT INTO my_table (id,uid,my_msg,my_date,the_ip) VALUES ('2',

I would very much like to be able to change the '2' part to just NULL and REGEX seems to be the way forward. However, I think I'd have to use ( as a unique identifier, and given that REGEx uses brackets as the separators, I'm now a little stuck. Apologies in advance for this simple question, but my brain is really not working today.

Hi Flick, thanks for your comments. I do have a regex solution for you that is very easy and quick. Note that this regex syntax is specific to Notepad++.

First, let me answer your question re: the curved bracket (or parenthesis) character: in order to search for and find the open parenthesis character, place the parenthesis within square brackets like this: [(]

However, you do not need to use the parentheses or square brackets at all to achieve what you want to (if I have understood you correctly).Search for: '.*',Replace with: NULL

If you do not want to get rid of the comma, then delete it from the search term. If this then stuffs up your search and finds incorrect portions of text, you could insert a comma after null in the replace with expression: NULL,

Mark,Do you have some advice for the following. I have a set of text lines... and I want to delete duplicate lines. But the redundant information will occur only at the beginning of the line, the end of those lines differ in their information. I'm just starting to use notepad++ RegExp utilities, but I'm no whiz yet with the format.Thanks

ok... I've made the text file simpler so that the duplicates I want to delete all have the same information.

[19-766]???^Los Angeles^60-638^LOS ANGELES CITY USE ONE STEP 1940 LARGE CITY ED FINDER[19-767]???^Los Angeles^60-638^LOS ANGELES CITY USE ONE STEP 1940 LARGE CITY ED FINDER[19-773]???^Los Angeles^60-638^LOS ANGELES CITY USE ONE STEP 1940 LARGE CITY ED FINDER[19-1581]???^Los Angeles^60-638^LOS ANGELES CITY USE ONE STEP 1940 LARGE CITY ED FINDER

the phrases in brackets, on separate lines, are ignored by the final use of the text file. They can remain, but I do want to delete the duplicates of the ??? lines. I'll have other cities with similar format.

As further background, you are looking at the content of the 1930 census districts laundered into the 1940 census districts. I have transcribed a cross table between 1930 and 1940, and we seeded the 1940 EDs with the 1930 information. Those 1930 ED numbers are in brackets, and point to the next text line (where that information came from). Since census districts change boundaries between federal censuses, especially in large cities, you will see multiple 1940 entries from different 1930 EDs that are partially contained within the 1940 ED. I don't think there would be any more than 10 such contribution EDs. For rural areas the data from 1930 to 1940 is accurate, for urban areas we have transcribed street indexes for over 200 large cities, thus instead of repeating their 1940 ED streets (I have scanned 28 rolls of 1940 ED descriptions), I just direct them to the other utility. For smaller areas of 25,000 or more, I intend to get street indexes for them, and have replaced their descriptions with "TO BE DONE BY BOUNDARY OR STREET INDEX".

When there are multiple ED entries for a single 1940 ED # (which is a two part number), they will occur together as a block with no blank line between the various lines. If a 1940 ED has only a single 1930 entry, it should have a blank line above the brackets, and one below the text line.

I fooled with TextFX but it moves the brackets from the text lines, doesn't show a numerical sort of numbers (thus one sees 2, 20, 21, ...) and for some didn't get me to a unique line.

I need the entire line. So for the first example I want:

[19-766]???^Los Angeles^60-638^LOS ANGELES CITY USE ONE STEP 1940 LARGE CITY ED FINDER[19-767][19-773][19-1581]

but I'm willing to give up the brackets lines, but I do want a blank line between the statements.

I've done 2 states, and with California decided to do some more automation. To see Alabama and Arkansas... go to http://www.stevemorse.org/ed/ed.phpand choose 1940 and one of those two states.

Thanks... I'll ask Steve Morse to acknowledge you on the One Step site if you can pull this off.

Glad to hear that your problem got solved. Apologies for not responding as quickly as I usually would, but you caught me at a bad time (wedding and honeymoon). My wife doesn't let me post about regular expressions while on honeymoon!

Basically, the problems with your data are twofold:a) There is no unique identifier in the first occurrence of a 'new' number; andb) The number of repetitions varies.

You cannot use regexp to compare two strings of text and decide if a change has occurred (i.e. a new number/city, whatever). In summary, getting a parser/utility written was a smart move.

I am writing up a guide about how to use regular expressions, going from basics to more advanced stuff. Stay tuned.

I have an output file from a program which contains "\n" characters instead of line breaks, e.g.: "Text\nNew line\nAnother line"

Similar to your "mork" solution I do a consecutive replace, first in "normal mode" replacing "\n" characters with something unique like "ZZZ", then in "extended mode" replacing "ZZZ" with "\n" so I finally have the line breaks.

There should be a way to do this in one step, or to automate the two steps, either in notepad++ or with some other tool - has anyone got an idea?

fresh332, yes there is a way to do this in one step; and no, you cannot do this with Notepad++.

I now use a very powerful text editor called XEmacs. It really leaves Notepad++ for dead when it comes to regexp. It's so good that I'm working on a more detailed guide to regexp using XEmacs right now.

FYI: in XEmacs, you specify a newline character by first pressing Ctrl+Q and then Ctrll+J. This creates a newline character that takes care of \n and other "newline" characters.

I didn't see a mention of Notepad++'s other Find/Replace facility: The TextFX plugin. I did not look to see if any of the "unsolvable" problems would be solved by TextFX, but in the case that they might be, it's worth looking at the TextFX Find/Replace facility (CTRL+R or via the menus) because of the way it can handle newlines and tabs.

That being said, connecting Find/Replace (any flavor) with the macro recording facility of Notepad++ would elevate this software to "perfect" in my eyes...it's the one thing remaining that really aggrevates me on a semi-regular basis. Other than that, I LOVE this editor.

Good question, Ninad. I'm not sure if regexp can change upper to lower case for you. And I'm not sure how complicated your text file is. However, if you simply want to change text to lower case or vice versa, you can do this without using regexp.

In Notepad++, select the text that you would like to change, then click on the TextFX menu, then TextFX Characters, and then select lower case.

i just search the net for multiline regex replacements and i bumped into this post.

im experiencing same problems on n++. poor thing n++ can handle multiline regex. :( oh well im looking forward to see the XEmacs guide to regex. hope multiline regex replacement will be included in it. tnx.

i'm somewhat into coding that feature in java to fully customize regex commands into my needs (specially the multiline replacements). :) if anyone did that, please share. many thanks.. :)

Can I do a logical-OR regular expression search in Notepad++? In TextPad I used "^Alert|^Error|^Warning" to find all lines in a system log that started with either of the three words. The "|" operator does not seem to work in Notepad++.Of course, I could do three separate searches, but it would be nice if NotePadd++ did this for me by interpreting an OR operator, e.g. "|".

End state: I need an excel file with 3 columns: string1, url1, and string2

Any ideas? I am completely new to regex and using notepad++ for now. If someone who is really good at this replies quickly, then there also could be some work that we could pay them to do in the future as we get a lot of projects like this.

That's pretty easy to fix. I wouldn't use Notepad++ for this. Instead use the excellent and free XEmacs.In XEmacs, the correct regex search term would be (newline character at end of each line is made by Ctrl+Q, Ctrl+J):"\(.*\)".*.*"\(.*\)".*.*.*"\(.*\)".*.*.*

and the correct replace term would be:\1,\2,\3

This would create the following output:STRING1,URL1,STRING2which you could then open in Excel as a comma delimited file, which would place each string/url in a separate column.

Found your blog and hoping you can help me. I have a batch file that I receive daily. I need some help trying to modify it.

I need to insert a page break before it says PAGENO throughout the whole document. I tried to do Find and Replace with PAGENO & \fPAGENO, but it didn't work. It puts FF in black box in front of PAGENO, but doesn't create a page break when I print. What did I do incorrectly and is this the way to do a page break with regexp?

Also, is there a way to automate this process with Notepad++ or any other app?

Hey Vladimir,This was a tough one! Let me begin by saying that I have an answer for you... kind of.

First of all, as far as I am aware, you cannot have page breaks in a text document.

Ok, now that we've got that out of the way, what are we going to do to help you? I would say that inserting a page beak requires a rich text editor. So, Notepad++ is not going to cut it.

I have achieved what you requested in one easy step using Microsoft Word. Open your file in Word and select Replace (Ctrl+H), and enter the following search term:

Find what: \fPAGENOReplace with: ^12

and then hit Replace All.

All of the \fPAGENO are now page breaks. Easy.

If you wish to remove PAGE from the top of each page, you could replace it with nothing. Be sure to match the case when searching so that you do not remove any legitimate occurrences of the word "page" that are in the content of your file (if there are any).

As for automating this, it can be done (although I am not hugely experienced in task automation in Word). Take a look at this URL: http://www.microsoft.com/technet/scriptcenter/resources/qanda/jul07/hey0710.mspx

Output: when i place the cursor on any of the open braces and press ctrl-B in a LISP file(got by using alt-l-l enter) i can see the open bracket n the closed bracket highlighted. Now i need a command to delete the text inbetween teh brackets.

for ex: In the above input if I select { "pt_on_cv::evaluate" } then it should get deleted upon using a shortcut.

Some more information would be helpful. As your search involves multiple lines, I would strongly recommend using a more powerful text editor than Notepad++. I use XEmacs on Windows and Aquamacs on OSX. The solutions below will work in any text editor that supports multiline regular expressions (not Notepad++).

If you simply want to remove all instances of curly brackets, and everything that is in between them, you would search for: {.*}

Note that in Emacs, the way to insert a newline into your search query is to press Ctrl+Q then Ctrl+J. In the above example, you would insert the newline after the asterisk * and before the close curly braces }

and replace this with nothing.

However, I am assuming that you want to keep some of the information in the curly brackets. From your question, I cannot tell if it is every second instance, or curly brackets that contain "cv". Some more information would allow me to give you a more tailored answer. For the time being, I will assume that you want to remove curly brackets containing "cv", but want to leave those containing "sf" (or anything else) unaffected. To accomplish this, you would search for: {.*cv.*}

Thanks. Unfortunately, no text in the passage is same. The only pattern is"Post" followed by 6 and exactly six random digits. There can be "Post" followed by 8 or 9 random digits, but they are of no interest to us.Example

If you are working on something Post123456 cool, let #delete thisPost123456789 him know.#dont deletePost234567 They select a #deletePost1 forum member#dont deletePost23 each month for a#dont deletegrant of up to $100 in hardware or software or other products. (Products do not have to be available on the mp3Car Store.)

Ok, so I didn't understand your previous message properly, then. It still looks to me that there is a pattern there though.

Search for:Post......

Replace with: nothing

The problem is that if you search for "Post......" it will replace longer strings too, such as "Post12345678" will become "78", and this is not good. So, in order to make it unique, you might include a space after the final period in your search expression.

I will put the search term in quotes to illustrate that there is a space on the end. Do not use the quotes in your text editor -Search for: "Post...... "

If you are working on somethingcool, let #delete thisPost123456789 him know.#dont deleteThey select a #deletePost1 forum member#dont deletePost23 each month for a#dont deletegrant of up to $100 in hardware or software or other products. (Products do not have to be available on the mp3Car Store.)

Thanks. This is exactly what I did. However, regexp has a more elegant solution. You can specify exactly how many characters you are searching for.What if the number of digits was 60 instead of 6? you can write +{60} instead of typing 60 dots. I was wondering if notepad has this feature implemented.

And also, we need to search only for digits.. so we will have to type [0-9] sixty times. (otherwise, posting123 will be selected)

Ok. If that is all that your file contains, then you could simply search for:..(.)

and replace with:\1

Easy.

Note, I don't use Notepad++ any more, since I have moved on to Emacs. In Emacs the search term would be:..\(.\)but the concept is exactly the same: Discard the first two occurrences and keep the third.

Hi Mark is it possible to make something like this, im not a programmer so ill try to explain it easy

find any content between two specific custom tags and replace it with the same tags and a new content between them like

find [customtag]*[customtag]

replace [customtag] This is new content replacing whatever was between custom tags.[customtag]

im using * like a wildcard to explain that should select every single character between tags

and more specific what i want is

find *

replace some html marked text like \\Let change some hmtl paragraphs\\(ive put slashes mixed with html tags because blogger does not allow me to post those tags)

ive read you cannot use regular with multiline so i ask myself if this is possible in notepad++ in some extent and in multiple opened files simultaneously, preferable as i do all my work with this program, and only xemacs as a last option, or alternative if you want to show next to notepad++ that it is easier to accomplish this in xemacs. But i ask myself if xemacs is not for non programmer ppl like (i know html css and more or less can read php and python with a very rough idea of whats going on, sometimes)

thanks again for this super post the best in internet explaining regular expressions for notepad++ and introducing xemacs for the same.

I used NP++'s regular expressions for find and replace for the first time - successfully, before this I depended on MS SQL Server's Management studio for this, as it has very cool easy to use find/replace features (using regular expressions).

There's a very simple workaround for searching multiple lines. Replace \r\n with something that is never present naturally. I like the ANSI character 167, but Notepad doesn't have a facility for inserting ANSI characters easily.

Anyway then you run your search specifying the character or string as your endline equivalent, go to town and replace the puppies with \r\n.

Clever workaround. I like it. However, this doesn't address the main reason that forced me to move from Notepad++ to Emacs:By using a more powerful text editor, workarounds are not required. New line characters can be searched for and/or replaced at will. This simplifies the search and replace expressions and saves me time.

Thank you for the guide!I'm a little confused by what to do in my situation.I have a file with such a structur:BEGIN:VCARDVERSION:2.1N:Doe;John;;;FN:John DoeTEL;CELL;PREF:+41800800800EMAIL;PREF;WORK:test@blabla.comORG:TestEND:VCARD

I want the "FN:" section to be changed in that way: FN: Doe, John (and no more FN: John Doe). Is that possible?

Apologies for taking so long to respond. You all caught me in the midst of a trans-continental move. Now, to your questions:

@Edward: To my knowledge, no.

@Pushkar: Do you literally mean replacing @#$% with <@#$%>? This can be achieved using a simple Find + Replace:

Find: @#$%Replace with: <@#$%>

If you are talking about some sort of larger-scale find and replace based on some criterion, you need to give me more information, and preferably a snippet of text showing what the text looks like before and what you would like it to look like after.

to answer the question, "is there anything it can't do"well look ahead and look behind in regexp fails, and newlines (pretty much anything supported in extended) isn't supported in regexp.and in case any one is wondering, yes vim supports this just fine.but I'm still in love with notepad++ because it's just so much more simple to use, but learning vim is still well worth the effort (in my 1st week now and starting to get some real work done with it xD)

but who knows, maybe these issues will get addressed in the next version of notepad++

anyway nice article it did help a little even for an issue that couldn't be fixed in notepad++ xD

Yes, e22, that is what I did in the original post above, though I used a nonsense word "mork" rather than !NEWLINE!

Still though, it is quite unacceptable to me that three steps are required rather than one. And once you start using very complex regular expressions in text files that are hundreds of thousands of lines long, it becomes very tedious to have to worry about whether you missed any of your newly inserted !NEWLINE!s, or if any subsequent expressions modified something in your nonsense word (e.g., if I then got rid of all exclamation marks, it would be hard to go back). My point is that regular expressions are meant to save you time...

This is quite a straightforward example, Nico. Haven't had one of these in a while ;)

So we start off with this:Minradio#23-567

In Notepad++ regular expression search mode,

Search for: .*#(.*)-(.*)Replace with: \1\2

What you end up with is this:23567

It might seem a little tricky, but the concept is simple: What information do you want to keep? And how does the other unimportant information border it? In the regexp above, I used the hash (#) and hyphen (-) as anchors. This means that:a) the text before the hash is free to varyb) the number of digits between the hash and hyphen are free to varyc) the number of digits after the hyphen are free to vary.

The limitation is that if some of your lines of text do not contain # or - then it will break my regexp.

You want to keep the numbers, and get rid of whatever is before the numbers as well as the hyphen. So, in Notepad++ regular expression search mode,

Search for: .*#(.*)-(.*)

Let me break down this search term. The first three characters .*# will search for anything until a hash # is found (Minradio# in the above example). We don't put parentheses around this because we don't want to use it in our Replace term; we simply discard it. The next five characters (.*)- will search for anything until a hyphen - is found. The parentheses around the period and asterisk mean that that text (which is in this instance the text immediately after the hash #, that is, the number 23) can be recalled in our Replace term. The way to recall the contents of this first set of parentheses is by typing \1. The hyphen is not enclosed within the parentheses and therefore cannot be recalled in the Replace term; it is simply discarded. Finally, the last four characters (.*) select the remaining text (in this example 567) and the parentheses mean that it can be recalled in the Replace term, this time by \2, because it is the second set of parentheses. So, the Replace term looks like this:

Replace with: \1\2

What you end up with is this:23567

So, why are you ending up with 23-567? There are a few possiblities:

1. The original text had two hyphens:Minradio#23--567If that is the case change your search term to this:.*#(.*)--(.*)

2. You are including the hyphen within one of the sets of parentheses:.*#(.*-)(.*)or.*#(.*)(-.*)The hyphen therefore will not be discarded. It will be recalled when you use \1 (top) or \2 (bottom).

3. You are reinserting the hyphen in your Replace term:Replace with: \1-\2

Mauri, it turns out that this is not as trivial as it first appears. Handling email addresses is quite a controversial issue in the regexp world. See http://www.regular-expressions.info/email.html for a discussion of the varioius issues and disagreements. Your sample text has two unique characteristics that allows us to sidestep the messy world of identifying 'what is an email address?', so I have taken advantage of these two unique conditions:1. Each email is separated be a comma followed by a space ", "2. Some of the email addresses are missing a "@"

I have written the solution below for Notepad++. It involves several steps, but as long as conditions 1 and 2 from above are satisfied, it will always work.

If you are trying to perform this search in Notepad++, it's not going to happen.

Having said that, if you insist on using Notepad++, you are going to need to get creative and will need to break the search down into steps because \r\n cannot be used in Regular Expression mode - you need to use Extended Search mode for \r\n. So, how many steps do you need? I'm not sure, because it depends on your text, but my guess is at least three:

1. Turn the newline into something unique.2. Run the regexp.3. Put the newlines back or do something else with them (not sure what, because you didn't specify).

Well :), if the regular expression implementation in Notepad++ would implement the multi line pattern that could solve it. I am not familiar with how Notepad++ is implemented but Java would allow multi line patterns. I bet .NET would do the same.

The solution you suggested would work nicely but what I was trying to do was to search thru a large set of java files for a certain multi line pattern. So I can't have the option to replace the \r\n with a special token since that will alter the code base.

You can see that in the search term, I am using the forward slash and underscore as signposts, and am keeping everything before and after (enclosed in parentheses), but am discarding everything in between (not enclosed in parentheses).

Yes it is. But it will require a very long and convoluted process and several search and replace steps (similar to the blog post above). The problem is the "increment by one" part.

In Notepad++, you can insert incremented numbers from the Edit | Column Editor menu command. This places numbers at the front of each line.

You could possibly position each ? so that it occurs at the end of each line, then replace it with ${abc\1}$, where \1 represents the number at the beginning of the line.

Not sure if you want to go ahead with this, but if you do, here are the steps:1. Get rid of all line breaks, replacing them with some unique string that does not occur in your original text file, such as "thereisnoothertextlikethis".2. Search for ? and replace with a ? followed by a linebreak.3. Add numbers to the beginning of each line using the Edit | Column Editor menu command.4. Use a regular expression to search for the number at the beginning of each line and move it to ${abc\1}$5. Remove all linebreaks.6. Replace all instances of thereisnoothertextlikethis to restore your original linebreak structure.

If you want to go ahead with this, paste a larger portion of your text file (10-20 lines) and I'll show you how to do it in more detail.

Thanks for all the great info on regular expressions, although I have a problem I can't seem to find the solution for.

I have a data file which I would like to strip out some sections are they are useless, first I replaced all the \r\n with @NEWLINE@ so I could get the whole file in one line, now i'm trying to replace anything between and with

e.g.

**Data I want to keep is here 1**messagecalled today but nobody was home/message**Data I want to keep is here 2**messagecalled today but nobody answered/message

the words message have < and > around them but the site wont let me post them.

As I said I removed all the line breaks from this and tried to run this regular expression.

Find: (messages.*)(/messages)Replace: deleted

I couldn't work out how to find the < or > symbols.

I hoped this would delete all the messages and replace them with the word deleted, what it does though is finds the 1st occurance of the word messages then finds the last occurance and replaces everything in between with the word deleted. In my example above its deleting **Data I want to keep is here 2**

Thanks for the quick response Mark, what I want to replace is the < message > and < /message > and everything in between them. I can get it to work if there is only one set of these tags in the file (unfortunately there are thousands), if there are more than 1 set it goes wrong and deletes everything between the 1st < message > and the last < /message >.

Since the < message > and < /message > are on different lines in the file and the content between them can also vary on how many lines it's over, I removed all the line breaks to make it a bit easier to do the search and replace.

Notepad++ has a hard time handling multiline regular expressions. One option is to use a different text editor with more powerful regexp capabilities (ahem, Emacs). The other option is to use Notepad++ and break this down into a few steps (3 to be precise).

Step 1: Remove the newlines

Search for (extended mode): \r\nReplace with: nothing

This will give you this:**Data I want to keep is here 1**< message >called today but nobody was home< /message >**Data I want to keep is here 2**< message >called today but nobody answered< /message >

Step 2: Make all instances of < /message > occur at the end of a line. The reason for this is because we want to discard everything before < /message >, apart from that bit at the front that we want to keep.

Search for (extended mode): < /message >Replace with: \r\n

So, your text will now look like this:**Data I want to keep is here 1**< message >called today but nobody was home**Data I want to keep is here 2**< message >called today but nobody answered

We are nearly there, but we still want to discard everything after (and including) the < message > tag.

This is a little tricky. The reason why is because there are multiple parentheses on the same line. This can muck up your search term. First things first, the way to search for parentheses is with a preceding backslash, like this \( for open and this \) for closed.

One solution for your problem is to take a different approach: rather than trying to take care of all parentheses at once, you could take care of parentheses that contain the same number of digits.

Search for (regular expression mode): \(.\)Replace with: ,

Search for (regular expression mode): \(..\)Replace with: ,

Search for (regular expression mode): \(...\)Replace with: ,

Which will give you this:apple,orange,banana,tulip,

This, of course, becomes impractical if you have numbers within the parentheses that are range from 1 to 100 digits long. But, as a quick fix, it should be fine for your problem.

Hi your guide is amazing.I was hopeing you could help me with a problem.Here it is : I have one big long line of full names and phone numbers e.g john cruz 00374653 kelly brunz 95847364 alan whirtz 9898372 jane doerl and so on.I'm trying to get it like this John cruz 00374653kelly brunz 95847364alan whirtz 9898372

ps.ive tried to lookup replace every 3rd space with return button or something along those lines.

Glad you found the post helpful, Adrian. I really like your example, because it seems to be very difficult to find a pattern in this seemingly unpredictable series of names and numbers. Some people might have three names (or in the case of Madonna, Pele and so forth, just one), so looking for the third space is not a very foolproof solution. In addition, you could perhaps use the number of digits in a phone number, but this isn't foolproof either as area codes vary in length, as do country codes and so on. You need to think outside the box in order to solve this particular expression. The solution is actually so straightforward that you will probably kick yourself when you see it.

In order to solve this particular problem, it is necessary to take a step back and look at the structure of your text in an abstract way. We need to find not only a pattern in the data that repeats (such as the number of spaces), but one that will allow us to insert a line break so that each name and its corresponding number will occur on the same line. Ideally, we would like to say "wherever there is a string of numbers, turn the next space into a linebreak". It is not possible to do this in one step in Notepad++ (although you could do it in a more powerful text editor, like Emacs). We are going to need 2 steps.

In order to find where each phone number ends, all we have to do is find the number that has a space after it.Search for (regexp mode): ([0-9]) -note that there is a space after the closed parenthesisReplace with: \1,

Thank you Mark it works perfect.The reason i was asking you about how to make a line after lets say 12 spaces/commas is that i have alot of files that i want break up in lines of 3,6 and 9. I will try give you a good example of what i'm looking to do.eg:john,likes,this,jane,loves,games,peter,saved,me,george,fell,today,greg,pushed,me

So you start off with thisjohn,likes,this,jane,loves,games,peter,saved,me,george,fell,today,greg,pushed,me

and you want to put 3 words on each line. Notepad++ makes this a bit harder than it should be. We need 2 steps. First, add a comma to the end of the line so that it looks like thisjohn,likes,this,jane,loves,games,peter,saved,me,george,fell,today,greg,pushed,me,If you have hundreds or thousands of lines, you could use a regular expression to do this. Anyway, back to the task at hand.

Search for (regexp mode): ([a-z]*),([a-z]*),([a-z]*),Replace with: \1 \2 \3QQQjohn likes thisQQQjane loves gamesQQQpeter saved meQQQgeorge fell todayQQQgreg,pushed,meNote that the QQQ is just a random string that I came up with which (a) will never occur in your list of words, and (b) is easily searchable, which is useful for the next step below.

Ok, so let's say you start off with this:john25,likes,this,jane11,loves,games,peter,saved,me,george,fell,left3,greg,pushed55,me,and let's assume that you want to group them so that there are 5 words on each line.

Glad you found the blog helpful, Adrian. In order to change the number of words that will end up on each line, simply change the number of ([a-z0-9]*), in the search term, and make sure you have the same number of items in the replacement term.

The obvious limitation of using this sort of brute force approach is that it becomes impractical if you wanted say 1000 words on each line (that would be a lot of copy+pasting!). But, we are trying to work around the limitations of Notepad++, so we have to (sometimes) use inelegant solutions.

As for donations, I gratefully and humbly accept whatever you can spare. My email address for Paypal is markbfm@yahoo.com

Ah, yes, you are right. Notepad++ will not let you have more than 9 bins. Sorry, I was not working in Notepad++ when I posted my previous reply. This is yet another reason to use a more powerful text editor for this sort of advanced regexp. Enough of my ranting.

So, let's say that you want to have more than 9 words per line. It's just a matter of making our bins bigger. Rather than putting one word in each bin, we could put 20 in each bin (or however many you like).

Ok, so we start off with these 40 words:john25,likes,this,jane11,loves,games,peter,saved,me,george,fell,left3,greg,pushed55,me,john25,likes,this,jane11,loves,games,peter,saved,me,george,fell,left3,greg,pushed55,me,peter,saved,me,george,fell,left3,greg,pushed55,me,too,

Search for (regexp mode): ([a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,)Note that there is only one open parenthesis at the start and one closed parenthesis at the end of the search term.Replace with: \1QQQ

Mark since you seem like the regex master maybe you can point me in the right direction: I have a csv file that has text enclosed in "" but the problem is that in the REMARK/detail field there can be inches which are also using " how can I find these lines with the extra quotations?

Thanks for your question, Organix. I normally get asked about changing a text file by restructuring data, but finding text in a particular format can be useful, too. You are interested in an expression that will find text that contains a third " which indicates that the comment includes the inches of some object or action, such as dropping a ball. To find this use the search term below

Hey Vin,Getting rid of the information within the parentheses is pretty easy, although getting Notepad++ to recognise that you are loooking for a pernthesis as part of your search term requires that you to precede it with a backslash \( or \)

Search for (regexp mode): (.*)\(.*\)(.*)Replace with: \1\2

So, you will end up with this.7-Jul-09;6-4-12,JLN 4/125 ;VANTAGE POINT3-Sep-09;8-8-7,JLN 4/125;VANTAGE POINT1-Oct-09;6-10-07,JLN 4/125 ;VANTAGE POINT

I do not understand what you mean by "sort out the date accordingly". Throw me a bone here...

Oh, I get it now. You are not going to be able do that in a text editor. Perhaps import the text file into Excel, use the text-to-columns feature and specify the comma as your delimeter. Column A will contain all of the dates. Select Column A, set the format of the cells to 'date'. Select the whole data range and sort ascending by column A. That will do it.

Thanks man. It was really helpful. I wanted to remove "," that comes in a string from a flat file of 200000+ records. The comma was messing up with my delimiter. BTW I used [a-zA-Z1-0]+,[a-zA-Z1-0]+ as my search string.. Again thanks a ton man

Greg, does N++ regex support doing an arithmetic calculation in the replacement?

No. However, depending on the arithmetic, there may be a way to "fake it", and bend Notepad++ to your will. One reader asked me if it is possible to increment numbers in the replace term. It isn't. But if you insert a number on each line using the Notepad++ column editor, then use regexp to restructure the data, the result is identical.

So, where does that leave us then? I am not 100% clear on what your text looks like or what you want it to look like. Like anything, there is a way to do it, but the question is how messy will it get, and is it the most efficient way of getting the job done. It all depends on how repetitious your replace term will be. My gut feeling is that you should probably take a look at either Perl or Awk for your particular case.

Your work is amazingly good. Recently a hacker injected iframes into my web page for all php files. i'm trying to remove these iframes with notepad++.what I want to remove is something like:IFRAME Bla Bla Bla /IFRAMESo, I know the beginning and the end of the string but the problem is the contents are not the same all the time. One thing that I didn't confirm yet is, each iframe is located in a separate line, if so, all what i need is to delete the whole line where i locate iframe term.please give your suggestions. many thnks.

Thanks for your kind words, Mamoun. If all you need to do is remove whatever is contained within the Iframe tags, this can be achieved easily by

Search for (regexp mode): IFRAME.*/IFRAMEReplace with: nothing

If there are multiple instances of IFRAME on the same line, or if an individual IFRAME spans multiple lines, then things become a little more complicated, esp if you're using Notepad++. In this instance, you could change all instances of /IFRAME to something unique, such as ENDOFHACK or whatever. Then you could remove all newlines and replace them with something else unique, such as PUTBACKLATER or whatever. Then you would

Search for (regexp mode): IFRAME.*ENDOFHACKReplace with: nothing

Then reinsert all newlines back where they were by replacing PUTBACKLATER with \r\n in extended search mode.

Hello Mark, I am trying to replace spaces in string with a comma and figured I need to use a regular expression. I found your post and although it answers a lot, it doesn't help me achieve what I need. Hope you can help!

I have an export file in html from the delicious bookmark site. A part of the html looks like this:

Hi Mark, thanks for you quick reply! Here are a few lines. It's just a part of the full string because I can't paste the full html here. I don't see your email address but if you mine you can email me and I reply with an actual example file. Thanks !

Ok, thanks for providing the extra info, JP. So, we start off with what you have above. What makes things difficult is the fact that each bookmark has a different number of tags. So in order to get around this, first we will move everything after > to a new line.

and as you see we have a trick;there is a fixed text (iamhere) before 2 lines which we needed.in fact the question is easy , could we select/mark lines which cames before 2 lines a fixed text.i looked text-fx but i couldn't solve the problem.Thanks for helps.

Glad you found it helpful, RussiAmore and Randy. And thanks for the kind words.

Enes, your problem is a simple one (in theory), but is made complicated by Notepad++. As you point out, there is a (somewhat) recurring pattern in the text. You want to keep the line that is two lines above "iamhere", discard the line above "iamhere" as well as the "iamhere" itself. I am not going to even bother doing this is Notepad++ because the solution will be very, very, very long. We need a text editor that will allow us to include newlines as part of our regexp search term. I recommend Emacs, available here: http://ftp.gnu.org/gnu/emacs/windows/

Note: insert newline characters into a search term in Emacs by pressing Ctrl+Q Ctrl+J

Replace with: \1

This will give you this:

Jessie 213Jack 232blablablasometextagainsometextMark 30

Ok, so now you can see why I said that there is a *somewhat* recurring pattern. The number of lines between each occurrence of "iamhere" varies, so we want to get rid of "blablabla", "sometext" and "againsometext". In this example, we can use the fact that the unwanted text does not end with a number to our advantage, like this

Hi Mark: I really appreciate your blog!Question: I'm working in XML and I want to find all contents between these two tags: caution tags. (Imagine a left and right carrott tag on each caution with verbiage between them. For some reason this blog won't allow carrott tags.)I can find the tags, now how to I copy that content into a separate file? I know I have about 200 cautions and I want extract only that content to a file. Make sense?

I would appreciate any assistance you can offer, oh "NotePad ++ guru you!