Regex help please to remove Special characters inside xml tagsbrowsing

Hi… I work in a “locked” down environment working with XML files and only really have the vanilla version of NotePad++ to help i.e no plugins :-(

I’m after some help please with a regex to help me edit/correct xml files where a specific tag needs to be updated to remove “Special characters” i.e ?'s spaces *'s ^'s and perhaps most importantly /'s…whilst maintaining the xml tags and whats left of the data after the special chars have gone…

For example
<Ref>1234/567890</Ref> or
<Ref>123 TEST 23/*</Ref>

We need to remove the offending characters but keep the rest of the text/tags to leave
<Ref>1234567890</Ref> or
<Ref>123TEST23</Ref>

Our XSD validation only allows 0-9 and A-Z between the <REF></REF> tags…so anything else is “special”

Weve tried various one-liners without much success and are after a bit of advice/pointers please especially as were now looking at 6 million row files daily…

@PaulSc
First off I’m wondering if Notepad++ can deal with these VERY large files you mention (6 million row files). Regardless of that the following regex will I think suffice your needs. As it stands it needs to be run multiple times as each time it will only pull 1 ‘unwanted’ character from each <Ref></Ref> tag. So you keep running it until the result is “0 occurrences were replaced” There may well be a better method and if anyone can identify it @guy038 can.
So using the Replace function we have
Find What:(?i)(<Ref>.*?)[^a-z0-9](?=.*</Ref>)
Replace With:\1

Use the Replace All and have wrap around ticked, along with search mode as ‘regular expression’. Run it until you get the above response (0 occurences).
The (?i) allows for upper and lower case characters. So each time it checks for the start of the tag, this prevents removing unwanted characters not inside the tag, along with the lookahead (?=.*</Ref>).

I have not catered for having the tag split across lines, if this is so you will need to provide a fuller description of the data.

After some tests I realized that, instead of the negative look-ahead(?!<|>). syntax, we could simply use the negative class character [^<>].

Indeed, the interest of the negative look-ahead is obvious when you need to avoid certain strings. For instance, the regex (?i)123((?!A simple test|abc|OK).)*789 assures you that the range of characters between the numbers 123 and 789 should never contain any string A simple test, abc and OK, all together, whatever the case, in order to get a match !

But, when you simply need to avoid some single characters, the negative class character syntax is easier to understand, to my mind ! For instance, the regex 123[^#@+]*789 assures you that the range of characters between the numbers 123 and 789 should never contain any character #, @ and +, all together, in order to get a match !

Just notice, that inside a class character, most of symbols do not need to be escaped, in order to be considered as literals ! The only characters, which need to be preceded with the \escape symbol, are :

The - dash which defines a range of characters allowed or forbidden

The two square brackets [ and ], which are the boundaries of character class

The ^caret symbol, which defines a negative character class

The \escaping char, itself

So, for instance :

The regex [\^&\\~\]=] would look for any char ^, &, \, ~, ] or =, whereas

The regex [^(\[)\-%] would look for any char, different from (, [, ), - and %

Of course, according to specific locations, inside the character class, the \ character is, then, not mandatory but be aware that the escaping way is the safe method, in all cases !

But, let’s get back to our problem ! So, we can modify the above regex as :

(\G|<Ref>)[^<>]*?\K[^\r\n\w<>]+

Notes :

I suppressed the (?s)modifier as the regex does not contain any dot char, anymore !

The negative class character, [^\r\n\w<>], at the end of the regex, contains all the characters that should… not be deleted i.e. :

The line-break characters have to be kept, of course !

Any word character \w, which is an equivalent of the regex [A-Za-z0-9_], as well as all accentuated characters

The angle brackets < and >, surrounding starting and ending tags

Note that, if you would keep, between the two tags <Ref> and </Ref, let say, the colon punctuation sign, the dollar currency sign and a space character, simply change the regex as below :

(\G|<Ref>)[^<>]*?\K[^\r\n\w<>:$ ]+

So, this regex , after matching the starting tag <Ref> or from the location of the end of the previous match, \G, tries to match, first, the shortest range of characters, even null, all different from the angle brackets, [^<>]*?

Then, due to the \K syntax, the regex engine forgets all that was matched, so far and just considers the regex part [^\r\n\w<>]+, which represents the greatest range of consecutive symbols which have to be deleted

In summary, assuming the sample text below, containing <Ref>..........</Ref> regions, with two ranges in a first line, the next one, split in several lines, and a single region, at the end, after some line-breaks :

@guy038
I REALLY like your thinking. There’s only 1 concern I have and luckily you actually represented it in your example. On the first line in the 2nd tag you have the “_” (underscore) character.
Now@PaulSc said:

Our XSD validation only allows 0-9 and A-Z between the <REF></REF> tags…so anything else is “special”

so I’m wondering if the use of the “\w” is perhaps a bit too encompassing. I’ve been doing some testing using your idea (forgive me) and to my mind the power of your regex really comes from using the “\G”, which allows it to “stay put” after a find, rather than mine which continually had to reset after each find starting with looking for the next “<Ref>” sequence.

Looking forward to whether the “\w” can be constrained in an elegant way so it more exactly fits the OP’s request.

Roughly, the regex (\G|<Ref>)[^<>]*?\K[^\r\n\w<>]+, right after the string <Ref> or after the end of the previous match, looks for the shortest range of chars different from, either, < and > till a non-wanted char, different , which will be deleted, during the replacement phase.

The start location is before the upper-case letter B. Obviously, the match cannot begin at this position ( \G ) as, in order to reach the first unwanted char /, the range would cross the starting tag <Ref>, which contains < and >, of course !

So, the regex engine is forced to look for the second alternative, i.e. the string <Ref>. Below, I marked, with some bullet chars •, the range of characters, between <Ref> and the unwanted char /

You agree that current location of the regex engine is, now, right after the ==== string. Well, what are the next unwanted char(s) ? Obviously, the +++ string. But, in that case, the regex engine would cross the </Ref> string, which contains the forbidden chars < and >, as we defined for any range.

Then, necessarily, the start of the next match will be located some chars after current location, so that the \G assertion is not true, anymore ! Thus, the regex engine starts looking, again, for a starting tag <Ref>, followed by some chars till the unwanted chars @@, giving :

@guy038
GREAT explanation of the use of the “\G”. I knew the “descriptor” for it but was a bit unsure how it might be used into a regex.

I’d happily suggest that explanation (in a more generic form) would make a great FAQ, if only we had one for the individual metacharacters used in regex. Maybe we DO need a “FAQ for Metacharacters”, as you (and others) have provided many examples in the past. I try to refer to them when unsure, but they can be hard to find amongst ALL the posts.

You mention again the use of accentuated characters. Obviously in the English speaking world we have little to do with them (in normal life). I would also suggest that the data the OP is using would likely not have them either, so you providing both alternatives gives him maximum options to satisfy his needs.