for the changes to take effect, please switch to any other language and back to english.

I’m always nervous doing this. What if I can’t read the new language enough to know how to switch back to English? Haha, just kidding. :)

But I had a real reason for replying: There is no need to actually switch to a different language. Just click the dropdown box (so that it drops down) and re-select the current language choice (which should be the top selection available and should be highlighted–in blue, for me). This is enough to get changed customizations put into use. One step instead of 2.

In the Find what box, put (?-s)^(.+\R)\1+
In the Replace with box, put \1

It works… BUT… it will remove ALL records after the last duplicate that it finds in our files. It’s because of the size of my files, I know that…because I tested it out with shorter files, and had no issue. Our files are 5.3 million lines long, and we process 3-6 of these per day. We cannot split the files, as they’re used for manufacturing and we don’t want to do multi-step copy/pastes.

For reference, the built in “Remove Consecutive Duplicates” does exactly the same thing with the removing ALL the records after the last duplicate…

Our file generators push out a file with 5.3 million records, and usually we only have 3-5 duplicates, so when we run this command, it may run into the last duplicate on line 500,000 and then delete everything afterwards.

Is there a way to allow larger file sizes to process successfully? TextFX does it perfectly, and I can use that with a 32 bit Notepad++, but I’d like to keep the 64 bit if possible.

TextFX does it perfectly, and I can use that with a 32 bit Notepad++, but I’d like to keep the 64 bit if possible.

The two are not mutually exclusive. You could leave 64-bit as your installed Notepad++, but download a portable (zip-edition) of 32-bit Notepad++, unzipped in to some other directory (not in the Program Files (x86) hierarchy; I take a inspiration from the linux world, and put my outside-of-program-files programs in c:\usr\local\apps\____). You could then use the 64-bit for normal, everyday usage. But when you want to do the removing of duplicates, you can just launch your 32bit instance instead.

Sadly, there are some limitations where the regular expression engine is concerned…but you’ve already discovered this so I’m adding nothing new…

the built in “Remove Consecutive Duplicates” does exactly the same thing

This built-in command uses a regular-expression replacement operation as well (but rather a C++ coded one, not a user-supplied one), so the same outcome makes sense.

Is there a way to allow larger file sizes to process successfully?

If I were doing it, I’d turn to an external tool. Since something existing that does exactly this doesn’t pop to mind, I’d likely roll my own. I’d probably first try Python but if that wasn’t fast enough I’d turn to C. Maybe in your case, sticking with TextFX is the best option.

did a quick test. Creating 6_000_100 lines take much longer than removing its duplicates.

def remove_duplicates():
unique_lines = set()
duplicates = []
for line_num, line in enumerate(editor.getCharacterPointer().splitlines()):
if line not in unique_lines:
unique_lines.add(line)
else:
duplicates.append(line_num)
for line_num in reversed(duplicates):
editor.deleteLine(line_num)

Which took 5.8 seconds on my environment. :-)
Note, this script would remove ANY duplicate, not only the ones which are consecutive.

sorry, yes, I only posted the function itself - it must be called of course :-)

def remove_duplicates():
unique_lines = set()
duplicates = []
for line_num, line in enumerate(editor.getCharacterPointer().splitlines()):
if line not in unique_lines:
unique_lines.add(line)
else:
duplicates.append(line_num)
for line_num in reversed(duplicates):
editor.deleteLine(line_num)
remove_duplicates()

Now, as the nativeRemove consecutive duplicate lines N++ option does not take any selection in account, @ekopalypse, would it be easy enough to just consider the current main selection ? If so, it could be an interesting enhancement of this native N++ command ;-))

def remove_duplicates():
unselect_after_removable = False
unique_lines = set()
duplicates = []
if editor.getSelectionEmpty():
for line_num, line in enumerate(editor.getCharacterPointer().splitlines()):
if line not in unique_lines:
unique_lines.add(line)
else:
duplicates.append(line_num)
else:
start, end = editor.getUserLineSelection()
for line_num in range(start, end+1) :
line = editor.getLine(line_num)
if line not in unique_lines:
unique_lines.add(line)
else:
duplicates.append(line_num)
for line_num in reversed(duplicates):
editor.deleteLine(line_num)
if unselect_after_removable:
editor.clearSelections()

This time, I was warned ;-)) So, adding the part, below, to your script allowed me to appreciate your last version :

editor.beginUndoAction()
remove_duplicates()
editor.endUndoAction()

If no main selection is present, all file contents are processed. Else, the selection range, only, is concerned. Nice, indeed ;-))

I built a sample file containing, roughly, 497,000 lines, all different and I added a block of 15 lines, 128 times, each block being separated from the next one, with between 800 to 7500 lines, which, finally, gave me a file of almost 500,000 lines. On my out-dated laptop ( Win XP, 1GB of RAM ! ), No problem. It took 31s about to be processed !

BR

guy038

P.S. :

Yes, I know ! Why can’t he buy a recent laptop, with a 250 Gb SSD for Windows 10, 8 Gb of SDRAM, a 2 To SATA HD and 2 Go NVIDIA GeForce, as everybody ? Well, I think I’m about to reach the tipping point ;-))

Note that I did not emphasize these laptop’s characteristics as I’m not quite certain they are all accurate !!

hehe :-)
31s is long time - would be interesting to see your results using this little test
I assume the ram might be bottleneck, would you mind making some test
with 50 000 instead of 500 000 lines as well?

The last action, number 3, or pressing the cancel button breaks the loop.
Btw. I get 0.33 seconds removing 10 duplicates from 500 000 lines on my machine.
First gen i5 2500 but with plenty of ram - 16gb :-)