A huge (up to 2 GiB) text file of mine contains about 100 exact duplicates of every line in it (useless in my case, as the file is a CSV-like data table).

What I need is to remove all the repetitions while (preferably, but this can be sacrificed for a significant performance boost) maintaining the original sequence order. In the result each line is to be unique. If there were 100 equal lines (usually the duplicates are spread across the file and won't be neighbours) there is to be only one of the kind left.

I have written a program in Scala (consider it Java if you don't know about Scala) to implement this. But maybe there are faster C-written native tools able to do this faster?

this will destroy your ordering but,have you tried sort -u, I have no idea how or if it can run on such a massive file
–
msysJan 27 '12 at 15:57

4

C is often not significantly faster than Java, and if you're running it (in-order) now, there's a fair chance it'll finish before you get an answer here, implement it, and it finishes running; out of order, sort -u will probably be faster.
–
KevinJan 27 '12 at 15:59

6 Answers
6

This is great! It worked flawlessly on my Solaris box. I only tested it on a small file though
–
rahmuJan 27 '12 at 16:50

Just tried this on a 2G file and it took three minutes on my notebook. Not bad. I also tried uniq filename | awk '!seen[$0]++', but it wasn't any faster.
–
mgjkJan 27 '12 at 19:27

This command seems not just working but more than 100 times faster than the program of mine. Thanks.
–
IvanJan 27 '12 at 21:40

This is surprisingly faster than a more verbose awk version using 2 array lookups (shown as an expanded explanation in Gilles answer) : 0m36.132s vs 0m49.958s.. for 50 million lines.. I thought the bottleneck would be the I/O, but the extra array lookup is... 1 million elements in the array seems to make a rather significant dent...
–
Peter.OJan 28 '12 at 5:53

By the way, @larsmans has helped me to find out what was wrong with the proggy of mine, so don't think Scala/Java is that slow. The discussion is here: stackoverflow.com/q/9045042/274627
–
IvanJan 28 '12 at 18:18

There's a simple (which is not to say obvious) method using standard utilities which doesn't require a large memory except to run sort, which in most implementations has specific optimizations for huge files (a good external sort algorithm). An advantage of this method is that it only loops over all the lines inside special-purpose utilities, never inside interpreted languages.

If all lines begin with a non-whitespace character, you can dispense with some of the options (use a literal tab instead of \t if your sed doesn't support \t):

<input nl | sort -k 2 -u | sort -k 1n | cut -f 2- >output

For a large amount of duplication, a method that only requires storing a single copy of each line in memory will perform better. With some interpretation overhead, there's a very concise awk script for that (already posted by enzotib):

<input awk '!seen[$0]++'

Less concisely: !seen[$0] {print} {seen[$0] += 1}, i.e. print the current line if it hasn't been seen yet, then increment the seen counter for this line (uninitialized variables or array elements have the numerical value 0).

For long lines, you can save memory by keeping only a non-spoofable checksum (e.g. a cryptographic digest) of each line. For example, using SHA-1, you only need 20 bytes plus a constant overhead per line. But computing digests is rather slow; this method will only win if you have a fast CPU (especially one with a hardware accelerator to compute the digests) and not a lot of memory relative to the size of the file and sufficiently long lines. No basic utility lets you compute a checksum for each line; you'd have to bear the interpretation overhead of Perl/Python/Ruby/… or write a dedicated compiled program.

Assuming you can afford to keep as much as the de-duplicated file in memory (if your data is indeed duplicated by a factor of 100, that should be about 20MiB + overhead), you can do this very easily with Perl.

$ perl -ne 'print unless $dup{$_}++;' input_file > output_file

This preserves the order too.

You could extract the number of occurrences of each line from the %dup hash if you so wished, as an added free bonus.

If you prefer awk, this should do it too (same logic as the perl version, same ordering, same data gathered in the dup variable):

Not as fast as the awk command in other answers, but conceptually simple!
–
JohannMar 31 at 23:11

@Johann I am doing this pretty often on files with hundreds of thousands (even million) of short newline terminated strings. I get the results pretty quick for the experiments I am doing. It can be more important if used in scripts which are run again and again, savings in time can be considerable.
–
xeonMar 31 at 23:13

this causes the entire file to be slurped into memory and may not be a good fit for the OP's problem. Also not guaranteed to retain order
–
1_CRSep 15 '13 at 14:50

Thanks for the suggestion, I've been just learning python.. just tried this for learning purpose.. :)
–
Rahul PatilSep 15 '13 at 19:52

Here's a Python 2.7 version that is not a one-liner but (succinctly) returns unique lines preserving order without either loading the entire file into memory or creating a single gigantic string to feed to print
–
1_CRSep 16 '13 at 16:37