ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Welcome to LinuxQuestions.org, a friendly and active Linux Community.

You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!

Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.

If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.

Having a problem logging in? Please visit this page to clear all LQ-related cookies.

Introduction to Linux - A Hands on Guide

This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.

Firstfire's GNU awk script (use gawk) does have the benefit that GNU awk has support for sorting; with just two added statements (an asorti() to sort the array keys into a new array, and picking the key from that array into a temporary variable in the loop) you can have the output sorted.

First, we set semicolon ';' as the field separator (-F';'). Now each line consists of three fields, of which we are interested in first and third (e.g. $1="maaasw1", $3=1 for the first line). Using gensub() we strip extra characters from $1 to obtain the key (e.g. "aaa"), which is then used as an index of the associative (that is indexed by strings) array T (table). Using command like T["aaa"]+=1 we sum up values corresponding to the key "aaa". Finally, in the END{} rule we print out array T.

with the red bits varied. I generated an input file of roughly one and a half megabytes in size, over 10000 lines, with varying number of tokens per line, varying line lengths, using only linefeeds as newlines. Here are the timing results (all in real time):

As you can see, mawk-1.3.3 is by far the fastest, but only when using a simple record separator. GNU gawk-3.1.8 is much more sensitive to locale than the record separator; the overhead is about 0.013 seconds per megabyte of input on my machine. You cannot really compare the relative changes in run time, since the work the script does will drastically affect that, and this one does no real work.

Simply picking the best awk variant for the task will yield a much bigger difference in run time.

The variance in mawk runtime was quite a large surprise to me. I'm running on x86-64 and 64-bit binaries, so the results might be different on 32-bit x86. Feel free to run your own benchmarks, and post them here. I would be surprised if we saw any kind of clear trends at all; small changes in versions, scripts, or environment variables are going to significantly change the run time, more or less chaotically.

If you need predictable, efficient timing, you need to make sure you use the proper algorithms. Python is not really suitable for this, because its I/O is slow. I personally also avoid the C standard library too; it is quite slow (in the cases where I/O throughput does matter), although much faster than Python. Perl is pretty fast, but I don't like the syntax, and compiled languages with efficient libraries should prove at least a little bit faster.

In my case, I use awk for these kinds of situations, because the scripts are easy to write and maintain, and I can make them robust, so they won't choke on strange input.

In this current thread, I suspect the original data is from Filemaker or similar application -- the data is not CSV, its semicolon-separated-values --, and they tend to use whatever newline convention they feel like. I think I've seen all four in real-world files. A single different newline character(s) among the output is not rare at all.