Unix File Splitting

Hi Everyone,
My file is fixed delimited file. I am trying to split 11lakhs records into 4 files.
I have tried split -l and split -b
The issue i am facing is performance when i used split -l command it is taking 7 minutes to split whole records.
when i tried with split -b it finishes in 15sec. But the issue is it didnt split it properly with linecount.

Something can be done so that this process completes in 15-30 seconds(?)

I'm not sure whether awk could handle a long line with something like 297 fields. Maybe gnu awk, but that might not be a part of every system. I recall a limitation of around ~180 fields with HP-UX's awk version, but I'm sure Paul will give some more

I don't think there is a limit on RedHat GNU awk. I have a production one that splits 407 fields from an .xlsx, but I think I tested it and found at least 200 was good.

If you run into a limit, there is a simple work-around anyway. Use an awk array.

Split is fast where FS is a simple char. It can get much slower where you e.g. try to remove blanks from the fields too, like: FS = "[ ]*\t[ ]*"; it is actually faster to use FS = "\t" and then sub () off the blanks on each field separately.

{ split ($0, V); }

Then $1 is in V[1], and $7777 is in V[7777]. Far as I know, there is no limit on this method.

Solaris awk seemed to choke on input lines > 6K (actually 6044 bytes). GNU seems good with several million bytes on one line.

Didn't read the earlier posts first. awk does not care how many fields a line has, until you reference a field beyond where it runs out of indexes. I think, even for non-GNU awk:

(a) It only sets up a field index for a line if and when you first reference a field on that line.

(b) It indexes the fields up to its limit correctly (if that version has a limit) and only objects when you go past the limit.

First test for the poster: time wc -l filename.txt.

split can hardly be faster than wc, so that gives you a time to compare with split.

I cannot believe that any combination of tail and head will be any faster than split.

Are you using a proper split. If you are in BusyBox or similar utility, then it will do no optimisation at all. In that case awk is probably faster than wc or split.

I would advise not to wc the file first - it takes a lot of time. Quicker to pick the maximum lines in an output file and not bother with an uneven split. Something like:

awk ' { print > "MyOutput." 1 + int (NR / 200000) ".txt" } ' MyInput

should split 11 lahks into 6 files like MyOutput.3.txt in maybe 15 seconds.

This is not good for more than about 12 output files, because it does not close files as it goes. For more files, you would need to close each file after it had enough lines. But that is a slower test and stops it being a simple one-liner.

That contradicts my 407 known to work. In fact, I believe I tested GNU awk on 2,000 fields, and I am quite happy to believe it will work limited only by available memory. Even a medium programmer would make a: typedef struct { int f_sz; int f_offset; } Field_ref;

and then realloc a bigger array of Field_ref each time a line showed up that had too many fields for the current malloc. Awk was definitely written by first-class gurus.

Same for line lengths: I believe I tested lines with 6MB bytes in GNU awk - no point knowing more.

If you run out of line length in non-GNU awks, the trick is to define a new record separator. For example, suppose you have 50MB of XML, which is valid without any newlines at all. All XML tags end with ">", so they are very frequent in XML. Just start your awk with:

BEGIN { RS = ">"; }

and every line comes in as (optionally) one text field and one tag, with the ">" removed (because it is not part of the line now, it is a Record Separator). Makes it really easy to process the text now, it is all short lines with one XML token per line.

Anybody got the time to compare the shell while read do done, the head-tail, the split and the awk?

In particular, cat the 4 parts and diff with the whole file. I think the while read version trashes all whitespace. And I would expect it to be between 100 and 1000 times slower than awk. For starters, the >> opens, appends and closes to a file over a million times.

The (NR % 4) is very tidy (@Tiger being a great lateral thinker) and it balances the file sizes, but it does put each consecutive set of 4 lines into 4 distinct files. I believe these will be buffered in awk so that each file is blocked optimally and the disk does not thrash on output. However, it seems likely that when a file is later read, it will have its blocks separated on the disk more (by the blocks belonging to the other 3 files) and may not read as fast as the NR / 200000 version.

We seem to have got you a x3 speed-up, 7 mins to 2.5 mins. I don't know how much faster you might expect this to go.

I would be interested in 3 timed runs to see what your system throughput can achieve.

#.. How fast can we read and do minimal processing.
time wc -l < InFile
#.. How fast can we copy a file physically.
time dd < InFile > OutFile bs=262144
#.. How fast can we copy a file logically.
time cat < InFile > OutFile

If those are taking a few seconds, then you are short of CPU. If they take a minute or more, you are short of I/O bandwidth. You can't expect to do a lot of additional work on the data and get it to run in comparable times to these examples.

It would be useful to know your CPU MHz, chipset, OS and so on. Check "top" and see it you are using high CPU, have a lot of I/O waits, use a lot of swap, and so on.

I don't see a big problem in having a run taking a few minutes. What's the deadline? I have bunches of stuff that runs between 20 mins and 20 hours, and some in the pipeline that will run for 40 days. You have to define performance as one aspect of your solution, and you have to manage end-user expectation, and you have to be able to schedule work in cron or at so you can background a workstream to run unattended. But you can't work miracles just because the user does not have a grasp on reality.

I would expect Informatica to take an order of magnitude more to insert these files with indexing and conversions to binary types anyway. There is no point super-tuning the split operation to run in 20 seconds if the data load still takes 15 minutes - your total saving is still then only 20%.

Why can't you modify the Extract part of the system to create smaller multiple files in the first place?

OK. If you want to balance the files as close as possible, then you need to wc -l the file.

You say around 11 lakh, which I understand 1 lakh = 100,000. So 1.1 million records.

Suppose the true count is 11,276,153 lines. Divide by 4 and round UP slightly. Maybe get 2819100, so use (NR / 2819100). File 1 gets lines 1 to 2819099. Then the divide flips over to 1 and you start file 2 with record 2819100. File 4 gets 2818853 lines, no big difference in file size.

From my tests it follows:
- split -l needs about the same time as split -b
- standard awk works on the same level with limitation for total number of fields on the row. On HP-UX the maximum is 199
- GNU awk is about twice faster, no problems with number of fields
So my only recommendation is GNU awk. You can try the script below where you can change MAX variable to the number of output files you want.
In my test (input file with size 360 MB in 450000 records with 261 fields) the wc command consumed almost one third of the total time needed !

That's probably a slow solution. sed has to read the whole input file 4 times. OK, you ran in 4 background processes, but you probably have only two cpu (dual core), one memory bus, one scsi driver, and one set of disk heads. So it is not going to run 4 concurrent scans. More likely, it will be jumping all over the disk and run slower that 4 distinct sequential processes.

The arithmetic in things like 3L/4 is not valid shell. I believe the poster intended you to do the arithmetic on a hand-held calculator and plug the numbers hard-coded into the script. That means every time you run this you will have to redo the wc and then edit the script. That's not quality software however fast it runs.

There is a shell syntax to do the arithmetic and inject the results into the parameters, but that need to be enclosed in double quotes to make it do the substitutions. Like:

sed -n "1, $(( 3 * L / 4 ))" ...

and it is way too easy to get a one-off error in the rounding and drop a few lines.

The advantage of my earlier awk script is that every line goes to some file. Nothing can ever get dropped. And it only reads the input once. I believe you won't find a quicker or more robust solution, unless you write it in C yourself. Even then, I think awk does its own optimised I/O and might just still outperform any plain C you write using getchar or fgets.

You have been hacking this thing for most of a week now. There is no benefit in achieving ultimate performance if you are going to slip your project a week in doing so. You could have run this hundreds of times by now.

I think I quote AOR (Tony Hoare) who said "95% of optimisation is premature and therefore wasted." He should know - he wrote QuickSort (qsort).

It must be precisely as shown by Paul.
I bet that those 15 seconds is just the time for wc command
One command (only)

head -275000 InputFile

takes almost the same time (275000 equals one fourth of 11 lakhs)
Can you verify again that created files are not empty ? Check also time of their creation. Good to see output of ls -l command for input file and all 4 fractions.

OK, if that is your script and your results, you have what you want. I think a minute to process 1 GB is acceptable.

I am astonished to find you are in Solaris and ksh. I might have expected some smart RedHat bash to work with expr $foo/4 all in one word, or x=x+1 without any $, but Solaris ?? It's stuck in the 1980s.

In particular, what if I actually wanted to assign the text "here+1" to a variable? I really don't expect an assignment to attempt arithmetic evaluation unless I tell it to with $(( )) or let ...
I think that might break some old scripts, maybe that put variable text like "-mtime +7" into a find command later.

Yes, your last result looks OK. It shows some 53 sec time for whole task which is appropriately more than 15-20 seconds that you've reported before.
For comparison, I ran your last script on my machine and got some 90 seconds. This could be lowered by some 30-50% when using one of the gawk programs above. It could give you the result of 30-40 seconds on your SUN.

If you have a solution that works reliably, I would stick with it and not try to save a few seconds by further modifications.

You are obviously eager to learn, so this is some possible alternatives to your style that I might consider. No that you are wrong, but maybe these will give you more options next time.

1. With: wc -l $wholefile | read lines file

First, wc does not show a filename if it is standard input. So you don't need to manage the filename with:

wc -l < $wholefile | read lines

Second, this does not work in bash: the pipeline gets run in a sub-process and the variable "lines" is then in a local scope and is inaccessible to the main script.

I would probably use: LINES=$( wc -l < $wholefile )

Because all shell and (most) external commands are lowercase, I use uppercase for my own variables, and MixedCase for my own shell functions.

2. With: part3="`expr 2*$part2`"

Expr is an external process and is quite picky about arguments, and has poor diagnostic messages. Ksh and Bash have a built-in calculator which is faster, more helpful, and nicer syntax.

I also never use back-ticks. They are hard to see in the code, they do not nest like brackets do, and they can run extra sub-processes. The modern $( ... ) syntax is better.

I would use: part3="$(( 2 * part2 ))"

3. With: part3=part3+1

I would probably do the increment in the substitution. That is omit part3=part3+1
?and write tail as tail +$(( part3 + 1 )).

The reason for this is that the +1 offset is a characteristic of the tail command option, and not really of the line numbering itself.

4. I'm not clear why you use head | tail for the second file. I would use tail | head in the second part just like the third part.

Firstly, just to be consistent. It saves having to think about the same issue in two different ways.

Second, I think you are taking a performance hit with that tail -n option. Tail cannot know how much of the file is left until it hits the end, so it has to store it all somewhere and count backwards. But tail +n and head -n both work from the top of the file, and do not need to buffer anything.

The whole point of the awk version I originally posted is that it can write multiple output files in one pass. So it only ever has to read the input file once.

What you have there is an awk wrapped inside a wc and a 4-way loop, forcing it to read the whole file five times, evaluate a complex range expression for every line, and output just one file in each pass over the input. That just cripples awk for no reason.

@TigerPeng posted an amazing awk with (NR % 4) that saves you even knowing how many lines are in the file, so no need for a fifth pass of the file with wc -l. His awk spreads the lines evenly in a round-robin. So it reads the whole input once instead of 5 times.

OK, I made a blunder. But I thought you were sufficiently advanced to compare this with your previous examples and see what was wrong.

You have to work through these posts and understand the purpose of each post and any dangers they represent. You can't really take anything on a forum as being completely tested or reliable or applicable to your machine. But yes, I messed this up royally.

The issue is that none of the lines that are meant to be reading the file, do so. There is no filename here, so each pipeline is reading stdin as the terminal, and waiting for you to type in the data.

It should look more like this. But I notice my post did the wc on $wholefile, and you chose to change this to the actual filename in full.

Paul,
Please, could you provide more explanation what's wrong with the gawk script. I think the input file is read by awk only once !
The idea of the script was to have savings when excluding calculations from big loop inside the awk. That's why all positions needed for splitting the file are calculated before awk is run.
In my test the script runs slightly better than your one-liner (with minor corrections to get comparable results):

I believe that reads the input exactly twice (once to get size, once to fan out data).

@Tiger put this up 4 days ago.

If the order of line is not important, change (NR / 200000) to (NR % 4).

That is smart because it does not need to know the total number of lines in advance. It automatically evens out the lines, provided:

(a) There is no logical connection between separate lines.

(b) There is no performance impact because the lines are now no longer in (maybe) ascending key order (I find this is important for bulk loading in Ingres).

OK, I'm an idiot. I did not read your complex gawk script carefully enough. It constructs a 4-line awk script and does indeed only read the input once in wc and once in gawk. Very subtle, and knocked me off my soapbox nicely.

I do have reservations about awk ranges, either numbered or patterned. It seemed to me that each range operates independently - if you have distinct test types some lines can get output multiple times, once per independent condition.

That can't happen here, because the tests are mutually exclusive. But it suggests to me that each range test is a permanent object that marks True when the first test is satisfied and False after the second test is satisfied. And that might indicate that range tests could be relatively expensive.

I would probably try the effect of a construction of simpler tests like:
NR <= $HI { print > File.$NO; next; }
with the last one being an unconditional print.

I still believe a bullet-proof one-liner is worth more than rubies. If it is too slow to watch, go get coffee while it runs.

So my script worked and in a short period of time.. I am glad to hear that, although I wish it could have been in the 15-30 seconds timeframe! I knew it would generate the files properly as I tested it myself... before posting.

Paul, in regards to your post, it doesn't matter if the "wc -l < $wholefile | read lines" doesn't work in bash if you are using #!/usr/bin/ksh at the top of the script.. Isn't KSH available on every platform? Also, if you loop through a file line by line you are increasing the run time a huge amount, in my experience.

I like your comment about part3="$(( 2 * part2 ))" and tail +$(( part3 + 1 ))... I do like the backticks but you have a good point about the subprocess thing, I suppose. I can be a bit of a hack but my scripts work even if they are a bit wordy. I tried to put the increment in the tail but was having a heck of a time and I didn't have a lot of time to play with it so, hence, the increment kludge.

In 4, "I'm not clear why you use head | tail for the second file." I used head | tail because I am interested in working on smaller parts of the file and the second part is in the first half (instead of working with the last three quarters of it). I figured that would be quicker or at least more efficient.

In conclusion, I am generally impressed with those that write in awk script, as I still have not learned this tool (wanted to learn it for decades), but I can always get a script that works the old fashioned way! And my shell scripting can be relatively involved with functions/procedures and structure. But I too like seeing a good awk command that does the work in 1/10th the lines.

Sorry I forgot to acknowledge this was your original posting. With over 60 posts, I have been losing the place on this topic.

Agreed the shebang takes care of the bash/ksh issues perfectly. The treatment of variables and subshells with pipes is (I feel) the most annoying demerit in Bash. I know people copy snippets of code from these forums, and the wail of "I did command | while read do ... done, and my variable is not getting set" comes up often enough to need a reminder of the issue.

Agreed shell reads of whole large files (or pipelines) is a terrible performance hit. In this case, "large" means > 60 lines. And shell read also modifies white space unless you are very careful with IFS and quoting.

The way I would look at the head/tail versus tail/head goes like this. Suppose we have a 100-line file and we want the second quadrant (lines 26-50):

Because of variable length lines etc, no Unix utility can seek to a specific line number. So some process must read from line 1 of the file to the last line you want to pull from that pass. That goes to a pipe, then a second command will clip the unwanted parts out.

With head | tail: the head reads 50 lines and sends 50 lines plus EOF down the pipe. The tail is reading an unknown number of lines and has been told to emit the last 25. So it has to hold on to line 1 in case there are only 25 or fewer lines coming its way. It can only discard line 1 when it has stored line 26. It must always hold the last 25 lines (at least) until it sees EOF. I don't know if it does this in memory or on file or a mix (it could even gzip the lines in memory), but it has to do it somehow and it is expensive in memory or disk or performance.

With tail | head: the tail can just discard the first 25 lines because they are at start of file. It starts sending lines with line 26. (That's a performance gain: the first part of the data does not go through the pipe, and the head does not even get scheduled because it has nothing to read yet.)

When tail has sent lines 26-50 (and maybe a few more in the buffering) down the pipe, head decides it has done its job with -n 25. It closes the output file containing lines 26-50, and exits.

Next time tail writes to the pipe. it gets sent a SIGPIPE because there is no reader for it. It can either trap the signal (if it has clean-up to do), or not trap it and get killed. Either way, it does not read the rest of the input file.

In my book that is a clear win. The first process reads only what is necessary (in either case). But less goes through the pipe, less is read by the second process, and there is no need for the second process to count lines backwards or to cache an unknown quantity of data.

I pretty much live my working life in Awk. I am at least 10 times more productive in awk than C. And all the things awk does so well (field management, hash tables, patterns) are just the things that are mind-numbingly tedious and error-prone otherwise. I will use awk to calculate 2 + 2, and I will use awk to run a 16-file consistency check on our real-time database, and I will be happy both ways.

I never got around to using the bit operators in Awk yet. Interested to see what gawk --version tells us.

It's conceivable that "and (NR, 3)" is valid syntax in an old awk. The and would then just be an undefined variable (valid, assumed null or zero), and (NR, 3) might be treated as in C (both parts evaluated, second one returns result), or as a 2-D array index.

However, the diagnostic "It's not working" is disappointing. Does it output anything to stderr or stdout? Does it create any file? Does it run instantly, or long enough to read the file but ignore the data? No Unix command should just do nothing unless you feed it /dev/null or a dumb pattern.

GNU awk (gawk) has a bunch of extensions - extra statements that are not in standard awk like AIX or Solaris.

The "logical" operators - and, or, xor and a few others - are only in GNU awk. So Solaris will bomb on them. It would help if you could post things like syntax errors just as they are output.

Solaris persist in having a very old version of awk as /bin/awk. On a Solaris box, if you want a proper standard awk, you have to call nawk (New awk). It is better, but it is still not gawk (for example, no in-memory sorting).

However, on Solaris, I would expect nawk to run faster than awk and to have better features. On Solaris, you might try something that works in awk on nawk and see whether it is faster.

I would expect awk/nawk/gawk to be reasonably consistent in performance across all systems, except for reasonable scaling by CPU power and I/O hardware. That's because it is a very stable source - vendors do not mess with it much and it uses few of their libraries. For example, it generally uses its own pattern library, not the local regexp. And it optimises its own memory and I/O buffering.

Other commands written in C often use standard vendor libraries which are optimised for application development. They probably don't have all the polishing that old-style C Unix gurus had the time to do.

I once found a wc -l that was so slow, it could be beaten by awk discarding the whole file: