In other words, the [ character indicates a new file should begin. The entries ([ entry*], where * is an integer) are always in numerical order and are consecutive integers starting from 1 to N (in my actual input file, N = 200001).

Is there any way I can accomplish automatic text file splitting in bash? My actual input entry.txt actually contains 200,001 entries.

11 Answers
11

This will work for any file size, irrespective of the number of lines in each entry, as long as each entry header looks like [ blahblah blah blah ]. Notice the space just after the opening [ and just before the closing ].

EXPLANATION:

awk and gawk read an input file line by line. As each line is read, its contents are saved in the $0 variable. Here, we are telling gawk to match anything within square brackets, and save its match into the array k.

So, every time that regular expression is matched, that is, for every header in your file, k[1] will have the matched region of the line. Namely, "entry1", "entry2" or "entry3" or "entryN".

Finally, we print each line into a file called <whatever value k currently has>.txt, ie entry1.txt, entry2.txt ... entryN.txt.

+1 nice. You do not need to match the entry: /^\[/ { name=$2 } should be enough.
–
ThorAug 26 '12 at 1:51

Thanks @Thor. Your suggestion is correct for the described case, but it assumes there is never a space in the entry's name. That is why I used the example [ blahblah blah blah ] in my answer.
–
terdon♦Aug 26 '12 at 13:20

Ah I missed the bit about space separated entries. You could also accommodate those with FS, e.g. -F '\\[ | \\]'.
–
ThorAug 26 '12 at 13:50

@terdon I really like this short solutions, unfortunately I usually fail to generalize them to my needs. Could you give me a hand? My file has lines starting with #S x, where x is a 1, 2, or 3 digit number. Just saving them into x.dat would suffice. I tried: gawk '/^#S/{match($0, / [0-9]* /, k)} {print >k[1]".dat" }' myFile.txt and some variations of that.
–
mikuszefskiDec 15 '14 at 16:45

You'll end up with an extra empty file entry0.txt (containing the part before the first header).

Standard csplit lacks the {*} indefinite repeater and the -b option to specify the suffix format, so on other systems you'll have to count the number of sections first and rename the output files afterwards.

This is a somewhat crude, but easily understood way to do it:
use grep -l '[ entry ]' FILENAME to get the line numbers to split at [ entry ].
Use a combination off head and tail to get the right pieces.

What about using awk with [ as a record separator and space as the field separator. This gives us easily the data to be put in the file as $0 where he have to put back the removed leading [ and the filename as $1. We then only have to handle the special case of the 1st record which is empty. This gives us:

terdon's answer works for me but I needed to use gawk, not awk. The gawk manual ( search for 'match(' ) explains that the array argument in match() is a gawk extension. Maybe it depends on your Linux install and your awk/nawk/gawk versions but on my Ubuntu machine only gawk ran terdon's excellent answer:

Here's a perl solution. This script detects the [ entryN ] lines and changes the output file accordingly, but doesn't validate, parse or process the data in each section, it just prints the input line to the output file.