I have a large number of file, some of which are very long. I would like to truncate them to a certain size if they are larger by removing the end of the file. But I only want to remove whole lines. How can I do this? It feels like the kind of thing that would be handled by the Linux toolchain but I don't know of the right command.

For example, say I have a 120,000 byte file with 300-byte lines and I'm trying to truncate it to 10,000 bytes. The first 33 lines should stay (9900 bytes) and the remainder should be cut. I don't want to cut at 10,000 bytes exactly, since that would leave a partial line.

Of course the files are of differing lengths and the lines are not all the same length.

Ideally the resulting files would be made slightly shorter rather than slightly longer (if the breakpoint is on a long line) but that's not too important, it could be a little longer if that' easier. I would like the changes to be made directly to files (well, possibly the new file copied elsewhere, the original deleted, and the new file moved, but that's the same from the user's POV). A solution that redirects data to a bunch of places and then back invites the possibility of corrupting the file and I'd like to avoid that...

The sed approach is fine, but to loop over all lines is not. If you know how many lines you want to keep (to have an example, I use 99 here), you can do it like this:

sed -i '100,$ d' myfile.txt

Explanation: sed is a regular expression processor. With the option -i given, it processes a file directly ("inline") -- instead of just reading it and writing the results to the standard output. 100,$ just means "from line 100 to the end of the file" -- and is followed by the command d, which you probably guessed correctly to stand for "delete". So in short, the command means: "Delete all lines from line 100 to the end of the file from myfile.txt". 100 is the first line to be deleted, as you want to keep 99 lines.

Edit: If, on the other hand, there are log files where you want to keep e.g. the last 100 lines:

[ $(wc -l myfile.txt) -gt 100 ]: do the following only if the file has more than 100 lines

$((100 - $(wc -l myfile.txt|awk '{print $1}'))): calculate number of lines to delete (i.e. all lines of the file except the (last) 100 to keep)

1, $((..)) d: remove all lines from the first to the calculated line

EDIT: as the question was just edited to give more details, I will include this additional information with my answer as well. Added facts are:

a specific size shall remain with the file (10,000 bytes)

each line has a specific size in bytes (300 bytes in the example)

From these data it is possible to calculate the number of lines to remain as " / ", which with the example would mean 33 lines. The shell term for the calculation: $((size_to_remain / linesize)) (at least on Linux using Bash, the result is an integer). The adjusted command now would read:

The OP wants to cut the file based on a certain byte size — not just length in terms of lines. I deleted my answer involving head -n.
–
slhck♦Jul 24 '12 at 19:55

@slhck Thank you for the notification. Yes, the OP just edited his question to make the intention more clear. As he has means to calculate how many bytes each line has, my answer remains valid in principle -- as he can calculate the number of lines to remain, and then use my approach to handle the files. Maybe I make a short remark on that within my answer.
–
IzzyJul 24 '12 at 22:08

No -- the sizes are not known in advance. That was an example. Each file will have a different size and lines are of irregular length. Some files don't need to be truncated at all.
–
CharlesJul 25 '12 at 0:45

Oh, again... Well, some things are hard to explain clearly (too many facettes). As for the files which need no truncate, that's probably based on file size? That can be covered. But if there's not even an average line size known, this part gets hard -- I cannot think of an easy solution (without too much overhead) at the moment.
–
IzzyJul 25 '12 at 6:23

All I can come up with currently would involve to e.g. get the first n lines, calculate an average length based on them, and use this value. Would that help you?
–
IzzyJul 25 '12 at 6:30