I have a 150GB xml file that I would like to shorten (ie truncate) to about 1GB - is there a simple (bash or similar) command I can use, or do I have to go the programmatic route (editing it in vi or emacs is a nightmare even on big iron systems)?

(I am not particlarly concerned about the loss of information, I want a shorter file so I can test a piece of software on this and not wait many hours for the answer, a shorter file will allow me to do that.)

Since this is an XML file, which I assume contains a sequence with a great number of elements, you could also use an XML transformation language such as XQuery to filter out a certain number of these elements, which would have the advantage to output valid XML (Example)
– AaronJan 5 '18 at 17:31

4

Does the file still need to be valid XML when done?
– JoeJan 5 '18 at 21:22

That will likely not result in a readable XML file when done.
– JoeJan 5 '18 at 21:21

3

@Joe - OP did not request a readable file (nor did they say it could be unreadable). They did say that they did not care about loss of information. I would expect a new question from OP about how to fix said file.
– KevinDTimmJan 5 '18 at 22:17

3

I know enough xml to fix it, I wrote the DTD for the format!
– adrianmcmenaminJan 5 '18 at 23:12

Where possible, I'd use the truncate command as in John1024's answer. It's not a standard unix command, though, so you might some day find yourself unable to use it. In that case, dd can do an in-place truncation too.

dd's default behavior is to truncate the output file at the point where the copying ends, so you just give it a 0-length input file and tell it to start writing at the desired truncation point:

dd if=/dev/null of=filename bs=1048576 seek=1024

(This is not the same as the copy-and-truncate dd in multithr3at3d's answer.)

Note that I used 1048576 and 1024 because 1048576*1024 is the desired size. I avoided bs=1m because this is a "portability" answer, and classic dd only knows suffixes k, b, and w.

For the general solution, you should probably note that the bs number multiplied by the seek number is the number of bytes to keep. Any two numbers that satisfy that constraint should work; e.g., bs=1073741824 seek=1 or bs=1 seek=1073741824. Or, since bs defaults to 512, seek=2097152 alone should also work. And you can use notation like 1M, 1K, 1G and 2M.
– G-ManJan 5 '18 at 20:10

Putting aside whether this answers the question or not, this will scan the entire file, I believe, so it's much more efficient to use sed 1000000q (and a bit more compact, visually speaking).
– B LayerFeb 9 '18 at 14:48