Pages

Tuesday, March 15, 2016

Recently, there was an HN thread about the implementation (not just use) of text editors. Someone mentioned that some editors, including vim, have problems opening large files. Various people gave workarounds or solutions, including using vim and other ways.

I commented that you can use the Unix command bfs (for big file scanner), if you have it on your system, to open the file read-only and then move around in it, like you can in an editor.

I also said that the Unix commands split and csplit can be used to split a large file into smaller chunks, edit the chunks as needed, and then combine the chunks back into a single file using the cat commmand.

This made me think of writing, just for fun, a simple version [1] of the split command in Python. So I did that, and then tested it some [2]. Seems to be working okay so far.

[1] I have not implemented the full functionality of the POSIX split command, only a subset, for now. May enhance it with a few command-line options, or more functionality, later, e.g. with the ability to split binary files. I've also not implemented the default size of 1000 lines, or the ability to take input from standard input if no filename is specfied. (Both are easy.)

However, I am not sure whether the binary file splitting feature should be a part of split, or should be a separate command, considering the Unix philosophy of doing one thing and doing it well. Binary file splitting seems like it should be a separate task from text file splitting. Maybe it is a matter of opinion.

[2] I tested split.py with various valid and invalid values for the lines_per_file argument (such as -3, -2, -1, 0, 1, 2, 3, 10, 50, 100) on each of these input files:

No, I haven't tested it in that use case. My post does indicate (though maybe it is implied, not stated explicitly) that binary files are not supported in this initial version, at least. I know that the original Unix split (which I've linked to) does support both text and binary files in the same program. But I do not plan to. I plan to make that a separate program. See point about Unix philosophy in the post.

I will also make it explicit via another comment, that this version does not support binary files.

And commenting again in a short while, with some more of my thoughts on this matter.

@Jay Wren: Here's the follow-up comment I said I would write (a bit long, sorry):

I reviewed my post just now, the part about the features that this split command supports, and what I said there about handling text vs. binary files. Also thought about that a bit more. Here's what I think:

Of course, it's possible to handle splitting of both text files and binary files in the same program. But I prefer not to do that\. From my early days of doing database CRUD programming, where some colleagues used to have C functions with names like priamd (standing for PRInt, Add, Modify, Delete - I know :), I've not liked that style of intertwining the code for the different CRUD operations in the same function, which some people did, just to save a few lines of code, or to avoid writing separate functions for each operation. IMO that is a false economy, and code is cleaner and more maintainable if one writes separate functions for each logically distinct operation. I think the same in this case. I can easily write another program, called, say bsplit (for binary split).

However, that still leaves us with the point you raised:

>It looks like it will crash or perform very poorly on a large binary file with no newline characters in it.

An interesting point. I do know that you cannot read lines of unlimited length with built-in Python file object's read*() functions (one of which, maybe readline(), my for loop is calling under the hood), unless you read the file a character/byte at a time, until EOF. Or rather, you can attempt to do so, but as you said, the program may either crash (due to a buffer overflow) or perform very slowly because of allocating a lot of virtual memory (on disk). Haven't really experienced such as situation myself, so I'll experiment and try to simulate it, and see what happens to the current split program. But leaving aside that, there are a couple of things than can be done:

- inform the users (via the documentation of the program) to use it only on text files, and that too, where no line is extremely long.

- it may be possible to write some heuristic code in the program, near the start, which detects very long lines, and aborts with a warning, so the program does not hang or crash later. Something like that the Unix "file" command does - it reads the first few hundred bytes of any file given as argument, to try and detect what type of file it is, based on known file headers, typical content of C programs vs. shell scripts vs. other types of files, etc. This heuristic would have to read maybe a few thousands bytes before deciding.

However even that heuristic may not work, if the file has many lines that are not very long, at the start of the file, and then somewhere in the middle or near the end of the file, there is a very long line or a huge chunk of binary data with no newline in it.

So I guess the ultimate answer is that no program can handle every situation that can be thrown at it, and the user has to take some care too.

One final point - I think (though I'd need to verify) - that it may be possible to handle the issue you mention (huge binary data with no newline) by reading the file character by character using in_fil's.read(1), looking for newlines in the characters read, and handling the data accordingly. My selpg Linux utility does something like that when invoked with the option to handle pages demarcated by form feeds. However then how do you decide where to split the data into lines (since no newlines exist)?

Or use in_fil.readline(size) method with some large positive value for size. Would still need some extra code to handle the need to split the output into files of lines_per_file lines each, because a newline may only occur after many read() calls - or not at all.

I just realized another point, regarding selpg - when used in the line mode (not the form feed mode), it can act like a rudimentary version of the split command. But you would have to invoke it many times, once for each output file you want to split the input file into; i.e. if you have a file of 100 lines, call selpg once to extract lines 1 to 10, again to extract lines 11 to 20, etc. It would be inefficient to use it that way compared to split, which creates all the output files in one run.

But of course that is not selpg's intended use. It is only meant to extract one range of pages from the input, for purposes such as printing, etc.

A similar approach could be taken in Python using in_fil.read(1), but as I said, that does not solve the problem of where do you split the file into lines - if it has no newlines. In fact, it is then a binary file, so I would prefer to just treat it separately, using the bsplit program I mentioned. I think it would rarely be the case that you do not know whether the file you want to process is a text file or a binary file. If that happens, there may be a problem upstream, which should be fixed first.