Sunday, May 19, 2013

Arguably, the most serious incompatibility between LB Booster and 'genuine' Liberty BASIC is the lack of support for huge strings. The maximum string length in LB4 isn't quoted anywhere, as far as I know, but is probably limited only by the memory available to BASIC, which is usually stated as being in the region of 70 Mbytes (although presumably one shouldn't expect it to work properly if a single string uses a significant proportion of that memory). In any case it is a lot bigger than BBC BASIC's 64 Kbytes (65535 characters) limit.

The ability to have multi-megabyte strings isn't necessarily desirable, because it can encourage the use of inappropriate algorithms. For example suppose one wants to count the number of lines in a text file. The traditional way, and the 'natural' way in a language which doesn't support huge strings, is to read the file sequentially, incrementing a counter for each line read. Only one line needs to be buffered in memory at a time, and the process is quite efficient.

In a language like Liberty BASIC, however, there is a temptation to read the entire file into a string - because you can - and then to scan the string for line delimiters (for example using the INSTR function) in order to count the lines. That will work, but it is highly inefficient of memory usage and CPU time. Firstly it involves holding the entire file in memory, which may be many megabytes, and secondly the INSTR function may need to scan the string from the beginning every time (or at least make a copy of the string first).

In a simple benchmark using LB 4.04, counting the lines in a 1 Megabyte text file with the first (sequential read) approach took about one second and with the second (whole file loaded into a string) approach took about 130 seconds. Repeating the test using LBB (for how that was possible, read on!) gave timings of 0.2 seconds and 20 seconds respectively. This illustrates the kind of speed improvement possible by using LBB, but in both cases loading the entire file into a string was about one hundred times slower than the sequential approach!

However, whilst it is possible to argue that supporting huge strings isn't always desirable, a significant proportion of LB programs do rely on that capability. To improve compatibility there is no alternative but for LB Booster to do the same. In investigating how that might be possible I first looked at the practicality of adapting the HUGESLIB library, which provides limited support for huge strings in BBC BASIC for Windows. Although in principle having the required functionality, it would involve identifying which string variables need to be able to hold huge strings (in itself non-trivial) and then substituting calls to library functions whenever those strings are manipulated.

I eventually decided that such an approach would be too complicated and too likely to suffer from obscure bugs. The only other possibility was to adapt BB4W itself to handle huge strings natively. Whilst clearly far from trivial, in fact such a modification wasn't desparately difficult for a couple of reasons. Firstly, I had already tackled increasing the maximum string length once before, when it was increased from 255 bytes (the limit in Acorn versions of BBC BASIC) to 65535 bytes, and the techniques used on that occasion are, in principle at least, fairly readily extended to support even longer strings. Secondly, because the 64K limit implied the string length being stored as a 16-bit quantity, the fact that the CPU's native registers are in fact 32-bits made it quite easy to adapt existing code (typically where the string length had previously been stored in the 16-bit CX register it could instead be stored in the 32-bit ECX register).

So it was that I modified the BB4W interpreter to support huge strings, specifically for the benefit of LBB. It's quite likely that the resulting performance is not as good as it might have been had huge strings been anticipated from the start, since in some cases the algorithms don't scale too well, but assuming that most strings will still be shorter than 64 Kbytes a typical program shouldn't experience much impairment. The version of LB Booster with this modification is, at the time of writing, in a beta test phase and is what was used to run the benchmark program.

Of course this raises a particular difficulty in respect of BB4W - should I make the 'huge string' version available to BBC BASIC programmers as well as to Liberty BASIC programmers? That is a much more tricky proposition, because unlike LBB - where the huge strings improve compatibility - the change from 16-bit to 32-bit lengths will inevitably impair compatibility and break some existing BBC BASIC programs. For example programs which assume the 'string descriptor' is 6 bytes long won't work when it increases to 8 bytes, and programs which include assembler code to manipulate strings almost certainly would need to be changed.

So this is quite a dilemma for which I haven't found a satisfactory resolution. I could simply not make the huge-string version available to BBC BASIC programmers at all, but that would seem rather unfair. Or I could upgrade BB4W and document the program modifications necessary (it's happened before, after all, when the strings were changed from 255 to 65535 characters). Or I could somehow make both versions available, but that greatly complicates version control, maintenance, automatic upgrades etc. There's no perfect answer!

5 Comments:

Also, with other BASICs, the situation of loading an entire text file into memory and then parsing it in code vs. LINE INPUT is a fairly well-known thing. In most MS BASIC-alikes, it's quite a bit faster to load the whole thing than it is to read the file line-by-line.

As an example, I did a 1,000,000-byte text file in FreeBASIC, PowerBASIC, and Visual Basic 6, and with all three, it's as much as 5 to 2 in favor of reading the entire file at once (meaning, if the LINE INPUT version takes 5 seconds, the "entire file" version takes 2 seconds).