Re: WANTED: Volunteer to Scan Old Programs - CP/M

This is a discussion on Re: WANTED: Volunteer to Scan Old Programs - CP/M ; Emmanuel!
I and we all have tolerated a lot from you but when you start spamming
the group with binary and HTML junk you have passed the limit. This is
just not on.
Axel...

Re: WANTED: Volunteer to Scan Old Programs

I and we all have tolerated a lot from you but when you start spamming
the group with binary and HTML junk you have passed the limit. This is
just not on.

Axel

Re: WANTED: Volunteer to Scan Old Programs

Hello, Axel!
> I and we all have tolerated a lot from you but when you start spamming
> the group with binary and HTML junk you have passed the limit. This is
> just not on.

All I want are:

1) to be able to use the "Box drawing" characters that are present on
all my IBM PCs and PC printers.

2) to be able to use a "fixed-width" font, since all my WS4 files are
78-columns wide.

By the way, Axel, when searching how to display the "extended"
characters correctly on the comp.os.cpm Newsgroup, I found the answer
in a file explaining how to display all the 256 chars of the Atari
ST... So, just changing the values in my BASIC program would enable
you to display correctly the 256 characters of your Atari ST on any
computer or program using the "UTF-8" encoding. (It has nothing to do
with HTML, it is a way to encode "extended" characters, said to be
compatible with ASCII -- it was done by an American.)

(I am rewriting from scratch the program, now that I have finally
understood how Google displays characters. After a few more tests, I
will publish a message explaining how it works. Meanwhile, if someone
knows more about UTF-8 and HTML and CSS than me, I still have a few
questions.)

Yours Sincerely,
Mr Emmanuel Roche

Re: WANTED: Volunteer to Scan Old Programs

"Mr Emmanuel Roche, France" writes:
>Hello, Axel!
>> I and we all have tolerated a lot from you but when you start spamming
>> the group with binary and HTML junk you have passed the limit. This is
>> just not on.
>All I want are:
>1) to be able to use the "Box drawing" characters that are present on
>all my IBM PCs and PC printers.

You can - but *not* in a newsgroup..!
>2) to be able to use a "fixed-width" font, since all my WS4 files are
>78-columns wide.

All of us do.
But no other 8regular) writer uses the Web-Interface
of 'google' to post...
>By the way, Axel, when searching how to display the "extended"
>characters correctly on the comp.os.cpm Newsgroup

You can't. So: Stop trying!
>(I am rewriting from scratch the program, now that I have finally
>understood how Google displays characters.

Just use a 'real' Newsreader.

> After a few more tests, I

*If* you have to 'test' please use a group dedicated to "test".

>will publish a message explaining how it works. Meanwhile, if someone
>knows more about UTF-8 and HTML and CSS than me, I still have a few
>questions.)

So, the English, French, and German MS-DOS computers used "Code Page
850" but, when Micro**** introduced WinDoze, it changed to "Code Page
1252". Hence the problems when printing MS-DOS / CP/M-86 files on this
cybercafe computer running under WinDoze.

What my program does is convert those "Code Page 850" characters into
something called "UTF-8" which seems to be widely used, now, following
the internationalization of the Internet (web browsers are now obliged
to display correctly characters from a lot of countries).

That's all.

(If you are a Chinese / Japanese / Korean still using WordMaster under
CP/M 2.2, simply changing the values of a table in my program will
enable you to convert your WordMaster files into UTF(-8) files.)

Yours Sincerely,
Mr Emmanuel Roche

Re: WANTED: Volunteer to Scan Old Programs

WS4UTF.WS4
----------

A WS4-to-UTF(-8) File Converter

Before computers, telecommunication people used Baudot code.
Circa 1963, the (American) Teletype company created its "Model
33" (as you can see, there had been 32 models before it...),
using a code that was soon after normalised as the US-ASCII
(with 2 modifications). It is this famous ASR-33 Teletype that
was the standard "terminal" when the first microcomputers were
created.

Now, it is essential to understand that the ASR-33 Teletype was
the son of a long line of teletypes: that's why it was not
totally compatible with the ASCII code.

As can be seen, there were 4 "control codes" located at the end
of the allowable characters, where, in the ASCII code, only
"DEL" remains. (I hope that you are able to understand that
"UNASSIGNED" means that there were no lowercase characters: the
ASR-33 TTY was uppercase only. That's why Altair BASIC -- until
Version 4 -- (also known as "MITS 4K BASIC") was uppercase
only.) Please note, especially, that "ESC" and "ACK" were
located there, not in the first 32 "control codes". Of course,
since this ASR-33 Teletype was pre-ASCII, the names of the
"control codes" that it used were also different. Also, note the
famous "ALT MODE", which was also a relic of previous codes.
Before, when telecommuncation devices used 6 bits, "ALT MODE"
was used, for example, to switch from characters to numbers, or
from black ink to red ink (that's why Altair BASIC tests it),
depending upon the teletypewriter used (there were several
teletypes makers, besides Teletype).

Finally, note the mentions of a "up arrow" and a "left arrow",
where the ASCII code uses "^" and "_" (caret and underline).
Some programming languages used "left arrow" as their symbols
for "assignment" (i.e., the equivalent of "LET" in BASIC).

Also, note that those 2 arrows were pointing left and up, that
is to say: the right and down arrows were missing... (Several
ASR-33 users complained about this!)

Re-finally, also note that the ASR-33 Teletype was uppercase
only: there were no lowercase characters, and the { | } ~
characters used (among others, by the C language).

This is a 7-bit code (previous codes were often 6 bits, 5 bits,
etc.).

WordStar, the word processor that I use, being made by an
American, is also 7 bits. Later versions of WordStar, like
Version 4.0, enable someone to use 8- bits codes.

I wanted to show you, here, the characters from 80h to 0ffh but,
for portability reason, finally decided not to. You will have to
bring back your old copy of the "IBM PC Technical Reference
Manual (1981)", and open it to Section 7: "Characters,
Keystrokes, and Colors". That's the reference that I used.

As could be seen, this WS4 file could contain all the characters
used by my IBM PC (from 20h to 0ffh: it is also possible to
display the values corresponding to the "control chars", but I
decided not to implement it). I have 2 copies of the WordStar 4
manuals: one "CP/M edition" (which lists only the ASCII set) and
one "Professionnal" (which lists the "ASCII character codes and
extended character set"). I checked those characters, one by
one: the shapes of B0, B1, and B2 are horizontal, rather than
vertical; the shapes of DB, DC, DD, DE, and DF seem to be
compressed vertically; in my opinion, the Greek char that they
used for E8 is the uppercase version, but I could be wrong,
since I have not read Greek since University...

Those were the characters used by the IBM PC in 1981. Those are
the characters still displayed on the screen of my IBM PC in
2008, and printed on my (PC) printer. They were so widely used
that they became a "de facto" standard, normalized as "ISO 646"
in 1988, 7 years after the creation of the IBM PC.

Yet, when I go to a cybercafe and publish a text on the
Internet, the "extended characters" are different, preventing me
from using the "Box drawing" characters... What went wrong with
the "IBM Clown"?

According to my various docs, when Micro**** introduced WinDoze
(thus dropping MS-DOS), he switched to something called "Windows
Latin 1" (also named "ISO Latin 1", or ISO-8859-1). This
character set is well-known in France, since 2 common french
characters are missing in it, preventing you from writing words
like "eye" and "beef" with proper french characters! ("This is
not a bug, this is a feature!") The "extended chars" of this set
are almost only combinations of various European characters
(which were previously available by printing over a previous
character: each combination is now a single character. This is
reminding of the Ethiopian alphabet, which used to have separate
voyels. Then, during the Middle Age, people started writing
those voyels at the bottom of the consonnants -- as if you wrote
a,e,o,i,u under the previous consonnant -- until they were
familiar with this "ba", "ca", "da" system. As a result, the
Ethiopian alphabet now has 114 characters, rather than the
original 26-or-so characters). For some unknown reason, there
are no "Box drawing" characters under WinDoze. You need a PAO
program to make a table.

In reaction, a consortium made of Adobe, Apple, IBM, Sun, and
Xerox, called "UNICODE", made the ISO adopt one year later a so-
called "UNICODE", which is 16-bits, so able to contain 64K
different characters.

Thinking that 65,536 characters were not enough, ISO then made
its own standard, "ISO 10646" (note the 646) which is 32-bits,
hence able to define 4 294 967 296 characters... (All that
because a people in Lebanon used 24 characters (inspired by the
alphabet used in Irak) 2,300 years ago! The Greeks (who were in
commercial relations with them) adopted those characters... The
rest is history!)

Now, since 65536 characters do not fit in 7-bits ASCII, how is
it possible that UNICODE became a world-wide standard for
Internet applications? The main reason is that an American,
Kenneth Thompson, invented a way (compatible with US-ASCII) to
code those 65536 characters, named UTF-8.

US-ASCII is 7 bits, so the MSBit of each char is 0: 0xxx$xxxxb.
(I am using the CP/M ASM "blank char" $ to separate the
nibbles.) Their values range from 00h to 7fh.

Now, one of the difficulties of UTF-8 is that it jumps from one
size to another at unusual values. Explanation: when you count
7, 8, 9, when you reach 10, you now have 2 digits. During 25+
years, I was used that a byte value was "jumping" from one byte
at 0ffh to two bytes at 0001$0000h (Here, I use the $ to
separate the bytes.)

With UTF-8, it is different, because the 2nd and 3rd bytes start
with 10b. But, since it is a value greater than 7fh, the first
byte of a two-byte value should also start with a 1... It is
here that things are a little bit tricky, so I suggest to
examine the following lines:

(The UTF-8 standard also allows a 4-byte value but, since my
WS4-to-UTF(-8) File Converter don't use a 4-byte value, I will
limit my explanation to the first 3-byte values, which are also
the most often used.)

If you look at the above 3 lines, something should be clear: the
xxxx bits are always preceded by a 0.

In the case of a byte, this 0 is the MSBit.

In the case of a 2-byte, this 0 is preceded by a 11 before the 0
of the MSByte, and by a 1 in the case of the LSByte.

In the case of a 3-byte, this 0 is preceded by a 111 before the
0 of the MSByte, and by a 1 in the case of the MidByte and
LSByte (this is also the case for 4-byte values).

The trick is that 2-byte values start with a 11: that is to say:
"two ones" means that this is a 2 bytes value.

3-byte values start with a 111: that is to say: "three ones"
means that this is a 3 bytes value.

(4-byte values start with... Guess what? a 1111.)

So, if you see a byte starting with a 0, you know that it is a
byte (a 1-byte value).

If you see a byte starting with a 110, you know that this is a
word (a 2-byte value).

If you see a byte starting with a 1110, you know that this is a
3-byte value.

(If you see a byte starting with a 10, you know that it is a
"following byte". You loop until you encounter another byte
starting with 0, 110, 1110, etc.)

Let us now examine an example, so you see the point.

When creating my WS4-to-UTF(-8) File Converter, I had some
difficulty with the "currency symbol" for "Pesetas" (the Spanish
coins). Opening the UNICODE book, I found that the only Peseta
symbol was 20A7 (hex).

Now, this would be too simple: you don't simply DOKE the value
20A7 into your WordStar file! First, it has to be "surrounded"
by 2 characters used internally by WordStar to know when it is
dealing with "Extended characters".

Second, and more important, since it allows us to display the
character set of the IBM PC ("Code Page 850", but this could be
another one) correctly to people using the Internet programs
which, being world-wide, need to be able to display correctly a
lot of characters from foreign countries.

UTF-8, being compatible with US-ASCII for byte values, is widely
used. In practice, 65536 characters is enough to deal with a lot
of foreign countries.

So, back to our example: how to convert the value 20A7, defined
in the UNICODE standard as the symbol for the currency
"Peseta"(s), using the UTF-8 standard?

20A7 = 0010$0000$1010$0111b

20A7 is bigger than 7f. It is also bigger than the range 0080-
07ff used by 2-byte values. So, it must be a 3-byte value (0800-
ffff)? Yes.

So, the first byte will start with a 1110, saying that this is
the start of a 3-byte value, and will be followed by 2 bytes
starting with 10, saying that they are the following bytes.

That is to say, during the UTF-8 encoding, our 20A7 value has
become a E2.82.A7 triplet...

(Note that the first number, 2, and the last number, 7, remain
the same. In practice, they are the only numbers that don't
change. A quick look to the above drawing will explain why,
since the Nib4 and Nab1 are copied vertically without change,
while the Nib3 and Nib2 ("NB" and "NA") are split and preceded
by the "following byte" markers (10).

This is how UTF-8 works.

Now, I did not understood it instantly...

You may remember that I published an HTML file in the
comp.os.cpm Newsgroup (for which I was flamed...). I was
expecting Google Groups (the newsreader that I use) to recognize
the HTML commands and display correctly the "extended
characters" that I had coded using the ሴ scheme.

Instead, the full HTML source code appeared on the screen... It
was only if I was "clicking" on the screen that a new "window"
was opening, containing all the characters in their grid. What
did happen?

I was puzzled, to say the least.

As usual, impossible to find anybody with the answer.

It is, one day, when I decided to dump the display at home, on
my computer, that a breakthrought happened.

So, I "selected" the contents of the window, then opened my
floppy disk. When offered to give a name, I chose ASCIGRID.TXT.
Surprise! WinDoze opened a warning window, telling me that the
file contained UTF-8 characters, and that I would lose them if I
did not use the "File Format" pull-down menu.

So, I followed its suggestion (it must be the first time that I
experienced a WinDoze computer being useful!), and saved the
mysterious file on my floppy.

Back home, I dumped and printed the contents of the file.
Opening WordStar, I could see the grid containing all the
characters on the screen. All I had to do was to mark, on the
printout, what was in hex for each character. It is then that I
understood that some characters were coded with one byte, some
with two bytes, and all the "Box drawing" characters with three
bytes.

Since this file contained absolutely no HTML commands, I wrote
the WS4UTF.BAS program that follows. Every day, I was going to
the cybercafe, was printing the UTF file that it had created,
then was going back home to examine character by character the
224 characters printed.

When there was a problem, I also searched many, many hours in
the UNICODE web site, for the missing characters.

Since I see no more problem, I finally release it. Of course, if
you see any problem, warn me.

The only known problems are:

1) The char for DEL used to be a "triangle" for my CP/M
computers which were able to display it and the IBM PC. However,
all my IBM Clowns now display it as something called "home
symbol". Since this is also the graphics printed on all my
printers, this is the char to be found.

2) The symbol for "Peseta" (9E) was "Pt". But the only Peseta
symbol I was able to find on the UNICODE web site is "Pts"...

3) Chars A6 and A7 (superscript lowercase "a" and "o") are
underlined in all the references that I have, all the printers
that I have, all the computers that I have... except when I
print them under WinDoze at the cybercafe! In addition, the
UNICODE web site display them (in a PDF file) as underlined!!!
So, could someone tell Micro**** that they are, indeed,
underlined, and have been so since (at least) 1981? (Else,
WinDoze would not be IBM PC-compatible...)

4) I had quite a lot of trouble finding a corresponding
character for A9, but managed, eventually.

5) I am not much impressed by the Greek chars displayed by
WinDoze. In particular, I think that E2 is too similar to a "Box
drawing" character.

6) Finally, I had much difficulty with EE. I interpret it as the
math symbol "belongs to". I found 2 such symbols in the UNICODE
web site (a small and a big) but they never displayed or printed
on the WinDoze computer of the cybercafe. As a desperate
solution, I am using "epsilon", instead.

7) I have trouble seeing the difference between F9 and FA, which
were, as far as I can see, different in the "IBM PC Technical
Reference Manual".

Conclusion: Despite all those little problems, this is (as far
as I know) the first time that a "Code Page 850" to UTF-8 file
converter is offered (I was totally unable to find any table
explaining this convertion, despite long searches with Google).
Even better: this version is able to convert the file produced
by the WordStar word-processor that I have been using for the
last 20 years. (According to my computers, I have 800+ WS4
files. Now, every time I will check a WS4 file and notice that
it contain "ASCII graphics" (usually tables), I will convert it
to those "Extended characters" that were used, in the good old
days, to produce those tables with phototypeseters... At the
cybercafe, the laser printer prints at 600 DPI: I am unable, at
my age, to see any difference between its output and a book.
That's enough for me.)

(The only thing not explained is the "BOM" bytes, that I
discovered in my dump. It is only a few days later, when reading
stuff about UTF-8, that I saw a passing reference to a "Byte
Order Mark"... That's all I needed to know. As long as this
program translates my WS4 files into files displayed and printed
correctly under Internet programs, I am happy.)

Yours Sincerely,
Mr Emmanuel Roche

EOF

Re: WANTED: Volunteer to Scan Old Programs

said, page 19: "... before 1956, the concentration was on
meeting user needs, the concentration was not on the languages:
it was on building the tools which later began to support the
languages: the compilers, the generators, the supporting
elements which now back up our definitions of languages.
Languages came into use. People began to use them. There was
another stage that had to occur. I think that, to some extent,
we had paid little attention to it in the early days. And that
was that the implementers interpreted things differently. This
particularly occurred in the case of COBOL. The Navy became very
much interested in trying to make all the COBOL compilers give
the same answers, even though they were on different computers.
And it was for that reason that I was called back and in the
late 1960s at the Navy Department. A set of programs was built,
which would validate a COBOL compiler. They would compare the
execution of the compiler against the standard, and monitor the
behavior of the actions of the compiler. It was the first set of
programs that was built to try to use software to check
software.

I think this is an important element we have omitted. If we are
going to have a language, it certainly should have the same
answers on all the different computers. The set of COBOL
validation routines was the first such set of routines to prove
whether or not a compiler did in fact correctly implement the
standard. I have the papers that were published on the "Federal
COBOL Compiler Testing System". Recently, they have also
produced a set of routines for testing FORTRAN.

I think this is something we overlooked in our development of
our languages. We overlooked the fact that we must see to it
that those compilers properly implemented the language; that
they did give common answers on different computers. A language
is not very useful if you get different answers on different
computers. At least it isn't to an organization like the Navy
which, at any given moment, has at least one of every kind of
computer, and we *would* like to get the same answers from the
same program on the different computers."