Re: [clisp-list] Multiplatform encoding solutions

edgar wrote:
>
> Originally I had prepared a rather long list what works and what not
> in what language and encoding unter what operation system until I
> realized that I can also write it in in a single short sentence:
>
> CLISP produces exactly the same quirks like the GNU 'iconv' program
> (the gettext encoding converter)
>
> This prooves me that this has nothing to do with CLISP itself but is
> just simply the usual multiplatform encoding quirks.
no, this just stems from the fact that iconv and clisp were written by the same
Bruno Haible who is now maintaining gettext.
I will leave it to him to discuss the i18n issues...
Thanks for your input!
Sam.

Thread view

Multiplatform encoding solutions
I hope that this mail now did not get filed under some other topic.
To make you know that I can not only write nag-messages here some
results from the encoding research of the last weekend.
Originally I had prepared a rather long list what works and what not
in what language and encoding unter what operation system until I
realized that I can also write it in in a single short sentence:
CLISP produces exactly the same quirks like the GNU 'iconv' program
(the gettext encoding converter)
This prooves me that this has nothing to do with CLISP itself but is
just simply the usual multiplatform encoding quirks.
So I decided to go another direction.
Below is a very simple binary pattern-matcher that reads the first
up-to 16k of a text file and searches for typical umlaut patterns.
This is how web-browsers like Firefox etc. try to determine the
encoding if the related declaration in the HTML header is missing.
This is for international use of course a too simple program because
in Germany we have only 7 umlauts where the sets of binary hexnums
of the different encodings do not even intersect with each other.
With the code from below I can read ISO-8859-1 and UTF-8 encoded
Lisp code on Windows and Linux with no scrambled umlauts or any
other CLISP conversion errors.
I would be interested if there had been similar attempts in the past.
Here's the code:
;; ------------------------------ encoding -----------------------------
(defun get-file-encoding (file &optional verbose)
(let ((encoding custom:*default-file-encoding*)
(iso-umlauts (list #xe4 #xf6 #xfc #xc4 #xd6 #xdc #xdf))
(utf-umlauts (list #xa4 #xb6 #xbc #x84 #x96 #x9c #x9f))
(utf-marker #xc3) ; first byte of a two-byte UTF-8 encoding
(iso 0) (utf 0) (ascii 0) (unknown 0) ; character counters
(bytes 0) (current-byte 0) (previous-byte 0))
;; binary pattern matcher
(with-open-file (input-stream file :direction :input
:element-type '(unsigned-byte 8))
(setq bytes (min (file-length input-stream) 16384))
(dotimes (i bytes)
(setq current-byte (read-byte input-stream))
(cond ((member current-byte iso-umlauts) ; iso umlaut
(incf iso))
((and (eql previous-byte utf-marker) ; utf umlaut
(member current-byte utf-umlauts))
(incf utf))
((< current-byte 128) (incf ascii)) ; 7-bit ascii
((not (eql current-byte utf-marker)) ; unknown byte
(incf unknown)))
(setq previous-byte current-byte)))
;; print the numbers
(when verbose
(format t "~&;; iso:~a utf:~a ascii:~a unknown:~a bytes:~a~%"
iso utf ascii unknown bytes))
;; the highest match determines the charset
#+unicode
(let* ((charset (cond ((and (eql iso 0) (eql utf 0)) charset:ascii)
((> iso utf) charset:iso-8859-1)
((> utf iso) charset:utf-8)
(t :default))))
(setq encoding (ext:make-encoding :charset charset)))
#-unicode
(format t "~&;; get-file-encoding: no :UNICODE support found.~%")
;; print the choosen charset
(when verbose
(let ((charset #+unicode (ext:encoding-charset encoding)
#-unicode "ISO-8859-1"))
(format t "~&;; charset -> ~a~%" charset)))
;; return the encoding
encoding))
;; examples:
(defmacro xload (file &rest args)
"Load a Lisp file in its native encoding."
`(let ((encoding (get-file-encoding ,file)))
(load ,file :external-format encoding ,@args)))
(defun cat (file)
"Read a text file in its native encoding and print it to the screen."
(let ((encoding (get-file-encoding file))
(last-line-was-empty-p nil))
(with-open-file (in-stream (open file :if-does-not-exist nil))
;; (stream-external-format ...) must be set to the encoding
;; because READ-LINE has no :external-format keyword
(setf (stream-external-format in-stream) encoding)
(loop for line = (read-line in-stream nil)
while line do (format t "~&~a~%" line)))))
;; ---------------------------- end of code ----------------------------
Herewith this code is declared as public domain.
Have fun,
- edgar
--
The author of this email does not necessarily endorse the following
advertisements, which are the sole responsibility of the advertiser:

edgar wrote:
>
> Originally I had prepared a rather long list what works and what not
> in what language and encoding unter what operation system until I
> realized that I can also write it in in a single short sentence:
>
> CLISP produces exactly the same quirks like the GNU 'iconv' program
> (the gettext encoding converter)
>
> This prooves me that this has nothing to do with CLISP itself but is
> just simply the usual multiplatform encoding quirks.
no, this just stems from the fact that iconv and clisp were written by the same
Bruno Haible who is now maintaining gettext.
I will leave it to him to discuss the i18n issues...
Thanks for your input!
Sam.

2010/4/27 Sam Steingold <sds@...>:
> edgar wrote:
>>
>> Originally I had prepared a rather long list what works and what not
>> in what language and encoding unter what operation system until I
>> realized that I can also write it in in a single short sentence:
>>
>> CLISP produces exactly the same quirks like the GNU 'iconv' program
>> (the gettext encoding converter)
>>
>> This prooves me that this has nothing to do with CLISP itself but is
>> just simply the usual multiplatform encoding quirks.
This is no quirks, this is exact. Guessing the encoding from some input
is always inexact but sometimes nice to have.
> no, this just stems from the fact that iconv and clisp were written by the same
> Bruno Haible who is now maintaining gettext.
Your solution to encoding guesses are too trivial. You can only
seperate german (latin1) from utf8.
I usually use libtextcat for language and encoding guessing which
checks for often used triples.
Maybe -E guess is an idea.
--
Reini Urban