From dev-return-6175-apmail-couchdb-dev-archive=couchdb.apache.org@couchdb.apache.org Sat Aug 29 05:56:58 2009
Return-Path:
Delivered-To: apmail-couchdb-dev-archive@www.apache.org
Received: (qmail 77020 invoked from network); 29 Aug 2009 05:56:58 -0000
Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3)
by minotaur.apache.org with SMTP; 29 Aug 2009 05:56:58 -0000
Received: (qmail 71651 invoked by uid 500); 29 Aug 2009 05:56:57 -0000
Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org
Received: (qmail 71564 invoked by uid 500); 29 Aug 2009 05:56:57 -0000
Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: dev@couchdb.apache.org
Delivered-To: mailing list dev@couchdb.apache.org
Received: (qmail 71554 invoked by uid 99); 29 Aug 2009 05:56:57 -0000
Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230)
by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Aug 2009 05:56:57 +0000
X-ASF-Spam-Status: No, hits=-2000.0 required=10.0
tests=ALL_TRUSTED
X-Spam-Check-By: apache.org
Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140)
by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Aug 2009 05:56:54 +0000
Received: from brutus (localhost [127.0.0.1])
by brutus.apache.org (Postfix) with ESMTP id B9EB4234C044
for ; Fri, 28 Aug 2009 22:56:32 -0700 (PDT)
Message-ID: <956176654.1251525392746.JavaMail.jira@brutus>
Date: Fri, 28 Aug 2009 22:56:32 -0700 (PDT)
From: "Curt Arnold (JIRA)"
To: dev@couchdb.apache.org
Subject: [jira] Commented: (COUCHDB-345) "High ASCII" can be inserted into
db but not retrieved
In-Reply-To: <944783679.1241648130398.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394
X-Virus-Checked: Checked by ClamAV on apache.org
[ https://issues.apache.org/jira/browse/COUCHDB-345?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D127=
49083#action_12749083 ]=20
Curt Arnold commented on COUCHDB-345:
-------------------------------------
ISO-8859-1, Cp1252 and Latin-1 are near synonyms for encoding the first 256=
character points in Unicode as single byte values and is incapable of repr=
esenting any other character without some escape mechanism. Any arbitrary=
set of bytes would be a valid ISO-8859-1 sequence and can be decoded into =
a sequence of Unicode characters.
UTF-8 is an variable byte encoding of the full Unicode character repertoire=
. Character values from \u0000 to \u007F are represented as a single-byte,=
while other characters require 2-6 bytes to encode. Unlike ISO-8859-1, no=
t every sequence of bytes is valid and can be converted back to Unicode cha=
racter points. If I remember correctly, any two-bytes in a row with the hi=
gh-bit set is invalid. The test date in the last two cases are valid ISO-8=
859-1 sequence, but they can not be interpreted as UTF-8 since they contain=
byte sequences that can not be converted back into Unicode code points.
If it was just an encoding mismatch and the data was being misinterpreted, =
you would lay the blame at the client. However, in this case, data can go =
into the database that the rest of the stack can't process since it contain=
s invalid sequences.=20
The RFC mentions the two variants of UTF-16 and UCS-4, however the ISO-8859=
-1 sequences could not be interpreted using any of those encodings since th=
e first two characters must be ASCII. There are only certain sequences of =
bytes that could appear for JSON encoded in any of those encodings and the =
byte sequences send in the last two cases don't match any of those patterns=
. Sniffing the encoding would work in a similar manner to XML which is de=
scribed in http://www.w3.org/TR/REC-xml/#sec-guessing.
> "High ASCII" can be inserted into db but not retrieved
> ------------------------------------------------------
>
> Key: COUCHDB-345
> URL: https://issues.apache.org/jira/browse/COUCHDB-345
> Project: CouchDB
> Issue Type: Bug
> Affects Versions: 0.9
> Environment: OSX 10.5.6
> Reporter: Joan Touzet
> Attachments: badtext.tar.gz, enctest.zip
>
>
> It is possible to PUT/POST a document into CouchDB with a "high ASCII" va=
lue that cannot be retrieved. This results from not escaping a non-ASCII va=
lue into \u#### when PUT/POSTing the document.
> The attached sample code will recreate the problem using the hex value D8=
(=C3=98) in a possibly unsavoury test string.
> Sample output against 0.9.0 is as follows:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> {
> "ok": true
> }
> {
> "id": "fail",=20
> "ok": true,=20
> "rev": "1-76726372"
> }
> {
> "error": "ucs",=20
> "reason": "{bad_utf8_character_code}"
> }
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> Please note this defect turned up another problem, namely that the bad_ut=
f8_character_code exception thrown by a design document attempting to map()=
the bad document caused Futon to fail silently in building the view, with =
no indication (except via debug log) that there was a failure. The log indi=
cated two attempts to build the view, both failing, followed by an uncaught=
exception error for Futon.
> Based on this, there are likely other areas in the codebase that do not h=
andle the bad_utf8_character_code exception correctly.
> My belief is that CouchDB shouldn't accept this input and should have rej=
ected the PUT/POST, or should have escaped the input itself before the inse=
rtion.
--=20
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.