Details

Community version (still on FBSD only), so dxflib. I suspect the culprit is dxflib.

I have created layers and blocks and whatnot with (hungarian) accented characters in their names.Apparently (according to the Internet, as evidenced probably most glaringly by usa.autodesk.com /adsk/servlet/ps/dl/item?siteID=123112&id=7586582&linkID=9240617) the R15 DXF version assumes single-byte character sets being used. Quick grepping the DXF2000 mentions “String (255-character maximum; less for Unicode strings)” (Group Code Value types), so it may be a false track...

Anyway, the DXF file written does have strings converted to single-byte encoding, but it seems it’s always ANSI-1252. When the output encoder encounters a character that is not representable in this one, it will use a literal question mark.

Actual case, I have a block with the name

106 egypólusú váltókapcsoló jelzőfénnyel

Of this, “ő” (U+0151) is not representable in ANSI-1252, so what gets written to the dxf is (non-ASCII shown in hex)

106 egyp<f3>lus<fa> v<e1>lt<f3>kapcsol<f3> jelz?f<e9>nnyel

Note the literal question mark.

Now the problem is this is an irreversible operation but the result is perfectly valid ANSI-1252, so upon opening the file again, I will get a block named

106 egypólusú váltókapcsoló jelz?fénnyel

IMHO the ideal resolution is to

Have a preference for the export code page (and use it, too, circumstances permitting)

Iff this is not set (or set to a default “Use system locale to determine” or something), use a look-up table to take a good guess (like old QCAD2 qcadlib/src/engine/rs_system.cpp:QCString RS_System::localeToISO())

If the output encoder encounters a character that is not representable in the target code page, throw an error with an option to ignore the error (and keep using question marks, but then this must have been acknowledged by the user so not silent problem anymore), pick a new output code page, whatever else

This all assuming the R15 doesn’t actually depends hardly on ANSI-1252 and ANSI-1252 only. In that case, option #3 would still be nice.

Most Western European languages (and English) are not affected by this as ANSI-1252 has most of them covered, but a little to the east, a little to the south, a little to the north, and it does make a bit of a difference :)

Hm, upon closer look, the DXF reference (both for R14 and for R19, have not looked at others) says this about $DWGCODEPAGE:

Drawing code page; Set to the system code page when a new drawing
is created, but not otherwise maintained by AutoCAD

I couldn't quickly find any more elaborate reference to strings. To me, this suggests that AutoCAD (and, perhaps a too quick conclusion, consequently other CAD software) doesn't care about strings too much, apart from displaying them.

(This also suggests that in the end this isn't going to be a dxflib problem but a QCAD problem, as for dxflib strings are just an arbitrary stream of arbitrary bits, and it's the application that needs to make some sort of sense of these bits.)

I see three possible courses of action, from which the user should be able to choose (per-drawing or application-wide I am not sure):

As it seems it's customary to just use ANSI-1252 in DXFs, for maximum compatibility, just use ANSI-1252 invariably. Convert the user-supplied strings to this code page, let the user know when conversion fails, let her pick a sensible action

For maybe a bit less compatibility, convert the user-supplied strings to one-byte encoding of a suitable code page (user-specified or derived from system settings/environment/etc). Let the user know when conversion fails, let her pick a sensible action

With reference to the Unicode remark above, it could very well be that simply using Unicode is quite valid (even if under-used) in DXFs. This one is very highly likely to be minimally interoperable with other implementations, on the other hand allows the user to have no restriction on what she enters in strings. If one doesn't need interop, this could be a good compromise. The encoding used (UTF-8, UTF-16, UCS-something?) should be decided upon, possibly by cross-checking how other implementations react. (Considering the Windows heritage, the correct answer is probably either UTF-16 or UCS-2.)

The QCAD Community Edition does not support use of non-ASCII characters for layer and block names at this point. I've changed this into a feature request since this is a known limitation of the community edition.

While these things are typically not documented by Autodesk, it seems that code page 'ANSI_1251' means 'Latin1' for DXF R15. As newer DXF versions were released, it changed its meaning at one point to 'Utf-8'.

Since the QCAD Community Edition has no support for newer DXF format versions, this means that all non-ASCII text has to be escaped (\U+xxxx).