Abstract

The Private Use Areas in the Unicode encoding are a limited resource which needs to be managed carefully. This document describes such a management scheme which aims to allow maximum flexibility while allowing for the cross transfer of data between entities.

Introduction

When Unicode was first envisaged, it was heralded as having space for everyone and everything. Surely 65000 codepoints should be ample space for the world’s computing needs? Based on this mentality such extremes as allowing one codepoint per syllable in Hangul were allowed. But those were halcyon days and since then there has been a realization that 65000 codepoints is not really all that many and more care is being taken over codepoint allocation. Add in Unicode and ISO 10646 merging, and the process of adding a new script to the standard Unicode encoding becomes a long and political process.

To allow for codes which are not in Unicode, Unicode includes an area called the Private Use Area (PUA). This area is a set of codes which are guaranteed not to be assigned to anything. It differs from other unallocated codes, since these other codes are reserved for future allocation and should not be assigned to. It is hard enough getting software to support codes assigned in the PUA, but it is unjustifiable to demand that software support the codes assigned at arbitrary reserved locations elsewhere.

In order to get around the problem of needing more codes than 65000 (for example, for such dead scripts as Egyptian hieroglyphics, Linear B, etc. or Klingon) a system was devised to combine two Unicode characters to make one, longer (20 bit) code. These double codes are called surrogate pairs and using them allows software to access another 16 pages of 64K codes. Maybe this will be sufficient for all our needs! Of these extra pages, 2 are allocated for private use. The non-surrogate, standard Unicode, plane is called the Basic Multilingual Plane (BMP), while the surrogate planes are numbered from 1 to 16 inclusive. See Figure 1.

Figure 1: Unicode BMP Surrogates and Private Use areas

Use of the PUA

The Private Use Area extends from U+E000 to U+F8FF, and can be organized however someone wants. But it would be helpful if there were some corporate guidelines to facilitate communication, planning and the transfer of data. The structure presented here aims to provide the maximum flexibility to all concerned while limiting the danger of stepping on each other’s toes.

The first thing to note is that the region U+F000 to U+F0FF is used within Windows fonts as the symbol encoding region. Symbols are encoded as being within this region. Therefore, it would seem wise to avoid using this area.

This splits the PUA into two blocks. The first block (U+E000 to U+EFFF) is allocated for entity use. This means that it is up to each entity (or group of entities, if there is agreement between them) how the 4096 codes should be allocated. Each entity controls its own area.

The second block (U+F100 to U+F8FF, 2048 codes) is for corporate allocation. Thus any codes needed to implement IPA could be placed there, along with Hebrew and Greek extensions, etc. The primary concern for allocation in this block is that the codes are of universal concern within the organization.

Rendering

The current rendering technology we have is really only capable of rendering one glyph per codepoint1. But this is liable to change in the future. Therefore, it would make sense to try to separate codes which are allocated for the reason of getting around the one glyph per codepoint constraint, from truly new codes which we might want to see added to Unicode.

To facilitate this, it would seem natural to start allocating new codes which we might want to see added to Unicode from the bottom of the block (U+E000) and the temporary glyph based codes from the top of the block going downwards (U+EFFF). Then, one day, when everyone has stopped using the temporary codes, they can possibly be reallocated2.

It is expected that entities would need to add very few permanent codes, if any.

Figure 2 shows a summary of this discussion.

Figure 2: Proposed use of Private Use Area

Surrogate Pairs

So far we have presented an allocation scheme which works well within any particular entity. The entity controls part of the PUA for its needs and has some codes thrust upon it from a central allocation. But what happens if someone wishes to share data with characters from one entity’s PUA allocation with someone from another entity?

One approach is for the friend to remember that this new data has a different encoding to the rest of his data. But this soon becomes cumbersome, especially for an international consultant. Another alternative is to provide a centrally defined encoding and the means to convert, cleanly, between the local, entity defined, encoding and the centrally defined encoding. Then if someone wants to transfer data, they can convert it to the universal encoding, and their friend can either work with it in that encoding or convert it to their own local encoding.

The central encoding would not use the entity PUA block (U+E000 to U+EFFF) at all but would allocate codes in one of the Private Use Area surrogate blocks.

How it works

A surrogate pair consists of two codes. The first code is from the range U+D800 to U+DBFF (1024 codes) and the second from the range U+DC00 to U+DFFF (1024 codes). The two codes are reduced down to numbers in the range 0..1024 and multiplied together to give a 20 bit number. This is then the codepoint in one of 16 pages of 64K codes.

Planes 15 and 16 are reserved for private use. Therefore, we can take, say, page 15 and use it as a centrally organized page based on the needs of all the entities. Data would then be converted from the PUA encoding to use the same characters but using codes in surrogate page 15.

In order for this to work, an entity would need to inform a central agency3 of the allocations it is making in its PUA block. The agency could then allocate them codes in page 15 and tell them what they are.

Implementation

Even a scheme seemingly as simple as this, is not trivial to implement, especially with the technological constraints we have. This section looks at some of the implementation issues.

PUA only

It is not possible to directly render surrogate based data without smart font rendering technology. Whilst such technology is in the pipeline, it is not available at the moment. Thus all data must be converted to a local PUA based encoding before being processed. With the arrival of smart font rendering, the need for any PUA codes may disappear for many since correct use of Unicode will meet all of most people’s needs. Only those with obscure scripting issues will even need to consider the PUA.

Incompatibilities

One problem with the local PUA approach is that it may be that it is not possible to map data from one local PUA scheme to another. This is because the target scheme has not allocated codes which are used in the source scheme. There is no real way around this except for a strongly centrally controlled encoding design, or for a dynamic allocation scheme.

The former approach is the same as saying that everyone should use the central encoding and that the NRSI (or whoever) should control the whole PUA for everyone. This would cause a bottleneck in the allocation process which may be unacceptable to many.

The latter approach is risky since it may result in each individual having their own encoding and getting in a real mess.

But, the likelihood is that users will be transferring document type data and that this will have fonts associated with it. Since a font is unique to an encoding, it is likely that users will not bother to change the encoding but will just ensure they have the appropriate fonts with the data.

For other types of data, the issue is a question of composition with the surrogate form being the decomposed form and the PUA form being the composed form. Getting this right will also involve care on the part of software developers working with such data.

Conclusion

By giving control of the allocation process to the entity but holding information centrally, we get the best of both worlds with regard to unity and diversity. There is still work to be done in implementing the support tools, but I hope this approach would provide the following benefits:

Works with current technology and will improve with future technology.

Entity controlled and centrally served.

Conformant with the Unicode standard.

Figure 3 gives a summary diagram of the strategy.

Figure 3: SIL use of Surrogates and Private Use areas

Questions

In any discussion there are questions that may help:

The above proposal presents a strong compromise between centralization and decentralization regarding encodings. Would an alternative position be better?

Would it be better to give the central encoding more space in the PUA than the entity and let the central agency try to solve more of the problems for the entity? (i.e. swap blocks)

Who in your entity or entity group would be best suited to taking on the entity side of PUA care?

What other alternative schemes are there? What are their strengths and weaknesses?

Just to confuse matters, the corporate block (U+F100 to U+F8FF) would work the other way around, with long term codes being added from U+F8FF downwards and temporary glyph type codes from U+F100 upwards. This is because of the confusion over U+F000 to U+F0FF.