All the integer types (except long) are stored in 32 bits regardless of the variable type. To minimise the footprint pack your data so it uses all 32 bits and use a static intialiser. I put mine in the class data and mark it private.e.g. private int[] data = {0x12345678, 0xFEDCBA98, 0x55AA55AA};

If you look at the bytecode for using these array structures, you'll notice that each number in the array is added in the static initializer block. That, and each integer not in the byte range is added to the constant pool, means that the structure:

static final int[] x = { 1024 }

adds 5 bytes for the constant pool for the 1024 value, and about 3 bytes for adding the 1024 value into the array, and the overhead for creating the array. (I don't have the classfile spec in front of me, so I don't know if the push value into the array size is right).

Alternate options are to insert data into the classfile as an attribute block, or to load the data from a file in the deployed zip. I have a tool that I worked on which adds a binary data blob to a class file, and updates the class file to reference the correct position and size of the added data.

EDIT: Just realized this is in the 4k section, so would most certainly not be a viable solution. It might be useful for > 4k projects though.

You could store the data as Base64 encoded string. The array:

1

[1,2,3,4,5,6,7,8,9,10]

would be represented as:

1

AQIDBAUGBwgJCg==

You would need to edit the array outside the source and then encode it before inserting in the source. You then need to implement a function in your code to decode it, so there is a performance overhead.

At the moment I'm tending to use strings but only store one byte per character. It means that for true binary data (not that I really have that) your UTF-8 encoding overhead is larger than if you use the full range of char but it avoids the problems with invalid characters.

(Someone would need to find out if the encoding is required here; I believe it is). However, any byte value > 127 will cause the UTF-8 string to grow more than storing 1 byte for that character (give or take).

To reduce this extra data usage, you could encode the data in essentially Base127. The decoding looks something like this:

Unless you have a really big data blob with lots of expanded UTF-8 characters, I'm imagining that straight-up String embedding would be smaller than including a decoding script like this (which has the added negative of including a new method call "charAt" as well).

If you go the approach for reading data that was added to the bytecode itself, there's extra code overhead for retrieving that data as well.

YMMV, but it looks to me like String-encoded data may be the smallest approach.

Actually why not use chars and utf-16 encode the data. Java doesn't seem to care about invalid surrogate pairs the only special case is for quote marks and that doesn't have any effect on decoding. I think this is much more efficient than utf8.

Actually why not use chars and utf-16 encode the data. Java doesn't seem to care about invalid surrogate pairs the only special case is for quote marks and that doesn't have any effect on decoding. I think this is much more efficient than utf8.

You can put whatever String literal you like in the sourcecode, but when it's written to the constants pool in the binary .class file it will be stored using modified UTF-8.Consequently you'll end up with it using 1, 2 or 3 bytes per character (or more, if you hit a surrogate pair!) - and given your example of using both the upper & lower 8 bits of the char, most will require 3+ bytes per input char.

I think the best* suggestion so far is to Base64 encode it.1. It'll expand your data by a factor of 4 BUT it'll compress very well.2. Decoding it is very cheap (in terms of code size)

I think the best* suggestion so far is to Base64 encode it.1. It'll expand your data by a factor of 4 BUT it'll compress very well.2. Decoding it is very cheap (in terms of code size)

Is there a Base64 implementation in the public Java 1.5 API? I know that Sun has at least 2 implementations in their distribution, but they aren't public.

On another note, perhaps a simple 16-bit character encoding scheme? Such as 4-bits + 8 bits per character? I'd need to go over the UTF-8 encoding scheme, but it would get you in worst case the same ratio as Mime encoding (3/4), but with a smaller decoding overhead cost, which is my biggest concern.

A classfile is composed of (roughly) three parts: header, constant-pool and attributes and the verifier insures that a given classfile is valid at load-time. Since you can't push any data into the header, this leaves the CP and attributes. For the CP, the only choice is strings, which are always encoded in UTF8 (as Abuse stated). The only way I can think-of to shove data into an attribute is via an annotation. Shoving a byte array into an annoation is bloated, so I don't think this is an option.

The piece of code that I posted above is intended as a building tool which geneates a classfile (Of a class named "D" with a single public String named "D"). This is to seperate binary data from source during the development cycle.

All it does is shoves the raw data provided into the CP entry of String "D", so it must be validly encoded. It's up the the user to provide a valid encoding, of which directly as UTF8 is an option.

Directly encoding 8-bit data as UTF8 requires (on average) 1.5 bytes per byte, but as I stated above, the two byte encoding has exactly 4 10-bit prefixes and should compress well. So, for the UTF8 example, the decode source is:

1

byte[] data = D.D.getBytes("UTF8");

WRT: Pack200. Pack200 is a front-end transform of one or more classes. By this I mean that it reorganizes the raw data into a form which is likely to compress better with a standard entropy compressor. The transforms applied to UTF8 entries really targets member signatures and full-qualified names. The pseudo-code directly from the spec (asserts deleted for length):

I've recoded Falcon4k to convert 5 chars into one 4 byte integer by coding each character as 32 + 0..94. This is slightly more efficient than using Base64 and has saved 224 bytes from my original jar which used an integer array initialisaton approach. However for the pack.gz version the conversion only saved 61 bytes.

Didn't JBanes have some tools to handle embedding in magical ways at one point?

Kev

not sure if you meant the tool i made to embed binary data into a user defined class attribute. Such an option is moot which pack200 as it does not deal with unknown class attributes and thus means you are better off just leaving it as another file in the pack200 compressed jar.

yes, you are correct, you are able to define how to handle your attribute, however i really doubt that there will be net gain (i.e. reduced size) when compared to using a separate file in a jar that is then compressed into a pack200 compressed format. This is because you will need to transfer the definition of the attribute as well.

Presumably like usual; use Class#getResourceAsStream(classFileName), then search through for your unique identifier indicating the start of the attribute.

There's a trade-off here between 100% packing by using binary attribute data vs. a 4 to 5 (or what have you) encoding in Strings, based on the constant size for retrieving the data.

I was planning at some point to sit down and actually try to compute the different cut-off values. Unfortunately, it won't fully capture the possible additional optimizations that could be done on the data extraction process.

Ok so as per the api this may not work. But works fine on everything i tested it on.

So the problem of storing binary data in a string seems to be enocding to and from charaters. What about "padding" a string to the correct length in the .java file and putting in the binaray data directly into the class file. I have hacked "enter product key" things this way, using only the strings program and vim!

I have no special talents. I am only passionately curious.--Albert Einstein

Unfortunately, it won't fully capture the possible additional optimizations that could be done on the data extraction process.

It's worth noting here that pack200 supports a number of string encodings and if you give it flags to make the best effort it seems to try different ones to produce optimal output for the statistics of the string you give. In particular, if your data roughly increases then I think it will do difference encoding for you, so you can not bother and save yourself the code to undo it.

Ok so as per the api this may not work. But works fine on everything i tested it on.

That (usually) works on the same computer, but try shipping those bytes to another computer. The conversion is done with a locale-dependent character encoding, so it might be MacRoman on OS X (certainly used to be); UTF-8 on my Linux box; ISO-8859-1 on my old Linux box; etc.

What about "padding" a string to the correct length in the .java file and putting in the binaray data directly into the class file. I have hacked "enter product key" things this way, using only the strings program and vim!

The tool code I posted above basically shoves whatever data you want into the CP entry. The verifier will reject the class if the string is not a valid UTF8 encoding. So a 0x00 byte will cause the class to be rejected since it is not valid.

That (usually) works on the same computer, but try shipping those bytes to another computer. The conversion is done with a locale-dependent character encoding, so it might be MacRoman on OS X (certainly used to be); UTF-8 on my Linux box; ISO-8859-1 on my old Linux box; etc.

I don't think the "encoding" outside byte packing changes the bytes. So UTF-8 does the 1, 2 or 3 byte thing with restrictions on what can be encoded in each set as stated above. The encoding is more what char gets decoded to what "letter"? as i thought. For one i have never heard of MacRoman. That sounds like a char->font thing.

Also i have done this over the network to other machines and had no problems. But some machines may default to UTF-16 or something so i should use the versions that specify encoding.

I have no special talents. I am only passionately curious.--Albert Einstein

java-gaming.org is not responsible for the content posted by its members, including references to external websites,
and other references that may or may not have a relation with our primarily
gaming and game production oriented community.
inquiries and complaints can be sent via email to the info‑account of the
company managing the website of java‑gaming.org