Integers in PHP, running with scissors, and portability

Until recently I thought that currently popular scripting languages, which mostly evolved over last 10 years or something, must allow for easier portability across different platforms compared to ye good olde C/C++.

After all, their development started a few decades after C, so its notorious caveats are all well-known and should be easy to avoid when designing a new language, right?

However, PHP just brought me a new definition of “portable” – and that was when working with… integers.

PHP is not able to handle unsigned integers, and converts values over 2^31 to signed. So if your IDs go slightly over 2 billion, and PHP decides to treat them as integers, you’re in trouble.

Oh wait, no – that’s on 32-bit platforms only! PHP int size is platform-dependent, and it seems to be 8 bytes on our 64-bit boxes. Yes, the very same ones where C/C++ int is 4 bytes, you know.

That was the easy part. It was mostly documented.

Now, there’s a function called unpack() which essentially allows to convert different types of data from binary strings to PHP variables. What if you try to unpack unsigned 32-bit big endian integer (format code “N”)? Let’s check the doc:

Having read the doc I personally blatantly relied upon it and expected that large unsigned 32bit numbers would be converted to float, or string, or something, but handled properly. However, a couple or so weeks ago the following notice suddenly appeared:

How sweet. No, it just could not behave like documented and convert 32-bit unsigned value to float on x32 or keep it integer on x64 – you now suddenly have to care about value size yourself. Ah, and by the way, there’s no official way to know what’s int size.

To make things even better, 5.2.1 introduced a nice bug in unpack(), which f..ed unpacking less-than-16-bit values on x64. (I assume you understand that “f..ed” means “fixed”). It took some time and several tries to convince PHP team that x64 has enough bits to hold 16-bit unpacked value, but thankfully its now acknowledged and assigned.

To summarize, if you need to unpack an unsigned 32bit int from binary stream, you have to:

convert it to float or string manually,

do that depending on int size on current platform,

which can not be done using anything documented,

and specifically avoid PHP 5.2.1 on x64.

Most people could probably learn all that, and then use sprintf(“%u”,$id), work with string IDs everywhere, avoid 5.2.1 and be happy.

Unfortunately, my final goal was to have support for 64-bit document IDs…

Let’s do a small time travel. Integer types in C/C++ have always been a pain, but back in 1999 ISO commitee ratified ISO/IEC 9899:1999 standard, also known as ISO C99, which guarantees that “long long int” integer type must be at least 64 bits in size. By now, most compilers support that part perfectly.

However, designers of PHP 5 (released in 2004) type system were either not aware of this change, or decided to not rely on the standard which has been out for “only” 5 years by then, or just thought that 31 (no typo) bits and 640K should be enough for everybody.

Long story short, it’s 2007 now but there’s no native 64-bit integer type in PHP. Let me remind that built-in “int” might be 64-bit, but then again it might be not, and there’s no official way to tell.

This time, there’s a number of routes one could take – either use ints (and pray that the app is never run on x32, and that “platform dependent” size does not change to 4 next version); or use GMP or bcmath extensions if they are available.

Fine, so 99.999% of the world would hit that, compile in bcmath, and be happy again.

Unfortunately, I needed to develop a library which could be deployed in any environment – and still work, and produce reasonable results. The worst case is x32, and neither GMP nor bcmath available.

And this is how the following code was born.

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

///portably build64bitidfrom32bithi andlo parts

function_Make64($hi,$lo)

{

//on x64,we can just useint

if(((int)4294967296)!=0)

return(((int)$hi)<<32)+((int)$lo);

//workaround signed/unsignedbraindamage on x32

$hi=sprintf("%u",$hi);

$lo=sprintf("%u",$lo);

//useGMP orbcmath ifpossible

if(function_exists("gmp_mul"))

returngmp_strval(gmp_add(gmp_mul($hi,"4294967296"),$lo));

if(function_exists("bcmul"))

returnbcadd(bcmul($hi,"4294967296"),$lo);

//compute everything manually

$a=substr($hi,0,-5);

$b=substr($hi,-5);

$ac=$a*42949;//hope that floatprecision isenough

$bd=$b*67296;

$adbc=$a*67296+$b*42949;

$r4=substr($bd,-5)++substr($lo,-5);

$r3=substr($bd,0,-5)+substr($adbc,-5)+substr($lo,0,-5);

$r2=substr($adbc,0,-5)+substr($ac,-5);

$r1=substr($ac,0,-5);

while($r4>100000){$r4-=100000;$r3++;}

while($r3>100000){$r3-=100000;$r2++;}

while($r2>100000){$r2-=100000;$r1++;}

$r=sprintf("%d%05d%05d%05d",$r1,$r2,$r3,$r4);

$l=strlen($r);

$i=0;

while($r[$i]=="0"&&$i<$l-1)

$i++;

returnsubstr($r,$i);

}

list(,$a)=unpack("N","\xff\xff\xff\xff");

list(,$b)=unpack("N","\xff\xff\xff\xff");

$q=_Make64($a,$b);

var_dump($q);

For reference, this is what would the equivalent C/C++ snippet look like:

35 Comments

Thank you Andrew for posting your findings. I know you spent quite a while to implement it portable way for Sphinx API.

I had similar surprise moving to 64bit PHP with crc32() function which magically started to return different values.

I think MySQL had much better approach in this regard. MySQL internal integer math was always 64bit even on 32bit platforms. This was a bit of performance penalty but not too much in reality, but at least you have good portability.

I’ve just done quick-n-dirty speed testing and the slowest “manual” route is 5.5 times slower than bcmath. bcmath yields ~128K calls/sec, and manual ~23K calls/sec. This is on AXP-3200+ under WinXP and PHP 4.4.1.

Normally there would be only a few records processed (say, 20-100) so both speeds can be tolerated. So I’m much more surpised by the amount of issues and workaround which is required to perform very simple 64-bit operation which has been in C standard for ages now.

I stand corrected. There is an official way since 4.4.0, called PHP_INT_SIZE.

However, this bit is hidden pretty well IMO. PHP_INT_SIZE is neither mentioned in the section on integer types, nor can be easily found in the documentation (results from http://www.php.net/results.php?q=int+size&p=manual&l=en are just irrelevant).

Other thing, PHP_INT_SIZE is rather recent addition meaning it will make sphinx_api incompatible with large amount of old PHP versions which it could be otherwise.

I guess it was added as these problems started to pop up a lot relatively recently. 3 years ago I guess vast majority of PHP users were running 32bit systems or at least there was no massive migration 32bit->64bit.

Type size depending on architecture is definitely a big problem in PHP if you try to have your code working on both 32 and 64 bit machines in parallel.

In PHP, the variable size of the integer type is a “feature” (probably the developers wanted to make use of the native C types for efficiency).
It’s also documented on http://www.php.net/manual/en/language.types.integer.php:
“The size of an integer is platform-dependent, although a maximum value of about two billion is the usual value (that’s 32 bits signed). PHP does not support unsigned integers.”

You will also run into problems when doing bit shifting operations in your code or check if a certain bit is set or not – the result may vary depending on the machine you run the code on!

There are also documented issues with the pack and unpack functions (http://www.php.net/manual/en/function.pack.php) as they allow you using machine dependent formats:
i signed integer (machine dependent size and byte order)
I unsigned integer (machine dependent size and byte order)
f float (machine dependent size and representation)
d double (machine dependent size and representation)

If possible, try to avoid using these formats as they may give you different results depending on your architecture.

You may well have a problem if you rely on third party code that does not care about these issues. We are using some external code to read in binary files produced by Excel and that code happily used pack, unpack and bit shift operations all over the place without caring about int size, e.g.:

The problem is that you have to adjust the application to be aware of the architecture and put in nasty and slow workarounds.

There are also well known issues with the crc32 function which may give you different results on 32 and 64 bit machines if you do not reformat its results with dechex or sprintf accordingly.

Finally, you may also run into trouble with serialize and unserialize. For example, try on
php -r “var_dump(unserialize(‘i:234444444444444344;’));”
It will return either
int(234444444444444344)
or
int(-424884552)
depending on architecture.
So you cannot properly unserialize big int values serialized on 64 bit systems on 32 bit systems. Imagine using a mixed environment with 32 and 64 bit servers that need to share their data.

Some PHP extensions or the underlying library also had issues on 64bit systems in the past, e.g. cracklib did not work on 64 bit some time ago because it used machine dependent ints in structs that read in some file header information.

There are other nasty portability issues with PHP which are even worse if you ask me.
For instance, fgetcsv is locale-dependent (documented in the manual):
“Note: Locale setting is taken into account by this function. If LANG is e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this function.”
So you have to either switch the system locale by your application or write your own fgetcsv substitution.

Probably the PHP documentation team can add a page to the PHP manual listing the obvious portability issues?

Ah, man, this is so the “PHP way”, it’s ridiculous. How can they do that ? I ask myself this every day as I discover yet another pile of shit hidden under the carpet of this joke of a language. Like their object model which doesn’t support overloading static methods. DUMB DUMB DUMB ! And they won’t fix it !

Just a quick note to say thank you. I’m running into exactly the same issues and have had to battle with them while I was writing a bytecode assembler in PHP previously also. This is a really great reference and it’s good to know that I’m not going crazy! (Well, maybe a little!) 🙂

I’m sure that most of you think that when I rant and rave about the horrors of PHP, I’m just being an elitist old bitch, with my closures and namespaces and garbage collection which works properly, and what-not. PHP is…

[…] Comment on Integers in PHP, running with scissors, and portability… …horrors of PHP, Iâ€™m just being an elitist old bitch, with my closures and namespaces and garbage collection which works properly, and what… […]

I am trying to graph some InnoDB stats in Cacti, and must extract them from SHOW INNODB STATUS, which prints them as a hi/lo 64-bit unsigned number, of course. I have to do some math on them to subtract one from the other. I ended up just sending them back to MySQL and doing a SELECT, casting them to string at the same time:

[…] script that gathers the data is totally rewritten from scratch, and much improved. For example, the math works on 32-bit systems. It has caching built-in so each poll cycle results in just one request to the server, instead of […]

One more thing. After testing with a considerable number of different 64-bit values, I found that some were combined incorrectly via the manual computation route. The bug is in the three “while ( $r4>100000 )” ($r3, $r2) loops, which should test with “>=” rather than “>”.

We absolutely love your blog and find many of your post’s to be precisely what I’m looking for. can you offer guest writers to write content in your case? I wouldn’t mind publishing a post or elaborating on a lot of the subjects you write concerning here. Again, awesome web log!