Somehow while creating the various anagram implementations for the previous entries in this series, I neglected to create one for PHP. This was primarily because I hadn’t really even thought about it, in much the same way as I didn’t create a Javascript version of the script. Nonetheless, it seems like a worthy endeavour; PHP is a very popular and well-used language, and there is no reason not to see how it stacks up against other languages in this sort of situation.

One of the main things when creating a PHP implementation is to get an environment set up. If I was to implement the script and run it on my webhost, the results would no longer correspond properly; what has to happen is the PHP has to be interpreted on my machine for the results to be something we can compare. PHP is designed more as a web-based language- as opposed to a local scripting environment. By this I mean it’s prevalent- and almost universally used on the server side; whereas other languages like Python, Perl, etc. find a lot of use client-side as well. Because I hadn’t ever actually run PHP client-side I had to do a little searching. But the search was relatively short. Since I plan to run these tests on my Windows machine, I needed the PHP interpreter for Windows. What I didn’t need, however, was Apache; and while I have IIS installed and could probably run PHP that way, I am concerned that the web server components might add some overhead to the running of the script, and generally make the timing “off” in comparison to the other languages I used. As I write this I am installing the file from here, which ought to give me the tools to write a PHP equivalent of the Anagrams Algorithm. This entry (evidently) describes the PHP entrant into that discussion. A naive anagram search program will typically have dismal performance, because the “obvious” brute force approach of looking at every word and comparing it to every other word is a dismal one. The more accepted algorithm is to use a Dictionary or Hashmap, and map lists of words to their sorted versions, so ‘mead’ and ‘made’ both sort to ‘adem’ and will be present in a list in the Hashmap with that key. A psuedocode representation:

Loop through all words. For each word:

create a new word by sorting the letters of the word.

use the sorted word as a key into a dictionary/hashmap structure, which indexes Lists. Add this word (the normal unsorted one) to that list.

When complete, anagrams are stored in the hashmap structure indexed by their sorted version. Each value is a list, and those that contain more than one element are word anagrams of one another.

The basic idea is to have a program that isn’t too simple, but also not something too complicated. I’ll also admit that I chose a problem that could easily leverage the functional constructs of many languages. Each entry in this series will cover an implementation in a different language and an analysis of such.

PHP is one of those languages that have a lot of fiobles and historical issues that are maintained to attempt to make sure things are compatible. In some ways, it reminds me of the story of Visual Basic 6. after several language iterations, it still holds to it’s roots. So on the one hand, you have very old code that still works well in the latest version as it used to, and on the other you have some tiny incompatibilities that might require changes to certain kinds of code. One of the most popular PHP-detracting on the web today is PHP: A Fractal of Bad Design Which makes some good points, but also ignores the fact that PHP was really never designed- it ‘evolved’. It was never really designed as a programming language, but rather as a way to get what needed to be done done in the fastest possible way and with the fewest headaches. The inevitable result is what we have now, which has some crusty deposits, but those crusty deposits are what makes PHP so useful.

As far as the anagrams concept goes, I figure the best way to accomplish the goal is to use associative arrays. PHP has these built in, and this is arguably an extremely awesome feature that is frequently overlooked. This is the resulting implementation:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

<?php

error_reporting(-1);

//timing start...

$mtime=microtime();

$mtime=explode(" ",$mtime);

$mtime=$mtime[1]+$mtime[0];

$starttime=$mtime;

$anagrams=array();

$processedcount=0;

//open dictionary file...

$file_handle=fopen("D:\\dict.txt","r");

//iterate through each line.

while(!feof($file_handle)){

$processcount++;

$word=fgets($file_handle);

//split the word into each character.

$letters=str_split($word);

//sort the letters...

sort($letters);

//implode to a new string, and use that as a index into the associative array.

A run of this took 116 seconds. Now, PHP being slower than many other languages, directly, is not amazing. However this is quite a bit slower than any other implementation by a factor of ten. Therefore I think it is fair to PHP to see if perhaps I did something wrong in the above implementation.

I am fairly certain that the issue is with my use of array_merge, since it rather re-creates the array every time, which can be costly given the dictionary has over 200K words. I changed it to use array_push:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

&lt?php

error_reporting(-1);

//timing start...

$mtime=microtime();

$mtime=explode(" ",$mtime);

$mtime=$mtime[1]+$mtime[0];

$starttime=$mtime;

$anagrams=array();

$processedcount=0;

//open dictionary file...

$file_handle=fopen("D:\\dict.txt","r");

//iterate through each line.

while(!feof($file_handle)){

$processcount++;

$word=fgets($file_handle);

//split the word into each character.

$letters=str_split($word);

//sort the letters...

sort($letters);

//implode to a new string, and use that as a index into the associative array.

$sorted=implode('',$letters);

//determine if this sorted item already exists.

if($anagrams[$sorted]==null){

$anagrams[$sorted]=array($word);

}else{

array_push($anagrams[$sorted],$word);

}

}

$mtime=microtime();

$mtime=explode(" ",$mtime);

$mtime=$mtime[1]+$mtime[0];

$endtime=$mtime;

$totaltime=($endtime-$starttime);

echo$processedcount+" processed in ".$totaltime."seconds\n";

echo$anagrams[50];

?>

I also “fixed” a minor issue from before. We are basically working with a top level associative array which uses the sorted words as their key, each one of which stores another array of the words that sort to that sorted word. Each of which would be anagrams of one another, naturally.
I also added a check at the bottom to make sure that I was getting some sort of result. This one clocked in at 114 seconds, which is still rather dismal. At this point I’m still convinced I’m just not using it “properly” but it is also possible that PHP really has this issue regarding speed. That said, I could probably find a Dictionary/HashTable class implementation on-line that would work better than a nested associative array. Another thing might be that I am simply using a slower PHP implementation. I’m using the installation from “php-5.3.20-nts-Win32-VC9-x86.msi” which is accessible from here. I went with the “Non thread-safe” version since that is usually faster. It’s also possible that running this on a Linux machine will yield faster results, since that is PHP’s ‘native’ environment. I considered the stumbling blocks with other languages, particularly Java and C#, with which I only got maximum performance when disabling debug mode. With that in mind, I sought to make tweaks to the php.ini file that came with the PHP interpreter, in the interest of increasing speed. I wasn’t able to make any significant changes to increase speed, though I did get it down to 109 seconds.

I think a fair assessment here has to do with the File I/O being done only a tiny bit at a time. This is one of the issues I had with some of my other implementations as well. In this case it creates and allocates more memory (possibly, depending on the specific PHP implementation of adding elements). So it makes sense to try to do it in one fell swoop, and then iterate over those elements. With that in mind, I made some changes:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

<?php

error_reporting(-1);

//timing start...

$mtime=microtime();

$mtime=explode(" ",$mtime);

$mtime=$mtime[1]+$mtime[0];

$starttime=$mtime;

$anagrams=array();

$processedcount=0;

//open dictionary file...

$lines=file("D:\\dict.txt",FILE_IGNORE_NEW_LINES);

//iterate through each line.

foreach($linesas$line){

$processcount++;

$word=strtolower($line);

//split the word into each character.

$letters=str_split($word);

//sort the letters...

sort($letters);

//implode to a new string, and use that as a index into the associative array.

$sorted=implode('',$letters);

//determine if this sorted item already exists.

if($anagrams[$sorted]==null){

$anagrams[$sorted]=array($word);

}else{

array_push($anagrams[$sorted],$word);

}

}

$mtime=microtime();

$mtime=explode(" ",$mtime);

$mtime=$mtime[1]+$mtime[0];

$endtime=$mtime;

$totaltime=($endtime-$starttime);

echo$processedcount+" processed in ".$totaltime."seconds\n";

echo$anagrams[50];

?>

Which still didn’t fix the speed problem (113 seconds this time). I did notice that when I had error display on, I got a lot of notices- as the loop ran- about undefined indices when setting $anagrams[$sorted]. So if I had to hazard a guess I’d say that the speed issue is a combination of a very long loop (200K words) combined with it being entirely interpreted that results in the rather slow run-time. The language itself is a bit messy, though it’s certainly not as bad as Perl as far as special punctuation marks go. Of course with with the loss of that special punctuation ($ for scalars, @ for arrays and % for hashmaps in Perl, but PHP only keeps $ and lacks the multiple contexts) we get a simplified language that is a bit easier to work with cognitively but lacks some power and capability. Nonetheless, it is still a very workable language, and most of it’s capability can be leverages by using existing frameworks and code; PEAR libraries can be found for pretty much every purpose. I hazard to wonder how it can be considered scalable, given the speed issues I’ve encountered, but I’m still quite certain I’m simply using it incorrectly. I have to say I was surprised that I remembered the foreach syntax off the top of my head- (foreach($array as $value)). Not because it’s complex or esoteric but because I have been messing with C#, java, and other languages a lot in the meantime, many of which use very slight differences- Java uses it’s for loop but with a special syntax for(value:array) C# uses a foreach loop foreach(value in array), Scala uses for(value < - array), and Python doesn't technically have a standard for loop and everything is really a foreach for value in array:. F# uses a similiar syntax, replacing the colon with it’s own equivalent keyword for value in array do. I was convinced when I used as that I was remembering some other language implementation and would have to look it up, but to my surprise I remembered it correctly. probably just luck though, to be fair. Even though C# is my “main” language for desktop development I do in fact find myself using the Java for syntax for foreach loops, too. (or even trying to use for(value in array) and wondering why the IDE yells at me on that line until I facepalm that I used the wrong keyword).

One might thing this particular example proves PHP is “slow”, but I disagree. I think it just shows that PHP is not well suited to iterating over hundreds of thousands of elements. But given it’s particular domain- web pages- you shouldn’t be presenting that much information to the user anyway, so that sort of falls out of the languages intended use cases’ scope.

In the works: Part XI, Haskell. I had a nice long one almost finished but I appear to have either forgotten to save it when I restarted my PC or something along those lines, so I’ve had to rewrite it. Thankfully I still have the actual Haskell code, which is arguably the most intensive part of each entry in the series.

I executed this within the desired directory (my BASeBlock source folder, if you must know) and the result was a file filled with numbers and files; I wrote a quick python script to parse that and add up the numbers that were at the start of each line, but then I figured, why not just write the whole think in python and forget about the rest of it, so I did.

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

#!/usr/bin/python

importsys

importfnmatch

importos

defsearchfolder(folder,filemask,excludemask):

filelist=[]

forroot,subFolders,files inos.walk(folder):

forfileinfiles:

buildpath=os.path.join(root,file)

ifos.path.isfile(buildpath):

iffnmatch.fnmatch(buildpath,filemask):

ifnotfnmatch.fnmatch(buildpath,excludemask):

filelist.append(buildpath)

returnfilelist

defCountlines(filename):

#counts the lines in filename.

countit=open(filename)

runner=0

forline incountit:

runner=runner+1

returnrunner

searchfor=sys.argv[1]

searchin=sys.argv[2]

excludethese=sys.argv[3]

print"examining files matching "+searchfor+" in directory "+searchin

+" excluding "+excludethese

totalcount=0

forcountthese insearchfolder(searchin,searchfor,excludethese):

#print Countlines(countthese)

totalcount=totalcount+Countlines(countthese)

print"total line count:"+totalcount

It’s a rather basic script, and I don’t even comment it as much as I ought to. I just wanted a quick tool to be able to count the lines of code in a given directory for a given source file type. Ideally, I’d allow for multiple types, but I didn’t want to complicate the argument parsing code too much. The counting method is pretty barren, it just loops over every line and increments a counter. It seems to work relatively fast. It quickly gave me the result I wanted, which was that BASeBlock’s .cs files comprise about 53K lines of code, excluding the .designer.cs files (thus the third argument). And now I have a nice reusable script to figure this out in a jiffy without too much thinking about shell syntax or what I need to pipe to wc and what arguments I need to pass wc and whatnot. plonked in a location on my windows machine with pathext set to allow execution of .py files directly using the ActivePython interpreter and putting it on my Linux machine and adding a symlink in /usr/bin to it makes it available to me on both machines.

So, as mentioned in the previous post, I added a “sort” of scriptability to BASeBlock.

I made some tweaks, and refactored the code so it was a bit more abstracted; the original implementation was directly in the MultiTypeManager, but that didn’t really have anything to do with it, so I tweaked some of the parameters to a few methods there, added a new class with the static routines required for the needed functionality, etc. I also made it so that a “BASeBlock Script Group file” (.bbsg extension) could be used to both compile a set of files into an assembly, as well as include various other assemblies as required. Future additions will probably include the ability for each assembly to define a sort of “main” method, which can be called when the assembly is initialized.

However, once again, Serialization was the constant thorn in my side. I was able to mess about with a custom block written in a .cs script, and it even saved properly.

But the game encountered an exception when I tried to open that LevelSet; I forget the specifics, something to the tune of “failed to find assembly” type of error. What could I do?

SerializationBinder

What this basically meant was I was going to have to learn even more about the Serialization structure of .NET. Specifically, SerializationBinder’s. The concept was actually quite simple. You basically just derive a type from SerializationBinder, and use that as the .Binder property on a IFormatter class, overriding one method seems to be enough for the most part:

1

Type BindToType(stringassemblyName,stringtypeName)

Simple! it gives you an assembly Name, a Type Name, and you simply return the appropriate type. The reason the default implementation wasn’t working was certainly as a result of the assembly being loaded dynamically, since it wasn’t [i]really[/i] being referenced by the BASeBlock assembly, so the default implementation didn’t find the “plugin” class assembly or the appropriate type, so threw an exception.

The trick here is not to enumerate the referenced assemblies, but rather to use all loaded assemblies in the current AppDomain. The general consensus with regards to using CodeDOM and compiling things like this is to compile them to their own AppDomain; however, since the assemblies were being kept “alive” for the duration of the application, that wasn’t necessary, and in this case would have complicated things. Well, it would have complicated things more than they already were.

The “AssemblyName” parameter, however, was more akin to the FullName property of the System.Reflection.Assembly; for example, BASeBlock’s assembly would (for the current version) be passed in as “BASeBlock, Version=1.4.0.0, Culture=neutral, PublicKeyToken=null”. Since we are only interested in the actual name of the assembly, we can simply grab everything up to the first comma.

Armed with the Assembly’s base name, we can start enumerating all the loaded Assemblies and looking for a match:

Then, we compare the Assembly.FullName.Split(‘,’)[0] (the text before the first comma) to the modified string, BaseAssemblyName that was changed in the same manner. I decided to go string insensitive for no reason; mostly because the assembly names for the scripts are formed from the filenames and I wouldn’t want filename capitalization to prevent a script from serializing/deserializing properly. If we find a matching assembly, we return the result of a GetType() call to that assembly with the same typename passed in as a parameter to the method. The Formatter will than attempt to deserialize the data it has to that type as needed.

There are a few issues with this- for one thing, it doesn’t work with Generic types. At least, I assume it doesn’t; I assume I would need some special code to get the appropriate Type for a generic type given type arguments. I’ll cross that bridge when I come to it.

Right now, I’ve not actually tested this extensively. My main concern would be to test that it can serialize in one session, and deserialize in another. This concern is based on the fact that the two assemblies would in that case be literally distinct- in that it would have been compiled on two occasions. Assuming the Binder is enough to convince the serializer it can deserialize something, there shouldn’t be any issues. Once I decide to figure out how to add Generics support to this Binder, I’ll definitely write about it here.