ReactJS in PHP: Writing Compilers Is Easy and Fun!

--ADVERTISEMENT--

I used to use an extension called XHP. It enables HTML-in-PHP syntax for generating front-end markup. I reached for it recently, and was surprised to find that it was no longer officially supported for modern PHP versions.

So, I decided to implement a user-land version of it, using a basic state-machine compiler. It seemed like it would be a fun project to do with you!

Creating Compilers

Many developers avoid writing their own compilers or interpreters, thinking that the topic is too complex or difficult to explore properly. I used to feel like that too. Compilers can be difficult to make well, and the topic can be incredibly complex and difficult. But, that doesn’t mean you can’t make a compiler.

Making a compiler is like making a sandwich. Anyone can get the ingredients and put it together. You can make a sandwich. You can also go to chef school and learn how to make the best damn sandwich the world has ever seen. You can study the art of sandwich making for years, and people can talk about your sandwiches in other lands. You’re not going to let the breadth and complexity of sandwich-making prevent you from making your first sandwich, are you?

Compilers (and interpreters) begin with humble string manipulation and temporary variables. When they’re sufficiently popular (or sufficiently slow) then the experts can step in; to replace the string manipulation and temporary variables with unicorn tears and cynicism.

At a fundamental level, compilers take a string of code and run it through a couple of steps:

The code is split into tokens – meaningful characters and sub-strings – which the compiler will use to derive meaning. The statement if (isEmergency) alert("there is an emergency") could be considered to contain tokens like if, isEmergency, alert, and "there is an emergency"; and these all mean something to the compiler.

The first step is to split the entire source code up into these meaningful bits, so that the compiler can start to organize them in a logical hierarchy, so it knows what to do with the code.

The tokens are arranged into the logical hierarchy (sometimes called an Abstract Syntax Tree) which represents what needs to be done in the program. The previous statement could be understood as “Work out if the condition (isEmergency) evaluates to true. If it does, run the function (alert) with the parameter ("there is an emergency")”.

Using this hierarchy, the code can be immediately executed (in the case of an interpreter or virtual machine) or translated into other languages (in the case of languages like CoffeeScript and TypeScript, which are both compile-to-Javascript languages).

In our case, we want to maintain most of the PHP syntax, but we also want to add our own little bit of syntax on top. We could create a whole new interpreter…or we could preprocess the new syntax, compiling it to syntactically valid PHP code.

I’ve written about preprocessing PHP before, and it’s my favorite approach to adding new syntax. In this case, we need to write a more complex script; so we’re going to deviate from how we’ve previously added new syntax.

Generating Tokens

Let’s create a function to split code into tokens. It begins like this:

We’re off to a good start. By stepping through the code, we can check to see what each character is (and identify the ones that matter to us). We’re seeing, for instance, that the first element opens when we encounter a < character, at index 5. The first element closes at index 210.

Unfortunately, that first opening is being incorrectly matched to <?php. That’s not an element in our new syntax, so we have to stop the code from picking it out:

As with JSX, it would be good for attributes to allow dynamic values (even if those values are nested JSX elements). There are a few ways we could do this, but the one I prefer is to treat all attributes as text, and tokenize them recursively. To do this, we need to have a kind of state machine which tracks how many levels deep we are in an element and attribute. If we’re inside an element tag, we should trap the top level {…} as a string attribute value, and ignore subsequent braces. Similarly, if we’re inside an attribute, we should ignore nested element opening and closing sequences:

We’ve added new $attributeLevel, $attributeStarted, and $attributeEnded variables; to track how deep we are in the nesting of attributes, and where the top-level starts and ends. Specifically, if we’re at the top level when an attribute’s value starts or ends, we capture the current cursor position. Later, we’ll use this to extract the string attribute value and replace it with a placeholder.

We’re also starting to capture $elementStarted and $elementEnded (with $elementLevel fulfilling a similar role to $attributeLevel) so that we can capture a full element opening or closing tag. In this case, $elementEnded doesn’t refer to the closing tag but rather the closing sequence of characters of the opening tag. Closing tags are treated as entirely separate tokens…

After extracting a small substring after the current cursor position, we can see elements and attributes starting and ending exactly where we expect. The nested control structures and elements are captured as strings, leaving only the top-level elements, non-attribute nested elements, and attribute values.

Let’s package these tokens up, associating attributes with the tags in which they are defined:

There’s a lot going on here, but it’s all just a natural progression from the previous version. We use the captured attribute start and end positions to extract the entire attribute value as one big string. We then replace each captured attribute with a numeric placeholder and reset the code string and cursor positions.

As each element closes, we associate all the attributes since the element was opened, and create a separate array token from the tag (with its placeholders), attributes and starting position. The result may be a little harder to read, but it is spot on in terms of capturing the intent of the code.

Before we associate the attributes, we loop through them and tokenize their values with a recursive function call. We also need to append any remaining text (not inside an attribute or element tag) to the tokens array or it will be ignored.

The result is a list of tokens which can have nested lists of tokens. It’s almost an AST already.

Organizing Tokens

Let’s transform this list of tokens into something more like an AST. The first step is to exclude closing tags that match opening tags. We need to identify which tokens are tags:

I’ve extracted a list of tokens from the last token script, so that I don’t have to run and debug that function anymore. Inside a loop, similar to the one we used during tokenization, we print just the non-attribute element tags. Let’s figure out if they’re opening or closing tags, and also whether the closing tags match the opening ones:

Take some time to study what’s going on here. We create a $nodes array, in which to store the new, organized node structures. We also have a $current variable, to which we assign each opening tag node by reference. This way, we can step down into each element (opening tag, closing tag, and the tokens in between); as well as stepping back up when we encounter a closing tag.

The references are the most tricky part about this, but they’re essential to keeping the code relatively simple. I mean, it’s not that simple; but it is much simpler than a non-reference version.

We don’t have the cleanest function in terms of how it works recursively. So, when we pass the attributes through the nodes function, we sometimes get empty “token” attributes alongside nested tag attributes. Because of this, we need to filter the attributes to first try and return a nested tag before returning a non-empty token attribute value. This could be cleaned up quite a bit…

Rewriting Code

Now that the code is neatly arranged in a hierarchy or AST, we can rewrite it into valid PHP code. Let’s begin by writing just the string tokens (which aren’t nested inside elements), and formatting the resulting code:

When we find a tag node, we loop through the attributes and build a new attributes array that is either just text from token nodes or parsed tags from tag nodes. This bit of recursion deals with the possibility of attributes that are nested elements. Our regular expression only handles attributes quoted with single quotes (for the sake of simplicity). Feel free to make a more comprehensive expression, to handle more complex attribute syntax and values.

I went ahead and installed pre/short-closures, so that the arrow function would be expanded to a regular function:

composer require pre/short-closures

There’s also a handle PSR-2 formatting function in there, so our code is formatted according to the standard.

We parse each tag child, and directly quote each token child (adding slashes to account for nested quotes). Then, when we’re building the parameter array; we loop over the children and add each to the string of code our parse function ultimately returns.

Each tag is converted to an equivalent pre_div or pre_span function. This is a placeholder mechanism for a larger, underlying primitive element system. We can demonstrate this by stubbing those functions:

I’ve modified the input nodes, so that $thing will be printed. If we implement a naive version of pre_div and pre_span then this code executes successfully. It’s actually hard to believe, given how little code we’ve actually written…

Integrating with Pre

The question is: what do we with with this?

It’s an interesting experiment, but it’s not very usable. What would be better is to have a way to drop this into an existing project, and experiment with component-based design in the real world. To this end, I extended Pre to allow for custom compilers (along with the custom macro definitions it already allows).

Then, I packaged the tokens, nodes, and parse functions into a re-usable library. It took quite a while to do this and, between the time I first created the functions and built an example application using them, I improved them quite a bit. Some improvements were small (like creating a set of HTML component primitives), and some were big (like refactoring expressions and allowing custom component classes).

I’m not going to go over all these changes, but I’d like to show you what that example application looks like. It begins with a server script:

I haven’t yet tried running this through a web server, like Apache or Nginx. I believe it would run in much the same way.

The server scripts begins with me setting up the Silex server. I define a few routes, the first of which fetches an array of tasks from the current session. If that array hasn’t been defined, I default it to an empty array.

I pass these directly, as children of the TaskList component. I’ve wrapped this, and the AddTask component, inside a Page component. The Page component looks like this:

This component isn’t strictly necessary, but I want to declare the doctype and make space for future header things (like stylesheets and meta tags). I destructure the $props associative array (using some pre/collections syntax) and pass this into the <body> element.

Elements can have dynamic attributes. In fact, this library doesn’t support them having literal (quoted) attribute values. They’re complicated to support, in addition to these dynamic attribute values. I’m defining the className attribute; which supports a few different formats:

This is similar to the className attribute in ReactJS. The keyed or object form uses the truthiness of values to determine whether the keys are appended to the element’s class attribute.

All the default elements support non-deprecated and non-experimental attributes defined in the Mozilla Developer Network documentation. All elements support an associative array for their style attribute, which uses the kebab-case form of CSS style keys.

Finally, all elements support data- and aria- attributes, and all attribute values may be functions which return their true values (as a form of lazy loading).

Each task expects an id defined for each task (which server.pre defines), and some children. The children are used for the textual representation of a task, and are defined where the tasks are created, in the TaskList component.

We’re not storing anything in a database, but we could. These components and scripts are all that there is to the example application. It’s not a huge example, but it does demonstrate various important things, like component nesting and iterative component rendering.

It’s also a good example of how some of the different Pre macros work well together; particularly short closures, collections, and in certain cases async/await.

Here’s a gif of it in action.

Phack

While I was working on this project, I rediscovered a project called Phack, by Sara Golemon. It’s a similar sort of project to Pre, which seeks to transpile a PHP superset language (in this case, Hack) into regular PHP.

The readme lists the Hack features that Phack aims to support, and their status. One of those features is XHP. If you’ve always wanted to write Hack code, but still use standard PHP tools; I recommend checking it out. I’m a huge fan of Sara and her work, so I’ll definitely be keeping an eye on Phack.

Summary

This has been a whirlwind tour of simple compiler creation. We learned how to build a basic state-machine compiler, and how to get it to support HTML-like syntax inside regular PHP syntax. We also looked at how that might work in an example application.

I’d like to encourage you to try this out. Perhaps you’d like to add your own syntax to PHP – which you could do with Pre. Perhaps you’d like to change PHP radically. I hope this tutorial has demonstrated one way to do that, well enough that you feel up to the challenge. Remember: creating compilers doesn’t take a huge amount of knowledge or training. Just simple string manipulation, and some trial and error.