Pages

Friday, June 19, 2009

Finishing my short trilogy of L1-related posts today, I am going to talk about the actual implementation of this phantasm. This is an area where chromatic and I disagree pretty strongly, so I will just say that this is not canonical and there are plenty of other differing opinions about all this stuff. PerlJam said pretty eloquently on IRC today:

PerlJam> whoever implements something first is likely to have a strong influence on what the final implementation looks like

With that in mind, I've been reticent to publish this blog post for fear any lousy ideas and misconceptions I have will negatively taint the entire design process. I was even more hesitant when I saw three times today alone my previous L1 blog posts being used to explain the idea to other parroteers. It turns out that what I've written so far is the only concise written account of this little language, besides the disjoined conversations that have occurred randomly on IRC and email.

My conception of L1 is a low-level set of operations that represents a nearly direct way to interface with the Parrot API. I'm envisioning a one-to-one relationship between special API functions (in C), including VTABLEs, and L1 operations. We will have L1 ops for most of the VTABLE functions (with the implicit understanding that the number of VTABLE functions is going to shrink dramatically). I've mentioned the analogy of microcode before, and I think it's a fitting one. Microcode, as traditionally used, is a set of lowest-level machine instructions that usually map pretty closely to the underlying circuitry. Complicated and multi-stage operations can be implemented as a sequence of multiple atomic micro-ops. Also, there is the benefit that the machine code format becomes divorced from the underlying circuitry, so that one can be more easily modified without having to modify the other.

In my mind L1 should satisfy these conditions:

Should map closely to the underlying "circuitry". That is, the operations that Parrot understands (API Functions) should map directly to individual L1 ops. Notice that the capabilities of our "machine" (virtual machine) is more expansive then the capabilities of an ordinary hardware machine.

We need to be able to separate PIR (which should stay relatively stable for our end users) from the underlying engine that runs it. This way we can make major improvements to one without having to redo the other.

The operations should be simple and atomic enough to be executed quickly, JITted without much effort, and easily analyzed for optimization. This means they should be very simple logically, and as much as possible approach a common implementation pattern.

Provide all the functionality needed to implement all flow control devices currently implemented in PIR and C. Right now we switch back and forth between PIR C, and that's costly for a number of reasons. In an L1-enabled Parrot, control flow should never leave L1, and should be able to do everything both PIR and C do (including calling into arbitrary external library functions and arbitrary PIR functions).

All that said, here's a basic list of operators that I think will be good L1 ops:

This is a pretty limited list of ops, and I admit that I'm probably missing a few and have probably added a few that are extraneous. Keep in mind when reading this though that L1 is designed for speed and ease of execution, not for programmer convenience. L1 ops need to be trivially simple: a single arithmetic operation, a single VTABLE call, a single API function call, etc. Painfully, obnoxiously simple. PASM and PIR are better for the average programmer to use.

L1 is going to be register based, it doesn't make sense to do it any other way. I don't know how exactly we are going to handle registers, since each L1 op is going to need to be able to take in registers for its destination and source arguments, and also needs access to temporaries so that the underlying L1 op stream isn't overwriting data in PIR-level registers. I draw strong parallels to the MIPS architecture which reserves certain hardware registers to be used for implementing instruction macros without having to overwrite user-accessable registers.

Enough with the explanations, here's a quick idea of some ways that PIR ops could be written in L1:

op add(out PMC, in PMC, in PMC) {vtable[add] $1, $2, $3}

This snippet shows a few things. First off, I have no idea how to intelligently provide access to several hundred VTABLEs. Do we have one L1 op per vtable, or do we provide a lookup mechanism? I don't think we want anything like this (in pseudocode):

func = find_vtable obj, "add"result = call func, args

VTABLE calls are very common operations and it makes good sense that it would be a simple, atomic operation to invoke one. Looking it up by string is definitely not a simple or reasonable way to do it. Using symbolic integer constants to look them up instead of using strings is slightly better, but you still have the problem of having to find the vtable first and invoke it second, two steps for such a basic common operation seems lousy to me. The alternative is to use one L1 op per vtable:

vtable_add result, src1, src2

But again, we would end up with a huge number of such ops to call a huge number of VTABLES. This might provide impetus for us to decrease the total number of them in existence, or it might just lead to unhelpful explanations to "do it all in PIR instead" when L1 becomes so ugly and unmanagable that people sweep it under the rug and choose to ignore it completely.

In the example above I used the same argument placeholders $1, $2, etc that are currently used in .ops files, and I think that's fitting. These op definitions are little more then macros and the values for these arguments can be substituted in very simply when a program is converted from PBC to L1BC. If we're using that notation for arguments (which represent the register set that is available for use from PIR), then we would have to use a different notation for representing the set of temporary registers which are reserved for use internally by L1. I'll kick out a few ideas, but I don't have any strong preference how these register sets are delimited. Consider for instance the case of the swap opcode:

Here, the [] brackets denote a temporary L1-only register. Again, just one out of a hundred equally good (or bad) ideas for this. We do need a way to specify whether the register is a P, S, I or N still, and to address them from each family, so in some respects we're stuck with the "P0" base string. I won't really discuss this any further because it's so open and completely fruitless.

Implementing PIR in terms of L1 ops is only part of the problem, the rest is trying to implement PMCs in it too. For PMCs, we need these kinds of operations, in addition to the list I mentioned above:

allocate, access, and free memory buffers and custom storage

Access and manipulate attributes

freeze and thaw, including attributes and children

mark attributes and children for GC

access and manipulate PMC flags, pmc_ext, pmc_sync, and other fields

Again, I'm sure this list isn't complete, but it should give a good idea for the kinds of things we need L1 to be capable of.

The swap example kind of hits right at what I was talking about on the last post.

Some CPUs have a swap opcode. Were L1 not to have a swap opcode, you'd have to do like above (or use the XOR trick to avoid using a temporary variable, assuming there is an L1 XOR op.)

In any case you are talking about invoking three things to do what can be done in one (often even atomic) operation on many CPUs.

I'll give you that it was probably a bad pick for an example, as there probably would be an L1 swap opcode. Or one could argue that JIT would notice and reduce the code -- but breaking even with JIT is basically a game of hoping enough logic gets strung end-to-end that the cost of doing the JIT doesn't exceed the cost of operations it saves.

As to the number of opcodes needed, it is worth noting that the way PIR/PBC has managed to exercise control over the need to manage a large mapping of binary codes to operations is by overloading single opcodes on the type of the operands.

I don't think the swap example was a bad one, and I don't think we need to worry about optimizing it down. Premature optimization is a bad thing, and it's worth realizing that not all platforms where Parrot is supported or will be supported have it. Remember, L1 ops need to be trivial to JIT, which means that they have to be very small and atomic. A swap operation, if a hardware swap isn't provided, is composed of at least two separate sub-operations. So we have some cases where the swap L1 opcode would be far less trivial to JIT then other cases.

We can leave it to the JIT engine to detect patterns like this and optimize them for each particular platform. Every JIT engine I've ever seen has options to specify the level of desired optimization, which is always going to be a trade-off between initial compilation effort versus execution efficiency. In many cases we won't want to take the performance hit of lengthy optimizations on the front-end when such a process would take longer then execution of the non-optimized program! It is well beyond the scope of parrot to implement per-platform machine code optimizations while trying to also maintain a programming environment that is supposed to be independent of platform differences. A script written in PIR should execute equivalently on all platforms where Parrot is compiled and built.

PIR doesn't really use opcode overloading, it's an illusion. Each operation has a unique long name, and the PIR compiler allows the user to use a non-unique "short name" if enough information is available at runtime to derive the long name from it. The number of PIR ops is damn near 1300 right now, although there are some efforts to eliminate a number of these and move several sets of them to non-core dynamic libraries.

"Breaking even with JIT" is a tradeoff between the initial compilation overhead and the amount of time saved throughout execution. For small one-off programs in any language, JIT is rarely as quick as direct execution. For longer-lived programs or programs that execute long loops JIT is better because the per iteration efficiency savings outweigh the initial compilation overhead over time.

Well, first, as to the philosphical stuff: Obstructing optimization is about as "bad" as prematurely optimizing, and the reasons why are the same. One must understand precisely what these reasons are.

"Prematurely optimized" code requires the people working nearby it to pay mind to the optimization, and slows development due to attempts to remain compatible with the extra demands of the optimization while modifying behavior. This is just about the ONLY real reason it is bad.

If there was a function that nobody ever needed to maintain, because it's part of a very stable codebase, and someone came along and optimized it, then that's not "premature optimization" in the sense meant by the oft-quoted princep, even if that optimized function really isn't run enough to be in the 10% of the 90/10.

If they did that when there was better things to be doing, that's a different matter, but we have to keep in mind that volunteer coder skill sets vary widely... not all developers are capable of grasping a project the size of Parrot and must restrict their attention to subsystems -- either that, or do no coding and just a lot of talking about coding.

On the other hand, erecting barriers to optimization has a price, too, and not just the frustration of developers waiting for their test suite to complete.

Code maintainability can suffer if there are walls beyond which only automated utilities dare tread. The custom automations become the thing catered to and worked around, instead of the optimizations. They become jargon within the codebase; New developers face a steeper learning curve, and the ability to work on parts of the system without having to develop an accurate "big picture" is diminished. When core developers finally look up from their voodoo, they are speaking an entirely different language than anyone who's attention they might be able to get.

I'm not against L1, and I thank you greatly for trying to shed some light on it. I'm just not sold on it yet, so do pardon me if I poke it rudely with a stick. My attitude is that if it's another jargon, it had better damn well be worth the negative impact on the accessibility by new blood, because I'm new blood, and I certainly don't feel like this project is something I could make a positive impact on without devoting way more spare time than most people have.

Now onto a few specific points:

The number 1300 doesn't scare me, as long as there are really that many unique operations to perform. Modern CPUs are within an order of magnitude of that (even RISC CPUs, which sometimes even exceed CISC CPUs) and when we talk about internal microcode, much of that is really just eliminating redundant logic used by multiple instructions. There is still a lot of logic unique to many instructions, just you don't need the extra in-silicon copy of "and then byteswap your result" or whatnot.

I kind of EXPECT a higher level VM to have more instructions (or MMD overloading acheiving the same thing) because I expect to be able to do more with it.

While there are plenty of real world success cases for JIT, just using it doesn't mean it will necessarily work. I can understand how L1 is trying to address that need by reigning in the code to behave better on the back end.

But there at least four other areas I view as much more important: Concurrency and AIO, at least soft realtime, stable APIs for direct memory access (I guess this is called "c pointer support" in parrotspeak), and lest we forget, much better documentation and updated code comments that do not lead you on goose chases. You know, all that stuff core developers avoid doing by embarking on the next big adventure :-)

So maybe we could live without JIT? I mean I know it's the "thing to have" because JVM has it, but somehow despite that java_vm is the slowest pig of a process on my desktop... so...

Please do poke this idea with a stick, and poke it rudely. I would be far less happy if my posting here didn't spark any questions, discussion, or argument.

I called the swap thing a premature optimization because it's not really necessary at this stage. In the broader sense, whether we implement swap as one machine instruction or three (two using the XOR trick is bad for reasons I could explain separately) is absolutely unimportant. swap isn't an instruction that is used too often so optimizing it on some platforms is very low priority.

You are right about making sure new users are able to get involved, and I think Parrot does a good job of it. What you talk about is also a big part of why I have this blog in the first place: Keeping a written record of things, not only post facto but also throughout the planning and implementation stages. I like to write about things I am planning so that people (myself included) can get insight about what I have in mind for things.

Think about L1 as being a tool. It does add a layer of algorithmic complexity, but actually provides a large amount of simplicity too. It's going to simplify the way the whole VM operates, simplify the way all the components of it are developed and maintained. simplify the way individual components are updated, simplify the logic in our GC, simplify control flow, simplify our exceptions system, simplify our threading system, and the list goes on. We add a whole new layer to our little programming flowchart, and yet the whole thing gets less complicated, not more.

L1 is a long way away (at least in my estimation, it is notoriously difficult to estimate progress on volunteer-driven projects), and I will be doing a lot more blogging in the interim to try and express my vision for L1, and to try and make it more clear for other people to understand. If through all that discussion we decide that it's actually a bad idea, we won't implement it at all.

About half a year ago I took a read through the Parrot source. Some of it I understand, some of it not so much. I don't program for a living, usually. My degree is in Electrical and Computer Systems Engineering, not CS. I never went into the CSE field professionally, opportunity led me into WAN/LAN. But still, programmatically I think down to the wire. When I see code I see the bits going from here to there, gears meshing, valves valvifying, whatnot.

When I hear "multiple inheritance with parametric roles" my eyes glaze over. I realize to some developers such features are incredibly important. I'll use those features from time to time when it helps, and study up on them when I need to mod something that uses them, but they aren't critical to me. For sheer lack of time, I'll probably never write something so complicated that they'll be critical to code management.

I'd place good odds that a majority of potential Parrot contributers who are not contributing are in a similar situation -- they really want Parrot and/or Rakudo for one specific set of features. They could leave or take 90% of the rest, and they don't have three solid weeks to ferret an understanding of the entire system and the development disposition from disparate sources.

I write two types of things on the job -- scripts in high level languages, and very low level API bit shuffling usually involving network packets.

Data structures, that I can understand. So I read into hash.c and the string class. I see in there that a lot of streamlining could happen. I nopasted some observations to IRC, and nobody tells me I'm delusional, so I figure, hey, maybe I'll try to hand code up some functions to see if we can calc hashes at memory line speed when doing a long compare. And if that works out, and someone likes the sample and maybe it even gets in, I could do some other architectures.

Now, I've had about 6 false starts trying to be productive on rakudo and 2 on Parrot before this. I'm thinking to myself, this just might be low level enough a job that I could tuck myself in a corner, not bother anyone, and just crank out a bunch of optimized functions while watching TV every afternoon after work.

Perfect. Just the sort of thing I actually enjoy coding. Finally, even if it's a modest performance gain, no more dealing with a shifting landscape, here's something I can zoom in on.

But then, just when I'm starting to talk myself into it, there's now this mystery thing on the horizon, called L1, that looks like it has the potential to render that work useless. And I'm trying to determine A) if the operations I was thinking of working on are low level enough that they would still be in C, and B) if I'm going to be able to use the vector processor, or is it going to be reserved across the entire codebase because JIT might need it (even though JIT's only going to kick in in the middle of tight loops.)

And that's why understanding the full implications of L1 is important to me.

Thanks for the explanation, that makes a lot more sense to me where you are coming from now. I actually did EE in school, but did coding in my spare time at home in front of the TV, so I know exactly where you are from!

Parrot is at a very interesting development stage right now. First, the majority of it's major systems are at least prototyped, if not at some level of stability and maturity. However, there are several systems still in need of major cleanup, improvement, and basic implementation.

Because a lot of the systems are basically implemented now as initial prototypes, there is massive room for optimization at the function level and at the architectural level. Some systems, while functional, are in such a deplorable state that the only path forward is complete reimplementation. JIT is one of those systems. GC is another. Both of these are on-the-metal systems where optimizations would have a major effect on Parrot performance.

L1, while a big project, actually is going to occupy a relatively small niche in Parrot land. I'll try to put together a new blog post tonight to really explain what it will do from a perspective that I think you will want to see. What L1 is mostl going to effect is the front-end compiler (which will be modified to output L1 streams instead of PBC streams) and the central execution core (which is about a 5 line function). All the core systems are basically going to be unaffected by the transition, so work on them should be fine.

(BTW, the XOR trick takes 3 ops, and, unless your writes are guaranteed atomic and SMP caches coherent, and things do not blow up if both locations have the same value, you still have to worry about concurrency when using the temp variable approach. If you can guarantee the temp is a register and doesn't cause a spill, using a temp is the better approach.)