Posted
by
Hemos
on Wednesday November 29, 2000 @01:28AM
from the morphing-for-fun-and-profit dept.

jjr writes: "It seems that IBM has a Open Source Project called Daisy that does a lot of what transmeta does. Their code-morphing technology supports PowerPC, x86, and S/390, as well as the Java Virtual Machine. They Morph the [code] into VLWI just like transmeta but they still have some issues to work out. Other issues dealt with in the report include self-modifying code, precise exceptions, and aggressive reordering of memory references in the presence of strong MP consistency and memory mapped I/O."

If so, and if it's in the way of IBM, IBM might have to challenge those patents; nice.

They (IBM) might be successful since they're such a good customer of the patent office. Surely they will get some favourable treatment to overthrow those patents.

Btw in this case I'd be on IBM's side, since here it is in-the-open research against patents. Aside of what one thinks about IP patents, research should never be limited by futile things like patents. But it would be ironical since IBM, in other issues such a big user (profiter) of the patent system, would kind of have to act against itself (i.e. the patent office) then.

Not really. Amiga's just implementing a thin virtual machine layer, providing an "ideal assembly language" that provides more control than, say, C, but still provides sufficent abstraction that the code can be targeted to a
wide range of CPUs easily. (This is in stark
contrast to, say, the Java VM, which is comparitively quite heavy.) You can think of
Amiga's virtual assembly as a "medium level language", if such a term exists.

DAISY translates other ISA into its own native
Tree-VLIW ISA. Rather than providing an abstract assembly language that gets targeted to a wide variety of CPUs, DAISY is doing the reverse: Take a wide variety of ISAs, and target them to this specialized CPU. Transmeta is similar, although they've chosen to focus primarily on x86 to get the biggest bang for their limited bucks.

BTW, Transmeta has been working on their stuff since 1995, so the technology mentioned in the 1997 paper doesn't strictly predate it.

I read about Daisy a few years back when I was studying VLIW scheduling techniques and whatnot. The DAISY VLIW is quite different than most VLIWs around. Their instruction word is built upon the ability to execute large numbers of "branches" in parallel every cycle. (As best as I can tell, these "branches" are actually closer to being composite predication conditions in many cases, which is why I put "branches" in quotes.) Their experimental physical implementation could execute something like 8 branches every cycle.
Downright weird.

A more traditional VLIW uses predication [google.com] to convert short branches into a simple "if (cond)" prefix on individual instructions. (This technique is known as if conversion.) Also, traditional VLIW instruction words are flat -- all N instructions in a VLIW bundle execute together in parallel, with no tree structure implicit in the encoding.

All that aside, the DAISY scheduling techniques sound pretty similar totrace scheduling [google.com] , which was used on the old Multiflow VLIW machines [google.com]. The actual process of converting PowerPC instructions to individual DAISY operations is mostly search and replace, and preserving program order is a matter of constructing proper dependences between the instructions.

Feel free to ask me questions if you're curious about this kind of stuff. It's my day job.

You're quite right. Car manufacturers have basically given the chemical-battery pure electric car the flick after decades of research because they simply couldn't make a chemical battery efficient enough to give the desired range and performance.

Personally I think it's a shame that while we all wait for these technologies to get economically viable the suburbs of the US, Canada and Australia are being filled with fuel-guzzling gasoline-powered four wheel drives, despite the fact their owners never take them off road:(

So if you had a Sparc -> Sparc binary translator you could make the thing run faster.

On the Amiga, with Motorola's 68000, 68020, 68030, 68040 and a few 68060, someone actually released a binary patcher that attempted to patch binaries compiled for lower processors to make them faster (use new instructions, avoid ones emulated on the newer chips and thus slower, etc.) It also attempted to patch some sub-optimal cases often produced by the main C compilers in the market.
Tended to work pretty well...

OK, completely irrelevant, but I thought you might be interested anyway:)

That's not what I said; I'm not saying we should do an "exhaustive run" of the program.

I'm talking about a program which takes a binary compiled for one processor as its input, and gives a binary native to another processor as its output (and then runs it). This way, you only translate once, rather than each time thru the loop.

AFAIK- the 68k emulator that was used on the initial powerpc macs was infact emulating a 68LC040 (which *doesn't* have a fpu... that's why when the powermacs first came out there were 2 or 3 commercail apps that emulated the 68040's fpu and basically gave you a full 68040. i even think there was a shareware one that ran fine on a real 68LC040, but in order to run on a powerpc it needed registration.
as for that just in time compile stuff, damn i never knew that... learn something new every day:)

I can't say anything about how it works, but Apple has basically two kinds of 68000 emulators that run on the PowerPC Macintoshes.

The first emulator I understand was basically an interpreter, sort of like the Java virtual machine but where the "bytecodes" are 68000 instructions (I'm not sure which actual microprocessor was emulated, maybe it was the '020). Not real fast because you have to
decode each instruction every time you hit it, but it was well-written and reliable.

Then there was the dynamic recompilation emulator which I believe first appeared in the first PCI Macs (like the 8500/120) and System 7.5.3 (not exactly sure if that's right but thereabouts).

This was like the JIT - "Just in Time" compilers for Java, it would compile 68000 code to PowerPC code and then execute the PowerPC code natively.

This was a shipping product I believe in late '95 and I'm pretty sure Apple was not the first to do such a thing.

Note that on the Mac they were unable to rewrite much of the low-level OS code from 68000 to PowerPC, at least not initially, and so a lot of system software remained emulated and probably still does. Also it is very common for Mac applications to install interrupt time tasks and many of those are legacy 68k apps and it would be innefficient to switch instruction set architectures all the time.

I seem to recall it takes something like 200 PPC instructions to switch from one architecture to the other so if you're already in 68k code and you're about to run a small routine it's best to remain emulated.

It is possible to write "fat" code that provides both options and the machine will use whichever one it's currently running - this is common for "Extensions" which make "fat patches" to OS calls, and many OS calls are "fat traps".

For this reason, the Classic System 7 MacOS (of which Mac OS 8 and OS 9 are examples, but Mac OS X is a whole different thing) handles hardware interrupts in emulated 68000 code.

Interrupt handlers and device drivers may be written in 68k code or PowerPC code as you like and run on a PowerPC machine.

The dynamic recompilation emulator I think emulates an '040 with its instruction cache issues, and it correctly handles hardware interrupts that happen in the middle of running a chunk of recompiled code.

Early Mac apps very commonly used self-modifying
code. For example, if a "code resource" was expected to be loaded into memory and used by the system, many applications would load a small stub that jumped to an offset that was a placeholder. Then they would write an address in the running program code into the placeholder after it was loaded. This kind of thing screwed up on the 040 because you were writing to code using data instructions, but there were lots of workarounds such as the painful decision to flush the data cache after calling BlockMove - and the addition of the BlockMoveData call which wouldn't flush the cache.

Also note that an application (or any code) can install callbacks that are written in 68k, PPC code or fat, and this code will be correctly called from the OS or toolbox, whether it started in 68k or PPC. This works because of something called a "routine descriptor" that is a compact description of a function API - it handles Pascal vs. C calling conventions, instruction set architectures, and the possibility of providing alternative entry points for each architecture.

On 68k there is a "trap" - a defined illegal instruction, that causes a jump to an exception handler. The exception handler reads in the routine descriptor and does the right thing. On PPC, you pass the routine descriptor to the CallRoutineDescriptor function (or something like that).

68k code is legacy and knows nothing about routine descriptors, but the emulated processor handles traps correctly. PowerPC being released after the routine descriptor architecture was all implemented, developers can easily put it directly in their code. There are headers with macros that make most of this transparent so you can compile both kinds from one set of sources.

Compaq has also been doing a load of nice work on value profiling. There has been a few publications [compaq.com] lately describing their DCPI [compaq.com] (Digital Continous Profiling Infrastructure).

It's kind of cool; they actualy sample executing code (including kernel) at regular intervals by interpreting some instructions from the instruction stream instead of just recording the instruction pointer. This enables them to gather statistics about the outcome of instructions, physical location of load/store instructions, whether the instruction hit in the cache, how long it took to execute the instruction, and so on.

There is supposedly a downloadable evaluation version of the software at their website (problem of course, is that it only works on alphas running Tru64 Unix or Windows NT).

A color LCD of usable brightness (another huge drain on battery life) is going to output a certain amount of energy

True for LCD, but why limit yourself to one technology?. There's no reason a screen has to emit light at all. After looking at several flavors of "electronic paper" it doesn't seem particularly fanciful to imagine a display which consumes zero power if the image isn't changing and which is readable under the same wide variety of conditions as regular paper. It may well be that such displays will always lag behind more conventional technologies in areas such as transition time or color depth, but for a very wide variety of devices and applications that would still be a big win.

Even within the realm of light-emitting display technology, there's plenty of room to reduce power consumption. For example, the Light Emitting Polymer work at CDT could lead to displays that consume a lot less power than CRT or LCD displays, in addition to being extremely thin, light and flexible.

I'm not trying to argue with you here. I completely agree with your main point that power consumption needs to be addressed beyond the CPU. Displays and rotating media in particular are at least as deserving of attention. This is all just FYI.

The assumption that Transmeta chips are quite good quility is probably rubbing off of the fact hat Linus works for Transmeta, thus, since linux is good, the Crusoe must be good. It's not linear, or even proper, logic, though. It's a nice thought, but torvalds isn't the only hacker their, either. They're sure to have hardware that do all the hardware, while Torvalds makes sure the firmware is AOK. Highly likely the CPU's are quite stable, though.:)

Congradulations dieman!!! Your post has now become the technical bulk of another DAISY/Crusoe article [zdnet.com] from ZDNet [zdnet.com].

Appearantly, no one at IBM or Transmeta would call ZDNet back, so the writers came to/. I commend them for knowing where to get quality information; but, considering that they quoted dieman word-for-word, I wonder if they even read the white paper themselves. I'm betting that they couldn't understand it and decided that obviously dieman could - just look at all those big words - so he seemed like a good source. I haven't gotten around to reading the white papre yet myself, so I hope you got all of it right dieman.

One other thing, the ZDNet article mentioned something about Crusoe doing parallel processing, and I believe they mean internally, not just multiple processors on a board. I haven't seen anything anywhere indicating Crusoe or DAISY is capable of true parallel processing. Has anyone seen anything about this, or are the ZDNet writes drinking their glow-stick juice again?

Just looking through the documentation [ibm.com]
available for this software, it seems like it's some very cool tech indeed. a truly OS version of code morphing is definately very cool, as is getting away from the X86's semi-dedicated register set, which is as others have said, a bitch.

ST: Anyone else immediately think of 2001 when
they saw the project was named Daisy?

IMO, the best approach would be a hybrid, where the code morphing could use the intermediate representation as a binary form to generate machine language, and then optimize it using runtime profiling and scheduling based on techniques already used by current compilers on IR trees.

Hey cool, if you come even close with that you'll have no problem getting some killer jobs.

It occurred to me that profiling a kernel like I suggested is a problem because the kernel can disable interrupts (as when handling an interrupt) and so even though you might be able to sample to some extent it may be hard to get good results. Also you crash the machine, etc.

But I recall reading recently here that someone had the Linux kernel running as a user space program. So you boot a real linux kernel, then run a fake kernel inside of some kind of hardware emulator or something. It was suggested to use this for kernel development - you could quit the kernel and restart it much quicker than rebooting and there's less danger of corrupting your machine, if your test machine is also your user machine, as is all too often the case.

But with this you could easily profile a userspace kernel and be interrupting it from the outside without the test kernel being aware its being interrupted, as those interrupts are not handled by the test kernel, but external code.

Of course, you'd want this to work for ordinary programs first. Let the kernel be your fourth year project!

A color LCD of usable brightness (another huge drain on battery life) is going to output a certain amount of energy

i agree, but i for one (and i'm sure there are others out there) would be happy to get a greyscale screen if i could get an increase in battery power for it. are there any decent laptops out there with black and white screens?

Something just occurred to me for the first time. There's two ways that emulation is presently done: either by running a virtual processor completely in software, or by doing it in hardware as Transmeta does.

It occurs to me that there's a third possible way: rather than doing the emulation step by step as the program runs, step thru the whole compiled program and convert it to native code just once, and then run it natively from then on, rather than re-emulate it each time thru the loop.

well - I get your point - but I also beleive there's a class of genuinely non-obvious, innovative ideas for which patents are appropriate (not the bulk of the crap being patented at the moment) - IMHO this is one.

I suggest the following test to anyone considering patenting something - "would you feel proud explaining your idea to Mr Edison? or embarassed?"

Distinguishing code and data. How do you know some bits in the.data section aren't instructions that are going to be executed at run-time? We won't even get into the trick of overlapping data values and instructions in the same memory locations.

Addressing. It is difficult to know where basic blocks begin and end. Because the instruction sets won't match one-to-one, you have to go patch branch addresses. This becomes very difficult when you deal with computed gotos and function pointers.

Register Allocation. Even if you could translate all the instructions correctly, you'd still want to do a good job. Unfortunately, with an architecture like x86, many variables are not enregistered, even if it is valid to do so. The translator can't know statically that it is safe to put a memory location into a register. Transmeta gets around this with some hardware to detect invalid register allocations at run-time.

FX!32 does something like what you're talking about, expect it uses the initial, emulated run of the program to find out what parts are actual code. On the next run, if untranslated code is touched, an exception handler emulates it and marks it for translation after program execution.

So, does it mean that with this technology I can run Microsoft Windows ME and Microsoft Windows NT 4.0 and Microsoft Windows 2000 Server on PowerPC (including MacIntosh PPC) and IBM S/390?
No, I'm not trolling. I'm just curious since there has been this sort of "hardware emulation" trend going on recently.

yup they have a cool patent on their writeback buffer - basicly it stalls to clean points where exceptions are resolved - that way they don't have to worry about having 'clean' exceptions - just toss the memory/register changes and drop back to interpretting the code instruction by instruction

Code morphing is a great way to transition to VLIW, but dynamic translation and parallelization will always be slower than native processes. Are there any other ways that we as a community can start moving away from the old x86? I am sick of only having 4 registers when asembly programming!

why doesn't the industry start to concentrate on making energy efficient devices besides the processor, and it would also help out so that we aren't pushing battery technology, because that field seems to be lagging behind badly

>Code morphing is a great way to transition to
>VLIW, but dynamic translation and
>parallelization will always be slower than
>native processes.

No. you're actually wrong (though it is counter-intuitive). Dynamic translation lets you make optimizations at runtime about the behavior of the code that can't be done statically at compile time (or even as well in the CPU using branch prediction, etc etc) . e.g. check out the 'Dynamo' project at HP - emulate the PA-RISC processor on top of itself in software, and get substantial speed improvements....

What make Transmeta special is that they have put a dynamic binary translator in a chip and have developed silicon to make it faster.

No, actually you have it backwards. Intel (and later NextGen, AMD and I believe Cyrix) put a dynamic ISA translator *on their chips* starting with the P6--they decode (i.e. translate) x86 instructions into internal "u-op" instructions (AMD calls them "macro-ops", same idea) which are used by the rest of the silicon. (This is necessary because x86 instructions are too heterogenous in length and complexity to work well in a deeply-pipelined out-of-order core.)

What Transmeta did was essentially move this translator *off* the chip, into software. The advantage of this is simpler silicon, and therefore lower power consumption. (Also, all things being equal, higher maximum clock speeds; all things are clearly not equal.) A secondary advantage is that far more resources (16MB IIRC) can be devoted to buffering, tracing, analyzing and optimizing the instructions than on a chip, where the physical chip-size keeps buffers small and optimizations simple. The disadvantage is that all this needs to be run on general-purpose (i.e. slower) silicon--and worse, competes for CPU-time with the very programs it is trying to optimize. (Not to mention takes up 16MB of system resources.)

So far the tradeoff has been (IMO) a big loser except in special circumstances--where you need long battery life, x86-compatibility (otherwise there are faster, smaller, more efficient chips out there, like anything in the ARM family), little weight (otherwise just use a bigger battery), and have efficient enough components for the rest of the system to actually make a difference (this is the gotcha with traditional laptops). Whether this particular set of circumstances will turn out to be a small or huge market niche, it is certainly a small problem space. Of course, much of the blame is due to TM's implementation rather than the (basically sound) idea; apparently their architecture is not up to Intel's standards (their process technology is IBM, so that's not the problem). Of course, mistakes are very common in the first iteration of a wildly new idea--witness Itanium (harnessing VLIW for very different ends--and arguably with less success) for proof of that.

why doesn't the industry start to concentrate on making energy efficient devices besides the processor, and it would also help out so that we aren't pushing battery technology, because that field seems to be lagging behind badly

There is certainly research and development on low-energy components besides the CPU; check out the energy usage of the mobile Radion, for one thing. However, there are limits on how much you can possibly squeeze out of some components. Hard drives (which probably eat the most energy in a portable system) need to spin, and there's a certain amount of mass which is being kept moving at a certain velocity, along with a certain amount of energy required to read/write data. That puts a limit on how much energy you can save there. CD-ROM drives have similar limitations.

A color LCD of usable brightness (another huge drain on battery life) is going to output a certain amount of energy; you could make the screens dimmer, but then they are harder to see. Wireless connections are going to require a certain amount of power for broadcast; the further the connection, the more juice. Sound output requires a certain amount of power, and so on.

What you're seeing is the design decisions which made the original Palm Pilot: no movable parts for storage, B&W, passive matrix screen, no wireless. And it could run for two months on two AAAs. Adding on just a color screen drops that down significantly and requires rechargable batteries for a reasonable experience. Ditto for wireless. I just don't think there's going to be much of a way around it until we figure out how to store more energy in a light, safe way.

I think what is needed out of the MS-DoJ thing is some sort of portable code emulator, a binary equal of C. Software makers would write most of the code to this, and this would run under different OS's, be they free or commercial.

You will still have a need for OS specific apps, but so much of the customer cost in OS replacement is in replacing the apps for the OS.

Different emulator codes could be optimised for different classes of program: for example, games and productivity suites have different requirements, and thus could use different emulators. You could can the games emulator to stop people running games at work:)

There is so many possibilities that we missed because of the `ILOVEYOU' affair with Windows:(.

This comment is a slap in the face of all the electrical engineers who have been working their ass off to bring you low powered portable devices. (Don't worry, they're used to it.)
Part of the problem is that every time the engineers perfect something, users come back with more demands. "I want stereo sound!" "I want 3d graphics" "I want a huge color display!" "Can it play DVDs?"
So what is your standard for "low power" and how many weeks ago did you decide on it? Give the guys a few more months.

Also, battery technology has been slow. (Don't think people arn't trying to fix this.) Getting energy out of something stable like a battery isn't easy.

I guess the point is: people are making low power devices, you just can't be pleased.

I read an IBM paper when I was an OS engineer at A Big Fruit Company [apple.com] which discussed the use of instruction-pointer sampling profilers to optimize compiled PowerPC code (I think maybe actually POWER code, similar but not the same) by rearranging blocks of the machine code in the executable file.

This was in either late '95 or early '96 - but the IBM work on this had been around for a while by the time I read the paper.

This technology is widely available now - read all the way to the end to see how you can try it out.

If you have a jump to a certain offset in a routine, you can move the code where you jump to elsewhere in the file and change the offset you give in the jump. Complicated, because you need to parse RISC machine code, but doable.

It's made a little easier by PowerPC instructions always being fixed at 32 bits with no extension words (a side effect of that is that there's no way to load a 32-bit constant into a register with a single instruction, which makes it hard to scan machine code by eye for constants in an assembly debugger.)

This has the effect of speeding up the overall program execution because you group frequently used code blocks together in the executable file, and also in memory once it's loaded. You may find less-commonly used branches of an if-statement put miles away at the end of the file, so that you jump a long ways away and then back in sometimes, but this isn't a big deal because all the frequent cases flow straight along.

The reason this is a big win is twofold. First, you reduce virtual memory paging and the code resident in physical memory because less commonly used code is all grouped together and just sits idly paged out on disk; that which is taking up valuable physical RAM is of a minimum size and being used actively.

Also (and more importantly in small programs, and in CPU-bound cases), you make more effective use of your processor's code cache.

This is because jumping over an uncommonly used branch may load a few unused instructions into the cache at the beginning and end of the branch that's not taken - cache lines (blocks) are of a fixed size and are always aligned by the cache block size, so if you have 32 byte cache lines then the start of any cached code falls at a physical address that is divisible by 32.

If you run even one instruction into the address rangle, you load 32 whole bytes of code into the cache, deleting 32 bytes of code that might be useful later, then if your code is not optimized this way you'll just end up jumping over most of it.

Many people who are trying to make their programs
run faster would benefit from knowing more about
how the cache works. Gary Kacmarcik's
Optimizing PowerPC Code [fatbrain.com] has a good discussion of this that will benefit anyone who programs on modern microprocessors - not just PowerPCs. And while Kacmarcik emphasizes PowerPC assembly, most of the benefit of improving cache use you can do from C, C++ or another higher level language.

The way the profiler works is that an interrupt-driven task is used to check the instruction counter at frequent but random intervals. The samples are saved to a file for later analysis, then a postprocessor makes a histogram which gives the number of samples per basic block of instructions.

(A basic block, essentially, is any code that falls between a pair of curly braces if it came from original C source code. It's more complicated than that in practice but basically it's a chunk of machine code that has one entry point and one exit. It's possible to analyize machine code with a program and divvy it up into basic blocks.)

Then basically what you do is sort the machine code, with the most frequently used basic blocks coming earlier in the file.

Note that the profiling process depends necessarily on the use to which the program is put during the sampling. For best results, you might actually want to prepare several seperate binaries of the same program, each optimized for a different purpose. Or you might want to construct test data or a test script that gives you a good overall average performance.

Now, how do you get this tool? It's more than just theory. It's available for IBM RS-6000's, although I don't remember what they call it.

I believe a variant of this is available in the Metrowerks Codewarrior [metrowerks.com] development environment for PowerPC (CodeWarrior also supports Windows, Linux via GCC and lots of embedded systems but I believe the code reordering is only available for PowerPC).

CodeWarrior provides both an IDE (on Windows there's a choice of MDI user interface or Mac style with a global menu bar and free windows, which makes me much happier when I program on Windows) and it also provides command line tools, including the entirety of MPW with mwcc preinstalled so you can do "make" style builds on the MacOS (but with a weird makefile syntax).
I don't seem to find any mention of this on Metrowerks' website. I'll ask their friendly support guy if I'm correct about this.

Perhaps you're lusting over using this for Linux. It would certainly be interesting to try using this on the kernel - build the kernel, boot the machine off it, run it for a while under a normal load while you run the instruction pointer sampler, then reorder the instructions in the kernel and boot off the new kernel and you run faster!

This would probably be easiest to do on PowerPC Linux given the availability of published information from IBM and Apple about it, but I don't see why you couldn't do it for any instruction set. Some would just be harder to parse or rearrange correctly than others.

It looks like Daisy has been around for a while. Papers date back to 96. Looking at Transmeta I see that they were founded in 95. I'm not sure when Transmeta got their patents but it could very well be after IBM did much of their early research. IBM probably has a good fighting chance against any legal actions that might taken against them.

It was never dropped. Microsoft stopped supporting it, but the install folder for PowerPC is still on any NT 4.0 Install CD.

On a related note, does anyone know whether or not this is true: I heard that because the Macintosh Network Server models that apple put out a while ago were based on a chrp motherboard, you could install NT on them. If it is true, does anyone have one of these models they would be willing to sell?

I just don't think there's going to be much of a way around it until we figure out how to store more energy in a light, safe way.

Given that the automotive industry, amongst others, has been throwing money at battery research for decades and hasn't made any order-of-magnitude breakthroughs suggests that making more efficient batteries is extremely difficult.

Hotspot works by JITting only parts of the code - the parts that get run a lot (hotspots) - and tuning & optimizing these for speed. For the rest of the code, it's an interpreter. Because of this, startup time is low (no need to JIT the entire Swing toolkit, XML parser, CORBA, etc. etc. when you start up a complex app that uses all of these) and execution speed is typically at least as good as for a JIT.

It's virtually impossible to do an exaustive run of a program unless it is just a straight computation. How would you "pre-run" say a Netscape browser and know all the possible buttons to be pushed? Also would you want to sit around and "pre-run" all your applications? This would take forever to start an app.

It's not that battery technology is lacking for R&D efforts, it's that batteries are what you might consider a "mature" technology. For the most part the basic design of batteries has barely changed in close to 100 years. Expecting improvements anywhere close to what you see in the semiconductor industry is rather unreasonable. It would be like asking car manufacturers to double the mileage of internal combustion engines every 2 years.

You need energy to brake a spinning disk, not to keep it spinning. If you can reduce
the friction to zero (think of magnetic bearings
with room-temperature superconduction magnets in
an evacuated case), the hard drive will spin forever.
Of course there are no room-temperature superconducting materials known yet (and strong magnetic fields may cause problems..), but the point is that current devices are far from the limits
set by the laws of physics.
Also, if you use a head-worn display instead of
a huge screen, the size and power output of the LCD drops substantially.

It occurs to me that there's a third possible way: rather than doing the emulation step by step as
the program runs, step thru the whole compiled program and convert it to native code just once,
and then run it natively from then on, rather than re-emulate it each time thru the loop.
How come nobody is doing it that way?

They are, and have been for the past few years. Just because it doesn't happen in linux doesn't mean it doesn't happen.

Their patents would have to be extremely specific... They can't have simply patented running another processors instruction set, because there's a wide variety of prior art for that. And anything to do with abstraction is probably also covered under Java... And given the machines that their code is running on, it's safe to say that they're no using code morphing as a means of prolonging battery life in portables, which is probably where most of transmeta's patents lie...

And lets not forget... Transmeta initially chose ibm as a foundry specifically because they have a license from intel to manufacture x86 compatible chips... IBM could have extracted a cross-license agreement to cover whatever technologies they needed covered when the were negotiating wth Transmeta.

Don't you wish you could talk to those managing transmeta directly? I'd love to point at articles like this and say, "I told you so." They are a good year ahead of any competition, but unfortunately their products are still too pricey and too slow. Since Transmeta refuses to open source their code morphing capability I'll put my money and support behind IBM or whomever writes software to give me the functionality Transmeta doesn't even want to give its customers.
I want a system that can change its instruction set on the fly, or at least in prom or bios. I want a system I can run solaris, OSX, Linux, IRIX and wintendoze on natively at near hardware speeds. It would also be nice if this could be a portable system, but that's not nearly as much a requirement. Transmeta refuses to write additional code morphing software for the ultrasparc, MIPS, PPC, etc. instruction sets. So as far as I'm concerned they can be consumed by AOL or the next big monopoly. I won't shed a tear.

Dynamic in this case means that some code is emulated on the fly, and some is translated. This approach was pioneered for bytecode systems in Smalltalk implementations in the 80's, and of course is now used in Sun's HotSpot and other dynamic adaptive JVMs.

Static binary translators have been around for even longer, and were used (among other things) for running VAX programs on Alpha.
A useful overview [digital.com] of this sort of technology appeared in the Digital Technical Journal 4:4 (1992) [digital.com]. HP also performed binary translation between the HP3000 and the Precision architecture, but I can't find on-line info on that, just a citation [nec.com] to a paper article (1987).
There is also a useful survey article [nec.com] on static and dynamic binary translation.

What is presumably novel in Transmeta's approach is that their instruction set architecture (ISA) is tuned specifically for dynamic translation (see page 12ff of Transmeta's paper The Technology Behind Crusoe Processors [transmeta.com]. Some microcode architectures have been designed specifically for general emulation (most have been tuned for a particular macroinstruction ISA), e.g. the early Lisp Machines [uni-hamburg.de] (1976-81).

Kernel profiling, instrumentation, and dynamic rewriting has been done in kernel space (on stock, unmodified, closed-source kernels, no less), by the KernInst group [wisc.edu] at Wisconsin. One of my officemates is currently working on this project, and it is an "offspring project" to the research group that I work for. They managed to improve squid's performance by altering the behavior of (IIRC) open() with O_TRUNC.

The original (sparc solaris) port has been going on for over four years now, with several graduate students working on it. There is an i386 linux port in progress, but I don't know if it's generally available yet. I'd suggest reading the papers (the above link), as there are a lot of fascinating "gotchas" and the ways that these were dealt with are quite clever. (For example: how do you atomically insert a sequence of two instructions into a process that you can't stop?)

In any event, the papers are good reading, and will be very useful to your research. (BTW, the student who started this project is finishing his PhD this winter and has a "killer job" waiting for him.:-) )

~wog
My views are my own and do not reflect those of my university or research group.

This is pretty neat how it converts PPC and x86 code into VLWI. It is a good way to see how efficient VLWI's unique tree instruction approach would be in currently compiled code. However, there is a bit of a latency the first time each block of code runs, so it is difficult to tell how much this will slow down the process the first time through each brick of code.

This could mean that upgrading architechtures could be possible while still retaining backwards compatibility. Isn't it about time Microsoft left the x86 instruction world and embraced the newest technology available? This would be like Apple's transition to PPC, although unlike Apple, they wouldn't need to write a software emulator for older software, they could simply use DAISY to morph the code.

Does anyone know how DAISY compares with software emulation in terms of speed? I'm guessing it is a great deal faster.

yes.. that is because Daisy is a DYNAMIC BINARY TRANSLATOR.. say the words with me. What make Transmeta special is that they have put a dynamic binary translator in a chip and have developed silicon to make it faster. At this very moment I am doing maintenance work on a Pentium -> Sparc dynamic binary translator. Getting x86 float point instructions to work is a bitch, but for some reason the compress95 benchmark needs float point to generate data in the test harness, even though it's an integer benchmark.

According to their white paper, Transmeta uses dynamic binary translation to convert x86 code into code for Transmeta's internal architecture. This is similar in concept to the current version of DAISY which converts PowerPC code into code for an underlying DAISY VLIW machine. DAISY was developed at IBM independently of Transmeta. The DAISY research project focuses less on low power and more on achieving instruction level parallelism in a server environment and on convergence of different architectures on a common microprocessor core. A more detailed comparison of the DAISY and Transmeta approaches will be possible after Transmeta publishes their techniques in more detail.

Isn't it about time Microsoft left the x86 instruction world and embraced the newest technology availible?

What technology are you talking about? ia-64? FYI, they are. Besides that, I would say that x86 is pretty much the best tech out there, as far as Microsoft is concerned.

Besides, the point of this technology is that the software vendors don't have to do anything (or not much). Their code just runs where you want it. Microsoft is not stopping you from emulating an x86 on this new technology you're talking about.

Does this mean that (if all goes well) the technology could be used to blend the stability of Linux with the availability of MS apps? With Big Blue's half hearted attempts at putting out Linux desktops/laptops, this might be a sign of good things coming down the road?

Dynamic translation lets you make optimizations at runtime about the behavior of the code that can't be done statically at compile time (or even as well in the CPU using branch prediction, etc etc)

Well yes...and no. Yes it can let you make runtime optimizations on the code by agressively profiling it at runtime (something a compiler can't do), but you have to remember that when you translate from one instruction set to another it isn't the same as going from an intermediary form to machine language. If you translate from one machine language to another, you have to deal with the fact that code has already been compiled once, and has been scheduled and optimized (perhaps poorly) by a previous compiler. You're stuck with an instruction stream, and extracting the meaning of that instruction stream and generating an equivalent and more optimized set of instructions in another machine language is extremely difficult...much more difficult than it is for a compiler which has access to an intermediate representation. Register allocation, instruction scheduling, prefetching, and instruction selection have already been done for one specific architecture. That's one main disadvantage of code morphing, because you can't really ever hope to correct the mistakes of a previous compiler (which didn't know any better because it was doing the best it could do for its target architecture).

IMO, the best approach would be a hybrid, where the code morphing could use the intermediate representation as a binary form to generate machine language, and then optimize it using runtime profiling and scheduling based on techniques already used by current compilers on IR trees.

Very similar, but it leaves you with the restrictions of the JVM which I'm told (and I don't know enough about the JVM or architectural design to verify the comment) is not particularly well designed for fast execution.

The biggest drags on performance of (say) x86 systems is not lack of runtime profile data to the compiler, but from hard-to-parallelize restrictions placed on operations by both the language and the host ISA. Dynamic translation allows you to optimze for the 99% case, and patch up the other 1% with an exception handler. Register allocation may be a little hard to deal with, but scheduling, prefeching and instruction selection are exactly what dynamic translation should fix up.
It might be faster to use an IR or VLIW->VLIW translator rather than using a "real" ISA as your source, but I think that is the wrong comparison. The advantage of dynamic translation is that it can potentially be used to execute x86 (or whatever) code on a VLIW machine much faster than on a native machine. BTW, one of their source ISAs is JVM, which really is an IR.

RE: prior art, please read Transmeta's patents before flying off the handle here. To the best of my knowledge, TM didn't patent dynamic translation, but they have several patents on optimizations for dynamic translation. Most of them are particularly suited to the situation where the hardware can be specifically designed to help out the translator, like the shadow registers used to insure precice exception behavior.

Newer GCCs have something like this. Look up -fprofile-arcs -ftest-coverage in the GCC/gprof documentation.
I haven't looked super closely at how well it works, but the documentation seems to hint that
it's doing similar types of optimizations. Basically, it takes the profile information for each arc in the control flow tree, and uses that to decide how to lay out the basic blocks when it generates the code.

In my limited experimentation, I didn't see much of a difference (too small to measure) using these tricks, so either my code wasn't helped by it (too small?), or GCC was just going through the motions. YMMV.

Some of the automotive problems are a little different. The fuel cells need to be extreamly potent, light, yet strong enough to survive a crash, and stable enough that if a train hits it, Akron doesn't need to be bulldozed into a big pit lined with clay. It's a tricky situation. That's why we're more likely to see hybrid cars that are gas/electric (so a very efficent turbine can be used). Fuel cells, at least the reactions I had studied circa 1996 were all fairly complicated to get going, let alone in a very reliable fashion, and you did use saftey equipment. They will eventually make it to automobiles, but there are a lot of hurdles, those, they take time. It's not like everyone has been throwing buckets of money at the problem like it was cancer, for the past half century. The methonol fuel cells another person mentioned earlier had shown some promise, but I bet those will be a little later comming to america. A child might fall down a well, and try to survive on the smelly water in daddy's cell phone battery.

That works as long as you can identify all of
the code cleanly. Particularly outside the UNIX world (think DOS / Win9x), it's commonplace to treat data as code and code as data. (Most UNIX programs just rely on
the ELF/COFF file format and don't muck with
code vs. data attributes, unless they're doing
something icky like GNU C's trampolines which
put executable code on the stack.... *blech*)

It's hard enough to write a reliable disassembler that doesn't fall over when it hits a jump table,
callback, overlay, dynamic library or other indirect method for loading / invoking code.
(At least, one that's reliable in the absence of
a symbol table that highlights all of the valid entry points.)
What makes you think you can reliably re-assemble
a binary for a different target in such a setting?

that's cool! There was a program that did the same thing on Sparc, just looked for the calls to.mul and.div, etc and replaced them with single instructions. Then it evolved into doing better register allocation and soon became a static optimiser. Someone turned it into a dynamic optimiser. This is all research stuff, Sun doesn't actually sell this program (or give it away).

Reading the related paper [ibm.com] about the use of code rearranging (let's not call it code-morphing lest we get a patent infringement notice from TransMeta) for Java optimisation shows some of it being used starting in 1997 (isn't this pre-Transmeta ? prior Art ?).
The PDF doco covers the idea about converting Java bytcode into RISC (PowerPC , it is IBM =) ) code that is then scheduled in a magic way to give a degree of parallisation on the right hardware. Hmm this does smell like Transmeta.
One of the guys working on that Java Paper has got a few patents [delphion.com] in his name for optimization.. --

What about fuel cells for power? I vaguely remebe reading somewhere that they can theoretically make these small enough to fit into mobile phones and palmtops etc. Think of topping up your cellphone with a thimbleful of alcohol and running it for two months.

From what I understand of it, this is essentially what Java does: compile code for a virtual machine, and then emulate it on different OSes and processors. Of course, there are some obvious limitations to Java, most notably the extra memory it uses and the decreased speed of Java applications, but that's the sort of thing that could be expected in any project like this.

Actually, thinking about the speed limitations of Java... do any of the JIT runtimes optimize code when they translate it? If it works as well as it's supposed to for Transmeta, I'd think the same principle could be applied in Java. Anyone out there know if it's being done, or why it wouldn't work?

actually we've heard some horror stories about Sun's Sparc machines. Because they have so many different versions of the processor, people often distribute just one binary - the lowest common denominator. This makes customer support a hell of a lot easier when someone says "it crashed at 0x..." and they don't have to go about asking the person what version of the binary they are running, etc etc. Apparently people do this a LOT and your new shiny V9 is no faster than your neighbours V7 (can't even multiplication in a single instruction!). So if you had a Sparc -> Sparc binary translator you could make the thing run faster.