Might be relevant if Lynn Wheeler could expand on the unreleased VAMPSmicrocode to speed up 370 SMP, and also provided logical processorswith similarities to those on current zSeries LPARs, although that mayjust have dropped parts of 370 sequential code down into microcode.

so presumably this recent post vis-a-vis vamps and the later i432http://www.garlic.com/~lynn/2006n.html#42 Why is zSeries so CPU poor?

early microcode effort was "VMA" original for 370/158 that helpedvirtual machine performance. for subset of "supervisor" stateinstructions, microcode was added to execute the instruction using"virtual machine" rules (to avoid interrupting into the virtualmachine hypervisor where the instruction was simulated).

concurrent with VAMPS effort was "ECPS" for 370 138&148. ECPS did somemore stuff like VMA on the 158 (direct supervisor state instructionexecution) ... but it also identified parts of the hypervisor kerneland moved that kernel code into microcode. the issue on 138&148machines was that there was an avg. of 10:1 microcode instructionsexecuted for every 370 instruction. Much of the kernel code moved tomicrocode on straigh 1:1 basis resulting in ten times performancespeed up. old posting identifying specific kernel code segments formigrating into microcode.http://www.garlic.com/~lynn/94.html#21 370 ECPS VM microcode assist

the VMA-related efforts eventually evolved into SIE ... where nearlyall supervisor state instructions had microcode enhancement fordirectly executing with regard to virtual machine rules (avoiding alot of interruption into virtual machine hypervisor to simulatesupervisor state instructions). SIE was a state change instructionthat gathered up all the fields needed by various supervisor stateinstructions to execute according to "virtual machine" rules. post ofold SIE discussion about implementation issue differences between 3081and "trout" (3090)http://www.garlic.com/~lynn/2006j.html#27 virtual memory

there were still things like page faults for the virtual machine thatresulted in interruptions into the hypervisor kernel for handling. aspecial case was defined involving things like dedicated real storagefor a virtual machine ... eliminating need to interrupt into thehypervisor kernel. This resulted in being able to operate a virtualmachine subset directly supported by hardware ... w/o the need for avirtual machine kernel. This was called "PR/SM" ... and PR/SMcapability eventually evolved into the current LPARs (logicalpartitions). a reference discussing some current LPAR and PR/SMhttp://researchweb.watson.ibm.com/journal/rd/483/siegel.html

current machines can have a configurable limited number of LPARs ...and it is possible to run a virtual machine hypervisor in an LPAR,which in turns supports a much larger number of virtual machines. Thehas been an evoluation of the SIE support. Initially, SIE was notvirtualized but LPARs make use of SIE for support. That met that avirtual machine hypervisor running in an LPAR wouldn't haveperformance assist of SIE for running its virtual machines (allvirtual machine supervisor instructions would interrupt into thehypervisor for simulation). Enhancements were required to virtualizeSIE for at least one level (so it could be used both by LPAR functionand also by hypervisor running in an LPAR).

Since I was doing both VAMPS and ECPS ... I borrowed a lot of stuffdone for ECPS for doing VAMPS. However, for VAMPS, I wanted itextended in a much more architected way ... rather than simply doing a1-fo-1 movement of existing kernel 370 code into microcode. VAMPS wasto have up to five processors ... and I defined a microcode hardwarequeued work interface where the hypervisor put units of work on thequeued work interface (and the microcode took the queued work andexecuted on whatever available processor there were). The hardewaremicrocode also placed queued work for the hypervisor to handle ...like things that were i/o interrupts in traditional 370 or page faultinterrupts (from executing virtual machines), etc.

The VAMPS abstraction of queued work for multiprocessor environmentwas somewhat akin to the later defintion found later in i432. Some ofthe VAMPS abstraction for i/o work queueing was somewhat akin to whatshowed up later for 370-xa i/o operations.

After VAMPS was killed, I adapted the multiprocessing microcode queuedprocessing to an software implementation. A lot of the SMP kernelimplementations used a single, global kernel SPIN lock to serializeall kernel execution. This drastically minimized the amount of codechanges to adapt a single-processor operating system to support amultiprocessor operation.

In adapting the VAMPs multiprocessing microcode support to software, Itook the equivalent kernel software functions (that had been moved tomicrocode in VAMPS) and made them multiprocessing parallelized withfine-grain locking. This amounted to the majority of the softwarekernel execution time ... but a relatively small amount of the totalkernel instructions. The majority of the kernel instructions relied ona somewhat traditional global kernel lock. However, when ever the"parallized" kernel code required to transition into the "sequential"kernel code ... rather than "spinning" on the global kernel lock... it "bounced". If it obtained the global kernel lock, then itproceeded as normal. If it couldn't obtain the global kernel lock, itwould queue a super lightweight work request ... and go off and lookfor other "parallelized" work.

This approach obtained almost all the thruput benefit of having akernel fine-grain locking implementation, avoided the degradation ofsingle kernel spin-lock implementation ... but the kernel code changeswere not significantly more than required for a single kernelspin-lock implementation. This implementation shipped in VM370 releasefour.

Post by Anne & Lynn Wheelerconcurrent with VAMPS effort was "ECPS" for 370 138&148. ECPS did somemore stuff like VMA on the 158 (direct supervisor state instructionexecution) ... but it also identified parts of the hypervisor kerneland moved that kernel code into microcode. the issue on 138&148machines was that there was an avg. of 10:1 microcode instructionsexecuted for every 370 instruction. Much of the kernel code moved tomicrocode on straigh 1:1 basis resulting in ten times performancespeed up. old posting identifying specific kernel code segments formigrating into microcode.http://www.garlic.com/~lynn/94.html#21 370 ECPS VM microcode assist

-- snip snip

Lynn:

Can you point us to any information on 138/148 microprogramming andmicroarchitecture? Examples of the 10:1 microcode to 370 instructionexpansion would be fascinating.

Post by John AhlstromCan you point us to any information on 138/148 microprogramming andmicroarchitecture? Examples of the 10:1 microcode to 370 instructionexpansion would be fascinating.

re:http://www.garlic.com/~lynn/2006n.html#44 Any resources on VLIW?

i don't have any left ... and am not aware of any onlineresources. possibly somebody has some old field engineering manualswith instruction description.

the high-end 370s had horizontal microcode ... more akin to VLIW.

the low and mid-range 370s tended to be relatively straightforwardprocessor enginess ... and the 370 "microcode" was relativelystraight-foward sequential instruction sequences (i.e. "vertical"microcode) ... and the avg. of 10:1 microcode instruction per 370instructions was relatively the same across variety of engines(i.e. the microprocessor MIP rate had to be on the order of ten timesthat of whatever 370 model it was being used in).

the large variety of these different (microcode) processing enginesgave rise to the "fort knox" effort circa 1980 ... to replace most ofthe internal microcode processing engines with 801s (aka risc).http://www.garlic.com/~lynn/subtopic.html#801

where the standard 801/risc instruction set was extended with someinstructions that aided in instruction simulation.

the followon to the 138/148 was the 4331/4341. the follow-on to the4331/4341 (4361/4381) were going to have 801 risc processors as themicrocode engine. i help author a white paper that killed that effort.the issue was that technology was advancing to the point where it waspossible to implement nearly the whole 370 directly in silicon... avoiding much of the instruction emulation overhead altogether(i.e. 4381 was much more of a direct silicon implementation).

some number of the 370 instructions required a lot more then 10microprocessor instructions and for which there wouldn't be a directsimple microcode instruction ... however, the typical high useage 370kernel instructions tended to be a lot of testing bits/state andbranching ... for which there typically was an exact correspondance inthe microcode instruction set (i.e. eliminate microcode decode of the370 instruction, manipulate the 370 registers, etc)

re:http://www.garlic.com/~lynn/2006n.html#44 Any resources on VLIW?http://www.garlic.com/~lynn/2006n.html#47 Any resources on VLIW?

as an aside ... some number of the relatively recent 370 emulatorswritten for intel platforms have quoted avg. instruction ratio numbersaround 10:1 also (have to play some real tricks to get it much below10:1).