Staff Member

THE FIRST sesh here at the Intel Developer Forum kicked off with Intel fellow Steve Pawlowski philosophising about multicore chips.
He said there was a raging debate going on about what multicores are within Intel, that is to say what place they occupy in the greater scheme of things.

Intel has decided to back off from power, voltage and the frequency of chips, and instead focus on features and software that could take advantage of the different cores.

He didn't say whether by 2010 Intel chips would have the equivalent brain power of a bumblebee by 2010. Another Intel fellow said five years ago Intel would be able to produce brains of a computer matching that of a bumblebee. He said by the year 2010 assuming that systems have equivalent to one or two teraflops. What we know is a bumble bee's brain is 500,000 neurons strong. And we know in its brief three to four weeks of life, it can produce millions of terror flaps.

It will be possible for Intel to provide multicores from the 10s to the 100s and even thousands if necessary. The idea is that by 2018 the process tech will be eight nanometers, using 256 billion transistors. By 2016 there will be 128 billion transistors on 11 nanometres.

Intel will use these additional transistors to take advantage of software applications. He said: We won’t get the same performance or power benefits, but there will be new types of applications while Intel will bring supercomputing to the mainstream within the next 10 years and offer teraflops of performance.

San Francisco (CA) - If no surprises emerge from tomorrow's first day of the Fall 2005 Intel Developers' Forum, Intel may still command the headlines with its anticipated announcement from newly annointed CEO Paul Otellini, of the next- generation Pentium architecture.

But the story of the week may very well become the triumph of Intel's Israel Design Center (IDC), whose more moderate approach to processor architecture has won that team several architectural victories of late--not only over arch-rival AMD, but also over Intel's own NetBurst architecture, which may very well follow the path Itanium has carved toward Intel's back burner.

"The rule of thumb in 'NetBurst land,'" Nathan Brookwood, principal analyst with the Insight64 consultancy, told Tom's Hardware Guide this afternoon, "was just throw clock frequency at the problem, and you'll get more performance almost without thinking. And it turns out we've run into the end of that era. The Israelis saw that coming."

With all the recent innovations in multicore CPU packaging, microarchitecture--the design of processor engine components--has recently assumed a secondary role in public conversation. Lately, the talk has been about what Brookwood characterizes as, "How many cores can you fit on the head of a pin?" As a result, what's happening inside each individual core hasn't been a front-burner topic. So if you were to judge tomorrow's likely IDF news from a multicore vantage point alone, you might overlook an upheaval going on beneath the core-level: The so-called NetBurst architecture which was the key feature of Pentium 4 when it was introduced in 2000, is being phased out.

NetBurst had originally introduced Intel's first 20-stage execution pipeline, and
proceeded to grow the pipeline from there, having shipped a P4 with a 31-stage
pipeline, according to Brookwood, and having cancelled a product that would have
included a 40-stage pipeline (edit: este era o Tejas). Longer pipelines were originally introduced,
according to Intel, to enable greater pre-assessment and optimization of machine
code prior to execution.

"A very long pipeline turns out to be extremely inefficient," said Brookwood.
"Therefore, although you felt good because you had a 3 GHz processor, in reality,
it wasn't delivering any more performance than a 2 GHz processor with shorter
pipelines. But it used a lot more power and generated a lot more heat."

The Israeli team's alternative was Pentium M, introduced in March 2003. As
Brookwood confirmed, Intel conducted some convincing tests indicating Pentium M
performance on a par with Pentium 4 in everyday, general-purpose applications--
even though the P4 was expected to yield as much as four times the performance,
and even though Pentium M units feature as small as 10-stage pipelines.

"So from moving from the NetBurst core to a new core based on the Israeli
techniques," added Brookwood, "I think Intel will end up with a core that scales
better with frequency." With lower power consumption, you can put two or four of
the new cores on a single chip, and still preserve what he called "reasonable
thermal characteristics."

The new architecture will also mark the first time that desktop and server CPU
architectures were derived from a mobile platform. As Brookwood reminded us, the
Pentium III architecture was modified once to create the "Mobile" edition, and
then modified a second time to create the first Xeon processors. But the Israeli
design team was first commissioned five years ago to develop a mobile processor
architecture that could meet what were then considered the extreme thermal
conditions of notebook and laptop systems. The solution to the mobile thermal
problem became the solution to the desktop and server thermal problem a few years
later. "This represents the triumph of the power-efficient design methodologies
that came out of Israel," said Brookwood, "moving into Intel's mainstream
desktop, and server lines, as well as next-generation mobile processors."

Tomorrow's announcements are expected to indicate that the so-called Merom
processor architecture--first code-named in 2004--will serve as the basis for the
Conroe desktop CPU architecture and the Woodcrest server CPU
architecture.

Oftentimes, smart companies publish bad news on the heels of an otherwise good-news day. So if rumors put forth in the Inquirer this afternoon are correct that HP plans to cancel its planned orders for Itanium-based systems--in the wake of HP's already having cancelled its collaboration with Intel on Itanium's design--then this news could conceivably come during IDF.

While unable to confirm such rumors himself, Insight 64's Nathan Brookwood
speculated, "If HP were to turn down Montecito...that would, I think, cause a
great deal of reassessment in almost all parts of the industry that touch
Itanium."

Other announcements expected no later than Wednesday include whether Intel has stepped up its plans to proceed toward 45 nm lithography--thus bending the curve of Moore's Law up just slightly; the possibility of a new, lower-wattage dual-core Xeon processor; and a possible hardware deal--mentioned on CNBC late this afternoon--between Intel and Blackberry producer Research In Motion, Ltd.

THIS IS THE LATEST article in a series about hardware virtualisation. The first set is on Vanderpool, Intel's version of the concept. If you are unfamiliar with the concept, please read Vanderpool Parts 1, 2, 3 and 4, and AMD's Pacifica parts 1, 2 and 3.
Before you can get Intel chips with VT in them, Intel is touting VT2 and VT3. Don't think of this as a reason not to buy chips with vanilla VT1 in them, you will have a long wait if you do, think of this as more of a statement of future direction. VT enabled chips will be on the market in a few months, but the war of words has already begun.

Intel announced VT over a year ago and released the specs about 6 months ago. Not to be outdone, AMD announced Pacifica, which should be out early next year. Now Intel is talking about the next generation shortly before the first one is out, which means AMD will follow with Pacifica2 before Pacifica the elder hits the market. Before you conclude that this is all a marketing game, just remember, there is a substantial amount of good tech here, and VT2 will do a lot of good for VMM developers.

Think of this as where Intel will go once the above listed stuff is purchasable silicon. This time, it is not just limited to the CPU, it pulls in the Northbridge, memory controllers, buses and peripherals. It is a lot more of an attempt to virtualise the system rather than only the processor.

The first and probably the most important addition is memory virtualisation - not memory controller virtualisation. While this may seem like a rather odd distinction, it actually makes a lot of sense, and with the Intel implementation there may never be a need to virtualise the memory controller. This is called Extended Page Tables - which is EPT in Intel parlance.

Finding the right address is a time consuming and recursive process which can be three or four levels deep with a lot to keep track of. As with AMD's Nested Page Tables, you could add another level of recursion, or as Intel does, add an offset to the PTW, the Page Table Walker. What this in effect does is figure out the offset to where the VM thinks a page is located and adds this to the PTW calculations.

The PTW hardware is designed to figure out what an address should be after a TLB miss. To add virtualisation you need to know the offset between the real memory address and what the guest thinks it is.

When the calculations are done by the PTW, the result is then passed to the memory controller in a 'pre-virtualised' manner, it is already correct. The memory controller does not need to figure out anything, nor does it need to be aware that the OS calling for data is virtualised. In effect, you are adding the intelligence before the memory controller rather than on it.

One nice thing that EPT does is to make any memory controller able to support hardware virtualisation with little or no changes. When CPUs that support it come out, motherboard support should be there as well, there's no need to design a new northbridge for virtualisation.

EPT will mean no more dropping in and out of the VM every time there is a certain class of memory accesses, and a whole lot fewer interrupts to trap. Since this is one of the largest costs in virtualisation it should speed things up dramatically. On the surface, it appears to be quite a different method of achieving the same goal that Pacifica gets to by virtualising the memory controller. It will be interesting to see which method provides the lowest overhead, but both will be vastly better than the current software method. EPT will catch Intel up to Pacifica.

Once you've caught up, the next thing to do is move ahead, and that's what DMA Remapping does. If you recall, Pacifica can block DMA at the HT to ccHT border, providing a yes/no ability for the VM to see a particular piece of hardware. This is not exactly a hugely granular solution, but it does the intended job fairly well.

DMA Remapping remaps the DMA request to the correct guest OS, so in cases where Pacifica might deny something, DMA Remapping points it to the correct spot. This is the first step to virtualising peripherals, but more on that later.

DMA Remapping in software is hugely expensive, it can bring a fast machine to it's knees if done improperly. Remapping in hardware is orders of magnitude faster, but still has enough overhead so it isn't something you would want to do on every interrupt. That is where Intel adds the IOTLB. Like the TTLB from Pacifica, or the plain old TLB in non-virtualised chips, it caches the DMA remappings so they only have to be calculated once.

EPT and DMA Remapping work in tandem - they virtualise two of the biggest holes that VT left open. With these two things running in VT2/3, or whatever marketing name is thought up by then, there is very little left to do on the CPU to provide a completely virtualised environment.

The catch here is 'on the CPU.' The next phase of VT is to pull the platform and peripherals into the picture. This involves just about every vendor out there, and Intel has to ride them while cracking the whip. They all have to dance to the same standard or you end up with an ugly mess instead of a virtualised machine.

To do this, you start with the PCI-SIG because all of these things that you want dancing to the same beat are all plugged into PCIe. Work here is well under way. There are virtualisation working groups for PCIe 2.0 that are in the early stages of bitter argument. It may be a while, but with any luck, PCIe 2.0 will come out with some form of virtualisation support built in.

This will allow the peripheral vendors to make individual devices virtualisable, or at least be VM aware enough to dodge the uglier bullets. With a full set of virtualisable hardware on a PCIe 2.0 bus, and a VTx CPU running it all, you have pretty much a completely virtualised system. That is the goal, and it looks like the roadmap is being made known.

A cute trick that this, once fully implemented, will allow is for each VM to run it's own driver set in guest OS space. No more massive jumping back and forth to the VMM and trapping every call under the sun. For the user, you can play a lot of tricks, and also run games and other demanding apps in a VM. For devs, it could allow driver debugging on a whole new level. Have five revisions of a driver, and want to test them all on the same box? Not a problem. Things like this make devs smile.

Another little trick they've added is a preemption timer. This doesn't really virtualise anything specifically, but it allows for different ways to pop in and out of the VMM. It is a timer that says run VM 1 for X milliseconds, then drop out and run VM 2 for Y. Preemption timers have some very interesting implications for the embedded world and other similar applications, but for desktops they are not all that useful.

It can help a lot when you need to switch tasks now, or you must allocate a certain amount of CPU power to a task. For telecom and networking applications, it makes virtualisation a useful tool and possibly a must have feature. On the other end of the spectrum, it can help for media applications like media PCs and Tivo-type devices. For the business world, it doesn't buy you all that much.

So, with all the tech, what is the nutshell story, and more importantly, when? VT is launching in a few months, definitely in 2005 for desktops. For servers and mobile, Intel is only saying 'after 2005' but you can narrow that down with a little guesswork. I would look for the memory virtualisation on the Merom cores along with DMA remapping. If Intel follows it's normal way of doing things, both techs will probably not make the first spin of the cores, but will follow in the next revision.

The platform level work is a bigger open question. The PCIe 2.0 virtualisation should proceed in the same orderly cat-herding fashion that any standard setting body goes through. It is needed, and I think everyone agrees that it should be done, but how, when, and most importantly who's methodology will be a contentious issue. It will be worked out, and you will see virtualised PCIe eventually.

Then comes a task that makes the previous cat-herding look easy. Imagine cat herding with a firehose and firecrackers. That is notably easier than getting all the peripheral makers to play along. This part will also come in time, starting with the enterprise level hardware, and moving down the food chain to the more reputable peripheral makers, and then eventually to everyone. Think hardware compatibility lists, and lots of them.

Once VT2 and VT3 are out there won't be much left to do. The entire computer will be virtualizable with very little overhead, and the dreaded software faking of any part of the system should be banished to memories of the bad old days. That is the point of all of this, and now we have a rough roadmap on how to get there. µ