Ask Ars: what is a CPU thread?

You've heard the term "simultaneous multithreading" applied to CPUs, but maybe …

In 1998, Ask Ars was an early feature of the newly-launched Ars Technica. Now, as then, it's all about your questions and our community's answers. Each week, we'll dig into our question bag, provide our own take, then tap the wisdom of our readers. To submit your own question, see our helpful tips page.

Question: What is a CPU thread (as in "multithreaded CPU," "simultaneous multithreading," etc.)?

Tech pundits, analysts, and reviewers often speak of "multithreaded" programs, or even "multithreaded processors," without ever defining what, exactly, a "thread" is. Truth be told, some of those using the term probably don't really know what it means, but the concept isn't really very hard to grasp. At least, it isn't hard when you look at it from the point of view of the CPU (the operating system definition of a "thread" is another matter).

From the CPU's perspective, a thread (short for "thread of execution") is merely an ordered sequence of instructions that tells the computer what to do. In most of my articles on Ars and in my book, I prefer to speak of "instruction streams" instead of "threads," because the thread is a more complicated and OS-centric concept. As far as most CPUs are concerned, they merely execute whatever instruction streams come into their front end, and they don't care if that instruction stream is from a process or a thread. There may be some special-purpose register values that differ between the two, but the basic functioning of the processor doesn't change.

So when someone talks about a "multithreaded processor," they're talking about a processor that can execute multiple instruction streams simultaneously. There are two ways that a processor can perform such a feat: simultaneous multithreading, and using multiple cores. Neither of these methods is mutually exclusive, and both are often used together.

Simultaneous multithreading (SMT) is a trick that lets the processor work more than one thread at a time. The front end of the processor alternates among the different threads in a form of time-sharing, fetching batches of instructions from one thread and then the other. The actual execution core of most multithreaded processors typically doesn't know or care which instruction stream a particular instruction comes from—the parts of the machine that do track which instruction goes with which thread will handle the chore of retiring the right instructions with the right stream.

The other way to make a multithreaded processor is to put more than one processor core on the same die. Each actively executing instruction stream is assigned to a single core, so a four-core processor can support four threads at once, or an eight-core processor can do eight threads at once, and so on.

The life of a thread

An instruction stream enters the CPU by being fetched into the processor's front end. The first time a particular stream of instructions is fetched, like when a new program is loaded, the instructions move from main memory into the processor's L1 cache. The front end then fetches instructions in batches from the L1 cache and decodes them into the processor's internal instruction format.

Once the instructions are decoded, they're ready to be dispatched to the chip's execution hardware, where the actual number-crunching happens. The execution units carry out the arithmetic and memory operations specified by the instructions, and write the results to the processor's registers.

In an out-of-order processor, where instructions are reordered to be executed in the fastest possible sequence, there's an additional step after execution. The instructions must be put back in program order, and their results written back to main memory.

When a new thread is loaded into the processor, the original thread's state is saved out to main memory and all of the original thread's instructions are removed from the pipeline. The new thread then begins at the fetch stage, and is decoded, dispatched, and retired as described above.

neat article. Back when all this multithreading began, I heard that software developers were not taking advantage of the additional cores/multithreading is that still the case or do most programs now use however many cores you have to speed up processing?

Jon, would you be open to a limited reprinting of your book, on an on-demand basis? I read half of it before I had to give my copy back to the library and then I discovered it wasn't in stock anywhere any more.

neat article. Back when all this multithreading began, I heard that software developers were not taking advantage of the additional cores/multithreading is that still the case or do most programs now use however many cores you have to speed up processing?

It is still the case for three reasons :

1) Not all computing problems can easily be multithreaded. Some problems are strictly linear (step n+1 requires data from step n), some are difficult to balance and optimize between threads.2) Coding multithreaded code is still much much harder than single threaded one. Conception is difficult, coding is tricky, debugging is a nightmare. 3) Most program are just "fast enough" and don't need any serious multithreading, notably programs that are mostly a GUI with no heavy processing.

Multithreading is making progress, however. Some APIs, frameworks and middlewares make a better use of it. It can be used through multitasking (à la Chrome). Some programing techniques are now better to implement it (Java 5 concurrency, Apple GCD).

But multithreading is and remain a very hard theoretical problem, and the average programmer will never be able to use it fully. Thus, the expected 32 core processors will only be useful and quick for a handfull of problems (mostly graphics-related) and will be a total waste for day-to-day computing.

What? No nerd outrage over inaccuracies? I must be early. Back in a few...

Um, I'm not feeling any outrage exactly, but the article does seem a little wooly on a few points; in particular, it was kinda unclear whether the article was talking about CPU 'threads' or OS threads. I'd have thought it was entirely about CPU threads, but it says that thread-state is stored in main RAM, and makes no mention of the CPU maintaining an execution-context (register-file etc) per thread, which is required to make the whole thing work.

I'm sure the article's 100% correct (this being Jon Stokes), just not quite as clearly expressed as it perhaps could be.

These two techniques (SMT and cores) aren't mutually exclusive, so you might change this line:

Quote:

Simultaneous multithreading (SMT) is a trick that lets the processor work more than one thread at a time.

To be:

Quote:

Simultaneous multithreading (SMT) is a trick that allows a processing core to work on more than one thread at a time.

Also, it might be worth noting the advantage of SMT. Switching between two threads but executing one at a time isn't helpful. It's when one of the threads has a long-latency cache miss and the other can keep executing that it's helpful. Similarly, it might be worth noting that SMT doesn't require saving state to main memory to switch between the threads, so it can do it much faster.

This was really informative. Granted, I don't understand it all.... I'm mostly a software guy, and I'm a little bit of a jerk sometimes so I just assumed that when people were talking about multi-threaded processors, they were clueless as to what they were actually talking about. Oh well, wouldn't be the first time I was proven wrong.

Now, my thread of execution is telling me it's time to get off my bum and go to class.

neat article. Back when all this multithreading began, I heard that software developers were not taking advantage of the additional cores/multithreading is that still the case or do most programs now use however many cores you have to speed up processing?

I suspect that outside of games and media work, most users will mostly notice a MT benefit from the "security" programs not slowing down the computer as much as they used to. That is, one can have more programs doing their thing at the same time without the UI going rigor mortis.

Thus, the expected 32 core processors will only be useful and quick for a handfull of problems (mostly graphics-related) and will be a total waste for day-to-day computing.

If we are talking about one user systems, then yes, it's tricky to speed up performance using multiple thread.

Servers, on the other hand, are multiuser machines - each user can be served using his own thread (or even process) - so the more execution engines, the merrier.

Functional programming offers a way to easily program parallel tasks. The requirement of a called method is only that it doesn't change anything outside its scope (eg. file system etc.). Called under the same parameters, the functional method should always return the same result. If the runtime engine figures out (via profiling) that it pays of to separate the loop into X parts to X threads, then it can do so.

neat article. Back when all this multithreading began, I heard that software developers were not taking advantage of the additional cores/multithreading is that still the case or do most programs now use however many cores you have to speed up processing?

I suspect that outside of games and media work, most users will mostly notice a MT benefit from the "security" programs not slowing down the computer as much as they used to. That is, one can have more programs doing their thing at the same time without the UI going rigor mortis.

No. That's likely to be achieved through old school big iron multi-tasking involving multiple processes running on separate CPUs. That's SMP, not MT.

neat article. Back when all this multithreading began, I heard that software developers were not taking advantage of the additional cores/multithreading is that still the case or do most programs now use however many cores you have to speed up processing?

I suspect that outside of games and media work, most users will mostly notice a MT benefit from the "security" programs not slowing down the computer as much as they used to. That is, one can have more programs doing their thing at the same time without the UI going rigor mortis.

No. That's likely to be achieved through old school big iron multi-tasking involving multiple processes running on separate CPUs. That's SMP, not MT.

Except that at least on Intel hardware those two are basically interchangeable. Hell, any CPU usage graph shows each core on a recent Intel product as being two cores thanks to its ability to run two threads pr core.

neat article. Back when all this multithreading began, I heard that software developers were not taking advantage of the additional cores/multithreading is that still the case or do most programs now use however many cores you have to speed up processing?

It is still the case for three reasons :

1) Not all computing problems can easily be multithreaded. Some problems are strictly linear (step n+1 requires data from step n), some are difficult to balance and optimize between threads.2) Coding multithreaded code is still much much harder than single threaded one. Conception is difficult, coding is tricky, debugging is a nightmare. 3) Most program are just "fast enough" and don't need any serious multithreading, notably programs that are mostly a GUI with no heavy processing.

Multithreading is making progress, however. Some APIs, frameworks and middlewares make a better use of it. It can be used through multitasking (à la Chrome). Some programing techniques are now better to implement it (Java 5 concurrency, Apple GCD).

But multithreading is and remain a very hard theoretical problem, and the average programmer will never be able to use it fully. Thus, the expected 32 core processors will only be useful and quick for a handfull of problems (mostly graphics-related) and will be a total waste for day-to-day computing.

btw, the following is a bit dummed down for simplicity and may not be entirely accurate because of it, but it moves to a point, so I'll get on with it.

If we take your very good description about this a step further, from the CPU perspective, multithreadding is pretty straight forward. All the hard stuff is figuring out what can and can not be executed "in parallel" without causing issues (and finding bugs when there are). However, that is only the beginning, as that assumes a single program is in control of what and how to multithread.

in reality, the OS itself has multiple threads, individual applications have multiple threads, and an OS is typically running dozens of individual applications at a time (in fact, the OS itself is often treated as a collection of applications). You have user foreground apps, background apps, OS functions, drivers, and more, all vying for resources. A CPU cores resources are far from infinite, 2 threads per core on most modern CPUs, so the OS has it's own "threadding" services to handle who can send what to a CPU when, and to unload/save state active threats as part of "preemptive" multitasking. This is critical in a multitasking world, deciding which CPU can have a thread from each app at a time, and managing something called "thread contention" when too many apps request too many threads at once.

Why I bring this up, is it pertains most directly to Android right now. Apple implemented GCD into iOS back in iOS 2, and now that there are dual core ARM CPUs for iOS (A5) programmers can use simple GCD calls to multi-thread portions of their code to easily balance larger loads across 2 CPUs. GCD in iOS is very similar to GCD in Mac OS X, so devs will adapt easily, and multi-threading a Mac App is a LOT easier than Windows of Linux because of GCD. In Android (3.0 specifically) however, the kernel itself lacks true understanding of both cores. It can see them, and code can be launched to either available CPU thread, but it is as yet not managing those threads, and the queue manager launches ALL applications to core 1 regardless of its load. It manages threads on core 1 just fine, but it does not manage core 2 at all. The only programs that can access core 2 at this time are those using direct kernel calls to it, which is not only more difficult than is common in code to do, but there are no developer feedback mechanisms from the CPU or queue manager to balance load. Someone else's app could just as easily interfere with their thread, or because all apps sit on CPU 1 regardless of laod, coule limit available resources to the app while CPU 2 lies barren of execution, and this cause unecpected program operation as a result. Not until 3.2 will android make both the queue manager itself (in 3.1) and the Os UI aware of the second core. This means, non-multi-threadded code has to fight over the small CPU resources of a 1.0GHz single core CPU, which has additional overhead of being a dual core chip.

For the short term, this not only complicates development greatly for those working in milti-threaded code (no OS APIs to help make it easier), and corrals all apps to a single slower core than they were used to in Snapdragon chips, but as time goes forward, and Google augments the intelligence behind multi-threadding, it will very likely BREAK apps coded for multi-threadding in 3.0. Since we have no data from google on how this is being implemented, devs hesitate (both due to the cost/complexity and the risk of more later). The more apps and services Google allows to background, and with a bloated Os UI unable itself to use underutilized cycles on core 2, Android users may see their existing apps run even slower on new dual core hardware. Google shoudl not have released support for dual core chips for their OS until the OS itself supported it natively. They have done a great disservice to their development community, and by extension their users.

I suspect that outside of games and media work, most users will mostly notice a MT benefit from the "security" programs not slowing down the computer as much as they used to. That is, one can have more programs doing their thing at the same time without the UI going rigor mortis.

No. That's likely to be achieved through old school big iron multi-tasking involving multiple processes running on separate CPUs. That's SMP, not MT.

I don't agree, with neither of the two comments above.

Difference:SMP is good for splitting heavy workloads,SMT ("HyperThreading") is good for running non-cache optimized workloads.

The idea behind SMT is that waiting for cache line to get fetched is a waste of good CPU resources. So the ALU units get shared by more execution engines - this is good for HTTP or SQL servers (long cache waits, lots of random memory access). Read more about Sun Niagara.

On single core systems, SMT added more responsiveness to UI. See Pentium4 with HyperThreading. But in the age of dual core+ systems and smart schedulers (Windows7), virtual processors can only help.

>I suspect that outside of games and media workI don't believe that games and media encoders are not cache optimized. SMT doesn't work well for cache optimized hand coded assembly.

Does each thread have its own instruction pointer? Does each CPU core have its own instruction pointer?

Yes and Yes.

The trick is that the context switch causes the CPU (or more commonly the OS) to store away the contents of the IP and bring in the value for the thread being switched in.

Precisely where they store this (and other thread specific) data is really where SMT starts to kick in. the idea being that if you can store that some place close (as opposed to main memory) you can do something with the extra resources available on a super-scalar CPU core while the other thread isn't using them (either because it didn't need the, say, barrel shifter, at that point when the other thread does or because it couldn't do anything at all because it was waiting to get some data from main memory/higher cache levels.

Simultaneous multi-threading is not distinct from "multi-core" multi-threading at all, simply because each core is executing a CPU thread "simultaneously". In fact, although SMT may be standard nomenclature for multiple threads executing in an interleaved (read: pipelined) fashion on a core, it is anything but simultaneous in the semantics of the word.

Furthermore, cores with multiple ALU units (multiple addition/multiplication blocks), can direct threads to utilize the blocks simultaneous. Two threads passing an addition instruction to the CPU can both be executed simultaneously if two addition blocks are available, and the instructions don't incur a read penalty (i.e. both additions occur on CPU registers).

neat article. Back when all this multithreading began, I heard that software developers were not taking advantage of the additional cores/multithreading is that still the case or do most programs now use however many cores you have to speed up processing?

I was dissapointed this wasn't addressed in the Ars article. Not because I don't know the answer, I'm a software guy, but because so many people ask this question.

and multi-threading a Mac App is a LOT easier than Windows of Linux because of GCD.

No. Multithreading in Windows is trivial these days. Not only do you have ThreadPools (available since .NET 1.0 - which is basically what GCD is, only GCD is a little smarter in that it watches the entire system for load) you have the entire parallel API in .NET4. Also, actual lambdas in .NET make kicking off work items *really* easy (blocks are a poor-mans lambda).

Does each thread have its own instruction pointer? Does each CPU core have its own instruction pointer?

Yes and yes. In many modern CPU designs, the "instruction pointer", or program counter in common lingo, is actually a value that travels down the pipeline along with the instruction. This makes it easy to do rollbacks and flushes as well as design for multiple threads.

Each CPU core is completely independent in a multi-core configuration (with the exception of AMD's Bulldozer and Bobcat designs).

I don't believe that games and media encoders are not cache optimized. SMT doesn't work well for cache optimized hand coded assembly.

Most codec work saw an increase in performance from SMT in just about every x86 variant that it's come from. Keep in mind that even if you are able to saturate the front-end with its needed data and instructions, most processors have a much wider back-end than front end. That, coupled with the fact that you have access to twice as many registers -- since each thread has its own -- and you can see how having two independent sets of instructions can be executed faster even on a single core.

Every CPU core has at least a single "thread" (instruction stream) that it executes. OSs and some other stuff can 'fake out' the CPU by time slicing that single instruction stream to make it look like the core is running numerous programs "at the same time". Think of the thread context as being a set of data registers, an instruction pointer, various status registers, etc. A HT/SMT core has multiple contexts. So, a two-SMT core has two contexts. All that sits above the "core" itself (all the instruction execution resources... ALU, FPU, etc).

Instructions from those contexts are interleaved. Basically, from each stream, a collection of all the instructions that can be scheduled this clock cycle is available to the scheduler (ignoring which context the instructions come from... the collection is filled with all the 'ready' instructions from all contexts). The scheduler tries to pick a "good fit" from that collection to schedule this clock cycle and dispatches them. The next clock cycle, the above logic has refilled the collection of instructions from all the contexts that can be scheduled this (new) clock cycle and the process starts over.

If you think of each context (each SMT) sitting above the core, think of a big vat of water (the available instruction collection mentioned above) that's being filled from multiple water hoses (with some way of tagging each water molecules come from which hoses as it falls into the vat). Each water hose is an instruction stream, or a "thread" (HT/SMT thread). At the bottom of this big vat is a single outflow valve where water pours through a water wheel (the core's execution resources - the core itself).

The goal is to keep that water wheel spinning at maximum RPM all the time. The instructions streams (water hoses) aren't perfect and they cough and sputter from time to time with no water coming out. With one instruction stream (one water hose), these "bubbles" in the stream (they are actually called bubbles in the pipeline) can cause the water wheel to not spin efficiently as when a bubble comes through as the vat gets emptied and the outflow pipe has no water going down it, there's no force on the water wheel so the wheel slows down until water is able to come down the pipe again. When you have many water hoses filling the hopper (all sputtering and coughing at different times), you decrease the chance that the vat ever runs out of water to go down the pipe to turn the wheel.

The OS can then multiplex instruction streams across contexts. Whether the CPU has one context or multiple, it's kind of all the same to the OS. Smart OSs can know which contexts share the same core and 'spread the love' across the cores first for compute intensive tasks then go back to fill the other contexts per core.

The main difference between OS level context switching vs. what goes on in HT/SMT is that OSs generally switch contexts on the order of millisecond frequency (an eternity when clock cycles are in the nanosecond range or less). This is great to mask latency issues with hard drives, networks and the like.. things in the many thousands of clock cycles time frames (it takes a bit of time to do an OS 'type' context switch... hundreds of clock cycles). HT/SMT can deal with latencies at the level of the L1/L2 cache (such as on a cache miss)... nanosecond scope... one or two clock cycles time frames. So, the OS can help hide IO latency to devices by context switching "it's way" and HT/SMT can hide latency at granularities that are far too fine for an OS to be able to deal with efficiently (or at all, really).

As an old OS/2 fan, I've been quite familiar with the concept for a long time, and greatly disappointed that programmers still don't program for it more. Now it's a bit clearer why that is, even in this era of more cores than needed. My Mac Pro has eight cores and a ton of RAM. It takes a whale of a lot of stuff to slow it down, but it can and does happen when a poorly-written app elbows its way through the computer's "tubes."

and multi-threading a Mac App is a LOT easier than Windows of Linux because of GCD.

No. Multithreading in Windows is trivial these days. Not only do you have ThreadPools (which is basically what GCD is, only GCD is a little smarter in that it watches the entire system for load) you have the entire parallel API in .NET4. Also, actual lambdas in .NET make kicking off work items *really* easy (blocks are a poor-mans lambda).

Agreed and don't forget UMS in Win7. Contrary to Apple's PR, GCD isn't that advanced really (basically little more than a work item library). In fact it doesn't really monitor the system load, only the GCD load (i.e. it's a OS global work item manager which has both pros and cons). The problem is that all these APIs miss the point entirely, after all you can roll your own without too much work with IO completion ports in Windows (how thread pools are implemented under the hood) or something not too far off with condition vars (unix, linux, MacOS, Vista\Win7). No the real problem is designing the algorithm to scale. While these kinds of APIs are good at training programmers to think of their code in terms of discrete atomic tasks they don't help at all in their ability to locate the tasks that are (a) independent and (b) can efficiently scale to N CPUs. In fact by abstracting the subtle problems involved they can actually make things worse. Synchronisation of data between CPUs and their caches isn't free (it's very expensive) so any algorithm that needs frequent state updates between threads will scale poorly as will anything that buffers up too much data in batches as the subsequent serialised buffer copies can dominate run time. So it is hardly surprising to find many examples of multithreaded code that runs SLOWER on a multi-core CPU than a single core! The sad fact is that most of the threads you can see in your everyday applications are there as abstractions to make designing interactive latency bound applications easier to write and add absolutely nothing in terms of performance. There are just no easy answers and often no real answer at all to both multithreading and GPGPU (which comes with an even more extensive list of gotchas)

Difference:SMP is good for splitting heavy workloads,SMT ("HyperThreading") is good for running non-cache optimized workloads.

The idea behind SMT is that waiting for cache line to get fetched is a waste of good CPU resources. So the ALU units get shared by more execution engines - this is good for HTTP or SQL servers (long cache waits, lots of random memory access). Read more about Sun Niagara.

On single core systems, SMT added more responsiveness to UI. See Pentium4 with HyperThreading. But in the age of dual core+ systems and smart schedulers (Windows7), virtual processors can only help.

>I suspect that outside of games and media workI don't believe that games and media encoders are not cache optimized. SMT doesn't work well for cache optimized hand coded assembly.

Not true, while your description fits well with Sun's Niagara design, it's a little off base with intel CPUs. Since Intel CPUs decode instructions from each thread on alternate clock cycles many other situations can lead to performance gains. The key benifits are the effective reduction of the pipeline depth for each thread and the introduction of independant latancy chains. In optimal cases Corei7 class hardware can decode and retire 3-4 instructions per clock but average IPC is still less than 2, very long dependancy chains or poorly predicted branchy code often has an average IPC of less than 1. These kinds of applications are fairly common (database servers, SAP, web servers, compilers, JITs, interactive applications, etc) and since Intel's SMT is not tied into the cahce miss logic (unlike Niagara) it doesn't do nearly as well in the situations you mention due to cache thrashing etc. If, as you claim, most stuff is cache line optimised it would suggest Niagara is pointless and clearly its not, but cache optimisation is a easier problem than the issues Intel's SMT try to address (which also hides memory latency somewhat).

My question is how will AMD's Bulldozer design compare to Intel's HT. It should be interesting.

My guess is HT is more efficient with resources as HT only adds about +10% transistors, while BD's dual cores with shared SIMD could have drastically higher threaded throughput.

The one thing HT does help with is single thread performance. Win7 recognizes HT threads and will schedule to use only one thread per core until it has to wake up a HT thread. The benefit of this is when only one thread is active and the other one is put into deep-sleep, then that one thread gets *full* access to all resources of the core. Normally the resources are split 50/50, but with one asleep, the other gets a relatively large L1 cache/ execution window and lots of extra execution units. The Out-Of-Order(OOO) scheduler can really make use of the extra resources and help speed up single thread performance.

AMD's Bulldozer design looks to have a very strong threaded peak output as there is little contention between the cores. Personally, I think single threaded performance is almost useless as many new games will make use of threads and most applications that are single threaded, don't have a performance issue anyway.

But only time will tell and Intel does have a HUGE fab lead on the industry as a whole.

At least when Ivy Bridge and the Bulldozer are out, BattleField3 and its deferred shaders will be out also. It will make a great testbed for threaded game performance.

I will admit to not reading the article before making this post (it almost feels like a duty here lol), but isnt this story under most or all of our heads?

Or I am overestimating Ars's current userbase?

Your over-estimating ~1/2 of Ars' user base, many of which are degreed IT/IS professionals, or are in enough of an academic/science field that works with computers to grasp the concepts being talked about. This article is over the heads of another 1/4, but they're geek-savvy enough to take interest, and it could kick off a day's worth of googling and info sponging. The other 1/4... probably skipped over this once real technical terms were thrown around and realized it wasn't just a Windows/Apple/Linux thread.

I will admit to not reading the article before making this post (it almost feels like a duty here lol), but isnt this story under most or all of our heads?

Or I am overestimating Ars's current userbase?

Sometimes it is helpful to get a simplified explanation that we can give to others who are not at the same level of understanding that the rest of us are. Then, there are those whose experiences are in different areas of computers and have not really had to think about the How's of multiprocessing and multithreading.

One of Intel's HT performance guides specifically said that one of the benefits of HT is to make use of otherwise wasted time caused by cache misses.

Is this older P4 style HT issues?

Yes it will hide memory latency since when one thread stalls the other gets 100% of the execution core to itself. But, Niagara only switches thread context on a cache miss (ie the decoder is linked to the cache fetch unit in some way). Intel SMT has no such link, it merely reacts to the stalled load instruction as it would for any other long latency operation. Personally, I prefer Intel's approach as it solves a larger array of problems many of which are far harder to deal with than memory latency. It also can benefit edge cases such a FPU bound thread running with a ALU bound thread. The only problem is the level 1 cache can be too small for two cache heavy threads to run without severely impacting on each other.

Bengie25 wrote:

AMD's Bulldozer design looks to have a very strong threaded peak output as there is little contention between the cores

Yes, for ALU (integer) the shared FPU can only issue 2 integer SSE and 2 floating point ops per clock per pair of cores, so for multithreaded pure FPU or integer SSE stuff the throughput will average to 1op/clock/core!The single thread performance looks to be pretty anaemic as well unless they make good their promise and really ramp clockspeeds.