Future Microarchitectures

Haven't been around this forum for a while, I really miss discussing the nuiances of computer architecture in the good old days when K8 and CMA reigned supreme. I just got done with my busy spell in late summer, and decided to sit down and take a little time to draw up a rudimentary schematic of a prototype microarchitectural design that might lead to some good conversations in this area. Although it would be most beneficial and interesting to those actively study and do research in the area(like myself) or make a living in this field, I'm sure many self trained computer architecture enthusiast have plenty to contribute to a meaningful discussion as well.

This is a conceptual design that incorporates a number of likely trends in future commercial microarchitectures in the next number of years. It is actually roughly based on a microarchitecture that might be coming on market in the near to medium term (depends on what you consider "near"); but I have altered, omitted, and replaced a number of architectural elements and mechanisms so that nothing useful in terms of the original design or specifications could be deciphered, while still ending up with a functional conceptual design. The really important things are useful concepts and trends in designs anyway.

Or here link to a larger version:

This deign is emblematic of some classes of new features that might be heavily incorporated for increased ILP and lower thermal consumption / instruction throughput, design concepts that take into account of: RF latency vs. size vs. #ports trade off; long latency instructions and dependents; dependence steering; increased use in clustered FUs; companion techniques to clustering such as banked way predictions and independent LSQs; and a number of other trends.

I hope this may lead to some useful discussion about the future microarchitectures. The K8L/K10 threads eventually devolved into a shouting match and mud-slinging; but I hope we can have a substantive and cordial discussion here about the technical aspects of future designs as well as learn some useful information from each other in the mean time.

Hey, did you work for Cyrix at all or is Intel reintroducing Netburst?

Hat,

I'm still in grad school, definitely not been around long enough to have worked in Cyrix.

Good observations, although I made darned sure that things like backend issue width and specific FU counts are among the first things that I'd changed, so not a reliable indicator of these aspects of the original. The number one priority is to erase any indication that might lead to even ball-park estimation of performance numbers from the original design.

Though I see why you would think that; the backend of the processor is not as impoverished as it appears, there are some techniques and tricks being used in the design represented by the schematic. There probably still are some mistakes in the graphic, since I didn't get time to double and triple check everything, feel free to point them out (I already see a couple small ones).

This design actually would be able to theoretically peak issue (on very very rare and fortuitous cycles) 11 64-bit microinstructions in a single cycle, including four ALU ops, four AGU ops, and five in the FP pipeline. Although each ALU cluster RF has only two write ports, it can actually issue and complete two address operations at the same time as two regular integer. Some AG ops are actually supplemented with certain bits, such as dependency on the immediately preceding instruction(s), which are , and can bypass instruction control in the small cluster scheduler, and be among the first instructions to issue in the slice, so that any long latency misses might be known as early as possible in slice execution.

The two 128-bit SIMD units in the FP pipeline actually are pairs of 64 bit sub-units that can be issue locked to execute 128-bit SIMD with a single set of control lines; these can also execute 64-bit instructions independently with two sets of control inputs to the FU. Additionaly work needed to have been done in decode and and rename to achieve certain temporal and spatial arrangement of the two instructions in the FP scheduler, and to limit the configuration of the reg-file so that the four inputs can be read on the appropriate ports on the RF. Ideally, the FP FUs can execute 4 64-bit ops + one 64-bit MV/ST, or 2+1 128-bit instructions.

It's hard to see all of the details with the small graphs, terse descriptions and profuse use of acronyms, unless one has been working in this field for a while. But given a little time, these can be worked out. Feel free to give any questions or criticisms, I'll do the best I can to answer (providing within ethical limits).

This concept is suspiciously similar to one that is widely discussed on different boards. AMD's future Bulldozer architecure to be exact. Here is a diagram (based on patents from AMD, read more on this blog) of it:

The similarities are easy to spot. Two integer clusters, the FPU arrangement, the way the front end is laid out, the LSU and cache arrangement etc. You said you made alterations to distinguish it from another microarchitecture which would account for the differences.

Now the question is do you have access to inhouse AMD information and if yes, how did you manage to publish this without breaking an NDA. Maybe you put it together through patent based research like Dresdenboy did, but the amount of differences and information that according to Dresdenboy are not from patents is too big for your concept to be derived from the same sources.

Originally posted by Triskaine:This concept is suspiciously similar to one that is widely discussed on different boards. AMD's future Bulldozer architecure to be exact. Here is a diagram (based on patents from AMD, read more on this blog) of it:

The similarities are easy to spot. Two integer clusters, the FPU arrangement, the way the front end is laid out, the LSU and cache arrangement etc. You said you made alterations to distinguish it from another microarchitecture which would account for the differences.

Now the question is do you have access to inhouse AMD information and if yes, how did you manage to publish this without breaking an NDA. Maybe you put it together through patent based research like Dresdenboy did, but the amount of differences and information that according to Dresdenboy are not from patents is too big for your concept to be derived from the same sources.

Originally posted by Triskaine:This concept is suspiciously similar to one that is widely discussed on different boards. AMD's future Bulldozer architecure to be exact. Here is a diagram (based on patents from AMD, read more on this blog) of it:

The similarities are easy to spot. Two integer clusters, the FPU arrangement, the way the front end is laid out, the LSU and cache arrangement etc. You said you made alterations to distinguish it from another microarchitecture which would account for the differences.

Now the question is do you have access to inhouse AMD information and if yes, how did you manage to publish this without breaking an NDA. Maybe you put it together through patent based research like Dresdenboy did, but the amount of differences and information that according to Dresdenboy are not from patents is too big for your concept to be derived from the same sources.

I would be extremely interested in hearing your explanation.

Greetings, Triskaine

quote:

Originally posted by morfinx:

I'd be interested in knowing as well.

Thanks you guys again for posting here as well as at Anand, but again I'll just say pretty much the same as I did in the same thread at Anandtech:

I admire your enthusiasm in looking forward to studying the microarchitecture of future x86 designs from different design houses. Although this is not my intent for this thread at all, but is rather to explore general trends in future general purpose MPUs and possible techniques that may be used to raise ILP utilization, lower power / throughput, explore techniques relevant to future microarchitecture trends such as those in TLS(thread level speculation).

In response to your question on the BD uarchitecture that's coming up, I can't really comment on whether this particular design has anything to do with any microarchitecture of any particular unreleased and undisclosed design from a specific vendor. This design maybe based on something, but this has been reworked thouroughly to be general in nature, and to capture a number of general trends that will surface in the next few years, and types of architectural elements that have a good chance of being put to use in the near to medium term future.

All of the information contained in the design, if any of it has to do with any future design from a specific vendor; they have either already be (1a)removed, (1b)substantially altered so that no vendor specific information exists any longer in the design, (1c)replaced by design elements with similar function but using different techniques; or alternatively, this information is both (2a)readily available in published technical literature of an industry or adademic source, AND (2b)all competing vendors in the industry are already aware of this piece of information to the last detail (so if company X is using this technique, then companies Y and Z that are competing are already fully aware of this).

I hope that I've made this very clear. Feel free to ask me about specific elements of this conceptual design, I will answer what I can. And I look forward to a productive conversation about non-vendor specific designs that will be coming in the not too distant future.

You may think you've made it clear, but those two posters made some interesting points. It's preferable that your design be totally *yours* because, if not, any suggestions will be ignored by the teams in the middle of finalizing a commercial product. And if the suggestions *are* noted, it may rankle a few people that their pro bono efforts have gone to a commercial entity.

If you want any actual discussion, you're going to need to break that diagram apart on the block level, giving a high-level picture of how the large blocks connect and then the details in the lower-level diagrams, with notes about any specific details you want discussed.

As it is that picture is entirely too busy and I certianly can't figure out what the hell is actually going on. If you showed up to a meeting here with that monstrosity I guarentee the only response you'd get is for that poster to be rolled up and used to beat you over the head.

Originally posted by kcisobderf:You may think you've made it clear, but those two posters made some interesting points. It's preferable that your design be totally *yours* because, if not, any suggestions will be ignored by the teams in the middle of finalizing a commercial product. And if the suggestions *are* noted, it may rankle a few people that their pro bono efforts have gone to a commercial entity.

Yeah, it was mostly for convenience; and any microarchitecture that compasses the generalization of practical future designs is bound to look in some respect to look like something vendor specific that I have seen, since there are really very limited number of them mature enough to have detailed organization.

I didn't really consider the specific scenario that you described; I suppose it would probably be marginally better to completely distance any new design from existing ones that I have seen. Although that would be like treading a minefield left and right.

It still seems unlikely though; for the immediately succeeding generation of designs to hit roadmap targets, they would have to be in OEMs' hands in quantity within 16 months, and end users in 17-18. They better be long done with any architectural simulation by now, and deep into RTL verification, and perhaps tweaking a few low level logic from the libraries; and ideally it would be in the process of sending it through the fab once for a ground-up new design. I don't really see them adding significant new architectural features at this point, otherwise there would be considerable problems in meeting the roadmap.

Originally posted by TurinTurambar:If you want any actual discussion, you're going to need to break that diagram apart on the block level, giving a high-level picture of how the large blocks connect and then the details in the lower-level diagrams, with notes about any specific details you want discussed.

As it is that picture is entirely too busy and I certianly can't figure out what the hell is actually going on. If you showed up to a meeting here with that monstrosity I guarentee the only response you'd get is for that poster to be rolled up and used to beat you over the head.

You are not the first to suggest that, another poster in Anand said something very similar to what you have (although in a completely different tone). I will have to see how much time I have later this semester; it would take another order of magnitude more time for me to tease apart each of the major blocks in the diagram, and describe the function and dataflow in detail. I'll have to see if I have more time later this semester apart from my research and TA duties.

Although I have to say, from the tone of your post, I honestly can't figure out whether you sincerely want to discuss further, or are you really jesting and throwing out some flamebait.

Originally posted by TurinTurambar:If you want any actual discussion, you're going to need to break that diagram apart on the block level, giving a high-level picture of how the large blocks connect and then the details in the lower-level diagrams, with notes about any specific details you want discussed.

As it is that picture is entirely too busy and I certianly can't figure out what the hell is actually going on. If you showed up to a meeting here with that monstrosity I guarentee the only response you'd get is for that poster to be rolled up and used to beat you over the head.

You are not the first to suggest that, another poster in Anand said something very similar to what you have (although in a completely different tone). I will have to see how much time I have later this semester; it would take another order of magnitude more time for me to tease apart each of the major blocks in the diagram, and describe the function and dataflow in detail. I'll have to see if I have more time later this semester apart from my research and TA duties.

Although I have to say, from the tone of your post, I honestly can't figure out whether you sincerely want to discuss further, or are you really jesting and throwing out some flamebait.

I believe you misunderstood his request. When looking at an initial design the down & dirty details of the internals of each macro are not interesting - you need a 10,000 ft view that can be easily understood.

For example, instead of drawing 4 decode units interconnected with 4 sets of uop buffers & optimizations you could draw a simple decode box titled "Decode (4x)" with a big arrow pointing to a second big box containing all the "uop magic." It would also simplify the diagram if you leave out items such as "local/sys interface" - the arrow to the "remote microcode ROM" already provides the needed detail in that location.

Now, you would need to actually tell us what's different with the design instead of trying to show us, but telling us would be much more efficient. It would save you at least a few hours (it's much quicker to write a sentence than to draw several block diagrams) and we would be much less likely to loose details in the noise.

Another important item before you simply say "what do you think of my design" is to clearly define the target audience/market/product. Are you targeting x86 servers? desktops? laptops? notebooks? embedded? (Seems to be playing in the high-end desktop and/or server space but you can't expect people to sit around counting wires on a flow chart to figure it out )

In short, it needs an executive summary.

Edit:If you do take the time to do this please make a point of detailing the items that you believe fall under "likely trends in future commercial microarchitectures." IMO, that has potential to be the most interesting part of this discussion.

This diagram looks suspiciously similar to the Bulldozer micro architecture. I wonder if you're the source that Theo Valich mentioned in his Bulldozer article. IF you're an AMD employee, you are a disgrace to the company and should be prosecuted IMMEDIATELY for corporate espionage.

This diagram looks suspiciously similar to the Bulldozer micro architecture. I wonder if you're the source that Theo Valich mentioned in his Bulldozer article. IF you're an AMD employee, you are a disgrace to the company and should be prosecuted IMMEDIATELY for corporate espionage.

That's all I have to say in this thread.

No, I don't work for AMD in any way, shape, or form. I don't really know how many times do I have to say the same things over and over again, which I have said as the very first thing of this thread. Which part of "nothing useful in terms of the original design or specifications could be deciphered" do you have trouble with?

Sure, I've had former colleagues and classmates now in the industry that I discuss technical aspects of microarchitectures of some specific designs from time to time, and I have learned vendor specific information about certain designs over time. But nothing, I mean nothing at all of this kind of information is in the design presented in this thread. I have gone to great lengths to make sure that nothing specific that I might know gets in here at all, and that's the first and last thing that I check in terms of making the schematic.

This thread discusses general trends and the types of features and techniques that are likely to be included in future designs; that and nothing more. I didn't mentioned anything about AMD, bulldozer or any related topic, except in the case in response to your friends (or perhaps someone that make the same assumptions as you), saying that I don't comment on such things.

How do you get from a general discussion of the technical aspects of future microarchitectures, to AMD Bulldozer, to me being the source of some Inq reporter, to me working for AMD, to the involvement in some grand industrial espionage case is just beyond me.

I think, part of the trouble has been caused by me mentioning this discussion on my blog. While I try to use any information in some academic way for analysis, discussion and derived predictions, others may be a bit too eager and look at such a publication with the eye of a critic or lawyer

However I find Hard Ball's architecture rather inspiring. There are several recently researched elements included like all the thread level speculation stuff (e.g. those parts mentioning epochs). Slicing is also an interesting technique. Well, we will see, what we will get to see in the future.

I also think, that those single time posters are not from AMD since AMD's John Fruehe also had a look at what is going on here and talked to engineers about the arch. As it seems, there is no problem regarding what has been published and there is also no NDA. So since AMD people don't have a problem with what Hard Ball is doing, there is no reason, that anyone from AMD should post in such a way.

Instead I would just be interested if there was some inspiration by my arch. My blog had a single visitor from University of Illinois just 2 weeks before Hard Ball published her architecture

You probably shouldnt care about the trolls and their FUDs. I thought that AMD has better things to do than employ someone to search around internet and try to intimidate people, but obviously they follow in RIAA footsteps. Good for them, I hope they choke on their secrets.

Anyway, I'm also interested in more compact design, this one is rather confusing. But it seems kinda netbursty to me, like hat monster said. However, netburst had a lot less paralelism IIRC.

What's the pipeline lenght? Also, what do you mean by "FP scoreboard", is it the classic scoreboarding (the old one) and shouldn't that be obsolete with speculation and renames? But I can't see if it's using any speculation at all from the image.

So, when we're at it (ISA discussion, that is) do you arstechnicars know anything about new cortex? They say it's "out of order" but do they mean dynamic schelduling or speculation? In is dynamic schelduling without speculation any good?

EDIT: or if you don't want questions about ARM in discussion about your architecture, I can move it elsewhere.

I know there are fan-out and fan-in issues as well as extracting ILP in program streams, but I've wondered about a really wide core with a bunch of contexts that sit on top of it... like how we have a quad-core part with HT on it now (8 contexts), having all those resources in one core with 8 contexts on top of it... so when some app *can* make use of that many resources, it might actually get them. Also, adding more contexts on top of it would be sorta easy... so you can have the same core with 8-way or 16-way logicals on top of it and the amount of silicon wouldn't be too much more.

Or... alternatively, having basically a 'shared' set of resources and some dedicated to some contexts... like having four cores worth of resources and having 1/2 of those in a shared pool, and the other half divided amongst (dedicated to) four other 'half-cores', for example compared to a quad core today. That way, there's a guaranteed 'thin' core for each context/context-set and then a large shared pool that any can borrow from when they need to get 'wide'. The big pool of shared resources can be powered down when not needed and such.

So, when we're at it (ISA discussion, that is) do you arstechnicars know anything about new cortex? They say it's "out of order" but do they mean dynamic schelduling or speculation? In is dynamic schelduling without speculation any good?

Speculative execution is simply that the CPU will use branch prediction. For a conditional jump, the CPU will guess what the condition evaluates to be before it's evaluated.

Dynamic scheduling and out of order execution are, at this level, virtually synonymous. If I have the following instruction stream:

The DIVU instruction cannot be executed until the ADDU has evaluated so cannot be scheduled. However, the SUBU instruction has no dependencies further up the chain and can be immediately scheduled in parallel with the ADDU. Dependency related pipeline delays are very severe on deeply pipelined CPUs.

This out of order execution enables CPUs to extract instruction level parallelism.

Some designs, such as the VLIW-derived EPIC (IA64), are purely in-order and rely on the compiler to reschedule instructions so the CPU doesn't have to. The compiler has more resources and more knowledge of the program at its disposal so should be able to do a better job. The "should" is yet to evaluate to true in all cases.

Sorry things have gotten busy once again, and I couldn't post here very often. I'll try to address everything that I've neglected so far.

@

quote:

Originally posted by Dresdenboy:

Much thanks indeed, I certainly would not blame anything on you; although there are some that are very quick to jump to conclusions while making multiple highly improbably logical leaps.

There are some aspects and directions of the design that are drawn from what I know to be very likely general trends in the industry going forwards. I've always had some connections with the industry over time, and have learned various information over the last couple of years (since late 2007) about a number of upcoming designs, and only general directions that I can infer from more specific information are here. And I have been aware of these trends for quite some time, I guess that's what mostly "inspired" me, that and the fact that we were on the brink of seeing some first silicons from these designs.

From a first look at your vision of BD it seems that the steering mechanism works at instruction level granularity, it would be interesting to hear what kind of steering mechanism/heuristic would you propose to be appropriate, and what the division of labor would be between the "dispatch" block and the scheduler the individual cluster schedulers, where specific steps in dynamic scheduling would lie.

The design in this thread actually uses slice based steering which is coarser grained. It is predicated on formation of slices through use of selections from a few classes of key instructions that are likely on the critical path of the dynamic execution stream. Then the actual reschule within the instruction window (which is much larger compared to traditional OoO superscalar) would be done at two different levels of granularity, corresponding to slices and individual instructions within slices; and the idea is that these two levels of management would correlate well with two levels of latencies in potential sources of stalls (roughly the difference in latencies between L1 miss and L2 miss). I will elaborate on the details more as time goes on.

Anyways, keep up the good work; we can certainly use more people like you in IT journalism; very knowledgeable, very inquisitive, and yet very conscientious.

Take care

@

quote:

Originally posted by c2418:

I really appreciate your sentiment, I never thought that these posters were actually from AMD or anything; and certainly not anyone that knows about the details of the actual BD design (they would certainly know better than trying to equate it my design here; and they are probably having a pretty good belly laugh, if they see that someone is actually trying to equate these two.).

I think you and Hat are onto something, it is netburst like in one single respect. Netburst, for better or worse, did implement a very primitive form of slicing mechanism; but that relies on a fixed time expectation that deals with the typical L1 miss case, I had brief discussions in the forum last year, here:http://episteme.arstechnica.co...9259831#328009259831and also here:http://episteme.arstechnica.co...1359831#366001359831The original idea is not bad (IMO), but the way it's implemented and the compromises that had to be made due to a number of other features that either compete for on-die resources or have some type of conflicts with it doomed it to failure from the beginning.

This mechanism here, although similar to that of netburst in the broadest sense, is much more elaborate and is designed to deal with several types of latency expectations and deal with numerous contigencies on top of that. I will get much more into details when I have time for a serious write up for this forum.

The FP scoreboard there simply tracks the completed instruction's renamed registers and the consequent RAW hazards associated with these physical destination registers. Since as you can see, the pipeline's scheduling mechanisms are split at a very early stage in the core, the FP pipe actually never sees the direct execution of the branch instructions; but it only receives cues about legality of retirement of certain sets of FP instructions at certain cycles determined by what happens in the integer pipeline; and this is done through setting a series of PC limits for the FP pipeline by the integer ICU, which in the schematic as "ready horizon", which determines the time horizon of when certain FP instructions that affect the consistent memory state (such as FSTORE) are permitted to retire from F-ROB.

I think Hat gave you very good answers for the rest of your questions already.

Forgot to respond to this point. This is not really traditional linear pipeline, where instructions flow from decode to schedule to issue to WB, where cycle time of each stage affects the clock of the entire pipeline. There are multiple paths for each group of instructions to go through; and for certain parts of some path, the stages would not affect the lower bound of cycle time on the "primary" part of the pipeline.

And sometimes the same group of instructions (slice) are actually present simultaneously two branches of the integer pipeline, one copy that's being optimistically executed, while another copy is being speculatively rescheduled to make reentry into a certain point of the primary pipeline, if certain latency requirements on the primary execution of the slice is not met. So in some sense, there are potentially hundreds of stages in the longest path through the pipeline, but in that particular path, most of the stages would not be cycle-critical, and the stage count would not be very meaningful at all. I will offer up more details later on, it will take a very long write-up to explain all of this in detail.

By optimistic execution here, I mean execution that is predicated on certain amount of latency expectation, like speculative execution is predicated on a certain expectation on the outcome of the resolution of a particular branch.

@

quote:

Originally posted by fitten:

I see what you are saying, and there have been proposals to that effect. In that case though, there is going to be a lot of sacrifices to be made in terms of cycle time, as well a difficulties in many areas of logic design. Just for example, the graph partitioning algorithm run for min-cut of large logic blocks involved in floorplan would clearly become much more difficult and longer to run.

What that in effective gives you is a hardware managed distribution of hardware contexts for maximum utilization of available hardware resources. You would have to weigh that against another solution of building that capability into the OS, while allowing some assymetry in the chip design (assymetry in terms of execution resources, not ISA) among different pipelines, and allowing the OS aware of the assymetry, and to do the management and load balancing. This would allow you have vast majority of the benefits provided by the HW solution, while being much less costly in terms of the compromises that you otherwise have to incur in the chip design.

@

quote:

Originally posted by Hat Monster:

Could not have explained it better. And perfect lead into my explanation of scoreboarding and speculation in the FP pipeline in my prev post.

I will try to come up with a detailed example with some microassembly with non-destrutive syntax to illustrate some of my earlier points, and some more detailed writeup, soon.

OK, let's provide some more concrete explanation over the schematic, this will have to be done over a number of installments; I'll put one up whenever I have some spare time. Since I just answered some questions over potential similarities between this and Netburst, particularly in terms of the replay mechanism, perhaps we can start here.

The target of the netburst replay mechanism is predictable (or so thought) L1 misses, which made an assumption on the predominant latency of that type of miss. This resulted in the design of a rigid replay system that relied on highly predicatable latency for the average case (L1 miss but L2 hit) in, at the time was assumed to be, the vast majority of the workloads; which turned out to be one of the really flawed assumptions that ultimately sunk the Netburst line. The mechanism here is theoretically similar, but with a nearly diametrically opposite set of assumptions and implementations. Here the long latency mechanism is designed to deal primarily with long latency L2 misses, and is implemented using a much more flexible mechanism that can deal with large variations in latency that might be experienced when dealing with accessing shared cache/SRAM.

The design has two levels of dependency structures mapped in the instruction stream, one at the instruction level as in any tomasulo style speculative OoO architecture; but there is also a coarser grained level of dependency among slices. Two of the big topics, slice initiation (choosing heads of dependency trees within register dependency graph based on the instruction stream) and slice elongation/growth (separation of instruction stream into slices based on dependencies in relation to seed instructions) I will deal with at a later time, for the time being, it suffices to say that, the slices are formed such that the large majority of register to register dependencies are captured within individual slices, in other words, the amount of true dependencies that flow from an architectural live-out register of any logically earlier slice to an architectural live-in of a logically later slice are minimized. And the long latency reordering mechanism takes advantage of this coarse grained level of instruction stream organization, which provide natural places for restarting parts of execution stream that must wait on some long latency (primarily memory) operation.

The upper-right (top-level) block of the schematic is the main mechanism for dealing with long-latency operations (I will call this LL-block as a short hand from this point on). In the steering mechanism of the pipeline, each cluster has two associated steering buffers; and one of these is alternately being filled by the steering mechanism with a subset of the instruction stream, as the other buffer is being drained for execution. As the instruction slices drain into the integer clusters from one of the steering buffers (one instruction at a time), simultaneously the instruction slice also is copied to one of the IQs in the LL-mechanism (which we will call reschedule IQ from this point on). There is one such reschedule-IQ for each of the clusters, so in a sense, while the slice is being sent to the cluster scheduler for execution, at the same time, it is already being speculatively rescheduled for another round of execution in the event that the reexecution trigger event occurs. The trigger event is described as L2 miss (but can also include any other event with LL potential, but I will focus on L2 misses here), and this event is ensured to be detected as early as possible during slice execution in each one of the integer clusters (which handles all loads, including FP and SIMD ld), partially with the aid of a fast-pathed loads, which are loads with mem address not dependent on the computation of any preceding instruction in the slice. This fast-pathed mechanism is another extensive discussion for a later time.

If no L2 miss is detected during slice execution, that corrsponding entries in the approriate reschedule-IQ are designated done, and will not enter into the LL's long term buffer. If an L2 miss is detected during slice execution, the corrsponding entries in the reschedule-IQ are set to enter the long term buffer in the LL-block. When the instructions enter into the long term buffer, these are marked with the PC of the seed instruction of the logically preceding slice that corresponds to the architectural register of one of the source operand of the current instruction; in other words, these are potentially passed register values from the previous epochs (epoch is the logical time period corresponding to each record of the archictural register state) if the current instruction were to be determined to be an exposed read. The other source operand is necessarily dependent on the seed instruction for the current slice, which will become apparent once slice elongation is thoroughly explained. Each of the slices in the buffer is also marked with the LL-operation that caused its residence in the buffer, and will be cleared for return to the main pipeline once the L2 miss returns requested data.

Once the slice is cleared for refilling one of the cluster schedulers, it first traverses a filter for previous epoch architectural live-out filter, which basically a partial-lookup CAM structure along with some logic that determines the register reads which would need to be replaced by values already in epoch ISA registers. So at the end of this live-out filter stage, we have the any instruction that contains an exposed read reg payload, replacing the reg payload with the seed PC of the logically preceding slice where this architectural register was logically last written. And a status bit with each instruction is set if the instruction is determined to be exposed read.

Now, we have to mention another essential structure tied to this process, which are the result shift registers for ISA, with one shift register for each of the architecturally defined register. Each of these RSRs contains the architectural states at the end of each epoch; in other words, for any RSR (each corresponds to an architectural register), each epoch entry of the RSR contains either value shifted from the previous entry (corresponds to the execution of a single slice), or updated value from the destination physical register from some instrution that is mapped to the architectural register according to the cluster RAT; and since the updates of architectural states is done as the slice executes, the updated value in the RSR must also be logicaly the last write to the architectural register for the corresponding epoch.

So once the slice passed through the filter with the relevant reg payloads replaced, the architectural RSR of the relevant epochs are read, and the instructions that have exposed read of previous epoch architectural live-outs are remapped to instructions containing the value from the relevant RSR state. So in essenence, the exposed live-in instructions become replaces a register operand with an immediate operand. The original renamed RAT of the slice is updated so that the corresponding physical registers of the architectural live-ins are freed up, so that these can be reused as renamed registers for future epochs. And here the remapped slice enters the refill instruction buffer, for the cluster resource to become available to restart execution. This process would be controlled by the integer ICU, and dependent on the steering mechanism's resource monitor.

So essentially, if this works ideally in concert with the slicing and steering mechanisms, eliminate much of the stall cycles associated with L2 misses, which usually cannot be hidden through reordering in a conventional OoO core. It's very late, I'm sure I've left something out, and I'm sure there are things still unclear, please feel free to ask me; and I'll provide whatever answer I can.

What that in effective gives you is a hardware managed distribution of hardware contexts for maximum utilization of available hardware resources. You would have to weigh that against another solution of building that capability into the OS, while allowing some assymetry in the chip design (assymetry in terms of execution resources, not ISA) among different pipelines, and allowing the OS aware of the assymetry, and to do the management and load balancing. This would allow you have vast majority of the benefits provided by the HW solution, while being much less costly in terms of the compromises that you otherwise have to incur in the chip design.

I would imagine that the hardware in general has a much better idea of the instruction mix of a bunch of threads, and the execution resources available, at any moment in time, than the OS does. The instruction mix is highly dynamic (dependent on which threads are scheduled, which operands are available from memory, etc.). Taking a whole bunch of in-flight instructions from several thread contexts and divvying them up amongst available execution resources is already something that processors do.

I mean, think about it. If there were no silicon constraints we wouldn't have multicore processors at all; just extreme hyperthreading (in the fine-grained, Intel style). Why would you want a dual core, one thread per core processor if you could have a single core, two thread per core processor with the same execution resources available to it? The worst case would be dividing the resources up equally between the two threads equally, but if one thread was relatively idle, more of the resources might be usable by the other. The more thread contexts available, the higher the proportion of the total execution resources can be used at any one time. Which is surely what we want.

What that in effective gives you is a hardware managed distribution of hardware contexts for maximum utilization of available hardware resources. You would have to weigh that against another solution of building that capability into the OS, while allowing some assymetry in the chip design (assymetry in terms of execution resources, not ISA) among different pipelines, and allowing the OS aware of the assymetry, and to do the management and load balancing. This would allow you have vast majority of the benefits provided by the HW solution, while being much less costly in terms of the compromises that you otherwise have to incur in the chip design.

I would imagine that the hardware in general has a much better idea of the instruction mix of a bunch of threads, and the execution resources available, at any moment in time, than the OS does. The instruction mix is highly dynamic (dependent on which threads are scheduled, which operands are available from memory, etc.). Taking a whole bunch of in-flight instructions from several thread contexts and divvying them up amongst available execution resources is already something that processors do.

I mean, think about it. If there were no silicon constraints we wouldn't have multicore processors at all; just extreme hyperthreading (in the fine-grained, Intel style). Why would you want a dual core, one thread per core processor if you could have a single core, two thread per core processor with the same execution resources available to it? The worst case would be dividing the resources up equally between the two threads equally, but if one thread was relatively idle, more of the resources might be usable by the other. The more thread contexts available, the higher the proportion of the total execution resources can be used at any one time. Which is surely what we want.

What you say is basically correct, there can be something said on both fronts in terms of the benefits and detractions of SMT on one hand, and CMP on the other. And that's been debated in the academic community over a very long period of time. Some representative arguments for each side can be heard from each side, such as in these representative literatures:SMT: Simultaneous Multithreading: Maximizing On-Chip Parallelism; Tullsen et al, ISCA 1995CMP: The Case for a Single-Chip Multiprocessor; Olukotun et al, ACM 1996

This is a debate of whether to dedicate more transistor budget for the increased number of reg-file ports as well as complexity of the by-pass network in an SMT design, or to dedicate the same budget for a larger number of individual pipelines on the chip. On one hand, for the same amount of execution (FU) resources, would perform much better for the the reasons of less vertial and horizontal wastes in a superscalar pipeline. But on the other hand, the complexity of the interconnect network needed to support such scheduling scales quadratically with the issue width, which leads obviously larger transistor budget as well and higher potential for delays. I think the recent trends tell us that although both still have merits and some future in commercial products, the CMP camp is much more widely adopted and has much deeper market penetration, and will be the major thrust in hardware vendors' effort to support greater software TLP.

But that is actually not the only thing that the debate turns on, which is a matter not of FU FU resource wastes vs. interconnect network complexity; which differ in matter of fine grained resource management. But in terms of HW sharing of execution resources among different pipelines, vs software management of assymetrically resourced cores involves fine grained as well as much coarser grained resource management. The number 1 benefit of shared FUs for multiple pipelines is the ability to allocate resource assymetrically, not necessarily the ability to dynamically change the allocation. This point is debatable, but I think that is the crux here; I'm sure we can talk much further on this.

I have made a smaller graph to illustrate better of the working of the part of the design that handles long latency operations and reduces LL-induced stall time. You can see in my explanation of that part of the design, in my post yesterday.

Originally posted by DrPizza:I mean, think about it. If there were no silicon constraints we wouldn't have multicore processors at all; just extreme hyperthreading (in the fine-grained, Intel style). Why would you want a dual core, one thread per core processor if you could have a single core, two thread per core processor with the same execution resources available to it? The worst case would be dividing the resources up equally between the two threads equally, but if one thread was relatively idle, more of the resources might be usable by the other. The more thread contexts available, the higher the proportion of the total execution resources can be used at any one time. Which is surely what we want.

An ideal CPU would indeed be thread-decoupled, where the front end takes on as many threads as it damn well feels like and the back end simple chews through ops.

At the scales we see this kind of thing, wide cores such as K7/K8/K8L or Core2/i7, are already pushing their caches. With so many threads on the go, we'd need incredibly large amounts of extremely fast local store: A video card as a CPU, basically.

I don't see such a machine working, on contemporary code, without 64 MB L2 cache (8 cycle latency or so ) and 512 MB L3 (15-20 cycle). It'd just grind to a horrible bandwidth-starved stop with current silicon densities.