Larrabee: Samples in Late 08, Products in 2H09/1H10

Wednesday 16th January 2008, 12:42:00 PM, written by Arun

When Doug Freedman asked Paul Otellini about Larrabee during yesterday's conference call, we didn't think much would come out of it. But boy were we wrong: Otellini gave an incredibly to-the-point update on the project's timeframe. So rather than try to summarize, we'll just quote what Otellini had to say here.

Larrabee first silicon should be late this year in terms of samples and we’ll start playing with it and sampling it to developers and I still think we are on track for a product in late ’09, 2010 timeframe.

And yes, the seekingalpha.com transcript says 'Laramie' and we have no idea how anyone could spell it that way given the pronunciation, but whatever. The first interesting point is that Otellini said 'first silicon', as if it wasn't their intention to ship it in non-negligible quantities.

So it'd be little more than a prototype, which makes sense: it'll likely take quite some time for both game and GPGPU developers to get used to the architecture and programming model. At least this proves Intel isn't being ridiculously overly confident in terms of software adoption if we're interpreting that right, and that they're willing to make this a long-term investment.

On the other hand, if their first real product is expected to come out in 'late 2009', an important point becomes what process Intel will manufacture it on. If it's 45nm, they'll actually be at a density and power disadvantage against GPUs produced at TSMC, based on our understanding of both companies ' processes and roadmaps.

But if Intel is really aggressive and that chip is actually on 32nm, then they would be at a real process advantage. That seems less likely to us since it'd imply it would tape-out and release at about the same time as Intel's CPUs on a new process. Either way, it is a very important question.

The next point to consider is that in the 2H09/1H10 timeframe, Larrabee will compete against NVIDIA and AMD's all-new DX11 architectures. This makes architectural and programming flexibility questions especially hard to answer at this point. It should be obvious that NVIDIA and AMD must want to improve those aspects to fight against Larrabee though, so it could be a very interesting fight in the GPGPU market and beyond.

Discuss on the forums

Tagging

intel ± larrabee, dx11


Latest Thread Comments (643 total)
Posted by Jawed on Saturday, 22-Mar-08 19:50:06 UTC
Quoting armchair_architect
It's quite possible I simply misunderstood you. The way I read it, there's essentially two register files, a big&slow one and a small&fast one. Each clause does all its work in the fast registers: at the beginning of the clause all values it needs to read are copied into the fast registers, and at the end of the clause any live outputs (i.e. values needed by subsequent clauses) are copied back to the slow registers. Is that accurate?
Yep, that's a good summary.
Quote
If so, if clause A computes a value used by clauses B and C, then that value will be copied out from A and copied in to B and C. Compared to a design with a flat register file, that's three extra data movements.
Ah, OK, I understand what you're saying. Agreed.
Quote
This idea of big&slow vs. small&fast registers is just a two-level memory hierarchy (ignoring the cache and framebuffer levels beyond them). Which is fine by itself. But having these atomic clauses essentially means that every N operations you're forced to flush/invalidate the lowest level of the hierarchy, and reload it next time you need the data.
Often you'd be switching clauses anyway because of texture operations (or to evaluate branching-predicate for subsequent instruction issue). Regardless, yes the extra data being moved is costly. --- I've been fiddling some more with GPUShaderAnalyzer and what I'm finding is that there seems to be a low ceiling on the count of clause-temporaries. R122 seems to be the lowest I've gotten so far (i.e. 6 temporaries, R122-R127). 6 is a funny number. Anything more complex seems to start assigning shader registers, e.g. r1, r2 etc. This makes me suspect that the first model is being used, where the ALU pipeline operates on *both* the main register file and the subsidiary one. That obviously makes the operand addressing more fiddly. Of course what I've been referring to as a subsidiary register file could be just a scratch area within the main register file :???: Coming slightly back on topic I'm also wondering how Larrabee will map the voluminous register file specification of SM4 onto its caches/register-file/ALUs... Software threads versus hardware threads. Jawed

Posted by Jawed on Sunday, 23-Mar-08 03:51:33 UTC
Quoting armchair_architect
This idea of big&slow vs. small&fast registers is just a two-level memory hierarchy (ignoring the cache and framebuffer levels beyond them).
Hmm, isn't this actually how G80 works? It does "wide but shallow" fetches from the register file into a block of memory that's used to enable the correct ordering of registers for the ALUs? This is how it "simulates" a multi-ported register file, using a single very fat port into a small memory that easily supports the "random" fetches the ALUs require. As far as G80 ALUs are concerned addressing is from the fast register file, which presumably only requires a few bits of address per operand ("thread ID" takes care of the rest). I presume that the fast memory is a gather window for all types of operands: registers, parallel data cache and constants. Jawed

Posted by Ilfirin on Wednesday, 26-Mar-08 08:49:09 UTC
I just got some performance figures back from a project I completed recently and found it rather interesting in relation to this discussion.Some background..The project in question was for the display section of a relatively advanced VNC system where the worst-case involved analyzing two screens for differences, extracting the bounding rectangles of those differences, moving the changed areas from one screen to the next while performing some dynamic digital grading followed by a RGB->YUY2 colorspace conversion at the end. This is essentially a 3 step process: 1. Find differences and update the target screen 2. Perform digital grading 3. RGB->YUY2 (sound a lot like a pixel shader running on a CPU? ;) )I wrote all this in highly optimized mmx-assembly code and got some rather surprising results. At 1680x1050x32 on a P4 1.3GHz system the RGB->YUY2 transformation took about 9ms, as did the update stage.. and the digital grading stage. The process as a whole ended up being around 30ms. When all 3 stages were combined into 1 the process as a whole went back down to around 9ms. I added some 'wavy screen effects' just to test the performance of doing so.. still 9ms. The arithmetic instructions now outnumber the memory operations by a massive amount.On a system clocked 500mhz faster but with the same cache size it was still at 9ms, but when it was ran on a similarly clocked processor with twice the cache, the performance nearly doubled. On a dual-processor system with a combined total cache equal to the last setup (running the multithreaded version of the algorithm) performance was the same. C2Qs with doubled up caches performed twice as fast as C2Ds, regardless of differences in clock rate.To make a long story short, what ended up happening was that across all the systems tested performance scaled almost perfectly linearly with cache size and nothing else seemed to have any effect on it. Tripling the ALU ops didn't affect performance, nor did going multi-core without increasing the overall cache amount. Bus speed differences made a marginal affect, but certainly nothing to write home about.The moral of all of this is that with the clock frequencies being quoted for Larrabee the most limiting factor isn't going to be so much how many cores it has as it is how much cache and bandwidth is associated with each core and how it's organized. Being able to execute a bunch of different sections of the screen simultaneously doesn't really help you much if a single core can already process all the data from the last read before the next set arrives. I'm very interested in seeing exactly how they address this.. along with seeing what it's going to be like programming on it after they're done.Looking over the posts here it seems a bunch of people are already saying this, but it was neat to have some "real" data on the matter.

Posted by Demirug on Wednesday, 26-Mar-08 11:09:23 UTC
Yes, cache misses are expensive on all current CPUs. This is even more a problem when it comes to pure In Order CPU. I could not remember who said it but a GPU marker claimed that todays GPUs are designed to have very high texture cache hit rates as this is one of the key factors to be fast.

Posted by nAo on Wednesday, 26-Mar-08 14:22:57 UTC
Quoting Ilfirin
.
Looking over the posts here it seems a bunch of people are already saying this, but it was neat to have some "real" data on the matter.
Forgive me for this silly question but given that you probably had a pretty regular and predictable memory access pattern were you prefetching your data at all?

Posted by Jawed on Wednesday, 26-Mar-08 15:24:43 UTC
Quoting Jawed
This makes me suspect that the first model is being used, where the ALU pipeline operates on *both* the main register file and the subsidiary one. That obviously makes the operand addressing more fiddly. Of course what I've been referring to as a subsidiary register file could be just a scratch area within the main register file :???:
Yay, just been reading the R600 ISA document and this is how it works, using a variably sized scratch area within the register file for clause temporaries. It turns out there's only a maximum of 4 clause temporaries supported in R600. In some cases I've seen GPUSA report less registers required for RV670. I wonder if this is because RV670 can support more clause temporaries? Jawed

Posted by 3dilettante on Wednesday, 26-Mar-08 16:42:02 UTC
It appears R600 does try to reduce the amount of register data moves, even if it should complicate addressing.

Is the R600 ISA document part of the SDK, or can it be found separately?

Posted by Jawed on Wednesday, 26-Mar-08 16:57:11 UTC
Quoting 3dilettante
It appears R600 does try to reduce the amount of register data moves, even if it should complicate addressing.
I think it's more relevant to see the clause temporaries as enabling significant savings in per-thread register allocations, increasing the number of in-flight threads.
Quote
Is the R600 ISA document part of the SDK, or can it be found separately?
As far as I can tell it is only available as part of the SDK, which is annoying and short-sighted. I was able to install the SDK even though my driver is out of date (7.7 or 7.8 I think). Dunno what would happen if you don't have an ATI GPU. Jawed

Posted by 3dilettante on Wednesday, 26-Mar-08 17:25:10 UTC
How are the clause temporaries addressed compared to standard registers?

Hard-wiring the highest 4 vec4 register addresses to be temporaries would be the simplest way to encode things, as in it changes nothing, but that would leave it up to the compiler/coder to make sure that switching out the clause doesn't wreck everything.
Leaving correctness up to the precise timing of the code sequence would be very old-school VLIW, if that were the case.

On the other hand, that would limit future expandability, as any further increase of scratch space capacity would eat into the normal register space.

Interestingly, 4 vec4 registers would translate to 16 32-bit registers (like another ISA has for general-purpose registers). If the temp references are encoded differently and ALUs can reference the temp and permanent sections, it would almost be like an x86 mem-reg operation, though hopefully entirely on chip and deterministic at the time of reference.

Posted by Jawed on Wednesday, 26-Mar-08 18:31:16 UTC
Quoting 3dilettante
How are the clause temporaries addressed compared to standard registers?
I haven't read enough about addressing to give you an answer yet. Quoting table 2-6 on page 15: GPRs:* number per thread : 127 minus 2 times Clause-Temporary GPRs* Each thread has access to up to 127 GPRs, minus two times the number of Clause-Temporary GPRs. Four GPRs are reserved as Clause-Temporary GPRs that persist only for one ALU clause (and therefore are not accessible to fetch and export units). GPRs may hold data in one of several formats: the ALU can work with 32-bit IEEE floats (S23E8 format with special values), 32-bit unsigned integers, and 32-bit signed integers.Clause-Temporary GPRs:* number per thread : 4* GPRs containing clause-temporary variables. The number of clause-temporary GPRs used by each thread reduces the total number of GPRs available to the thread, as described immediately above.
Quote
Hard-wiring the highest 4 vec4 register addresses to be temporaries would be the simplest way to encode things, as in it changes nothing, but that would leave it up to the compiler/coder to make sure that switching out the clause doesn't wreck everything.Leaving correctness up to the precise timing of the code sequence would be very old-school VLIW, if that were the case.
The compiler, as far as I can tell, explicitly encodes for regular versus CT registers.
Quote
On the other hand, that would limit future expandability, as any further increase of scratch space capacity would eat into the normal register space.
The rate at which CTs consume register file space is very low (since it's only the number of objects per thread * the number of threads active in the ALU pipeline - in R600 this is 64*2*CT-count). As I commented earlier, I suspect RV670 has more capacity for CTs, but that's only based on GPUSA I don't fully understand.
Quote
Interestingly, 4 vec4 registers would translate to 16 32-bit registers (like another ISA has for general-purpose registers). If the temp references are encoded differently and ALUs can reference the temp and permanent sections, it would almost be like an x86 mem-reg operation, though hopefully entirely on chip and deterministic at the time of reference.
If you look at GPUSA output you'll see quite clearly that GPRs and CTs are mixed "freely". I've discovered that there *are* restrictions on the sequence of operands issued in a 5-op ALU Instruction Group, but I don't understand them yet... Code:
---------
212 x: ADD R0.x, PV(211).z, C2.z y: MAX R123.y, PV(211).y, 0.0f z: ADD R5.z, PV(211).x, C2.z w: ADD R123.w, R2.z, -PV(211).w t: MUL R122.z, R124.x, C2.w 213 x: MUL R124.x, R127.w, C2.w y: ADD R14.y, PS(212).x, R3.y z: ADD R11.z, PV(212).w, C2.z w: MUL R123.w, PV(212).y, PV(212).y t: ADD R14.x, R126.w, R3.w
--------- GPRs:* R0* R2* R3* R5* R11* R14CTs:* R122* R123* R124* R126"Previous" registers (what I've called pipeline registers in the past), note the index refers to the instruction number that produced the result, the index is *always* the prior instruction:* PV(211) "V" refers to the vector of four lane resultants X, Y, Z, W* PV(212)* PS(212) "S" refers to the scalar T resultant (confusingly always referred to as "X")The CT assignment is suspect, by the way, because the entire clause also refers to R125 and R127 (6 CTs in total), which is more than the supposed limit of 4. Sigh. Jawed


Add your comment in the forums

Related intel News

RWT explores Haswell's eDRAM for graphics
RWT: An Updated Look at Intel's Quick Path Interconnect
32nm sixsome over at RealWorldTech
Intel Core i3 and i5 processors launched
Analysis: Intel-TSMC announcement more complex than reported
Intel and TSMC join forces to further Atom
Fudzilla: Intel 45nm Havendale MCM replaced by 32nm+45nm MCM
Intel announce Core i7 processors, reviews show up
Intel's Aaron Coday talks to Develop about Larrabee
Larrabee to also be presented at Hot Chips