Larrabee: 16 Cores, 2GHz, 150W, and more...

Friday 01st June 2007, 06:08:00 PM, written by Arun

It is amazing how much information is out there in the wild, when you know where to look. TG Daily has just published an article partially based on a presentation they were tipped off about, and which was uploaded on the 26th of April. It reveals a substantial amount of new information, which we will not focus on analysing right now, so we do encourage you to read it for yourself.

Page 1 discusses the possibility that Larrabee is a joint effort between NVIDIA and Intel, which we find unlikely, and is possibly just a misinterpretation of the recently announced patent licensing agreement between the two companies. Page 2 is much more interesting however, as they link to the presentation above and also uncover the hidden Larrabee PCB diagram on slide 16.

We would tend not to agree with most of the analysis and speculation provided by TG Daily, but it's still worth a good read along with the presentation, which we are very glad they uncovered. Especially interesting are slides 16, 17, 19, 24 and 31. That last one includes some very interesting and previously unknown information on Intel's upcoming Gesher CPU architecture (aka Sandy Bridge), which is aimed at the 32nm node in the 2010 timeframe. Larrabee, on the other hand, will presumably be manufactured on Intel's 45nm process but sport a larger die size.

Discuss on the forums

Tagging

intel ± larrabee, gesher


Latest Thread Comments (95 total)
Posted by 3dilettante on Friday, 22-Jun-07 23:03:32 UTC
Quoting Demirug
Isn’t using x86 for a GPU not already adventurous enough? Tim Sweeney will properly love it.
Adventurous from the point of view of fiddling with the x86 ISA.
These extensions may be a straighforward widening of the SSE registers and some additional mask and compare instructions, or they could try to do something more drastic.

Quote
An interesting side effect could be that this is the first GPU that would be upgraded to a new DX version with only a driver.
Not any more than how the shader cores in G80 and R600 could be.

The fixed-function block that Intel has left pretty much blank would not upgrade to match the DX version.

Posted by ebola on Saturday, 23-Jun-07 03:17:02 UTC
Quoting Gubbi
However, Motorola completely b0rked the ISA with the 68020
Heh I guess there must have been many good reasons why they were so keen to move onto ppc.
rose tinted spectacles. I think I'm just permanently psychologically scarred by the moment when I realised that to continue graphics coding i would have to lose 8 registers :)

I suppose it's also ironic that i've just named a processor that pretty much resurrects the concept of "Memory Segmentation" as my "favourite peice of silicon :)" I was commenting to one of my colleagues a while back that we needed near & far keywords in the compiler.

Posted by MfA on Saturday, 23-Jun-07 12:59:34 UTC
Quoting 3dilettante
It supposedly goes as far as implementing specialized control and branch instructions.
Without sophisticated branch prediction you pretty much need those. Maybe finally after decades we will be able to use loops on x86 without knowing for a fact the predictor will get every branch wrong at least once ... goodbye loop unrolling?

Posted by ADEX on Sunday, 24-Jun-07 00:32:08 UTC
Quoting 3dilettante
Unrolling is a little more involved than just register renaming.
I don't see how the concepts are related at all, loop unrolling is a software technique, OOO is a hardware technique. That said OOO can be used to implement loop unrolling and that requires rename registers but it's not the same thing.

Loop unrolling is a technique which has a number of benefits. It reduces the number of instructions used, reduces the number of branches and makes better use of the CPU's pipeline. It can boost performance by a hefty amount - the first time I ever used the technique it boosted performance by 5X. You can also use the technique on things like memory moves and the performance boost can also be big.

Posted by 3dilettante on Monday, 25-Jun-07 12:45:32 UTC
Quoting ADEX
Loop unrolling is a technique which has a number of benefits. It reduces the number of instructions used, reduces the number of branches and makes better use of the CPU's pipeline. It can boost performance by a hefty amount - the first time I ever used the technique it boosted performance by 5X. You can also use the technique on things like memory moves and the performance boost can also be big.
Loop unrolling does not reduce the number of instructions used.

The increase in code size is one of the big drawbacks to loop unrolling, since it takes up instruction cache space.

Posted by Frank on Monday, 25-Jun-07 22:16:10 UTC
Quoting 3dilettante
The increase in code size is one of the big drawbacks to loop unrolling, since it takes up instruction cache space.
And memory bandwidth.Then again, it doesn't mispredict. And, as most loops are simple counters, it would be best and simplest to use a branch prediction that specifies the amount.

Posted by Gubbi on Tuesday, 26-Jun-07 07:48:22 UTC
Quoting 3dilettante
Loop unrolling does not reduce the number of instructions used.
I think ADEX meant the number of instructions scheduled is reduced, - because of the reduced loop overhead.

Quoting 3dilettante
The increase in code size is one of the big drawbacks to loop unrolling, since it takes up instruction cache space.
Not only does the the unrolled loop take up n times more instructions, but you usually have a preamble to preload the first data needed, and an epilogue to finalize the loop (storing results without prefetching more data). The bloat can be substantial.

One of the few redeeming features of IPF is that you can normally collapse the preamble, unrolled loop and epilogue into a compact loop body by using predicated instructions and the rotating integer register stack.

Loop unrolling really is a clumsy way of avoiding data-dependency stalls.

Cheers

Posted by 3dilettante on Tuesday, 26-Jun-07 21:08:34 UTC
Quoting MfA
Without sophisticated branch prediction you pretty much need those. Maybe finally after decades we will be able to use loops on x86 without knowing for a fact the predictor will get every branch wrong at least once ... goodbye loop unrolling?
Core2 and Pentium M already have forms of loop detection.
Conroe goes as far as having a loop cache that is some number of instructions in size (64 I think).

It was rumored at one point Barcelona would have something similar, but it's not in the documentation.

Posted by Techno+ on Thursday, 28-Jun-07 18:04:47 UTC
just a question. Do you think that one of the larrabee cores would be enough to run windows XP?

Posted by 3dilettante on Thursday, 28-Jun-07 18:32:47 UTC
If it were limited to one core, it seems possible to run, though I don't think it would be an option anyone would like.

The top clock speed in the presentation was 2.5 GHz.

There are a few unknowns.

The threading used wasn't listed. If SMT, then it's possible for one thread to use the core most of the time.
If fine-grained, then a 4-threaded core at 2.5 GHz is going to look like it's 625 MHz to a single thread.
OSs are more threaded than most desktop applications, but it would have an impact.

The integer width was not disclosed. It was a ??? in the slide.
If the minimum FP width is 2 ops, then it may be that the minimum integer width is two ops.
It may look like a Pentium (pre-Pro).

The next question is how much branch prediction there would be.
That wasn't disclosed, and it is unlikely it would be anywhere near the huge predictors for P4 and Core2.
Having a branch predictor as limited as the pre-Pro Pentium might be possible, and Intel's more generalized vision for Larrabee may mean it will have some prediction, even though such speculation is usually a waste for a GPU.

Best case, we see it perform like a 2.5 GHz Pentium MMX + SSE.
Not so good, maybe something like half that.
Worse, we see it perform like a ~800 Pentium MMX with SSE.
Worst (unlikely?) case, we might see a single core peform like a ~800 MHz Pentium MMX with no branch prediction, which would cut performance significantly.

I don't think it would be acceptable for most users, but it could chug along fine with no other programs running.

The spooky part about this is that a number of plausible high-level design decisions would make it look a lot like a Pentium with MMX and SSE multiplied 16 to 32 times.

edit: And there's even a thread in the Hardware forum about that, which would have been nice had I seen it before all this writing.


Add your comment in the forums

Related intel News

RWT explores Haswell's eDRAM for graphics
RWT: An Updated Look at Intel's Quick Path Interconnect
32nm sixsome over at RealWorldTech
Intel Core i3 and i5 processors launched
Analysis: Intel-TSMC announcement more complex than reported
Intel and TSMC join forces to further Atom
Fudzilla: Intel 45nm Havendale MCM replaced by 32nm+45nm MCM
Intel announce Core i7 processors, reviews show up
Intel's Aaron Coday talks to Develop about Larrabee
Larrabee to also be presented at Hot Chips