Larrabee's Rasterisation Focus Confirmed

Wednesday 23rd April 2008, 08:33:00 PM, written by TeamB3D

For many months, researchers and marketing fanatics at Intel have been heralding the upcoming 'raytracing revolution', claiming rasterisation has run out of steam. So it is refreshing to hear someone actually working on Larrabee flatly denying that raytracing will be the chip's main focus.

Tom Forsyth is currently a software engineer working for Intel on Larrabee. He previously worked at Rad Game Tools on Pixomatic (a software rasterizer) and Granny3D, as well as Microprose, 3Dlabs, and most notably Muckyfoot Productions (RIP). He is well respected throughout the industry for the high quality insight on graphics programming techniques he posts on his blog. Last Friday, though, his post's subject was quite different:

"I've been trying to keep quiet, but I need to get one thing very clear. Larrabee is going to render DirectX and OpenGL games through rasterisation, not through raytracing.

I'm not sure how the message got so muddled. I think in our quest to just keep our heads down and get on with it, we've possibly been a bit too quiet. So some comments about exciting new rendering tech got misinterpreted as our one and only plan. [...]
That has been the goal for the Larrabee team from day one, and it continues to be the primary focus of the hardware and software teams. [...]

There's no doubt Larrabee is going to be the world's most awesome raytracer. It's going to be the world's most awesome chip at a lot of heavy computing tasks - that's the joy of total programmability combined with serious number-crunching power. But that is cool stuff for those that want to play with wacky tech. We're not assuming everybody in the world will do this, we're not forcing anyone to do so, and we certainly can't just do it behind their backs and expect things to work - that would be absurd."

So, what does this mean actually mean for Larrabee, both technically and strategically? Look at it this way: Larrabee is a DX11 GPU with a design team that took both raytracing and GPGPU into consideration from the very start, while not forgetting performance in DX10+-class games that assume a rasteriser would be the most important factor determining the architecture's mainstream success or failure.

There's a reason for our choice of phrasing: the exact same sentence would be just as accurate for NVIDIA and AMD's architectures. Case in point: NVIDIA's Analyst Day 2008 had a huge amount of the time dedicated to GPGPU, and they clearly indicated their dedication to non-rasterised rendering in the 2009-2010 timeframe. We suspect the same is true for AMD.

The frequent implicit assumption that DX11 GPUs will basically be DX10 GPUs with a couple of quick changes and exposed tesselation is weak. Even if the programming model itself wasn't significantly changing (it is, with the IHVs providing significant input into direction), all current indications are that the architectures themselves will be significantly different compared to current offerings regardless, as the IHVs tackle the problem in front of them in the best way they know how, as they've always done.

The industry gains new ideas and thinking, and algorithms and innovation on the software side mean target workloads change; there's nothing magical about reinventing yourself every couple of years. That's the way the industry has always worked, and those which have failed to do so are long gone.

Intel is certainly coming up with an unusual architecture with Larrabee by exploiting the x86 instruction set for MIMD processing on the same core as the SIMD vector unit. And trying to achieve leading performance with barely any fixed-function unit is certainly ambitious.

But fundamentally, the design principles and goals really aren't that different from those of the chips it will be competing with. It will likely be slightly more flexible than the NVIDIA and AMD alternatives, let alone by making approaches such as logarithmic rasterisation acceleration possible, but it should be clearly understood that the differences may in fact not be quite as substantial as many are currently predicting.

The point is that it's not about rasterisation versus raytracing, or even x86 versus proprietary ISAs.  It never was in the first place.  The raytracing focus of early messaging was merely a distraction for the curious, so Intel could make some noise.  Direct3D is the juggernaut, not the hardware.

"First, graphics that we have all come to know and love today, I have news for you. It's coming to an end. Our multi-decade old 3D graphics rendering architecture that's based on a rasterization approach is no longer scalable and suitable for the demands of the future."  That's why the message got so muddled, Tom.  And no offence, Pat, but history will prove you quite wrong.


Discuss on the forums

Tagging

intel ± larrabee, raytracing, rasterisation


Latest Thread Comments (244 total)
Posted by Jawed on Wednesday, 07-May-08 02:44:04 UTC
Quoting 3dilettante
To rephrase: one vector instruction can be issued to a vector unit.Since it seems some Intel figures have stated publically that an FMAC instruction is available to Larrabee, we can map that to the 8-16 DP figure given in the earlier Larrabee slide, which--barring some flaky issue restrictions--indicates each core has one vector unit.
Yeah, in single-precision terms: ((x,y,z,w),(x,y,z,w),(x,y,z,w),(x,y,z,w))
Quote
Could be. None of the slides go into that.There are a number of ways that can go. It could be fully separate, or complex ops can share hardware with the FMAC unit.
I wonder how important double precision is. Generally Intel seems to be keeping a keen eye on it yet it makes transcendentals hairier, biasing implementation away from the pipelined designs we see in GPUs. LOL, with 16 lanes in a SIMD you could do a funky set of parallel terms (one per lane) for a polynomial to produce one transcendental every few clocks... Might as well link this as I just ran into it: http://developer.intel.com/technology/itj/q41999/pdf/transendental.pdf Interestingly only atan has degree more than 16 for double-precision: 22 :razz:
Quote
8 vector registers?By soft-context-swap, you mean have each master thread emulate a context switch with successive writes?
Yeah, 8x 512-bit registers populated to form a hardware context from one of many virtualised states. D3D10 requires support for 4096 128-bit registers per *object*. Since Intel has to implement virtualised shader state then it might go one further and virtualise threads by creating a pool of software contexts. Hmm, as far as a software-GPU is concerned this should be entirely up for grabs - what does Swiftshader do? Presumably Intel is retaining SSE functionality so it's really a matter of the most advantageous way to use soft contexts (if it makes any sense for Larrabee-as-GPU).
Quote
The effectiveness of such a solution could depend on a lot of things, such as the physical port count.Unless there is a form of bulk write that can write multiple registers to memory, we're talking about a soft-context switch that will take up 8 port cycles and 8 instructions out of the core's issue bandwidth.Possibly, if there is a large load/store buffer, the successive writes can wait to commit to memory and take advantage of a wider cache port.
Perhaps this is all centred on the gather and scatter units? By their nature they have to do wide operations against memory (cache) so that average bandwidth of operands gathered/scattered is in the ballpark of ALU operand bandwidth. Whereas G80 uses an operand window (gather window) between register file and SIMD, perhaps placing the operand window between cache and register file is the solution for Larrabee? All SIMD instructions, if they run solely from/to register file, have guaranteed operand bandwidth and no gather/scatter headaches. I'm guessing this is more like how SSE uses its register file (I really don't have a good understanding of SSE implementations :oops: ).
Quote
If we assume two fully active threads, one thread could do a switch while the other continued working, assuming an internal width of at least 2 threads-worth of resources and two data cache ports.That would require that the other active thread keep active for 8 cycles to hide this switch, or only 4 if it doesn't use any cache bandwidth at all and sticks to within the reg file. If we crudely link the vector registers' capacity to an equivalent number of 4-channel FP32 elements, that's between 32 and 16 elements, if we go by what you posited by having only 2 threads fully running.
Yeah that's the kind of thing I was thinking. The register file is two-way, 8x 512-bits per hardware context, with the fetches and stores running on the "idle" hardware context. These fetch/store SSE instructions would actually be executed by the gather/scatter units. With the gather and scatter units being the interface to the real world for all data, the SIMD can cosy up to a very small register file - presumably much like SSE's SIMD does. Double-threading the SIMD is obviously going to complicate things but by keeping the count of in-flight registers tiny (unlike a GPU) Intel avoids the register file explosion that we see in R600, where the register file amounts to 1MB effective (and it has at least 3 read ports it seems though I still haven't untangled whether that's 3 physical ports or 3 emulated ports). So, since D3D10 enforces virtualised state, perhaps Larrabee will dump pretty much the entire state into memory and let the caches and gather/scatter units take the strain. Jawed

Posted by Barbarian on Wednesday, 07-May-08 07:00:22 UTC
Quoting Jawed
So, since D3D10 enforces virtualised state, perhaps Larrabee will dump pretty much the entire state into memory and let the caches and gather/scatter units take the strain.
That would be the most logical thing to do. Current GPUs struggle with constant register indexing, something that would be trivial with virtualized register file.
If the rumors of 1-cycle L1 reads plus reg-mem vectors instructions are true, that would effectively give 32kb register file per core.

Posted by MfA on Wednesday, 07-May-08 16:01:26 UTC
I kinda doubt it can sustain 1 cycle reads for vectorized reads.

Posted by 3dilettante on Wednesday, 07-May-08 16:26:39 UTC
Quoting Jawed
I wonder how important double precision is. Generally Intel seems to be keeping a keen eye on it yet it makes transcendentals hairier, biasing implementation away from the pipelined designs we see in GPUs.

Hopefully they stay pipelined, otherwise one thread's transcendental function is going to completely monopolize the one vector unit for quite some time.

Quote
LOL, with 16 lanes in a SIMD you could do a funky set of parallel terms (one per lane) for a polynomial to produce one transcendental every few clocks...

Might as well link this as I just ran into it:

http://developer.intel.com/technology/itj/q41999/pdf/transendental.pdf

Interestingly only atan has degree more than 16 for double-precision: 22 :razz:

Perhaps a microcode instruction could spit out the necessary operations.
Just for funzies, I tried by hand to fold that table used for the optimal scheduling of that polynomial evaluation in table 2 across multiple SIMD lanes.
I haven't really gone too in depth, but it seems that the scheduling could be done with 3 vector FMACs and one potentially scalar FMAC.
The downside is that it would require some hefty permutes between each operation, and each successive operation uses fewer and fewer lanes, so utilization plummets, the last FMAC would only use one lane.
So long as each operation is pipelined, other work could be overlaid in the latency periods between each op.
I'd hope the FMAC is pipelined, but would the permutes?
The latency would be the sum of the FMAC and permute latencies.

The other option is to use precisely one lane for N different transcendentals, though this would involve burning up a vector register for each term across 16 elements.
Once a value is no longer needed, its register could be reused, though the register footprint would still be wider.
It does avoid the permute stuff, though.
The latency then is that of 8 FMACs and 3 MULs, though the throughput would be that latency divided by 16.

Quote
Perhaps this is all centred on the gather and scatter units? By their nature they have to do wide operations against memory (cache) so that average bandwidth of operands gathered/scattered is in the ballpark of ALU operand bandwidth.

Whereas G80 uses an operand window (gather window) between register file and SIMD, perhaps placing the operand window between cache and register file is the solution for Larrabee?
Some buffering is already done in the load/store units of x86s, though even one or two vector registers would be more than enough to exceed their capacity.
Various speculative future directions the CPU manufacturers have bandied about is an L0 operand cache.

As such hardware is on a critical signal path, port width and buffering is used carefully.

Quote
All SIMD instructions, if they run solely from/to register file, have guaranteed operand bandwidth and no gather/scatter headaches. I'm guessing this is more like how SSE uses its register file (I really don't have a good understanding of SSE implementations :oops: ).
Since SSE can have one memory operand, the hardware can draw operands from memory, register file, or the bypass network.
SSE currently has no scatter/gather headaches because it can't do scatter/gather.
Just load multiple values and shift them around to gather, or do the reverse for scatter.

Quote
Yeah that's the kind of thing I was thinking. The register file is two-way, 8x 512-bits per hardware context, with the fetches and stores running on the "idle" hardware context. These fetch/store SSE instructions would actually be executed by the gather/scatter units.
I'd almost expect the register save/restore to be aligned, since the vectors match the cache line width.
If we assume the registers are sequential, the entire save/restore wouldn't even require much in the way of scatter/gather than a simple add to a base address.

Quote
Double-threading the SIMD is obviously going to complicate things
Not really. The SIMD can't be double-threaded, as the core can only issue one instruction to the unit.
Threads will just alternate on the issue port.

Quote
but by keeping the count of in-flight registers tiny (unlike a GPU) Intel avoids the register file explosion that we see in R600, where the register file amounts to 1MB effective (and it has at least 3 read ports it seems though I still haven't untangled whether that's 3 physical ports or 3 emulated ports).
The downside to doing this is that Intel's emulated expanded register space means even virtual shuffling of register state involves monopolizing a memory client for some time.
R600's register shenanigans frequently happen in parallel with the activity of other memory clients.
Depending on port count, the same cannot always be said for Larrabee.
If Intel goes this route, I'm wondering if I shouldn't also count R600's register ports in a count as well.

I also forgot in my previous post that writing out a thread implies writing another one in.
As such a soft switch would involve 8*512 bits worth of writing. If I assume a physical port width of 512 bits, a single port will take 8 cycles.
To switch back in with a thread in the L1 with an latency of 1 cycle, reading in will take 9 cycles.
Larrabee must occupy its vector unit for 17 cycles with other work.

Quote
So, since D3D10 enforces virtualised state, perhaps Larrabee will dump pretty much the entire state into memory and let the caches and gather/scatter units take the strain.
The success of such a strategy depends on which is cheaper: ALUs or hardware, or memory clients.
It also depends on just where that gather/scatter hardware is, and how it is implemented.


edit:
Just one comment on the transcendental thing: a lot of the register footprint would probably stick around in hidden scratch registers, or hopefully will to spare the main register files.

Posted by TimothyFarrar on Wednesday, 07-May-08 16:30:06 UTC
Quoting Barbarian
How so? Why would you have a 32bit RGBA texture that is not aligned on 4bytes? Actually a lot of recent hardware aligns textures on 4Kb boundaries. That's plenty alignment.
Un-aligned loads not in term of PC alignment, but in terms of main memory granularity, texture cache line granularity, vector granularity, and that compressed textures don't technically have pixels aligned.

So if you have a vector unit which can only do SIMD aligned loads (like say the cell), texture fetch obviously needs to break vector alignment and do general non-vector aligned gather to fetch texture samples.

Interesting side question, not sure if compressed textures get kept in the texture cache compressed or uncompressed?

Posted by Jawed on Wednesday, 07-May-08 22:44:55 UTC
Quoting 3dilettante
Hopefully they stay pipelined, otherwise one thread's transcendental function is going to completely monopolize the one vector unit for quite some time.
Sorry I meant pipelined in the sense of being a single instruction rather than being calculated by a macro.
Quote
Perhaps a microcode instruction could spit out the necessary operations.Just for funzies, I tried by hand to fold that table used for the optimal scheduling of that polynomial evaluation in table 2 across multiple SIMD lanes.I haven't really gone too in depth, but it seems that the scheduling could be done with 3 vector FMACs and one potentially scalar FMAC.
Blimey! Afterwards I realised that a polynomial's terms are heavily serially dependent just because of the successive powers so it prolly doesn't split across many lanes too well. Also with double-precision computation there's a halving in effective lane count and for single-precision the computation is prolly so quick (very few terms) that it's prolly not worth the effort.
Quote
The downside is that it would require some hefty permutes between each operation, and each successive operation uses fewer and fewer lanes, so utilization plummets, the last FMAC would only use one lane.So long as each operation is pipelined, other work could be overlaid in the latency periods between each op.I'd hope the FMAC is pipelined, but would the permutes?The latency would be the sum of the FMAC and permute latencies.
By permute I presume you mean swizzle, though I suppose what you're getting at is swizzling across the entire 16 lanes, not just within groups of 4 as GPUs do. There might be an opportunity to use the 16 lanes to produce 4 transcendental results in less clocks than if they were produced separately in parallel on the four (x,y,z,w) sets of lanes. Apart from the approximation step, the reduction and reconstruction steps provides further opportunities to enhance utilisation on a wide SIMD. In effect using the 16-lane width of the SIMD to overlap computations for a set of 4 results.
Quote
The other option is to use precisely one lane for N different transcendentals, though this would involve burning up a vector register for each term across 16 elements.Once a value is no longer needed, its register could be reused, though the register footprint would still be wider.It does avoid the permute stuff, though.The latency then is that of 8 FMACs and 3 MULs, though the throughput would be that latency divided by 16.
So you're saying generate 16 trascendentals in parallel? I dare say its simplicity is compelling and it could well coincide with the number of objects in a batch. The issue with running so many in parallel is the 16-way duplication of the lookup tables - though I guess they're all quite small.
Quote
Some buffering is already done in the load/store units of x86s, though even one or two vector registers would be more than enough to exceed their capacity.Various speculative future directions the CPU manufacturers have bandied about is an L0 operand cache.
I've not heard of L0 before...
Quote
I'd almost expect the register save/restore to be aligned, since the vectors match the cache line width.
I presume you're referring to the cache line width of current SSE implementations. We don't know this for Larrabee do we?
Quote
If we assume the registers are sequential, the entire save/restore wouldn't even require much in the way of scatter/gather than a simple add to a base address.
I dare say sequential registers would only apply for simpler programs.
Quote
Not really. The SIMD can't be double-threaded, as the core can only issue one instruction to the unit.Threads will just alternate on the issue port.
I'm thinking that it increases the porting complexity because both the SIMD and the gather/scatter units are fetching and storing concurrently - although in an alternating pattern. I suppose doubling the banking would be the easiest solution.
Quote
The downside to doing this is that Intel's emulated expanded register space means even virtual shuffling of register state involves monopolizing a memory client for some time.R600's register shenanigans frequently happen in parallel with the activity of other memory clients.Depending on port count, the same cannot always be said for Larrabee.If Intel goes this route, I'm wondering if I shouldn't also count R600's register ports in a count as well.
So we get back to the question of port widths...
Quote
I also forgot in my previous post that writing out a thread implies writing another one in.As such a soft switch would involve 8*512 bits worth of writing. If I assume a physical port width of 512 bits, a single port will take 8 cycles.To switch back in with a thread in the L1 with an latency of 1 cycle, reading in will take 9 cycles.Larrabee must occupy its vector unit for 17 cycles with other work.
With the SIMD being pipelined and with any one instruction only able to consume, at most, 3 operands, thread B instructions can start issuing before thread B's register set has been fully populated. Meanwhile thread A's register set can start being written out before A has finished. OK, I know, it's hairy :razz:
Quote
The success of such a strategy depends on which is cheaper: ALUs or hardware, or memory clients.It also depends on just where that gather/scatter hardware is, and how it is implemented.
:lol: Thinking that we could be waiting 2 years to find out is, ahem, maybe not so funny.
Quote
edit:Just one comment on the transcendental thing: a lot of the register footprint would probably stick around in hidden scratch registers, or hopefully will to spare the main register files.
This is bloody tantalising: http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/4343798/4343799/04343860.pdf?tp=&isnumber=&arnumber=4343860 But it's hidden and I can't find anything else on the topic :sad: Jawed

Posted by 3dilettante on Wednesday, 07-May-08 23:26:09 UTC
Quoting Jawed
By permute I presume you mean swizzle, though I suppose what you're getting at is swizzling across the entire 16 lanes, not just within groups of 4 as GPUs do.
I went back to check what I wrote, and it's more like the AltiVec permute instructions, which would allow the unit to pick out the elements from the previous iteration's source and result registers and build the needed operands for the next step.
The combination would have to span quad lanes and also be gathered from different registers.
SSE in other x86s isn't quite as flexible in this regard, but with extra steps the same end result can be created.
Either that, or a specialized transcendental unit can quickly pick out the needed elements, since where a given value is resulted and where it must be copied would be static and could be hardwired.

Quote
So you're saying generate 16 trascendentals in parallel? I dare say its simplicity is compelling and it could well coincide with the number of objects in a batch. The issue with running so many in parallel is the 16-way duplication of the lookup tables - though I guess they're all quite small.
The entries are pretty small, and a lookup table would probably be a pretty large part of the hardware in a transcendental unit.
If a storage location the size of the L1 is 1 cycle in latency, it is possible that readying the lookup wouldn't be worse.
It might potentially add issue latency for transcendental instructions, but how many back-to-back issues would be needed?

Quote
I've not heard of L0 before...
It's been mentioned before, though it seems perpetually "in the future".
I think some fanciful accounts of what would have come after AMD's K8 included mention of it.

Quote
I presume you're referring to the cache line width of current SSE implementations. We don't know this for Larrabee do we?
The leaked Larrabee slide said the line width was 64B, just like Gesher/Sandy Bridge.

Quote
:lol: Thinking that we could be waiting 2 years to find out is, ahem, maybe not so funny.
We'd have enough time to speculatively design Larrabee several times over.


Quote
This is bloody tantalising:

http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/4343798/4343799/04343860.pdf?tp=&isnumber=&arnumber=4343860

But it's hidden and I can't find anything else on the topic :sad:

Jawed
Hmm, the architecture is outlined elsewhere as a 48-ALU design divided into 8 clusters.
Each cluster contains 3 adders and two multipliers.
That's 16 add+mul pairs, equivalent to 16 FMAC lanes.

Posted by MfA on Thursday, 08-May-08 00:48:27 UTC
The LSU is essentially already a L0 cache in present CPUs.

Posted by Scali on Thursday, 08-May-08 09:14:31 UTC
We shouldn't have to wait *that* long... Intel plans to have engineering samples out in the second half of 2008. I hope they'll have some working software on it aswell, and perhaps are willing to release more info on the architecture and how they employ it for rendering.

Posted by 3dilettante on Thursday, 08-May-08 13:32:17 UTC
Quoting MfA
The LSU is essentially already a L0 cache in present CPUs.
It's not because the entries only persist until the memory operations waiting in the buffers are retired.

If AMD thought the same way, there would have been no point in mentioning the L0 because they've had LSUs for over a decade.


Add your comment in the forums

Related intel News

Nehalem Article @ RWT + 3.2GHz samples(?)
Opinion: Silverthorne fails but PowerVR impresses (+Montalvo trouble)
Belated Analysis: Intel Atom/Silverthorne
Havok physics software on PC soon-to-be free for non-commercial use
Intel purchases young game development house Offset
Larrabee: Samples in Late 08, Products in 2H09/1H10
Intel results indicate consumer spending strength; investor ignorance
Quick Analysis: Nehalem CPUs & Sockets
Larrabee and Intel's acquisition of Neoptica
Intel launches Penryn-based QX9650