Larrabee's Rasterisation Focus Confirmed
Wednesday 23rd April 2008, 08:33:00 PM, written by TeamB3D
Tom Forsyth is currently a software engineer working for Intel on Larrabee. He previously worked at Rad Game Tools on Pixomatic (a software rasterizer) and Granny3D, as well as Microprose, 3Dlabs, and most notably Muckyfoot Productions (RIP). He is well respected throughout the industry for the high quality insight on graphics programming techniques he posts on his blog. Last Friday, though, his post's subject was quite different:
"I've been trying to keep quiet, but I need to get one thing very clear. Larrabee is going to render DirectX and OpenGL games through rasterisation, not through raytracing.
I'm not sure how the message got so muddled. I think in our quest to just keep our heads down and get on with it, we've possibly been a bit too quiet. So some comments about exciting new rendering tech got misinterpreted as our one and only plan. [...] That has been the goal for the Larrabee team from day one, and it continues to be the primary focus of the hardware and software teams. [...]
There's no doubt Larrabee is going to be the world's most awesome raytracer. It's going to be the world's most awesome chip at a lot of heavy computing tasks - that's the joy of total programmability combined with serious number-crunching power. But that is cool stuff for those that want to play with wacky tech. We're not assuming everybody in the world will do this, we're not forcing anyone to do so, and we certainly can't just do it behind their backs and expect things to work - that would be absurd."
So, what does this mean actually mean for Larrabee, both technically and strategically? Look at it this way: Larrabee is a DX11 GPU with a design team that took both raytracing and GPGPU into consideration from the very start, while not forgetting performance in DX10+-class games that assume a rasteriser would be the most important factor determining the architecture's mainstream success or failure.
There's a reason for our choice of phrasing: the exact same sentence would be just as accurate for NVIDIA and AMD's architectures. Case in point: NVIDIA's Analyst Day 2008 had a huge amount of the time dedicated to GPGPU, and they clearly indicated their dedication to non-rasterised rendering in the 2009-2010 timeframe. We suspect the same is true for AMD.
The frequent implicit assumption that DX11 GPUs will basically be DX10 GPUs with a couple of quick changes and exposed tesselation is weak. Even if the programming model itself wasn't significantly changing (it is, with the IHVs providing significant input into direction), all current indications are that the architectures themselves will be significantly different compared to current offerings regardless, as the IHVs tackle the problem in front of them in the best way they know how, as they've always done.
Intel is certainly coming up with an unusual architecture with Larrabee by exploiting the x86 instruction set for MIMD processing on the same core as the SIMD vector unit. And trying to achieve leading performance with barely any fixed-function unit is certainly ambitious.
But fundamentally, the design principles and goals really aren't that different from those of the chips it will be competing with. It will likely be slightly more flexible than the NVIDIA and AMD alternatives, let alone by making approaches such as logarithmic rasterisation acceleration possible, but it should be clearly understood that the differences may in fact not be quite as substantial as many are currently predicting.
The point is that it's not about rasterisation versus raytracing, or even x86 versus proprietary ISAs. It never was in the first place. The raytracing focus of early messaging was merely a distraction for the curious, so Intel could make some noise. Direct3D is the juggernaut, not the hardware.
"First, graphics that we have all come to know and love today, I have news for you. It's coming to an end. Our multi-decade old 3D graphics rendering architecture that's based on a rasterization approach is no longer scalable and suitable for the demands of the future." That's why the message got so muddled, Tom. And no offence, Pat, but history will prove you quite wrong.
Tagging
Related intel News
Opinion: Silverthorne fails but PowerVR impresses (+Montalvo trouble)
Belated Analysis: Intel Atom/Silverthorne
Havok physics software on PC soon-to-be free for non-commercial use
Intel purchases young game development house Offset
Larrabee: Samples in Late 08, Products in 2H09/1H10
Intel results indicate consumer spending strength; investor ignorance
Quick Analysis: Nehalem CPUs & Sockets
Larrabee and Intel's acquisition of Neoptica
Intel launches Penryn-based QX9650


To rephrase: one vector instruction can be issued to a vector unit.Since it seems some Intel figures have stated publically that an FMAC instruction is available to Larrabee, we can map that to the 8-16 DP figure given in the earlier Larrabee slide, which--barring some flaky issue restrictions--indicates each core has one vector unit.
Could be. None of the slides go into that.There are a number of ways that can go. It could be fully separate, or complex ops can share hardware with the FMAC unit.
8 vector registers?By soft-context-swap, you mean have each master thread emulate a context switch with successive writes?
The effectiveness of such a solution could depend on a lot of things, such as the physical port count.Unless there is a form of bulk write that can write multiple registers to memory, we're talking about a soft-context switch that will take up 8 port cycles and 8 instructions out of the core's issue bandwidth.Possibly, if there is a large load/store buffer, the successive writes can wait to commit to memory and take advantage of a wider cache port.
If we assume two fully active threads, one thread could do a switch while the other continued working, assuming an internal width of at least 2 threads-worth of resources and two data cache ports.That would require that the other active thread keep active for 8 cycles to hide this switch, or only 4 if it doesn't use any cache bandwidth at all and sticks to within the reg file. If we crudely link the vector registers' capacity to an equivalent number of 4-channel FP32 elements, that's between 32 and 16 elements, if we go by what you posited by having only 2 threads fully running.