Intel on data-parallel languages and raytracing

Monday 28th May 2007, 09:09:00 PM, written by Tim

Intel is currently researching data-parallel languages, reports EETimes. These languages would most likely be for massively parallel architectures, such as Larrabee and Terascale, putting them in direct competition with NVIDIA and AMD in the GPGPU market. In addition, Intel is developing parallel languages for specific areas, such as TCP/IP processing.

Another intriguing tidbit reveals that Intel is developing "applications in areas known for being highly parallel such as recognition and game graphics." Presumably, this is for Larrabee, although that chip's specific function is still unknown. Jon Stokes from Ars Technica has claimed that Larrabee will be use a combination of ray tracing and rasterization, presumably using rasterization for primary rays and ray tracing for secondary rays (as put forth in a recent paper from Stanford). EETimes' comments do suggest that Larrabee will target the GPU market as well as the GPGPU market; of course, this does contradict earlier reports that the first iteration of Larrabee will only be targeted at the GPGPU market. However, the timeframe for Larrabee remains unknown, and it remains to be seen if these research projects are even exclusively aimed at the Larrabee architecture, or if they are more general in nature.
Discuss on the forums

Tagging

intel ± larrabee, gpgpu


Latest Thread Comments (15 total)
Posted by Tim Murray on Tuesday, 29-May-07 21:55:39 UTC
Quoting 3dilettante
Or HT wasn't really all that great, and that using the P4 to spearhead the usage of SMT was not the best way to get people to multithread their code.
The problem isn't with HT, though, is it? Isn't it more of a problem with SMT slaughtering cache coherence in general? I remember benchmarks that showed that HT should be disabled on any machine running... Apache, I think? because performance tanked when the number of cache hits decreased enormously.

I am waiting to see just what the graphics-oriented things they're working on are, though. A hybrid raytracer/rasterizer is all well and good, but it will never, ever catch on unless you have a fantastic API that is wonderful for developers to use. And then you need a killer app...

Posted by Gubbi on Wednesday, 30-May-07 11:58:53 UTC
Quoting Tim Murray
The problem isn't with HT, though, is it? Isn't it more of a problem with SMT slaughtering cache coherence in general? I remember benchmarks that showed that HT should be disabled on any machine running... Apache, I think? because performance tanked when the number of cache hits decreased enormously.
That was mostly for Northwood P4s which suffered heavily from trace cache and D$ thrashing. Prescott added a lot of measures to better support SMT: 4x bigger D$ with higher associativity, *two* trace caches, one for each active context, and many more registers. Of course all these measures were negated in large part by the basic performance parameters Prescott had: Twice the D$ load-to-use latency (2 cycles to 4) and the much longer pipeline and associated miss predict latency (although the branch predictor in Prescott is better).

SMT on OOO processors looks like it's dead, in essense you have to enhance your I and D caches, have a scheduler for each thread (because one big one holding all instructions in flight is simply too slow), architected registers for each thread. All these structures become slower, so you introduce pipestages (or run at a lower clock) so it impacts single thread performance, and in the end the large effort is for better utilization of the execution units which in a modern CPU takes up less than 25% (including massive SIMD FP units). Better to just replicate the entire core and get a guaranteed 2x speedup on independent threads.

Cheers

Posted by 3dilettante on Thursday, 31-May-07 13:34:20 UTC
Quoting Tim Murray
The problem isn't with HT, though, is it? Isn't it more of a problem with SMT slaughtering cache coherence in general? I remember benchmarks that showed that HT should be disabled on any machine running... Apache, I think? because performance tanked when the number of cache hits decreased enormously.
SMT was used successfully in IBM's POWER5. It wasn't for every workload, but it didn't have as many adverse effects as HT did for Netburst.

IBM's method was more flexible, with software-controlled priority levels as well as hardware mechanisms that balanced overall instruction flow to keep a stalled thread from getting in the way other threads.

POWER5 was also a wider design than Netburst, it had larger caches, and it wasn't as aggressively speculative as P4.
IBM had more spare units that could be used, and it didn't fill its queues as readily with instructions that would have to be replayed.

There were a number of things that were characterized as glass jaws for the Netburst architecture. One big one was its highly speculative instruction scheduling and replay mechanism.

The design's long pipeline and emphasis on speculation made it so that the chip would issue an instruction several cycles before it was known if a cache access would hit.
It is usually the case that cache misses happen more often with SMT, and P4's smaller caches tended to feel the impact more than most.
Since many instructions would be issued incorrectly, the P4's replay mechanism would loop the instructions back into the pipeline on every cache miss and access to memory.
Not only that, but it was possible for the replay loop to get clogged by multiple replays during long dependency chains.

For single-threaded performance, replay could be a headache, since the P4 would sometimes for no visible reason take hundreds or thousands of cycles to complete a simple stretch of code.
For SMT, the massive amount of speculation consumed limited resources on a rather narrow core.

SMT should have filled in stall cycles in the long P4 pipeline. The problem was that it had competition from the replay mechanism, which also tried to fill stall cycles. In pathological cases, the replay mechanism would fill stall cycles with instructions that inevitably stalled again and again.

Prescott theoretically improved its threading resources and replay mechanism.
It was not enough to correct for the thermal ceiling that killed its clock scaling.

edit:
There is an interesting article on the replay mechanism on xbit:

http://www.xbitlabs.com/articles/cpu/display/replay.html

There's a section on its influence on hyperthreading that also includes a comparison between Prescott and Northwood that shows how much Prescott improved HT.

Posted by INKster on Friday, 01-Jun-07 14:43:34 UTC
Huh ?!? (http://www.tgdaily.com/content/view/32282/118/) :shock:

"Larrabee" as a joint Intel-Nvidia effort.
I wonder what this would mean for the future "Fusion" products (beyond the initially "simple" IGP/CPU on the same die)...

Posted by Geo on Friday, 01-Jun-07 17:35:58 UTC
I'm not taking that one to the bank yet. I can see why Intel would like Nvidia's participation. . .it's less clear to me why Nvidia would want to play ball unless there's a pretty sizeable revenue/royalty stream associated with it for them. Would Intel make that kind of deal? Doesn't seem in character for them.

Posted by INKster on Friday, 01-Jun-07 17:43:35 UTC
Quoting Geo
I'm not taking that one to the bank yet. I can see why Intel would like Nvidia's participation. . .it's less clear to me why Nvidia would want to play ball unless there's a pretty sizeable revenue/royalty stream associated with it. Would Intel make that kind of deal? Doesn't seem in character for them.
Extraordinary circumstances call for extraordinary partnerships, and the AMD/ATI merger was certainly one of them.Hannibal, in his late April article about "Larrabee" at arstechnica.com, shared some insider info about it which i found a bit... suspicious, given the constant reference to the G80 architecture.This could be why Intel "named certain names", excluding the R600 (aside from the fact that they are competitors, Intel could have used the AMD "Fusion" project as a bashing bullet on "why our solution is better than theirs").Let's wait and see.

Posted by Geo on Friday, 01-Jun-07 17:46:31 UTC
Oh, I'm not ruling it out. I'm just saying when I read it I didn't exactly go "Ah, of course. . ."

Posted by Panajev2001a on Friday, 01-Jun-07 18:08:47 UTC
Quoting Arun Demeure
I'm not arguing that a number of apps won't benefit from multithreading. They obviously will. Games will benefit massively - that's just a matter of time, and a large number of non-mass-market applications will. That doesn't justify the purchase of a octo-core CPU for Joe Consumer though, and the problem is that I fail to see not what justifies it today, but what justifies it in 5+ years when it will have become a commodity. The only interesting emerging workloads that might become more important (such as voice recognition) seem to benefit more from throughput cores (or even GPU cores) than CPU cores.
Arun, many do agree that not a lot of everyday use applications are going to benefit from having faster and faster cores and more of these cores running in parallel, but this might be true as far as each application taken alone.

A lot of people, intentionally or not (tons of programs installed as start-up programs that do work while the PC is idle or that steal a few cycles here and there), are running more and more programs/processes in parallel and multi-core systems for users such as myself do benefit from multiple cores as the whole environment feels more responsive.

Posted by nutball on Friday, 01-Jun-07 18:22:49 UTC
Quoting Panajev2001a
A lot of people, intentionally or not (tons of programs installed as start-up programs that do work while the PC is idle or that steal a few cycles here and there), are running more and more programs/processes in parallel and multi-core systems for users such as myself do benefit from multiple cores as the whole environment feels more responsive.
This is an oft stated argument. It scales to ... maybe two cores on the desktop for a typical user. Four cores tops, but not for a typical user. It's not really a good justification for 8 or 16 CPU cores becoming the default option when buying a PC (unless our favourite operating system vendor can come up with new and even more pretty ways to waste our computing resources for us).

Posted by AlStrong on Friday, 01-Jun-07 18:23:16 UTC
Quoting Panajev2001a
A lot of people, intentionally or not (tons of programs installed as start-up programs that do work while the PC is idle or that steal a few cycles here and there), are running more and more programs/processes in parallel and multi-core systems for users such as myself do benefit from multiple cores as the whole environment feels more responsive.
Even so, I find the hard drives to be quite limiting for carrying out multiple tasks despite dual core. I suppose ideally, you'd have multiple programs and multiple hard drives from which to do those tasks. But there's too much (thrashing?) conflicting use with the one hard drive that many computers only have (e.g. laptops or other template-built computers from Dell or HP for instance).


Add your comment in the forums

Related intel News

Intel's Aaron Coday talks to Develop about Larrabee
Larrabee to also be presented at Hot Chips
Intel Larrabee @ SIGGRAPH 2008
Larrabee's Rasterisation Focus Confirmed
Nehalem Article @ RWT + 3.2GHz samples(?)
Opinion: Silverthorne fails but PowerVR impresses (+Montalvo trouble)
Belated Analysis: Intel Atom/Silverthorne
Havok physics software on PC soon-to-be free for non-commercial use
Intel purchases young game development house Offset
Larrabee: Samples in Late 08, Products in 2H09/1H10