Larrabee and Intel's acquisition of Neoptica

Wednesday 28th November 2007, 12:12:00 PM, written by Arun

On October 19th, Neoptica was acquired by Intel in relation to the Larrabee project, but the news only broke on several websites in the last 2 days. We take a quick look at what Intel bought, and why, in this analysis piece.

Neoptica's Employees

  • 8 employees (including the two co-founders) according to Neoptica's official website.
  • 3 have a background with NVIDIA's Software Architecture group: Matt Pharr (editor of GPU Gems 2) and Craig Kolb who were also Exluna co-founders, and Geoff Berry. Tim Foley also worked there as an intern.
  • 2 are ex-Electronic Arts employees: Jean-Luc Duprat (who also worked at Dreamworks Feature Animation) and Paul Lalonde.
  • Nat Duca comes from Sony, where he led the development of the RSX tool suite and software development partnerships.
  • Aaron Lefohn comes from Pixar, where he worked on GPU acceleration for rendering and interactive film preview.
  • Pat Hanrahan also was on the technical advisory board. He used to work at Pixar, where he was the chief architect of the Renderman Interface protocol. His PhD students were also responsible for the creation of both Brook and CUDA.

Neoptica's Vision

  • Neoptica published a technical whitepaper back in March 2007. Matt Pharr also gave a presentation at Graphics Hardware 2006 that highlights some similar points.
  • It explained their perspective on the limitations of current programmable shading and their vision of the future, which they name "programmable graphics". Much of their point resides on the value of 'irregular algorithms' and the GPU's inability to construct complex data structures on its own.
  • They argue that a faster link to the CPU is thus a key requirement, with efficient parallelism and collaboration between the two. Only the PS3 allows this today.
  • They further claim the capability to deliver many round-trips between the CPU and the GPU every frame could make new algorithms possible and improve efficiency. They plead for the demise of the unidirectional rendering pipeline.

Neoptica's Proposed Solution

  • Neoptica claims to have developed a deadlock-free high-level API that abstracts the concurrency between multiple CPUs and GPUs despite being programmed in C/C++ and Cg/HLSL respectively.
  • These systems "deliver imagery that is impossible using the traditional hardware rendering pipeline, and deliver 10x to 50x speedups of existing GPU-only approaches."
  • Of course, the claimed speed-up is likely for algorithms that just don't fit the GPU's architecture, so it's more accurate to just say a traditional renderer just couldn't work like that rather than claim huge performance benefits.
  • Given that only the PS3 and, to a lesser extend, the XBox360 have a wide bus between the CPU and the GPU today, we would tend to believe next-generation consoles were their original intended market for this API.
  • Of course, Neoptica doesn't magically change these consoles' capabilities. The advantage of their solution would be to make exotic and hybrid renderers which benefit from both processors (using CELL to approximate high-quality ambient occlusion, for example) much easier to develop.

Intel's Larrabee

  • Larrabee is in several ways (but not all) a solution looking for a problem. While Intel's upcoming architecture might have inherent advantages in the GPGPU market and parts of the workstation segment, it wouldn't be so hot as a DX10 accelerator.
  • In the consumer market, the short-term Larrabee strategy seems to be add value rather than try to replace traditional GPUs. This could be accomplished, for example, through physics acceleration and this ties in with the Havok acquisition.
  • Unlike PhysX, however, Larrabee is fully programmable through a standard ISA. This makes it possible to add more value and possibly accelerate some algorithms GPUs cannot yet handle, thus improving overall visual quality.
  • In the GPGPU market, things will be very different however, and there is some potential for the OpenGL workstation market too. We'll see if rumours about the latter turn out true or not.
  • The short-term consumer strategy seems to make Larrabee a Tri- and Quad-SLI/Crossfire competitor, rather than a real GPU competitor. But Intel's ambitions don't stop there.
  • While Larrabee is unlikely to excel in DX10 titles (and thus not be cost-competitive for such a market), its unique architecture does give it advantages in more exotic algorithms, ones that don't make sense in DX10 or even DX11. That (and GPGPU) is likely where Intel sees potential.

Intel's Reasoning for Neoptica

  • Great researchers are always nice to have on your side, and Intel would probably love to have a next-generation gaming console contract. Neoptica's expertise in CPU-GPU collaboration is very valuable there.
  • Intel's "GPU" strategy seems to be based around reducing the importance of next-generation graphics APIs, including DX11. Their inherent advantage is greater flexibility, but their disadvantage is lower performance for current workloads. This must have made Neoptica's claims music to their ears.
  • Furthermore, several of Neoptica's employees have experience in offline rendering. Even if Larrabee didn't work out for real-time rendering, it might become a very attractive solution for the Pixars and Dreamworks of the world due to its combination of high performance and (nearly) infinite flexibility.
  • Overall, if Intel must stand a chance to conquer the world of programmable graphics in the next 5 years, they need an intuitive API that abstracts the details while making developers remember how inflexible some parts of the GPU pipeline remain both today and in DirectX 11. Neoptica's employees and expertise certainly can't hurt there.

While we remain highly skeptical of Intel's short-term and mid-term prospects for Larrabee in the consumer market, the Neoptica and the Havok acquisitions seem to be splendid decisions to us, as they both potentially expand Larrabee's target market and reduce risk. In addition, there is always the possibility that as much as Intel loves the tech, they also love the instant 'street cred' in the graphics world they get from picking up that group of engineers.

We look forward to the coming years as these events and many others start making their impact.


Discuss on the forums

Tagging

intel ± larrabee, pixar


Latest Thread Comments (48 total)
Posted by ShaidarHaran on Saturday, 01-Dec-07 18:09:43 UTC
Quoting nAo
Not really relevant, how does this lower the need for larger caches?
Assuming SoEMT, it's just as hoho said. In the case of a cache miss, a core can simply switch to another thread until the data for the previous thread is retrieved.

A side-effect of this is that smaller caches could theoretically be used. Otherwise, it can make a CPU core with a medium-sized cache deliver the performance of a core with a larger cache. This all assumes a multi-thread friendly environment, of course.

Posted by Arun on Saturday, 01-Dec-07 18:19:47 UTC
Uhm, I think you're both thinking of two different advantages of caches: saving memory bandwidth and hiding memory latency. SoEMT will reduce the importance of caches for hiding memory latency, but may actually increase their importance in terms of saving bandwidth, as trashing may go up. The same principles also apply to GPUs.

Posted by nAo on Saturday, 01-Dec-07 18:23:38 UTC
Quoting ShaidarHaran
Assuming SoEMT, it's just as hoho said. In the case of a cache miss, a core can simply switch to another thread until the data for the previous thread is retrieved.
The other thread will *at best* re-use the same data the first thread was using. This means that it can potentially evict data from the cache that the first thread is going to need in the near future, at least in a worst case scenario.
In the general case 2 threads working on different data sets will require a larger cache, not a smaller one.
Quote
A side-effect of this is that smaller caches could theoretically be used.
Again, how can a second thread increase the efficiency of your cache so that yout might endup using a smaller one?

Posted by Arun on Saturday, 01-Dec-07 18:28:34 UTC
I guess the question is how often you're bandwith limited vs how often you're idling because of bandwidth latency. For the quad-core Nehalem though, you're looking at a 192-bit DDR3 IMC... So I don't think latency is too much of a concern for once! ;)

Posted by nAo on Saturday, 01-Dec-07 18:30:45 UTC
Quoting Arun
Uhm, I think you're both thinking of two different advantages of caches: saving memory bandwidth and hiding memory latency. SoEMT will reduce the importance of caches for hiding memory latency, but may actually increase their importance in terms of saving bandwidth, as trashing may go up. The same principles also apply to GPUs.
I see your point but SoEMT is mostly a cheap way to put at some use idle units while paying a fairly small cost for it, it's not an opportunity to constantly trash your caches hoping that the your improved ability to hide memory accesses latency will save your arse, not in the general case for sure.Unless you never re-use your data, but then you don't need a cache in the first place.

Posted by Silent_Buddha on Saturday, 01-Dec-07 19:02:10 UTC
Add to that, has even a 4 core Penryn come even remotely close the amount of math it can do in comparison to years old R580?

If Folding at Home is any indication of relative realworld workload and potential Physics performance then it'll still be years (Larrabee perhaps?) until Intel matches R580 much less anything newer.

Then again I'm not an expert in this area so I could be attributing far too much importance to GPU performance in FAH.

And the amount of physics used in Crysis is indeed magnitudes higher than those used in Farcry. It's not just a simple evolutionary rise in useage. The complexity of the calculations might have only gone up slightly but the sheer number of calculations going on in any given scene absolutely dwarfs those used in Farcry.

It can, of course, be argued that games don't need that level of physic nor that number of calculations per scene. But I would argue it goes a long ways towards immersion and the all important WOW factor. I'll be a sad buddha if the trend reverts to doing less because it isn't "needed."

Regards,
SB

Posted by Nick on Monday, 03-Dec-07 01:13:18 UTC
Arun, your arguments are correct but you're making the general mistake of only looking at the high-end. The average system sold today does not come with a GeForce 8800, but it does have dual-core. In fact more than half of all systems are laptops, so 200+ Watt GPU's will never become the norm. Lots of people, including occasional gamers, are even content with integrated graphics or a low-end card. Sure we'll see multi-teraflop GPU's in the not too distant future, but we'll see the average system equipped with quad-cores sooner.

So basically the average system has a potent CPU, but a modest GPU. For games this means that the GPU should be used only for what it's most efficient at; graphics. CPU's are more interesting for everything else. I come to the same conslusion when looking at GPGPU applications. Only with high-end GPU's they achieve a good speedup, but actual work throughput versus available GFLOPS is often laughable.

So I believe that the arrival of teraflop GPU's will have an insignificant effect on the current balance. Advanced multi-core CPU's on the other hand will be adopted relatively quickly, making it ever more interesting not to offload things like physics anywhere else. The megaherz race is over, but the multi-core race has just begun and has some catching up to do.

From an architectural point of view, GPU's have only three limited types of data access:
- Registers: very fast, but not suited for storing actual data structures.
- Texture cache: very important to reduce texture sampling bandwith. Close to useless for other access patterns.
- RAM: high bandwidth but high latency. Compression techniques only useful for graphics.

For CPU's this becomes:
- Registers: extremely fast and since x64 no longer a big performance limiter.
- L1 cache: very fast and practically an extension of the register set. Suited for holding actual data sets.
- L2 cache: a bit slower but can hold the major part of the working set.
- RAM: not that high bandwith, but still tons of potential when multi-core increases the need.

So we'd have to see major changes in GPU architecture to make them more suited for non-graphics tasks, likely affecting their graphics performance. That might be ok for the high-end but the mid- and low-end have no excess performance for anything else. The CPU on the other hand is already well on its way to be able to handle larger workloads, and effectively ending up in every system.

Posted by nAo on Monday, 03-Dec-07 02:35:29 UTC
Quoting Nick
From an architectural point of view, GPU's have only three limited types of data access:
- Registers: very fast, but not suited for storing actual data structures.
- Texture cache: very important to reduce texture sampling bandwith. Close to useless for other access patterns.
- RAM: high bandwidth but high latency. Compression techniques only useful for graphics.
DX10 GPUs have constant buffers and associated caches.
G80 also exposes a fast on chip memory through CUDA

Posted by Arun on Monday, 03-Dec-07 08:43:54 UTC
Quoting Nick
Arun, your arguments are correct but you're making the general mistake of only looking at the high-end. The average system sold today does not come with a GeForce 8800, but it does have dual-core.
And they'll play old games that already use the CPU for physics anyway. Your arguement only stands if this dynamic remains true going forward, which it very possibly won't IMO (more below).

Quote
In fact more than half of all systems are laptops
I don't know where you get your stats, but I suspect you should reconsider your source. Even the most optimistic estimates don't expect that to happen before 2010.

Quote
Lots of people, including occasional gamers, are even content with integrated graphics or a low-end card.
There's the gaming market, and then there's the gaming market. I could have a lot of fun playing casual games and 3-5 years old games, but that's not what I'm doing. There will ALWAYS be a market for games that run on IGPs. I say IGPs because low-end cards will die within the next 2 years, as they'll become essentially senseless: if you look at AMD's and NVIDIA's upcoming DX10 IGPs, they're practically good enough for Windows 7 and for a few years after that. All you'll see after that are incremental increases in performance & video decoding quality, imo.

But it's not because there is a segment for games on what will be a $300 PC market that there won't be a market for games above that; and as has traditionally been the case, these two will be completely separate.

Quote
Sure we'll see multi-teraflop GPU's in the not too distant future, but we'll see the average system equipped with quad-cores sooner.
I have massive doubts that more the 2-3 cores makes sense in the 'low-end commodity PC market'. If a game aims at the low-end, it should aim at that, and that some artificial segment with average performance that nobody really fits in. Anyway, that's arguable, but to the next point now...

Quote
So basically the average system has a potent CPU, but a modest GPU. For games this means that the GPU should be used only for what it's most efficient at; graphics.
You assume this to remain true: once again, it will not. Rather than looking at the present, it might be a good idea to try and look at the future instead. In the 2010-2011 timeframe (32nm), dual-cores will remain widely available for the low-end of the market. These will be paired with G86-level graphics performance in the ultra-low-end, with probably a higher ALU:TEX ratio. So in that segment of the market, you'll see maybe 200GFlops on the GPU and 75GFlops on the CPU.

And indeed, I can't really imagine any circumstance where offloading the physics to the GPU makes sense there, but GFlops still aren't massively in the CPU's favor and this market would mostly play casual and old games; amusingly, given the performance of current GPUs in DX10 games, they would presumably still play DX9 games!

Now, look at another segment of the market: $120 CPU, $120 GPU, $60 Chipset. In 2010-2011, that would probably correspond to a 3GHz+ quad-core on the CPU side of things (with a higher IPC than Penryn). That's about 150GFlops, maybe. On the GPU side of things, however, you'll easily have more than 1TFlop: just take RV670, which manages 500GFlops easily on 55nm at 190mm2. It's not exactly hard to predict where things will go with 40nm and 32nm...

Quote
Only with high-end GPU's they achieve a good speedup, but actual work throughput versus available GFLOPS is often laughable.
Uhm, that's just wrong. In apps that make sense for their current architecture, and there are *plenty* of them, the efficiency in terms of either GFlops or bandwidth (whichever is the bottleneck) is perfectly fine.

Quote
The megaherz race is over, but the multi-core race has just begun and has some catching up to do.
Yes, it has a lot of catching up to do in terms of, as you kinda said yourself, politics and hype. You don't see to realise that when predicting the future configurations of PCs (i.e. are most consumers going to go with a $300 CPU with a $150 GPU, or a $100 CPU with a $350 GPU?) what maters is what decisions the developers take. If the CPU is never the bottleneck, then why would you want more than a $100 CPU anyway?

If physics acceleration on the GPU doesn't take off, then obviously you'll want more than a $100 CPU. But if it does, then who knows - and that's why NVIDIA and ATI are so interested in it. They want to increase their ASPs at the CPU's expense, and there is no fundamental reason why they cannot succeed. It's all about their execution against Intel's.

And if what happens is that GPUs capture more out of a PC's ASPs, then you're looking at $100 CPUs being paired with $500 GPUs. Heck, as I said I'm already a pioneer in that category - slightly overclocked E4300 ($150) with a $600 GPU. The difference in GFlops between the two is kind of laughable, really, and in this case GPU Physics would clearly make sense. *That* is the dynamic that NVIDIA and AMD are trying to encourage, and that's why it's a political question, not really a technical one (although perf/watt for CPUs vs GPUs for physics also matters).

I might be right, or you might be right, but we aren't personally handling any of these companies' mid-term strategies so I wouldn't dare claiming anything with absolute certainty given that it's not even really a technical debate from my POV. I do agree that a 100GFlops CPU is just fine for very nice physics, but I do not believe that closes the debate either.

Quote
So we'd have to see major changes in GPU architecture to make them more suited for non-graphics tasks, likely affecting their graphics performance.
Those are already happening and barely affecting graphics performance, as they are mostly minor things and the major things can easily be reused for graphics. There is no fundamental reason why this will not keep happening.

_P.S._: I just thought I'd point out that I do NOT consider Larrabee to be a CPU here, but that obviously it might be a very interesting target for GPGPU/Physics. An heterogeneous chip with Sandy Bridge and Larrabee cores on 32nm, if the ISA takes off, ought to be much better than a GPU for Physics and other similar workloads either way (assuming there's enough memory bandwidth). So this is another important dynamic of course, and arguably much nearer to the discussion subject of this thread... :)

Posted by 3dilettante on Tuesday, 04-Dec-07 00:14:11 UTC
Quoting nAo
I see your point but SoEMT is mostly a cheap way to put at some use idle units while paying a fairly small cost for it, it's not an opportunity to constantly trash your caches hoping that the your improved ability to hide memory accesses latency will save your arse, not in the general case for sure.
Unless you never re-use your data, but then you don't need a cache in the first place.
Wouldn't fine-grained round-robin or a hybrid scheme like Niagra be more effective?

SoEMT is rather pessimistic about data sharing and coherent thread behavior.
It's suited to long-latency events where speculation within the same thread is mostly pointless, but it also assumes that there's a somewhat limited amount of non-speculative work in other threads, which is why it will stick with a thread for quite a stretch between events.

If the workload has massive amounts of non-speculative work available in other threads with a high likelihood that they are working in the same place, why bother running with just one thread many instructions ahead of the pack when each instruction taken increases the chance of tripping up one of the other threads through a cache invalidation?


Add your comment in the forums

Related intel News

RWT explores Haswell's eDRAM for graphics
RWT: An Updated Look at Intel's Quick Path Interconnect
32nm sixsome over at RealWorldTech
Intel Core i3 and i5 processors launched
Analysis: Intel-TSMC announcement more complex than reported
Intel and TSMC join forces to further Atom
Fudzilla: Intel 45nm Havendale MCM replaced by 32nm+45nm MCM
Intel announce Core i7 processors, reviews show up
Intel's Aaron Coday talks to Develop about Larrabee
Larrabee to also be presented at Hot Chips