The Khronos Group announce Heterogeneous Computing Initiative

Tuesday 17th June 2008, 11:15:00 PM, written by Rys

The Khronos Group, an organised collection of interested parties that collaborate to push and develop open standards for certain classes of computing, have announced what they call the Heterogeneous Computing Initiative.

The hand-waving impetus behind the initiative is to create a set of open standards behind the idea of task and data parallel computing, via what they're calling the Compute Working Group.  The heterogenous part comes from the execution of the same codes on both CPUs and GPUs.

The Compute Working Group has initial membership from the likes of Intel, AMD, NVIDIA, ImgTec, Nokia, Apple, 3DLabs, ARM, Freescale, TI and Qualcomm (we highlighted the discrete and embedded GPU guys on purpose, sorry to those on the list in the PR that we've not namechecked here).  OpenCL is laid out as the focal point of the group for the time being, as the industry gets behind it, and representatives of the companies mentioned will be at SIGGRAPH to talk about the Compute Working Group, OpenCL and related topics.

The announcement is a big deal, and we'll cover it in a bit more depth in the near future.  Until then, you can read the press release, and the forums are getting warmed up with discussion as you read this.

Discuss on the forums

Tagging

b3d ± khronos, group, compute, working, opencl, opengl, heterogenous, gpu, cpu


Latest Thread Comments (36 total)
Posted by Arun on Wednesday, 25-Jun-08 01:19:58 UTC
CUDA is certainly going to evolve, let alone because of changes in their DX11 hardware architecture and the fact individual developers & consumer apps will become even more important in the future (and those have much lower complexity tolerance). However, I don't think it's really necessary to completely hide all of those implementation details; just a layer API with a higher level of abstraction would do the trick. Hitting the right sweetspot for it may be difficult however.

Posted by MfA on Wednesday, 25-Jun-08 11:39:35 UTC
Quoting nAo
(who cares about number of warps/blocks/grids/wavefront/whatever..)
Anyone using local/shared storage?

Posted by Andrew Lauritzen on Wednesday, 25-Jun-08 15:54:53 UTC
The problem with CUDA IMHO is that it's a little too specific to the G80/92/T200 architecture. It doesn't map naturally to other architectures with different memory hierarchies and although it can be "made to work", something a bit more abstract is needed for a standard that is meant to be targeted to a wide range of parallel processors with varying memory hierarchies.

The other problem with CUDA is that it's just too damn hard to make it fast/optimal ;) This is more a problem with the complexity of the underlying hardware than the language itself, but the point remains that the language does nothing to prevent you from seriously shooting yourself in the foot, which is never a good thing. As it stands, even simple problems require highly non-linear optimization and machine-learning style optimization algorithms (http://www.crhc.uiuc.edu/IMPACT/ftp/conference/cgo-08-ryoo.pdf) to even approach 50% of peak performance. There are just too many variables that affect performance in highly non-linear ways for us mere mortals to get right ;)

Now the above is just a tough problem with parallel programming and complex architectures in general, but it begs the question as to whether we need to be specifying algorithms in something a bit more general and tunable than CUDA, and then the backend/compilers can handle the heavy-lifting as far as optimization and targeting to a specific memory model go.

Anyways there are certainly many interesting topics moving forward, and it will be fascinating to see what falls out of OpenCL and similar initiatives (DX compute shaders, etc).

Posted by TimothyFarrar on Wednesday, 25-Jun-08 16:46:00 UTC
Great paper BTW. Interesting that the difference between worst and peak is only 235%.As for peak performance, I'm assuming you are referring to ALU utilization? How many graphics programs ever reach peak ALU performance? The point being that it is always tough to reach peak performance under any platform, and in all cases you have to have intimate hardware knowledge to tune (or engineer the algorithm in the first place). I think a great example of this is the potential of floating point performance on the xbox 360 or cell/ps3. In both cases you need to vectorize. On 360 you have to stay in cache and aligned, and have a huge amount of work going in parallel to hide really long instruction latencies... ie you really have to program as you do on a GPU to get anywhere close to peak ALU performance. Most developers either will not or cannot do this for anything but a small amount of code.

Posted by Andrew Lauritzen on Wednesday, 25-Jun-08 17:00:23 UTC
Quoting TimothyFarrar
As for peak performance, I'm assuming you are referring to ALU utilization?
Well not just ALU utilization... I'm also considering things like how cleverly you touch memory, avoid cache misses, etc. Basically everything that makes your algorithm as fast as it can theoretically be on a given set of hardware. I realize this is largely hand-wavy, but I'm just trying to make the distinction between - say - the naive vs. hand-tuned vs. autotuned versions of algorithms.

And yes, it's definitely tough to reach any sort of peak performance, but I'm concerned that on G8x and similarly complex architectures it has gone beyond "tough" into the realm of automated empirical optimization (as the paper that I referenced does). This process can potentially be "guided" or hinted or pruned by the user in the majority of cases, but with all of the factors that come into play when making something fast on G8x/CUDA, it is simply infeasible for even a ninja programmer to find a globally optimal configuration of tuning parameters except in the simplest of cases. The best we can do is a sort of orthogonal gradient ascent (in each dimension) which can be quite suboptimal in the case of something like G8x.

Anyways my only real point here is that CUDA is pretty tied to a specific architecture, and pretty complex in terms of extracting excellent performance out of that architecture. I submit that these are characteristics of a low-level, relatively non-portable language which is great in its own right, but not suitable as-is for something like OpenCL or DX compute shaders.

Posted by pcchen on Wednesday, 25-Jun-08 17:23:17 UTC
I agree that CUDA is too tied to a specific architecture, which makes it very hard to "generalize." However, the problem of "hard to optimize" is very difficult to solve. Even CPU have the same problem. For example, a matrix multiplication algorithm, even written in C/C++, without considering SIMD, will not have optimal performance if the cache size is not considered.

Of course, the beautiful thing of a CPU is (especially a x86 CPU), even a "normal" program (not specifically optimized for a certain architecture) may perform relatively well. The same can't be said for GPU, or any other more "exotic" architectures, including CELL.

IMHO, it's almost impossible to hide all architecture details while maintaining high performance. To do so, it would require a lot of "helper" hardwares, which sort of defeat the idea of GPGPU. Therefore, the most important problem right now, is probably to figure out what is the "best" architecture for GPGPU, which all major vendors can accept, and also useful for most application developers.

Posted by Andrew Lauritzen on Wednesday, 25-Jun-08 18:35:04 UTC
Quoting pcchen
However, the problem of "hard to optimize" is very difficult to solve.
Oh no doubt! I didn't mean to imply that optimizing for G8x is in any way hindered by CUDA... just that writing optimal CUDA code is sufficiently tied to the G8x platform that I consider it a fairly "low-level" language. Clearly CUDA is the best (and only) language for targeting G8x hardware "to the metal", but I remain unconvinced that it provides a good general-purpose, portable programming model.

Anyways I don't want to come off as anti-CUDA - quite the contrary! I just don't think it makes sense for something like CUDA to be the programming model of choice for writing code to target stuff like AMD GPUs, multicore CPUs, Larrabee and Cell.

Posted by nAo on Wednesday, 25-Jun-08 18:37:36 UTC
Quoting Andrew Lauritzen
! I just don't think it makes sense for something like CUDA to be the programming model of choice for writing code to target stuff like AMD GPUs, multicore CPUs, Larrabee and Cell.
Or for whatever NVIDIA will unleash in the next 18/24 months..

Posted by Dave Baumann on Thursday, 17-Jul-08 16:54:53 UTC
http://www.guardian.co.uk/technology/2008/jul/17/news.computing

Worthwhile reading.

Posted by Arun on Thursday, 17-Jul-08 17:17:08 UTC
Good read indeed... :)

However I have one major problem with it: the whole handheld thing is patently absurd. Mostly visionaries who fail IMO to understand the difference between theory and practice... There is no use case for a FP32-centric device in this field, and there are massively better architectures *on the market today* for every single application you could ever imagine. These solutions already are orders of magnitude more efficient than x86 CPUs which GPUs compare favorably to.

Might be useful for non-graphics tasks in games, especially because proprietary hardware won't be often exposed let alone standardized, but beyond that I'm very very skeptical. I also laughed at this sentence: "such as being able to point the phone's camera at a building and then process the image so that it can tell you which building it is." - right, because GPS and location-aware services (showing nearby buildings) could *never* do that for a billionth the cost and the power while delivering a better user experience... right?

I'm sorry for being a bit mean here, but I'm not a big fan of random predictions that contradict the fundamental dynamics of computer architecture and system design. just because they'd benefit you strategically. And I thought Intel had patented that intellectual process, anyway?


Add your comment in the forums

Related b3d News

Sub $100 graphics at Tech Report
Broadcom purchase AMD's DTV business
Lucid Hydra 100 multi-GPU scaling demonstrated at IDF
3DMark, the Game
PC's Assassin Creed Official System Requirements: 2GB of RAM
Beyond3D's 2007 'Bricks & Bouquets'
Illuminate Labs adds its lighting baking solutions to UE3
Are foundries trying to inflate wafer prices?
MegaTexture in Quake Wars
Analysis: Q3 2007 GPU shipments reach record highs