AMD launches FireStream 9250 with 200Gflops DP via RV770

Tuesday 17th June 2008, 10:58:00 PM, written by Rys

AMD have announced a new FireStream product based on their upcoming RV770 GPU, with some seriously impressive single and double precision peak rates.

The FireStream 9250 comes with 1GiB of GDDR3, and the 800 ALUs in the processors combine via an impressive chip clock to produce peak single precision FP32 performance of over 1Tflop, with 200Gflops double precision as a product of how the architecture used the ALUs to compute those results.

While the hardware architecture isn't as immediately amenable to being able to extract those peak rates, compared to the product's immediate competition, if you can make the computation fit then performance is undoubtedly going to be impressive.  For reference, the headline DP rate is about 2x higher than the PowerXCell 8i, which was the fastest performing DP-capable processor readily available before the FireStream 9250 was announced.

Programming a FireStream 9250 will be most commonly done via Brook+ and the AMD Stream SDK, with CAL available for those with no fear of directly programming a GPU with the raw ISA.   AMD have also said that they'll support OpenCL, or Open Compute Language, so FireStream support is guaranteed at some point in the future.

The announcement is midly premature, since the product won't become available until Q3 2008, but the $1000 price is very keen, and hitting the peak rates will cost you less than 150W.  Single and double precision rates are therefore available with leading performance per watt figures too.  AMD have clearly used a FireStream of some kind to compute their 8Gflops/watt figure, since it seems spot on.

We'll take a closer look at the non-graphics performance of the hardware after we've take a better look at the Radeon implementation of RV770.

Discuss on the forums

Tagging

amd ± firestream, 9250, rv770, double, precision, 1Tflop, really, fast, gpu, zomg


Latest Thread Comments (135 total)
Posted by itaru on Monday, 16-Jun-08 09:50:23 UTC
http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~126593,00.html
AMD Stream Processor First to Break 1 Teraflop Barrier

—Next-generation AMD FireStream™ 9250 processor accelerates scientific
and engineering calculations, efficiently delivering supercomputer performance at
up to eight gigaflops-per-watt —

The AMD FireStream 9250 stream processor includes a second-generation
double-precision floating point hardware implementation delivering
more than 200 gigaflops, building on the capabilities of the earlier
AMD FireStream™ 9170, the industry’s first GP-GPU with double-precision floating point support.
The AMD FireStream 9250’s compact size makes it ideal for small 1U servers
as well as most desktop systems, workstations, and larger servers and
it features 1GB of GDDR3 memory, enabling developers to handle large, complex problems.

AMD is also working closely with world class application and solution providers
to ensure customers can achieve optimum performance results.
Stream computing application and solution providers include CAPS entreprise,
Mercury Computer Systems, RapidMind, RogueWave and VizExperts.
Mercury Computer Systems provides high-performance computing systems
and software designed for complex image, sensor, and signal processing applications.
Its algorithm team reports that it has achieved 174 GFLOPS performance for
large 1D complex single-precision floating point FFTs on the AMD FireStream 9250

Posted by MfA on Monday, 16-Jun-08 13:31:37 UTC
174 GFLOPs is incredibly fast (CUFFT did around 20 on the G80 last I looked).

Posted by Anarchist4000 on Monday, 16-Jun-08 22:02:34 UTC
1 TFLOP,

Posted by Arnold Beckenbauer on Wednesday, 01-Oct-08 00:41:58 UTC
Question: What are CAL Compute Shaders, which are a new shader type on R700 hardware only?
(Current Stream Computing SDK 1.2.0beta)

Posted by Jawed on Wednesday, 01-Oct-08 20:55:21 UTC
Compute shaders provide an explicit "thread ID" based programming model. They're only programmable with CAL, which is basically a bastard offspring of D3D assembly with lots of machine-specific knobs and features (and backwards compatibility for HD2xxx-onwards GPUs). Compute shaders also do away with the "graphics" sense of a kernel, so inputs can no longer include "interpolated attributes" (mirroring vertex attribute interpolation in graphics programming) and the outputs have to be explicit writes ("memexport") to memory locations, instead of outputs into a "virtual render target". Inputs can consist of "memimport" or sampling using the texturing hardware, but with no "vertex attribute interpolation" all sampling operations are forced to be dependent, i.e. computed based on thread ID and whatever else the programmer decides. The explicit thread ID model also forms the basis of the "data share" mechanism in CAL. Here any "thread" can write data to any one of its own, 64 vec4 (128 bit) locations. Then, any other thread can read any of these 64 locations. So it's a sort of broadcast model without any explicit destination. Think of it as "write private/read public", which requires explicit synchronisation by the programmer. The Local Data Share and Global Data Share memories in RV7xx are where this action happens. I guess it involves a fair amount of juggling, moving LDS/GDS to/from video memory, and therefore involves a fair amount of latency-hiding, similar to the way GPUs hide the latency of texturing. Overall, CAL compute shaders could be described as a CUDA-isation in terms of explicit thread ID based programming and the explicit use of shared memory. Or it could be the model that's been drawn up for D3D11 compute shaders. Or maybe just a significant portion of it. Note that RV7xx's shared memory model isn't the same as CUDA's. CUDA allocates a fixed-size block of memory to be shared by all warps extant in a multiprocessor (thread block). So with less warps each thread has more memory to use. And all threads can write to all locations. But data cannot be shared with warps on other multiprocessors or in other clusters. That requires the programmer to do a separate write/read via video memory. Brook+ already exposes thread IDs and allows for "threads" to exchange data as well as hiding from the programmer the "graphics-ness" of GPGPU programming. I suppose the changes in RV7xx architecture will increase the efficiency of threaded Brook+ programming. But progress on Brook+ is very slow, and I can imagine OpenCL and D3D11 will gain the lion's share of AMD's internal software engineering resources. Brook programming was originally about a pure streaming model of computation with no ability to access and manipulate thread IDs and sharing data across threads. CUDA's main break with Brook was to abandon that pure streaming model as being too restrictive. AMD has basically come to the same realisation in Brook+ and is now on the second iteration of supporting this functionality directly in hardware. D3D11 Compute Shader is also based on that realisation. So one way or another AMD had no choice and I suspect CAL compute shader is either a preview of D3D11 CS or is a major step in that direction. Jawed

Posted by ahu on Thursday, 02-Oct-08 08:49:36 UTC
Wow. Excellent piece of information there, thanks Jawed!

One might clarify though, that Brook+ doesn't currently expose the data share, only CAL does. Nor does it allow access to the whole GPU video memory like CAL does.

Posted by rpg.314 on Thursday, 02-Oct-08 16:38:07 UTC
Thanks Jawed, that was really helpful and informative

Posted by Rufus on Friday, 03-Oct-08 04:15:30 UTC
And this is a perfect example of why choices aren't always a good thing. AMD really needs to focus on 1 language, put a bunch of effort behind it, and support it from here out. CTM being dropped for CAL and now CAL compute shaders, with brook and brook+ on the side is just too confusing.CTM is already dead. What's the chance that any of the rest survive after OpenCL comes out?

Posted by Dave Baumann on Friday, 03-Oct-08 13:03:20 UTC
Quoting Rufus
And this is a perfect example of why choices aren't always a good thing. AMD really needs to focus on 1 language, put a bunch of effort behind it, and support it from here out. CTM being dropped for CAL and now CAL compute shaders, with brook and brook+ on the side is just too confusing.
CTM and CAL are not languages, they are interfaces. CTM was just somewhat lower level, while CAL abstracts a little more because it its intended to be more portable across generations. Irrespective of whether there is Brook+ or, later on, OpenCL, CAL will exist as the interface that those compilers will sit upon to access the hardware - but, right now you do not program "in CAL", you program in IL if you want low level access, or you use Brook+ or a 3rd party toolset from the likes of RapidMind or others for higher level access.

Posted by Arnold Beckenbauer on Sunday, 05-Oct-08 00:04:53 UTC
Quoting Jawed
Compute shaders provide an explicit "thread ID" based programming model. They're only programmable with CAL, which is basically a bastard offspring of D3D assembly with lots of machine-specific knobs and features (and backwards compatibility for HD2xxx-onwards GPUs).

Compute shaders also do away with the "graphics" sense of a kernel, so inputs can no longer include "interpolated attributes" (mirroring vertex attribute interpolation in graphics programming) and the outputs have to be explicit writes ("memexport") to memory locations, instead of outputs into a "virtual render target". Inputs can consist of "memimport" or sampling using the texturing hardware, but with no "vertex attribute interpolation" all sampling operations are forced to be dependent, i.e. computed based on thread ID and whatever else the programmer decides.

The explicit thread ID model also forms the basis of the "data share" mechanism in CAL. Here any "thread" can write data to any one of its own, 64 vec4 (128 bit) locations. Then, any other thread can read any of these 64 locations. So it's a sort of broadcast model without any explicit destination. Think of it as "write private/read public", which requires explicit synchronisation by the programmer. The Local Data Share and Global Data Share memories in RV7xx are where this action happens. I guess it involves a fair amount of juggling, moving LDS/GDS to/from video memory, and therefore involves a fair amount of latency-hiding, similar to the way GPUs hide the latency of texturing.

Overall, CAL compute shaders could be described as a CUDA-isation in terms of explicit thread ID based programming and the explicit use of shared memory. Or it could be the model that's been drawn up for D3D11 compute shaders. Or maybe just a significant portion of it. Note that RV7xx's shared memory model isn't the same as CUDA's. CUDA allocates a fixed-size block of memory to be shared by all warps extant in a multiprocessor (thread block). So with less warps each thread has more memory to use. And all threads can write to all locations. But data cannot be shared with warps on other multiprocessors or in other clusters. That requires the programmer to do a separate write/read via video memory.

Brook+ already exposes thread IDs and allows for "threads" to exchange data as well as hiding from the programmer the "graphics-ness" of GPGPU programming.

I suppose the changes in RV7xx architecture will increase the efficiency of threaded Brook+ programming. But progress on Brook+ is very slow, and I can imagine OpenCL and D3D11 will gain the lion's share of AMD's internal software engineering resources.

Brook programming was originally about a pure streaming model of computation with no ability to access and manipulate thread IDs and sharing data across threads. CUDA's main break with Brook was to abandon that pure streaming model as being too restrictive. AMD has basically come to the same realisation in Brook+ and is now on the second iteration of supporting this functionality directly in hardware. D3D11 Compute Shader is also based on that realisation. So one way or another AMD had no choice and I suspect CAL compute shader is either a preview of D3D11 CS or is a major step in that direction.

Jawed
Too much INPUT...Error...Reset

Thx.


Add your comment in the forums

Related amd News

AMD goes Asset Smart; splits into two
Beyond Programmable Shading course notes available
AMD GPGPU solutions get extra support from industry partners
AMD Phenom X3 released; reviewed
Rage3D take a look at Assassin's Creed D3D10.1 support
Stanford University release Folding@Home client for R6-family ATI GPUs
Official: AMD layoffs 10%; misses Q1 guidance.
AMD release FireGL V7700 with DisplayPort support
AMD release new Phenom X4 processors with B3 silicon
AMD RV670 price cuts & 128-bit Radeon HD3830?