AMD launches FireStream 9250 with 200Gflops DP via RV770

Tuesday 17th June 2008, 10:58:00 PM, written by Rys

AMD have announced a new FireStream product based on their upcoming RV770 GPU, with some seriously impressive single and double precision peak rates.

The FireStream 9250 comes with 1GiB of GDDR3, and the 800 ALUs in the processors combine via an impressive chip clock to produce peak single precision FP32 performance of over 1Tflop, with 200Gflops double precision as a product of how the architecture used the ALUs to compute those results.

While the hardware architecture isn't as immediately amenable to being able to extract those peak rates, compared to the product's immediate competition, if you can make the computation fit then performance is undoubtedly going to be impressive.  For reference, the headline DP rate is about 2x higher than the PowerXCell 8i, which was the fastest performing DP-capable processor readily available before the FireStream 9250 was announced.

Programming a FireStream 9250 will be most commonly done via Brook+ and the AMD Stream SDK, with CAL available for those with no fear of directly programming a GPU with the raw ISA.   AMD have also said that they'll support OpenCL, or Open Compute Language, so FireStream support is guaranteed at some point in the future.

The announcement is midly premature, since the product won't become available until Q3 2008, but the $1000 price is very keen, and hitting the peak rates will cost you less than 150W.  Single and double precision rates are therefore available with leading performance per watt figures too.  AMD have clearly used a FireStream of some kind to compute their 8Gflops/watt figure, since it seems spot on.

We'll take a closer look at the non-graphics performance of the hardware after we've take a better look at the Radeon implementation of RV770.

Discuss on the forums

Tagging

amd ± firestream, 9250, rv770, double, precision, 1Tflop, really, fast, gpu, zomg


Latest Thread Comments (128 total)
Posted by Jawed on Thursday, 27-Mar-08 19:23:27 UTC
Quoting Farhan
Yeah, obviously you have to do that for an ADD. I was just talking about the MUL.
What I'm proposing is that the final stage for the MUL is a pipelined-add, for p1+p2. Hence the exponent adjustment and trading of significant bits in p2 against bits of p1. You queried this addition earlier saying it needs to be done at 54+27 bits precision, but I hope I've shown that treating it as a normal floating point add (de-normalising: shifting one operand and modifying the exponent) allows it to be performed with only 54 bits (for a 53 bit final result). Jawed

Posted by Jawed Something different on Thursday, 27-Mar-08 19:33:59 UTC
Any chance that modifying/widening the DP4 paths will provide the requisite stages? Jawed

Posted by Farhan on Thursday, 27-Mar-08 23:02:05 UTC
Quoting Jawed
What I'm proposing is that the final stage for the MUL is a pipelined-add, for p1+p2. Hence the exponent adjustment and trading of significant bits in p2 against bits of p1. You queried this addition earlier saying it needs to be done at 54+27 bits precision, but I hope I've shown that treating it as a normal floating point add (de-normalising: shifting one operand and modifying the exponent) allows it to be performed with only 54 bits (for a 53 bit final result). Jawed
Regardless of whether it's a pipelined add, you can't do that shifting thing for the MUL because that would be incorrect. The alignment for p1 and p2 is always fixed (they are not 2 completely independent FP numbers, think of them as having a shared exponent). The addition is always between the top 54 bits of p1 and the bottom 54 bits of p2, with the carry propagation having to go through all the way to the MSB of p2 (27 bits).

Posted by Jawed on Friday, 28-Mar-08 03:58:22 UTC
Quoting Farhan
Regardless of whether it's a pipelined add, you can't do that shifting thing for the MUL because that would be incorrect. The alignment for p1 and p2 is always fixed (they are not 2 completely independent FP numbers, think of them as having a shared exponent).
I've diagrammed a possible set of exponents: Code:
---------
Blo 27Alo 27 --- w55Bhi 53Alo 27 --- z81 ---Z+W ===== z82 partial sum 1 ===== Blo 27Ahi 53 --- y81Bhi 53Ahi 53 --- 107 ---X+Y ===== 108 partial sum 2 ===== p1 z82p2 +108 ======= 109 =======
--------- For the sake of clarity, both A and B have exponent 53. When split into hi and lo parts, the hi parts keep their exponent, 53, while the lo parts are normalised to exponent 27 (though it could be lower for either of them). I've then worked through the multiplications and additions, calculating the maximum value of each of the resulting exponents. Doing this I think I've understood my mistake. When I said "the count of significant bits in p2 determines how many bits from p1 are used, i.e. 54-p2+27" that's wrong, it should be the difference in exponents as there's always 54 significant bits in p2. --- My suggestion is the addition, p1+p2, is done on the final adder in the pipeline (in lanes X and Y). This adder is required to perform a DADD instruction, so in this case it is also used for p1+p2. Since DADD has to support two 53-bit operands by being a 54-bit adder, the addition of p1+p2, 27 bits + 54 bits requires no extra hardware dedicated to MUL. So, what I'm thinking is that a conventional single precision DP4 needs to perform a final ADD on 4 MULs. So the DP4 instuction requires a 4 operand adder. I'm wondering if this same adder can also support:* DADD A, B* DMUL p1, p2* DMAD p1, p2, CC comes from A*B+C. Does DP4 work like that, though? Jawed

Posted by itaru on Sunday, 25-May-08 13:35:05 UTC
http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=95565&enterthread=y
AMD Stream SDK v1.1-beta Now Available For Download

The AMD Stream Team is pleased to announce the availability of AMD Stream SDK v1.1-beta!

The installation files are available for immediate download from:
FTP Download Site For AMD Stream SDK v1.1-beta (ftp://streamcomputing:streamcomputing@ftp-developer.amd.com/AMD_Stream_SDK/v1.01.0-beta)

The AMD Stream Computing website will be updated in the next few days to reflect this new release.

With v1.1-beta comes:

- AMD FireStream 9170 support
- Linux support (RHEL 5.1 and SLES 10 SP1)
- Brook+ integer support
- Brook+ #line number support for easier .br file debugging
- Various bug fixes and runtime enhancements
- Preliminary Microsoft Visual Studio 2008 support


If you have any questions, please do not hesitate to post your question to the forum.

Sincerely,
AMD Stream Team

Posted by wingless on Saturday, 07-Jun-08 15:22:47 UTC
Quoting itaru
http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=95565&enterthread=y
AMD Stream SDK v1.1-beta Now Available For Download

The AMD Stream Team is pleased to announce the availability of AMD Stream SDK v1.1-beta!

The installation files are available for immediate download from:
FTP Download Site For AMD Stream SDK v1.1-beta (ftp://streamcomputing:streamcomputing@ftp-developer.amd.com/AMD_Stream_SDK/v1.01.0-beta)

The AMD Stream Computing website will be updated in the next few days to reflect this new release.

With v1.1-beta comes:

- AMD FireStream 9170 support
- Linux support (RHEL 5.1 and SLES 10 SP1)
- Brook+ integer support
- Brook+ #line number support for easier .br file debugging
- Various bug fixes and runtime enhancements
- Preliminary Microsoft Visual Studio 2008 support


If you have any questions, please do not hesitate to post your question to the forum.

Sincerely,
AMD Stream Team
Awesome. I hope we see more ATI support in GPGPU before CUDA takes over the market.

Posted by Karoshi on Monday, 09-Jun-08 02:31:00 UTC
Quoting itaru
AMD Stream SDK v1.1-beta Now Available For Download The AMD Stream Team is pleased to announce the availability of AMD Stream SDK v1.1-beta! With v1.1-beta comes: - AMD FireStream 9170 support - Linux support (RHEL 5.1 and SLES 10 SP1) - Brook+ integer support - Brook+ #line number support for easier .br file debugging - Various bug fixes and runtime enhancements - Preliminary Microsoft Visual Studio 2008 supportIf you have any questions, please do not hesitate to post your question to the forum. Sincerely, AMD Stream Team
Wishlist:- Brook CUDA backend.A quick search around here didnt find any references to this. I think I read a post sugesting CUDA on CTM or CAL a few days ago. Brook on CUDA seems easier.Disclaimer: I know CUDA and AMD´s stream SDK only at the executive PDF level.I see advantages to a brook port to cuda.

Posted by itaru on Monday, 16-Jun-08 09:50:23 UTC
http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~126593,00.html
AMD Stream Processor First to Break 1 Teraflop Barrier

—Next-generation AMD FireStream™ 9250 processor accelerates scientific
and engineering calculations, efficiently delivering supercomputer performance at
up to eight gigaflops-per-watt —

The AMD FireStream 9250 stream processor includes a second-generation
double-precision floating point hardware implementation delivering
more than 200 gigaflops, building on the capabilities of the earlier
AMD FireStream™ 9170, the industry’s first GP-GPU with double-precision floating point support.
The AMD FireStream 9250’s compact size makes it ideal for small 1U servers
as well as most desktop systems, workstations, and larger servers and
it features 1GB of GDDR3 memory, enabling developers to handle large, complex problems.

AMD is also working closely with world class application and solution providers
to ensure customers can achieve optimum performance results.
Stream computing application and solution providers include CAPS entreprise,
Mercury Computer Systems, RapidMind, RogueWave and VizExperts.
Mercury Computer Systems provides high-performance computing systems
and software designed for complex image, sensor, and signal processing applications.
Its algorithm team reports that it has achieved 174 GFLOPS performance for
large 1D complex single-precision floating point FFTs on the AMD FireStream 9250

Posted by MfA on Monday, 16-Jun-08 13:31:37 UTC
174 GFLOPs is incredibly fast (CUFFT did around 20 on the G80 last I looked).

Posted by Anarchist4000 on Monday, 16-Jun-08 22:02:34 UTC
1 TFLOP,


Add your comment in the forums

Related amd News

AMD Bulldozer microarchitecture analysis
Say hello to GLOBALFOUNDRIES
AMD completes deal with ATIC to create The Foundry Company
AMD Propus to be released in Q2 & Q3
AMD launch 45nm Phenom II processor
AMD goes Asset Smart; splits into two
Beyond Programmable Shading course notes available
AMD GPGPU solutions get extra support from industry partners
AMD Phenom X3 released; reviewed
Rage3D take a look at Assassin's Creed D3D10.1 support