AMD OpenCL development platform for CPU and GPU

Tuesday 13th October 2009, 07:35:00 PM, written by Rys

AMD has had OpenCL support for their CPUs available in their Stream SDK for a little while now, with the missing link being support on the GPU.  Today's release of a beta version of the brand new 2.0 Stream SDK fixes that, with compliant OpenCL 1.0 support for everything from the HD 4350 up, including multi-GPU SKUs.

The kit supports Windows and Linux, both 32- and 64-bit, and multiple compilers (including ICC) too, which might be a real boon depending on what you're doing with the host platform and other software.

The Stream SDK still supports driving the GPU via CAL, if that's how you currently drive ATI graphics hardware for general compute, and there are myriad bug fixes and little tweaks and enhancements here and there.

CPU support targets the SSE3 SIMD hardware on AMD's processors for many ops, and together with the existing OpenCL supports on the CPU, it means that AMD are the first vendor to release a developer kit with support for CPU and GPU together.

Language support appears to be limited to C at the moment, but that will likely change.

If you're in any way GPU compute inclined, you should take a look at the kit.

Discuss on the forums

Tagging

ati ± opencl, compute, conformance, khronos, stream, beta


Latest Thread Comments (54 total)
Posted by Broken Hope on Monday, 21-Dec-09 15:10:08 UTC
Quoting mhouston
This version supports an initial version of the Khronos ICD which is designed to allow multiple platforms to play nice. i.e. opencl.dll is just a shim that loads the vendor's dlls. To avoid breaking older Nvidia builds, we don't smash opencl.dll and instead place it in the ATI Stream path and the installer will update your path variable.
Any idea when we'll get a driver that actually supports OpenCL out of the box? You shouldn't really expect end users to have to install a SDK in order to get OpenCL support like the 9.12 Hotfix drivers seem to require?

Posted by cho on Monday, 21-Dec-09 15:38:32 UTC
is that possible the radeon does not need to connect to a monitor or force set extend desktop to run opencl app. .

Posted by mhouston on Monday, 21-Dec-09 16:17:44 UTC
We are working on getting OpenCL into Catalyst. One of the issues we have is that we support x86 as well as ATI GPUs so we don't want the driver install to be the only way people get OpenCL, else if you have an Nvidia board there is no way for you to get CPU support under OpenCL.Not requiring a monitor is also being looked at.

Posted by Arnold Beckenbauer on Monday, 21-Dec-09 21:32:25 UTC
A naive question: Why don't you port some old ATi Physics demos from the year 2006? Just for fun?

Posted by CNCAddict on Monday, 21-Dec-09 22:37:33 UTC
Quoting arnold beckenbauer
a naive question: Why don't you port some old ati physics demos from the year 2006? Just for fun?
yes...do what he says!!! Need something to get people excited about ATI's solution. nVidia is getting all the lovin at the moment.

Posted by codedivine on Saturday, 26-Dec-09 01:47:10 UTC
Quoting Mintmaster
That's a bit disappointing, as it's only 1 float per Vec5 unit per clock. In ATI's counter-Fermi presentation they stated the 5870 could access LDS at 960 floats per clock, i.e. 3 per Vec5 per clk.
From the recently released Evergreen ISA document, one can see that each LDS has 32 banks of 4 bytes each. Assuming each bank can return 1 float/cycle, thats 32/SIMD. Adding in the 16 floats/SIMD from L1 and you get 48 floats/SIMD and for a 20 SIMD part such as Cypress, it is 960 floats/clock as claimed by AMD but thats an aggregate.In the test above, the LDS did only 16 float/SIMD instead of 32 though.

Posted by Forrest on Sunday, 27-Dec-09 05:45:48 UTC
In my original test, all threads access same position in the LDS at a time so there may be bank conflicts. (I don't know if there is a broadcast in ATI)

I changed my code to do linear 32-bit reads from LDS, since 16 threads are processed in 1 cycle, there won't be any bank conflicts (32-banks each 32-bit wide). I got around 450GB/s for the newer version. But when I saw the ISA produced, the LDS_READ_RET instruction is actually thrown in alternate cycles because of required ALU operations to maintain dependency (so that compiler won't optimize out the reads).

So on doubling the bandwidth number, what I get is fairly close to what ATI says in their document, i.e 1080GB/s.

This particular ALU clause has total 32 cycles, of which LDS read happens on 14 cycles.

Code:
---------
00 ALU: ADDR(64) CNT(124) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15)
0 z: LSHL ____, R0.x, (0x00000002, 2.802596929e-45f).x
t: MULLO_INT ____, R1.x, KC0[1].x
1 x: ADD_INT ____, R0.x, PS0
y: ADD_INT R7.y, KC1[1].x, PV0.z
2 x: LDS_READ_RET QA, PV1.y
y: ADD_INT ____, PV1.y, (0x00000004, 5.605193857e-45f).x
z: ADD_INT ____, PV1.y, (0x00000008, 1.121038771e-44f).y
w: ADD_INT R5.w, PV1.x, KC0[6].x
t: ADD_INT T0.x, PV1.y, (0x0000000C, 1.681558157e-44f).z
3 x: LDS_READ2_RET QAB, PV2.y, PV2.z
y: ADD_INT ____, R7.y, (0x00000010, 2.242077543e-44f).x
z: ADD_INT T0.z, R7.y, (0x00000014, 2.802596929e-44f).y
w: ADD_INT T0.w, R7.y, (0x00000018, 3.363116314e-44f).z
t: ADD_INT T1.x, R7.y, (0x0000001C, 3.923635700e-44f).w
4 x: LDS_READ2_RET QAB, T0.x, PV3.y
y: ADD_INT T0.y, R7.y, (0x00000020, 4.484155086e-44f).x
z: ADD_INT T1.z, R7.y, (0x00000024, 5.044674472e-44f).y
w: ADD_INT T1.w, R7.y, (0x00000028, 5.605193857e-44f).z
t: ADD_INT T2.x, R7.y, (0x0000002C, 6.165713243e-44f).w
5 x: LDS_READ2_RET QAB, T0.z, T0.w
y: ADD_INT T1.y, R7.y, (0x00000030, 6.726232629e-44f).x
z: ADD_INT T2.z, R7.y, (0x00000034, 7.286752014e-44f).y
w: ADD_INT T2.w, R7.y, (0x00000038, 7.847271400e-44f).z
t: ADD_INT T3.x, R7.y, (0x0000003C, 8.407790786e-44f).w
6 x: LDS_READ2_RET QAB, T1.x, T0.y
7 x: MOV ____, QA.pop
8 x: MOV T1.x, QB.pop VEC_120
y: ADD ____, PV7.x, 0.0f
z: MOV ____, QA.pop
w: ADD_INT T3.w, R7.y, (0x00000040, 8.968310172e-44f).x
9 x: MOV T0.x, QB.pop VEC_120
y: ADD ____, PV8.z, PV8.y
w: MOV T0.w, QA.pop
10 x: ADD ____, PV9.y, T1.x
y: MOV T0.y, QA.pop
z: MOV T0.z, QB.pop VEC_120
w: ADD_INT R0.w, R7.y, (0x00000044, 9.528829557e-44f).x
11 x: MOV T1.x, QB.pop VEC_120
y: ADD_INT T2.y, R7.y, (0x00000048, 1.008934894e-43f).x
z: ADD ____, PV10.x, T0.w
w: MOV T0.w, QA.pop
12 x: LDS_READ2_RET QAB, T1.z, T1.w
y: ADD_INT T3.y, R7.y, (0x0000004C, 1.064986833e-43f).x
w: ADD ____, PV11.z, T0.x
13 x: ADD ____, PV12.w, T0.y
y: ADD_INT T0.y, R7.y, (0x00000050, 1.121038771e-43f).x
14 x: LDS_READ2_RET QAB, T2.x, T1.y
y: ADD ____, PV13.x, T0.z
z: ADD_INT T0.z, R7.y, (0x00000054, 1.177090710e-43f).x
15 x: LDS_READ2_RET QAB, T2.z, T2.w
z: ADD ____, PV14.y, T0.w VEC_021
w: ADD_INT T0.w, R7.y, (0x00000058, 1.233142649e-43f).x
16 x: LDS_READ2_RET QAB, T3.x, T3.w
y: ADD_INT T1.y, R7.y, (0x0000005C, 1.289194587e-43f).x
w: ADD T3.w, PV15.z, T1.x
17 x: LDS_READ2_RET QAB, R0.w, T2.y
y: MOV ____, QA.pop
z: MOV T2.z, QB.pop VEC_120
w: ADD_INT T2.w, R7.y, (0x00000060, 1.345246526e-43f).x
18 x: ADD ____, T3.w, PV17.y
19 x: MOV T1.x, QB.pop VEC_120
y: ADD ____, PV18.x, T2.z
z: ADD_INT T3.z, R7.y, (0x00000064, 1.401298464e-43f).x
w: MOV ____, QA.pop
20 x: ADD ____, PV19.y, PV19.w
y: MOV T2.y, QA.pop
z: MOV T2.z, QB.pop VEC_120
w: ADD_INT R1.w, R7.y, (0x00000068, 1.457350403e-43f).x
21 x: MOV T1.x, QB.pop VEC_120
y: ADD ____, PV20.x, T1.x
z: ADD_INT R0.z, R7.y, (0x0000006C, 1.513402341e-43f).x
w: MOV T3.w, QA.pop
22 x: ADD ____, PV21.y, T2.y
y: MOV T2.y, QA.pop
z: MOV T1.z, QB.pop VEC_120
w: ADD_INT R2.w, R7.y, (0x00000070, 1.569454280e-43f).x
23 x: LDS_READ2_RET QAB, T3.y, T0.y
y: ADD ____, PV22.x, T2.z
z: ADD_INT R1.z, R7.y, (0x00000074, 1.625506219e-43f).x VEC_201
24 x: LDS_READ2_RET QAB, T0.z, T0.w
z: ADD ____, PV23.y, T3.w VEC_021
w: ADD_INT R3.w, R7.y, (0x00000078, 1.681558157e-43f).x
25 x: LDS_READ2_RET QAB, T1.y, T2.w
y: ADD_INT R2.y, R7.y, (0x0000007C, 1.737610096e-43f).x VEC_120
w: ADD ____, PV24.z, T1.x
26 x: ADD ____, PV25.w, T2.y
y: ADD_INT R3.y, R7.y, (0x00000080, 1.793662034e-43f).x
27 x: LDS_READ_RET QA, T3.z
y: ADD R0.y, PV26.x, T1.z
z: ADD_INT R3.z, R7.y, (0x00000084, 1.849713973e-43f).x
28 x: MOV R0.x, QB.pop VEC_120
w: MOV R0.w, QA.pop
29 y: MOV R1.y, QA.pop
z: MOV R2.z, QB.pop VEC_120
30 x: MOV R1.x, QB.pop VEC_120
w: MOV R4.w, QA.pop
31 y: MOV R4.y, QA.pop
---------

Posted by OpenGL guy on Monday, 28-Dec-09 01:30:28 UTC
Quoting Forrest
Code:
---------
00 ALU: ADDR(64) CNT(124) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15)
0 z: LSHL ____, R0.x, (0x00000002, 2.802596929e-45f).x
t: MULLO_INT ____, R1.x, KC0[1].x
1 x: ADD_INT ____, R0.x, PS0
y: ADD_INT R7.y, KC1[1].x, PV0.z
2 x: LDS_READ_RET QA, PV1.y
y: ADD_INT ____, PV1.y, (0x00000004, 5.605193857e-45f).x
z: ADD_INT ____, PV1.y, (0x00000008, 1.121038771e-44f).y
w: ADD_INT R5.w, PV1.x, KC0[6].x
t: ADD_INT T0.x, PV1.y, (0x0000000C, 1.681558157e-44f).z
3 x: LDS_READ2_RET QAB, PV2.y, PV2.z
y: ADD_INT ____, R7.y, (0x00000010, 2.242077543e-44f).x
z: ADD_INT T0.z, R7.y, (0x00000014, 2.802596929e-44f).y
w: ADD_INT T0.w, R7.y, (0x00000018, 3.363116314e-44f).z
t: ADD_INT T1.x, R7.y, (0x0000001C, 3.923635700e-44f).w
4 x: LDS_READ2_RET QAB, T0.x, PV3.y
y: ADD_INT T0.y, R7.y, (0x00000020, 4.484155086e-44f).x
z: ADD_INT T1.z, R7.y, (0x00000024, 5.044674472e-44f).y
w: ADD_INT T1.w, R7.y, (0x00000028, 5.605193857e-44f).z
t: ADD_INT T2.x, R7.y, (0x0000002C, 6.165713243e-44f).w
5 x: LDS_READ2_RET QAB, T0.z, T0.w
y: ADD_INT T1.y, R7.y, (0x00000030, 6.726232629e-44f).x
z: ADD_INT T2.z, R7.y, (0x00000034, 7.286752014e-44f).y
w: ADD_INT T2.w, R7.y, (0x00000038, 7.847271400e-44f).z
t: ADD_INT T3.x, R7.y, (0x0000003C, 8.407790786e-44f).w
6 x: LDS_READ2_RET QAB, T1.x, T0.y
7 x: MOV ____, QA.pop
8 x: MOV T1.x, QB.pop VEC_120
y: ADD ____, PV7.x, 0.0f
z: MOV ____, QA.pop
w: ADD_INT T3.w, R7.y, (0x00000040, 8.968310172e-44f).x
9 x: MOV T0.x, QB.pop VEC_120
y: ADD ____, PV8.z, PV8.y
w: MOV T0.w, QA.pop
10 x: ADD ____, PV9.y, T1.x
y: MOV T0.y, QA.pop
z: MOV T0.z, QB.pop VEC_120
w: ADD_INT R0.w, R7.y, (0x00000044, 9.528829557e-44f).x
11 x: MOV T1.x, QB.pop VEC_120
y: ADD_INT T2.y, R7.y, (0x00000048, 1.008934894e-43f).x
z: ADD ____, PV10.x, T0.w
w: MOV T0.w, QA.pop
12 x: LDS_READ2_RET QAB, T1.z, T1.w
y: ADD_INT T3.y, R7.y, (0x0000004C, 1.064986833e-43f).x
w: ADD ____, PV11.z, T0.x
13 x: ADD ____, PV12.w, T0.y
y: ADD_INT T0.y, R7.y, (0x00000050, 1.121038771e-43f).x
14 x: LDS_READ2_RET QAB, T2.x, T1.y
y: ADD ____, PV13.x, T0.z
z: ADD_INT T0.z, R7.y, (0x00000054, 1.177090710e-43f).x
15 x: LDS_READ2_RET QAB, T2.z, T2.w
z: ADD ____, PV14.y, T0.w VEC_021
w: ADD_INT T0.w, R7.y, (0x00000058, 1.233142649e-43f).x
16 x: LDS_READ2_RET QAB, T3.x, T3.w
y: ADD_INT T1.y, R7.y, (0x0000005C, 1.289194587e-43f).x
w: ADD T3.w, PV15.z, T1.x
17 x: LDS_READ2_RET QAB, R0.w, T2.y
y: MOV ____, QA.pop
z: MOV T2.z, QB.pop VEC_120
w: ADD_INT T2.w, R7.y, (0x00000060, 1.345246526e-43f).x
18 x: ADD ____, T3.w, PV17.y
19 x: MOV T1.x, QB.pop VEC_120
y: ADD ____, PV18.x, T2.z
z: ADD_INT T3.z, R7.y, (0x00000064, 1.401298464e-43f).x
w: MOV ____, QA.pop
20 x: ADD ____, PV19.y, PV19.w
y: MOV T2.y, QA.pop
z: MOV T2.z, QB.pop VEC_120
w: ADD_INT R1.w, R7.y, (0x00000068, 1.457350403e-43f).x
21 x: MOV T1.x, QB.pop VEC_120
y: ADD ____, PV20.x, T1.x
z: ADD_INT R0.z, R7.y, (0x0000006C, 1.513402341e-43f).x
w: MOV T3.w, QA.pop
22 x: ADD ____, PV21.y, T2.y
y: MOV T2.y, QA.pop
z: MOV T1.z, QB.pop VEC_120
w: ADD_INT R2.w, R7.y, (0x00000070, 1.569454280e-43f).x
23 x: LDS_READ2_RET QAB, T3.y, T0.y
y: ADD ____, PV22.x, T2.z
z: ADD_INT R1.z, R7.y, (0x00000074, 1.625506219e-43f).x VEC_201
24 x: LDS_READ2_RET QAB, T0.z, T0.w
z: ADD ____, PV23.y, T3.w VEC_021
w: ADD_INT R3.w, R7.y, (0x00000078, 1.681558157e-43f).x
25 x: LDS_READ2_RET QAB, T1.y, T2.w
y: ADD_INT R2.y, R7.y, (0x0000007C, 1.737610096e-43f).x VEC_120
w: ADD ____, PV24.z, T1.x
26 x: ADD ____, PV25.w, T2.y
y: ADD_INT R3.y, R7.y, (0x00000080, 1.793662034e-43f).x
27 x: LDS_READ_RET QA, T3.z
y: ADD R0.y, PV26.x, T1.z
z: ADD_INT R3.z, R7.y, (0x00000084, 1.849713973e-43f).x
28 x: MOV R0.x, QB.pop VEC_120
w: MOV R0.w, QA.pop
29 y: MOV R1.y, QA.pop
z: MOV R2.z, QB.pop VEC_120
30 x: MOV R1.x, QB.pop VEC_120
w: MOV R4.w, QA.pop
31 y: MOV R4.y, QA.pop
---------
Note that your shader is "invalid" as you never write to LDS, only read from it. Thus, there is no way to guarantee correct values are in LDS.

LDS is not guaranteed to be coherent between different kernels. Imagine if some other context got swapped in between your kernel that generates the LDS data and the kernel that reads the data.

Posted by Forrest on Monday, 28-Dec-09 03:13:50 UTC
I don't care about whether the values are correct are not, I just want to measure the read bandwidth from LDS which is 32-floats per cycle per SIMD right?

On Juniper that is about : (32 x 4)bytes x 850MHz x 10SIMDs = 1088GB/s. Correct me if I am wrong.

Posted by OpenGL guy on Monday, 28-Dec-09 04:21:34 UTC
Quoting Forrest
I don't care about whether the values are correct are not, I just want to measure the read bandwidth from LDS which is 32-floats per cycle per SIMD right?

On Juniper that is about : (32 x 4)bytes x 850MHz x 10SIMDs = 1088GB/s. Correct me if I am wrong.
Yes, that's the correct number for Juniper.


Add your comment in the forums

Related ati News

ATI Catalyst 10.1 Display Driver
ATI Radeon HD 5670 released, bringing DX11 for less than $100
ATI 5970 comes out to play, completes ATI's lineup
ATI Cypress GPU and architecture analysis
ATI RV740 GPU and Architecture analysis
ATI Radeon HD 5870 released, powered by new DX11 GPU
ATI Radeon HD 4890 launched at $250 with improved GPU
ATI Mobility Radeon HD 4860 and 4830 appear on 40nm
ATI Catalyst 8.10 driver released
ATI Radeon HD 4870 X2 at Rage3D