AMD announces SSE5 instructions

Thursday 30th August 2007, 06:12:00 AM, written by Tim

AMD today announced a new extension of the SSE SIMD instruction set in the form of SSE5, a fairly radical upgrade that will arrive in 2009 with the "Bulldozer" core. The full instruction set reference is here, and we've also got a quick overview of new features:
  • There are now instructions that take three arguments in addition to the destination. As a result, there are new instructions that multiply two registers and add a third (much like most ALUs on a GPU) .
  • FP16, everyone's favorite partial precision format from the NV30 era, is back. All of the instructions for the new FP16 format are related to the new multiply-accumulate class of instructions.
  • There are a number of new instructions to move values within an XMM register. There's a new instruction, PPERM, to generate permutations of the contents of an XMM register, as well as vector rotates, shifts, and conditional moves.
It's fair to call this a new version of SSE as opposed to 3DNow, AMD's previous SIMD instruction set, since it uses the XMM registers introduced with SSE. Of course, there's the question of whether or not Intel will support it, or for that matter whether AMD will fully support SSE4. Barcelona supports SSE4a, a subset of SSE4, plus the extra POPCNT instruction, but there's no mention of whether Bulldozer will support SSE4 completely. If we had to guess, though, we'd say that Bulldozer will skip the rest of SSE4 completely. SSE5 defines a number of rounding instructions (ROUNDPS, ROUNDPD, etc) that were already present in SSE4.

Discuss on the forums

Tagging

amd ± sse5, sse4, bulldozer, popcnt


Latest Thread Comments (23 total)
Posted by pjbliverpool on Friday, 31-Aug-07 16:11:27 UTC
Thinking about it, SSE4 launches in a few weeks and after that Intel will be looking towards Nehalem. Is Nehalem likely to add further instructions to SSE? If so, whats stopping Intel calling that SSE6 and getting out of the gate with 6 before AMD launch 5? This all sounds pretty silly to me. AMD really should have supported SSE4 in full. I note SSE4 incorporates a dot product instruction aswell. Can anyone tell me what advantages this can bring to games?

Posted by ShaidarHaran on Sunday, 02-Sep-07 06:26:40 UTC
Quoting pjbliverpool
Thinking about it, SSE4 launches in a few weeks and after that Intel will be looking towards Nehalem. Is Nehalem likely to add further instructions to SSE? If so, whats stopping Intel calling that SSE6 and getting out of the gate with 6 before AMD launch 5? This all sounds pretty silly to me. AMD really should have supported SSE4 in full. I note SSE4 incorporates a dot product instruction aswell. Can anyone tell me what advantages this can bring to games?
AFAIK the next core scheduled to add significantly to ISA extensions is Sandy Bridge, formerly known as Gesher, hence the previous internal codename "Gesher New Instructions". However, Nehalem is supposed to add 7 new instructions in the form of SSE 4.2 (http://en.wikipedia.org/wiki/SSE4).

Posted by pjbliverpool on Sunday, 02-Sep-07 12:58:02 UTC
Quoting ShaidarHaran
AFAIK the next core scheduled to add significantly to ISA extensions is Sandy Bridge, formerly known as Gesher, hence the previous internal codename "Gesher New Instructions". However, Nehalem is supposed to add 7 new instructions in the form of SSE 4.2 (http://en.wikipedia.org/wiki/SSE4).
Cool, thanks for the clarification. So according to that Penryn doesn't actually utilise the full SSE4 instruction set as its missing 7 instructions. These will be added in Nehalem which will be the first processor to feature all 54 nstructions of SSE4. Sandy Bridge would have probably incorporated SSE5 but will now probably call it SSE6. Given that Sandy Bridge isn't due until about 2010 it seems AMD will get a little time out of SSE5 but not much. The question is will SSE6 be a superset of SSE5...

Posted by 3dilettante on Wednesday, 19-Sep-07 18:10:48 UTC
Quotes on whether Intel will or will not support SSE5.

http://www.theinquirer.net/?article=42484

Uh oh.

Posted by Jawed on Wednesday, 19-Sep-07 18:21:54 UTC
I think Larabee and AMD's "streaming" ambitions using GPU tech mean that the fork is inevitable. AMD has talked about CPUs having drivers, in the same way GPUs do. Once you've got a driver-based model for a CPU, does the forking actually matter?... Jawed

Posted by 3dilettante on Wednesday, 19-Sep-07 19:41:34 UTC
Quoting Jawed
I think Larabee and AMD's "streaming" ambitions using GPU tech mean that the fork is inevitable.

AMD has talked about CPUs having drivers, in the same way GPUs do. Once you've got a driver-based model for a CPU, does the forking actually matter?...

Jawed
I suppose we can ask Transmeta how well that worked.

The driver strategy doesn't make SSE5 any more likely to be adopted.
What does a driver do if there are vendor-specific extensions?

I'm thinking it either faults, wraps the command to an equivalent sequence (this gets dangerous), or requires a fallback path in lieu of a crash.

That is no different from how things are done right now through microcode.
If the driver model works for SSE5, microcode should have gotten 3dnow adopted.

The bigger company ignored the extension, and the vast majority of the market ignored it as well.

If Fusion somehow makes SSE5 palatable in the distant future, Intel will just migrate Larrabee's extensions to SSE6 or 7.

What will AMD do?
Assuming the cross-licensing agreement persists that far ahead in the future, it will probably wind up supporting Larrabee's extensions as well.
That leaves SSE5 hanging out with 3dnow.

edit:

By the way, I will beat Phil Hester with an NV30 if they try to pass off the kind of crap GPU drivers get away with as acceptable for a CPU interface.

Posted by Jawed on Wednesday, 19-Sep-07 20:16:43 UTC
I specifically excluded SSE5 from the point I made. SSE5 is going to have to fight its own battle. My main point is that *after* SSE5, when AMD puts a streaming/data-parallel core(s) in the CPU - that's going to be driver-dependent. That's my interpretation, anyway. But AMD's streaming/data-parallel pipes versus Intel's mini-cores implies that x86 *throughput* *computing* is going to be proprietary. AMD plans to use drivers, it seems. AMD doesn't look like it wants to follow Intel down the Larabee road. I dunno if Larabee will need a driver as such (for non-GPU-specific computing, i.e. when operating as a grid of "x86" cores) - each revision to SSE seems to be tantamount to a "driver" revision in itself - whether you actually install a piece of software or not, you've got the same problematic fragmentation in functionality (giving programmers headaches). Clearly the lightest possible drivers are most desirable. I'm not defending them, by the way. I was sorta hoping that some kind of x86 data parallel instruction set would arise which would be natively executed on the "GPU" - this extension wouldn't be trying to fork-off instructions as they arise onto the "GPU", piecemeal, but would be launching entire clauses of code (10s, 100s, 1000s of instructions) for 1000s-to-billions of threads, being data-centric not thread-centric. But the noises AMD is making make it sound more coarse-grained... Jawed

Posted by 3dilettante on Wednesday, 19-Sep-07 22:38:36 UTC
Quoting Jawed
I specifically excluded SSE5 from the point I made.
Okay, but I did not see anything in your previous post to that effect.
From the standpoint of ISA support, the driver issue seems almost orthogonal.

Quote
I dunno if Larabee will need a driver as such (for non-GPU-specific computing, i.e. when operating as a grid of "x86" cores) - each revision to SSE seems to be tantamount to a "driver" revision in itself - whether you actually install a piece of software or not, you've got the same problematic fragmentation in functionality (giving programmers headaches).
But at least extensions don't affect previous core revisions.
Driver updates can FUBAR any chips that they are installed over.
Just one of the bugs in one of the games that any GPU maker has gotten away with would lead to AMD being crucified, because nobody takes the kind of crap from their CPU that GPUs get away with.

Quote
I'm not defending them, by the way. I was sorta hoping that some kind of x86 data parallel instruction set would arise which would be natively executed on the "GPU" - this extension wouldn't be trying to fork-off instructions as they arise onto the "GPU", piecemeal, but would be launching entire clauses of code (10s, 100s, 1000s of instructions) for 1000s-to-billions of threads, being data-centric not thread-centric.
The piecemeal approach sounds closer to a more incremental form of Fusion that falls short of full integration, where the CPU "steals" or arbitrates for GPU units.

edit:
Actually, going with AMD's horrid luck at anything revolutionary, the piecemeal approach up to having a CPU hosting GPU-type units makes sense.

Posted by Jawed on Thursday, 20-Sep-07 00:34:41 UTC
Quoting 3dilettante
The piecemeal approach sounds closer to a more incremental form of Fusion that falls short of full integration, where the CPU "steals" or arbitrates for GPU units.
I was using "piecemeal" to mean "a single thread of a few SIMD-type instructions" or "a limited number of threads that execute the same few short piece of code". The cost of moving such piecemeal contexts onto a "GPU" is simply too high, no matter how tightly integrated the "GPU". As far as I can tell from the CAL stuff, for example, the same code can run on the CPU or GPU (much like CUDA) - it's a "runtime" switch effectively. So I imagine this'll end up as heuristics in the driver for a particular CPU that decides whether a context justifies being switched over to the "GPU" or whether the code should stay on the CPU, running "natively". Depends on the capability of the "GPU"(s) in the CPU as well as the number of attached GPUs (via PCI Express, say) + bandwidths, etc. etc. Future versions of D3D10 make the GPU more tightly slaved to the CPU, as far as I can tell (context switching, virtual memory) normalising the GPU as a compute resource and hiding a multitude of them from the developer. This doesn't sound hugely different from what CAL will do across an array of CPUs and GPUs - regardless of the physical coupling. Jawed

Posted by 3dilettante on Friday, 21-Sep-07 13:22:00 UTC
That sounds like the expected outcome from AMD's backing off from fully heterogenous multicore solutions.

They are unwilling to sacrifice large swaths of silicon real estate to general-purpose x86 code, probably fearing being pilloried in the benchmark game by getting caught in the Cell processor situation where legacy code only runs on the PPE.

Instead they will try to make a collection of x86 cores where some will likely suck at running most x86 code, but will still be around to chip in some amount of performance.

The driver would the necessary glue (kludge?) between hardware, OS scheduler, and perhaps specially compiled apps, to make this workable.

The unfortunate side effect of maintaining this through x86 is, as you mentioned, that all the context and semantics slathered onto every operation in x86 are going to slow the cores compared to those in a truly heterogenous chip.


Add your comment in the forums

Related amd News

Say hello to GLOBALFOUNDRIES
AMD completes deal with ATIC to create The Foundry Company
AMD Propus to be released in Q2 & Q3
AMD launch 45nm Phenom II processor
AMD goes Asset Smart; splits into two
Beyond Programmable Shading course notes available
AMD launches FireStream 9250 with 200Gflops DP via RV770
AMD GPGPU solutions get extra support from industry partners
AMD Phenom X3 released; reviewed
Rage3D take a look at Assassin's Creed D3D10.1 support