Further Tests



Software vs Hardware Geometry Performance

Here we'll look at the performance of Radeon 9700 with both hardware geometry acceleration enabled, so that it uses the Vertex Shader Engines, and disabled, so it uses the CPU.

First we'll look at a few select tests from 3DMark2001SE

Dragothic (High Detail Software 32.7 32.3 32.4 32.4 32.9
Hardware 149.8 145.6 141.1 121.3 99.1
%Diff 358% 351% 335% 274% 201%
Nature Software 49.2 48.7 50.1 51.3 41.7
Hardware 155.5 125.9 91.5 62.2 43.9
%Diff 216% 159% 83% 21% 5%
Vertex Shader Software 61.8 59.3 59.3 59.2 61.0
Hardware 175.0 169.0 169.7 160.2 142.5
%Diff 183% 185% 186% 171% 134%
Advanced Shader Software 158.1 159.4 159.9 126.5 91.8
Hardware 307.0 250.9 189.4 133.5 95.1
%Diff 94% 57% 18% 6% 4%

Going by these numbers it would appear that the days of whether Hardware Geometry Acceleration was questionable in the face of increasing CPU performance are long gone. Clearly the 4 Vertex Shaders of Radeon 9700 PRO are able to process the geometry information much faster than the latest Pentium 4 can, sometimes, especially in the case of the Dragothic test, significantly so. Bear in mind as well, that being a technology test and not an actual game, 3DMark2001SE does not have the other elements that a CPU will run in-game, such as physics and collision detection etc.

Lets take a look at how this performance translates in a couple of gaming situations.

MaxPayne Software 96.6 96.5 96.1 92.8 87.3
Hardware 135.8 135.4 131.1 111.4 88.4
%Diff 41% 40% 36% 20% 1%
SS:SE (DX) Software 66.3 65.2 62.8 55.9 47.6
Hardware 65 59.7 52.8 44.3 37.7
%Diff -2% -8% -16% -20% -21%

We can see that with the Max Payne game performance the trend that was displayed under 3DMark carries through to this title -- under all circumstances is there a performance improvement. At higher resolutions we become more fill rate limited and not so reliant on the pure geometry performance, hence the delta between software and hardware geometry processing is reduced.

The performance under the DirectX renderer of Serious Sam: Second Encounter shows a different story though, with the the hardware geometry causing performance drops. Normally the reason that this could possibly occur is because the vertex data being processed on the card can increase the local bandwidth requirements thus leaving less bandwidth for pure rendering, however 9700 PRO is hardly bandwidth limited in most cases and its OpenGL performance is much higher. I would guess that this could be due to some inefficiencies within the DirectX renderer of Serious Sam or just the drivers are not optimised for it yet.

Overdraw Reduction

As yet I've not discovered a method of enabling and disabling the various elements of HYPERZ III in a similar fashion as to Radeon 8500, so it going to be a little difficult to tell the exact effectiveness that the new elements of HYPERZ III. We'll take a look at PowerVR's Villagemark, a popular test for gauging overdraw reduction methods, test to see how Radeon 9700 PRO performs here.

9700 PRO 176 110 76
8500 92 59 40
FPS 84 51 36
% 91% 86% 90%

640x480 and 800x600 are not included in this test because it would seem that the Radeon 9700 PRO renders the test so fast the application is unable to record a score - each time it would report an FPS of 0, despite it being clearly rendered (very fast!). However, we can see that Radeon 9700 PRO renders this very fast in most cases, at around about a 90% performance increase over Radeon 8500. Although with twice the bandwidth and pixel fill rate, clock for clock Radeon 9700 PRO has the same texel rate as 8500 and yet in this test, which uses 3 texture layers, its still attaining a 90% improvement.

We'll use the GL_EXT_reme benchmark utility developed by 'Humus' to check out the effectiveness of Radeon 9700 PRO's HYPERZ III under different types of render orders. In this case the 'Overdraw factor of 8' numbers are shown.

Back-to-Front 9700PRO 546.61 379.13 244.87 153.31 108.08
8500 346.45 226.17 139.18 83 56.03
Random 9700PRO 1544.83 1084.1 703.55 441.16 227.89
8500 898.05 595.13 371.05 224.59 108.33
Front-to-Back 9700PRO 3769.78 2658.47 1752.17 1105.71 353.5
8500 1854.28 1261.76 805.61 500.19 154.98
Random 9700PRO 183% 186% 187% 188% 111%
8500 159% 163% 167% 171% 93%
Front-to-Back 9700PRO 590% 601% 616% 621% 227%
8500 435% 458% 479% 503% 177%

If you compare the percentage differences of the render orders from Back-to-Front between the 9700 PRO and 8500 we can see that in all cases the 9700 PRO is more effective than 8500, especially so with Front-to-Back rendering. This is a good indication that HYPERZ III's overdraw reduction schemes are operating more effectively than HYPERZ II.

Stencil Operations

The new Doom engine is known to make use of Stencil shadows quite significantly and so the importance of Stencil operations will increase as we begin to start seeing demos and the game. PowerVR recently released a 'FableMark' demo which makes extensive use of stencil buffering. Although this test is in no way indicative of the performance of DoomIII we can use it as a gauge to see how much Radeon 9700 PRO is faster than Radeon 8500.

9700 PRO 135.2 97.4 60.4 37.6 22.5
8500 37.5 25 17.6 10.8 7.3
% Diff 261% 290% 243% 248% 208%

We can see that the difference between Radeon 9700 PRO and Radeon 8500 in this application is quite significant, which could indicate that Stencil operations have been optimised. As mentioned earlier in the review Radeon 9700 PRO does have a method for collapsing stencil operations into one pass, from their normal two passes, however this would appear to be a DirectX9 operation and, this being the case, Fablemark would not even be using it as yet.

It should also be noted that while Radeon 8500 displays considerable corruption of textures within the scene, Radeon 9700 PRO displays no such issues.

Multi-Texturing vs Multi-Pass

It seems there been some confusion over the actual multi texturing abilities of boards such as Radeon 9700 PRO and 9000 PRO and what this means for applications. As mentioned earlier the number of textures that Radeon 9700 PRO can support is 16 per pass. With only one texture unit available per pipeline, as opposed to the two that many other graphics chips support, all this means is that it will take more time (cycles) for all these layers to be applied than a chip that can support the same number of layers per pass but with more texture units per pipeline.

Since the introduction of KYRO and later GeForce 3/4, Radeon 8500 and onwards (although it may possibly even date back to Voodoo Banshee) 3D chips have been able to support more textures per-pass than the actual number of texturing units available per pipeline by use of functionality popularly termed as 'loop-back'. Prior to these 3D chips, if a chip supported 2 texture units per pipe and application called for 4 texture layers then the application or drivers would split the scene up into 2 passes which results in the geometry being sent once with the information for the first two textures, which is then rendered to the frame buffer, and then the geometry being sent again with information for the second two textures; as the second set of geometry (second pass) is rendered it needs to be blended with the results of the first two textures. This blending of results can be costly because it requires both a write to the frame buffer from the first pass and then another read and write from the second pass. Also, dependant on the precision of the frame buffer, the blending of the two passes can result in incorrect results.

As mentioned earlier in the review, 'Loop-back' is s method for increasing the number of available textures per pass without having to support plenty of texture units per pipeline. In the case of Radeon 9700 PRO if an application needs 16 textures then for the pixel being rendered the first texture will be read by the single texture sampling unit and the result of that will be stored on chip so that on the next cycle the second texture layer is read; this occurs for all 16 layers and the resultant multi textured pixel is passed to the frame buffer. If more than 16 textures are required then this process occurs for up to six textures and then a geometry pass for each multiple of 16 layers after that (although in reality this is highly unlikely to be necessary for some time to come, if at all in gaming situations).

Applications don't need to be specifically coded to support 'loop-back' - this is transparent to the application and is up to the hardware to decide how to handle it. When an application runs all it needs to do is interrogate the capabilities of the board and find out how many textures per pass the board supports and if it supports 16 or more then the application will know that 16 texture layers can be applied per pass.

However, it should be noted that many application do not have this type of configurability; for instance because of the period it was developed in and the fact that most 3D chips at that time supported 2 texture units per pipe, Quake3 was designed such that it forces another pass if more than two texture layers are required. Because the application is forcing this then newer boards will have to do it as well, regardless of the number of textures they can apply per pipe. So, while application don't need to support anything for a board 'loop-back' to operate they do need to be coded in such a fashion that they can handle a configurable number of texture per pass because, with each passing generation, the number of supported textures per pass has been increasing regardless of whether the actual physical number of texturing units per pipe has.

One application that has been around for a while that does feature some configurability in the number of texture per pass is (you guessed it) Serious Sam. Serious Sam uses up to 5 texture layer, and the engine has been coded to be configurable from 1 texture per pass up to 4. We'll take a look at some results to see if they make any difference on the Radeon 9700 PRO.

None 114.3 107.5 70.8
Dual 110.3 106.2 74.2
Triple 110.7 106.6 77.1
Quad 114.3 107.8 78.9
Dual -3% -1% 5%
Triple -3% -1% 9%
Quad 0% 0% 11%

The low resolution results show a slightly odd pattern. Normally multi-pass would result in the geometry information being resent a multiple of times, i.e. if a pass is being done per texture layer then with a pass per texture you would expect the geometry to be sent 5 times in Serious Sam. This being the case then you would also expect low resolution rendering to display a some notable difference when altering the layer per pass as this would be where CPU/geometry limitations are shown most. However, that does not really appear to be the case here -- it may be that the geometry is buffered in the 3D card so that multiple passes do not need the static geometry to be resent by the CPU to the graphics cards a multiple of times.

On the other hand, higher resolution rendering, where bandwidth is becoming more of an issue we do see a distinct trend - the more layers done in a single pass the higher the performance. As we can see going from 1 texture per pass to four give an 11% performance increase. This goes to show that even though there is only only texture unit per pipe efficiencies can be gained by allowing multiple texture per pass over multiple clock cycles.