Further Tests
Hardware vs Software Geometry Processing
With four DX9 compliant Vertex Shaders we would expect Parhelia to have a good hardware geometry throughput. We'll test the processing using hardware and CPU based geometry processing to see what kind of performance gains can be attained by using Parhelia's geometry processing opposed to the geometry throughputs from the P4 2.53GHz CPU.
3DM - Dragothic | 640x480 | 800x600 | 1024x768 | 1280x1024 | 1600x1200 |
Software | 27.8 | 28.9 | 29.2 | 29.4 | 28.5 |
Hardware | 101.2 | 88.7 | 73.1 | 55.6 | 43.5 |
%Diff | 73% | 67% | 60% | 47% | 34% |
3DM - Nature | 640x480 | 800x600 | 1024x768 | 1280x1024 | 1600x1200 |
Software | 39.6 | 34.6 | 28.1 | 20.1 | 14.9 |
Hardware | 56.4 | 42.2 | 31.1 | 20.9 | 15.1 |
%Diff | 42% | 22% | 11% | 4% | 1% |
3DM - Vertex Shader | 640x480 | 800x600 | 1024x768 | 1280x1024 | 1600x1200 |
Software | 41.3 | 41.4 | 41.5 | 41.9 | 40.8 |
Hardware | 109.6 | 103.0 | 91.7 | 76.0 | 62.3 |
%Diff | 165% | 149% | 121% | 81% | 53% |
The 3DMark2001SE tests are all showing pretty good gains in performance when running hardware geometry processing as opposed to running it via the CPU. Even at high resolutions two of the tests are still showing good percentage increases, the only exception to this being the 'Nature' test. But as explained before this is probably because the test is becoming quite fill-rate/bandwidth limited, thus minimising the improvements that hardware geometry processing would be able to show.
As is always the case, though, 3DMark2001SE's tests are only demos and hence not representative of an actual game, which has many more calculations running on the CPU than just geometry processing. Let's take a look at the performance of a couple of game titles using software and hardware geometry processing
Dungeon Siege | 640x480 | 800x600 | 1024x768 | 1280x1024 | 1600x1200 |
Software | 67.5 | 68.2 | 67.5 | 59.7 | 47.0 |
Hardware | 80.3 | 79.1 | 77.2 | 63.6 | 48.4 |
%Diff | 19% | 16% | 14% | 7% | 3% |
Max Payne | 640x480 | 800x600 | 1024x768 | 1280x1024 | 1600x1200 |
Software | 91.1 | 91.0 | 79.0 | 55.3 | 38.7 |
Hardware | 126.3 | 115.4 | 85.8 | 55.5 | 38.9 |
%Diff | 39% | 27% | 9% | 0% | 1% |
As we've said before, Dungeon Siege is quite a CPU bound title and here we can see that using hardware geometry processing accounts for nearly a 20% improvement over software processing in the lower resolutions. As the resolution goes up, this advantage is scaled back as Parhelia is becoming a little more fill-rate limited in this game.
Max Payne shows even larger performance gains than Dungeon Siege at lower resolutions, but this is almost negated at high resolutions where fill-rate is of paramount importance.
Multi-Texturing vs Multi-Pass
Parhelia offers 4 texture units per pipeline, meaning that up to 4 texture maps can be sampled simultaneously and Serious Sam can handle up to four layer in a single pass, which should be a good fit for Parhelia. SS:SE also provides the ability to alter the number of texture layer applied per pass, so we can see how much benefit the 3D architecture is gaining from the number of texture layers applied in a single pass.
SS:SE's Citadel demo was used, with Trilinear filtering enabled
640x480 | 1024x768 | 1600x1200 | |
None | 71.7 | 49.6 | 25.3 |
Dual | 61.5 | 57.1 | 32.1 |
Triple | 60.6 | 57.3 | 34.3 |
Quad | 60.9 | 59.4 | 34.8 |
Diff from None | 640x480 | 1024x768 | 1600x1200 |
Dual | -14% | 15% | 27% |
Triple | -15% | 16% | 36% |
Quad | -15% | 20% | 38% |
Why there is such a large performance difference at 640x480 without any multi-texturing is a mystery. In theory we would expect multi-passing at low resolution to perform poorly because the geometry needs to be resent multiple times and hence would put more load on the Vertex Processors.
The gains in the higher resolutions here are quite large, but because of the texture unit arrangement on Parhelia, in a title like Serious Sam, removing the number of texture layers applied per pass has direct impact on its fill-rate and not just bandwidth. With future games a flexible number of texture layers per pass is very important for Parhelia.
We can see that the largest single gain is from going from no multi-texturing to dual texturing. Parhelia's texture units are capable of sampling bilinear samples in a single clock, so for Trilinear sampling two texture units will be used per map; hence at two texture maps per pass with Trilinear filtering all of Parhelia's texture units are being used simultaneously.
Overdraw Reduction
We'll use the VillageMark test to see what Parhelia's performance is like under the conditions presented by this benchmark
640x480 | 800x600 | 1024x768 | 1280x1024 | 1600x1200 | |
FPS | 166 | 119 | 81 | 53 | 39 |
Parhelia's performance in this test isn't too bad, despite it not being advertised with any bandwidth or overdraw reduction optimisations. VillageMark does use several simultaneous texture layers so the raw texturing performance of Parhelia will be of help, as will the raw bandwidth of the 256-bit memory bus.
Here we'll at "Humus"'s GL_EXT_reme benchmark, which has tests using various difference render orders.
Back-to-Front | Front-to-Back | Random | |
Overdraw factor 3 | 242.1 | 249.4 | 245.2 |
Overdraw factor 8 | 100.3 | 102.5 | 102.9 |
From this we can see that the gains from Random ordering or Front-to-Back ordering are very minimal in comparison to Back-to-Front ordering, which is a good indication that there isn't any early pixel Z tests to retire an occluded pixel before texturing or Pixel Shading/Fragment processing is done on that pixel.
The next generation id title 'DoomIII' specifically operates by rendering the entire scene in a very uncomplicated fashion, which is to first purely lay down Z information for the frame prior to any of the per pixel operations taking place. The results of this is that boards that have early Z checking / rejection schemes won't need to do the per-pixel operations on occluded pixels, which have very intensive calculations, once the initial Z calculation pass is complete. Because the Z information for the entire frame will already be present, there will be a 100% certainly of whether the current pixels being rendered will actually be displayed. However, a board without early Z testing will not know whether or not a pixel is going to be occluded until it's ready to be sent to the frame buffer, so all of the intensive per pixel operations will be carried out on every pixel regardless of whether or not it will actually be displayed in the final frame.
Stencil Test (FableMark)
Here we'll see how Parhelia performs under the PowerVR FableMark test, which makes extensive use of Stencil Buffers
640x480 | 800x600 | 1024x768 | 1280x1024 | 1600x1200 | |
FPS | 32.5 | 20.5 | 14.2 | 8.7 | 5.8 |
Although the performance seems to be quite low at high resolution this is reasonably consistent with other DX8 class hardware. However, it does not appear that Parhelia features anything like single pass stencil operations.