Beyond3D - Basic Primer on Computational Architecture Trade-Offs

Basic Primer on Computational Architecture Trade-Offs - Page 2

Published on 20th Mar 2009, written by Arun for Consumer Graphics - Last updated: 19th Mar 2009

Fixed-Function vs. Programmable Logic

It would be tempting to believe that fixed-function logic is always the most efficient way to implement any given algorithm; however, this quickly becomes very dubious once execution becomes either conditional or recursive, or if you don’t need to execute the algorithm all the time.

Let us consider a simple example: floating-point units. It is obviously possible for any Universal Turing Machine to emulate this functionality, but it would also be significantly less efficient than a proper fixed-function implementation. Or would it? Consider a program that only requires one such instruction every 50000 cycles on average; the rest of the time, the fixed-function FPU would be idling, reducing performance per transistor and draining power unless it was completely shut-off via power islands (eliminating both dynamic power and leakage) or at least through clock gating to get rid of dynamic power. On the other hand, assuming that it can indeed be shut-off, a discrete unit like that may actually improve power efficiency despite reducing performance per transistor.

More problematic would be the case where a given program conditionally executes either many floating point operations or none at all. The most efficient implementation would then depend on the probability of either path being run. However, the best implementation may not always be the most efficient (especially in terms of area); if the program’s execution is time-sensitive and the floating point path wouldn’t be fast enough when emulated, it would be necessary to implement a real FPU despite the lower overall efficiency.

In practice, many processor designers don’t know for sure what software will be run on their chips and, even if they do, they might not have sufficient data to make truly optimal design choices (the hardware and software development schedules rarely match and there may not be enough time to analyse the software anyway). This makes these design decisions based on either guesswork or, ideally, sophisticated risk management and market analysis.

It should be considered that some fixed-function logic really has more than one function; it may be slightly configurable (such as for a floating point unit’s rounding modes) or even be implemented as a slightly programmable finite state machine. The goals are obvious: improve utilization and simplify the design process. This can also reduce the number of potential hardware bugs (by moving a bit of the complexity to software), reducing the probability of product delays and expensive silicon re-spins.

The efficiency of this form of flexibility in configuration or light programmability is mostly proportional to the relative cost of its implementation. The cost of programmability is not constant, but it certainly doesn’t grow as fast as the execution unit’s size. Nowhere was this more obvious than in the evolution of graphics processors, where programmability and arithmetic precision have increased synchronously. This happened both because higher precision became more valuable as programs could grow more complex, and because as you improve one of the two aspects, the relative cost to improve the other goes down.

Specialisation vs. Unification

We referred to the improved utilisation of programmable logic above, but now we’ve got to admit that’s really just a convenient way of thinking. It's not completely wrong to look at it that way, mind you, but that doesn’t make it any more rigorous. The real trade-off is between specialisation and unification, and in certain cases improving programmability won’t increase the level of unification and vice-versa.

Consider the decision for a handheld device to run basic video processing on either a general-purpose processor, such as an ARM core, or on a Digital Signal Processor (where the DSP could be left off in the former case). Both cores are potentially as programmable for all intents and purposes; however, the DSP is still more specialized for this kind of workload because of its instruction set and execution units, so it would likely deliver higher raw performance at lower power consumption. However, if area/cost efficiency is the most important factor and the ARM-only solution is ‘good enough’, then that (i.e. unification) remains the best solution.

At the other extreme, consider bilinear texture filtering on modern graphics processors. Some software developers have asked for this part of the pipeline to be made programmable in order to support more exotic algorithms. One frequently discussed method to achieve that without tons of data moving around the chip is to keep it as a dedicated unit, but add some light programmability; a bit like DirectX 8-level pixel shading, but ideally more efficient. Practically speaking, the unit would remain highly specialised despite the extra overhead (think of its input/output mechanisms and precision compared to a generic processor!) and utilisation would certainly not be improved. We don’t think this will actually happen, but this kind of idea is clearly applicable in other cases for a variety of reasons.

In a more general sense, programmability makes unification possible; the more programmable a piece of logic is, the more likely it is that it could theoretically also execute an algorithm implemented elsewhere in a fixed-function unit or another programmable element. Whether unification is actually desirable, however, depends on both the average level of utilisation and the amount of extra efficiency that may be achieved through specialisation (or achievable raw performance, as we saw in the floating point example on the previous page).

Furthermore, just as it may be substantially cheaper to implement an algorithm in software than in hardware (especially on leading-edge process nodes), the R&D costs of implementing a few specialised processors are also higher than just running everything on a single general-purpose processor. Especially if you need the latter anyway for other reasons! You may not have a choice if power efficiency is your main selling point (depending on your expected utilisation rate and whether you use many power islands), but otherwise per-unit R&D amortisation is an extremely important factor.

Basic Primer on Computational Architecture Trade-Offs - Page 2

Fixed-Function vs. Programmable Logic

Specialisation vs. Unification

Page Navigation