november Tech: Stanford: ATI's GPU can Calculate Much Faster

flickr

Stanford: ATI's GPU can Calculate Much Faster

Friday, October 06, 2006

Beyond3D recently sat down with Stanford's Mike Houston, GPGPU guru and B3D regular, to discuss Stanford's new Folding@Home client for ATI GPUs.

Beyond3D: Is the X1K series's dynamic branching performance the enabler that lets you really explore and exploit R580's (and R520's) abilities for GPGPU, and specifically GROMACS in BrookGPU in this case --in a way that is impossible on any other hardware right now? After that, which of the other abilities the chip has are key for GROMACS performance? The ability to sustain close to peak performance in the fragment hardware? Memory bandwidth? Basically what does the GROMACS core hit hard on the chip and how are you exploiting that in the application?

Mike Houston: All GPUs are SIMD, so branching has a performance consequence. We have carefully designed the code to have high branch coherence. The code heavily relies on a tremendous amount of looping in the shader. On ATI, the overhead of looping and branching can be covered with math, and we have lots of math. We run the fragment shaders pretty close to peak for the instruction sequence used, i.e. we can't fully use all the pre-adders on the ALUs. But, I wouldn't say branching is the enabler. I'd say the incredible memory system and threading design is what currently make the X1K often the best architecture for GPGPU. Those allow us to run the fragment engines at close to peak.

What ATI can do that NVIDIA can't that is currently important to the folding code being run is that we need to dynamically execute lots of instructions per fragment. On NVIDIA, the shader terminates after 64K instructions and exits with R0->3 in Color[0]->Color[3]. So, on NVIDIA, we have to multi-pass the shader, which crushes the cache coherence and increases our off-chip bandwidth requirements, which then exacerbates the below.

The other big thing for us is the way texture latency can be hidden on ATI hardware. With math, we can hide the cost of all texture fetches. We are heavily compute bound by a large margin, and we could actually drive many more ALUs with the same memory system. NVIDIA can't hide the texture latency as well, and perhaps more importantly, even issuing a float4 fetch (which we use almost exclusively to feed the 4-wide vector units) costs 4 cycles. So NVIDIA's cost=ALU+texture+branch, whereas ATI is MAX(ALU, texture, branch).

While it would be possible to run the code on the current NVIDIA hardware, we would have to make pretty large changes to the code they want to run, and even past that, the performance is not great. We will have to look at their next architecture and re-evaluate. The next chips from both vendors should be interesting.

Beyond3D: Are you using ATI's Close To the Metal (CTM) API in BrookGPU now, and are you using it for this first Folding@Home implementation? How is it helping BrookGPU get better on R580 and R520 in a theoretical sense?

Mike Houston: There is a BrookGPU CTM backend currently being worked on and we hope to have it public when CTM is public. It is not being used for the current algorithms running in the Folding client though. However, it will enable other algorithms we weren't able to do in the past because of access to larger register files (128 registers!), scatter, and explicit control of the memory formats and memory system. You can do really neat things with CTM that we couldn't before through Direct3D/OpenGL. Being able to render and texture directly from host memory makes debugging much easier, and also allows an easy mechanism for asynchronous transfer to and from the hardware.

The main thing for BrookGPU is that the overheads of GL and D3D go away and we have full control of setting up the board. No extra commands are sent to the board. Also, we can compile directly to the ISA, so we don't have to worry about game optimizations breaking our GPGPU code. This also means that since we talk directly to the board, we are immune to driver changes which makes verification and shipping of actual applications much easier.

CTM is really going to change the way that GPGPU is done and honestly to really do GPGPU for real, you must have low level access to the hardware. Having this access helps you to better bend the architecture to your will, and when that fails, better understanding of how to change your algorithm. Using CTM, we were able to get matrix multiplication on R580 up from ~15Gflops to ~120Gflops by having control over the memory system and formats.

Folding@Home is currently written in BrookGPU, and uses the D3D9 backend.

Beyond3D: Can you talk about how you structure GROMACS on a GPU given the GPU's architecture and how it performs?

Mike Houston: The general key is to try to restructure your code to be compute, not bandwidth, bound. This is often difficult, but you can work to restructure your data access patterns to get the best use out of the memory system. GPGPU.org is a fantastic resource on tips and tricks for GPGPU.

Labels: Hardware

This entry was posted on Friday, October 06, 2006 at 6:45 AM. There are . If you want, you can leave your own response by clicking here. You may also send this post to a friend of yours via email, or bookmark it for yourself by using icons below;

Leave your response