OpenCL support for AMD/NVidia GPU rendering is currently on hold. Only a small subset of the entire rendering kernel can currently be compiled, which leaves this mostly at prototype. We will need major driver or hardware improvements to get full cycles support on AMD hardware. For NVidia CUDA still works faster, and Intel integrated GPU's are unlikely to give any speed improvement over CPU rendering.
In Blender 2.65, OpenCL is not available as a choice in the UI by default. The environment variable CYCLES_OPENCL_TEST can be defined to show it, which can be useful for developers that want to test it. The OpenCL kernel is located in 2.65/scripts/addons/cycles/kernel. In the file kernel_types.h specific functionality can be enabled/disabled for testing, without recompiling Blender.
The path tracing kernel is currently a single big kernel, much bigger than typical OpenCL code. There are about 40 shading nodes, 10 BSDF's, etc.
Splitting it up into smaller parts may help, but even then compiling only the shading nodes execution code fails. This would be quite difficult to split up. An alternative would be to compile a kernel for each material in the scene, but I don't have much faith in complex node setups compiling reliably then, and scene startup time would increase considerably.
If at all possible I would like to avoid splitting up the kernel in many pieces, mainly because it makes extending the code much harder (we're only getting started in terms of number of features). And also because I haven't really seen this demonstrated working efficiently in other renderers yet, e.g. NVidia Optix also uses a single kernel.
The immediate issue that you run into when trying OpenCL, is that compilation will take a long time, or the compiler will crash running out of memory. We can successfully compile a subset of the rendering kernel (thanks to the work of developers at AMD improving the driver), but not enough to consider this usable in practice beyond a demo.
NVidia hardware and compilers support true function calls, which is important for complex kernels. It seems that AMD hardware or compilers do not support them, or not to the same extent.
To see why this is a problem, consider this example. There are 5 places where the shading nodes are executed, and there are 20 places in the shading nodes where perlin noise is used. Because no true function calls are supported, the compiler must copy the perlin noise code 100x. You can see how this would make the final code size blow up and cause issues for the compiler.
Note that V-Ray RT at this time also does not support running their full OpenCL kernel on AMD (only an older and simpler version), and Luxrender with OpenCL is also running into kernel size issues when adding more features. So that's a good indication kernel size is the main issue here.