Note: See updated complete list of differences between all Compute Capabilities of CUDA.
The release of next generation CUDA architecture, Fermi, marks the fact that CUDA is still an evolving architecture. Fermi having compute capability of 2.0 has several differences from previous architectures. In addition to increasing the number of threads per blocks and packing 512 cores in a single chip, Fermi can also run multiple Kernels simultaneously. Shared memory has also been increased from 16 KB to 48KB and most importantly the number of streaming processors in one SM have been increased to 32. The comparison below, by NVIDIA, gives a complete picture of the differences between compute capability 1.0, 1.1, 1.2, 1.3 and 2.0 of NVIDIA’s CUDA enabled devices.
For full Fermi reference download the following Fermi related documents:
CUDA Programming Guide for CUDA Toolkit 3.0
Fermi Comparability Guide
Fermi Tuning Guide